The impact of high volumes of data in automating legal review is well understood. What’s less talked about is the impact of diversity of data. The accuracy of legal AI solutions is heavily dependant on the quality of the sample data on which the algorithms are trained. The higher the quality of legal training data, the higher the accuracy of AI-enabled legal review. This makes having a high-quality data set an imperative starting point for any organisation implementing AI-based review.
What makes a good training data set for the process of automating legal review? An example will illustrate this better. Entity X typically enters into approximately 70 agreements with external vendors every month, requiring the legal team to spend a lot of time manually reviewing terms of each contract. Given the high contract volumes involved, the legal team wanted to be able to focus on more business-critical legal matters, X was keen to automate the process of legal review of vendor agreements. Hence, a starting point for automating legal review was the compilation of 100 legacy agreements previously executed by entity X to serve as a training data set for enabling AI review.
The training data set was divided into two halves of 50 contracts each - the first set of 50 contracts contained only one type of payment obligation clause (annual payment) while the other set of 50 contracts contained a mixture of varied payment clauses (annual, monthly, lumpsum, milestone-linked).
When models were trained on the first set of contracts which contained only annual payment clauses, the models ‘learned’ these to be the only kind of payment obligation; hence model accuracy at detecting and understanding any other variations of payment obligation was low. when algorithms are trained on a homogeneous data set where a contract looks like every other in that dataset, this homogeneity compromises an AI solution’s capacity to handle real-life variations in text (which is often the case) in contract language.
This was in stark contrast to when models were trained on the second set of contracts containing a mix of diverse payment clauses, where the models essentially ‘learned’ that payment clauses in ‘real world contracts’ occurred in multiple distinct forms. The accuracy of machine learning systems to undertake legal review was significantly better for the second set.
The takeaway? A training data set that contains multiple diverse variations of clauses and concepts contributes to greater accuracy in automating legal review. The higher the quality of training data, the stronger the performance of machine learning systems in ‘understanding’ legal concepts.