Using Named Entity Recognition to extract Legal Information from Contracts

At SpotDraft, we use machine learning to solve challenging problems on legal contracts. One such problem is the extraction of entities like date, duration and geographical locations from contracts.

Why extract entities from Contracts

The general purpose entities present in contracts are used to extract specific legal concepts like Effective Date, Termination Date, Jurisdiction, Notice Period, etc.. Extracting this information from contracts allows users of our platform to manage and search through their contracts with ease. The current way of doing this is opening each document and doing Ctrl+F for specific keywords. Not only is this extremely time consuming, it’s not guaranteed to work as there is no common search phrase that will work across documents. The SpotDraft AI extracted dealpoints solve this problem by not requiring such searches and allowing users to search across their repository of contracts.


Figure 1: Examples of Key Pointers and other Legal Entities extracted using SpotDraft AI
Figure 2: Examples of Key Pointers and other Legal Entities extracted using SpotDraft AI


These entities also serve as a core part of the SpotDraft knowledge about a document allowing us to build more complex features like smart redline and notifications.

How AI is used to extract data from Contracts

To solve this problem, we use a transformer based Named Entity Recognition (NER) model. This is done using a combination of two things: 

  1. a domain adapted pre-trained model based on the bert-base-cased architecture 
  2. the excellent transformers library from HuggingFace. 

This model (which we call SpotLegalBERT*) is trained on tens of thousands of internally tagged and verified data points created using real-world contracts from across the English speaking world.

As part of the evaluation and training, we performed experiments to determine our model’s efficacy by comparing it to models trained using Google’s AutoML Entity Extraction and other transformer architectures. 

We were pleasantly surprised to see that our model performs better than AutoML. The following is a comparison of SpotDraft’s (top) model performance and AutoML’s (bottom) model performance - 


Table 1: SpotDraft’s model performance in extracting general purpose entities from legal contracts.
Table 2: AutoML’s model performance in extracting general purpose entities from legal contracts.

It is evident from the table above that our model is able to extract entities with far greater accuracy across all the labels. This is a very exciting achievement since AutoML performs hyperparameter tuning and iterative Neural Architecture Search to find the best possible model for the task. 

We believe that the major contributor to high accuracies of our model is domain adaptation of the transformer model on legal contracts. Domain adaptation allows our model to understand complex legal nuances and deal with specific legal vocabulary that is usually not present in pre-trained language models. 

Sounds interesting? Come work at SpotDraft and help automate the mundane! Check out spotdraft.com/careers