Machine Learning-Based Contract Provision Extraction on Poor Quality Scans

Written by: Noah Waisberg

9 minute read

ML-based contract provision models can be robust enough to work on poor quality scans and unfamiliar agreements, includes video showing this.

The CTO of another contract review software company (in a post about about their “Next Generation Contract Analytics” keyword-based “semantic based engine that also incorporates the use of Natural Language Processing” contract provision extraction system) recently wrote:

No system is capable of resolving poor quality data, where the optical character recognition (OCR) quality – the process of converting image files or paper documents into computer readable and searchable media – affects all the key features. Some companies have indicated that machine learning and or predictive coding can compensate for OCR quality. This is simply not the case. Some OCR errors can be accounted for within any system, but at their core, highly accurate systems need as much clean data as possible.

This is likely in response to our post on how we use machine learning to build contract provision extraction models, and the plusses (greater accuracy on unfamiliar documents and scans) and minuses (hard to build) of our approach. Specifically, we wrote:

**Even Form Agreements can Become Unfamiliar Documents if they are Poor Quality Scans_._** Some documents in contract review can be poor-enough quality that OCR gives mixed results. Documents with imperfect OCR results become like unfamiliar agreements; even though they may be written off a company form, manual rules tailored to the company form could miss them. Would a manual rules based system (not specifically set up for this wording) pick up a change of control provision written like this?
Mengesnorter iigernent or Control

tf-any-material change occurs in the management or control of the Supplieror_the_Business,save accordance-with-the provisions of this Agreement.

Maybe, maybe not. But our machine learning based system did. And we certainly never trained it on change of control provisions that were worded quite like this!

We weren’t exactly saying “that machine learning … can compensate for OCR quality” (though we think it can), but rather that a machine learning-based contract metadata extraction system can often identify provisions in poor quality scans, despite many words mangled during the OCR process. We agree that clean scans are preferable. But (i) poor quality scans often occur in contract review and (ii) contracts need to be reviewed in whatever form they are in.

We had planned to keep the Contract Review Software Buyer’s Guide free of cross-vendor squabbles. However, there are very significant technical differences between available contract metadata extraction systems. This dispute fits with a major point of the Guide: users are impacted by the details of how contract review systems actually work. The keyword-using CTO’s assertion provides a perfect opportunity to show why the greater robustness of machine learning-based contract provision extraction matters.

There are compelling computer science theory reasons why a machine learning system should outperform keyword and database contract provision search systems on poor quality scans (and in general). Nearly all recent academic research “starts from the assumption that statistical machine learning is the best approach to solving information extraction problems."* But, since a number of previous Guide posts have described the plusses and minuses of keyword (see also keyword extraction case study here), comparison, and machine learning contract review, we thought it would be more helpful to show how our system performs on on a poor quality scan it was not trained on.

Why OCR Matters

To properly explain why poor quality scans can be an issue for contract abstraction software, it is important to first explain how OCR software fits in. Contract metadata extraction software like ours works by applying contract provision models to text. If an image file (e.g., a scan) is uploaded, the system needs to have the scan converted into text. Optical character recognition (“OCR”) software does this image-to-text conversion. Contract review software vendors like us generally (probably exclusively, but I can’t definitively say) integrate third party built OCR into their systems for this functionality (sometimes with modifications to the OCR). Once a document has been converted into text, the contract metadata extraction system can apply its provision extraction models to the text and capture relevant results. (Here is more detail on how our contract review system actually works.) OCR accuracy can get rough on poor quality scans, and there are a lot of poor quality scans in contract review. This means provision extraction models often must be used on text unlike anything they were explicitly built for.

We’re happy to send you the full scanned document—get in touch. As you can see, this is a poor quality document. That said, companies have documents in this sort of condition, and they still need to be reviewed.

Here is what a pair of different OCR programs spit out on those pages.** This first OCR output was generated by a popular OCR program that several large law firms we have worked with use.

Not all OCR programs are equivalently accurate. Based on our research and tests, two systems are relatively comparable to each other and better than others. We have integrated one of these two into our system. Here is the OCR output of the exact same text from one of the leading two:

No' therehire. fct good and valuable consideration. the tecerpt and sufficiency of which ate hereby acknowledged. the parties agree as follows.

ARTICLE I
Ter ntand ‘frortnatIM of 1993 Steno]v_Agrrr’tnetrt

LI The initial term of tins Agreement (“Initial Term’) shall be effective a3 of the Effactive Data and ellitre on December 31, h)21. provided shat die
initial Term shall be automatically extended for succengive one-yearterms (each. Extension Term” and together

r:C: CO. a’y 04 2C I

with the initial Tenn, the “Term') Wail Mitten notice of non-renewal or termination is provided by one patty to the clher or leapt meetly-four inotelte prior
to die end oldie InitialTenn or any Extension ‘Fenn.
II. The particle agree he commence negotiations on the terms of all eNteraiOri of this Aemernent at least twenty’-folli (2-I) nioitlhs prior to tile end oldie
Initial Tenn Unless the partite otherwise agree in viliting, the terms awl conditions of this Agreement shall apply during any Extension Tenn,
The 1993 Supply Agreement is tenni/tilted effective as of the infective Date: provided. however, that any tights of Miler party with respect to
Containenirklivered to Don Pietro under Inch agreement pnor to the Effective Dale hereofAtall not be affected by IlUs termination.

These OCR outputs show what a contract metadata extraction system like ours sees as it reads through a poor quality scan. As the video shows, our system correctly identifies contract provisions from text that is the same as that in the second OCR output.

Show Instead of Tell

Without further ado, here is a video showing our system’s performance on the poor quality scan shown above. I recommend viewing it in full screen mode, as the very messed up OCR transcription won’t be visible at the embedded size.

A few things to note:

  • Text in our system’s automatically-created summary chart looks improperly transcribed. This is because the quality of the scan here was so poor—the system transcribed what it read. And our system read one of the OCR’d versions of the document reproduced above. The cool thing about this is that despite reading text that was as mangled as the second OCR result above, the system still correctly identified provisions. A user can quickly retype or edit the system’s self-generated results; importantly, the system found and directed its user to the right provision.
  • Our system includes a warning flag indicating that OCR quality on this document is poor. These flags allow users to prioritize review of documents which are more likely to have transcription problems.
  • This document was a very poor quality scan but there are even worse ones out there. While our system accurately identified relevant provisions in this document, its performance was likely worse than it would otherwise be (mistranscribed results as well as a missed date). And, eventually a scan can get poor enough quality that our system will not be able to automatically return results on it. On the good side, in these situations, users will be (i) alerted by the system’s automatic OCR warning flag of problems, (ii) still otherwise able to use our system to collect and track data (including setting links from the summary chart to a view of the original document text) and (iii) generally better off than they would be if not using our system.
  • This video showed our system’s actual performance on a poor quality scan we had never trained our system on. This last point is quite important: it is easy to cheat demos of contract provision extraction software by showing contracts the system was trained in advance to perform well on. Performance on pre-trained documents is useful if the demo viewer intends to use the contract metadata extraction system to primarily review (i) contracts they have in advance (e.g., based off their company forms) on (ii) cleanly scanned documents. It otherwise isn’t a good indicator of how real world performance will be. Make sure you test any contract provision extraction system you are considering on documents (including poor quality scans) you haven’t given the vendor in advance.

The Big Picture: Why Performance on Poor Quality Scans is Important Beyond Poor Quality Scans

This post and demo video discussed and showed how our machine learning-based contract provision extraction system performs on a poor quality scan. This robust performance has significance beyond simply poor quality scans: poor quality scans are a great analogy for unfamiliar documents. We never trained our system on provisions that looked like these, yet the system still correctly identified them. Not every contract reviewed will have OCR-mangled text, but many will have differently drafted provisions. There is a lot of variability in how contracts are drafted, yet contract reviewers are expected to identify provisions no matter how they are written. If contracts were all phrased the same way (and there were on good quality scans), human-written rules would be a fine way to accurately extract contract provisions. But this isn’t how contract review we’re familiar with works. Machine learning-based contract automated contract review software is robust enough to work on poor quality scans and unfamiliar agreements.

* From a paper by keyword-using vendor employees griping that they can’t get no respect, backed up by data. Note that this quote focussed on “information extraction” whereas we see most of what our system does as “classification.” Our sense is that the use of machine learning in classification is at least as overwhelming as in information extraction.

** I changed one party name to “Don Pietro” in the one instance in each reproduction where it was legible.


Contract Review Buyers Guide Series:

Share this article: