Predictive Coding Primer Part Two: Key Variables in a Predictive Coding Driven Review

As discussed in “Predictive Coding Primer Part One: Estimating Cost Savings,” the potential cost savings from predictive coding are substantial. If you are planning to implement the technology in your discovery plan, this article should be a useful primer on key points you’ll need to be aware of, variables at play, and their associated cost.

(1)   Predictive Coding Technology Processing

Processing costs for predictive coding technology generally range from $300 per GB to $700 per GB, or 6 cents to 15 cents per document for vendors determining the processing cost on a per document basis.  However, vendors often bundle predictive coding with other e-discovery tools and most provide volume discounts, so you will want to discuss pricing options with your preferred vendors before making any assumptions about the processing costs specific to your case. Since a party will pay a significant processing fee to utilize predictive coding technology, reducing the data set beforehand — through deduplication, keyword searches, conceptual searches, and other advanced culling techniques — is a crucial first step to managing overall discovery spend.

(2)   Technology Vendor Consulting

Many litigators and e-discovery attorneys are versed in a variety of electronic database platforms, culling techniques, and workflow designs. However, since predictive coding is nascent technology, some attorneys may find themselves relying more on the vendor to design the predictive coding workflow and to direct the training of the predictive coding model. As noted in a previous blog entry, “How to Budget for E-Discovery: The Big 5 Expenses,” technology vendor consultant fees may range from $200 – $500 per hour.  Accordingly, depending on a law firm or client’s understanding of predictive coding technology, a party may want to budget discovery spend for vendor consulting related to set up and training of the predictive coding model. This would be in addition to vendor consultant fees incurred for general troubleshooting, running advanced queries, and assisting in production preparation.

(3)   The Richness and Makeup of Your Data Set

Predictive coding technology has the potential for substantial cost savings: anywhere from approximately 20 percent to 80 percent over traditional linear review, according to vendor respondents in the 2012 Rand study Where the Money Goes: Understanding Litigant Expenditures for Producing Electronic Discovery. This wide range in cost savings estimates reflects the variability in the proportion of documents within the review set that will require more traditional eyes-on review, which will be affected largely by the richness of the review set (percentage of documents that are responsive in the universe of the collection) and any files present that the software’s algorithm cannot read, which will include certain types of spreadsheets, non-text documents, image files, and other technical documents. Thus, to develop a clear idea of the potential cost of a review, you will need to know not only the volume of the data set, but also its expected richness and the makeup of file types.

As I discussed in “Predictive Coding Primer Part One: Estimating Cost Savings,” Ralph Losey was able to achieve 75 percent savings using predictive coding technology over traditional linear review in a case study reviewing 699,082 Enron emails and attachments.  However, in Losey’s “Predictive Coding Narrative,” the richness rate was approximately one tenth of one percent – extremely low. The average richness in a universe of data is likely to be much higher, in the range of 5 percent to 15 percent.  Thus, in Losey’s case study, any second level attorney review and associated cost would have been minimal when compared to review of a more representative data set, contributing to the significant cost savings achieved.  So while astronomical savings are possible, it is important to remember that a data set’s expected richness and file makeup will determine in large part where on the 20% to 80% predictive coding cost savings spectrum your review will fall.

(4)   Standard E-Discovery Processing Services

In addition to the cost incurred to employ predictive coding, your review will still encounter the more typical e-discovery service fees, such as   (1) forensic collection fees, (2) native file processing (de-duping, deNisting, indexing) fees, (3) database/review platform hosting fees, and (4) user fees.  For further discussion on these services, see my previous post, “How to Budget for E-Discovery.”

(5)   Subject Matter Experts: Training the Model

Utilizing predictive coding technology to identify responsive documents is only as good as the attorneys who train the model.  The axiom often mentioned by electronic discovery vendors offering these services is “garbage in, garbage out.” Thus, the recommended best practice is to train the predictive coding system using a small team of two to four attorneys who are subject matter experts with in-depth knowledge of the facts of case, the client’s business, and the issues related to the claims and defenses of the litigation. Accordingly, these subject matter experts are not typically junior associates or contract attorneys, but rather mid-level or senior associates — even partners. Of course, more senior attorneys will bill out at more expensive rates. Depending on the volume and type of data in your collection, this up-front investment in more expensive subject matter expert attorneys is likely to translate into substantial document review savings.

But how many hours will be required by the subject matter experts? How many documents will they need to review to train the predictive coding system? The answer is that it depends. The subject matter experts will first need to review a random control set of documents. This random control set provides baseline metrics for the predictive coding software to calculate recall, precision, and F-Measure, which are key statistics needed to prepare the model for training. Recall is the number of responsive documents identified and retrieved by the predictive coding model out of the universe of responsive documents in the collection, and precision is the measurement of responsive documents retrieved by the predictive coding model of the total documents retrieved. F-Measure is the harmonic mean of recall and precision.

After the control set has been reviewed, the subject matter experts will then need to train the predictive coding model further through review of a statistically significant sample of documents. The number of documents that require review will depend on the size of the document collection, the estimated richness of the data (which is calculated from the completed control set), and the desired confidence level and F-Measure.

You need not worry about pulling out your dusty college statistics course books or undertaking a course on algorithms, as a quality electronic discovery vendor consultant should be able to educate you on the estimated number of documents that the subject matter experts will need to review for a particular collection of data, based upon your desired confidence and F-Measure levels. Further, most predictive coding software allows you to easily adjust the parameters of confidence, recall, precision, and F-Measure to assist you in achieving your desired results.

In your effort to fully appreciate the potential cost of discovery and document review utilizing predictive coding technology, be sure to factor in the cost of subject matter expert attorneys, who will be reviewing a control set of documents and then will be training the system at higher billable rates.

(6)   Subject Matter Experts: Retraining for Rolling Loads of Data

Depending on your document collection, training the predictive coding model may need to occur on more than one occasion. Often during the course of complex document reviews, data is loaded on a rolling basis from a variety of sources. If you are confident that your subsequent data loads are more of the same (e.g., an update of newer data from the same custodians) then perhaps re-training the predictive coding model is not necessary. If, however, subsequent loads are from different departments and different custodians, then your team of subject matter experts may need to re-train the model so that it can accurately code the new data. If you do not anticipate a complete data set at project outset, time spent on re-training the model to accommodate rolling loads of data should be factored into your budget projections.

(7)   Attorney Review Team: Second Level Review and Quality Control

One misconception is that predictive coding technology does all the work and document review by junior associates or contract attorneys will no longer be needed. That is likely not the case, but will depend on your workflow, agreements with the other side for privilege and confidentiality, and your risk tolerance. A likely scenario is that attorneys will remain a vital component in protecting a client’s interest and fulfilling discovery obligations even when utilizing predictive coding technology.

The following portions of any review utilizing predictive coding will still need to be performed by attorneys:

(a) Review of Spreadsheets, Technical, or other Non-text Documents –  At this time, predictive coding technology is unable to classify certain types of data including image files, scanned non-searchable hardcopy documents, Excel spreadsheets, audio and video files, graphics, photos, and computer-generated graphs. Accordingly, to the extent these types of files are in your data set, they will need to be incorporated into the workflow for attorney review.

(b) Quality Control Review: Validation of Predictive Coding Results  – Your workflow will need to include a quality control component to validate the results of the predictive coding. A random and statistically significant sample of documents classified as not-responsive (slated not to be produced or not receiving any responsive classification score by the predictive coding technology) should be reviewed to validate the predictive coding results.

(c) Review of Presumptively Responsive Documents for Confidentiality, Issue Tags, and Privilege – Depending on how your workflow is designed, you may engage a contract attorney review team to review the documents that have been identified by the predictive coding technology as presumptively responsive. In this scenario, predictive coding technology replaced first pass review by contract attorneys. At this stage the contract attorney team will review these presumptively responsive documents, focusing on confidentiality, issue tags, and privilege review.

If you are producing responsive documents in complete families, remember that predictive coding technology works on a document level, not a family level.  So, when you are budgeting for the cost of your second pass review, the size of the second pass data set will be the documents identified by the predictive coding technology as presumptively responsive plus their family members.

The workflow for this work can incorporate the predictive coding data by batching out documents with high responsive scores to more experienced attorneys and documents with lower responsive scores to a contract attorney team.  Further, other assisted-review techniques can be implemented to gain efficiencies, including clustering, targeted searching, near duplicate identification, and grouping of email threads.

Case studies utilizing predictive coding have illustrated that document review of the presumptively responsive set of documents tends to be slower for this very reason – i.e., the documents are all responsive, so there are fewer quick determinations of non-responsive or junk documents, which were culled out by the predictive coding software.  So whereas industry norms for document review pace without using predictive coding technology might be in the 50 document per hour range, this range when utilizing predictive coding may drop to the 40 or 45 document per hour range. As in all cases, the document review pace metric depends on the types of documents, the coding required, and the skill level of the review attorneys.

As you can see, attorney review time, depending on your workflow, will need to be factored in as one of your variables in budgeting and estimating the cost of a document review utilizing predictive coding technology.

The Takeaway

In the right cases, properly executed predictive coding will substantially reduce the universe of documents for attorney review, saving the client considerable expense.  Yet, predictive coding is a tool, not an “Easy” button, and those utilizing it will need to do so thoughtfully in order to fulfill discovery obligations and achieve savings for their client.

Questions to Ask at Project Outset

  • How much data are we collecting for this litigation or investigation?
  • How much of the data consists of spreadsheets, technical, image, or non-text documents? How are we incorporating review of these types of documents into our workflow?
  • What are the vendor processing costs (usually per GB) for the predictive coding software? What are the other vendor costs for collection, processing, hosting, user fees, consulting, and production preparation?
  • Who are the subject matter experts to train the predictive model?
  • What are the billable rates of the subject matter experts?
  • What is the estimated number of documents that the subject matter experts will need to review in order to train the predictive coding model? How many estimated billable hours does this translate to?
  • What is the status of our data collection? Are we expecting rolling loads of data? If so, how many?
  • What is the estimated number of documents that the subject matter experts will need to review in order to re-train or recalibrate the predictive coding model due to subsequent loads of data? How many estimated billable hours does this translate to?
  • What is our validation and quality control workflow? Who is going to perform quality control to validate the results of the predictive coding model? What are the estimated billable hours required for validation?
  • What is our workflow to review documents for confidentiality, issue tags, and privilege?
  • What is our baseline estimated responsive rate so we can estimate the number of presumptively responsive documents from the total universe requiring second pass review by a contract attorney team for confidentiality, issue tags, and privilege? Does the baseline estimate include family members?
  • Will more experienced attorneys be reviewing documents that are classified by the predictive coding technology with high responsive scores? If so, what is there billable rate?
  • What is the estimated review rate of documents per hour for contract attorneys to complete second pass review?
  • What is the estimated number of contract attorney hours to complete second pass review?
  • What is the estimated number of hours to manage, supervise, and perform quality control of the contract attorney team second pass review?


Leave a Reply