Numerous published articles have explored how predictive coding technology is a powerful e-discovery tool that, if used properly, can save clients a considerable amount of time and expense in the electronic discovery process. The practical application of predictive coding is perhaps best illustrated in Ralph Losey’s Predictive Coding Narrative, where he thoughtfully walks the reader through his 52-hour project of training a predictive coding engine to analyze 699,082 documents. Losey, in what he calls his hybrid multimodal approach (i.e., training the predictive coding model both with random seed sets provided by the review platform and by identifying training documents using his own advanced search and filter techniques), was able to identify 691 responsive documents, concluding with a 95% degree of certainty that he identified all of the responsive documents in the database. Losey achieved these results with predictive coding technology in a mere 52 hours, whereas, he points out, an old-fashioned linear review of all 699,082 documents would have taken a team of contract attorneys more than 13,000 hours to complete. According to Losey, this translates into a 93% savings for the client (with Losey hypothetically billing himself out at $1,000 per hour).
In Losey’s Corrections and Refinements to the Predictive Coding Narrative, he tweaks his conclusion and acknowledges that 93% savings may not be possible in real world conditions. Let’s explore this further to see how much savings truly exist, comparing his conclusions with some economic realities of the discovery process.
From the outset, it is important to note that Losey’s expected responsiveness rate (richness) was close to one-tenth of one percent (Losey’s baseline control sample identified 2 responsive documents in the random sample control set of 1,507 documents (0.13%)). Many reviews in the real world have significantly higher responsiveness rates. In addition, Losey was working with a static data set consisting entirely of emails and their attachments, and he did not have to factor in costs for second pass attorney review for confidentiality and privilege. Taken together, these features of his experimental review would have made it a real world anomaly.
As litigators and e-discovery attorneys will tell you, more often than not, real world reviews involve rolling loads of data from on-going collections. Moreover, these data sets typically contain spreadsheets, image files, media files, and other non-text documents in addition to emails and their attachments, and because these file types cannot be processed by predictive coding software, they will require eyes-on review. Finally, we often see a higher responsive rate and must factor in confidentiality and privilege review concerns.
Additionally, it seems unlikely that a contract attorney team would be utilized for a linear review of all 699,082 documents when the expected responsive rate is close to a mere one tenth of one percent. With such a small expected responsive percentage, most experienced electronic discovery attorneys and project managers would have employed efforts to further cull down the data set (by date, custodian, keyword search, clustering, filters, other advanced analytics, or some combination thereof). Losey acknowledges this point, but notes that the data had been culled to some extent by vertical deduplication and deNisting (removing operating and programing files from the data set).
For cost comparison purposes, we will assume that by using date, keyword searches, clustering, filters, or other advanced analytics, that we are able to cull the data set down by 60% in an effort to have a more focused review–this is a sizeable culling assumption given that the data was already deduplicated and deNisted, but since our baseline responsive rate is close to one tenth of one percent, such an approach may be reasonable.
After culling, instead of a linear review of the 699,082 documents, we would be left with a linear review of 280,082 documents. This would still require 5,600 hours of contract attorney review time to complete a first pass review, for a total cost of $280,000. So, even reducing the size of the database by 60% prior to undertaking a linear review, Losey’s multi-modal approach utilizing predictive coding technology would still yield a significant cost savings. As is always the case, culling methods need to be transparent. A party will want agreement from the other side, or at a minimum, be able to justify what was culled and why, in order to defend the reasonableness of their process.
However, we also need to factor in the estimated predictive coding processing fees. Losey utilized a slice of the Enron database put together by EDRM, the size of which is approximately 43 GB. Predictive coding fees can range anywhere from $250.00 – $700.00 per GB, though they are sometimes quoted on a per document basis rather than per GB and may be bundled with other vendor software and services. Here we will assume a middle-of-the-road fee of $450 per GB, for a total predictive coding expense of $19,350. This puts Losey’s hybrid multimodal approach at a total cost of $71,350 to the smaller linear review’s $280,000 – still a whopping 75% savings.
Predictive coding is a powerful tool, but many real world variables will influence the extent of its efficacy for a particular case. I will discuss these variables, and their associated costs, in the next article in this series: Predictive Coding Primer Part Two: Identifying Key Expenses in Predictive Coding Driven Review.As Losey notes in his Corrections and Refinements to the Predictive Coding Narrative, 93% savings are not likely achievable in real world conditions, but clearly the potential cost savings are still substantial. Using these tools to cut down the process by 75% leaves little doubt that your clients will be happy by the significant reduction in overall discovery spend.