Predicting the Future of Predictive Coding in the EDiscovery Industry

Written By

Sean Goldstein

Discovery is often one of the most expensive phases of the legal process.  Perhaps this is why Predictive Coding was one of the biggest legal technology topics of 2012. While the concept of Predictive Coding is appealing to some it remains the subject of debate among legal professionals and many questions remain relative to its accuracy, effectiveness and level of acceptance in the industry. Predictive Coding is a computerized process that uses sophisticated algorithms to enable the computer to determine relevance based on interaction with a human reviewer. By utilizing “coding” techniques, documents are digitally categorized as being responsive or unresponsive to a discovery request.  The idea is that once a sample set of documents is created by the legal team the predictive technology uses a computer categorized review process to classify additional documents on how well they match these predetermined keywords and concepts. After relevant documents are identified the legal team can review the subset.  This process greatly reduces the quantity of irrelevant and non-responsive documents that need to be reviewed manually.  Those who support the use of predictive coding highlight its cost and time saving capabilities.  However, critics say that predictive coding is not as reliable as human review and the concern is that potentially relevant documents could be overlooked. While the concept of predictive coding has yet to receive a unilateral “judicial stamp of approval”, it is beginning to gain some traction in the ediscovery industry and in February 2012 a landmark decision on the subject was made in the Southern District of NY. In the case of Da Silva Moore v. Publicis Group, Case No. 11-cv-01279 (S.D.N.Y April 2012) the U.S District Court became the first court to approve the use of predictive coding for reviewing electronically stored documents in certain cases.  The parties agreed to the use of predictive coding however, they disagreed on how the process should be implemented.  According to the stipulation submitted by the parties, the defendants were required to first identify a small number of documents which represented the categories to be reviewed and coded.  This document set was referred to as the seed set.  The seed sets are then applied to each category whereby the software training process begins to prioritize and identify similar documents within the larger population of documents.  From there the “relevant” documents are reviewed and recoded thereby training the software to re-categorize the documents.  The plaintiffs argued that this methodology lacked “generally accepted standards of reliability and violated Federal Rule of Civil Procedure 26 and Federal Rule of Evidence 702”. The judge disagreed stating that the protocol contained standards for measuring reliability, required active participation from plaintiffs at various stages of the process, and provided for judicial review in the event of a dispute prior to production. While this ruling certainly helps begin to pave the way for the use of Predictive Coding as a defensible method of discovery, many questions remain unanswered.  How will custodians be decided upon?  How many documents need to be reviewed in order to statistically represent a reliable number from the overall population?  How many rounds of recoding are necessary within the software training process?  Could a small error in the initial phases result in miss hits or false positives? Perhaps it’s too soon to tell where the use of Predictive Coding is headed.  What we do know is that the technology is available today and is getting better and it helps reduce the cost of the ediscovery process.  However, without a keen understanding of the technology, a mutually agreed upon protocol, and the blessing of the court it can create a colossal mess and may even doom the proper outcome of your case. What do you think?  Will the legal system catch-up to the technology or are we heading towards another multi-year debate on the virtues of this advanced ediscovery tool?