Predictive Coding

Predictive coding uses machine learning algorithms (e.g. adaptive logistic regression) to reduce the cost of e-Discovery. It helps users identify relevant documents among a larger set of documents without each and every document in the search space needing to be manually opened and reviewed.

To obtain a useful prediction, users train the system to recognize relevant documents (i.e. responsive) and those that are not relevant (non responsive). Therefore, machine learning algorithms are used to predict further documents that might be relevant. If the results are deemed to be unsatisfactory, the training set can be further refined, until the desired level of accuracy is achieved.

A model is created by identifying a sample of responsive (i.e. relevant) and non-responsive (non relevant) documents. Flagging a document as responsive, trains the model to seek out documents of a similar kind. Marking a document unresponsive indicates that the document and those similar to it are not relevant to the model.

A training model requires a certain amount of responsive and non-responsive documents. As a general rule, one should initially aim for an equal number of responsive vs non-responsive documents. The user is informed at the time of prediction should there not be enough responsive or non-responsive documents in the training set.

Train Model

The steps to create a training model are as follows:

Perform a regular keyword search that likely to contain some of the items that you are looking for (i.e. responsive items)
Enter a model name in the text box to the left of the Predict Model button on the toolbar.
Click the checkbox next to all relevant (i.e. "responsive") items.
Click the down arrow to the right of the Predict button, select Responsive in the button drop down.
Click the checkbox next to all non relevant (i.e. "unresponsive") items.
Click the down arrow to the right of the Predict button, select Unresponsive in the button drop down.
Repeat the procedure above until around thirty or so responsive and unresponsive items have been chosen.
To examine all the items trained in the model, click the down arrow to the right of the Predict button, select Load Model button drop-down.

Predict

The steps to initiate a prediction are as follows:

Perform a broad keyword search to obtain the result set from which the prediction is to occur.
Type/select the model name in the text box to the left of the Predict Model button on the toolbar.
Click the Predict Model button.
To verify the quality of the model, click the down arrow to the right of the Predict button, select View Model in the button drop down. The value of AUC (ROC) should be in the range of 0.8 - 0.9. If it is lower than 0.8, then it is a bad fit. If it is 1.0 it is overfitting and cannot recognize new documents.
There are two ways in which prediction results are obtained; they are:
1. If predicting over one thousand items or less (i.e. the result of step 1 above), the prediction results will be loaded automatically.
2. If you predict over more than one thousand items, a background prediction task will be executed. The progress of a task is outlined in Tasks. When the task is complete, the results are saved to a tag under the model name. When the task is complete, the results are loaded by clicking the Tagging button on the search toolbar. Enter the prediction model name. Click Load Tag under the button dropdown.
Analyze the results to determine whether the results are indeed relevant for human consumption.
Refine the model as necessary.