De-Duplication -- Different Tools, Different Results

If two emails are identical, shouldn’t they be considered duplicates?

Unfortunately in eDiscovery it is not quite so simple. The industry standard is to calculate an MD5 hash value for all emails in a population and then identify the duplicate emails (this is referred to as de-duping). MD5 hash value is the output of a complex mathematical algorithm; it provides a way to identify each unique document. Ralph Losey has written some very thoughtful commentary on hash values. He makes a very interesting case for using hash values as the replacement for Bates numbering; the 21st century version of Bates numbering. 

The issue/challenge is that each of the major eDiscovery software tools uses its own proprietary definition of the inputs used in calculating the hash values. In the hash value world, even a very small difference means that two documents that are truly identical can be considered distinct. As a result, this leads to a certain set of documents being reviewed and/or produced more than once. The table below provides a summary of the inputs used to calculate the hash values from three leading tools: Clearwell, LAW and IPRO. As you can see, they are each different.

 

Clearwell

LAW

IPRO

From

Yes

Yes

Yes

To

Yes

Yes

Yes

Cc

Yes

Yes

Yes

Bcc

Yes

Yes

Yes

Subject of the email

Yes

No

Yes

Email date (sent date)

UTC

No

GMT

Body content

Yes

Yes

Yes

Attachment Names

No

Yes

Yes

IntMsgID

No

Yes

No

       

Notes:

     

Yes indicates it is included in the hash computation.

   

No indicates that it is not included in the hash computation.

   

IPRO hash methodology can be customized based on the settings outlined above.

 

As a recent real world example, we worked on an eDiscovery project where the custodian sent out an email to eight people within his company. By any reasonable standard, this means that there were eight exact duplicates of this email in the population set. However, the software tool used to process this data categorized this email as being four different emails. This was due to the fact that the company had various internal email servers (a fairly common occurrence in larger corporations) and each time the email was handed off to a different internal server, it placed a slightly different time in one of the metadata fields.

Conclusion

Although each software tool calculates the hash values slightly differently, this does not necessarily mean that one tool is better or worse than another or that one is inherently more accurate. If hash values were to become the Bates stamp of the 21st century, the electronic discovery industry could benefit from a standard method of calculating hash values. Absent a standard, it is important to be aware of this issue in case you run across it.

 

 

Big Changes in Early Case Assessment

There are some very exciting trends and developments going on in the Early Case Assessment (“ECA”) phase of litigation. ECA is a critical part of the litigation process since it is a time to perform a preliminary analysis of the merits of a case, claims, likely defenses and estimate of the cost of the case. Usually the ECA is conducted in the first 90 days from the time a case is filed. There is general agreement that ECA improves litigation outcomes. For example, a survey by LexisNexis  showed that ECA results in favorable outcomes in 76% of cases and reduces litigation expenses in 50% of cases.

In the digital era, the BIG CHALLENGE is getting access to the electronic data early in the ECA process and having the tools to allow the legal team to evaluate the case based on a preliminary review of the evidence. This is both a technology challenge as well as a cost challenge. The good news is that there are now a number of early case assessment tools on the market that can solve this problem. We are big fans of Clearwell for this and our clients are seeing the value.

 

 

The key benefits from this are:

 

  1. Speeding up access to client data. The documents can be fully indexed and available to review within hours rather than weeks.
  2. An easy to use web interface. This means it is available anywhere and anytime. There is no need to rely on internal IT resources and no need to purchase additional software or hardware.
  3. Collaboration between in-house counsel and outside counsel. It is very easy to have the legal team work together to examine key documents.

Effective use of an early case assessment tool makes it possible to prepare an Early Case Assessment in the digital era. A good understanding of the documents allows the legal team to prepare a more complete litigation strategy. It also helps lower the overall cost of the case by reducing the amount of data that needs to be processed for review and correspondingly reducing the amount of legal hours required for review. The other added benefit is that the legal team will be able to create a more accurate budget for the case based on their insight into the data size and its nuances.