De-Duplication -- Different Tools, Different Results
If two emails are identical, shouldn’t they be considered duplicates?
Unfortunately in eDiscovery it is not quite so simple. The industry standard is to calculate an MD5
hash value for all emails in a population and then identify the duplicate emails (this is referred to as de-duping). MD5 hash value is the output of a complex mathematical algorithm; it provides a way to identify each unique document. Ralph Losey has written some very thoughtful commentary on hash values. He makes a very interesting case for using hash values as the replacement for Bates numbering; the 21st century version of Bates numbering.
The issue/challenge is that each of the major eDiscovery software tools uses its own proprietary definition of the inputs used in calculating the hash values. In the hash value world, even a very small difference means that two documents that are truly identical can be considered distinct. As a result, this leads to a certain set of documents being reviewed and/or produced more than once. The table below provides a summary of the inputs used to calculate the hash values from three leading tools: Clearwell, LAW and IPRO. As you can see, they are each different.
|
Clearwell |
LAW |
IPRO |
|
|
From |
Yes |
Yes |
Yes |
|
To |
Yes |
Yes |
Yes |
|
Cc |
Yes |
Yes |
Yes |
|
Bcc |
Yes |
Yes |
Yes |
|
Subject of the email |
Yes |
No |
Yes |
|
Email date (sent date) |
UTC |
No |
GMT |
|
Body content |
Yes |
Yes |
Yes |
|
Attachment Names |
No |
Yes |
Yes |
|
IntMsgID |
No |
Yes |
No |
|
Notes: |
|||
|
Yes indicates it is included in the hash computation. |
|||
|
No indicates that it is not included in the hash computation. |
|||
|
IPRO hash methodology can be customized based on the settings outlined above. |
|||
As a recent real world example, we worked on an eDiscovery project where the custodian sent out an email to eight people within his company. By any reasonable standard, this means that there were eight exact duplicates of this email in the population set. However, the software tool used to process this data categorized this email as being four different emails. This was due to the fact that the company had various internal email servers (a fairly common occurrence in larger corporations) and each time the email was handed off to a different internal server, it placed a slightly different time in one of the metadata fields.
Conclusion
Although each software tool calculates the hash values slightly differently, this does not necessarily mean that one tool is better or worse than another or that one is inherently more accurate. If hash values were to become the Bates stamp of the 21st century, the electronic discovery industry could benefit from a standard method of calculating hash values. Absent a standard, it is important to be aware of this issue in case you run across it.