De-Duplication -- Different Tools, Different Results

If two emails are identical, shouldn’t they be considered duplicates?

Unfortunately in eDiscovery it is not quite so simple. The industry standard is to calculate an MD5 hash value for all emails in a population and then identify the duplicate emails (this is referred to as de-duping). MD5 hash value is the output of a complex mathematical algorithm; it provides a way to identify each unique document. Ralph Losey has written some very thoughtful commentary on hash values. He makes a very interesting case for using hash values as the replacement for Bates numbering; the 21st century version of Bates numbering. 

The issue/challenge is that each of the major eDiscovery software tools uses its own proprietary definition of the inputs used in calculating the hash values. In the hash value world, even a very small difference means that two documents that are truly identical can be considered distinct. As a result, this leads to a certain set of documents being reviewed and/or produced more than once. The table below provides a summary of the inputs used to calculate the hash values from three leading tools: Clearwell, LAW and IPRO. As you can see, they are each different.

 

Clearwell

LAW

IPRO

From

Yes

Yes

Yes

To

Yes

Yes

Yes

Cc

Yes

Yes

Yes

Bcc

Yes

Yes

Yes

Subject of the email

Yes

No

Yes

Email date (sent date)

UTC

No

GMT

Body content

Yes

Yes

Yes

Attachment Names

No

Yes

Yes

IntMsgID

No

Yes

No

       

Notes:

     

Yes indicates it is included in the hash computation.

   

No indicates that it is not included in the hash computation.

   

IPRO hash methodology can be customized based on the settings outlined above.

 

As a recent real world example, we worked on an eDiscovery project where the custodian sent out an email to eight people within his company. By any reasonable standard, this means that there were eight exact duplicates of this email in the population set. However, the software tool used to process this data categorized this email as being four different emails. This was due to the fact that the company had various internal email servers (a fairly common occurrence in larger corporations) and each time the email was handed off to a different internal server, it placed a slightly different time in one of the metadata fields.

Conclusion

Although each software tool calculates the hash values slightly differently, this does not necessarily mean that one tool is better or worse than another or that one is inherently more accurate. If hash values were to become the Bates stamp of the 21st century, the electronic discovery industry could benefit from a standard method of calculating hash values. Absent a standard, it is important to be aware of this issue in case you run across it.

 

 

Trackbacks (0) Links to blogs that reference this article Trackback URL
http://www.electronicdiscoverymadeeasy.com/admin/trackback/152839
Comments (2) Read through and enter the discussion with the form at the end
Pathik Jayani - March 10, 2010 12:04 PM

I think there might be some other reason why only 4 instances got id'd as duplicate. But when email is sent, sent data/time is recorded as part of protocol header of an email. Which is always remains same even if internally been passed on by different email servers. And that time is recorded as sent date/time. receive time might be different (because of speed and how busy email server is at particular point of time), which don't affect hash at all. If recipients are located in different time zones (much more possible for larger organizations), then time corrections must be performed before generating hash.

David Rostov - March 15, 2010 10:22 AM

Pathik,

This is a very good comment. We had a recent case where this is exactly what happened. The exact same emails were not considered dupes due to the fact that the time on the various servers in the corporation was slightly off from each other.

David

Post A Comment / Question Use this form to add a comment to this entry.







Remember personal info?
Send To A Friend Use this form to send this entry to a friend via email.