Top Five Questions to Ask When Choosing an E-Discovery Vendor

By David Rostov and Debora Motyka Jones

We often get questions from our clients about how best to select an electronic discovery vendor.  Important considerations in this process are what questions to ask, how best to compare vendors and what are the important issues that are typically missed in the selection process.  In particular, our clients often tell us that they sometimes struggle in the vendor selection phase to be able to best assess the quality and capabilities of a vendor.  Given the challenges of choosing the right vendor, we often hear that law firms default to making their decision based almost exclusively on price considerations. 

We put together a short list of key questions that can help in the eDiscovery vendor selection process. 

 

 

 

Top Questions To Ask When Choosing an E-Discovery Vendor

  • Scope of Services

        What services does the vendor offer?

        If case parameters change, will the vendor be able to meet your needs and time frames?

        Are there volume benefits/discounts if you use multiple services (e.g. processing, hosting and production versus just hosting)?

        What services are sub-contracted out and does data ever leave the vendor’s site?

        What size or type of case is too big for the vendor?

        What have been vendor’s toughest cases?

  •        Expertise (Not all vendors are created equal; and it is not all about price)

        What is the vendor’s knowledge level of the technical issues?

        Are the vendor’s employees certified in the tools they use?

        What is the vendor’s level of understanding of the legal process?

        Are there legal professionals on staff?

        How does the vendor’s expertise compare to other vendors?

  •        Quality of Services

        Is this a vendor that you could see yourself establishing a longer term relationship?

        How does the vendor manage ensuring high quality service consistently: accurate and on-time?

        Are errors tracked? What are considered errors? How are errors addressed?

        What do the references say about the vendor?

  •        Customer Service

        What hours does the vendor operate?

        How available are the vendor’s employees during non-business hours?

        How much lead time is needed for processing and production?

        How are cases staffed?

        Who is the primary point of contact? Is it the same throughout the case? 

        What is the nature of the vendor’s project management team and approach?

        How are issues escalated?

  •        Technical Specifications

        Does the vendor use proprietary versus non-proprietary software and what are the benefits/trade-offs?

        If the data is not being processed locally, what is the vendor’s FTP connection speeds and how does this compare with the law firm’s FTP speeds?

        What is the vendor’s policy on backing up data?

        What is the vendor’s policy regarding storing data?

 

 

De-Duplication -- Different Tools, Different Results

If two emails are identical, shouldn’t they be considered duplicates?

Unfortunately in eDiscovery it is not quite so simple. The industry standard is to calculate an MD5 hash value for all emails in a population and then identify the duplicate emails (this is referred to as de-duping). MD5 hash value is the output of a complex mathematical algorithm; it provides a way to identify each unique document. Ralph Losey has written some very thoughtful commentary on hash values. He makes a very interesting case for using hash values as the replacement for Bates numbering; the 21st century version of Bates numbering. 

The issue/challenge is that each of the major eDiscovery software tools uses its own proprietary definition of the inputs used in calculating the hash values. In the hash value world, even a very small difference means that two documents that are truly identical can be considered distinct. As a result, this leads to a certain set of documents being reviewed and/or produced more than once. The table below provides a summary of the inputs used to calculate the hash values from three leading tools: Clearwell, LAW and IPRO. As you can see, they are each different.

 

Clearwell

LAW

IPRO

From

Yes

Yes

Yes

To

Yes

Yes

Yes

Cc

Yes

Yes

Yes

Bcc

Yes

Yes

Yes

Subject of the email

Yes

No

Yes

Email date (sent date)

UTC

No

GMT

Body content

Yes

Yes

Yes

Attachment Names

No

Yes

Yes

IntMsgID

No

Yes

No

       

Notes:

     

Yes indicates it is included in the hash computation.

   

No indicates that it is not included in the hash computation.

   

IPRO hash methodology can be customized based on the settings outlined above.

 

As a recent real world example, we worked on an eDiscovery project where the custodian sent out an email to eight people within his company. By any reasonable standard, this means that there were eight exact duplicates of this email in the population set. However, the software tool used to process this data categorized this email as being four different emails. This was due to the fact that the company had various internal email servers (a fairly common occurrence in larger corporations) and each time the email was handed off to a different internal server, it placed a slightly different time in one of the metadata fields.

Conclusion

Although each software tool calculates the hash values slightly differently, this does not necessarily mean that one tool is better or worse than another or that one is inherently more accurate. If hash values were to become the Bates stamp of the 21st century, the electronic discovery industry could benefit from a standard method of calculating hash values. Absent a standard, it is important to be aware of this issue in case you run across it.

 

 

Not All TIFFs Are Created Equal

Processing of electronic discovery data can lead to interesting surprises in terms of the complexity and/or size of the data. This can sometimes make it challenging to accurately estimate a timeline for a project prior to loading the data and performing some preliminary analysis.

For example, we recently received 20 spreadsheets that needed to be converted into TIFF images and produced to opposing counsel. The client called and asked us for an estimated time to complete the project. Based on the fact that it was only 20 spreadsheets, we estimated that we would have this project completed within a few hours. Assuming 50 pages per spreadsheet, our estimate was that this was going to be about 1,000 TIFF images.

After we received the data, we loaded it in our system and created TIFF images of the spreadsheets. It turned out that the 20 spreadsheets generated close to 100,000 TIFF images or pages (an average of 5,000 pages per spreadsheet). One spreadsheet converted into approximately 20,000 TIFF images. This meant that the data size was almost 100 times bigger than we had expected. As a result, the project took longer than our original estimate. The good news was that most of the spreadsheets actually had a lot of blank pages and other “quirky” formatting issues. In the case of the 20,000 page spreadsheet, we were able to fix the formatting (without, of course, changing any of the original data) which reduced the spreadsheet to a few hundred pages. We were also able to significantly reduce the page size for the other spreadsheets by a similar amount. The additional time that we took to fix the formatting ended up saving counsel countless review hours and cost.

Bottom line, when requesting a firm timeline and cost estimate from an electronic discovery vendor, it is always best to give them the actual data and request that they do a preliminary analysis of the data prior to finalizing an estimate. This will insure a much more realistic estimate.