Leveraging Deduplication to Streamline Information Management – From ROT Clean Up to eDiscovery

By Ed Moke

Whether you’re undertaking a data profiling project to understand the data across the enterprise, preserving legacy data from backup tapes, or collecting data for an eDiscovery event, the ability to identify and manage duplicate content is extremely important and powerful to the success of any of these initiatives.

Why Deduplication Matters?

Many tools use deduplication to filter and cull irrelevant data. Duplicates can represent up to 30% of what exists on corporate networks and up to 95% on legacy backups. Many tools use very simplistic metadata analysis, which can easily miss large volumes of content making the discovery and classification task more complex and time consuming. Without comprehensive and customizable deduplication technology your task of finding and managing data to support migration and governance will be complex.

Index Engines provides comprehensive and flexible deduplication features. These features can adapt to the task at hand. In some cases it may be important to utilize metadata deduplication, while others the use of more comprehensive MD5 hashing is required. Either way Index Engines provides deduplication options to support the task at hand.

Configuring the Deduplication Method

Index Engines deduplication functionality is fully customizable and empowers users to adapt the deduplication method to a specific workflow. To start, let’s first talk about the various properties that can be used to identify duplicates and how Index Engines implements those methods.

The set of parameters available are as follows:


The user can select one or many items to use as the deduplication method. The customer workflow and level of indexing will allow the user to customize a deduplication method to achieve the desired outcome. The ‘Path’, ‘Size’, ‘Filename’ and ‘Modification Time’ are the metadata properties of the file from the filesystem.

When indexing data from tape, backup or network data, Index Engines calculates a MD5 signature of each file that is stored in the index. When selecting ‘Content’ for deduplication this MD5 signature is used for deduplication. ‘Message id else Content’ is an option for users working on a project that has both email and loose file. This option will deduplicate email data using the internal Message ID within the individual emails and the content signature for non-email files.

The user can then select the scope of the deduplication by further selecting ‘Owner’ which will apply the selected deduplication method within the set of files owned by an email or Active Directory user. The ‘Family’ option is used to deduplicate files across email families, (e.g. return query matches in different email families). The Production option is used for implementing rolling delivery of files and for creating load files for eDiscovery workflows and will be discussed in subsequent post.

Deduplication Process and Viewing Results

The deduplication method can be set or changed at any time. Index Engines applies the deduplication method at query time. The user query is run and the set of results are then deduplicated, using the method set in the project preferences, and then the results are returned to the user in the search GUI.

The search GUI provides the power to the user to view the unique set of results (deduplicated) and switch between viewing just the duplicate files or all the files.


By selecting the set of files, unique files, duplicate files or all files, the user can quickly generate results and summary reports for all views with relative ease.