Where to Find the Data Protected by the GDPR

by Tim Williams

You’ve decided to get serious about the GDPR, and are ready to take the first step: building a comprehensive data map of all your data under management. Maybe you are following the road map I outlined in my previous post How to Find the Data Protected by the GDPR. Maybe you are using a GDPR advisor or consultant – most of them have come to similar conclusions.

Once you have figured out what how you are going to search for personal and pseudonymous data, the next step is figuring out where you are going to search. The question you need answered: is anyplace off limits to the GDPR?

Does the GDPR apply equally to all three classes of storage: primary, backup and archive?

Keep Reading on LinkedIn

How to Find the Data Protected by the GDPR

By Tim Williams

My company sells software used frequently by Data Governance professionals. Their backgrounds range from data management, to eDiscovery, to regulatory compliance, to data protection, cyber security and ransomware. Lately, their focus has been on the EU’s General Data Protection Regulation, or GDPR, which becomes effective next May.

The question I hear from them most is “how do I search for that kind of data?”.

If your background falls into one of those categories, but you haven’t heard of the GDPR, you should study up on it. I’ve written an introduction in a previous post: The Difference Between Privacy and Security: How to comply with the GDPR.

If you share their curiosity, I can tell you that it dictates some new processes, creates some new use cases for existing technology, and requires the ability to scale that technology to petabyte class data sets. But most of all, it requires an understanding of a regulation that, in my experience, is frequently being mistranslated to the tools they have used in past projects.

Keep Reading on LinkedIn

Privacy by Deletion: 5 Steps to Reducing Data Risk

Hosted by FTI Consulting and Index Engines, Wednesday, July 19 at 1 pm ET, 10 am PT

Organizations are seeing 40-60% annual growth in data capacity. As data grows, so does the significant data center cost and liability for your organization.

In the midst of this rapid data growth, how can IT and legal teams work together to better protect sensitive corporate data and stave off data breaches?

Join Jake Frazier, Senior Managing Director, FTI Consulting, Jim McGann, Vice President, Index Engines, and Anthony J. Diana, Partner, Reed Smith as they discuss how to mitigate these risks and reduce costs.

During this webinar, hosted by Bloomberg BNA, you’ll develop a greater understanding of:

• Data Pitfalls that are putting your organization at risk
• Must Ask Questions for your legal and IT teams
• Best Practices and Policies to practice in your own organization
• 5 Steps to Reducing Data Risk

Register Here

How Regular Expressions Empower Support for the GDPR and Privacy Assessments

By Ed Moke

Whether you’re preparing for the GDPR or monitoring your primary storage network or secondary backup data to identify and manage sensitive data, one of the key attributes of any solution is the ability to find ‘Personal Data.’ The GDPR defines ‘Personal Data’ as:

“any information relating to an identified or identifiable natural person; an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”

Index Engines delivers several key features that allow for the identification of personal data, including data that is defined as personal based on specific geographies.

Leveraging the Index Engines search, classification and management capabilities, policies can be defined to support specific personal data based on a pattern or regex query. Once these policies are defined and stored, they can be run on online or offline data sources to support the GDPR requirement to access, rectify, erase, restrict or migrate personal information.

There are several popular pattern matching capabilities within the Index Engines application. These include Social Security and credit card number (American Express, Visa, Master Card and Discover Card) as well as bank routing numbers. The advantage of embedding the pattern within the Index Engines processing platform is the files and email containing these patterns is immediately identified at indexing time and tagged in the index. This enables rapid search to quickly return files containing any of the embedded patterns across large data sets.

Read more about our Regex searches in our new Tech Primer

Supporting GDPR Compliance through Data Classification

by Jim McGann

The GDPR consists of 99 articles that mandate how data is to be handled, but how do you manage decades of data on various platforms?

Index Engines delivers speeds up to 1TB per hour using a single indexing node, with federated support for distributed environments. The index is highly compressed, typically only 1% of the original data size for metadata, allowing for extreme scalability. Architected for the enterprise, Index Engines will provide the knowledge to manage data across global environments.

Learn more about our patented technology in our new offering “Index Engines’ Guide to: Supporting GDPR Compliance through Data Classification

What’s Driving Data Center Classification? Privacy or Costs?

by Jim McGann

Data centers increase storage capacity year after year, relentlessly trying to keep up with prolific users. This strategy has worked since the cost of storage has plummeted over the years. Adding capacity has been solved through a simple call to the vendor of your choice. Problem solved?

Recently we have seen a surge in data classification projects. These projects are attacking a number of very different use cases. Here are some recent examples:

  1. Classification of 2.8PBs of user data on a shared network server. Found that 36% of the content was aged system files that had no business value. 1PB of storage capacity reclaimed. Happy client. Use Case: Cost Savings
  2. Security assessment on 25 network servers to determine is sensitive email/PSTs exist on the network. Found 2,376 PSTs, 57% had not been accessed in years, 28% were abandoned by ex-employees, and 32% were active and accessed in the last 6 months. Use Case: Privacy
  3. 676 TBs of high profile network storage classified and data classified for ROT (redundant, obsolete, trivial) analysis. 12% of data was purged. 19% was archived in the cloud. 275,000 files were found that contained high value intellectual property that were secured. Use Case: Cost Savings and Privacy

These projects started with the classification of unstructured data, providing the knowledge to make decisions and develop a disposition strategy. Many were initially driven by costs, however, the end game resulted in support for privacy.

Is cost the driving factor for the resurgence of data classification?

With the looming EU General Data Protection Regulation (GDPR) will privacy take a front seat?

Download our newest eBook, Harnessing Metadata for Streamlined Data Management and Governance to find out how data classification can drive down costs and help privacy efforts.

Leveraging Deduplication to Streamline Information Management – From ROT Clean Up to eDiscovery

By Ed Moke

Whether you’re undertaking a data profiling project to understand the data across the enterprise, preserving legacy data from backup tapes, or collecting data for an eDiscovery event, the ability to identify and manage duplicate content is extremely important and powerful to the success of any of these initiatives.

Why Deduplication Matters?

Many tools use deduplication to filter and cull irrelevant data. Duplicates can represent up to 30% of what exists on corporate networks and up to 95% on legacy backups. Many tools use very simplistic metadata analysis, which can easily miss large volumes of content making the discovery and classification task more complex and time consuming. Without comprehensive and customizable deduplication technology your task of finding and managing data to support migration and governance will be complex.

Index Engines provides comprehensive and flexible deduplication features. These features can adapt to the task at hand. In some cases it may be important to utilize metadata deduplication, while others the use of more comprehensive MD5 hashing is required. Either way Index Engines provides deduplication options to support the task at hand.

Configuring the Deduplication Method

Index Engines deduplication functionality is fully customizable and empowers users to adapt the deduplication method to a specific workflow. To start, let’s first talk about the various properties that can be used to identify duplicates and how Index Engines implements those methods.

The set of parameters available are as follows:


The user can select one or many items to use as the deduplication method. The customer workflow and level of indexing will allow the user to customize a deduplication method to achieve the desired outcome. The ‘Path’, ‘Size’, ‘Filename’ and ‘Modification Time’ are the metadata properties of the file from the filesystem.

When indexing data from tape, backup or network data, Index Engines calculates a MD5 signature of each file that is stored in the index. When selecting ‘Content’ for deduplication this MD5 signature is used for deduplication. ‘Message id else Content’ is an option for users working on a project that has both email and loose file. This option will deduplicate email data using the internal Message ID within the individual emails and the content signature for non-email files.

The user can then select the scope of the deduplication by further selecting ‘Owner’ which will apply the selected deduplication method within the set of files owned by an email or Active Directory user. The ‘Family’ option is used to deduplicate files across email families, (e.g. return query matches in different email families). The Production option is used for implementing rolling delivery of files and for creating load files for eDiscovery workflows and will be discussed in subsequent post.

Deduplication Process and Viewing Results

The deduplication method can be set or changed at any time. Index Engines applies the deduplication method at query time. The user query is run and the set of results are then deduplicated, using the method set in the project preferences, and then the results are returned to the user in the search GUI.

The search GUI provides the power to the user to view the unique set of results (deduplicated) and switch between viewing just the duplicate files or all the files.


By selecting the set of files, unique files, duplicate files or all files, the user can quickly generate results and summary reports for all views with relative ease.



What Symantec Got Wrong About Data Protection

by Tim Williams

On June 24, 2005, in the face of analyst skepticism, security giant Symantec’s “wary investors” finally approved it’s merger with storage giant Veritas in an all-stock deal valued at 13.5 billion dollars. Symantec CEO John W. Thompson’s rationale, as reported in the New York Times, was to “create a convenient one-stop shop for customers seeking computer security software, which Symantec sells, and data storage software, which Veritas sells”.

Ten years later, Symantec sold off the Veritas assets it acquired to private investors led by the Carlyle Group for 8 billion, implying that the assets had lost 40% of their value under Symantec’s stewardship.

“Ironically”, according to CRN, “the fact that the legacy Symantec and legacy Veritas business never really did merge appears to be making the process to split the two easier than it might have been.” At the time, the split was justified by Symantec Chief Product Office Matt Cain, who CRN quoted as saying “When we did a strategic analysis of the company, it became obvious quickly that the focus on data management and security were diverging quickly”.

Allow me to translate. At the time of the merger, Symantec saw this:

and after ten years of failing to find the synergy, they gave up. Or maybe they just didn’t try that hard, since the businesses “never really did merge“.

Meanwhile, something important changed in the market over those ten years that apparently went unappreciated at Symantec. The answer lies hidden in the two above referenced news sources. In 2005, the world saw Symantec as a “security” company Veritas as a “data storage” company. But by 2015, while Cain was still characterizing Veritas as a player in data management, analysts more properly described the spinout as “the newest startup in the data protection market”.

Pop quiz:

What’s the difference between security and data protection?

Time’s up. Hand in your answers.

If you answered that data protection means the essential disaster recovery processes that every enterprise relies on, you only get partial credit. Why? Think of the problem from the point of view of the data itself, and what exactly it needs to be protected from:

It’s not just system failures, the traditional province of data protection. It’s also bad actors launching viruses and ransomware attacks on your data…in other words, Symantec’s data protection market. And it’s also bad governance policies for data retention and management, the ones forbidden by regulations like the GDPR (the “General Data Protection Regulation”), a market that was within the reach of the two companies combined assets.

Security is data protection. So is backup. So is archive. So are data profiling, classification, remediation and disposition.

So what did Symantec get wrong about Data Protection? At a point when the IT market was coming to agreement on a comprehensive understanding of data protection, they missed the opportunity to be the one vendor that could supply an implementation of that understanding. They didn’t look at the market from the point of view of the data, the way their customers do.

If you are interested in learning about the connective tissue we provide between these three data protection needs, you can visit our website here, contact us here, or register for a webinar here.