Privacy by Deletion: 5 Steps to Reducing Data Risk

Hosted by FTI Consulting and Index Engines, Wednesday, July 19 at 1 pm ET, 10 am PT

Organizations are seeing 40-60% annual growth in data capacity. As data grows, so does the significant data center cost and liability for your organization.

In the midst of this rapid data growth, how can IT and legal teams work together to better protect sensitive corporate data and stave off data breaches?

Join Jake Frazier, Senior Managing Director, FTI Consulting, Jim McGann, Vice President, Index Engines, and Anthony J. Diana, Partner, Reed Smith as they discuss how to mitigate these risks and reduce costs.

During this webinar, hosted by Bloomberg BNA, you’ll develop a greater understanding of:

• Data Pitfalls that are putting your organization at risk
• Must Ask Questions for your legal and IT teams
• Best Practices and Policies to practice in your own organization
• 5 Steps to Reducing Data Risk

Register Here

How Regular Expressions Empower Support for the GDPR and Privacy Assessments

By Ed Moke

Whether you’re preparing for the GDPR or monitoring your primary storage network or secondary backup data to identify and manage sensitive data, one of the key attributes of any solution is the ability to find ‘Personal Data.’ The GDPR defines ‘Personal Data’ as:

“any information relating to an identified or identifiable natural person; an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person”

Index Engines delivers several key features that allow for the identification of personal data, including data that is defined as personal based on specific geographies.

Leveraging the Index Engines search, classification and management capabilities, policies can be defined to support specific personal data based on a pattern or regex query. Once these policies are defined and stored, they can be run on online or offline data sources to support the GDPR requirement to access, rectify, erase, restrict or migrate personal information.

There are several popular pattern matching capabilities within the Index Engines application. These include Social Security and credit card number (American Express, Visa, Master Card and Discover Card) as well as bank routing numbers. The advantage of embedding the pattern within the Index Engines processing platform is the files and email containing these patterns is immediately identified at indexing time and tagged in the index. This enables rapid search to quickly return files containing any of the embedded patterns across large data sets.

Read more about our Regex searches in our new Tech Primer

Supporting GDPR Compliance through Data Classification

by Jim McGann

The GDPR consists of 99 articles that mandate how data is to be handled, but how do you manage decades of data on various platforms?

Index Engines delivers speeds up to 1TB per hour using a single indexing node, with federated support for distributed environments. The index is highly compressed, typically only 1% of the original data size for metadata, allowing for extreme scalability. Architected for the enterprise, Index Engines will provide the knowledge to manage data across global environments.

Learn more about our patented technology in our new offering “Index Engines’ Guide to: Supporting GDPR Compliance through Data Classification

What’s Driving Data Center Classification? Privacy or Costs?

by Jim McGann

Data centers increase storage capacity year after year, relentlessly trying to keep up with prolific users. This strategy has worked since the cost of storage has plummeted over the years. Adding capacity has been solved through a simple call to the vendor of your choice. Problem solved?

Recently we have seen a surge in data classification projects. These projects are attacking a number of very different use cases. Here are some recent examples:

  1. Classification of 2.8PBs of user data on a shared network server. Found that 36% of the content was aged system files that had no business value. 1PB of storage capacity reclaimed. Happy client. Use Case: Cost Savings
  2. Security assessment on 25 network servers to determine is sensitive email/PSTs exist on the network. Found 2,376 PSTs, 57% had not been accessed in years, 28% were abandoned by ex-employees, and 32% were active and accessed in the last 6 months. Use Case: Privacy
  3. 676 TBs of high profile network storage classified and data classified for ROT (redundant, obsolete, trivial) analysis. 12% of data was purged. 19% was archived in the cloud. 275,000 files were found that contained high value intellectual property that were secured. Use Case: Cost Savings and Privacy

These projects started with the classification of unstructured data, providing the knowledge to make decisions and develop a disposition strategy. Many were initially driven by costs, however, the end game resulted in support for privacy.

Is cost the driving factor for the resurgence of data classification?

With the looming EU General Data Protection Regulation (GDPR) will privacy take a front seat?

Download our newest eBook, Harnessing Metadata for Streamlined Data Management and Governance to find out how data classification can drive down costs and help privacy efforts.

Leveraging Deduplication to Streamline Information Management – From ROT Clean Up to eDiscovery

By Ed Moke

Whether you’re undertaking a data profiling project to understand the data across the enterprise, preserving legacy data from backup tapes, or collecting data for an eDiscovery event, the ability to identify and manage duplicate content is extremely important and powerful to the success of any of these initiatives.

Why Deduplication Matters?

Many tools use deduplication to filter and cull irrelevant data. Duplicates can represent up to 30% of what exists on corporate networks and up to 95% on legacy backups. Many tools use very simplistic metadata analysis, which can easily miss large volumes of content making the discovery and classification task more complex and time consuming. Without comprehensive and customizable deduplication technology your task of finding and managing data to support migration and governance will be complex.

Index Engines provides comprehensive and flexible deduplication features. These features can adapt to the task at hand. In some cases it may be important to utilize metadata deduplication, while others the use of more comprehensive MD5 hashing is required. Either way Index Engines provides deduplication options to support the task at hand.

Configuring the Deduplication Method

Index Engines deduplication functionality is fully customizable and empowers users to adapt the deduplication method to a specific workflow. To start, let’s first talk about the various properties that can be used to identify duplicates and how Index Engines implements those methods.

The set of parameters available are as follows:


The user can select one or many items to use as the deduplication method. The customer workflow and level of indexing will allow the user to customize a deduplication method to achieve the desired outcome. The ‘Path’, ‘Size’, ‘Filename’ and ‘Modification Time’ are the metadata properties of the file from the filesystem.

When indexing data from tape, backup or network data, Index Engines calculates a MD5 signature of each file that is stored in the index. When selecting ‘Content’ for deduplication this MD5 signature is used for deduplication. ‘Message id else Content’ is an option for users working on a project that has both email and loose file. This option will deduplicate email data using the internal Message ID within the individual emails and the content signature for non-email files.

The user can then select the scope of the deduplication by further selecting ‘Owner’ which will apply the selected deduplication method within the set of files owned by an email or Active Directory user. The ‘Family’ option is used to deduplicate files across email families, (e.g. return query matches in different email families). The Production option is used for implementing rolling delivery of files and for creating load files for eDiscovery workflows and will be discussed in subsequent post.

Deduplication Process and Viewing Results

The deduplication method can be set or changed at any time. Index Engines applies the deduplication method at query time. The user query is run and the set of results are then deduplicated, using the method set in the project preferences, and then the results are returned to the user in the search GUI.

The search GUI provides the power to the user to view the unique set of results (deduplicated) and switch between viewing just the duplicate files or all the files.


By selecting the set of files, unique files, duplicate files or all files, the user can quickly generate results and summary reports for all views with relative ease.



What Symantec Got Wrong About Data Protection

by Tim Williams

On June 24, 2005, in the face of analyst skepticism, security giant Symantec’s “wary investors” finally approved it’s merger with storage giant Veritas in an all-stock deal valued at 13.5 billion dollars. Symantec CEO John W. Thompson’s rationale, as reported in the New York Times, was to “create a convenient one-stop shop for customers seeking computer security software, which Symantec sells, and data storage software, which Veritas sells”.

Ten years later, Symantec sold off the Veritas assets it acquired to private investors led by the Carlyle Group for 8 billion, implying that the assets had lost 40% of their value under Symantec’s stewardship.

“Ironically”, according to CRN, “the fact that the legacy Symantec and legacy Veritas business never really did merge appears to be making the process to split the two easier than it might have been.” At the time, the split was justified by Symantec Chief Product Office Matt Cain, who CRN quoted as saying “When we did a strategic analysis of the company, it became obvious quickly that the focus on data management and security were diverging quickly”.

Allow me to translate. At the time of the merger, Symantec saw this:

and after ten years of failing to find the synergy, they gave up. Or maybe they just didn’t try that hard, since the businesses “never really did merge“.

Meanwhile, something important changed in the market over those ten years that apparently went unappreciated at Symantec. The answer lies hidden in the two above referenced news sources. In 2005, the world saw Symantec as a “security” company Veritas as a “data storage” company. But by 2015, while Cain was still characterizing Veritas as a player in data management, analysts more properly described the spinout as “the newest startup in the data protection market”.

Pop quiz:

What’s the difference between security and data protection?

Time’s up. Hand in your answers.

If you answered that data protection means the essential disaster recovery processes that every enterprise relies on, you only get partial credit. Why? Think of the problem from the point of view of the data itself, and what exactly it needs to be protected from:

It’s not just system failures, the traditional province of data protection. It’s also bad actors launching viruses and ransomware attacks on your data…in other words, Symantec’s data protection market. And it’s also bad governance policies for data retention and management, the ones forbidden by regulations like the GDPR (the “General Data Protection Regulation”), a market that was within the reach of the two companies combined assets.

Security is data protection. So is backup. So is archive. So are data profiling, classification, remediation and disposition.

So what did Symantec get wrong about Data Protection? At a point when the IT market was coming to agreement on a comprehensive understanding of data protection, they missed the opportunity to be the one vendor that could supply an implementation of that understanding. They didn’t look at the market from the point of view of the data, the way their customers do.

If you are interested in learning about the connective tissue we provide between these three data protection needs, you can visit our website here, contact us here, or register for a webinar here.

Eating the Elephant. How to Get Started with Data Governance

by Tim Williams

Memory is malleable. Not only does our ability to recall the past degrade over time, but we often find ourselves remembering things that never happened. Our memory can even be altered by outside influences. According to memory scholar Elizabeth Loftus, it’s possible to produce false memories in a person either intentionally through overt manipulation, or unintentionally through prompting with misleading cues.

I was reminded of this while reading Exterro Content Marketing Manager Jim Gill’s blog post 4 Data Mapping Challenges and How to Overcome Them. Data maps, high level functional reports of a company’s data under management, are essential prerequisites for building a data compliance program. Gill’s focus is on data maps used to respond to a litigation, but the principles would apply to any other governance initiative. Gill warns that many companies start data mapping projects only to abandon them before completion. Nevertheless, he writes that systematic interviews with data stewards are the most efficient way to collect info for a data map.

That’s not my experience. The people that manage the storage have zero insight into the content of the data under management, and frequently weren’t even employed when the data store was implemented. The people that created the data have long ago lost track of most of it, and the creators may not even work there anymore. Trying to build a data map by relying on the memory of either will generate highly inaccurate results. And when you add Gill’s four challenges, too time consuming to build, impossible to keep information up to date, incomplete information, and not comprehensive and it understandable why most data mapping projects are considered failures.

How to get started? Well, start with the assumption that memory is at best an rough approximation, and that the best way to help someone recall the truth is to provide them with detailed and reliable cues grounded in facts. Start with the data. Organize it, classify it, present summary and detailed reports of it to your data stewards and data owners. Get them working directly with it, discovering what’s really there, and building the governance rules based upon what actually exists, rather than what they remember.

But don’t make the mistake of trying to do it all at once. Contrary to popular wisdom, the best way to eat an elephant is not one bite at a time. Really big problems resist solutions that involve breaking them up into small pieces and tackling each piece one by one. As Mike Martel warns, pretty soon, you are going to quickly get really sick of the taste of elephant and give up.

Start with a high level outline, and fill in the details iteratively. Let technology do the heavy lifting. Leverage a petabyte-class indexing and classification platform that can scale to meet the needs of massive data centers, one that can focus back and forth on the data like a camera does on the world, from wide angle landscapes to high resolution detailed shots.

The real problem with taking it step by step is that most people lose interest and end up quitting – Mike Martel

Your first pass should be focused on remembering…getting a rough idea of what types of data are stored where. Classify the data based upon just the file system meta data only. You can get an estimate of file types from their names, and sense of the amount of storage they consume from their sizes, an estimate of the storage wasted from the meta-data deduplication, and a sense of what’s valuable from last access and modification times. Share that information with the data stewards during your first interview with them and you will be surprised at how eye-opening the conversation will be.

Your second pass should be focused on reorganization. Go deeper and index the content. Identify the redundant, trivial and outdated data that can be deleted… the responsive, sensitive or personal data that needs to be protected…where hidden corporate intellectual property and historically valuable content that needs to be made more accessible…the archival-class data consuming primary storage that needs to be moved to cheaper long term storage.

Your third pass should be focused on risk. You should know where your key content is at this point. Support your legal and compliance teams with classified data that identifies data that should be on legal hold in support of eDiscovery, content that is sensitive and should be secured and preserved, email that are required for regulatory requirements, even content that contains personally identificable information (PII) that should be managed according to corporate governance polices. Your legal team will spend less time trying to find data, and more time protecting the organization from harm.

After each pass, show the results to your client. Help them get a better understanding of their data. Find out what they want to learn in the next pass. At the end, they will be able to develop informed governance policies derived from their actual data experience. And as the data changes over time, the data map you’ve created is just a easily updated.

If you’d like to learn more about how to build data maps using our petabyte class indexing platform, you can contact us here, or attend a webinar by registering here.

When I was younger I could remember anything, whether it happened or not; but my faculties are decaying now, and soon I shall be so I cannot remember any but the latter – Mark Twain

If you are wondering if it is true that an elephant never forgets, read this.

What Most People Don’t Know About PII

By Tim Williams

If I’m not lying, and that’s really my credit card number, most people would agree that I’m in trouble – we’re talking LifeLock class trouble. But what if I took that same piece of paper, and taped it anonymously to the entrance of the New York Hilton? Am I still in trouble?

The answer is no, and the reason is that Credit Card Numbers are not Personally Identifiable Information (PII).

Surprised? It’s a very common misconception that credit card numbers are PII, but the truth is, PII is your email address, your home address, your phone number…any information that can be used to identify you. A single piece of PII is like a loose thread. Once you have it, you can use the Internet to start pulling on it, and get more and more of it. In the world we live in, unless you are prepared to move completely off the grid, you can’t protect your PII.

My credit card number is certainly very Sensitive Information (let’s call it SI), but it can’t be used to identify me. Only when combined with PII, in this case my LinkedIn profile, does it create the problems associated with identify theft. If you don’t know who owns a credit card number, there’s just not that much mischief you can do with it.

Why does this matter? Well, have you just convinced your company to invest in technology that scans your network for Social Security Numbers, Credit Card Numbers, Bank Account Numbers, Routing Numbers, HealthCare Identifiers, etc. because you were charged with finding and eliminating these kinds of threats? Unless you understand the differences between SI and PII, the task will be much more difficult than you imagined. That’s because there are two different strategies that vendors use to find SI, and both of them have flaws.

Let’s call the first method the Optimist method. The Optimist assumes an orderly world where Social Security Numbers are always stored in a format like NNN-NN-NNNN, nine digits separated into three groups (three digits, two digits and four digits) by dashes. Maybe your Optimist has had a brush of reality, and will recognize Social Security Numbers with a single spaces instead of dashes, but that is as realistic as they will get. Unfortunately, reality can be cruel, often storing Social Security numbers as nine digits without dashes grouping them, or with lots of space between the groups. It can even be stored in three separate fields that alone are unrecognizable as a Social Security number, and only make sense when displayed in a companion form that supplies the dashes and readability of the data. For these reasons, the Optimist can, and does, miss SI. (Most vendors use this method, so odds are this is what you bought).

Compare that with the Pessimist method. The Pessimist knows how disorderly reality is and casts as wide a net as possible when searching. Not only will they match any sequence of nine consecutive digits, they will also match any series of three, then two, then four digits separated by any number of non-alphanumeric characters.The Pessimist isn’t likely to miss any SI at all. The problem is all the false positive matches they will find. You will be surprised how many nine digit numbers you will find that aren’t really Social Security numbers. While both methods generate false positives and while there are well known practices used by both methods to minimize those false positives, you’ll get far more of them from a Pessimist than an Optimist. In some datasets, the false positives can be overwhelming.

It’s possible to further minimize Pessimist false positives by, for example excluding search results that aren’t near strings like “Social” or “Security” or “SSN” or “Employee” when searching for Social Security Numbers, or for ‘Credit Card”, “Amex”, “Visa”, “MasterCard” when searching for Credit Card numbers. A search like that would hit the credit card number above, regardless of how the number was formatted. Using that technique on a dataset that was pronounced “Clean of SI” after it was processed by an Optimist, you will find lots of examples they missed. It’s a very effective way to quickly find the flaws in an Optimist implementation. Of course, that is also likely to end up excluding SI that the Optimist found that did not have those strings.

So if an Optimist is foolproof, and a Pessimist can generate too many false positives, what’s to be done? That’s the true value of searching for PII. Since SI is only a problem when it is matched with PII, then it follows that by using a tool that implements the Pessimistic method to search for SI only where it is near PII (the last names of all your employees or customers for instance), you can efficiently find all the SI that truly puts your organization at risk. That means that if your dataset is large, you will need a pretty powerful indexing engine and a well thought out search process, but at least you can be confident that the task can be successfully completed.

Still don’t believe me? You’re welcome to argue the point with me on LinkedIn. And no, that’s not my credit card number, and now that you have my PII, I won’t be posting the real one.