Eating the Elephant. How to Get Started with Data Governance

by Tim Williams

Memory is malleable. Not only does our ability to recall the past degrade over time, but we often find ourselves remembering things that never happened. Our memory can even be altered by outside influences. According to memory scholar Elizabeth Loftus, it’s possible to produce false memories in a person either intentionally through overt manipulation, or unintentionally through prompting with misleading cues.

I was reminded of this while reading Exterro Content Marketing Manager Jim Gill’s blog post 4 Data Mapping Challenges and How to Overcome Them. Data maps, high level functional reports of a company’s data under management, are essential prerequisites for building a data compliance program. Gill’s focus is on data maps used to respond to a litigation, but the principles would apply to any other governance initiative. Gill warns that many companies start data mapping projects only to abandon them before completion. Nevertheless, he writes that systematic interviews with data stewards are the most efficient way to collect info for a data map.

That’s not my experience. The people that manage the storage have zero insight into the content of the data under management, and frequently weren’t even employed when the data store was implemented. The people that created the data have long ago lost track of most of it, and the creators may not even work there anymore. Trying to build a data map by relying on the memory of either will generate highly inaccurate results. And when you add Gill’s four challenges, too time consuming to build, impossible to keep information up to date, incomplete information, and not comprehensive and it understandable why most data mapping projects are considered failures.

How to get started? Well, start with the assumption that memory is at best an rough approximation, and that the best way to help someone recall the truth is to provide them with detailed and reliable cues grounded in facts. Start with the data. Organize it, classify it, present summary and detailed reports of it to your data stewards and data owners. Get them working directly with it, discovering what’s really there, and building the governance rules based upon what actually exists, rather than what they remember.

But don’t make the mistake of trying to do it all at once. Contrary to popular wisdom, the best way to eat an elephant is not one bite at a time. Really big problems resist solutions that involve breaking them up into small pieces and tackling each piece one by one. As Mike Martel warns, pretty soon, you are going to quickly get really sick of the taste of elephant and give up.

Start with a high level outline, and fill in the details iteratively. Let technology do the heavy lifting. Leverage a petabyte-class indexing and classification platform that can scale to meet the needs of massive data centers, one that can focus back and forth on the data like a camera does on the world, from wide angle landscapes to high resolution detailed shots.

The real problem with taking it step by step is that most people lose interest and end up quitting – Mike Martel

Your first pass should be focused on remembering…getting a rough idea of what types of data are stored where. Classify the data based upon just the file system meta data only. You can get an estimate of file types from their names, and sense of the amount of storage they consume from their sizes, an estimate of the storage wasted from the meta-data deduplication, and a sense of what’s valuable from last access and modification times. Share that information with the data stewards during your first interview with them and you will be surprised at how eye-opening the conversation will be.

Your second pass should be focused on reorganization. Go deeper and index the content. Identify the redundant, trivial and outdated data that can be deleted… the responsive, sensitive or personal data that needs to be protected…where hidden corporate intellectual property and historically valuable content that needs to be made more accessible…the archival-class data consuming primary storage that needs to be moved to cheaper long term storage.

Your third pass should be focused on risk. You should know where your key content is at this point. Support your legal and compliance teams with classified data that identifies data that should be on legal hold in support of eDiscovery, content that is sensitive and should be secured and preserved, email that are required for regulatory requirements, even content that contains personally identificable information (PII) that should be managed according to corporate governance polices. Your legal team will spend less time trying to find data, and more time protecting the organization from harm.

After each pass, show the results to your client. Help them get a better understanding of their data. Find out what they want to learn in the next pass. At the end, they will be able to develop informed governance policies derived from their actual data experience. And as the data changes over time, the data map you’ve created is just a easily updated.

If you’d like to learn more about how to build data maps using our petabyte class indexing platform, you can contact us here, or attend a webinar by registering here.

When I was younger I could remember anything, whether it happened or not; but my faculties are decaying now, and soon I shall be so I cannot remember any but the latter – Mark Twain

If you are wondering if it is true that an elephant never forgets, read this.

What Most People Don’t Know About PII

By Tim Williams

If I’m not lying, and that’s really my credit card number, most people would agree that I’m in trouble – we’re talking LifeLock class trouble. But what if I took that same piece of paper, and taped it anonymously to the entrance of the New York Hilton? Am I still in trouble?

The answer is no, and the reason is that Credit Card Numbers are not Personally Identifiable Information (PII).

Surprised? It’s a very common misconception that credit card numbers are PII, but the truth is, PII is your email address, your home address, your phone number…any information that can be used to identify you. A single piece of PII is like a loose thread. Once you have it, you can use the Internet to start pulling on it, and get more and more of it. In the world we live in, unless you are prepared to move completely off the grid, you can’t protect your PII.

My credit card number is certainly very Sensitive Information (let’s call it SI), but it can’t be used to identify me. Only when combined with PII, in this case my LinkedIn profile, does it create the problems associated with identify theft. If you don’t know who owns a credit card number, there’s just not that much mischief you can do with it.

Why does this matter? Well, have you just convinced your company to invest in technology that scans your network for Social Security Numbers, Credit Card Numbers, Bank Account Numbers, Routing Numbers, HealthCare Identifiers, etc. because you were charged with finding and eliminating these kinds of threats? Unless you understand the differences between SI and PII, the task will be much more difficult than you imagined. That’s because there are two different strategies that vendors use to find SI, and both of them have flaws.

Let’s call the first method the Optimist method. The Optimist assumes an orderly world where Social Security Numbers are always stored in a format like NNN-NN-NNNN, nine digits separated into three groups (three digits, two digits and four digits) by dashes. Maybe your Optimist has had a brush of reality, and will recognize Social Security Numbers with a single spaces instead of dashes, but that is as realistic as they will get. Unfortunately, reality can be cruel, often storing Social Security numbers as nine digits without dashes grouping them, or with lots of space between the groups. It can even be stored in three separate fields that alone are unrecognizable as a Social Security number, and only make sense when displayed in a companion form that supplies the dashes and readability of the data. For these reasons, the Optimist can, and does, miss SI. (Most vendors use this method, so odds are this is what you bought).

Compare that with the Pessimist method. The Pessimist knows how disorderly reality is and casts as wide a net as possible when searching. Not only will they match any sequence of nine consecutive digits, they will also match any series of three, then two, then four digits separated by any number of non-alphanumeric characters.The Pessimist isn’t likely to miss any SI at all. The problem is all the false positive matches they will find. You will be surprised how many nine digit numbers you will find that aren’t really Social Security numbers. While both methods generate false positives and while there are well known practices used by both methods to minimize those false positives, you’ll get far more of them from a Pessimist than an Optimist. In some datasets, the false positives can be overwhelming.

It’s possible to further minimize Pessimist false positives by, for example excluding search results that aren’t near strings like “Social” or “Security” or “SSN” or “Employee” when searching for Social Security Numbers, or for ‘Credit Card”, “Amex”, “Visa”, “MasterCard” when searching for Credit Card numbers. A search like that would hit the credit card number above, regardless of how the number was formatted. Using that technique on a dataset that was pronounced “Clean of SI” after it was processed by an Optimist, you will find lots of examples they missed. It’s a very effective way to quickly find the flaws in an Optimist implementation. Of course, that is also likely to end up excluding SI that the Optimist found that did not have those strings.

So if an Optimist is foolproof, and a Pessimist can generate too many false positives, what’s to be done? That’s the true value of searching for PII. Since SI is only a problem when it is matched with PII, then it follows that by using a tool that implements the Pessimistic method to search for SI only where it is near PII (the last names of all your employees or customers for instance), you can efficiently find all the SI that truly puts your organization at risk. That means that if your dataset is large, you will need a pretty powerful indexing engine and a well thought out search process, but at least you can be confident that the task can be successfully completed.

Still don’t believe me? You’re welcome to argue the point with me on LinkedIn. And no, that’s not my credit card number, and now that you have my PII, I won’t be posting the real one.

From Civil to Criminal – When the Coverup is Worse than the Crime

by Jim McGann

Legal history is replete with stories of persons or companies turning a manageable legal problem into a more serious one by trying to hide or destroy evidence, see Watergate and Arthur Anderson/Enron for two notable examples. A recent case involving a bus company executive provides a good case study in what not to do when facing a government investigation and the consequences of trying to hide or destroy evidence in an investigation.

Article from Leonard L. Gordon, partner Venable LLP Read the rest on

Index Engines Helps Enterprise Customers Migrate Long Term Retention Data on Backup Tapes to Amazon Web Services

Companies using legacy backup tapes for long term retention of key business records could see a 76% cost savings in just three years by having Index Engines migrate this content from tape to AWS.

by Index Engines

HOLMDEL, NJ–Information management company Index Engines is delivering a cost-effective solution to migrate data of value from legacy backup tapes to the AWS Cloud.

Index Engines eliminates the need for the legacy backup software and provides an intelligent migration path to AWS. This facilitates improved access and management of the content as well as the retirement of the legacy tapes and infrastructure, saving significant data center expenses.

Current ROI analysis show a potential of up to 76% savings* after three years based on current maintenance, offsite storage, eDiscovery service provider and associated legacy backup data fees.

“Our customers want to access their corporate data including valuable intellectual property that is hidden on offline tapes,” said Sabina Joseph, Head of Global Storage Partnerships and Alliances, Amazon Web Services. “Index Engines makes it possible to move a single-instance or culled data set of data from legacy tape onto AWS so it can be accessed anytime, anywhere by legal teams or any knowledge workers who can benefit from the data assets.”

Index Engines simplifies migration of data from legacy backup tapes to Amazon Simple Storage Service (Amazon S3). A culled data set, or single instance of the tape contents, is migrated ensuring all metadata remains forensically sound.

The native deduped data, including unstructured files, email and databases, is stored in AWS. A metadata or full content index is available to search, manage based on retention policies, and access so data can be quickly retrieved based on business needs.

The costs and risks associated with leaving “dark” and unknown user data on legacy backup tapes is significant and goes well beyond the compliance and regulatory risks of not knowing what exists, including:

  • Old backup software maintenance costs as well as the manpower required to support it
  • Aged libraries and media servers under maintenance
  • Offsite tape storage costs and retrieval fees
  • eDiscovery restore requests by expensive service providers for specific files or user mailboxes
  • Hidden intellectual property and business assets that are not leveraged

Index Engines supports access to and migration from all common backup formats.

“Legacy backup tapes are hard to access, impossible to search and expensive to restore, yet they contain vital corporate records that must be preserved to meet legal and regulatory requirements,” Index Engines CEO Tim Williams said. “We’re excited to work with AWS to make that data accessible, responsive and governable.”

Index Engines provides a number of flexible pricing models that allow organizations to implement the solution in their data center or ship tapes to a secure Index Engines certified processing lab.

When deploying inside the firewall, the technology can be managed by internal resources or via Index Engines remote Assurance Program. The Assurance Program provides all the technical resources necessary to successfully execute a tape to AWS migration from remotely installing the software, to processing the tapes and migrating data of value to the cloud.

For more information on how to use Index Engines to migrate data to AWS, visit

*Savings of 76% based on this cost scenario.

Unburdening Undue Burden: Why are backup tapes still a burden?

by Jim McGann

A federal court in Washington ruled in favor of Franciscan Health System’s motion to not produce data from backup tapes as it was expensive and not easily accessible.

Franciscan Health claimed it would need to restore, search and review data from 100 backup tapes, which at 14 hours of labor per tape would require 1,400 hours and $157,500 in costs. That’s over $1,500 a tape to discover and collect the responsive ESI! You can read more about the case on

Index Engines’ Data Processing Lab recently performed a very similar job that required 25 mailboxes be restored from 100 tapes. The job was completed in 20 hours for less than $30,000. That’s a couple of days compared to many months of effort and for a fraction of the cost.

The Index Engines approach to tape is fundamentally different and less expensive from traditional restoration services. A simple scan of the tape allows full content or metadata search and reporting on the tape contents. Quickly find what is needed, whether it is a user mailbox, single file, entire directory or more. This content, and only this content, is then restored back online quickly and reliably.

Even when both parties “did not dispute that its backup tapes would contain at least some emails that were discoverable under Rule 26(b)(1)” the court once again fell for the burden argument and moved forward without doing a little research on how technology exists to circumvent the burdensome nature of backup tapes.

Learn more about the Index Engines’ tape advantage here:

Using Backup Tape as an Archive with Today’s Data Governance Requirements


by Jim McGann

Backup tapes have always provided a reliable and cost effective backup and data preservation solution. Even for those users who are backing up to disk, tapes have provided a cost-effective replication target. As a result, organizations have amassed stockpiles of legacy tapes in offsite storage vaults that have long outlived their disaster recovery usefulness. These tapes often represent the only copy of sensitive files and documents required to support legal and compliance requirements. These tapes are your corporate legacy data archive.

This paper will discuss a new approach towards managing and archiving legacy data in the cloud that is not only cost effective, but will help support today’s more challenging data governance requirements.

Tape Was Never Designed to be an Archive

Tape is a low cost, portable media which can be used to preserve data in support of disaster recovery. Continue Reading…

Confessions of a Data Hoarder

You can overcome your data hoarding addiction.The first step is Honesty. Admit that you are powerless over your addiction, and that your life has become unmanageable.

So go ahead, step up to the mic, introduce yourself, and say these words…” I am a data hoarder”. You simply cannot bring yourself to delete your redundant, out of date, and trivial data. You continue to store it, back it up, archive it, waste your company’s money and contribute to its out of control IT budget and escalating legal liability. Continue Reading…

Putting a price on undue burden – $136,000 isn’t big enough

Three things we can learn from Guardiola v. Renown Health

1. $136,000 to restore and review email from backup tapes is not enough to show “undue burden.”

2. Organizations must bear some responsibility for using a backup solution that did not maintain data in an indexed or searchable manner.

3. Restoration of legacy tape data is “technologically feasible” when bringing in a third-party vendor, alleviating the burden of in-house production.

Read more about the case here

Discover how to avoid a $136,000 eDiscovery bill here

EMC and Index Engines Partner for Backup Migration

If you would like to take advantage of EMC best-of-breed backup solutions including Networker, Avamar and Data Domain, but feel locked in your current provider because you’re using old backups for long-term retention, Index Engines has the solution.

Index Engines has partnered with EMC to help clients migrate to a new backup solution, but still maintain access to the legacy data without the need for the original software. Additionally, Index Engines takes advantage of EMC’s ECS cloud storage to migrate data of value from tape to cloud, enabling clients to go tapeless and eliminate tape as a LTR strategy.

Benefits of the solution include:

– Freedom to move to EMC backup solution.
– Retire non-production backup software and infrastructure.
– Use cloud for LTR and apply retention to data.
– Go tapeless, recoup offsite storage fees.
– Manage risk and liability hidden in legacy data

See how: