What’s Driving Data Center Classification? Privacy or Costs?

by Jim McGann

Data centers increase storage capacity year after year, relentlessly trying to keep up with prolific users. This strategy has worked since the cost of storage has plummeted over the years. Adding capacity has been solved through a simple call to the vendor of your choice. Problem solved?

Recently we have seen a surge in data classification projects. These projects are attacking a number of very different use cases. Here are some recent examples:

  1. Classification of 2.8PBs of user data on a shared network server. Found that 36% of the content was aged system files that had no business value. 1PB of storage capacity reclaimed. Happy client. Use Case: Cost Savings
  2. Security assessment on 25 network servers to determine is sensitive email/PSTs exist on the network. Found 2,376 PSTs, 57% had not been accessed in years, 28% were abandoned by ex-employees, and 32% were active and accessed in the last 6 months. Use Case: Privacy
  3. 676 TBs of high profile network storage classified and data classified for ROT (redundant, obsolete, trivial) analysis. 12% of data was purged. 19% was archived in the cloud. 275,000 files were found that contained high value intellectual property that were secured. Use Case: Cost Savings and Privacy

These projects started with the classification of unstructured data, providing the knowledge to make decisions and develop a disposition strategy. Many were initially driven by costs, however, the end game resulted in support for privacy.

Is cost the driving factor for the resurgence of data classification?

With the looming EU General Data Protection Regulation (GDPR) will privacy take a front seat?

Download our newest eBook, Harnessing Metadata for Streamlined Data Management and Governance to find out how data classification can drive down costs and help privacy efforts.

Leveraging Deduplication to Streamline Information Management – From ROT Clean Up to eDiscovery

By Ed Moke

Whether you’re undertaking a data profiling project to understand the data across the enterprise, preserving legacy data from backup tapes, or collecting data for an eDiscovery event, the ability to identify and manage duplicate content is extremely important and powerful to the success of any of these initiatives.

Why Deduplication Matters?

Many tools use deduplication to filter and cull irrelevant data. Duplicates can represent up to 30% of what exists on corporate networks and up to 95% on legacy backups. Many tools use very simplistic metadata analysis, which can easily miss large volumes of content making the discovery and classification task more complex and time consuming. Without comprehensive and customizable deduplication technology your task of finding and managing data to support migration and governance will be complex.

Index Engines provides comprehensive and flexible deduplication features. These features can adapt to the task at hand. In some cases it may be important to utilize metadata deduplication, while others the use of more comprehensive MD5 hashing is required. Either way Index Engines provides deduplication options to support the task at hand.

Configuring the Deduplication Method

Index Engines deduplication functionality is fully customizable and empowers users to adapt the deduplication method to a specific workflow. To start, let’s first talk about the various properties that can be used to identify duplicates and how Index Engines implements those methods.

The set of parameters available are as follows:

A

The user can select one or many items to use as the deduplication method. The customer workflow and level of indexing will allow the user to customize a deduplication method to achieve the desired outcome. The ‘Path’, ‘Size’, ‘Filename’ and ‘Modification Time’ are the metadata properties of the file from the filesystem.

When indexing data from tape, backup or network data, Index Engines calculates a MD5 signature of each file that is stored in the index. When selecting ‘Content’ for deduplication this MD5 signature is used for deduplication. ‘Message id else Content’ is an option for users working on a project that has both email and loose file. This option will deduplicate email data using the internal Message ID within the individual emails and the content signature for non-email files.

The user can then select the scope of the deduplication by further selecting ‘Owner’ which will apply the selected deduplication method within the set of files owned by an email or Active Directory user. The ‘Family’ option is used to deduplicate files across email families, (e.g. return query matches in different email families). The Production option is used for implementing rolling delivery of files and for creating load files for eDiscovery workflows and will be discussed in subsequent post.

Deduplication Process and Viewing Results

The deduplication method can be set or changed at any time. Index Engines applies the deduplication method at query time. The user query is run and the set of results are then deduplicated, using the method set in the project preferences, and then the results are returned to the user in the search GUI.

The search GUI provides the power to the user to view the unique set of results (deduplicated) and switch between viewing just the duplicate files or all the files.

B

By selecting the set of files, unique files, duplicate files or all files, the user can quickly generate results and summary reports for all views with relative ease.

 

 

What Symantec Got Wrong About Data Protection

by Tim Williams

On June 24, 2005, in the face of analyst skepticism, security giant Symantec’s “wary investors” finally approved it’s merger with storage giant Veritas in an all-stock deal valued at 13.5 billion dollars. Symantec CEO John W. Thompson’s rationale, as reported in the New York Times, was to “create a convenient one-stop shop for customers seeking computer security software, which Symantec sells, and data storage software, which Veritas sells”.

Ten years later, Symantec sold off the Veritas assets it acquired to private investors led by the Carlyle Group for 8 billion, implying that the assets had lost 40% of their value under Symantec’s stewardship.

“Ironically”, according to CRN, “the fact that the legacy Symantec and legacy Veritas business never really did merge appears to be making the process to split the two easier than it might have been.” At the time, the split was justified by Symantec Chief Product Office Matt Cain, who CRN quoted as saying “When we did a strategic analysis of the company, it became obvious quickly that the focus on data management and security were diverging quickly”.

Allow me to translate. At the time of the merger, Symantec saw this:

and after ten years of failing to find the synergy, they gave up. Or maybe they just didn’t try that hard, since the businesses “never really did merge“.

Meanwhile, something important changed in the market over those ten years that apparently went unappreciated at Symantec. The answer lies hidden in the two above referenced news sources. In 2005, the world saw Symantec as a “security” company Veritas as a “data storage” company. But by 2015, while Cain was still characterizing Veritas as a player in data management, analysts more properly described the spinout as “the newest startup in the data protection market”.

Pop quiz:

What’s the difference between security and data protection?

Time’s up. Hand in your answers.

If you answered that data protection means the essential disaster recovery processes that every enterprise relies on, you only get partial credit. Why? Think of the problem from the point of view of the data itself, and what exactly it needs to be protected from:

It’s not just system failures, the traditional province of data protection. It’s also bad actors launching viruses and ransomware attacks on your data…in other words, Symantec’s data protection market. And it’s also bad governance policies for data retention and management, the ones forbidden by regulations like the GDPR (the “General Data Protection Regulation”), a market that was within the reach of the two companies combined assets.

Security is data protection. So is backup. So is archive. So are data profiling, classification, remediation and disposition.

So what did Symantec get wrong about Data Protection? At a point when the IT market was coming to agreement on a comprehensive understanding of data protection, they missed the opportunity to be the one vendor that could supply an implementation of that understanding. They didn’t look at the market from the point of view of the data, the way their customers do.

If you are interested in learning about the connective tissue we provide between these three data protection needs, you can visit our website here, contact us here, or register for a webinar here.

Eating the Elephant. How to Get Started with Data Governance

by Tim Williams

Memory is malleable. Not only does our ability to recall the past degrade over time, but we often find ourselves remembering things that never happened. Our memory can even be altered by outside influences. According to memory scholar Elizabeth Loftus, it’s possible to produce false memories in a person either intentionally through overt manipulation, or unintentionally through prompting with misleading cues.

I was reminded of this while reading Exterro Content Marketing Manager Jim Gill’s blog post 4 Data Mapping Challenges and How to Overcome Them. Data maps, high level functional reports of a company’s data under management, are essential prerequisites for building a data compliance program. Gill’s focus is on data maps used to respond to a litigation, but the principles would apply to any other governance initiative. Gill warns that many companies start data mapping projects only to abandon them before completion. Nevertheless, he writes that systematic interviews with data stewards are the most efficient way to collect info for a data map.

That’s not my experience. The people that manage the storage have zero insight into the content of the data under management, and frequently weren’t even employed when the data store was implemented. The people that created the data have long ago lost track of most of it, and the creators may not even work there anymore. Trying to build a data map by relying on the memory of either will generate highly inaccurate results. And when you add Gill’s four challenges, too time consuming to build, impossible to keep information up to date, incomplete information, and not comprehensive and it understandable why most data mapping projects are considered failures.

How to get started? Well, start with the assumption that memory is at best an rough approximation, and that the best way to help someone recall the truth is to provide them with detailed and reliable cues grounded in facts. Start with the data. Organize it, classify it, present summary and detailed reports of it to your data stewards and data owners. Get them working directly with it, discovering what’s really there, and building the governance rules based upon what actually exists, rather than what they remember.

But don’t make the mistake of trying to do it all at once. Contrary to popular wisdom, the best way to eat an elephant is not one bite at a time. Really big problems resist solutions that involve breaking them up into small pieces and tackling each piece one by one. As Mike Martel warns, pretty soon, you are going to quickly get really sick of the taste of elephant and give up.

Start with a high level outline, and fill in the details iteratively. Let technology do the heavy lifting. Leverage a petabyte-class indexing and classification platform that can scale to meet the needs of massive data centers, one that can focus back and forth on the data like a camera does on the world, from wide angle landscapes to high resolution detailed shots.

The real problem with taking it step by step is that most people lose interest and end up quitting – Mike Martel

Your first pass should be focused on remembering…getting a rough idea of what types of data are stored where. Classify the data based upon just the file system meta data only. You can get an estimate of file types from their names, and sense of the amount of storage they consume from their sizes, an estimate of the storage wasted from the meta-data deduplication, and a sense of what’s valuable from last access and modification times. Share that information with the data stewards during your first interview with them and you will be surprised at how eye-opening the conversation will be.

Your second pass should be focused on reorganization. Go deeper and index the content. Identify the redundant, trivial and outdated data that can be deleted… the responsive, sensitive or personal data that needs to be protected…where hidden corporate intellectual property and historically valuable content that needs to be made more accessible…the archival-class data consuming primary storage that needs to be moved to cheaper long term storage.

Your third pass should be focused on risk. You should know where your key content is at this point. Support your legal and compliance teams with classified data that identifies data that should be on legal hold in support of eDiscovery, content that is sensitive and should be secured and preserved, email that are required for regulatory requirements, even content that contains personally identificable information (PII) that should be managed according to corporate governance polices. Your legal team will spend less time trying to find data, and more time protecting the organization from harm.

After each pass, show the results to your client. Help them get a better understanding of their data. Find out what they want to learn in the next pass. At the end, they will be able to develop informed governance policies derived from their actual data experience. And as the data changes over time, the data map you’ve created is just a easily updated.

If you’d like to learn more about how to build data maps using our petabyte class indexing platform, you can contact us here, or attend a webinar by registering here.

When I was younger I could remember anything, whether it happened or not; but my faculties are decaying now, and soon I shall be so I cannot remember any but the latter – Mark Twain

If you are wondering if it is true that an elephant never forgets, read this.

What Most People Don’t Know About PII

By Tim Williams

If I’m not lying, and that’s really my credit card number, most people would agree that I’m in trouble – we’re talking LifeLock class trouble. But what if I took that same piece of paper, and taped it anonymously to the entrance of the New York Hilton? Am I still in trouble?

The answer is no, and the reason is that Credit Card Numbers are not Personally Identifiable Information (PII).

Surprised? It’s a very common misconception that credit card numbers are PII, but the truth is, PII is your email address, your home address, your phone number…any information that can be used to identify you. A single piece of PII is like a loose thread. Once you have it, you can use the Internet to start pulling on it, and get more and more of it. In the world we live in, unless you are prepared to move completely off the grid, you can’t protect your PII.

My credit card number is certainly very Sensitive Information (let’s call it SI), but it can’t be used to identify me. Only when combined with PII, in this case my LinkedIn profile, does it create the problems associated with identify theft. If you don’t know who owns a credit card number, there’s just not that much mischief you can do with it.

Why does this matter? Well, have you just convinced your company to invest in technology that scans your network for Social Security Numbers, Credit Card Numbers, Bank Account Numbers, Routing Numbers, HealthCare Identifiers, etc. because you were charged with finding and eliminating these kinds of threats? Unless you understand the differences between SI and PII, the task will be much more difficult than you imagined. That’s because there are two different strategies that vendors use to find SI, and both of them have flaws.

Let’s call the first method the Optimist method. The Optimist assumes an orderly world where Social Security Numbers are always stored in a format like NNN-NN-NNNN, nine digits separated into three groups (three digits, two digits and four digits) by dashes. Maybe your Optimist has had a brush of reality, and will recognize Social Security Numbers with a single spaces instead of dashes, but that is as realistic as they will get. Unfortunately, reality can be cruel, often storing Social Security numbers as nine digits without dashes grouping them, or with lots of space between the groups. It can even be stored in three separate fields that alone are unrecognizable as a Social Security number, and only make sense when displayed in a companion form that supplies the dashes and readability of the data. For these reasons, the Optimist can, and does, miss SI. (Most vendors use this method, so odds are this is what you bought).

Compare that with the Pessimist method. The Pessimist knows how disorderly reality is and casts as wide a net as possible when searching. Not only will they match any sequence of nine consecutive digits, they will also match any series of three, then two, then four digits separated by any number of non-alphanumeric characters.The Pessimist isn’t likely to miss any SI at all. The problem is all the false positive matches they will find. You will be surprised how many nine digit numbers you will find that aren’t really Social Security numbers. While both methods generate false positives and while there are well known practices used by both methods to minimize those false positives, you’ll get far more of them from a Pessimist than an Optimist. In some datasets, the false positives can be overwhelming.

It’s possible to further minimize Pessimist false positives by, for example excluding search results that aren’t near strings like “Social” or “Security” or “SSN” or “Employee” when searching for Social Security Numbers, or for ‘Credit Card”, “Amex”, “Visa”, “MasterCard” when searching for Credit Card numbers. A search like that would hit the credit card number above, regardless of how the number was formatted. Using that technique on a dataset that was pronounced “Clean of SI” after it was processed by an Optimist, you will find lots of examples they missed. It’s a very effective way to quickly find the flaws in an Optimist implementation. Of course, that is also likely to end up excluding SI that the Optimist found that did not have those strings.

So if an Optimist is foolproof, and a Pessimist can generate too many false positives, what’s to be done? That’s the true value of searching for PII. Since SI is only a problem when it is matched with PII, then it follows that by using a tool that implements the Pessimistic method to search for SI only where it is near PII (the last names of all your employees or customers for instance), you can efficiently find all the SI that truly puts your organization at risk. That means that if your dataset is large, you will need a pretty powerful indexing engine and a well thought out search process, but at least you can be confident that the task can be successfully completed.

Still don’t believe me? You’re welcome to argue the point with me on LinkedIn. And no, that’s not my credit card number, and now that you have my PII, I won’t be posting the real one.

From Civil to Criminal – When the Coverup is Worse than the Crime

by Jim McGann

Legal history is replete with stories of persons or companies turning a manageable legal problem into a more serious one by trying to hide or destroy evidence, see Watergate and Arthur Anderson/Enron for two notable examples. A recent case involving a bus company executive provides a good case study in what not to do when facing a government investigation and the consequences of trying to hide or destroy evidence in an investigation.

Article from Leonard L. Gordon, partner Venable LLP Read the rest on Lexology.com

Index Engines Helps Enterprise Customers Migrate Long Term Retention Data on Backup Tapes to Amazon Web Services

Companies using legacy backup tapes for long term retention of key business records could see a 76% cost savings in just three years by having Index Engines migrate this content from tape to AWS.

by Index Engines

HOLMDEL, NJ–Information management company Index Engines is delivering a cost-effective solution to migrate data of value from legacy backup tapes to the AWS Cloud.

Index Engines eliminates the need for the legacy backup software and provides an intelligent migration path to AWS. This facilitates improved access and management of the content as well as the retirement of the legacy tapes and infrastructure, saving significant data center expenses.

Current ROI analysis show a potential of up to 76% savings* after three years based on current maintenance, offsite storage, eDiscovery service provider and associated legacy backup data fees.

“Our customers want to access their corporate data including valuable intellectual property that is hidden on offline tapes,” said Sabina Joseph, Head of Global Storage Partnerships and Alliances, Amazon Web Services. “Index Engines makes it possible to move a single-instance or culled data set of data from legacy tape onto AWS so it can be accessed anytime, anywhere by legal teams or any knowledge workers who can benefit from the data assets.”

Index Engines simplifies migration of data from legacy backup tapes to Amazon Simple Storage Service (Amazon S3). A culled data set, or single instance of the tape contents, is migrated ensuring all metadata remains forensically sound.

The native deduped data, including unstructured files, email and databases, is stored in AWS. A metadata or full content index is available to search, manage based on retention policies, and access so data can be quickly retrieved based on business needs.

The costs and risks associated with leaving “dark” and unknown user data on legacy backup tapes is significant and goes well beyond the compliance and regulatory risks of not knowing what exists, including:

  • Old backup software maintenance costs as well as the manpower required to support it
  • Aged libraries and media servers under maintenance
  • Offsite tape storage costs and retrieval fees
  • eDiscovery restore requests by expensive service providers for specific files or user mailboxes
  • Hidden intellectual property and business assets that are not leveraged

Index Engines supports access to and migration from all common backup formats.

“Legacy backup tapes are hard to access, impossible to search and expensive to restore, yet they contain vital corporate records that must be preserved to meet legal and regulatory requirements,” Index Engines CEO Tim Williams said. “We’re excited to work with AWS to make that data accessible, responsive and governable.”

Index Engines provides a number of flexible pricing models that allow organizations to implement the solution in their data center or ship tapes to a secure Index Engines certified processing lab.

When deploying inside the firewall, the technology can be managed by internal resources or via Index Engines remote Assurance Program. The Assurance Program provides all the technical resources necessary to successfully execute a tape to AWS migration from remotely installing the software, to processing the tapes and migrating data of value to the cloud.

For more information on how to use Index Engines to migrate data to AWS, visit www.indexengines.com/aws

*Savings of 76% based on this cost scenario.

Unburdening Undue Burden: Why are backup tapes still a burden?

by Jim McGann

A federal court in Washington ruled in favor of Franciscan Health System’s motion to not produce data from backup tapes as it was expensive and not easily accessible.

Franciscan Health claimed it would need to restore, search and review data from 100 backup tapes, which at 14 hours of labor per tape would require 1,400 hours and $157,500 in costs. That’s over $1,500 a tape to discover and collect the responsive ESI! You can read more about the case on Lexology.com.

Index Engines’ Data Processing Lab recently performed a very similar job that required 25 mailboxes be restored from 100 tapes. The job was completed in 20 hours for less than $30,000. That’s a couple of days compared to many months of effort and for a fraction of the cost.

The Index Engines approach to tape is fundamentally different and less expensive from traditional restoration services. A simple scan of the tape allows full content or metadata search and reporting on the tape contents. Quickly find what is needed, whether it is a user mailbox, single file, entire directory or more. This content, and only this content, is then restored back online quickly and reliably.

Even when both parties “did not dispute that its backup tapes would contain at least some emails that were discoverable under Rule 26(b)(1)” the court once again fell for the burden argument and moved forward without doing a little research on how technology exists to circumvent the burdensome nature of backup tapes.

Learn more about the Index Engines’ tape advantage here: http://www.indexengines.com/backup-tape/about/tape-restoration-vs-direct-access

Using Backup Tape as an Archive with Today’s Data Governance Requirements

tapevscloud

by Jim McGann

Backup tapes have always provided a reliable and cost effective backup and data preservation solution. Even for those users who are backing up to disk, tapes have provided a cost-effective replication target. As a result, organizations have amassed stockpiles of legacy tapes in offsite storage vaults that have long outlived their disaster recovery usefulness. These tapes often represent the only copy of sensitive files and documents required to support legal and compliance requirements. These tapes are your corporate legacy data archive.

This paper will discuss a new approach towards managing and archiving legacy data in the cloud that is not only cost effective, but will help support today’s more challenging data governance requirements.

Tape Was Never Designed to be an Archive

Tape is a low cost, portable media which can be used to preserve data in support of disaster recovery. Continue Reading…

Confessions of a Data Hoarder

You can overcome your data hoarding addiction.The first step is Honesty. Admit that you are powerless over your addiction, and that your life has become unmanageable.

So go ahead, step up to the mic, introduce yourself, and say these words…” I am a data hoarder”. You simply cannot bring yourself to delete your redundant, out of date, and trivial data. You continue to store it, back it up, archive it, waste your company’s money and contribute to its out of control IT budget and escalating legal liability. Continue Reading…