The importance of metadata in content management

Metadata accuracy is critical to ensuring accurate and reliable unstructured data classification. Many tools that exist in the market will corrupt metadata making the management of this content nearly impossible.

Organizations are learning that once metadata becomes unreliable it is difficult to make decisions about the data and it becomes lost and abandoned.

Take for example tools that crawl through networks and servers in order to index the metadata and content. In order to accomplish this task they access the document, thus changing the last accessed time. Some tools will reset this time, some may not.

As a result a key date field becomes inaccurate. In fact data that has been languishing on the network for a decade, untouched and long forgotten, can be suddenly indexed and the last accessed date become current. As a result data center and records managers would treat this content as valuable information rather than the classifying it as outdated and trivial which it is.

Another more common example is data center tools that change the owner of the document from a specific user to “administrator” or “admin.” These tools are common and will change one of the more critical metadata fields that is required for classifying content according to department and business unit.

As these tools scan the network they can change thousands of documents ownership to the useless “admin” owner and the document loses context and importance. One financial services company found that 50 percent of their unstructured data belonged to “admin” after being corrupted during consolidations and migrations – during this time last accessed dates also changed making the data useless and unmanageable.

Metadata is key to managing content and determining the disposition. As long as organizations continue to use tools that corrupt and cause spoliation of metadata content that has value or is sensitive will become lost among the complex infrastructure.

Understanding how a tool or platform extracts metadata and indexes content is important in ensuring long term metadata accuracy and confidence that user data will remain reliable.

Forbes: Three things companies can learn from A-Rod

After investigating alleged steroid use by New York Yankees third baseman Alex Rodriguez, Major League Baseball has reportedly offered him a plea deal. It’s the latest installment in a sad story, with important lessons for companies and workers, both inside and outside the ballpark.

Read Jim McGann’s entire guest post on Forbes.

Abandoned Data Clogging Corporate Networks

When an employee leaves an organization their data lives on. Their computer’s hard drive may be wiped, however they leave many footprints scattered about the data center. A very small portion of this content may be useful to existing employees, but the vast majority is abandoned content that has outlived its business value.

According to the Bureau of Labor Statistics, organizations are currently facing a 3.3% turnover rate. Take an example of a 5,000 employee organization; this represents 165 ex-employees annually. If these ex-employees were generating 5GB of unstructured content annually this would represent almost 1TB of abandoned data annually.

However, we know that corporate data is never represented by the single copy created by the user. Corporate data is replicated over and over again over time. Copies are made for backup and archiving. Copies are attached to email and sent to other users for review and consumption. When a single document is created over time this document can easily turn into 10 copies of the same document.

Taking the previous example of 1TB of data abandoned on networks by ex-employees, this number quickly turns into 10TB of annual useless content cluttering the data center. Over 10 years this will explode to 100TB of abandoned data.

Abandoned data is a hidden class of content that is taking up valuable storage capacity and causing long term risk and liability in today’s legal climate. Organizations rarely think about this content and are unaware that they are managing and storing it, often upgrading server capacity annually to continue to make room for this data.

Classification of user content is gaining steam due to the challenges data center face in managing decades of user content. When classified, data owned by ex-employees or abandoned data is something that no longer has a place on the primary network. Understanding this data and taking action on it has been complex, but data classification software can easily integrate with the Active Directory/LDAP environment to find and tag data owned by inactive users or ex-employees.

Once data is classified it can be easily managed according to policy. Legal and compliance would be happy to make disposition decisions about the content and in some cases purge it from the network. Worst case this content should be moved to less expensive storage or offline to recoup the capacity on the primary storage network.

Using Data Profiling to Mitigate 7 “Red Flag” Information Risks

Data profiling technology can help an organization identify what electronic information it has and where it is located, which is the first step to ensuring that information governance policies are applied to it, reducing the organization’s costs and mitigating its seven greatest information risks.

Uncover these red flags in the summer edition of ARMA’s Information Magazine.

Read Using Data Profiling to Mitigate 7 “Red Flag” Information Risks

Read the entire July/August 2013 issue of Information Management

Whitepaper: Leverage Data Profiling to Support Intelligent Disposition

Only with an understanding of unstructured data – owner, age, last accessed, file type – can decisions be made on its value and disposition.

While file-level analysis of data was previously near impossible to achieve, new technology has enabled organizations to classify data into categories and allow for manageable and simplified disposition strategies to be implemented.

Download this complimentary whitepaper from Engines to learn more.

Why time to data matters more than ever

Time to data has always been a big push for us at Index Engines, as we know that service providers and counsel need to have confidence that the ESI they need can be found and delivered on deadline.

But as data amounts increased and queries became more in depth, it became a lot harder for some vendor technology to keep up with demand and deliver the needed information.

Now, more than ever, we see the legal ramifications of not being able to complete ESI culling as one vendor is being held financially and legally accountable.

This shines the light back on accelerating time to data. Some ESPs still consider 20GB/hour quick – it’s not when terabytes or even petabytes of data need to be processed. Then that data needs to be culled, deduped, deNISTed and compared across platforms before being moved into legal hold for review.

Time to data is not just reflective of technology, it’s reflective of the service provider. ESPs need to do their homework before accepting a job and partnering with a technology vendor as they will be linked to that technology’s performance. Poor performance from the technology will ultimately lead to less work for the ESP and less trust among the legal community.

The ability to provide defensible and auditable ESI in a timely, cost effective manner has never been more important, and neither has the technology vendor ESPs choose to work with.

Is internet and data privacy a thing of the past?

Privacy has been in the news and on our minds of late. The NSA entered in the privacy debate when Edward Snowden exposed the fact that they were monitoring cell phone calls in order to uncover terror plots. If the government monitors private citizen’s records in the name of safety, is this ok? What about when Google or Facebook is required to hand over records to find criminals? If records are accessed by the government in order to protect and secure our citizens, is that ok? Many people would welcome this and feel more secure.

Where is the line drawn on privacy? How do organizations manage private and sensitive data? People constantly submit private data to websites when they buy goods or services. When you obtain a mortgage significant details of your life are delivered to trusted providers. Is this data secure? What happens when this content gets in the wrong hands? Have we become too trusting with our personal information?

What about those that grew up on Facebook? Facebook owns everything you post on their site. Does the average Facebook user understand the contract they accepted when they created an account? Can you accept that contract at 13? Is Facebook chipping away at privacy and making it more acceptable to share private details of our lives? Is the information shared only bad when it gets in the wrong hands? Are we relying on complexity and technology to hide personal data and hope no one will ever see it?

The recent dialog regarding the Enron data set shows how our community treats privacy. Many stated that it was common knowledge that private data, including personal tax records, was in the data set. The difference is here we didn’t have an Edward Snowdon to blow the whistle. Was privacy an issue in this case? I would think if it was your credit card or social security number it would be. If not, then you can make statement like – the value of the data set outweighed any issues related to privacy.

As technology provides more streamlined access to all data, that which was created today and content created many years ago, privacy must be front and center. Without privacy and control we harm people. The NSA is using private data for the protection of citizens. Others would like to use private data they can hack for evil and harm.

Managing ESI to control Risk and Liability

Uncover how unstructured data profiling can provide true information governance

Join eDiscovery Journal analyst Greg Buckles and Index Engines Vice President Jim McGann as they explore how unstructured data profiling technology is revolutionizing the way we look at ESI.

In less than 60 minutes, you’ll:

– Explore how data profiling works to mitigate risks and control liability associated with stored data,
– Discover how others are using this new technology to solve complex compliance and regulatory problems, and
– Evolve your information governance and data policies with immediately actionable and implementable strategies.

[embedplusvideo height=”298″ width=”480″ standard=”″ vars=”ytid=4Q8KblI8TZg&width=480&height=298&start=&stop=&rs=w&hd=0&autoplay=0&react=1&chapters=&notes=” id=”ep8474″ /]

The Enron PII Fallout: What dirty data really causes

Since we at Index Engines announced that Nuix’s re-release of the Enron PST data set still contained PII despite its press release’s claim it was ‘cleansed,’ a lot of questions have been posed and many reactions raised – ethically, legally and morally.

Our first reaction to finding PII was disappointment over the distribution of the PST data set before it was audited or validated by a third-party, especially since it was for public consumption. Despite what lawyers say about the legal accountability of republishing this set, we easily found names, addresses, birthdates and social security numbers in the SAME document. The eDiscovery community knows the ramifications of breaches better than anyone. Why allow this to happen?

We were confused why few really seemed to care that there was PII on a data set being promoted out. “It’s been around for a long time and I don’t think anyone’s been harmed, so oh well, it’s public.” That is the strangest logic and attitude I’ve ever seen come out of the legal community, no matter what some prior ruling stated. The world is a far different place than it used to be and we don’t believe in data breaches for the ‘greater good.’

Then our disappointment turned to fear. Much of what we found was buried deep within attachments, sent folders and Outlook notes, which happens as data ages – it becomes buried and harder to find. eDiscovery tools are supposed to make finding this information easier, but if they’re missing PII, they could be missing vital evidence. Is, for argument’s sake, finding 99% of the needed files enough? What about 95%? Or 97%? Where’s the accountability and what happens when what’s missing is the deciding factor in a case? The mortgage industry is likely going to be the first to experience this issue. Emails sent by loan originators that haven’t worked for the company in five or more years are going to be needed. How many tools can find ALL of them? There’s a difference between mitigating risks beforehand and missing some documents and not being able to produce all the information needed during eDiscovery.

Hindsight may be 20-20, but there’s some regret that this wasn’t vendor-blind community effort. EDRM is a great group that does a lot of good work. What if a handful of vendors could locate PII, then EDRM could remove it without vendors knowing who found what? Sure there may be a missed marketing opportunity or two, but that would have had the best chance of actually producing a truly cleansed data set. Until this clean data set can be achieved, we don’t support the publishing of any data breach and can’t figure out why it’s still published.

Then there’s a bit of advice for all the law firms and service providers. Use caution if you’re using a new vendor to uncover information for litigation readiness or eDiscovery. If you or another company you trust hasn’t audited this third party, get a second look. Depending on the depth of the job and the accuracy needed, the vendor you want to use may change. Every vendor has different strengths, just make sure you find a vendor with the right tools for the job. Ask the tough questions about validation, where their software comes from and if they can complete the job you need.