The Enron PII Fallout: What dirty data really causes

Since we at Index Engines announced that Nuix’s re-release of the Enron PST data set still contained PII despite its press release’s claim it was ‘cleansed,’ a lot of questions have been posed and many reactions raised – ethically, legally and morally.

Our first reaction to finding PII was disappointment over the distribution of the PST data set before it was audited or validated by a third-party, especially since it was for public consumption. Despite what lawyers say about the legal accountability of republishing this set, we easily found names, addresses, birthdates and social security numbers in the SAME document. The eDiscovery community knows the ramifications of breaches better than anyone. Why allow this to happen?

We were confused why few really seemed to care that there was PII on a data set being promoted out. “It’s been around for a long time and I don’t think anyone’s been harmed, so oh well, it’s public.” That is the strangest logic and attitude I’ve ever seen come out of the legal community, no matter what some prior ruling stated. The world is a far different place than it used to be and we don’t believe in data breaches for the ‘greater good.’

Then our disappointment turned to fear. Much of what we found was buried deep within attachments, sent folders and Outlook notes, which happens as data ages – it becomes buried and harder to find. eDiscovery tools are supposed to make finding this information easier, but if they’re missing PII, they could be missing vital evidence. Is, for argument’s sake, finding 99% of the needed files enough? What about 95%? Or 97%? Where’s the accountability and what happens when what’s missing is the deciding factor in a case? The mortgage industry is likely going to be the first to experience this issue. Emails sent by loan originators that haven’t worked for the company in five or more years are going to be needed. How many tools can find ALL of them? There’s a difference between mitigating risks beforehand and missing some documents and not being able to produce all the information needed during eDiscovery.

Hindsight may be 20-20, but there’s some regret that this wasn’t vendor-blind community effort. EDRM is a great group that does a lot of good work. What if a handful of vendors could locate PII, then EDRM could remove it without vendors knowing who found what? Sure there may be a missed marketing opportunity or two, but that would have had the best chance of actually producing a truly cleansed data set. Until this clean data set can be achieved, we don’t support the publishing of any data breach and can’t figure out why it’s still published.

Then there’s a bit of advice for all the law firms and service providers. Use caution if you’re using a new vendor to uncover information for litigation readiness or eDiscovery. If you or another company you trust hasn’t audited this third party, get a second look. Depending on the depth of the job and the accuracy needed, the vendor you want to use may change. Every vendor has different strengths, just make sure you find a vendor with the right tools for the job. Ask the tough questions about validation, where their software comes from and if they can complete the job you need.

One thought on “The Enron PII Fallout: What dirty data really causes”