At first glance this seems like an impossible task. When Google last published their stats they had indexed approximately 8 billion web pages as of first quarter 2005 (they have since stopped publishing a number however speculation has it well above 10 billion). That sounds like a lot, until you consider that most of the global 500 can easily exceed this number with historical files and email alone (One investment bank told me they have well over 2 billion active emails not counting their archives). Therefore, a large enterprise would be burdened with the same level of indexing as Google faces every day. No wonder it is commonly considered an impossible task.
What does it take to index a billion objects? Well, if you apply technology similar to the internet, then you will need 20-80,000 computers to process the data and store the index. This amount of compute resources is not practical for an enterprise to apply to any problem. The enterprise indexing problem is so daunting that most search companies recommend enterprises first decide which information is important and searchable and which information is not in order to reduce the problem set.
Recently I read the transcript of an Interop speech from a vendor in this space who was quoted as saying that â€œbefore you attempt indexing of enterprise data you first need to determine what is importantâ€. Of course, this is not practical as you need to get an understanding of all content before any segmentation can be done. The problem with indexing large volumes of enterprise data is that it needs to be approached much more efficiently as compared to traditional indexing, which works well on the Internet.
Index Enginesâ€™ attacked the two major inhibitors to enterprise wide indexing. The first challenge is the speed of indexing and the second is the size of the index. We found that traditional document scrapping tools were to slow, at around 2MB per second, and often required that a copy of the data is created for dedicated processing. We developed word scrapping technology that scrapes words at line speeds (approximately 200 MB/sec). Why is this important? Because an enterprise creates data constantly, much faster than users on the Internet create web pages and the indexing of this data needs to occur quickly in order to maintain currency and accuracy without burdening the IT infrastructure with extensive processing requirements.
The second, and most challenging inhibitor to enterprise wide indexing, is the size of the index itself. If you have 100 TBs of email and unstructured files and the index storage requirements of enterprise indexing solutions can be from 40 to well over 100% of the original document size. This results in terabytes, even petabytes to store the index! The Index Engines database can process 4 million words per-second with a resulting index of only 8% of the size of the content. Using the above example of 100 TBâ€™s of data, this results in approximately 8TBâ€™s of storage which is a far more realistic number and easier for a CIO to accept.
Indexing the enterprise must be done with speed and efficiency. Any IT manager will tell you they donâ€™t want to solve one problem by creating new problems. Traditional indexing approaches may provide a solution for data discovery, but they also result in challenges related to processing power and storage that most firms will not accept. This is why a new approach was required, one that now makes enterprise wide indexing possible.