Index Engines, a leader in enterprise discovery solutions, today announced that it has partnered with BlueArcÂ® Corporation, a leader in scalable, high-performance network storage to develop an optimized version of the Index Engines LAN engine to support BlueArcâ€™s high performance Titan 2000 unified network storage systems for fast, economical search and discovery of unstructured online data. The new Index Engines/BlueArc solution was developed as part of BlueArcâ€™s participation in the Index Engines Litigation Readyâ„¢ Partner Program. The resulting product reduces overall eDiscovery costs and turnaround time while giving companies the dynamic interaction with data that enables them to perform on-demand decision making for legal and regulatory compliance. Read more >>
Taneja Group just wrote a research paper on us. “Every so often, we find a vendor that cuts through a Gordian knot of IT complexity with one well conceived reframing of the problem. The Index Engines Appliance does precisely this, and we believe customers will agree. For enterprises exploring the complex world of classification and indexing tools, we recommend an evaluation of the Index Engines Appliance.”
Get a free copy of the full report on our website – here is a link.
My last blog entry talked about multiple cores being the most cost effective way to improve the number of instructions processed per-clock cycle. With future processors having many cores standard, it is worthwhile to take a look at what application writers can do to maximize the performance of multi-threaded code.
First, Locks are very expensive to execute in that they stall pipelines and lock memory cache lines. A typical multi-threaded program with locks and no lock contention, can run as much as 30% slower when compared to a non multi-threaded version. This slowdown is the overhead of executing the exclusive memory access instructions which significantly reduces the instructions per-clock executed. Also, mutex’s consume a lot of space. In Linux, a POSIX pthread_mutex is 40 bytes.
One way to avoid this locking overhead is to lessen the use of locks as much as possible. Take advantage of some of the core guaranteed atomic operations of the processor. All modern processors guarantee that simple loads and stores to aligned memory addresses are atomic. The Linux atomic_t and atomic64_t types supports a set of simple increments and decrements that will perform much better than a lock/unlock pair. Use algorithms that are based on atomic fetch_and_add principles to resolve conflicts. There are many public algorithms for list management and other common algorithms that take advantage of these primitives. Do a quick search on the Web and keep those cores busy.
In my earlier entry I spoke about the challenges of indexing the Enterprise. The biggest challenge is the speed at which indexing has to occur. Enterprises are creating data faster than traditional indexing methods can index data. In order to process billions of files at high speeds we needed to implement a new approach to scraping words. This method, which utilizes our advanced text scanning algorithms, works best on CPU architectures with very high-speed memory bandwidth and low-latency to that memory. After much analysis we found that the AMD Opteron CPU was the best fit. Because of Opteron’s Direct Connect Architecture, the latency for accessing random data from main memory has been minimized. This problem has also benefited by the gaming market which has driven down the CAS latency of DDR400 memory to 2 clock cycles.
What can we expect in the future? It is clear that the future of processors is multiple-cores. Thermal issues have put a damper on increasing clock speeds so any new available real estate is being used to add more cores which effectively increases the number of instructions that get executed per clock cycle. Existing CPU intensive applications will need to be modified to take advantage of these new architectures. Even though multi-threading has been around for a long time it is still worth examining this issue on multi-core systems and I will address this issue in my next blog entry.
What would I love to see in a future processor? Like everyone else our application would benefit from more L1 and L2 cache. This is obviously important to AMD also as they recently licensed the Z-RAM high density memory IP from Innovative Silicon. Hopefully some of that technology will significantly increase cache space. We could also use a simple built-in hash instruction for hashing strings. The best public domain hash functions take about 20 operations per word. I would guess a 10 to 20 fold speed up for a silicon approach.
Overall I look forward to many more cores and integrated DDR2 memory controllers. Just keep in mind that your code has to be tuned to take advantage of the new CPU’s otherwise you will have a lot of idle cores.
At first glance this seems like an impossible task. When Google last published their stats they had indexed approximately 8 billion web pages as of first quarter 2005 (they have since stopped publishing a number however speculation has it well above 10 billion). That sounds like a lot, until you consider that most of the global 500 can easily exceed this number with historical files and email alone (One investment bank told me they have well over 2 billion active emails not counting their archives). Therefore, a large enterprise would be burdened with the same level of indexing as Google faces every day. No wonder it is commonly considered an impossible task.
What does it take to index a billion objects? Well, if you apply technology similar to the internet, then you will need 20-80,000 computers to process the data and store the index. This amount of compute resources is not practical for an enterprise to apply to any problem. The enterprise indexing problem is so daunting that most search companies recommend enterprises first decide which information is important and searchable and which information is not in order to reduce the problem set.
Recently I read the transcript of an Interop speech from a vendor in this space who was quoted as saying that â€œbefore you attempt indexing of enterprise data you first need to determine what is importantâ€. Of course, this is not practical as you need to get an understanding of all content before any segmentation can be done. The problem with indexing large volumes of enterprise data is that it needs to be approached much more efficiently as compared to traditional indexing, which works well on the Internet.
Index Enginesâ€™ attacked the two major inhibitors to enterprise wide indexing. The first challenge is the speed of indexing and the second is the size of the index. We found that traditional document scrapping tools were to slow, at around 2MB per second, and often required that a copy of the data is created for dedicated processing. We developed word scrapping technology that scrapes words at line speeds (approximately 200 MB/sec). Why is this important? Because an enterprise creates data constantly, much faster than users on the Internet create web pages and the indexing of this data needs to occur quickly in order to maintain currency and accuracy without burdening the IT infrastructure with extensive processing requirements.
The second, and most challenging inhibitor to enterprise wide indexing, is the size of the index itself. If you have 100 TBs of email and unstructured files and the index storage requirements of enterprise indexing solutions can be from 40 to well over 100% of the original document size. This results in terabytes, even petabytes to store the index! The Index Engines database can process 4 million words per-second with a resulting index of only 8% of the size of the content. Using the above example of 100 TBâ€™s of data, this results in approximately 8TBâ€™s of storage which is a far more realistic number and easier for a CIO to accept.
Indexing the enterprise must be done with speed and efficiency. Any IT manager will tell you they donâ€™t want to solve one problem by creating new problems. Traditional indexing approaches may provide a solution for data discovery, but they also result in challenges related to processing power and storage that most firms will not accept. This is why a new approach was required, one that now makes enterprise wide indexing possible.
I decided write my own blog. Why? My current work at Index Engines is one of the most challenging development projects of my career. This blog will allow me to share this challenge with the development community and hopefully initiate some dialog. Please feel free to pass along your comments.