Efficient programming on multi-core systems

My last blog entry talked about multiple cores being the most cost effective way to improve the number of instructions processed per-clock cycle. With future processors having many cores standard, it is worthwhile to take a look at what application writers can do to maximize the performance of multi-threaded code.

First, Locks are very expensive to execute in that they stall pipelines and lock memory cache lines. A typical multi-threaded program with locks and no lock contention, can run as much as 30% slower when compared to a non multi-threaded version. This slowdown is the overhead of executing the exclusive memory access instructions which significantly reduces the instructions per-clock executed. Also, mutex’s consume a lot of space. In Linux, a POSIX pthread_mutex is 40 bytes.

One way to avoid this locking overhead is to lessen the use of locks as much as possible. Take advantage of some of the core guaranteed atomic operations of the processor. All modern processors guarantee that simple loads and stores to aligned memory addresses are atomic. The Linux atomic_t and atomic64_t types supports a set of simple increments and decrements that will perform much better than a lock/unlock pair. Use algorithms that are based on atomic fetch_and_add principles to resolve conflicts. There are many public algorithms for list management and other common algorithms that take advantage of these primitives. Do a quick search on the Web and keep those cores busy.