Shared-memory multiprocessors feature parallelism and a steep cache hierarchy, two salient characteristics that distinguish them from the commodity processors of ten years ago. Both of these characteristics can be exploited effectively using the same general strategy: divide-and-conquer recursion. This talk overviews the Cilk multithreaded programming language being developed in the MIT Laboratory for Computer Science, which allows a programmer to exploit parallelism through divide-and-conquer. In addition, I show how divide-and-conquer allows caches to be used effectively.
In the first part of my talk, I introduce the Cilk programming language. Cilk minimally extends the C programming language to allow interactions among computational threads to be specified in a simple and highlevel fashion. Cilk’s provably efficient runtime system dynamically maps a user’s program onto available physical resources, freeing the programmer from concerns of communication protocols and load balancing. In addition, Cilk provides an abstract performance model that a programmer can use to predict the multiprocessor performance of his application from its execution on a single processor. Not only do Cilk programs scale up to run efficiently on multiple processors, but unlike existing parallel-programming environments, such as MPI and HPF, Cilk programs “scale down”: the efficiency of a Cilk program on one processor rivals that of a comparable C program. I illustrate Cilk programming through the example of divide-and-conquer matrix multiplication.
The second part of my talk presents a strategy for designing algorithms to exploit multiple levels of caching effectively. Unlike previous algorithms, these algorithms are “cache oblivious”: no variables dependent on hardware parameters, such as cache size and cache-line length, need to be tuned to achieve optimality. Nevertheless, I show that cache-oblivious algorithms can be developed that use an optimal amount of work and move data optimally among multiple levels of cache. Problems that can be solved efficiently by cache-oblivious algorithms include matrix multiplication, FFT, sorting, and matrix transpose, all of which are solved in a divide-and-conquer fashion.
Together, these two technologies provide a foundation for the design and analysis of efficient algorithms for shared-memory multiprocessors.