Current trends in system design are pointing to using more and more processing units and storage units in a single system. In order to generate programs for these types of distributed memory machines, the challenge is to coordinate and schedule multiple functional units to perform computations efficiently.
In this paper, we describe how our compiler can automate the process and generate good parallel programs from sequential programs. We show how to turn the straight-line code into a task graph which exhibits maximum parallelism possible. Then we give an algorithm for assigning computation to processors to minimize communication cost. Finally, we give an algorithm to allocate registers across processors using an interference graph.