We present an auto-tuning mapping strategy for mapping grid blocks to multi-processors and multi-nodes in a parallel CFD program. We first calculate the communication matrices from the topology construction of the grids, then use two heuristics in tuning the possible partitions, which largely shrinks the searching space and obtains the optimal mapping under these searching constraints, making out a compromise between tuning cost and system performance. Experiments are carried out for a CFD calculation case on a high performance computing platform. Compared with general block mapping, cycle mapping and random mapping strategies, our strategy has an extraordinary advantage over the others in load balance and communication overhead.