We describe a new algorithm for prefetching arrays that are accessed with a compile-time known constant stride. We demonstrate a hardware limitation that has not previously been dealt with (limited off-chip bandwidth) and show its effect on prefetching. Our new algorithm is designed to cope with this hardware limitation. The new algorithm generates prefetches that are more efficient than the standard algorithm because it avoids cache conflicts and issues prefetches based on the machine’s ability to process memory transactions in parallel. As a bonus, the new algorithm is independent of the time to execute an iteration. This allows the compiler to perform prefetch analysis early, before good estimates of execution time are available. The result is that we prefetch as far ahead as is profitable given machine resources, but no farther. We show the performance gained by prefetching for our algorithm and for the traditional algorithm on the SPEC CPU2000 floating-point benchmarks. The new algorithm is demonstrably faster for this set of programs.