This work is motivated by the advance of heterogeneous computing and the strong demands of workload acceleration in practice. By considering pipeline workloads over FPGA, this paper explores a systematic methodology to configure the hardware instances of each pipeline stage such that the maximum of the execution time of each stage is minimized, where the FPGA allocation with the memory bandwidth constraint is considered. For the target problem, an algorithm is proposed and proved being optimal, and a real implementation study is conducted. In the experimental results, an image filter FPGA implementation can outperform the CPU, GPU, and baseline FPGA solutions by 460%, 73%, and 1030%, respectively. Extensive simulations were also conducted with a large FPGA size to show the scalability of this work.