The Fast Fourier Transform (FFT) is an important algorithm in the fields of science and engineering, where it is used in diverse areas such as communications, signal processing, instrumentation, image and video analysis, etc. The algorithm is essentially a fast implementation of the Discrete Fourier Transform which allows it to reduce the asymptotic complexity of the latter from O(n2) to the former's O(n log n). In this paper, the radix-2 decimation in time FFT algorithm is implemented and investigated on Field Programmable Gate Arrays (FPGA) and Graphic Processing Units (GPU). The hardware descriptive language Verilog HDL (VHDL) is used for the FPGA, while the Open Computing Language (OpenCL) is used for the GPU. Both implementations are compared with various pre-installed IP-core modules of Xilinx and MATLAB for complex input of various sample sizes. From the results, it is concluded that the FPGA shows faster performance for a large number of FFT's of small sizes. On the other hand, the GPU is more promising for large number of FFT's of large sizes. The results also confirm that the FPGA based implementation is faster then the built-in IP-core modules of Xilinx. A hardware synthesis for FPGA is also provided.