The inverse discrete cosine transform (IDCT) is a significant component in today's JPEG and MPEG decoders. Of all the stages in the decoding process of a JPEG file, the IDCT is the most computationally intensive. Hence, we require fast and efficient implementations, either in software or hardware. Numerous individual designs for computing the ID-IDCT have been proposed. Our 2D-IDCT incorporates two of our ID-IDCT cores and a transpose network to provide a stall-free pipeline. In this paper, we describe a fast hardware implementation of a two-dimensional IDCT architecture that implements a variation of the modified Loeffler algorithm. This design is currently functionally verified, synthesized and tested on the Xilinx Virtex II FPGA. Our FPGA implementation has a throughput of over 800 M coefficients per second, implemented as an eight-wide pipeline with a clock frequency of 102 MHz. We suggest ideas to parallelize the design and further enhance performance. We also describe an ASIC design of the HDL model that operates at a clock frequency of 154 MHz using TSMC'S 0.18 mum CMOS technology. Our VHDL implementation is released as "open source "