A highly efficient VLSI architecture for the (9/7) 2-D DWT based on a lifting scheme is presented in the paper. The proposed architecture processes the row and column transforms simultaneously, eliminates the memory buffer for the column transform coefficients. The hardware utilization is improved up to 100% by processing two independent data streams together using shared arithmetic functional blocks. And the embedded boundary extension circuit is exploited to optimize the architecture. Compared to previous architectures, the proposed architecture has more efficiency on critical path, power consumption, temporal storage usage and hardware utilization.