Singular value decomposition (SVD) is one of the most important factorizations in matrix computation. However, computing SVD is still time-consuming, especially when the dimension of matrices exceeds tens of thousands. In this paper, we present a high performance approach called “Bisection and Twisted” (BT) for solving bidiagonal SVD. As modern general purpose GPUs have shown their extreme computational advantages in parallel computing, we implement the BT algorithm on single and multiple GPUs. With our carefully designed GPU kernels, the BT algorithm is about 10 times faster than MKL divide-and-conquer routine DBDSDC on an 8-core 2.53GHz CPU, and 36 times faster than CULA QR routine DBDSQR on the same GPUs. Additionally, the BT algorithm is able to compute SVD for matrices of size 1 million by 1 million with only two GPUs. To the best of our knowledge, no implementation has achieved such a scale.