Energy consumption is a critical issue in embedded systems design. One way of being energy efficient is to complete the execution as early as possible. Multi-threaded processors reduce the execution time by exploiting both the instruction level and thread level parallelism, and offer an effective solution for energy saving. With a typical multi-threaded processor design, whenever the instruction pipeline has to stall due to high latency operations, the processor execution is switched to another thread so that the computing resources are effectively utilized and the processor throughput is improved. However, traditional designs use basic scheduling schemes, such as round robin, in thread selection, which is not suitable for real time execution and is inefficient for a set of threads that have unbalanced execution durations. In this paper, we propose 1) a thread scheduling approach that extends the life span of short threads to ensure the utilization efficiency of processor resources, and 2) zero-switching-time hardware design, to achieve a minimal execution time for a set of given applications. We demonstrate through experiment the effectiveness of our design.