The energy consumed in instruction fetching accounts for a significant portion of total processor energy consumption. Energy consumption as well as performance should be considered when designing high performance embedded processors. In this paper, we present a hardware-based loop detection technique to reduce the energy consumption in the instruction fetch unit (instruction cache and branch prediction logic) for high performance embedded processors. The proposed instruction fetch unit reduces the energy consumed in the instruction cache by replacing the accesses to the large main instruction cache with those to the small selectively accessed cache (SAC). It also reduces the energy consumed in the branch prediction logic by reducing unnecessary accesses to the branch prediction logic. We evaluate the proposed design using a simulation infrastructure based on SimpleScalar and CACTI. Simulation results show that the proposed technique reduces the energy consumption in the instruction cache and the branch prediction logic by 20% and 24% on the average, respectively. Moreover, the proposed scheme shows little performance loss compared to the traditional scheme.