The complexity of explicit parallel programming greatly limits programmers to achieve further performance gain in chip-multiprocessors (CMPs). To simplify software programming for large scale CMPs, we present a task-level superscalar microarchitecture which acts as the Control Processor (CP) of the Multi-Level Computing Architecture (MLCA), a novel multicore architecture especially targeted for embedded multimedia and streaming application systems. This task-level superscalar consists of a ten-stage task pipeline and exploits parallelism among tasks by hardware register renaming and out-of-order execution techniques, much in the same way a traditional superscalar processor exploits instruction-level parallelism. The experimental results show that our design can scale up to 256 processors while maintaing a relatively low resource overhead, finally achieving a much faster task dependency decode rate and naturally more notable performance speedup than the software runtime.