We propose an efficient programmable parallel architecture for DSP and matrix algebra applications that can exploit parallelism at algorithm (topology) level via systolic/SIMD array processing and at instruction level via multiple-issue control processor capable of multithreading. Our premise is: »One array — one chip« and integration of systolic/SIMD processing on the same processor array with required data storage. Multithreading on systolic/SIMD arrays is analysed on examples, which show that substantial speedups are possible (100%–800%) when up to four threads are interleaved on cycle-by-cycle basis. We are targeting processor element (PE) granularities in word range and employ support for floating-point operations. Furthermore the array integrates data memory at two levels of hierarchy: local per PE (SIMD) and global for the whole processing array (systolic). Complexity of such a system is explored in detail and is shown that 32 PE array can be implemented on a 120 million transistor chip.