In multiprocessor systems, low-latency synchronization is extremely important to effectively exploit fine-grain data parallelism and improve overall performance. This brief presents an efficient synchronization for embedded distributed multiprocessors. The proposed solution works in a completely decentralized request–response manner via explicit message exchange among the processing elements. Scalable lock and barrier synchronization algorithms, which are derived from the inherent distributed characteristics of the underlying architecture, are proposed to enable fair, orderly, and contention-free synchronization. We implement the proposed synchronization model in a distributed 32-core architecture with a commercial cycle-accurate SystemC simulation platform. Experimental results that show our proposed approach achieves ultralow synchronization latency and almost ideal scalability when the core count scales.