The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Many multiprocessor list scheduling heuristics that account for interprocessor communication delay have been proposed in recent years. However, no uniform comparative study of published heuristics has been performed in almost 20 years. This paper presents the results of a large quantitative study using random, but program-like input graphs. We found differences in the performance of the various heuristics...
New field-programmable gate array (FPGA) technologies have increased the industrial interest in tools which map a DSP application and a set of performance constraints to a specific VLSI architecture. This paper presents an optimization methodology for mapping a DSP application and a set of performance constraints into an architecture targeted for FPGA technologies with user-programmable RAM blocks...
Directory-based protocols are currently the method of choice to enforce cache coherence in large-scale shared-memory multiprocessors. The problems associated with these hardware schemes include their lack of scalability, although various suggestions have been made to ameliorate this drawback, and the loss of performance due to false sharing. Software controlled cache coherence (SCCC) is an alternative...
Presents an adaptive unicast and multicast routing algorithm for 3D mesh networks with wormhole routing and virtual channel flow control, which is called adaptive-cast. The unique feature of the adaptive-cast is that it is valid when messages with a single destination (unicast) and with multiple destinations (multicast) are mixed together, which drastically simplifies the implementation of the router...
Write-buffers have a significant impact on performance, especially in wide-issue superscalar systems with write-through caching. We develop fast efficient simulation methods for evaluating multiple write-buffer configurations together in a single-pass. Our results are also applicable for the simulation of other buffer structures. We first consider simulating non-coalescing write-buffers. We show that...
A new improved version of the classic binary non-restoring division algorithm is presented. It is implemented on a systolic ON-LINE architecture, targeted at use in digital signal processing applications. The overall goal is to implement DSP algorithms using redundant data representations throughout the algorithm, and to obtain a balanced architecture according to the specifications of the application...
We present a new scalar processor for high-speed vector processing and its evaluation. The proposed processor can hide long main memory access latency by introducing slide-windowed floating-point registers with data preloading feature and pipelined memory. Owing to the slide-window structure, the proposed processor can utilize more floating-point registers in keeping upward compatibility with existing...
It is essential to extract fine grain parallelism for further increase of processor performance. This paper investigates an extension model of VLIW architecture called V++, which retains the capabilities of VLIW architecture to effectively exploit fine grain parallelism while introducing facilities for restructuring very long instruction words dynamically. V++ adopts two types of restructuring methods:...
Advances in technology and computer design are resulting in impressive increases in raw processor power. Currently, new processor implementations are showing almost a doubling in clock frequency. Moreover, with each new generation, processor designers are incorporating more advance architecture techniques such as instruction level parallelism into these implementations. Memory technology also continues...
The increasing disparity of speed between processor and its main memory makes ways for multi-level cache hierarchies in almost any of today's computer systems; specifically, the second-level (L2) caches with larger capacity but longer access time than the first-level (L1) caches have been adopted to reduce this memory gap. In this study an enhanced one-pass trace-driven simulation technique is used...
As microprocessor speeds increase, memory bandwidth is rapidly becoming the performance bottleneck in the execution of vector-like algorithms. Although caching provides adequate performance for many problems, caching alone is an insufficient solution for vector applications with poor temporal and spatial locality. Moreover, the nature of memories themselves has changed. Current DRAM components should...
We examine the impact of using flash memory as a second-level file system buffer cache to reduce power consumption and file access latency on a mobile computer. We use trace-driven simulation to evaluate the impact of what we call a FLASHCACHE. We relate the power consumption and access latency of the storage sub-system to the characteristics of the FLASHCACHE: its size, the unit of erasure, and access...
We propose that optical packet switching networks are better implemented through space division switching (SDS) approaches than through wavelength division multiplexing (WDM) approaches. We show that active optical networks designed from optically controlled nonblocking networks can provide higher efficiency and lower latency advantages compared to other approaches. Our self-routing optical crossbar...
The single address-space that shared-memory architectures offer simplifies programming, problem partitioning, and dynamic load balancing as compared to other programming models for parallel computing systems such as e.g. Message passing. Unfortunately, as we scale shared-memory architectures to large configurations, the resulting memory system latencies may limit their performance potentials. Finding...
Shared memory architectures often have caches to reduce the number of slow remote memory accesses. The largest possible caches exist in shared memory architectures called Cache-Only Memory Architectures (COMAs). In a COMA all the memory resources are used to implement large caches. Unfortunately, these large caches also have their price. Due to its lack of physically shared memory, COMA may suffer...
Presents two hardware-controlled update-based cache coherence protocols. The authors discuss the two major disadvantages of the update protocols: inefficiency of updates and the mismatch between the granularity of synchronization and the data transfer. They present two enhancements to the update-based protocols, a write combining scheme and a finer grain synchronization, to overcome these disadvantages...
Shared memory multiprocessors generally use caches to improve the performance. This introduces the cache coherence problem. Multiple copies of the data need to be kept consistent by using a suitable mechanism. The paper presents a novel mechanism for organizing the memory modules in order to provide an inexpensive implementation for cache coherence. The interleaved directory scheme uses a unique address...
As the number of processors increases, so does communication latency and the probability of component failure. A technique that addresses these problems is data replication, which provides faster access and greater availability. Its drawback is that the replicas must be kept consistent. The author describes a family of fault-tolerant algorithms for maintaining the consistency of cacheable data. Processors...
Synchronization and remote memory access delays cause staggering inefficiency in most shared memory programs if run on thousands of processors. The authors introduce efficient lock synchronization using the combination of group write consistency, which guarantees write ordering within groups of processors, and eagersharing distributed memory, which sends newly written data values over fast network...
Performance in large-scale shared-memory multiprocessors depends on finding a scalable solution to the memory-latency problem. The author shows that protect consistency (PRC) relaxes previous consistency models with two distinct performance benefits. First, PRC is used to expose and exploit more parallelism in the computation, giving better support to latency tolerance. Second, assuming that visible...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.