The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Binary Decision Diagrams (BDDs) are used extensively in VLSI CAD for verification, synthesis, logic minimization and testing. Parallel algorithms for Boolean Function Manipulation using BDDs have been proposed and implemented on a Connection Machine (CM-5). Abstractions have been developed to support the design of these algorithms using the message passing model of parallel programming. A Distributed...
S3.mp (Sun's Scalable Shared memory MultiProcessor) is a research project to demonstrate a low overhead, high throughput communication system that is based on cache coherent distributed shared memory (DSM). S3.mp uses distributed directories and point-to-point messages that are sent over a packet switched interconnect fabric to achieve scalability over a wide range of configurations. S3.mp uses a...
Directory-based protocols are currently the method of choice to enforce cache coherence in large-scale shared-memory multiprocessors. The problems associated with these hardware schemes include their lack of scalability, although various suggestions have been made to ameliorate this drawback, and the loss of performance due to false sharing. Software controlled cache coherence (SCCC) is an alternative...
This paper presents a new fast way to simulate large networks of computers. The method uses a frontend EC, which accepts a parallel C program and translates it into a program in an intermediate language for parallel system simulations. An event driven simulator for distributed shared memory systems, DSIM, uses the intermediate language to simulate and obtain efficiency results in networks of thousands...
Fast computer simulation is an essential tool in the design of large parallel computers. We discuss the design and performance of our Fast Accurate Simulation Tool, FAST. We start by summarizing the tradeoffs made in the designs of this and other simulators. The key ideas used in this simulator involve execution driven simulation techniques that modify the object code of the application program being...
The single address-space that shared-memory architectures offer simplifies programming, problem partitioning, and dynamic load balancing as compared to other programming models for parallel computing systems such as e.g. Message passing. Unfortunately, as we scale shared-memory architectures to large configurations, the resulting memory system latencies may limit their performance potentials. Finding...
Shared memory architectures often have caches to reduce the number of slow remote memory accesses. The largest possible caches exist in shared memory architectures called Cache-Only Memory Architectures (COMAs). In a COMA all the memory resources are used to implement large caches. Unfortunately, these large caches also have their price. Due to its lack of physically shared memory, COMA may suffer...
Presents two hardware-controlled update-based cache coherence protocols. The authors discuss the two major disadvantages of the update protocols: inefficiency of updates and the mismatch between the granularity of synchronization and the data transfer. They present two enhancements to the update-based protocols, a write combining scheme and a finer grain synchronization, to overcome these disadvantages...
Shared memory multiprocessors generally use caches to improve the performance. This introduces the cache coherence problem. Multiple copies of the data need to be kept consistent by using a suitable mechanism. The paper presents a novel mechanism for organizing the memory modules in order to provide an inexpensive implementation for cache coherence. The interleaved directory scheme uses a unique address...
Compares the performance, in shared-memory multiprocessors, of locating translation-lookaside buffers (TLBs) at processors with that of locating TLBs at memory. The comparison is based on trace-driven simulations of multiprocessors with log N-stage networks interconnecting N processors and N memory modules. For the systems and workloads studied, memory-based TLBs perform noticeably better than processor-based...
Synchronization and remote memory access delays cause staggering inefficiency in most shared memory programs if run on thousands of processors. The authors introduce efficient lock synchronization using the combination of group write consistency, which guarantees write ordering within groups of processors, and eagersharing distributed memory, which sends newly written data values over fast network...
Performance in large-scale shared-memory multiprocessors depends on finding a scalable solution to the memory-latency problem. The author shows that protect consistency (PRC) relaxes previous consistency models with two distinct performance benefits. First, PRC is used to expose and exploit more parallelism in the computation, giving better support to latency tolerance. Second, assuming that visible...
We outline an approach for compiling for distributed-memory multiprocessors that is inherited from compiler technologies for shared-memory multiprocessors. We believe that this approach to compiling for distributed-memory machines as promising because it is a logical extension of the shared-memory parallel programming model, a model that is easier for programmers to work with, and that has been studied...
This paper addresses a purely software-based solution to the multiprocessor cache coherence problem by structuring an operating system to provide for the coherence of its own data while exporting coherent memory to user processes. Also covered are the results of proof-of-concept port of Mach 3.0, using the principles in this paper, to a prototype of the IBM Shared Memory System POWER/4, a Shared Memory...
Parallel computing on a network of workstations can saturate the communication network leading to excessive message delays and consequently poor application performance. We examine empirically the consequences of integrating a flow control protocol, called Warp control, into Mermera, a software shared memory system that supports parallel computing on distributed systems. For an asynchronous iterative...
Architecture neutrality, reliability, and support of reactive programs are the primary goals of the coordination and programming language model TSD (Transactions on Shared Data). The basic execution units are transactions that communicate through shared data. Data assigned to variables through unification are immutable; the presence or absence of data is used for synchronization. Transactions report...
In conventional parallel processing, the main objective of scheduling is to reduce the processor's idle time. However, in Time Warp (TW), which is an optimistic parallel discrete event simulation approach, keeping the processors busy does not necessarily lead to good performance, since the processors may be performing erroneous computations that must be eventually rolled back. Hence, the existing...
We present a linear time algorithm for scheduling iterations of a loop that has no loop-carried dependences. The algorithm is optimal in the sense that any p consecutive iterations in the schedule can be executed simultaneously without any possibility of false sharing, where p is the number of processors, and the algorithm uses at most two wait synchronizations per iteration. Our algorithm is asynchronous...
In writing a highly-portable parallel program, we developed a library of parallel primitives to shield our application code from the various multiprocessors. Rather than adapt the different multiprocessors to a program-specific library via extensive implementation, we chose to first find a common 'intersection' of the virtual machine models provided by each vendor, define a common interface to that...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.