The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
For higher processing and computing power, chip multiprocessors (CMPs) have become the new mainstream architecture. This shift to CMPs has created many challenges for fully utilizing the power of multiple execution cores. One of these challenges is managing contention for shared resources. Most of the recent research address contention for shared resources by single-threaded applications. However,...
The instruction cache is a critical component in any microprocessor. It must have high performance to enable fetching of instructions on every cycle. However, current designs waste a large amount of energy on each access as tags and data banks from all cache ways are consulted in parallel to fetch the correct instructions as quickly as possible. Existing approaches to reduce this overhead remove unnecessary...
A collection of slides from the author's conference presentation is given. The following topics are discussed: VLSI design, automation & test; technology scaling; Moore's law; platform segment characteristics; dynamic platform control; dynamic adaptation & reconfiguration; resilient platforms; voltage-frequency range limiters; voltage-frequency margins; fine-grain power management; voltage...
This paper describes a micro-architecture for a custom programmable FPGA-based processor, with direct support for streaming and vector computations relying on custom cache memory storage. The processor combines a custom data-path with several parallel data ports for accessing operands in streaming mode thus efficiently supporting nested looping constructs found in high-level languages while mitigating...
Summary form only given. Database systems have long optimized for parallel execution; the research community has pursued parallel database machines since the early '80s, and several key ideas from that era underlie the design and success of commercial database engines today. Computer microarchitecture, however, has shifted drastically during the intervening decades. Until the end of the 20th century...
Systems based on many Chip Multi-Processor (CMP) modules interconnected by global networks constitute now a feasible solution, which brings back to life challenges of massively parallel systems. The paper presents new methods for data communication inside CMP modules and for inter-CMP-module data communication. Inside CMP modules data communication through shared variables is improved by the use of...
Multicore processors can deliver higher performance than single-core processors by exploiting thread level parallelism (TLP): applications are split into independent threads, each of which is mapped into a different core, reducing the execution time and potentially its worst-case execution time (WCET). Unfortunately, inter-thread interferences generated by simultaneous accesses to shared resources...
Static cache analysis for data allocated on the heap is practically impossible for standard data caches. We propose a distinct object cache for heap allocated data. The cache is highly associative to track symbolic object addresses in the static analysis. Cache lines are organized to hold single objects and individual fields are loaded on a miss. This cache organization is statically analyzable and...
When there are several application running on Chip-Multiprocessors (CMPs), it is a problem to allocate the on-chip cache capacities between these competing applications. Cache partitioning is commonly used to solve this problem. Existing cache partitioning schemes either dedicate to the shared design or partition the last level cache depending on limited memory information. This paper presents Private...
Transactional memory (TM) is a new shared resource synchronization mechanism which was proposed to ease the difficulty of parallel programming. Currently, most hardware transactional memory systems leverages the extended directory based cache coherence protocol to resolve transaction conflicts; seldom research has been conducted to extend a snoopy coherence based chip multi-processor with TM support...
Three-dimensional integration has the potential to increase integration density and to reduce communication latency of chip-multiprocessors (CMPs). However, high power density (i.e., power dissipation per unit volume) due to the high integration incurs temperature-related problems in reliability, power consumption, performance, and system cooling cost. In this paper, we propose a design-time solution...
On a CMP (Chip Multi-Processor) architecture, cache sharing impacts threads non-uniformly, where some threads may be slowed down significantly, while others are not. This may cause severe performance problems such as throughput decreasing, cache thrashing. This paper proposes an architectural support predicting method (ASPM) to predict inter-thread cache contention, and schedules threads based on...
3D integration enables building caches from different types of technologies such as SRAM, Magnetic RAM (MRAM), DRAM, and Phase-change RAM (PRAM). Hybrid cache architectures (HCAs) have been proposed to take advantage of the benefits offered by these types of technologies. Employing this novel cache architecture to build shared caches in chip multiprocessors (CMPs) can lead to significant performance...
Given the diverse range of application characteristics that chip multiprocessors (CMPs) need to cater to, a “one-cache-topology-fits-all” design philosophy will clearly be inadequate. In this paper, we propose MorphCache, a Reconfigurable Adaptive Multi-level Cache hierarchy. Mor-phCache dynamically tunes a multi-level cache topology in a CMP to allow significantly different cache topologies to exist...
As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of...
The number of cores in a single chip multiprocessor is expected to grow in coming years. Likewise, aggregate on-chip cache capacity is increasing fast and its effective utilization is becoming ever more important. Furthermore, available cores are expected to be underutilized due to the power wall and highly heterogeneous future workloads. This trend makes existing L2 cache management techniques less...
To date dynamic voltage/frequency scaling (DVFS) has been one of the most successful power-reduction techniques. However, ever-increasing process variability reduces the reliability of static random access memory (SRAM) at low voltages. This limits voltage scaling to a minimum operating voltage (VDDMIN). Larger SRAM cells, that are less sensitive to process variability, allow the use of lower VDDMIN...
Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make...
The next-generation enterprise Xeon® processor consists of 10 Westmere 32nm cores and a shared inclusive L3 cache (LLC) integrated on a monolith ic die, with link-based l/Os. This paper focuses on the innovations and circuit optimizations over the predecessor targeting idle power reduction, robust high-speed I/O links, and performance per watt improvements. The processor is implemented in 32nm CMOS...
The next generation in the Intel® Itanium® processor family, code named Poulson, has eight multi-threaded 64 bit cores. Poulson is socket compatible with the current Intel® Itanium® Processor 9300 series (Tukwila) . The new design integrates a ring-based system interface derived from portions of previ ous Xeon® and Itanium® processors, and includes 32MB of Last Level Cache (LLC). The processor is...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.