The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
For higher processing and computing power, chip multiprocessors (CMPs) have become the new mainstream architecture. This shift to CMPs has created many challenges for fully utilizing the power of multiple execution cores. One of these challenges is managing contention for shared resources. Most of the recent research address contention for shared resources by single-threaded applications. However,...
The instruction cache is a critical component in any microprocessor. It must have high performance to enable fetching of instructions on every cycle. However, current designs waste a large amount of energy on each access as tags and data banks from all cache ways are consulted in parallel to fetch the correct instructions as quickly as possible. Existing approaches to reduce this overhead remove unnecessary...
When there are several application running on Chip-Multiprocessors (CMPs), it is a problem to allocate the on-chip cache capacities between these competing applications. Cache partitioning is commonly used to solve this problem. Existing cache partitioning schemes either dedicate to the shared design or partition the last level cache depending on limited memory information. This paper presents Private...
Transactional memory (TM) is a new shared resource synchronization mechanism which was proposed to ease the difficulty of parallel programming. Currently, most hardware transactional memory systems leverages the extended directory based cache coherence protocol to resolve transaction conflicts; seldom research has been conducted to extend a snoopy coherence based chip multi-processor with TM support...
As Chip Multiprocessors (CMPs) scale to tens or hundreds of nodes, the interconnect becomes a significant factor in cost, energy consumption and performance. Recent work has explored many design tradeoffs for networks-on-chip (NoCs) with novel router architectures to reduce hardware cost. In particular, recent work proposes bufferless deflection routing to eliminate router buffers. The high cost of...
Future CMPs will combine many simple cores with deep cache hierarchies. With more cores, cache resources per core are fewer, and must be shared carefully to avoid poor utilization due to conflicts and pollution. Explicit motion of data in these architectures, such as message passing, can provide hints about program behavior that can be used to hide latency and improve cache behavior. However, to make...
In current Chip-multiprocessors (CMPs), a significant portion of the die is consumed by the last-level cache. Until recently, the balance of cache and core space has been primarily guided by the needs of single applications. However, as multiple applications or virtual machines (VMs) are consolidated on such a platform, researchers have observed that not all VMs or applications require significant...
Large scale Chip-Multiprocessors (CMPs) generally employ Network-on-Chip (NoC) to connect the last level cache (LLC), which is generally organized as distributed NUCA (non-uniform cache access) arrays for scalability and efficiency. On the other hand, aggressive technology scaling induces severe reliability problems, causing on-chip components (e.g., cores, cache banks, routers) failure due to manufacture...
Energy efficiency is a primary concern for microprocessor designers. A very effective approach to improving the energy efficiency of a chip is to lower its supply voltage to very close to the transitor's threshold voltage, into what is called the near-thresold region. This reduces power consumption dramatically but also decreases reliability by orders of magnitude, especially for SRAM structures such...
Modern processor architectures sacrifice timing predictability to improve average performance. Branch prediction, out-of-order execution, and multi-level cache hierarchies complicate accurate execution time estimates. The timing demands of Cyber Physical Systems (CPS) have led some to propose new processor architectures, including Precision Timed (PRET) processors, which simplify analysis of execution...
Exploiting the locality of blocks in the same set, LRU replacement strategy has deficiencies to manage L2 cache resources as the temporal locality has filtered by L1 caches. Instead, reuse replacement strategy develops the reuse characteristics of blocks in entire cache scope being more potential to improve cache resources utilization. We use reuse replacement to manage L2 cache resources in chip...
Contention for shared cache resources has been recognized as a major bottleneck for multicores--especially for mixed workloads of independent applications. While most modern processors implement instructions to manage caches, these instructions are largely unused due to a lack of understanding of how to best leverage them. This paper introduces a classification of applications into four cache usage...
The last-level cache (LLC) mitigates the impact of long memory access latencies in today's microarchitectures. The insertion policy in the LLC has a significant impact on cache efficiency. A fixed insertion policy can allow useless blocks to remain in the cache longer than necessary, resulting in inefficiency. We introduce insertion policy selection using Decision Tree Analysis (DTA). The technique...
Exploiting the performance of today's processors requires intimate knowledge of the microarchitecture as well as an awareness of the ever-growing complexity in thread and cache topology. LIKWID is a set of command-line utilities that addresses four key problems: Probing the thread and cache topology of a shared-memory node, enforcing thread-core affinity on a program, measuring performance counter...
Multi-core trends are becoming dominant, creating sophisticated and complicated cache structures. Also, the bigger shared level-2 (L2) caches are demanded for higher cache performance. One of the easiest ways to design cache memory for increased performance is to double the cache size. However, the big cache size is directly related to the area and power consumption. Especially in mobile processors,...
Fairness is a critical issue because of some serious problems, such as thread starvation and priority inversion, it can arise and render the Operating System (OS) scheduler ineffective if no fair cache sharing which provided by the hardware. In order to improve the fairness of shared cache between threads in a chip multiprocessor, a dynamic fair partitioning policy of shared cache is proposed in this...
Previous work has shown that processor affinity is one effective way to improve the performance of SMP systems. A detailed analysis of cache characteristics on network processor is carried out. The result shows that network packet processing can also use cache affinity to gain performance improvements by reduce the miss rate of instruction cache and data cache. A schedule algorithm for multi-core...
As multi-core trends are becoming dominant, cache structures are complicated and bigger shared level-2 caches are demanded. Also, in mobile processors, multi-core design is being applied. To achieve higher cache performance, lower power consumption and smaller chip area in multi-core mobile processors, cache configuration should be re-organized and re-analyzed. The MID (Mobile Internet Devices) which...
The now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying...
The optimal size of a large on-chip cache can be different for different programs: at some point, the reduction of cache misses achieved when increasing cache size hits diminishing returns, while the higher cache latency hurts performance. This paper presents the Amorphous Cache (AC), a reconfigurable L2 on-chip cache aimed at improving performance as well as reducing energy consumption. AC is composed...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.