The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Due to high bandwidth demand on memory system of stream applications, most of stream processors use software-managed streaming memory. However, this memory disadvantages ease of programming, compatibility, and supporting irregular stream access, which hinder the usage of stream processor in broader application domains. Meanwhile, hardware-managed coherent caches overcome these shortcomings of software-managed...
A major challenge to the creation of chip multiprocessors is designing the on-chip memory and communication resources to efficiently support parallel workloads. A variety of cache organizations, data management techniques, and hardware optimizations that take advantage of specific data characteristics have been developed to improve application performance. The success of these approaches depends on...
Resource sharing can cause unfair and unpredictable performance of concurrently executing applications in Chip-Multiprocessors (CMP). The shared last-level cache is one of the most important shared resources because off-chip request latency may take a significant part of total execution cycles for data intensive applications. Instead of enforcing performance fairness directly, prior work addressing...
Chip-multiprocessor (CMP) architectures are becoming more and more popular as an alternative to the traditional processors that only extract instruction-level parallelism from an application. CMPs introduce complexities when accounting CPU utilization. This is due to the fact that the progress done by an application during an interval of time highly depends on the activity of the other applications...
An SMT processor is designed to execute multiple threads simultaneously in order to gain higher performance with sharing resources such as ALUs and cache memory among several threads. However, sharing cache memory may cause thread conflict misses which degrades its performance. In this paper, an effective replacement strategy in which conflicts miss ratio among threads is controlled by limiting the...
Main objective of this paper is to outline possible ways how to achieve a substantial acceleration in case of advection-diffusion equation (A-DE) calculation, which is commonly used for a description of the pollutant behavior in atmosphere. A-DE is a land of partial differential equation (PDE) and in general case it is usually solved by numerical integration due to its high complexity. These types...
On-chip many core architecture is an emerging and promising computation platform. High speed on-chip communication and abundant chipped resources are two outstanding advantages of this architecture, which provide an opportunity to implement efficient synchronization scheme. The practical execution efficiency of synchronization scheme is critical to this platform. However, there are few researches...
Information on a particular behavioral aspect of a program can be useful to know about the performance bottlenecks and can be utilized further to improve the performance of the system. It is observed that contention for shared L2 cache between programs running on a multi-core processor (MCP) is one of the performance bottlenecks. The utilization of the L2 cache by a program, while sharing it with...
In this paper, we present a low power and variable-length design of fast Fourier transform (FFT) processor for flexible MIMO-OFDM applications. In this work, mixed-radix-2/4/8 algorithm and new continuous-flow method are applied to achieve variable-length of 1K/2K/4K/8K points and in-order output. Furthermore, ping-pong cache memory architecture and optimized data scaling strategy are also applied...
The best interface between CPUs and reconfigurable hardware in heterogeneous systems remains an open question. The trend in multi-core processors is to communicate through a shared memory hierarchy; but cache organizations that work best for general-purpose multi-core systems may not be best for heterogeneous systems. In this paper we explore a variety of cache topologies for connecting a CPU with...
The number of functional errors escaping design verification and being released into final silicon is growing, due to the increasing complexity and shrinking production schedules of modern processor designs. Recent trends towards chip multiprocessors (CMPs) are exacerbating the problem because of their complex and sometimes non-deterministic memory subsystems, prone to subtle but devastating bugs...
Heterogeneous multicores, such as Cell BE processors and GPGPUs, typically do not have caches for their accelerator cores because coherence traffic, cache misses, and latencies from different types of memory accesses add overhead and adversely affect instruction scheduling. Instead, the accelerator cores have internal local memory to place their code and data. Programmers of such heterogeneous multicore...
Post-silicon processor debugging is frequently carried out in a loop consisting of several iterations of the following two key steps: (i) processor execution for some duration, followed by (ii) dumping out of the processor's internal state into an external logic analyzer for further offline processing. Internal state of the processor is dominated by the L2 cache. During the process of dumping the...
Prior work on HW support for memory race recording piggybacks time stamps on coherence messages and logs the outcome of memory races using point-to-point or chunk-based approaches. These memory race recorder (MRR) techniques are effective, but they require modifications to the cache coherence protocol that can hurt performance. In addition, prior work has mostly focused on directory coherence and...
Memory encryption offers a secure protection for the confidentiality of program and data. But implementing an encryption design for embedded processor is much difficult. As the embedded processor is highly constrained by the application requirement, the designers can't only concern with security. This paper proposes a new lightweight memory encryption cache (MEC) to obtain a balance among the performance,...
Thread level parallelism (TLP) has become a popular trend to improve processor performance, overcoming the limitations of extracting instruction level parallelism. Each TLP paradigm, such as Simultaneous Multithreading or Chip-Multiprocessors, provides different benefits, which has motivated processor vendors to combine several TLP paradigms in each chip design. Even if most of these combined-TLP...
Embedded system develops rapidly, functions turn into more complicate, and multi-media applications are growing daily and they consume more electrical power. Therefore, how to improve stand-by time will become a very important issue. Related researches indicate that the power consumption of processor cache is accounted for a big proportion. Way-prediction and LRU (least recently used) algorithms improve...
Memory latency is a significant bottleneck in modern computer architectures, especially for commercial and multimedia applications. Instruction cache misses can severely limit the performance, due to advent of superscalar processors and multicore systems. Prefetching is one of the promising method to bridge the performance gap between CPU and DRAM speed. Although Instruction prefetching is a promising...
Data caches in general-purpose microprocessors often contain mostly dead blocks and are thus used inefficiently. To improve cache efficiency, dead blocks should be identified and evicted early. Prior schemes predict the death of a block immediately after it is accessed; however, these schemes yield lower prediction accuracy and coverage. Instead, we find that predicting the death of a block when it...
Todaypsilas CMP platforms are designed to be symmetric in terms of platform resources such as shared caches. However, it is becoming increasingly important to understand the performance implications of asymmetric caches for two key reasons: (a) multi-workload scenarios such as server consolidation are a growing trend and contention for shared cache resources between workloads causes logical cache...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.