The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper proposes an energy-efficient, high-throughput DRAM architecture for GPUs and throughput processors. In these systems, requests from thousands of concurrent threads compete for a limited number of DRAM row buffers. As a result, only a fraction of the data fetched into a row buffer is used, leading to significant energy overheads. Our proposed DRAM architecture exploits the hierarchical organization...
As the amount of digital data the world generates explodes, data centers and HPC systems that process this big data will require high bandwidth and high capacity main memory. Unfortunately, conventional memory technologies either provide high memory capacity (e.g., DDRx memory) or high bandwidth (GDDRx memory), but not both. Memory networks, which provide both high bandwidth and high capacity memory...
The memory wall continues to be a major performance bottleneck. While small on-die caches have been effective so far in hiding this bottleneck, the ever-increasing footprint of modern applications renders such caches ineffective. Recent advances in memory technologies like embedded DRAM (eDRAM) and High Bandwidth Memory (HBM) have enabled the integration of large memories on the CPU package as an...
Owing to increasing demand of faster and larger DRAM system, the DRAM system accounts for a large portion of the total power consumption of computing systems. As memory traffic and DRAM bandwidth grow, the row activation and I/O power consumptions are becoming major contributors to total DRAM power consumption. Thus, reducing row activation and I/O power consumptions has big potential for improving...
With current DRAM technology reaching its limit, emerging heterogeneous memory systems have become attractive to keep the memory performance scaling. This paper argues for using a small, fast memory closer to the processor as part of a flat address space where the memory system is composed of two or more memory types. OS-transparent management of such memory has been proposed in prior works such as...
The performance of 3D rendering of GraphicsProcessing Unit that converts 3D vector stream into 2D framewith 3D image effects significantly impacts users gamingexperience on modern computer systems. Due to its hightexture throughput requirement, main memory bandwidthbecomes a critical obstacle for improving the overall renderingperformance. 3D-stacked memory systems such as HybridMemory Cube provide...
Historically, improvements in GPU-based high performance computing have been tightly coupled to transistor scaling. As Moore's law slows down, and the number of transistors per die no longer grows at historical rates, the performance curve of single monolithic GPUs will ultimately plateau. However, the need for higher performing GPUs continues to exist in many domains. To address this need, in this...
The paper proposes a solution an actual scientific problem related to load balancing and efficient utilization of resources of the distributed system. The proposed method is based on calculation of load CPU, memory, and bandwidth by flows of different classes of service for each server and the entire distributed system and taking into account multifractal properties of input data flows. Weighting...
Memory interference is a critical impediment to system performance in MPSoCs. To address this problem, we first propose a Locality-Aware Bank Partitioning (LABP), which partitions memory banks according to applications' memory access behavior. The key idea is to separate memory intensive applications with high row-buffer locality from the other applications. Moreover, we integrate LABP with a bandwidth...
Multi-granularity memory system provides multiple access granularities for the applications with various spatial localities. In the multi-granularity access pattern, the one-size-bandwidth NoC design cannot utilize the bandwidth efficiently. We propose a novel NoC design, called BoDNoC, which can merge multiple narrow subnets to provide various bandwidths for access data. The new design also adopts...
Processing-in-Memory (PIM), has recently been revisited as one of the most promising solutions to deal with the issue of bandwidth and power wall between processor and memory. In this paper, we propose a light-weight PIM architecture, approxPIM, which leverages approximate computing techniques to enable InMemory Processing in a realistic 3D-stacked DRAM, Micron's Hybrid Memory Cube (HMC). Using the...
Many important applications demand large amounts of on-chip memory both to fully utilize an FPGA's computational capacity and to minimize energy-consuming off-chip memory accesses, leading some recent commercial FPGAs to add higher-capacity on-chip block RAMs (BRAMs). While memory is becoming more important to FPGA designs, SRAM scaling is becoming more difficult because of increasing device variation...
RAM-based storage aggregates the RAM of thousands of commodity servers in data center networks (DCN) to provide extremely low I/O latency and high I/O throughput. In order to achieve fast failure recovery, MemCube exploits network proximity to restrict failure detection and recovery within 1-hop range. However, previous design is applicable only to the BCube network, which limits the usage of RAM-based...
Large off-die stacked DRAM caches have been proposed to provide higher effective bandwidth and lower average latency to main memory. Designing a large off-die DRAM cache with conventional block size requires a large tag array which is impractical to fit on-die. Placing the large directory off-die prolong the latency since a tag access is necessary before the data can be accessed. This additional trip...
With the development of Ultra-High-Definition video, the power consumed by accessing reference frames in the external DRAM has become the bottleneck for the portable video encoding system design. To reduce the dynamic power of DRAM, a lossy frame memory recompression algorithm is proposed. The compression algorithm is composed of a content-aware adaptive quantization, a multi-mode directional prediction,...
As more and more consumers access streaming video content over the internet, enterprises across the entire video distribution value chain experience tremendous pressure to deliver better performance and high quality of experience (QoE) to the end users. Enhanced video performance is highly desirable, starting from the Video Origin Servers through the core network and Content Delivery Networks (CDN)...
This paper discusses early efforts to integrate the RAN remote memory technology into the vl3 volume rendering framework. We successfully demonstrate this integration, achieving 73% of the theoretical hardware maximum with minimal variation.
Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. The performance of stencil calculations is often bounded by memory bandwidth. High-bandwidth memory (HBM) on devices such as those in the Intel® Xeon Phi™ ™200 processor family (code-named Knights Landing) can thus provide additional performance. In a traditional sequential time-step...
In this paper, HBM DRAM with TSV technique is introduced. This paper covers the general TSV feature and techniques such as TSV architecture, TSV reliability, TSV open / short test, and TSV repair. And HBM DRAM, representative DRAM product using TSV, is widely presented, especially the use and features.
Future systems dealing with big-data workloads will be severely constrained by the high performance and energy penalty imposed by data movement. This penalty can be reduced by storing datasets in DRAM or NVM main memory in compressed formats. Prior compressed memory systems have required significant changes to the operating system, thus limiting commercial viability. The first contribution of this...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.