The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Restrictions over memory performance have always had a great impact on soft-core processors. The reduced number of ports on FPGAs' block RAMs may limit the exploitation of parallelism on soft-core processors that are implemented on top of these devices. Multiple memory ports on FPGAs are cumbersome and do not scale well, having a high cost in area and power consumption when implemented. In order to...
Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained...
In this paper, we introduce memos, which integrates suitable memory management policies and schedules resources over the entire memory hierarchy in hybrid memory system. Powered by an OS kernel level monitoring tool, memos captures memory patterns online, and then leverages them to guide the memory page placement and data mapping. Experimental results show, on average, memos can benefit memory utilization,...
Data compression is a science of representing actual information in a more compact form by reducing its size to some extend. Reliable and efficient data compression accompanies least memory usage and less computational complexity. This work proposes a lossless compression technique for the efficient compression of both images and text files. LiBek II is an adaptive dictionary based algorithm in which...
This paper presents a new encoding and corresponding decoding scheme to reduce crosstalk on a high-speed parallel bus. The scheme is based on a modified Fibonacci sequence and is introduced along with potential benefits in some upcoming memory interfaces. The scheme provides appreciable eye opening for interfaces dominated by crosstalk such as existing memory interfaces.
3D integration provides opportunities to design high-bandwidth and low-power CMOS image sensors (CIS) [1–4]. The 3D stacking of pixel tier, peripheral tier, memory tier(s), and compute tier(s) enables high degree of parallel processing. Also, each tier can be designed in different technology nodes (heterogeneous integration) to further improve power-efficiency. This paper presents a case study of...
Collective I/O is a parallel I/O technique designed to deliver high performance data access to scientific applications running on high-end computing clusters. In collective I/O, write performance is highly dependent upon the storage system response time and limited by the slowest writer. The storage system response time in conjunction with the need for global synchronisation, required during every...
Sparse matrix vector multiplication (SpMV) is the workhorse for a wide range of linear algebra computations. In a serial setting, naive implementations for direct multiplication and transposed multiplication achieve very competitive performance. In parallel settings, especially on graphics hardware, it is widely believed that naive implementations cannot reach the performance of highly tuned parallel...
Compute clusters, consisting of many, uniformly built nodes, are used to run a large spectrum of different workloads, like tightly coupled (MPI) jobs, MapReduce, or graph-processing data-analytics applications, each of which with their own resource requirements. Many studies consistently highlight two types of under-utilized cluster resources: memory (up to 50%) and network. In this work, we take...
The memory wall problem is one of major obstacles against the realization of extremely fast and large scale simulations. Stencil computations, which are important kernels for CFD simulations, have been highly successful on GPU clusters in speed, due to high memory bandwidth and computation speed of accelerators. However, their problem scales have been limited by small capacity of GPU device memory...
Upcoming high-performance computing (HPC) platforms will have more complex memory hierarchies with high-bandwidth on-package memory and in the future also non-volatile memory. How to use such deep memory hierarchies effectively remains an open research question. In this paper we evaluate the performance implications of a scheme based on a software-managed scratchpad with coarse-grained memory-copy...
3D memories are becoming viable solutions for the memory wall problem and meeting the bandwidth requirements of memory intensive applications. The high bandwidth provided by 3D memories does not translate to a proportional increase in performance for all applications. For an application such as 2D FFT with strided access patterns, the data layout of the memory has a significant impact on the total...
In this paper, we propose an efficient motion estimation hardware architecture for High Efficiency Video Coding (HEVC) using a Modified Reference Data Access Skip (MRDAS) for reducing the minimum memory bandwidth. The memory bandwidth is responsible for the throughput limitations in motion estimation, especially when dealing with high quality video of a large frame size and search range. This architecture...
TCP/IP is widely used both in the Internet as well as in data centers. The protocol makes very few assumptions about the underlying network and provides useful guarantees such as reliable transmission, in-order delivery, or control flow. The price for this functionality is complexity, latency, and computational overhead, which is especially pronounced in software implementations. While for Internet...
After a decade evolving in the High Performance Computing arena, GPU-equipped supercomputers have conquered the top500 and green500 lists, providing us unprecedented levels of computational power and memory bandwidth. This year, major vendors have introduced new accelerators based on 3D memory, like Xeon Phi Knights Landing by Intel and Pascal architecture by Nvidia. This paper reviews hardware features...
■ Leverages Mali's scalable architecture ■ Scalable to 32 shader cores ■ Major shader core redesign ■ New scalar, clause-based ISA ■ New quad-based arithmetic units ■ New geometry data flow ■ Reduces memory bandwidth and footprint ■ Support for fine grain buffer sharing with the CPU
Solid State Drives (SSDs) using flash memory storage technology present a promising storage solution for data-intensive applications due to their low latency, high bandwidth, and low power consumption compared to traditional hard disk drives. SSDs achieve these desirable characteristics using internal parallelism - parallel access to multiple internal flash memory chips - and a Flash Translation Layer...
The era of Heterogeneous systems and Big Data computing is already here. Handling huge amount of data poses new challenges in data management and in the effective usage of memory, caches, heterogeneous structures and available bandwidth. In addition, computing requirements of Big Data is unique; in many occasions the processing required per storage access is limited (i.e. low Instructions/Byte) which...
The contribution of memory latency to execution time keeps increasing in modern memory systems. The hierarchical memory based on locality is the design to alleviate this effect. However, a modern memory system is also supported by various concurrency-driven technologies and the effect of leveraging locality with the consideration of concurrency becomes uncertain. We found that concurrency-driven technologies...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.