The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
This paper explores the feasibility of entirely disaggregated memory from compute and storage for a particular, widely deployed workload, Spark SQL [9] analytics queries. We measure the empirical rate at which records are processed and calculate the effective memory bandwidth utilized based on the sizes of the columns accessed in the query. Our findings contradict conventional wisdom: not only is...
Sparse matrix-vector multiplication (SpMV) is an important computational kernel in many applications. For performance improvement, software libraries designated for SpMV computation have been introduced, e.g., MKL library for CPUs and cuSPARSE library for GPUs. However, the computational throughput of these libraries is far below the peak floating-point performance offered by hardware platforms, because...
Wearable head-mounted display (HMD) smart devices are emerging as a smartphone substitute due to their ease of use and suitability for advanced applications, such as gaming and augmented reality (AR) [1–2]. Most current HMD systems suffer from: 1) a lack of rich user interfaces, 2) short battery life, and 3) heavy weight. Although current HMDs (e.g. Google Glass) use a touch panel and voice commands...
Deep learning using convolutional neural networks (CNN) gives state-of-the-art accuracy on many computer vision tasks (e.g. object detection, recognition, segmentation). Convolutions account for over 90% of the processing in CNNs for both inference/testing and training, and fully convolutional networks are increasingly being used. To achieve state-of-the-art accuracy requires CNNs with not only a...
As massive multi-threading in GPU imposes tremendous pressure on memory subsystems, efficient bandwidth utilization becomes a key factor affecting the GPU throughput. In this work, we propose thread batch enabled memory partitioning (TEMP), to improve GPU performance through the improvement of memory bandwidth utilization. In particular, TEMP clusters multiple thread blocks sharing the same set of...
In heterogeneous MPSoCs, memory interference between the CPU and realtime cores is a critical impediment to system performance. Previous memory schedulers adopt the classic two-tier queuing system, but unfortunately the use of two-tier queuing deteriorates the QoS of scheduling policies. In this paper, we propose the Single-Tier Virtual Queuing (STVQ) memory controller for efficacious QoS-aware scheduling...
The speed of memory capacity expansion of the computer system has not kept up with the speed of the increase of the memory requirement of large memory applications. Also, big memory system has been too expensive for many researchers and students. Therefore, approaches to utilize remote memory has been considered as a cost effective way to run large memory applications in the cluster environment where...
Live migration of virtual machine has attracted significant attention in recent years. It facilitates system online maintenance, load balancing, fault tolerance and power management. Existing pre-copy live migration approach has to iteratively copy redundant memory pages, which causes high network overhead and slow migration. Another post-copy live migration approach can provide quick migration with...
Recent advancements in the architecture of Graphic Processing Unit (GPU), enables the acceleration of many general purpose applications. Even with high memory bandwidth, GPUs are still faced with the challenge of accelerating highly memory intensive applications. To overcome this challenge this paper investigates the impact of scaling up of the memory partitions and also scaling of frequency of the...
The centrality of interleavers in interleave-division multiple-access (IDMA) cannot be over-emphasised, the interleaver being the only means of isolating signals for different users of the multiple-access system. This work gives a critical review of bit-error-rate (BER) performance of interleavers and IDMA systems. Existing literature shows that there are disagreements among results published by different...
Applications in modern data centers have a wide variety of resource requirements along the four main dimensions of computing, memory, storage, and networking. Data centers must manage these resources separately for each dimension, resulting in highly inefficient allocation of precious resources or even disastrous schemes that contribute to low utilization or over-provisioning of resources. However,...
This paper describes our experience with storage optimization that utilizes cost-effective PCIe solid-state drives (SSDs) to improve the overall performance of a Spark framework. A key problem we address is the limited memory system performance. In particular, we adopt high-performance SSDs to alleviate the saturated DRAM bandwidth and its limited capacity. We utilize SSDs to store shuffle data and...
Despite the ability of modern processors to execute a variety of algorithms efficiently through instructions based on registers with ever-increasing widths, some applications present poor performance due to the limited interconnection bandwidth between main memory and processing units. Near-data processing has started to gain acceptance as an accelerator device due to the technology constraints and...
While processor caches cannot grow arbitrarily large due to area, power, and latency considerations, dataset sizes grow faster than Moore's Law and pressure caches to grow to accommodate the increasing working sets. Cache compression partially mitigates this problem by providing an effective cache capacity larger than the physical capacity of the cache, but the prevalent rule of thumb dictates that...
In the high-speed real-time image processing system, an arbitration module is used to solve the problem of access conflicts when single port memory is shared within functional modules of FPGA. In this paper, the shared memory characteristics of each port module are briefly analyzed, and then the implementation mechanism and the specific design steps of arbiter logic are given. Finally, the logic is...
This paper presents a novel 2.5D multicore processor which consists of 3 distinct silicon dies: a processor die with 8 MIPS-cores, a 16kB SRAM die, and an accelerator die for multimedia and communication applications. These dies are interconnected into multi-modes, like core-core (up to 32 cores), core-memory (4x storage capacity) and core-accelerator (4.4x speedup in H.264 decoder), to establish...
Image feature descriptors composed of a series of binary intensity comparisons yield substantial memory and runtime improvements over conventional descriptors, but are sensitive to viewpoint changes in ways that vary per feature. We propose a method to improve the matching performance of such descriptors by specifically reasoning about the reliability of test results on a feature-by-feature basis...
The recent proliferation of smartphones and tablets leads to consider such devices as means for the execution of cyber-attacks. This scenario has rarely been considered earlier, since mobile devices always represented a target for cyber-criminals, rather than a vector to exploit. In this paper we introduce an innovative mobile bot net infrastructure, composed by mobile agents, for the execution of...
This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality...
Main memory system is facing increasingly high pressure from the advances of multi-core processors. The simplicity of conventional memory architecture has helped minimize memory latency and reduce the design cost. However, in present multi-core era, it is increasingly attractive to adopt flexible and advanced memory organization to further improve memory bandwidth utilization, power efficiency, and...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.