The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Commodity graphic processing units (GPUs) have rapidly evolved to become high performance accelerators for data-parallel computing through a large array of processing cores and the CUDA programming model with a C-like interface. However, optimizing an application for maximum performance based on the GPU architecture is not a trivial task for the tremendous change from conventional multi-core to the...
The delivery of data to computing ressources in a short time is a crucial issue for the effectiveness of High Performance Computing. We meet this issue when, for example, designing drivers for virtual machines. We developped two tools to speed up data transfers between Xen virtual machines. The first one is a circular buffer shared in user memory space between the two communicating domains and allowing...
This paper investigates the potential of flash as a large and slow memory behind dynamic random-access memory (DRAM) for stencil computation, which is one of the most common and important computation kernels in various scientific and engineering simulations. We evaluate the performance of a fastswap kernel, which was recently incorporated into Linux, in stencil computation using flash as a swap device...
Power outages and subsequent recovery are major causes of service downtimes. This issue is amplified by the ongoing trend of steadily growing in-memory state of Internet-based services which increases the risk of data loss and extends recovery time. Protective measures against power outages, such as uninterruptible power supply are expensive, maintenance-intensive and often fragile. With the advent...
Moving toward exascale, the number of GPUs in HPC machines is bound to increase, and applications will spend increasing amounts of time running on those GPU devices. While GPU usage has already led to substantial speedup for HPC codes, their failure rates due to overheating are at least 10 times higher than those seen for the CPUs now commonly used on HPC machines. This makes it increasingly important...
Persistent Memory (PM) technologies, such as Phase Change Memory, STT-RAM, and memristors, are receiving increasingly high interest in academia and industry. PM provides many attractive features, such as DRAM-like speed and storage-like persistence. Yet, because it draws a blurry line between memory and storage, neither a memory- or storage-based model is a natural fit. Best integrating PM into existing...
Many emerging applications from various domains often exhibit heterogeneous memory characteristics. When running in combination on parallel platforms, these applications present a daunting variety of workload behaviors that challenge the effectiveness of any memory allocation strategy. Prior partitioning-based or random memory allocation schemes typically manage only one level of the memory hierarchy...
Recent mobile consumer devices are suffering from limited memory and power consumption. The deduplication technique will be helpful to reduce memory footprint by identifying the same content memory pages. Linux is adopting the Kernel Samepage Merging (KSM) scheme for memory page deduplication. However, current KSM can invoke significant power consumption due to its inefficient scanning. In consumer...
As recently shown in 2013, Android-driven smartphones and tablet PCs are vulnerable to so-called cold boot attacks. With physical access to an Android device, forensic memory dumps can be acquired with tools like FROST that exploit the remanence effect of DRAM to read out what is left in memory after a short reboot. While FROST can in some configurations be deployed to break full disk encryption,...
Frequency table computation is a key step in decision tree learning algorithms. In this paper we present a novel implementation targeted for dataflow architecture implemented on field programmable gate array (FPGA). Consistent with dataflow model of computation, the kernel views input dataset as synchronous streams of attributes and class values. The kernel was benchmarked using key functions from...
With energy efficiency and power consumption being the primary impediment in the path to exascale systems, low-power high performance embedded systems are of increasing interest. The Parallella System-on-module (SoM) created by Adapteva combines the Epiphany-IV 64-core coprocessor with a host ARM processor housed in a Zynq System-on-chip. The Epiphany integrates low-power RISC cores on a 2D mesh network...
The purpose of this study is to evaluate the performance of two dimensional multi-threaded linear filtering process on the GPU and FPGA platforms. To obtain the implementation on varying platforms, OpenCL API is used. OpenCL provides platform independent programming advantage. The results on three different platforms are compared to each other within this scope. These platforms are CPU, GPU, and FPGA...
DRAM consists of multiple resources called banks that can be accessed in parallel and independently maintain state information. In Commercial Off-The-Shelf (COTS) multicore platforms, banks are typically shared among all cores, even though programs running on the cores do not share memory space. In this situation, memory performance is highly unpredictable due to contention in the shared banks.
GPUs (Graphics Processing Units) are designed to solve large data-parallel problems encountered in the fields of image processing, scene rendering, video playback, and gaming. GPUs are therefore designed to handle a higher degree of parallelism as compared to conventional CPUs. GPGPU (General Purpose computing on Graphics Processing Units) enables users to do parallel computing on the graphics hardware...
Energy efficiency of financial computations is a performance criterion that can no longer be dismissed, and is as crucial as raw acceleration and accuracy of the solution. In order to reduce the energy consumption of financial accelerators, FPGAs offer a good compromise with low power consumption and high parallelism. However, designing and prototyping an application on an FPGA-based platform are...
Embedded and real-time software is often constrained by several temporal requirements. Therefore, it is important to design embedded software that meets the required performance goal. The inception of embedded graphics processing units (GPUs) brings fresh hope in developing high-performance embedded software which were previously not suitable for embedded platforms. Whereas GPUs use massive parallelism...
Modern server and desktop systems combine multiple computational cores and accelerator devices into a hybrid architecture. GPUs as one class of such devices provide dedicated processing power and memory capacities for data parallel computation of 2D and 3D graphics. Although these cards have demonstrated their applicability in a variety of areas, they are almost exclusively used by special purpose...
The smallest instance offered by Amazon EC2 comes with 615MB memory and a 7.9GB disk image. While small by today's standards, embedded web servers with memory footprints well under 100kB, indicate that there is much to be saved. In this work we investigate how large VM-populations the open Stack hyper visor can be made to sustain, by tuning it for scalability and minimizing virtual machine images...
The kernel recursive least squares (KRLS) algorithm performs non-linear regression in an online manner, with similar computational requirements to linear techniques. In this paper, an implementation of the KRLS algorithm utilising pipelining and vectorisation for performance; and microcoding for reusability is described. The design can be scaled to allow tradeoffs between capacity, performance and...
Floating-point units are seldom in highly constrained systems, due to silicon and energy footprint, but emulated instead in algorithms based on integer arithmetic. In this paper, we use runtime code generation to generate outperforming flexible and optimized floating-point routines. On a Texas Instrument MSP430 fitted with only 512 bytes of RAM, we achieved mean speedups of 1032 % and 52 %, with tuning...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.