Search results

chapter

A Performance Prediction Model for Memory-Intensive GPU Kernels

Zhidan Hu, Guangming Liu, Zhidan Hu

2014 IEEE Symposium on Computer Applications and Communications > 14 - 18

2014 IEEE Symposium on Computer Applications and Communications (SCAC)

Commodity graphic processing units (GPUs) have rapidly evolved to become high performance accelerators for data-parallel computing through a large array of processing cores and the CUDA programming model with a C-like interface. However, optimizing an application for maximum performance based on the GPU architecture is not a trivial task for the tremendous change from conventional multi-core to the...

chapter

Optimizing Xen inter-domain data transfer

Sebastien Fremal, Pierre Manneback

2014 International Conference on High Performance Computing & Simulation (HPCS) > 1002 - 1004

2014 International Conference on High Performance Computing & Simulation (HPCS)

The delivery of data to computing ressources in a short time is a crucial issue for the effectiveness of High Performance Computing. We meet this issue when, for example, designing drivers for virtual machines. We developped two tools to speed up data transfers between Xen virtual machines. The first one is a circular buffer shared in user memory space between the two communicating domains and allowing...

chapter

An evaluation of the potential of flash SSD as large and slow memory for stencil computations

Hiroko Midorikawa, Hideyuki Tan, Toshio Endo

2014 International Conference on High Performance Computing & Simulation (HPCS) > 268 - 277

2014 International Conference on High Performance Computing & Simulation (HPCS)

This paper investigates the potential of flash as a large and slow memory behind dynamic random-access memory (DRAM) for stencil computation, which is one of the most common and important computation kernels in various scientific and engineering simulations. We evaluate the performance of a fastswap kernel, which was recently incorporated into Linux, in stencil computation using flash as a swap device...

chapter

NV-Hypervisor: Hypervisor-Based Persistence for Virtual Machines

Vasily A. Sartakov, Rudiger Kapitza

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks > 654 - 659

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Power outages and subsequent recovery are major causes of service downtimes. This issue is amplified by the ongoing trend of steadily growing in-memory state of Internet-based services which increases the risk of data loss and extends recovery time. Protective measures against power outages, such as uninterruptible power supply are expensive, maintenance-intensive and often fragile. With the advent...

chapter

HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems

Sudarsun Kannan, Naila Farooqui, Ada Gavrilovska, Karsten Schwan

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks > 738 - 743

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Moving toward exascale, the number of GPUs in HPC machines is bound to increase, and applications will spend increasing amounts of time running on those GPU devices. While GPU usage has already led to substantial speedup for HPC codes, their failure rates due to overheating are at least 10 times higher than those seen for the CPUs now commonly used on HPC machines. This makes it increasingly important...

chapter

A protected block device for Persistent Memory

Feng Chen, Michael P. Mesnier, Scott Hahn

2014 30th Symposium on Mass Storage Systems and Technologies (MSST) > 1 - 12

2014 30th Symposium on Mass Storage Systems and Technologies (MSST)

Persistent Memory (PM) technologies, such as Phase Change Memory, STT-RAM, and memristors, are receiving increasingly high interest in academia and industry. PM provides many attractive features, such as DRAM-like speed and storage-like persistence. Yet, because it draws a blurry line between memory and storage, neither a memory- or storage-based model is a natural fit. Best integrating PM into existing...

chapter

Going vertical in memory management: Handling multiplicity by multi-policy

Lei Liu, Yong Li, Zehan Cui, Yungang Bao, more

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) > 169 - 180

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)

Many emerging applications from various domains often exhibit heterogeneous memory characteristics. When running in combination on parallel platforms, these applications present a daunting variety of workload behaviors that challenge the effectiveness of any memory allocation strategy. Prior partitioning-based or random memory allocation schemes typically manage only one level of the memory hierarchy...

chapter

Optimizing power consumption of memory deduplication scheme

Jinwoo Ahn, Dongkun Shin

The 18th IEEE International Symposium on Consumer Electronics (ISCE 2014) > 1 - 2

2014 International Symposium on Consumer Electronics (ICSE)

Recent mobile consumer devices are suffering from limited memory and power consumption. The deduplication technique will be helpful to reduce memory footprint by identifying the same content memory pages. Linux is adopting the Kernel Samepage Merging (KSM) scheme for memory page deduplication. However, current KSM can invoke significant power consumption due to its inefficient scanning. In consumer...

chapter

Post-Mortem Memory Analysis of Cold-Booted Android Devices

Christian Hilgers, Holger Macht, Tilo Muller, Michael Spreitzenbarth

2014 Eighth International Conference on IT Security Incident Management & IT Forensics > 62 - 75

2014 Eighth International Conference on IT Security Incident Management & IT Forensics (IMF)

As recently shown in 2013, Android-driven smartphones and tablet PCs are vulnerable to so-called cold boot attacks. With physical access to an Android device, forensic memory dumps can be acquired with tools like FROST that exploit the remanence effect of DRAM to read out what is left in memory after a short reboot. While FROST can in some configurations be deployed to break full disk encryption,...

chapter

Frequency table computation on dataflow architecture

P. Skoda, V. Sruk, B. Medved Rogina

2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) > 342 - 346

2014 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

Frequency table computation is a key step in decision tree learning algorithms. In this paper we present a novel implementation targeted for dataflow architecture implemented on field programmable gate array (FPGA). Consistent with dataflow model of computation, the kernel views input dataset as synchronous streams of attributes and class values. The kernel was benchmarked using key functions from...

chapter

Programming the Adapteva Epiphany 64-Core Network-on-Chip Coprocessor

Anish Varghese, Bob Edwards, Gaurav Mitra, Alistair P. Rendell

2014 IEEE International Parallel & Distributed Processing Symposium Workshops > 984 - 992

2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)

With energy efficiency and power consumption being the primary impediment in the path to exascale systems, low-power high performance embedded systems are of increasing interest. The Parallella System-on-module (SoM) created by Adapteva combines the Epiphany-IV 64-core coprocessor with a host ARM processor housed in a Zynq System-on-chip. The Epiphany integrates low-power RISC cores on a 2D mesh network...

chapter

OpenCL implementation of unsharp filtering on GPU and FPGA

Ozge Unel, Toygar Akgun

2014 22nd Signal Processing and Communications Applications Conference (SIU) > 212 - 215

2014 22nd Signal Processing and Communications Applications Conference (SIU)

The purpose of this study is to evaluate the performance of two dimensional multi-threaded linear filtering process on the GPU and FPGA platforms. To obtain the implementation on varying platforms, OpenCL API is used. OpenCL provides platform independent programming advantage. The results on three different platforms are compared to each other within this scope. These platforms are CPU, GPU, and FPGA...

chapter

PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms

Heechul Yun, Renato Mancuso, Zheng-Pei Wu, Rodolfo Pellizzoni

2014 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS) > 155 - 166

2014 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)

DRAM consists of multiple resources called banks that can be accessed in parallel and independently maintain state information. In Commercial Off-The-Shelf (COTS) multicore platforms, banks are typically shared among all cores, even though programs running on the cores do not share memory space. In this situation, memory performance is highly unpredictable due to contention in the shared banks.

chapter

Parallel graph coloring algorithms on the GPU using OpenCL

Shilpi Sengupta

2014 International Conference on Computing for Sustainable Global Development (INDIACom) > 353 - 357

2014 International Conference on Computing for Sustainable Global Development (INDIACom)

GPUs (Graphics Processing Units) are designed to solve large data-parallel problems encountered in the fields of image processing, scene rendering, video playback, and gaming. GPUs are therefore designed to handle a higher degree of parallelism as compared to conventional CPUs. GPGPU (General Purpose computing on Graphics Processing Units) enables users to do parallel computing on the graphics hardware...

chapter

Energy-efficient FPGA implementation for binomial option pricing using OpenCL

Valentin Mena Morales, Pierre-Henri Horrein, Amer Baghdadi, Erik Hochapfel, more

2014 Design, Automation & Test in Europe Conference & Exhibition (DATE) > 1 - 6

2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Energy efficiency of financial computations is a performance criterion that can no longer be dismissed, and is as crucial as raw acceleration and accuracy of the solution. In order to reduce the energy consumption of financial accelerators, FPGAs offer a good compromise with low power consumption and high parallelism. However, designing and prototyping an application on an FPGA-based platform are...

chapter

Automated software testing of memory performance in embedded GPUs

Sudipta Chattopadhyay, Petru Eles, Zebo Peng

2014 International Conference on Embedded Software (EMSOFT) > 1 - 10

2014 International Conference on Embedded Software (EMSOFT)

Embedded and real-time software is often constrained by several temporal requirements. Therefore, it is important to design embedded software that meets the required performance goal. The inception of embedded graphics processing units (GPUs) brings fresh hope in developing high-performance embedded software which were previously not suitable for embedded platforms. Whereas GPUs use massive parallelism...

chapter

Leveraging Hybrid Hardware in New Ways - The GPU Paging Cache

Frank Feinbube, Peter Troger, Johannes Henning, Andreas Polze

2013 International Conference on Parallel and Distributed Systems > 291 - 298

2013 International Conference on Parallel and Distributed Systems (ICPADS)

Modern server and desktop systems combine multiple computational cores and accelerator devices into a hybrid architecture. GPUs as one class of such devices provide dedicated processing power and memory capacities for data parallel computation of 2D and 3D graphics. Although these cards have demonstrated their applicability in a variety of areas, they are almost exclusively used by special purpose...

chapter

Maximizing Hypervisor Scalability Using Minimal Virtual Machines

Alfred Bratterud, Harek Haugerud

2013 IEEE 5th International Conference on Cloud Computing Technology and Science > 1 > 218 - 223

2013 IEEE 5th International Conference on Cloud Computing Technology and Science (CloudCom)

The smallest instance offered by Amazon EC2 comes with 615MB memory and a 7.9GB disk image. While small by today's standards, embedded web servers with memory footprints well under 100kB, indicate that there is much to be saved. In this work we investigate how large VM-populations the open Stack hyper visor can be made to sustain, by tuning it for scalability and minimizing virtual machine images...

chapter

A low latency kernel recursive least squares processor using FPGA technology

Yeyong Pang, Shaojun Wang, Yu Peng, Nicholas J. Fraser, more

2013 International Conference on Field-Programmable Technology (FPT) > 144 - 151

2013 International Conference on Field-Programmable Technology (FPT)

The kernel recursive least squares (KRLS) algorithm performs non-linear regression in an online manner, with similar computational requirements to linear techniques. In this paper, an implementation of the KRLS algorithm utilising pipelining and vectorisation for performance; and microcoding for reusability is described. The design can be scaled to allow tradeoffs between capacity, performance and...

chapter

Software acceleration of floating-point multiplication using runtime code generation — Student paper

Charles Aracil, Damien Courousse

2013 4th Annual International Conference on Energy Aware Computing Systems and Applications (ICEAC) > 18 - 23

2013 Fourth Annual International Conference on Energy Aware Computing Systems and Applications (ICEAC)

Floating-point units are seldom in highly constrained systems, due to silicon and energy footprint, but emulated instead in algorithms based on integer arithmetic. In this paper, we use runtime code generation to generate outperforming flexible and optimized floating-point routines. On a Texas Instrument MSP430 fitted with only 512 bytes of RAM, we achieved mean speedups of 1032 % and 52 %, with tuning...

INFONA - science communication portal

Search results

A Performance Prediction Model for Memory-Intensive GPU Kernels

Optimizing Xen inter-domain data transfer

An evaluation of the potential of flash SSD as large and slow memory for stencil computations

NV-Hypervisor: Hypervisor-Based Persistence for Virtual Machines

HeteroCheckpoint: Efficient Checkpointing for Accelerator-Based Systems

A protected block device for Persistent Memory

Going vertical in memory management: Handling multiplicity by multi-policy

Optimizing power consumption of memory deduplication scheme

Post-Mortem Memory Analysis of Cold-Booted Android Devices

Frequency table computation on dataflow architecture

Programming the Adapteva Epiphany 64-Core Network-on-Chip Coprocessor

OpenCL implementation of unsharp filtering on GPU and FPGA

PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms

Parallel graph coloring algorithms on the GPU using OpenCL

Energy-efficient FPGA implementation for binomial option pricing using OpenCL

Automated software testing of memory performance in embedded GPUs

Leveraging Hybrid Hardware in New Ways - The GPU Paging Cache

Maximizing Hypervisor Scalability Using Minimal Virtual Machines

A low latency kernel recursive least squares processor using FPGA technology

Software acceleration of floating-point multiplication using runtime code generation — Student paper

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options