Search results

chapter

Towards more reliable embedded systems through a mechanism for monitoring driver devices communication

Rafael M. Madeira, Edna Barros, Camila Ascendina

Fifteenth International Symposium on Quality Electronic Design > 420 - 427

2014 15th International Symposium on Quality Electronic Design (ISQED)

Embedded systems require even more flexibility. Several system permits on-the-market software updates. However these updates must be reliable, otherwise, the results can be catastrophic. Device drivers may have any updates and they are very vulnerable to this problem, requiring mechanisms that are able to capture errors arising from updates at runtime. This work proposes an approach for runtime errors...

chapter

Adding virtualization support in MIPS 4Kc-based MPSoCs

Alexandra Aguiar, Carlos Moratelli, Marcos Sartori, Fabiano Hessel

Fifteenth International Symposium on Quality Electronic Design > 84 - 90

2014 15th International Symposium on Quality Electronic Design (ISQED)

Virtualization has emerged as a feasible technique for Embedded Systems, providing safer platforms, improving design quality and reducing manufacturing costs. However, its inherit overhead still prevent its wide adoption. Most of the current attempts use the para-virtualization technique that imposes the cost of performing comprehensive changes in the guest OS. We propose the adoption of full-virtualization...

chapter

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Ping Xiang, Yi Yang, Huiyang Zhou

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) > 284 - 295

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it is dispatched to / finished in a streaming...

chapter

Spare register aware prefetching for graph algorithms on GPUs

Nagesh B. Lakshminarayana, Hyesoon Kim

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) > 614 - 625

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

More and more graph algorithms are being GPU enabled. Graph algorithm implementations on GPUs have irregular control flow and are memory-intensive with many irregular/data-dependent memory accesses. Due to these factors graph algorithms on GPUs have low execution efficiency. In this work we propose a mechanism to improve the execution efficiency of graph algorithms by improving their memory access...

chapter

EVX: Vector execution on low power EDGE cores

Milovan Duric, Oscar Palomar, Aaron Smith, Osman Unsal, more

2014 Design, Automation & Test in Europe Conference & Exhibition (DATE) > 1 - 4

2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)

In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture...

chapter

Flattening-based mapping of imperfect loop nests for CGRAs?

Jongeun Lee, Seongseok Seo, Hongsik Lee, Hyeon Uk Sim

2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) > 1 - 10

2014 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)

For loop accelerators such as coarse-grained reconfigurable architectures (CGRAs) and GP-GPUs, nested loops represent an important source of parallelism. Existing solutions to mapping nested loops on CGRAs, however, are either designed for perfectly nested loops only, or expensive and inflexible. Efficient CGRA mapping of imperfect loops with arbitrary nesting depth still remains a challenge. In this...

chapter

PATS: Pattern aware scheduling and power gating for GPGPUs

Qiumin Xu, Murali Annavaram

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) > 225 - 236

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

General purpose computing using graphics processing units (GPGPUs) is an attractive option to achieve power efficient throughput computing. But the power efficiency of GPGPUs can be significantly curtailed in the presence of divergence. This paper evaluates two important facets of this problem. First, we study the branch divergence behavior of various GPGPU workloads. We show that only a few branch...

chapter

Optimizing stencil code via locality of computation

Yulong Luo, Guangming Tan

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) > 477 - 478

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

Stencil computation is a performance critical kernel used in scientific and engineering applications. We define a term of locality of computation to guide stencil optimization by either architecture or compiler. Being analogous to locality of reference, computational behavior is also classified into spatial locality and temporal locality. This paper develops equivalent computation elimination (ECE)...

chapter

Efficient sparse matrix multiple-vector multiplication using a bitmapped format

Ramaseshan Kannan

20th Annual International Conference on High Performance Computing > 286 - 294

2013 20th International Conference on High Performance Computing (HiPC)

The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without...

chapter

On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations

Huayou Su, Nan Wu, Mei Wen, Chunyuan Zhang, more

2013 International Conference on Parallel and Distributed Systems > 78 - 85

2013 International Conference on Parallel and Distributed Systems (ICPADS)

Although OpenCL programming provides full code portability between different hardware platforms, performance portability can be far from satisfactory. In this work, we use a set of representative 3D stencil computations to study OpenCL's performance portability between GPUs and CPUs. For each stencil computation, we have devised different implementations of the computational kernel function, all being...

chapter

The Use of Locality Information on Data Intensive Parallel File Systems

Ricardo Ryoiti Sugawara Junior, Liria Matsumoto Sato

2013 IEEE 16th International Conference on Computational Science and Engineering > 167 - 173

2013 IEEE 16th International Conference on Computational Science and Engineering (CSE)

Many recent data intensive parallel systems builds with cost effective hardware and combine compute and storage facilities. Since bandwidth-bisecting networks are the norm, distributing jobs near data provides significant performance improvements. However, the data locality information is not easily available to the programmer. It requires interaction with file system internals, or the adoption of...

chapter

VG-MIPS: A dynamic binary instrumentation framework for multi-core MIPS processors

Zahid Anwar, Marya Sharf, Essam Khan, Muhammad Mustafa

INMIC > 166 - 171

2013 16th International Multi Topic Conference (INMIC)

Valgrind is a Dynamic Binary Analysis tool used for debugging and profiling purposes. It's mostly used to analyze the memory usage of software applications. Currently it supports the ×86, AMD, ARM, PPC and S390X architectures. Recently it has been ported to MIPS/Linux. This paper describes VG-MIPS a port of Valgrind to Cavium Networks®'s Octeon Processor for intelligent networking which hosts a MIPS64...

chapter

Auto-tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor

Zeng Zhao, Naijie Gu, Yangzhao Yang

2013 First International Symposium on Computing and Networking > 233 - 239

2013 First International Symposium on Computing and Networking (CANDAR)

A typical decoupled access/execute architecture (DAE) processor is consisting of Access Processors (AP) and Execute Processors (EP). The overhead of memory access of AP can be hidden by calculation of EP. Based on this principle, a new optimization algorithm of general dense matrix multiplication operation (GEMM) will be introduced in this paper. The algorithm is divided into four levels, every level...

chapter

Linux system-calls processes based on the arm processor

Qin Zhou, Ding Yuan

2013 10th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) > 201 - 203

2013 10th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP)

System-call is the interaction interface between operation system and uses program. Program runs without system calls. in this paper, Analysis of linux system-calls on the ARM processor implementation principle, Describes the structure of the system-call, Involving the four main contents of Kernel files made some references, Combined with a simple example to illustrate the linux system-calls based...

chapter

ePUMA: A unique memory access based parallel DSP processor for SDR and CR

Andreas Karlsson, Joar Sohl, Jian Wang, Dake Liu

2013 IEEE Global Conference on Signal and Information Processing > 1234 - 1237

2013 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

This paper presents ePUMA, a master-slave heterogeneous DSP processor for communications and multimedia. We introduce the ePUMA VPE, a vector processing slave-core designed for heavy DSP workloads and demonstrate how its features can used to implement DSP kernels that efficiently overlap computing, data access and control to achieve maximum datapath utilization. The efficiency is evaluated by implementing...

chapter

Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture

Kwontaek Kwon, Sungjin Son, Jeongsoo Park, Jeongae Park, more

2013 International Conference on Field-Programmable Technology (FPT) > 198 - 205

2013 International Conference on Field-Programmable Technology (FPT)

Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories...

chapter

LoRe: Supporting Non-deterministic Events Logging and Replay for KVM Virtual Machines

Jianxin Li, Shouyu Si, Bo Li, Lei Cui, more

2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing > 442 - 449

2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC)

Cloud computing brings a loose-coupled resources integration paradigm with virtualized, elastic and cost-efficient resource management capabilities. Virtualization-based logging and replay technologies give users the ability to record the executions of the whole virtual machines and recover them at any time in a peer to peer mode, and it has become an important approach to analyze the system vul-nerability,...

chapter

The research and implementation of PCI express driver for DSP based on KeyStone architecture

Yong Xu, Hongxu Jiang, Tingshan Liu, Donglin Zhai, more

Proceedings of 2013 3rd International Conference on Computer Science and Network Technology > 592 - 595

2013 3rd International Conference on Computer Science and Network Technology (ICCSNT)

As a third-generation universal I/O interconnect technology succeeding ISA and PCI bus, PCI Express has characteristics of lower pin count, higher reliability and faster transfer rate, which makes it a promising prospect. This paper studies PCI Express interface technology of DSP based on KeyStone architecture, and proposes a driver design method aimed at PCI Express interconnection between DSP and...

chapter

One OpenCL to rule them all?

Romain Dolbeau, Francois Bodin, Guillaume Colin de Verdiere

2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS) > 1 - 6

2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS)

OpenCL is now available on a very large set of processors. This makes this language an attractive layer to address multiple targets with a single code base. The question on how sensitive to the underlying hardware is the OpenCL code in practice remains to be better understood. ¹

chapter

A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy

Xinyuan Guo, Hai Jiang, Kuan-Ching Li

2013 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing > 247 - 252

2013 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many scientific applications. However, as GPU becomes a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to GPU's batch-mode execution manner. The paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states. A precompiler...

INFONA - science communication portal

Search results

Towards more reliable embedded systems through a mechanism for monitoring driver devices communication

Adding virtualization support in MIPS 4Kc-based MPSoCs

Warp-level divergence in GPUs: Characterization, impact, and mitigation

Spare register aware prefetching for graph algorithms on GPUs

EVX: Vector execution on low power EDGE cores

Flattening-based mapping of imperfect loop nests for CGRAs?

PATS: Pattern aware scheduling and power gating for GPGPUs

Optimizing stencil code via locality of computation

Efficient sparse matrix multiple-vector multiplication using a bitmapped format

On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations

The Use of Locality Information on Data Intensive Parallel File Systems

VG-MIPS: A dynamic binary instrumentation framework for multi-core MIPS processors

Auto-tuning GEMM Kernels for a Decoupled Access/Execute Architecture Processor

Linux system-calls processes based on the arm processor

ePUMA: A unique memory access based parallel DSP processor for SDR and CR

Mobile GPU shader processor based on non-blocking Coarse Grained Reconfigurable Arrays architecture

LoRe: Supporting Non-deterministic Events Logging and Replay for KVM Virtual Machines

The research and implementation of PCI express driver for DSP based on KeyStone architecture

One OpenCL to rule them all?

A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options