The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Embedded systems require even more flexibility. Several system permits on-the-market software updates. However these updates must be reliable, otherwise, the results can be catastrophic. Device drivers may have any updates and they are very vulnerable to this problem, requiring mechanisms that are able to capture errors arising from updates at runtime. This work proposes an approach for runtime errors...
Virtualization has emerged as a feasible technique for Embedded Systems, providing safer platforms, improving design quality and reducing manufacturing costs. However, its inherit overhead still prevent its wide adoption. Most of the current attempts use the para-virtualization technique that imposes the cost of performing comprehensive changes in the guest OS. We propose the adoption of full-virtualization...
High throughput architectures rely on high thread-level parallelism (TLP) to hide execution latencies. In state-of-art graphics processing units (GPUs), threads are organized in a grid of thread blocks (TBs) and each TB contains tens to hundreds of threads. With a TB-level resource management scheme, all the resource required by a TB is allocated/released when it is dispatched to / finished in a streaming...
More and more graph algorithms are being GPU enabled. Graph algorithm implementations on GPUs have irregular control flow and are memory-intensive with many irregular/data-dependent memory accesses. Due to these factors graph algorithms on GPUs have low execution efficiency. In this work we propose a mechanism to improve the execution efficiency of graph algorithms by improving their memory access...
In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture...
For loop accelerators such as coarse-grained reconfigurable architectures (CGRAs) and GP-GPUs, nested loops represent an important source of parallelism. Existing solutions to mapping nested loops on CGRAs, however, are either designed for perfectly nested loops only, or expensive and inflexible. Efficient CGRA mapping of imperfect loops with arbitrary nesting depth still remains a challenge. In this...
General purpose computing using graphics processing units (GPGPUs) is an attractive option to achieve power efficient throughput computing. But the power efficiency of GPGPUs can be significantly curtailed in the presence of divergence. This paper evaluates two important facets of this problem. First, we study the branch divergence behavior of various GPGPU workloads. We show that only a few branch...
Stencil computation is a performance critical kernel used in scientific and engineering applications. We define a term of locality of computation to guide stencil optimization by either architecture or compiler. Being analogous to locality of reference, computational behavior is also classified into spatial locality and temporal locality. This paper develops equivalent computation elimination (ECE)...
The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without...
Although OpenCL programming provides full code portability between different hardware platforms, performance portability can be far from satisfactory. In this work, we use a set of representative 3D stencil computations to study OpenCL's performance portability between GPUs and CPUs. For each stencil computation, we have devised different implementations of the computational kernel function, all being...
Many recent data intensive parallel systems builds with cost effective hardware and combine compute and storage facilities. Since bandwidth-bisecting networks are the norm, distributing jobs near data provides significant performance improvements. However, the data locality information is not easily available to the programmer. It requires interaction with file system internals, or the adoption of...
Valgrind is a Dynamic Binary Analysis tool used for debugging and profiling purposes. It's mostly used to analyze the memory usage of software applications. Currently it supports the ×86, AMD, ARM, PPC and S390X architectures. Recently it has been ported to MIPS/Linux. This paper describes VG-MIPS a port of Valgrind to Cavium Networks®'s Octeon Processor for intelligent networking which hosts a MIPS64...
A typical decoupled access/execute architecture (DAE) processor is consisting of Access Processors (AP) and Execute Processors (EP). The overhead of memory access of AP can be hidden by calculation of EP. Based on this principle, a new optimization algorithm of general dense matrix multiplication operation (GEMM) will be introduced in this paper. The algorithm is divided into four levels, every level...
System-call is the interaction interface between operation system and uses program. Program runs without system calls. in this paper, Analysis of linux system-calls on the ARM processor implementation principle, Describes the structure of the system-call, Involving the four main contents of Kernel files made some references, Combined with a simple example to illustrate the linux system-calls based...
This paper presents ePUMA, a master-slave heterogeneous DSP processor for communications and multimedia. We introduce the ePUMA VPE, a vector processing slave-core designed for heavy DSP workloads and demonstrate how its features can used to implement DSP kernels that efficiently overlap computing, data access and control to achieve maximum datapath utilization. The efficiency is evaluated by implementing...
Coarse-grained reconfigurable arrays (CGRAs) based processors provide high performance and energy-efficiency as well as programmability by means of the ability to reconfigure the datapath connecting the ALU arrays. A CGRA based processor executes loop kernels whose schedule should be fixed at compile time. This restriction hinders CGRA from being efficient particularly in accessing external memories...
Cloud computing brings a loose-coupled resources integration paradigm with virtualized, elastic and cost-efficient resource management capabilities. Virtualization-based logging and replay technologies give users the ability to record the executions of the whole virtual machines and recover them at any time in a peer to peer mode, and it has become an important approach to analyze the system vul-nerability,...
As a third-generation universal I/O interconnect technology succeeding ISA and PCI bus, PCI Express has characteristics of lower pin count, higher reliability and faster transfer rate, which makes it a promising prospect. This paper studies PCI Express interface technology of DSP based on KeyStone architecture, and proposes a driver design method aimed at PCI Express interconnection between DSP and...
OpenCL is now available on a very large set of processors. This makes this language an attractive layer to address multiple targets with a single code base. The question on how sensitive to the underlying hardware is the OpenCL code in practice remains to be better understood. 1
Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many scientific applications. However, as GPU becomes a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to GPU's batch-mode execution manner. The paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states. A precompiler...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.