Search results

chapter

Automatic Scan Parallelization in OpenMP

Maicol Zegarra, Marcio Pereira, Xavier Martorell, Guido Araujo

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 85 - 90

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical analysis, string comparison, image filtering among others. Although there are libraries...

chapter

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Mohsen Kiani, Amir Rajabzadeh

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE) > 260 - 265

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)

Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG...

chapter

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

Yingchao Huang, Dong Li

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 166 - 177

2017 IEEE International Conference on Cluster Computing (CLUSTER)

A heterogeneous memory system (HMS) consists of multiple memory components with different properties. GPU is a representative architecture with HMS. It is challenging to decide optimal placement of data objects on HMS because of the large exploration space and complicated memory hierarchy on HMS. In this paper, we introduce performance modeling techniques to predict performance of various data placements...

chapter

A GPU-Friendly Skiplist Algorithm

Nurit Moscovici, Nachshon Cohen, Erez Petrank

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 246 - 259

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

We propose a design for a fine-grained lock-based skiplist optimized for Graphics Processing Units (GPUs). While GPUs are often used to accelerate streaming parallel computations, it remains a significant challenge to efficiently offload concurrent computations with more complicated data-irregular access and fine-grained synchronization. Natural building blocks for such computations would be concurrent...

chapter

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

Wei Han, Daniel Mawhirter, Bo Wu, Matthew Buland

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 233 - 245

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Most GPU-based graph systems cannot handle large-scale graphs that do not fit in the GPU memory. The ever-increasing graph size demands a scale-up graph system, which can run on a single GPU with optimized memory access efficiency and well-controlled data transfer overhead. However, existing systems either incur redundant data transfers or fail to use shared memory. In this paper we present Graphie,...

chapter

Resilience for Stencil Computations with Latent Errors

Aiman Fang, Aurelien Cavelan, Yves Robert, Andrew A. Chien

2017 46th International Conference on Parallel Processing (ICPP) > 581 - 590

2017 46th International Conference on Parallel Processing (ICPP)

Projections and measurements of error rates in near-exascale and exascale systems suggest a dramatic growth, due to extreme scale (10^9 cores), concurrency, software complexity, and deep submicron transistor scaling. Such a growth makes resilience a critical concern, and may increase the incidence of errors that "escape", silently corrupting application state. Such errors can often be revealed...

chapter

Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures

Fabio Baruffa, Luigi Iapichino, Nicolay J. Hammer, Vasileios Karakasis

2017 International Conference on High Performance Computing & Simulation (HPCS) > 381 - 388

2017 International Conference on High Performance Computing & Simulation (HPCS)

We describe a strategy for code modernisation of Gadget, a widely used community code for computational astrophysics. The focus of this work is on node-level performance optimisation, targeting current multi/many-core Intel® architectures. We identify and isolate a sample code kernel, which is representative of a typical Smoothed Particle Hydrodynamics (SPH) algorithm. The code modifications include...

chapter

Cache Partitioning + Loop Tiling: A Methodology for Effective Shared Cache Management

Vasilios Kelefouras, Georgios Keramidas, Nikolaos Voros

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) > 477 - 482

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

In this paper, we present a new methodology that provides i) a theoretical analysis of the two most commonly used approaches for effective shared cache management (i.e., cache partitioning and loop tiling) and ii) a unified framework to fine tuning those two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by one order of magnitude keeping at...

chapter

Boosting Performance of Map Matching Algorithms by Parallelization on Graphics Processors

Markus Auer, Hubert Rehborn, Sven-Eric Molzahn, Klaus Bogenberger

2017 IEEE Intelligent Vehicles Symposium (IV) > 462 - 467

2017 IEEE Intelligent Vehicles Symposium (IV)

In this paper existing map matching algorithms are combined and modified such, that the resulting algorithm is suitable for the implementation on the graphics processing unit (GPU). The map matching algorithm implemented on GPU consists of a geometrical and topological processing step, which provides high accuracy with high efficiency at the same time. An important building block of the implementation...

chapter

A 142MOPS/mW integrated programmable array accelerator for smart visual processing

Satyajit Das, Davide Rossi, Kevin J. M. Martin, Philippe Coussy, more

2017 IEEE International Symposium on Circuits and Systems (ISCAS) > 1 - 4

2017 IEEE International Symposium on Circuits and Systems (ISCAS)

Due to increasing demand of low power computing, and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy-efficient programmable accelerators. This paper proposes an Integrated Programmable-Array accelerator (IPA) architecture based on an innovative execution model, targeted to accelerate both data and control-flow parts of deeply embedded...

chapter

Memory fartitioning-based modulo scheduling for high-level synthesis

Tianyi Lu, Shouyi Yin, Xianqing Yao, Zhicong Xie, more

2017 IEEE International Symposium on Circuits and Systems (ISCAS) > 1 - 4

2017 IEEE International Symposium on Circuits and Systems (ISCAS)

High-Level Synthesis (HLS) has been widely recognized as an efficient compilation process targeting FPGAs for algorithm evaluation and product prototyping. However, the massively parallel memory access demands and the extremely expensive cost of single-bank memory with multi-port have impeded loop pipelining performance. Thus, based on an alternative multi-bank memory architecture, a joint approach...

chapter

Power Efficient Sharing-Aware GPU Data Management

Abdulaziz Tabbakh, Murali Annavaram, Xuehai Qian

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 698 - 707

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy...

chapter

Directive-Based Partitioning and Pipelining for Graphics Processing Units

Xuewen Cui, Thomas R. W. Scogland, Bronis R. de Supinski, Wu-chun Feng

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 575 - 584

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in state-of-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication...

chapter

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning

Tianyi David Han, Tarek S. Abdelrahman

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1350 - 1361

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We explore the use of synthetic benchmarks for the training phase of machine-learning-based automatic performance tuning. We focus on the problem of predicting if the use of local memory on a GPU is beneficial for caching a single target array in a GPU kernel. We show that the use of only 13 real benchmarks leads to poor prediction accuracy (about to 58%) of the 13 leave-one-out models trained using...

chapter

TimerShield: Protecting High-Priority Tasks from Low-Priority Timer Interference (Outstanding Paper)

Pratyush Patel, Manohar Vanga, Bjorn B. Brandenburg

2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) > 3 - 12

2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)

Timer interference arises when a high-priority realtime task is delayed by a timer interrupt that is intended for a lower-priority task. We demonstrate that high-resolution timers, as exposed for instance by Linux's hrtimer API, can cause substantial timer interference, which manifests as significantly increased response times and lowered throughput. To eliminate this source of unpredictability, we...

chapter

An improved automatic MPI code generation algorithm for parallelizing compilation

Yangxia Xiang, Caisen Chen, Hongyan Wang, Zeyun Zhou

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) > 1623 - 1626

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)

Open64 is an open source compiler with powerful analysis and widely used as a research and commercial development platform. However, it has not been designed and developed to realize MPI parallelization. There are many contributions in the paper. Firstly, the Open64 compiler infrastructure is showed. Secondly, the location of MPI code generation in the Open64 compiler architecture is analyzed. Thirdly,...

chapter

A static-placement, dynamic-issue framework for CGRA loop accelerator

Zhongyuan Zhao, Weiguang Sheng, Weifeng He, ZhiGang Mao, more

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 1348 - 1353

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

This paper presents a static-placement, dynamic-issue (SPDI) framework for the coarse-grained reconfigurable architecture (CGRA) in order to tackle the inefficiencies of the static-issue, static-placement (SISP) CGRA. This framework includes the compiler that statically places the operations and hardware design, a SPDI CGRA, that automatically schedule the operations. We stress on introducing the...

chapter

To use or not to use: CPUs' cache optimization techniques on GPGPUs

D.R.V.L.B. Thambawita, Roshan G. Ragel, Dhammike Elkaduwe

2016 IEEE International Conference on Information and Automation for Sustainability (ICIAfS) > 1 - 6

2016 IEEE International Conference on Information and Automation for Sustainability (ICIAfS)

General Purpose Graphic Processing Unit(GPGPU) is used widely for achieving high performance or high throughput in parallel programming. This capability of GPGPUs is very famous in the new era and mostly used for scientific computing which requires more processing power than normal personal computers. Therefore, most of the programmers, researchers and industry use this new concept for their work...

chapter

Convolutional Self Organizing Map

Hiroshi Dozono, Gen Niina, Satoru Araki

2016 International Conference on Computational Science and Computational Intelligence (CSCI) > 767 - 771

2016 International Conference on Computational Science and Computational Intelligence (CSCI)

Recently, deep learning became very popular, and was applied to many fields. The convolutional neural networks are often used for representing the layers for deep learning. In this paper, we propose Convolutional Self Organizing Map, which can be applicable to deep learning. Conventional Self Organizing Map uses single layered architecture, and can visualizes and classifies the input data on 2 dimensional...

chapter

Tessellation-based multi-block memory mapping scheme for high-level synthesis with FPGA

auJuan Escobedo, auMingjie Lin

2016 International Conference on Field-Programmable Technology (FPT) > 125 - 132

2016 International Conference on Field-Programmable Technology (FPT)

For many intensive computing tasks, simultaneous data access into multi-dimensional data arrays is highly restricted by its data mapping strategy and memory port constraint. As such, to increase memory accessing bandwidth, innovative memory partitioning and mapping algorithms have been proposed to simultaneously access multiple memory blocks through physically distributing data elements in the same...

INFONA - science communication portal

Search results

Automatic Scan Parallelization in OpenMP

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems

A GPU-Friendly Skiplist Algorithm

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

Resilience for Stencil Computations with Latent Errors

Performance Optimisation of Smoothed Particle Hydrodynamics Algorithms for Multi/Many-Core Architectures

Cache Partitioning + Loop Tiling: A Methodology for Effective Shared Cache Management

Boosting Performance of Map Matching Algorithms by Parallelization on Graphics Processors

A 142MOPS/mW integrated programmable array accelerator for smart visual processing

Memory fartitioning-based modulo scheduling for high-level synthesis

Power Efficient Sharing-Aware GPU Data Management

Directive-Based Partitioning and Pipelining for Graphics Processing Units

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning

TimerShield: Protecting High-Priority Tasks from Low-Priority Timer Interference (Outstanding Paper)

An improved automatic MPI code generation algorithm for parallelizing compilation

A static-placement, dynamic-issue framework for CGRA loop accelerator

To use or not to use: CPUs' cache optimization techniques on GPGPUs

Convolutional Self Organizing Map

Tessellation-based multi-block memory mapping scheme for high-level synthesis with FPGA

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options