Search results

Items from 1 to 20 out of 52 results

chapter

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Mohsen Kiani, Amir Rajabzadeh

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE) > 260 - 265

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)

Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG...

chapter

A GPU-Friendly Skiplist Algorithm

Nurit Moscovici, Nachshon Cohen, Erez Petrank

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 246 - 259

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

We propose a design for a fine-grained lock-based skiplist optimized for Graphics Processing Units (GPUs). While GPUs are often used to accelerate streaming parallel computations, it remains a significant challenge to efficiently offload concurrent computations with more complicated data-irregular access and fine-grained synchronization. Natural building blocks for such computations would be concurrent...

chapter

Histogram optimization with CUDA

Keh Kok Yong, Sheera Shaheera Othman Talib

2016 IEEE Industrial Electronics and Applications Conference (IEACon) > 312 - 318

2016 IEEE Industrial Electronics and Applications Conference (IEACon)

Histogram is a popular analytic graphical representation of data distribution resulting from processing a given numerical input data. Although the sequential histogram computation may be simple, it is no longer suitable in processing high volume of data. With recent advancement of high performance computing (HPC), aided by the accelerating growth of General Purpose Graphic Processing Unit (GPGPU),...

chapter

GPU accelerated information retrieval using Bloom filters

Alexandru Iacob, Lucian Itu, Lucian Sasu, Florin Moldoveanu, more

2015 19th International Conference on System Theory, Control and Computing (ICSTCC) > 872 - 876

2015 19th International Conference on System Theory, Control and Computing (ICSTCC)

Information retrieval is a technique used in search engines, advertisement placement and cognitive databases. With increasing amounts of data and stringent response time requirements, improving the underlying implementation of document retrieval becomes critical. To this end, we consider a Bloom filter, a simple randomized data structure that answers membership queries with no false negative and customizable...

chapter

Compiling and Optimizing Java 8 Programs for GPU Execution

Kazuaki Ishizaki, Akihiro Hayashi, Gita Koblents, Vivek Sarkar

2015 International Conference on Parallel Architecture and Compilation (PACT) > 419 - 431

2015 International Conference on Parallel Architecture and Compilation (PACT)

GPUs can enable significant performance improvements for certain classes of data parallel applications and are widely used in recent computer systems. However, GPU execution currently requires explicit low-level operations such as 1) managing memory allocations and transfers between the host system and the GPU, 2) writing GPU kernels in a low-level programming model such as CUDA or OpenCL, and 3)...

chapter

JolokiaC++: Optimizing Irregular Accesses for GPGPU

Vibha Patel, Sanjeev Aggarwal, Amey Karkare

2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems > 583 - 590

2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS) and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS)

We present JolokiaC++ a compiler framework to ease coding of irregular data applications on GPUs. The effectiveness of the compiler and runtime systems of JolokiaC++ is tested using three kernels IRREG, MOLDYN and NBF, executed on NVIDIA GPUs. We developed extensions for the generic parallel constructs that allow portable and efficient programming of codes with irregular accesses on the GPU. We present...

chapter

Automatic Parallelization of GPU Applications Using OpenCL

Lizandro D. Solano-Quinde, Brett M. Bode, Arun K. Somani

2015 Asia-Pacific Conference on Computer Aided System Engineering > 276 - 283

2015 Asia-Pacific Conference on Computer Aided System Engineering (APCASE)

Graphics Processing Units (GPUs) have been successfully used to accelerate scientific applications due to their computation power and the availability of programming languages that make more approachable writing scientific applications for GPUs. However, since the programming model of GPUs requires offloading all the data to the GPU memory, the memory footprint of the application is limited to the...

chapter

Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU

Carl Yang, Yangzihao Wang, John D. Owens

2015 IEEE International Parallel and Distributed Processing Symposium Workshop > 841 - 847

2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)

We implement a promising algorithm for sparse-matrix sparse-vector multiplication (SpMSpV) on the GPU. An efficient k-way merge lies at the heart of finding a fast parallel SpMSpV algorithm. We examine the scalability of three approaches -- no sorting, merge sorting, and radix sorting -- in solving this problem. For breadth-first search (BFS), we achieve a 1.26x speedup over state-of-the-art sparse-matrix...

chapter

Double precision stencil computations on Kepler GPUs

Anamaria Vizitiu, Lucian Itu, Laszlo Lazar, Constantin Suciu

2014 18th International Conference on System Theory, Control and Computing (ICSTCC) > 123 - 127

2014 18th International Conference on System Theory, Control and Computing (ICSTCC)

Graphics Processing Units (GPU) have been used extensively for accelerating parallelizable applications in general, and scientific computations in particular. Stencil based algorithms are used intensively in various research areas and represent good candidates for GPU based acceleration. Since scientific computations have high accuracy requirements, herein we focus on stencil based double precision...

chapter

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

Rong Shi, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, more

2014 43rd International Conference on Parallel Processing > 221 - 230

2014 43nd International Conference on Parallel Processing (ICPP)

An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain...

chapter

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

Walid Abu Sufah, Khalid Ahmad

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 1117 - 1124

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these applications. Unlike SpMV, where matrix elements are accessed only once, multiplying by k vectors requires accessing matrix elements k times. In this paper...

chapter

A Compiler Translate Directive-Based Language to Optimized CUDA

Feng Li, Hong An, Weihao Liang, Xiaoqiang Li, more

Graphics processing units(GPUs) provide a low cost platform for accelerating high performance computations. New programming languages, such as CUDA and OpenCL, make GPU programming attractive to programmers. However, programming GPUs is still a cumbersome task for two reasons, tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming...

chapter

Accelerating outlier detection with intra- and inter-node parallelism

Fabrizio Angiulli, Stefano Basta, Stefano Lodi, Claudio Sartori

2014 International Conference on High Performance Computing & Simulation (HPCS) > 476 - 483

2014 International Conference on High Performance Computing & Simulation (HPCS)

Outlier detection is a data mining task consisting in the discovery of observations which deviate substantially from the rest of the data, and has many important practical applications. Outlier detection in very large data sets is however computationally very demanding and the size limit of the data that can be elaborated is considerably pushed forward by mixing three ingredients: efficient algorithms,...

chapter

Transparent GPU Execution of NumPy Applications

Troels Blum, Mads R.B. Kristensen, Brian Vinter

2014 IEEE International Parallel & Distributed Processing Symposium Workshops > 1002 - 1010

2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)

In this work, we present a back-end for the Python library NumPy that utilizes the GPU seamlessly. We use dynamic code generation to generate kernels, and data is moved transparently to and from the GPU. For the integration into NumPy, we use the Bohrium runtime system. Bohrium hooks into NumPy through the implicit data parallelization of array operations, this approach requires no annotations or...

chapter

Importance of GPGPUs in efficiency improvement of real world applications

Shreyas Bhatia, Minal Tolpadi, Akhtar Rasool

2014 IEEE Students' Conference on Electrical, Electronics and Computer Science > 1 - 6

2014 IEEE Students' Conference on Electrical, Electronics and Computer Science (SCEECS)

The changing times have caused the requirements to change, causing a revolution in the field of parallel computing. The emergence of parallel computing as a necessity has boosted the use of GPGPUs for this purpose. With such an emergence comes a drastic improvement in many real world applications of GPGPUs as well. In this paper we discuss about GPGPUs, their evolution, and their contribution to many...

article

Portable Parallel Programs with Python and OpenCL

Massimo Di Pierro

Computing in Science & Engineering > 2014 > 16 > 1 > 34 - 40

Two Python modules are presented: pyOpenCL, a library that enables programmers to write Open Common Language (OpenCL) code within Python programs; and ocl, a Python-to-C converter that lets developers write OpenCL kernels using the Python syntax. Like CUDA, OpenCL is designed to run on multicore GPUs. OpenCL code can also run on other architectures, including ordinary CPUs and mobile devices, always...

chapter

Data-reuse optimizations for pipelined tiling with parametric tile sizes

Alexandre Isoard

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) > 509 - 510

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more...

chapter

The Study of Parallel Ortho-rectification Method of Line-Array Image Based on GPU

Yuxia Yang, Zhaohua Liu, Jingyu Yang

2013 International Conference on Computer Sciences and Applications > 615 - 618

2013 International Conference on Computer Sciences and Applications (CSA)

This paper first briefly introduces the principle of Ortho-Rectification of line-array image, then designed a parallel processing method based on GPU and proposes a shared memory optimizing strategy of POS data to avoid performance bottle-neck due frequently accessing data in global memory, at last do a system experiment using ADS40 image based on Tesla C2050 GPU and invalidate the parallel processing...

chapter

GPU-Accelerated Parallel 3D Image Thinning

Bingfeng Hu, Xuan Yang

2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing > 149 - 152

2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC)

The skeletons of the objects in 3D images can be extracted by using 3D image thinning. The application of 3D image thinning for image analysis is hampered by its considerable computation time. By employing the graphics processing unit (GPU), which has tremendous powerful computing power at an incomparable performance-to-cost ratio, the calculation of 3D image thinning can be accelerated. In this paper,...

chapter

Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms

Dan Zou, Yong Dou, Qiang Wang, Jinbo Xu, more

2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing > 1064 - 1069

2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC)

Breadth-First Search (BFS) is a basis for many graph traversal and analysis algorithms. In this paper, we present a direction-optimizing BFS implementation on CPU-GPU heterogeneous platforms to fully exploit the computing power of both the multi-core CPU and GPU. For each level of the BFS algorithm, we dynamically choose the best implementation from: a sequential top-down execution on CPU, a parallel...

Data set:
ieee
Keywords:
KERNEL
GPU
ARRAYS
Publication language:
English

Publication date

Set your own date range

INFONA - science communication portal

Search results

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

A GPU-Friendly Skiplist Algorithm

Histogram optimization with CUDA

GPU accelerated information retrieval using Bloom filters

Compiling and Optimizing Java 8 Programs for GPU Execution

JolokiaC++: Optimizing Irregular Accesses for GPGPU

Automatic Parallelization of GPU Applications Using OpenCL

Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU

Double precision stencil computations on Kepler GPUs

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

A Compiler Translate Directive-Based Language to Optimized CUDA

Accelerating outlier detection with intra- and inter-node parallelism

Transparent GPU Execution of NumPy Applications

Importance of GPGPUs in efficiency improvement of real world applications

Portable Parallel Programs with Python and OpenCL

Data-reuse optimizations for pipelined tiling with parametric tile sizes

The Study of Parallel Ortho-rectification Method of Line-Array Image Based on GPU

GPU-Accelerated Parallel 3D Image Thinning

Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms

Filter options

Publication date

Content availability

Publication type

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options