Search results

chapter

Optimizing memory efficiency for convolution kernels on kepler GPUs

Xiaoming Chen, Jianxu Chen, Danny Z. Chen, Xiaobo Sharon Hu

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC) > 1 - 6

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)

Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating...

chapter

Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs

Yi Yang, Ping Xiang, Mike Mantor, Huiyang Zhou

2012 41st International Conference on Parallel Processing > 329 - 339

2012 41st International Conference on Parallel Processing (ICPP)

Given the extraordinary computational power of modern graphics processing units (GPUs), general purpose computation on GPUs (GPGPU) has become an increasingly important platform for high performance computing. To better understand how well the GPU resource has been utilized by application developers and then to facilitate them to develop high performance GPGPU code, we conduct an empirical study on...

chapter

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

Weizhi Xu, Hao Zhang, Shuai Jiao, Da Wang, more

2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing > 231 - 235

2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing (SNPD)

It is an important task to tune performance for sparse matrix vector multiplication (SpMV), but it is also a difficult task because of its irregularity. In this paper, we propose a cache blocking method to improve the performance of SpMV on the emerging GPU architecture. The sparse matrix is partitioned into many sub-blocks, which are stored in CSR format. With the blocking method, the corresponding...

chapter

Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators Based on a Domain-Specific Language for Medical Imaging

Richard Membarth, Frank Hannig, Jurgen Teich, Mario Korner, more

2012 11th International Symposium on Parallel and Distributed Computing > 211 - 218

2012 11th International Symposium on Parallel and Distributed Computing (ISPDC)

An efficient memory bandwidth utilization for GPU accelerators is crucial for memory bound applications. In medical imaging, the performance of many kernels is limited by the available memory bandwidth since only a few operations are performed per pixel. For such kernels only a fraction of the compute power provided by GPU accelerators can be exploited and performance is predetermined by memory bandwidth...

chapter

Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems

Shucai Xiao, Wu-chun Feng

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum > 2554 - 2557

2012 26th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Graphics Processing Units (GPUs) have been widely used as accelerators in large-scale heterogeneous computing systems. However, current programming models can only support the utilization of local GPUs. When using non-local GPUs, programmers need to explicitly call API functions for data communication across computing nodes. As such, programming GPUs in large-scale computing systems is more challenging...

chapter

Energy Efficiency Analysis of GPUs

Juan M. Cebri'n, Gines D. Guerrero, Jose M. Garcia

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum > 1014 - 1022

2012 26th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

In the last few years, Graphics Processing Units (GPUs) have become a great tool for massively parallel computing. GPUs are specifically designed for throughput and face several design challenges, specially what is known as the Power and Memory Walls. In these devices, available resources should be used to enhance performance and throughput, as the performance per watt is really high. For massively...

chapter

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Haicheng Wu, Gregory Diamos, Jin Wang, Srihari Cadambi, more

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum > 2433 - 2442

2012 26th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Data warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement...

chapter

The case for GPGPU spatial multitasking

Jacob T. Adriaens, Katherine Compton, Nam Sung Kim, Michael J. Schulte

IEEE International Symposium on High-Performance Comp Architecture > 1 - 12

2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA)

The set-top and portable device market continues to grow, as does the demand for more performance under increasing cost, power, and thermal constraints. The integration of Graphics Processing Units (GPUs) into these devices and the emergence of general-purpose computations on graphics hardware enable a new set of highly parallel applications. In this paper, we propose and make the case for a GPU multitasking...

chapter

Real-time semi-global matching disparity estimation on the GPU

Christian Banz, Holger Blume, Peter Pirsch

2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops) > 514 - 521

2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops)

This paper presents the design, implementation and evaluation of new parallelization schemes for performing dense disparity estimation based on non-parametric rank transform and semi-global matching on Graphics Processing Units (GPUs). A detailed analysis of the performance limitating factors (memory throughput, instruction throughput, etc.) for each part of the parallel implementation is performed...

chapter

Efficient Implementation of the Overlap Operator on Multi-GPUs

Andrei Alexandru, Michael Lujan, Craig Pelissier, Ben Gamari, more

2011 Symposium on Application Accelerators in High-Performance Computing > 123 - 130

2011 Symposium on Application Accelerators in High-Performance Computing (SAAHPC)

Lattice QCD calculations were one of the first applications to show the potential of GPUs in the area of high performance computing. Our interest is to find ways to effectively use GPUs for lattice calculations using the overlap operator. The large memory footprint of these codes requires the use of multiple GPUs in parallel. In this paper we show the methods we used to implement this operator efficiently...

chapter

Where is the data? Why you cannot debate CPU vs. GPU performance without the answer

C Gregg, K Hazelwood

(IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE > 134 - 144

2011 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS 2011)

General purpose GPU Computing (GPGPU) has taken off in the past few years, with great promises for increased desktop processing power due to the large number of fast computing cores on high-end graphics cards. Many publications have demonstrated phenomenal performance and have reported speedups as much as 1000× over code running on multi-core CPUs. Other studies have claimed that well-tuned CPU code...

article

Exploiting SPMD Horizontal Locality

Chunyang Gou, G N Gaydadjiev

IEEE Computer Architecture Letters > 2011 > 10 > 1 > 20 - 23

In this paper, we analyze a particular spatial locality case (called horizontal locality) inherent to manycore accelerator architectures employing barrel execution of SPMD kernels, such as GPUs. We then propose an adaptive memory access granularity framework to exploit and enforce the horizontal locality in order to reduce the interferences among accelerator cores memory accesses and hence improve...

chapter

A Case Study of SWIM: Optimization of Memory Intensive Application on GPGPU

Wei Yi, Yuhua Tang, Guibin Wang, Xudong Fang

2010 3rd International Symposium on Parallel Architectures, Algorithms and Programming > 123 - 129

Third International Symposium on Parallel Architectures, Algorithms and Programming (PAAP 2010)

Recently, GPGPU has been adopted well in the High Performance Computing (HPC) field. The limited global memory bandwidth poses a great challenge to many GPGPU programmers trying to exploit parallelism within the CPU-GPU heterogeneous platform. In this paper, we choose SWIM, a typical memory intensive application from the SPEC OMP 2001 benchmark suite, for case study. We attempt to optimize the performance...

chapter

GMH: A Message Passing Toolkit for GPU Clusters

Jie Chen, William Watson, Weizhen Mao

2010 IEEE 16th International Conference on Parallel and Distributed Systems > 35 - 42

2010 IEEE 16th International Conference on Parallel and Distributed Systems (ICPADS 2010)

Driven by the market demand for high-definition 3D graphics, commodity graphics processing units (GPUs) have evolved into highly parallel, multi-threaded, many-core processors, which are ideal for data parallel computing. Many applications have been ported to run on a single GPU with tremendous speedups using general C-style programming languages such as CUDA. However, large applications require multiple...

chapter

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Ronald Babich, Michael A Clark, Balint Joó

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 11

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromo- dynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced...

chapter

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, more

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 13

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations. The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth. Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density...

chapter

A Multi-GPU Spectrometer System for Real-Time Wide Bandwidth Radio Signal Analysis

Hirofumi Kondo, Eric Heien, Masao Okita, Dan Werthimer, more

International Symposium on Parallel and Distributed Processing with Applications > 594 - 604

2010 International Symposium on Parallel and Distributed Processing with Applications (ISPA 2010)

This paper describes the implementation of a large bandwidth multi-GPU signal processing system for radio astronomy observation. This system performs very large Fast Fourier Transform (FFT) and spectrum analysis to achieve real-time analysis of a large bandwidth spectrum. This is accomplished by implementing a four-step FFT algorithm in Compute Unified Device Architecture (CUDA). The key feature of...

chapter

SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms

Tobias Brandvik, Graham Pullan

2010 10th IEEE International Conference on Computer and Information Technology > 1181 - 1188

2010 IEEE 10th International Conference on Computer and Information Technology (CIT)

We present a new software framework for the implementation of applications that use stencil computations on block-structured grids to solve partial differential equations. A key feature of the framework is the extensive use of automatic source code generation which is used to achieve high performance on a range of leading multi-core processors. Results are presented for a simple model stencil running...

chapter

Improving the Performance of the Sparse Matrix Vector Product with GPUs

F Vázquez, G Ortega, J J Fernández, E M Garzón

2010 10th IEEE International Conference on Computer and Information Technology > 1146 - 1151

2010 IEEE 10th International Conference on Computer and Information Technology (CIT)

Sparse matrices are involved in linear systems, eigensystems and partial differential equations from a wide spectrum of scientific and engineering disciplines. Hence, sparse matrix vector product (SpMV) is considered as key operation in engineering and scientific computing. For these applications the optimization of the sparse matrix vector product (SpMV) is very relevant. However, the irregular computation...

chapter

Graphics Card Computing for Cosmology: Cholesky Factorization

Steven Gratton

2010 10th IEEE International Conference on Computer and Information Technology > 1207 - 1212

2010 IEEE 10th International Conference on Computer and Information Technology (CIT)

Cosmological data sets are becoming so large as to make optimal statistical analyses of them impossible. Even with approximations made, the computational challenges can be severe. Cholesky Factorization of matrices is an essential tool. This paper reports on progress made by the author in implementing Cholesky Factorization on one or more graphics processing units (GPUs). Particular attention is paid...

INFONA - science communication portal

Search results

Optimizing memory efficiency for convolution kernels on kepler GPUs

Fixing Performance Bugs: An Empirical Study of Open-Source GPGPU Programs

Optimizing Sparse Matrix Vector Multiplication Using Cache Blocking Method on Fermi GPU

Automatic Optimization of In-Flight Memory Transactions for GPU Accelerators Based on a Domain-Specific Language for Medical Imaging

Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems

Energy Efficiency Analysis of GPUs

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

The case for GPGPU spatial multitasking

Real-time semi-global matching disparity estimation on the GPU

Efficient Implementation of the Overlap Operator on Multi-GPUs

Where is the data? Why you cannot debate CPU vs. GPU performance without the answer

Exploiting SPMD Horizontal Locality

A Case Study of SWIM: Optimization of Memory Intensive Application on GPGPU

GMH: A Message Passing Toolkit for GPU Clusters

Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

A Multi-GPU Spectrometer System for Real-Time Wide Bandwidth Radio Signal Analysis

SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms

Improving the Performance of the Sparse Matrix Vector Product with GPUs

Graphics Card Computing for Cosmology: Cholesky Factorization

Filter options

Publication date

Publication type

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options