Search results

chapter

Efficient warp execution in presence of divergence with collaborative context collection

Farzad Khorasani, Rajiv Gupta, Laxmi N. Bhuyan

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) > 204 - 215

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp's lockstep traversal of all diverging execution paths. In this work, we present a software...

chapter

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance

Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, Xiaojin Zhu

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) > 725 - 737

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

GPUs have become prevalent and more general purpose, but GPU programming remains challenging and time consuming for the majority of programmers. In addition, it is not always clear which codes will benefit from getting ported to GPU. Therefore, having a tool to estimate GPU performance for a piece of code before writing a GPU implementation is highly desirable. To this end, we propose Cross-Architecture...

chapter

Free launch: Optimizing GPU dynamic kernel launches through thread reuse

Guoyang Chen, Xipeng Shen

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) > 407 - 419

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some...

chapter

Migration of CUDA Program Based on a Divide-and-Conquer Method

Nan Li, Jianmin Pang, Zheng Shan

2014 IEEE 17th International Conference on Computational Science and Engineering > 1685 - 1691

2014 IEEE 17th International Conference on Computational Science and Engineering (CSE)

Porting CUDA program to other heterogeneous and many-core platform especially native processor is very meaningful for extending the range of the CUDA application, taking advantage of many-core on target platform and supporting national industries. Traditional binary translation technique is not competent to this task. On the point of software reverse engineering, it is feasible to design a new migration...

chapter

An Evaluation of Debayering Algorithms on GPU for Real-Time Panoramic Video Recording

Ragnar Langseth, Vamsidhar Reddy Gaddam, Hakon Kvale Stensland, Carsten Griwodz, more

2014 IEEE International Symposium on Multimedia > 110 - 115

2014 IEEE International Symposium on Multimedia (ISM)

Modern video cameras normally only capture a single color per pixel, commonly arranged in a Bayer pattern. This means that we must restore the missing color channels in the image or the video frame in post-processing, a process referred to as debayering. In a live video scenario, this operation must be performed efficiently in order to output each frame in real-time, while also yielding acceptable...

chapter

High throughput long integer multiplication using Fast Fourier Transform on parallel workstation

Jitendra V. Tembhurne, Shailesh R. Sathe

2014 Annual IEEE India Conference (INDICON) > 1 - 6

2014 Annual IEEE India Conference (INDICON)

In this paper, we have proposed high throughput parallel long integer multiplication algorithm on parallel workstation. In integer arithmetic operations, long integer multiplication is the most time consuming and key operation. In public-key cryptography such as RSA, Diffie-Hellman and so on long integer multiplication is required. Long integer multiplication operation is performed heavily for the...

chapter

A high parallel way for processing IQ/IT part of HEVC decoder based on GPU

Lang-ping He, Satoshi Goto

2014 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS) > 211 - 215

2014 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS)

HEVC (High Efficiency Video Coding) is the newest video compression standard. Compared with the previous standards, the Coding efficiency is greatly improved at the cost of much higher codec complexity. So, many people improve the HEVC algorithm from the hardware level and software level. For the IQ/IT (inverse quantization/inverse transform) part, HEVC just processes one TU block one by one. And...

chapter

Software based ultrasound B-mode/beamforming optimization on GPU and its performance prediction

Thi Yen Phuong, Jeong-Gun Lee

2014 21st International Conference on High Performance Computing (HiPC) > 1 - 10

2014 21st International Conference on High Performance Computing (HiPC)

In the paper, we design and optimize an ultrasound B-mode imaging including a high-computationally demanding beamformer on a commercial GPU. For the performance optimization, we explore the design space spanned with the use of different memory types, instruction scheduling and thread mapping strategies, etc. Then, with the developed B-mode imaging code, we conduct performance evaluations on various...

chapter

Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms

Yuan Wen, Zheng Wang, Michael F. P. O'Boyle

2014 21st International Conference on High Performance Computing (HiPC) > 1 - 10

2014 21st International Conference on High Performance Computing (HiPC)

Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need...

chapter

GpuTejas: A parallel simulator for GPU architectures

Geetika Malhotra, Seep Goel, Smruti R. Sarangi

2014 21st International Conference on High Performance Computing (HiPC) > 1 - 10

2014 21st International Conference on High Performance Computing (HiPC)

In this paper, we introduce a new Java-based parallel GPGPU simulator, GpuTejas. GpuTejas is a fast trace driven simulator, which uses relaxed synchronization, and non-blocking data structures to derive its speedups. Secondly, it introduces a novel scheduling and partitioning scheme for parallelizing a GPU simulator. We evaluate the performance of our simulator with a set of Rodinia benchmarks. We...

chapter

Periodic steady state solution of power systems by selective transition matrix identification and graphic processing units

Ernesto Magana-Lemus, Aurelio Medina-Rios, Antonio Ramos-Paz

2014 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC) > 1 - 6

2014 IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC)

Modern power systems analysis becomes more challenging due to their increasing complexity, greater interConnectivity of elements and sub-systems and the use of alternative generation systems, among others. In order to efficiently conduct these analyses, it is necessary the use of modern computational and numerical techniques, such as parallel processing and fast periodic steady state solution techniques...

chapter

Double precision stencil computations on Kepler GPUs

Anamaria Vizitiu, Lucian Itu, Laszlo Lazar, Constantin Suciu

2014 18th International Conference on System Theory, Control and Computing (ICSTCC) > 123 - 127

2014 18th International Conference on System Theory, Control and Computing (ICSTCC)

Graphics Processing Units (GPU) have been used extensively for accelerating parallelizable applications in general, and scientific computations in particular. Stencil based algorithms are used intensively in various research areas and represent good candidates for GPU based acceleration. Since scientific computations have high accuracy requirements, herein we focus on stencil based double precision...

chapter

Acceleration of spatial channel model simulation using GPU

Qingqing Dang, Zhisong Bie

2014 IEEE/CIC International Conference on Communications in China (ICCC) > 770 - 774

2014 IEEE/CIC International Conference on Communications in China (ICCC)

In traditional link level simulation, multiple-input and multiple-output (MIMO) channel model is one of the most time-consuming modules. When using more realistic geometry-based channel models, it consumes more time. In this paper, we propose an efficient simulator implementation of geometry-based spatial channel model (SCM) on graphics processing unit (GPU). We first analyze the potential parallelism...

chapter

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

Rong Shi, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, more

2014 43rd International Conference on Parallel Processing > 221 - 230

2014 43nd International Conference on Parallel Processing (ICPP)

An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain...

chapter

Reducing Data Copies between GPUs and NICs

Anh Nguyen, Yusuke Fujii, Yuki Iida, Takuya Azumi, more

2014 IEEE International Conference on Cyber-Physical Systems, Networks, and Applications > 37 - 42

2014 IEEE International Conference on Cyber-Physical Systems, Networks, and Applications (CPSNA)

Cyber-physical systems (CPS) must perform complex algorithms at very high speed to monitor and control complex real-world phenomena. GPU, with a large number of cores and extremely high parallel processing, promises better computation if the data parallelism often found in real-world scenarios of CPS could be exploited. Nevertheless, its performance is limited by the latency incurred when data are...

chapter

GPU Maps for the Space of Computation in Triangular Domain Problems

Cristobal A. Navarro, Nancy Hitschfeld

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 375 - 382

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

There is a stage in the GPU computing pipeline where a grid of thread-blocks, or space of computation, is mapped to the problem domain. Normally, the space of computation is a k-dimensional bounding box (BB) that covers a k-dimensional problem. Threads that fall inside the problem domain perform computations and threads that fall outside are discarded, all happening at runtime. For problems with non-square...

chapter

Comparison of Xeon Phi and Kepler GPU Performance for Finite Element Numerical Integration

Krzysztof Banas, Filip Kruzel

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 145 - 148

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

We consider two recently introduced massively multi-core architectures designed for high performance computing, the Xeon Phi coprocessor and Kepler graphics processor. We discuss the OpenCL programming model, as one that allows to look at the platforms in a unified way and to construct efficient algorithms for both of them. As an example application we investigate a typical algorithm employed in finite...

chapter

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

Walid Abu Sufah, Khalid Ahmad

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 1117 - 1124

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these applications. Unlike SpMV, where matrix elements are accessed only once, multiplying by k vectors requires accessing matrix elements k times. In this paper...

chapter

A Compiler Translate Directive-Based Language to Optimized CUDA

Feng Li, Hong An, Weihao Liang, Xiaoqiang Li, more

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 982 - 989

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

Graphics processing units(GPUs) provide a low cost platform for accelerating high performance computations. New programming languages, such as CUDA and OpenCL, make GPU programming attractive to programmers. However, programming GPUs is still a cumbersome task for two reasons, tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming...

chapter

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU

Tingxing Dong, Azzam Haidar, Piotr Luszczek, James Austin Harris, more

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 157 - 160

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently...

INFONA - science communication portal

Search results

Efficient warp execution in presence of divergence with collaborative context collection

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance

Free launch: Optimizing GPU dynamic kernel launches through thread reuse

Migration of CUDA Program Based on a Divide-and-Conquer Method

An Evaluation of Debayering Algorithms on GPU for Real-Time Panoramic Video Recording

High throughput long integer multiplication using Fast Fourier Transform on parallel workstation

A high parallel way for processing IQ/IT part of HEVC decoder based on GPU

Software based ultrasound B-mode/beamforming optimization on GPU and its performance prediction

Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms

GpuTejas: A parallel simulator for GPU architectures

Periodic steady state solution of power systems by selective transition matrix identification and graphic processing units

Double precision stencil computations on Kepler GPUs

Acceleration of spatial channel model simulation using GPU

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

Reducing Data Copies between GPUs and NICs

GPU Maps for the Space of Computation in Triangular Domain Problems

Comparison of Xeon Phi and Kepler GPU Performance for Finite Element Numerical Integration

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

A Compiler Translate Directive-Based Language to Optimized CUDA

LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU

Filter options

Publication date

Content availability

Publication type

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options