Search results

chapter

Characterization and analysis of dynamic parallelism in unstructured GPU applications

Jin Wang, Sudhakar Yalamanchili

2014 IEEE International Symposium on Workload Characterization (IISWC) > 51 - 60

2014 IEEE International Symposium on Workload Characterization (IISWC)

GPUs have been proven very effective for structured applications. However, emerging data intensive applications are increasingly unstructured — irregular in their memory and control flow behavior over massive data sets. While the irregularity in these applications can result in poor workload balance among fine-grained threads or coarse-grained blocks, one can still observe dynamically formed pockets...

chapter

Double precision stencil computations on Kepler GPUs

Anamaria Vizitiu, Lucian Itu, Laszlo Lazar, Constantin Suciu

2014 18th International Conference on System Theory, Control and Computing (ICSTCC) > 123 - 127

2014 18th International Conference on System Theory, Control and Computing (ICSTCC)

Graphics Processing Units (GPU) have been used extensively for accelerating parallelizable applications in general, and scientific computations in particular. Stencil based algorithms are used intensively in various research areas and represent good candidates for GPU based acceleration. Since scientific computations have high accuracy requirements, herein we focus on stencil based double precision...

chapter

Efficient Scan Operator Methods on a GPU

Adrian P. Dieguez, Margarita Amor, Ramon Doallo

2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing > 190 - 197

2014 26th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

Current GPUs (Graphics Processing Units) offer high computational power at relatively low cost, nonetheless, this enhanced performance often comes at the expenses of flexibility and code complexity. Efficient GPU programming requires detailed knowledge on certain hardware aspects. The scan operator is an important building block for a wide range of algorithms. In this paper, we present a number of...

chapter

Fast oil paint image filter algorithm: Optimization at overlapped sub-pixel

Siddhartha Mukherjee, Haresh S Chudgar

2014 International Conference on Advances in Electronics Computers and Communications > 1 - 5

2014 International Conference on Advances in Electronics, Computers and Communications (ICAECC)

Fast Oil Paint Image filter is a performance optimized oil paint algorithm. Current oil paint algorithms are CPU intensive and take a long time to produce the output; further the time taken increases exponentially with increasing quality. One of the main causes is re-computation. The proposed algorithm significantly reduces re-computation reducing the processing time by approximately 86%. The research...

chapter

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

Rong Shi, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, more

2014 43rd International Conference on Parallel Processing > 221 - 230

2014 43nd International Conference on Parallel Processing (ICPP)

An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain...

chapter

Sparse sampling methods for efficient spatial coherence estimation

Dongwoon Hyun, Gregg E. Trahey, Jeremy J. Dahl

2014 IEEE International Ultrasonics Symposium > 535 - 538

2014 IEEE International Ultrasonics Symposium (IUS)

Short-lag spatial coherence (SLSC) imaging, a coherence-based alternative beamforming technique, creates images related to the spatial covariance in backscatter. Because spatial covariance estimation is a computationally intensive process, efficient techniques are crucial to implementing SLSC imaging in real-time. Sparse sampling methods that take advantage of the statistical properties of spatial...

chapter

Semi-automatic Tool to Ease the Creation and Optimization of GPU Programs

Jacob Jepsen

2014 43rd International Conference on Parallel Processing Workshops > 196 - 205

2014 43nd International Conference on Parallel Processing Workshops (ICCPW)

We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form and apply optimizations thereto. Based on experiments, the code generated by the tool can be 3-256X faster than...

chapter

A Mapping Method for Application Customized Reconfigurable Pipeline

Guanwu Wang, Sikun Li

2014 9th IEEE International Conference on Networking, Architecture, and Storage > 123 - 127

2014 9th IEEE International Conference on Networking, Architecture, and Storage (NAS)

Application mapping algorithm for reconfigurable architecture is one of the major research direction in reconfigurable computing. In this paper, we analyze the data memory bank conflict problem of the ACRPs (Application Customized Reconfigurable Pipelines) when exploiting the data parallelism and a conflict-free iteration-data mapping algorithm based on the operation mapping results is proposed to...

chapter

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

Walid Abu Sufah, Khalid Ahmad

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 1117 - 1124

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these applications. Unlike SpMV, where matrix elements are accessed only once, multiplying by k vectors requires accessing matrix elements k times. In this paper...

chapter

A Compiler Translate Directive-Based Language to Optimized CUDA

Feng Li, Hong An, Weihao Liang, Xiaoqiang Li, more

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 982 - 989

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

Graphics processing units(GPUs) provide a low cost platform for accelerating high performance computations. New programming languages, such as CUDA and OpenCL, make GPU programming attractive to programmers. However, programming GPUs is still a cumbersome task for two reasons, tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming...

chapter

Exploiting the Inter-cluster Record Reuse for Stream Processors

Ying Zhang, Gen Li, Caixia Sun, Hongwei Zhou, more

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 916 - 921

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

Memory accesses limit the performance of stream processors. The stream compiler exploits the reuse of records distributed on different ALU clusters by introducing inter-cluster communications, which decreases the program performance. The paper presents the Stream Transpose (ST) approach to exploit such reuse. The approach, by reorganizing the data, puts data that have been distributed on neighboring...

chapter

JolokiaC++: An Annotation Based Compiler Framework for GPGPUs

Vibha Patel, Sanjeev Aggarwal, Amey Karkare

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 1134 - 1141

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

We present JolokiaC++, an annotation based compiler framework which generates high quality CUDA (Compute Unified Device Architecture) code for GPUs. Our contributions include: (1) developing explicit and implicit annotations with illustrations of their use in C++, (2) showing the utility of these annotations by providing comparison code snippets, which demonstrates the ease of programming and performance...

chapter

Accelerating outlier detection with intra- and inter-node parallelism

Fabrizio Angiulli, Stefano Basta, Stefano Lodi, Claudio Sartori

2014 International Conference on High Performance Computing & Simulation (HPCS) > 476 - 483

2014 International Conference on High Performance Computing & Simulation (HPCS)

Outlier detection is a data mining task consisting in the discovery of observations which deviate substantially from the rest of the data, and has many important practical applications. Outlier detection in very large data sets is however computationally very demanding and the size limit of the data that can be elaborated is considerably pushed forward by mixing three ingredients: efficient algorithms,...

chapter

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

Milan Stanic, Oscar Palomar, Ivan Ratkovic, Milovan Duric, more

2014 International Conference on High Performance Computing & Simulation (HPCS) > 47 - 54

2014 International Conference on High Performance Computing & Simulation (HPCS)

Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released...

chapter

Burrows-Wheeler Transform based indexed exact search on a multi-GPU OpenCL platform

David Nogueira, Pedro Tomas, Nuno Roma

2014 International Conference on High Performance Computing & Simulation (HPCS) > 31 - 38

2014 International Conference on High Performance Computing & Simulation (HPCS)

A multi-GPU parallelization of exact string matching algorithms based on the backward-search procedure by using indexing techniques, such as the Burrows-Wheeler Transform and the FM-Index, is proposed in this paper. To attain an efficient execution on highly heterogeneous parallel platforms, the proposed parallelization adopted an unified OpenCL implementation that allows its execution either in CPUs...

chapter

A bias-scalable current-mode analog support vector machine based on margin propagation

Ming Gu, Shantanu Chakrabartty

2014 IEEE International Symposium on Circuits and Systems (ISCAS) > 273 - 276

2014 IEEE International Symposium on Circuits and Systems (ISCAS)

Bias-scalability in analog CMOS circuits refers to a current-mode design paradigm where the operation of the circuit remains invariant to the operating conditions (weak-inversion, moderate-inversion or strong-inversion) of the transistors. In this paper we present the design and implementation of a bias-scalable analog support vector machine (SVM) based on our previously reported margin propagation...

chapter

Fine-grain task aggregation and coordination on GPUs

Marc S. Orr, Bradford M. Beckmann, Steven K. Reinhardt, David A. Wood

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) > 181 - 192

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)

In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads executing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous execution of similar operations across thousands of data elements. To alleviate this programmer burden, Gaster and Howes outlined the channel abstraction,...

chapter

A stochastic geometric approach to sensor array processing

Ba Ngu Vo, Ba Tuong Vo

2014 IEEE Workshop on Statistical Signal Processing (SSP) > 236 - 239

2014 IEEE Statistical Signal Processing Workshop (SSP)

A new unified mathematical framework for sensor array processing is proposed. The proposed framework combines Bayesian estimation with stochastic geometry to accommodate prior information, uncertainty in array parameters, and unknown and stochastically time-varying number of nonstationary sources. A system model for a common signal setting is constructed to demonstrate the proposed framework.

chapter

The design and optimization of Connect6 computer game system

Chang Liu, Bingke Wu, Sichen Wu

The 26th Chinese Control and Decision Conference (2014 CCDC) > 3936 - 3940

2014 26th Chinese Control And Decision Conference (CCDC)

Computer game, a new field of artificial intelligence, as the name suggests, is to make the computer learn to think and play chess games like human beings. As one of the important research field of the artificial intelligence, computer game, which is considered as the touchstone of the artificial intelligence, has brought many important methods and theories to the field. Connect6, is a newly introduced...

chapter

Simulation and verification of the virtual memory management system with MSVL

Meng Wang, Zhenhua Duan, Cong Tian

Proceedings of the 2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design (CSCWD) > 360 - 365

2014 IEEE 18th International Conference on Computer Supported Cooperative Work in Design (CSCWD)

The paging mechanism is widely used in most modern systems to handle the virtual memory. Many page replacement algorithms have been proposed. Therefore, the cor-rectness and reliability of virtual memory management systems become very important. It is essential to formalize and verify the system in a formal way. In this paper, we model the virtual memory management system with MSVL, which is a parallel...

INFONA - science communication portal

Search results

Characterization and analysis of dynamic parallelism in unstructured GPU applications

Double precision stencil computations on Kepler GPUs

Efficient Scan Operator Methods on a GPU

Fast oil paint image filter algorithm: Optimization at overlapped sub-pixel

HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

Sparse sampling methods for efficient spatial coherence estimation

Semi-automatic Tool to Ease the Creation and Optimization of GPU Programs

A Mapping Method for Application Customized Reconfigurable Pipeline

On Implementing Sparse Matrix Multi-vector Multiplication on GPUs

A Compiler Translate Directive-Based Language to Optimized CUDA

Exploiting the Inter-cluster Record Reuse for Stream Processors

JolokiaC++: An Annotation Based Compiler Framework for GPGPUs

Accelerating outlier detection with intra- and inter-node parallelism

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

Burrows-Wheeler Transform based indexed exact search on a multi-GPU OpenCL platform

A bias-scalable current-mode analog support vector machine based on margin propagation

Fine-grain task aggregation and coordination on GPUs

A stochastic geometric approach to sensor array processing

The design and optimization of Connect6 computer game system

Simulation and verification of the virtual memory management system with MSVL

Filter options

Publication date

Content availability

Publication type

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options