Search results

Items from 61 to 80 out of 843 results

chapter

Access pattern-aware cache management for improving data utilization in GPU

Gunjae Koo, Yunho Oh, Won Woo Ro, Murali Annavaram

2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) > 307 - 319

2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)

Long latency of memory operation is a prominent performance bottleneck in graphics processing units (GPUs). The small data cache that must be shared across dozens of warps (a collection of threads) creates significant cache contention and premature data eviction. Prior works have recognized this problem and proposed warp throttling which reduces the number of active warps contending for cache space...

chapter

GPU implementation of all pairs shortest path algorithm for graphs using triangular matrix method

S. Umamaheswari, G. Abisheik

2016 Eighth International Conference on Advanced Computing (ICoAC) > 218 - 223

2016 Eighth International Conference on Advanced Computing (ICoAC)

In various applications where the problem domain can be modeled into graphs, the shortest path computation in the graph is an indispensable challenge. In applications like online social networks and shortest route computation problems, the size of the graph is so large; the number of nodes have become close to hundreds of billions. Shortest path graph algorithms like SSSP (Single Source Shortest Path)...

chapter

DFGenTool: A Dataflow Graph Generation Tool for Coarse Grain Reconfigurable Architectures

Manideepa Mukherjee, Alexander Fell, Apala Guha

2017 30th International Conference on VLSI Design and 2017 16th International Conference on Embedded Systems (VLSID) > 67 - 72

2017 30th International Conference on VLSI Design and 2017 16th International Conference on Embedded Systems (VLSID)

In this paper, DFGenTool, a dataflow graph (DFG) generation tool, is presented, which converts loops in a sequential program given in a high-level language such as C, into a DFG. DFGenTool adapts DFGs for mapping to Coarse Grain Reconfigurable Architectures (CGRA) to enable a variety of CGRA implementations and compilers to be benchmarked against a standard set of DFGs. Several kernels have been converted...

chapter

A Memory Accessing Method for the Parallel Aho-Corasick Algorithm on GPU

JinMyung Yoon, Kang-Il Choi, HyunJin Kim

2016 International Conference on Information Science and Security (ICISS) > 1 - 3

2016 International Conference on Information Science and Security (ICISS)

In this paper, we propose a memory accessing method of Parallel Failureless Aho-Corasick (PFAC) algorithm considering Graphic Processing Unit (GPU) memory architecture for throughput improvement. Compared with Aho-Corasick (AC) Algorithm using Central Processing Unit (CPU) and Data-Parallel Aho-Corasick (DPAC) using Open Multi-Processing (OpenMP), PFAC using GPU achieves high performance advancement...

chapter

Emulating an Octeon MIPS64 based embedded system on X86 in QEMU

Muhammad Amir Mehmood, Qurrat Ul Ain, Ayaz Akram, Abdul Qadeer, more

2016 19th International Multi-Topic Conference (INMIC) > 1 - 7

2016 19th International Multi-Topic Conference (INMIC)

Embedded systems are proliferating with their growing hardware capabilities. Their application areas include internet of things, cellular devices, network devices, etc. Application development and testing natively on such embedded hardware is expensive, time consuming, and challenging. In this case, system emulation is a cost-effective alternative. We have extended Quick Emulator (QEMU) to support...

chapter

Characterizing OS Behaviors of Datacenter and Big Data Workloads

Chen Zheng, Jianfeng Zhan, Zhen Jia, Lixin Zhang

2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) > 1079 - 1086

As datacenters and big data workloads become dominant ones, the pressure of new system design achieving cost-effectiveness rises, for both architecture and operating system communities. Consistent efforts on benchmarks have been taken to characterize the micro-architectural characteristics of those workloads. Statistics show that datacenter and big data workloads suffer from more front-end pipeline...

chapter

A reuse distance based performance analysis on GPU L1 data cache

Dongwei Wang, Weijun Xiao

2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC) > 1 - 8

2016 IEEE 35th International Performance Computing and Communications Conference (IPCCC)

Generally, cache is a bridge between CPU and main memory in order to narrow the gap of performance. As a throughput-oriented device, Graphics Processing Unit(GPU) has already integrated with cache, which is similar to CPU cores in order to exploit the locality of memory accesses. However, the applications in GPGPU computing exhibit distinct memory access patterns compared to the multi-core counterparts...

chapter

Histogram optimization with CUDA

Keh Kok Yong, Sheera Shaheera Othman Talib

2016 IEEE Industrial Electronics and Applications Conference (IEACon) > 312 - 318

2016 IEEE Industrial Electronics and Applications Conference (IEACon)

Histogram is a popular analytic graphical representation of data distribution resulting from processing a given numerical input data. Although the sequential histogram computation may be simple, it is no longer suitable in processing high volume of data. With recent advancement of high performance computing (HPC), aided by the accelerating growth of General Purpose Graphic Processing Unit (GPGPU),...

chapter

GPU implementation of multi-scale Retinex image enhancement algorithm

Hui Li, Weihao Xie, Xingang Wang, Shousheng Liu, more

2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA) > 1 - 5

2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)

Multi-scale Retinex algorithm is an image enhancement algorithm that aims at image reconstruction. The algorithm maintains the high fidelity and the dynamic range compression of the image, so the enhancement effect is obvious. The algorithm exploits a large number of convolution operations to achieve dynamic range compression and color/brightness rendition, and the calculation time increased significantly...

chapter

Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching

Zhen Lin, Lars Nyland, Huiyang Zhou

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis > 898 - 908

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis

Context switching is a key technique enabling preemption and time-multiplexing for CPUs. However, for single-instruction multiple-thread (SIMT) processors such as high-end graphics processing units (GPUs), it is challenging to support context switching due to the massive number of threads, which leads to a huge amount of architectural states to be swapped during context switching. The architectural...

chapter

dCUDA: Hardware Supported Overlap of Computation and Communication

Tobias Gysi, Jeremia Bar, Torsten Hoefler

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis > 609 - 620

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis

Over the last decade, CUDA and the underlying GPU hardware architecture have continuously gained popularity in various high-performance computing application domains such as climate modeling, computational chemistry, or machine learning. Despite this popularity, we lack a single coherent programming model for GPU clusters. We therefore introduce the dCUDA programming model, which implements device-side...

chapter

A parallel multi-GPU Clonal Selection Algorithm for optimization using OpenCL and OpenMP

Igor L.S. Russo, Heder S. Bernardino, Helio J.C. Barbosa

2016 IEEE Latin American Conference on Computational Intelligence (LA-CCI) > 1 - 6

2016 IEEE Latin American Conference on Computational Intelligence (LA-CCI)

Artificial Immune Systems are nature inspired techniques which have been applied with success to several areas. The Clonal Selection Algorithm (CLONALG) is one of the most used immune inspired techniques for optimization. Similarly to other metaheuristics, CLONALG requires a large number of objective function evaluations making it impracticable when the objective function is computationally expensive...

chapter

Batched Generation of Incomplete Sparse Approximate Inverses on GPUs

Hartwig Anzt, Edmond Chow, Thomas Huckle, Jack Dongarra

2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) > 49 - 56

2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)

Incomplete Sparse Approximate Inverses (ISAI) have recently been shown to be an attractive alternative to exact sparse triangular solves in the context of incomplete factorization preconditioning. In this paper we propose a batched GPU-kernel for the efficient generation of ISAI matrices. Utilizing only thread-local memory allows for computing the ISAI matrix with very small memory footprint. We demonstrate...

chapter

Scalable and Modular Online Data Processing for Ultrafast Computed Tomography Using CUDA Pipelines

Tobias Frust, Guido Juckeland, Andre Bieberle

2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV) > 7 - 11

2016 Second Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV)

For investigations of rapidly moving structures in opaque technical devices ultrafast electron beam X-ray computed tomography (CT) scanners are available at the Helmholtz-Zentrum Dresden-Rossendorf (HZDR). Currently, measurement data must be initially downloaded after each CT scan from the scanner to a data processing machine. Afterwards, cross-sectional images are reconstructed. This limits the application...

chapter

Characterizing Power and Performance of GPU Memory Access

Tyler Allen, Rong Ge

2016 4th International Workshop on Energy Efficient Supercomputing (E2SC) > 46 - 53

2016 4th International Workshop on Energy Efficient Supercomputing (E2SC)

Power is a major limiting factor for the future of HPC and the realization of exascale computing under a power budget. GPUs have now become a mainstream parallel computation device in HPC, and optimizing power usage on GPUs is critical to achieving future goals. GPU memory is seldom studied, especially for power usage. Nevertheless, memory accesses draw significant power and are critical to understanding...

chapter

Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System

Ryotaro Sakai, Fumihiko Ino, Kenichi Hagihara

2016 Fourth International Symposium on Computing and Networking (CANDAR) > 408 - 414

2016 Fourth International Symposium on Computing and Networking (CANDAR)

In this paper, we present a data decomposition method for multi-dimensional data, aiming at realizing multi graphics processing unit (GPU) acceleration of a compute unified device architecture (CUDA) code written for a single GPU. Our multi-dimensional method extends a previous method that deals with one-dimensional (1-D) data. The method performs a sample run of selected GPU threads to decompose...

chapter

A Fast Level-Set Segmentation Algorithm for Image Processing Designed For Parallel Architectures

Julian Gutierrez, Fanny Nina-Paravecino, David Kaeli

2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3) > 66 - 69

2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3)

Among the many choices to perform image segmentation, Level-Set Methods have demonstrated great potential for unstructured images. However, the usefulness of Level-Set Methods have been limited by their irregular workload characteristics such as high degree of branch divergence and input dependencies, as well as the high computational costs required to solve partial differential equations (PDEs).In...

chapter

Exploring Compiler Optimization Opportunities for the OpenMP 4.× Accelerator Model on a POWER8+GPU Platform

Akihiro Hayashi, Jun Shirako, Ettore Tiotto, Robert Ho, more

2016 Third Workshop on Accelerator Programming Using Directives (WACCPD) > 68 - 78

2016 Third Workshop on Accelerator Programming Using Directives (WACCPD)

While GPUs are increasingly popular for high-performance computing, optimizing the performance of GPU programs is a time-consuming and non-trivial process in general. This complexity stems from the low abstraction level of standard GPU programming models such as CUDA and OpenCL: programmers are required to orchestrate low-level operations in order to exploit the full capability of GPUs. In terms of...

chapter

Evaluating and Optimizing the NERSC Workload on Knights Landing

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, more

2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) > 43 - 53

2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture...

chapter

How to Speed Up CUDA-WSat-PcL by 5x

Heng Liu, Arrvindh Shriraman, Evgenia Ternovska

2016 Fourth International Symposium on Computing and Networking (CANDAR) > 462 - 468

2016 Fourth International Symposium on Computing and Networking (CANDAR)

The Propositional Satisfiability Problem (SAT) is one of the most fundamental NP-complete problems, and is central to many domains of computer science. Utilizing a massively parallel architecture on a Graphics Processing Unit (GPU) together with a conventional CPU on NVIDIA's Compute Unified Device Architecture (CUDA) platform, this work proposes an efficient scheme to implement one parallel Stochastic...

Keywords:
KERNEL
INSTRUCTION SETS

Publication date

Set your own date range

Content availability

Available (840)
None (3)

Keywords

GRAPHICS PROCESSING UNITS (354)
GRAPHICS PROCESSING UNIT (291)
GPU (204)
CUDA (164)
COMPUTER ARCHITECTURE (155)
PARALLEL PROCESSING (149)
HARDWARE (131)
OPTIMIZATION (110)
COMPUTATIONAL MODELING (109)
GPGPU (94)
COPROCESSORS (92)
REGISTERS (84)
MEMORY MANAGEMENT (81)
ARRAYS (77)
COMPUTER GRAPHIC EQUIPMENT (70)
PROGRAMMING (62)
SYNCHRONIZATION (57)
ALGORITHM DESIGN AND ANALYSIS (56)
BENCHMARK TESTING (54)
LINUX (48)
PERFORMANCE EVALUATION (46)
VECTORS (44)
ACCELERATION (43)
LIBRARIES (41)
SPARSE MATRICES (41)
MATHEMATICAL MODEL (38)
BANDWIDTH (35)
MULTIPROCESSING SYSTEMS (34)
THROUGHPUT (34)
MULTICORE PROCESSING (33)
RUNTIME (33)
OPENCL (32)
RANDOM ACCESS MEMORY (32)
RESOURCE MANAGEMENT (31)
MESSAGE SYSTEMS (30)
INDEXES (29)
CONTEXT (28)
FIELD PROGRAMMABLE GATE ARRAYS (27)
PARALLEL COMPUTING (27)
PARALLEL ARCHITECTURES (26)
CENTRAL PROCESSING UNIT (25)
DATA STRUCTURES (25)
REAL-TIME SYSTEMS (22)
EQUATIONS (21)
SCHEDULING (21)
SWITCHES (20)
PERFORMANCE (19)
PARALLEL ALGORITHMS (18)
PIPELINES (18)
CLUSTERING ALGORITHMS (17)
PARALLEL PROGRAMMING (17)
ACCURACY (16)
DATA TRANSFER (16)
HEURISTIC ALGORITHMS (16)
OPENMP (16)
EMBEDDED SYSTEMS (15)
IMAGE PROCESSING (15)
MULTI-THREADING (15)
PIXEL (15)
SYSTEM-ON-CHIP (15)
LAYOUT (14)
OPTIMISATION (14)
PROCESSOR SCHEDULING (14)
SCHEDULES (14)
SERVERS (14)
TRAINING (14)
COMPUTE UNIFIED DEVICE ARCHITECTURE (13)
COMPUTERS (13)
HIGH PERFORMANCE COMPUTING (13)
PARALLEL (13)
REAL TIME SYSTEMS (13)
GPU COMPUTING (12)
GRAPHIC PROCESSING UNIT (12)
MONITORING (12)
MPI (12)
SCALABILITY (12)
STANDARDS (12)
TILES (12)
DECODING (11)
ESTIMATION (11)
FEATURE EXTRACTION (11)
FPGA (11)
GENETIC ALGORITHMS (11)
GPUS (11)
GRAPHICS (11)
HISTOGRAMS (11)
JACOBIAN MATRICES (11)
MATRIX DECOMPOSITION (11)
SPMV (11)
TUNING (11)
ANALYTICAL MODELS (10)
APPLICATION PROGRAM INTERFACES (10)
CONVOLUTION (10)
CPU (10)
EDUCATIONAL INSTITUTIONS (10)
ENCODING (10)
ENERGY CONSUMPTION (10)
IMAGE COLOR ANALYSIS (10)
more

INFONA - science communication portal

Search results

Access pattern-aware cache management for improving data utilization in GPU

GPU implementation of all pairs shortest path algorithm for graphs using triangular matrix method

DFGenTool: A Dataflow Graph Generation Tool for Coarse Grain Reconfigurable Architectures

A Memory Accessing Method for the Parallel Aho-Corasick Algorithm on GPU

Emulating an Octeon MIPS64 based embedded system on X86 in QEMU

Characterizing OS Behaviors of Datacenter and Big Data Workloads

A reuse distance based performance analysis on GPU L1 data cache

Histogram optimization with CUDA

GPU implementation of multi-scale Retinex image enhancement algorithm

Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching

dCUDA: Hardware Supported Overlap of Computation and Communication

A parallel multi-GPU Clonal Selection Algorithm for optimization using OpenCL and OpenMP

Batched Generation of Incomplete Sparse Approximate Inverses on GPUs

Scalable and Modular Online Data Processing for Ultrafast Computed Tomography Using CUDA Pipelines

Characterizing Power and Performance of GPU Memory Access

Towards Automating Multi-dimensional Data Decomposition for Executing a Single-GPU Code on a Multi-GPU System

A Fast Level-Set Segmentation Algorithm for Image Processing Designed For Parallel Architectures

Exploring Compiler Optimization Opportunities for the OpenMP 4.× Accelerator Model on a POWER8+GPU Platform

Evaluating and Optimizing the NERSC Workload on Knights Landing

How to Speed Up CUDA-WSat-PcL by 5x

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options