Search results

Items from 41 to 60 out of 843 results

chapter

Static WCET Analysis of GPUs with Predictable Warp Scheduling

Yijie Huangfu, Wei Zhang

2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC) > 101 - 108

2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC)

The capability of GPUs to accelerate general-purpose applications that can be parallelized into massive number of threads makes it promising to apply GPUs to real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, the worst-case execution time (WCET) analysis methods and techniques designed...

chapter

Towards Reproducible Blocked LU Factorization

Roman Iakymchuk, Enrique S. Quintana-Orti, Erwin Laure, Stef Graillat

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1598 - 1607

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

In this article, we address the problem of reproducibility of the blocked LU factorization on GPUs due to cancellations and rounding errors when dealing with floating-point arithmetic. Thanks to the hierarchical structure of linear algebra libraries, the computations carried within this operation can be expressed in terms of the Level-3 BLAS routines as well as the unblocked variant of the factorization,...

chapter

Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives

Ichitaro Yamazaki, Mark Hoemmen, Piotr Luszczek, Jack Dongarra

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1118 - 1127

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We compare the performance of pipelined and s-step GMRES, respectively referred to as l-GMRES and s-GMRES, on distributed multicore CPUs. Compared to standard GMRES, s-GMRES requires fewer all-reduces, while l-GMRES overlaps the all-reduces with computation. To combine the best features of two algorithms, we propose another variant, (l, t)-GMRES, that not only does fewer global all-reduces than standard...

chapter

Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures

Mehmet Deveci, Christian Trott, Sivasankaran Rajamanickam

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 693 - 702

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We consider the problem of writing performance portablesparse matrix-sparse matrix multiplication (SPGEMM) kernelfor many-core architectures. We approach the SPGEMMkernel from the perspectives of algorithm design and implementation, and its practical usage. First, we design ahierarchical, memory-efficient SPGEMM algorithm. We thendesign and implement thread scalable data structures thatenable us to...

chapter

GPU accelerated foreground segmentation using CodeBook model and shadow removal using CUDA

Praveen Gudivaka, Nayaneesh Mishra, Anupam Agrawal

2017 International Conference on Computing, Communication and Automation (ICCCA) > 765 - 770

2017 International Conference on Computing, Communication and Automation (ICCCA)

Background Subtraction is the major important step in many image processing applications which can be applied in much of video surveillances. The major result of this method is accuracy as well as processing time. So we mainly focused on these two challenges. We parallelized the Two Layered CodeBook Model on Graphical Processing Unit (GPU) for increasing the processing speed and the accuracy of the...

chapter

New optimized GPU version of the k-means algorithm for large-sized image segmentation

Hicham Fakhi, Omar Bouattane, Mohamed Youssfi, Ouajji Hassan

2017 Intelligent Systems and Computer Vision (ISCV) > 1 - 6

2017 Intelligent Systems and Computer Vision (ISCV)

K-means is a compute-intensive iterative algorithm, each iteration consists of two steps data assignment and K centroids recalculation. In order to accelerate the compute-intensive portions of k-means, the data assignment and K centroids recalculation steps are offloaded to the GPU in parallel. Only the initialization and convergence tests steps are performed by the CPU. In addition this new version...

chapter

Multi2Sim Kepler: A detailed architectural GPU simulator

Xun Gong, Rafael Ubal, David Kaeli

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 269 - 278

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Presilicon simulation is one of the key toolsets for computer architects to evaluate and optimize their future designs. As Graphics Processing Units (GPUs) have become the platform of choice in many computing communities due to their impressive processing capabilities, computer architecture researchers need a simulation framework that allows them to quantitatively consider design tradeoffs. In this...

chapter

Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration

Ashutosh Dhar, Deming Chen

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) > 48 - 55

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

GPUs are capable of running a variety of applications, however their generic parallel-architecture can lead to inefficient use of resources and reduced power efficiency, due to algorithmic or architectural constraints. In this work, taking inspiration from CGRAs (coarse-grained reconfigurable architectures), we demonstrate resource sharing and re-distribution as a solution that can be leveraged by...

chapter

Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator

Li Wang, Ren-Wei Tsai, Shao-Chung Wang, Kun-Chih Chen, more

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 127 - 128

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Heterogeneous CPU-GPU systems have recently emerged as an energy-efficient computing platform. A robust integrated CPU-GPU simulator is essential to facilitate researches in this direction. While few integrated CPU-GPU simulators are available, similar tools that support OpenCL 2.0, a widely used new standard with promising heterogeneous computing features, are currently missing. In this paper, we...

chapter

On the Effectiveness of Virtualization Based Memory Isolation on Multicore Platforms

Siqi Zhao, Xuhua Ding

2017 IEEE European Symposium on Security and Privacy (EuroS&P) > 546 - 560

2017 IEEE European Symposium on Security and Privacy (EuroS&P)

Virtualization based memory isolation has been widely used as a security primitive in many security systems. This paper firstly provides an in-depth analysis of its effectiveness in the multicore setting, a first in the literature. Our study reveals that memory isolation by itself is inadequate for security. Due to the fundamental design choices in hardware, it faces several challenging issues including...

chapter

Sweet KNN: An Efficient KNN on GPU through Reconciliation between Redundancy Removal and Regularity

Guoyang Chen, Yufei Ding, Xipeng Shen

2017 IEEE 33rd International Conference on Data Engineering (ICDE) > 621 - 632

2017 IEEE 33rd International Conference on Data Engineering (ICDE)

Finding the k nearest neighbors of a query point or a set of query points (KNN) is a fundamental problem in many application domains. It is expensive to do. Prior efforts in improving its speed have followed two directions with conflicting considerations: One tries to minimize the redundant distance computations but often introduces irregularities into computations, the other tries to exploit the...

chapter

GATSim: Abstract timing simulation of GPUs

Kishore Punniyamurthy, Behzad Boroujerdian, Andreas Gerstlauer

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 43 - 48

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

General-Purpose Graphic Processing Units (GPUs) have become an integral part of heterogeneous system architectures. Ever increasing complexities have made rapid, early performance evaluation of GPU-based architectures and applications a primary design concern. Traditional cycle-accurate GPU simulators are too slow, while existing analytical or source-level estimation approaches are often inaccurate...

chapter

A two-kernel based strategy for performing assembly in FEA on the graphics processing unit

Subhajit Sanfui, Deepak Sharma

2017 International Conference on Advances in Mechanical, Industrial, Automation and Management Systems (AMIAMS) > 1 - 9

2017 International Conference on Advances in Mechanical, Industrial, Automation and Management Systems (AMIAMS)

This paper presents a strategy to perform assembly of system of equations arising in Finite Element Analysis (FEA) on Graphics Processing Units (GPU) based on the principle of dividing the workload into separate kernels. Three different sparse formats are analyzed for efficient storage along with two different implementations for the race condition arising in the traditional assembly (addto method)...

chapter

TwinKernels: An execution model to improve GPU hardware scheduling at compile time

Xiang Gong, Zhongliang Chen, Amir Kavyan Ziabari, Rafael Ubal, more

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 39 - 49

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

As throughput-oriented accelerators, GPUs provide tremendous processing power by running a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPU's resources often under-utilized. Compared to compute resources, memory resources can tolerate considerably lower levels...

chapter

Taming warp divergence

Jayvant Anantpur, R. Govindarajan

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 50 - 60

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Graphics Processing Units (GPUs) are designed to exploit large amount of parallelism. However, warp-level divergence occurring due to different amounts of work, memory access latency experienced, etc., results in warps of a thread block (TB) finishing kernel execution at different points in time. This, in effect, reduces utilization of resources of SMs and hence performance of the GPU. We propose...

chapter

FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures

Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, more

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 27 - 38

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

The integrated architecture that features both CPU and GPU on the same die is an emerging and promising architecture for fine-grained CPU-GPU collaboration. However, the integration also brings forward several programming and system optimization challenges, especially for irregular applications. The complex interplay between heterogeneity and irregularity leads to very low processor utilization of...

chapter

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, more

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) > 649 - 660

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address...

chapter

Latency-aware packet processing on CPU-GPU heterogeneous systems

Arian Maghazeh, Unmesh D. Bordoloi, Usman Dastgeer, Alexandru Andrei, more

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC) > 1 - 6

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)

In response to the tremendous growth of the Internet, towards what we call the Internet of Things (IoT), there is a need to move from costly, high-time-to-market specific-purpose hardware to flexible, low-time-to-market general-purpose devices for packet processing. Among several such devices, GPUs have attracted attention in the past, mainly because the high computing demand of packet processing...

chapter

Statistical pattern based modeling of GPU memory access streams

Reena Panda, Xinnian Zheng, Jiajun Wang, Andreas Gerstlauer, more

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC) > 1 - 6

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)

Recent research studies have shown that modern GPU performance is often limited by the memory system performance. Optimizing memory hierarchy performance requires GPU designers to draw design insights based on the cache & memory behavior of end-user applications. Unfortunately, it is often difficult to get access to end-user workloads due to the confidential or proprietary nature of the software/data...

chapter

Optimizing memory efficiency for convolution kernels on kepler GPUs

Xiaoming Chen, Jianxu Chen, Danny Z. Chen, Xiaobo Sharon Hu

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC) > 1 - 6

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)

Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating...

Keywords:
KERNEL
INSTRUCTION SETS

Publication date

Set your own date range

Content availability

Available (840)
None (3)

Keywords

GRAPHICS PROCESSING UNITS (354)
GRAPHICS PROCESSING UNIT (291)
GPU (204)
CUDA (164)
COMPUTER ARCHITECTURE (155)
PARALLEL PROCESSING (149)
HARDWARE (131)
OPTIMIZATION (110)
COMPUTATIONAL MODELING (109)
GPGPU (94)
COPROCESSORS (92)
REGISTERS (84)
MEMORY MANAGEMENT (81)
ARRAYS (77)
COMPUTER GRAPHIC EQUIPMENT (70)
PROGRAMMING (62)
SYNCHRONIZATION (57)
ALGORITHM DESIGN AND ANALYSIS (56)
BENCHMARK TESTING (54)
LINUX (48)
PERFORMANCE EVALUATION (46)
VECTORS (44)
ACCELERATION (43)
LIBRARIES (41)
SPARSE MATRICES (41)
MATHEMATICAL MODEL (38)
BANDWIDTH (35)
MULTIPROCESSING SYSTEMS (34)
THROUGHPUT (34)
MULTICORE PROCESSING (33)
RUNTIME (33)
OPENCL (32)
RANDOM ACCESS MEMORY (32)
RESOURCE MANAGEMENT (31)
MESSAGE SYSTEMS (30)
INDEXES (29)
CONTEXT (28)
FIELD PROGRAMMABLE GATE ARRAYS (27)
PARALLEL COMPUTING (27)
PARALLEL ARCHITECTURES (26)
CENTRAL PROCESSING UNIT (25)
DATA STRUCTURES (25)
REAL-TIME SYSTEMS (22)
EQUATIONS (21)
SCHEDULING (21)
SWITCHES (20)
PERFORMANCE (19)
PARALLEL ALGORITHMS (18)
PIPELINES (18)
CLUSTERING ALGORITHMS (17)
PARALLEL PROGRAMMING (17)
ACCURACY (16)
DATA TRANSFER (16)
HEURISTIC ALGORITHMS (16)
OPENMP (16)
EMBEDDED SYSTEMS (15)
IMAGE PROCESSING (15)
MULTI-THREADING (15)
PIXEL (15)
SYSTEM-ON-CHIP (15)
LAYOUT (14)
OPTIMISATION (14)
PROCESSOR SCHEDULING (14)
SCHEDULES (14)
SERVERS (14)
TRAINING (14)
COMPUTE UNIFIED DEVICE ARCHITECTURE (13)
COMPUTERS (13)
HIGH PERFORMANCE COMPUTING (13)
PARALLEL (13)
REAL TIME SYSTEMS (13)
GPU COMPUTING (12)
GRAPHIC PROCESSING UNIT (12)
MONITORING (12)
MPI (12)
SCALABILITY (12)
STANDARDS (12)
TILES (12)
DECODING (11)
ESTIMATION (11)
FEATURE EXTRACTION (11)
FPGA (11)
GENETIC ALGORITHMS (11)
GPUS (11)
GRAPHICS (11)
HISTOGRAMS (11)
JACOBIAN MATRICES (11)
MATRIX DECOMPOSITION (11)
SPMV (11)
TUNING (11)
ANALYTICAL MODELS (10)
APPLICATION PROGRAM INTERFACES (10)
CONVOLUTION (10)
CPU (10)
EDUCATIONAL INSTITUTIONS (10)
ENCODING (10)
ENERGY CONSUMPTION (10)
IMAGE COLOR ANALYSIS (10)
more

INFONA - science communication portal

Search results

Static WCET Analysis of GPUs with Predictable Warp Scheduling

Towards Reproducible Blocked LU Factorization

Improving Performance of GMRES by Reducing Communication and Pipelining Global Collectives

Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures

GPU accelerated foreground segmentation using CodeBook model and shadow removal using CUDA

New optimized GPU version of the k-means algorithm for large-sized image segmentation

Multi2Sim Kepler: A detailed architectural GPU simulator

Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration

Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator

On the Effectiveness of Virtualization Based Memory Isolation on Multicore Platforms

Sweet KNN: An Efficient KNN on GPU through Reconciliation between Redundancy Removal and Regularity

GATSim: Abstract timing simulation of GPUs

A two-kernel based strategy for performing assembly in FEA on the graphics processing unit

TwinKernels: An execution model to improve GPU hardware scheduling at compile time

Taming warp divergence

FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Latency-aware packet processing on CPU-GPU heterogeneous systems

Statistical pattern based modeling of GPU memory access streams

Optimizing memory efficiency for convolution kernels on kepler GPUs

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options