Search results

Items from 101 to 120 out of 473 results

1 ...
3
4
5
6
7
8
9

chapter

A multi-GPU approach for the exchange Monte Carlo method

Cristobal A. Navarro, Huang Wei, Youjin Deng

2015 34th International Conference of the Chilean Computer Science Society (SCCC) > 1 - 6

2015 34th International Conference of the Chilean Computer Science Society (SCCC)

We present an efficient multi-GPU approach for the Exchange Monte Carlo method designed for the simulation of disordered spin systems. Parallel computation is organized using a two-level scheme, allowing the algorithm to scale its performance in the presence of faster GPUs as well as multiple GPUs. Performance results show that spin-level performance is between one and two orders of magnitude faster...

chapter

TRACO: An automatic loop nest parallelizer for numerical applications

Marek Palkowski, Tomasz Klimek, Wlodzimierz Bielecki

2015 Federated Conference on Computer Science and Information Systems (FedCSIS) > 681 - 686

2015 Federated Conference on Computer Science and Information Systems (FedCSIS)

We present the source-to-source TRACO compiler allowing for increasing program locality and parallelizing arbitrarily nested loop sequences in numerical applications. Algorithms for generation of tiled code and extracting synchronization-free slices composed of tiles are presented. Parallelism of arbitrary nested loops is obtained by creating a kernel of computations represented in the OpenMP standard...

chapter

Restructuring and implementations of 2D matrix transpose algorithm using SSE4 vector instructions

Ahmed S. Zekri

2015 International Conference on Applied Research in Computer Science and Engineering (ICAR) > 1 - 7

2015 International Conference on Applied Research in Computer Science and Engineering (ICAR)

Current general-purpose processors are augmented with vector instructions that can process many elements of matrices and vectors in parallel. Transposing a matrix in-place is a main kernel operation required by many scientific and engineering applications to shuttle data before, during, or after processing. This operation increases the traffic on the memory bus and hence clever techniques such as...

chapter

Design and Verification of Heterogeneous Streaming Parallel Mechanisms on Kepler CUDA

Kailong Zhang, Shaoli Zhou, Liang Hu, Hang Su, more

2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing > 2256 - 2262

In many-core based parallel computing field, how to optimally allocate and schedule computing core resources according to characteristics of parallel applications is one typical and fundamental problem, which touches closely to computing performances. After analyzing features and mechanisms of Kepler CUDA architecture, three heterogeneous streaming parallel computing modes and corresponding constraints,...

chapter

CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters

Mohammed Sourouri, Johannes Langguth, Filippo Spiga, Scott B. Baden, more

2015 IEEE 18th International Conference on Computational Science and Engineering > 17 - 26

2015 IEEE 18th International Conference on Computational Science and Engineering (CSE)

On modern GPU clusters, the role of the CPUs is often restricted to controlling the GPUs and handling MPI communication. The unused computing power of the CPUs, however, can be considerable for computations whose performance is bounded by memory traffic. This paper investigates the challenges of simultaneous usage of CPUs and GPUs for computation. Our emphasis is on deriving a heterogeneous CPU+GPU...

chapter

Accelerating Support Vector Machine Learning with GPU-Based MapReduce

Tianyao Sun, Hanli Wang, Yun Shen, Jun Wu

2015 IEEE International Conference on Systems, Man, and Cybernetics > 876 - 881

2015 IEEE International Conference on Systems, Man, and Cybernetics (SMC)

With the exploding growth of data, the computational complexity required by learning Support Vector Machine (SVM) lays a heavy burden on real-world applications. To address this issue, parallel computational techniques can be employed such as the Graphics Processing Units (GPUs) and MapReduce model. As it is well known, GPUs are microprocessors on a multi-core architecture which reveal high performance...

chapter

NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures

Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, more

2015 International Conference on Parallel Architecture and Compilation (PACT) > 13 - 24

2015 International Conference on Parallel Architecture and Compilation (PACT)

Thanks to massive parallelism in modern Graphics Processing Units (GPUs), emerging data processing applications in GPU computing exhibit ten-fold speedups compared to CPU-only systems. However, this GPU-based acceleration is limited in many cases by the significant data movement overheads and inefficient memory management for host-side storage accesses. To address these shortcomings, this paper proposes...

chapter

An Efficient Vectorization Approach to Nested Thread-level Parallelism for CUDA GPUs

Shixiong Xu, David Gregg

2015 International Conference on Parallel Architecture and Compilation (PACT) > 488 - 489

2015 International Conference on Parallel Architecture and Compilation (PACT)

Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This...

chapter

Polyhedral Optimizations of Explicitly Parallel Programs

Prasanth Chatarasi, Jun Shirako, Vivek Sarkar

2015 International Conference on Parallel Architecture and Compilation (PACT) > 213 - 226

2015 International Conference on Parallel Architecture and Compilation (PACT)

The polyhedral model is a powerful algebraic framework that hasenabled significant advances to analysis and transformation ofsequential affine (sub)programs, relative to traditional AST-basedapproaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworksto optimize explicitly parallel programs. An interesting side effectof supporting...

chapter

Fast automatic fundus images registration and mosaic based on Compute Unified Device Architecture

Yuliang Wang, Zhaoying Chen, Xianxi Liu, Jianxin Shen

2015 8th International Congress on Image and Signal Processing (CISP) > 275 - 280

2015 8th International Congress on Image and Signal Processing (CISP)

In order to overcome the characteristics of low contrast, non-uniform illumination, limited field of view (FOV), and the geometric distortion between different FOV of the fundus images, a fast automatic fundus image registration and mosaic algorithm based on Compute Unified Device Architecture (CUDA) is presented. Firstly fundus images are enhanced by homomorphism filtering, then the Scale Invariant...

chapter

Expressing Parallelism on Many-Core for Deterministic Discrete Ordinates Transport

Tom Deakin, Simon McIntosh-Smith, Wayne Gaudin

2015 IEEE International Conference on Cluster Computing > 729 - 737

2015 IEEE International Conference on Cluster Computing (CLUSTER)

In this paper we demonstrate techniques for increasing the node-level parallelism of a deterministic discrete ordinates neutral particle transport algorithm on a structured mesh to exploit many-core technologies. Transport calculations form a large part of the computational workload of physical simulations and so good performance is vital for the simulations to complete in reasonable time. We will...

chapter

A task-based linear algebra Building Blocks approach for scalable graph analytics

Michael M. Wolf, Jonathan W. Berry, Dylan T. Stark

2015 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 6

2015 IEEE High Performance Extreme Computing Conference (HPEC)

It is challenging to obtain scalable HPC performance on real applications, especially for data science applications with irregular memory access and computation patterns. To drive co-design efforts in architecture, system, and application design, we are developing miniapps representative of data science workloads. These in turn stress the state of the art in Graph BLAS-like Graph Algorithm Building...

chapter

Hierarchical clustering and k-means analysis of HPC application kernels performance characteristics

M.L. Grodowitz, Sarat Sreepathi

2015 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 6

2015 IEEE High Performance Extreme Computing Conference (HPEC)

In this work, we present the characterization of a set of scientific kernels which are representative of the behavior of fundamental and applied physics applications across a wide range of fields. We collect performance attributes in the form of micro-operation mix and off-chip memory bandwidth measurements for these kernels. Using these measurements, we use two clustering methodologies to show which...

chapter

Towards an Automatic Prediction of Image Processing Algorithms Performances on Embedded Heterogeneous Architectures

Romain Saussard, Boubker Bouzid, Marius Vasiliu, Roger Reynaud

2015 44th International Conference on Parallel Processing Workshops > 27 - 36

2015 44th International Conference on Parallel Processing Workshops (ICPPW)

Image processing algorithms are widely used in the automotive field for ADAS (Advanced Driver Assistance System) purposes. To embed these algorithms, semiconductor companies offer heterogeneous architectures which are composed of different processing units, often with massively parallel computing unit. However, embedding complex algorithms on these So Cs (System on Chip) remains a difficult task due...

chapter

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

Da Li, Hancheng Wu, Michela Becchi

2015 44th International Conference on Parallel Processing > 979 - 988

2015 44th International Conference on Parallel Processing (ICPP)

The effective deployment of applications exhibiting irregular nested parallelism on GPUs is still an open problem. A naïve mapping of irregular code onto the GPU hardware often leads to resource underutilization and, thereby, limited performance. In this work, we focus on two computational patterns exhibiting nested parallelism: irregular nested loops and parallel recursive computations. In particular,...

chapter

Automatic Performance Tuning of Stencil Computations on GPUs

Joseph D. Garvey, Tarek S. Abdelrahman

2015 44th International Conference on Parallel Processing > 300 - 309

2015 44th International Conference on Parallel Processing (ICPP)

We consider automatic performance tuning of stencil computations on Graphics Processing Units. We present a strategy that uses machine learning to determine the best way to use memory followed by a heuristic that divides the remaining optimizations into groups and exhaustively explores one group at a time. We evaluate our strategy using 102 synthetically generated OpenCL stencil kernels on an Nvidia...

chapter

Pattern-Driven Hybrid Multi- and Many-Core Acceleration in the MPAS Shallow-Water Model

Peng Zhang, Yulong Ao, Chao Yang, Yiqun Liu, more

2015 44th International Conference on Parallel Processing > 71 - 80

2015 44th International Conference on Parallel Processing (ICPP)

There is an urgent demand in studying efficient methodologies to enable hybrid multi- and many-core accelerations in global climate simulations. The Model for Prediction Across Scales (MPAS) is a family of earth-system component models that receives increasingly more attention. Like many other models, MPAS, though features some emerging numerical algorithms, employs a pure MPI approach for parallel...

chapter

Optimizing Image Sharpening Algorithm on GPU

Mengran Fan, Haipeng Jia, Yunquan Zhang, Xiaojing An, more

2015 44th International Conference on Parallel Processing > 230 - 239

2015 44th International Conference on Parallel Processing (ICPP)

Sharpness is an algorithm used to sharpen images. As the increase of image size, resolution, and the requirements for real-time processing, the performance of sharpness needs to get improved greatly. The independent pixel calculation of sharpness makes a good opportunity to use GPU to largely accelerate the performance. However, to transplant it to GPU, one challenge is that sharpness involves several...

chapter

BM3D image denoising using heterogeneous computing platforms

Sampsa Sarjanoja, Jani Boutellier, Jari Hannuksela

2015 Conference on Design and Architectures for Signal and Image Processing (DASIP) > 1 - 8

2015 Conference on Design and Architectures for Signal and Image Processing (DASIP)

Noise reduction is often performed at an early stage of the image processing path. In order to keep the processing delays small in different computing platforms, it is important that the noise reduction is performed swiftly. In this paper, the block-matching and three-dimensional filtering (BM3D) denoising algorithm is implemented on heterogeneous computing platforms using OpenCL and CUDA frameworks...

chapter

Compiling HPC Kernels for the REDEFINE CGRA

Kavitha T. Madhu, Saptarsi Das, Nalesh S., S. K. Nandy, more

2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems > 405 - 410

2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS) and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS)

In this paper, we present a compilation flow for HPC kernels on the REDEFINE coarse-grain reconfigurable architecture (CGRA). REDEFINE is a scalable macro-dataflow machine in which the compute elements (CEs) communicate through messages. REDEFINE offers the ability to exploit high degree of coarse-grain and pipeline parallelism. The CEs in REDEFINE are enhanced with reconfigurable macro data-paths...

1 ...
3
4
5
6
7
8
9

Keywords:
KERNEL
PARALLEL PROCESSING

Publication date

Set your own date range

Content availability

Available (468)
None (5)

Keywords

INSTRUCTION SETS (149)
GRAPHICS PROCESSING UNITS (132)
GRAPHICS PROCESSING UNIT (98)
COMPUTER ARCHITECTURE (92)
HARDWARE (89)
GPU (82)
COMPUTATIONAL MODELING (73)
CUDA (58)
FIELD PROGRAMMABLE GATE ARRAYS (58)
PROGRAMMING (56)
OPTIMIZATION (53)
COPROCESSORS (50)
ARRAYS (46)
ALGORITHM DESIGN AND ANALYSIS (44)
PROGRAM PROCESSORS (42)
COMPUTER GRAPHIC EQUIPMENT (38)
MEMORY MANAGEMENT (38)
PERFORMANCE EVALUATION (35)
GPGPU (34)
ACCELERATION (33)
MULTIPROCESSING SYSTEMS (32)
BENCHMARK TESTING (31)
REGISTERS (30)
YARN (29)
OPENCL (28)
RUNTIME (26)
PARALLEL PROGRAMMING (24)
BANDWIDTH (23)
FPGA (23)
SYNCHRONIZATION (22)
COMPUTER GRAPHICS (21)
DATA MINING (21)
MULTICORE PROCESSING (21)
PARALLEL COMPUTING (21)
CENTRAL PROCESSING UNIT (18)
LIBRARIES (18)
MICROPROCESSOR CHIPS (18)
PIXEL (18)
THROUGHPUT (18)
IMAGE PROCESSING (17)
PIPELINES (17)
TRAINING (17)
PARALLEL ARCHITECTURES (16)
CONVOLUTION (15)
HEURISTIC ALGORITHMS (15)
COMPUTE UNIFIED DEVICE ARCHITECTURE (14)
SPARSE MATRICES (14)
LINUX (13)
SERVERS (13)
SUPPORT VECTOR MACHINES (13)
MULTI-THREADING (12)
RANDOM ACCESS MEMORY (12)
VECTORS (12)
CONTEXT (11)
DATA STRUCTURES (11)
DATABASES (11)
EMBEDDED SYSTEMS (11)
INDEXES (11)
RECONFIGURABLE ARCHITECTURES (11)
TILES (11)
ACCURACY (10)
COMPUTERS (10)
DECODING (10)
GRAPHIC PROCESSING UNIT (10)
MAGNETIC CORES (10)
MATHEMATICAL MODEL (10)
MESSAGE PASSING (10)
MESSAGE SYSTEMS (10)
PARALLEL ALGORITHMS (10)
RESOURCE MANAGEMENT (10)
APPLICATION PROGRAM INTERFACES (9)
DIGITAL SIGNAL PROCESSING (9)
HIGH PERFORMANCE COMPUTING (9)
MICROPROCESSORS (9)
OPENMP (9)
RESOURCE ALLOCATION (9)
SCHEDULING (9)
CPU (8)
ENCODING (8)
FEATURE EXTRACTION (8)
GPU COMPUTING (8)
MULTI-CORE (8)
OPTIMISATION (8)
PARALLEL (8)
PROCESSOR SCHEDULING (8)
REAL-TIME SYSTEMS (8)
SCHEDULES (8)
ANALYTICAL MODELS (7)
BIOINFORMATICS (7)
CLOCKS (7)
GRAPHICS (7)
IMAGE COLOR ANALYSIS (7)
JACOBIAN MATRICES (7)
LINEAR ALGEBRA (7)
MATRIX MULTIPLICATION (7)
SCALABILITY (7)
SIMD (7)
SOFTWARE (7)
more

INFONA - science communication portal

Search results

A multi-GPU approach for the exchange Monte Carlo method

TRACO: An automatic loop nest parallelizer for numerical applications

Restructuring and implementations of 2D matrix transpose algorithm using SSE4 vector instructions

Design and Verification of Heterogeneous Streaming Parallel Mechanisms on Kepler CUDA

CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters

Accelerating Support Vector Machine Learning with GPU-Based MapReduce

NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures

An Efficient Vectorization Approach to Nested Thread-level Parallelism for CUDA GPUs

Polyhedral Optimizations of Explicitly Parallel Programs

Fast automatic fundus images registration and mosaic based on Compute Unified Device Architecture

Expressing Parallelism on Many-Core for Deterministic Discrete Ordinates Transport

A task-based linear algebra Building Blocks approach for scalable graph analytics

Hierarchical clustering and k-means analysis of HPC application kernels performance characteristics

Towards an Automatic Prediction of Image Processing Algorithms Performances on Embedded Heterogeneous Architectures

Nested Parallelism on GPU: Exploring Parallelization Templates for Irregular Loops and Recursive Computations

Automatic Performance Tuning of Stencil Computations on GPUs

Pattern-Driven Hybrid Multi- and Many-Core Acceleration in the MPAS Shallow-Water Model

Optimizing Image Sharpening Algorithm on GPU

BM3D image denoising using heterogeneous computing platforms

Compiling HPC Kernels for the REDEFINE CGRA

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options