Search results

Items from 1 to 20 out of 28 results

chapter

Power Efficient Sharing-Aware GPU Data Management

Abdulaziz Tabbakh, Murali Annavaram, Xuehai Qian

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 698 - 707

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy...

chapter

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning

Tianyi David Han, Tarek S. Abdelrahman

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1350 - 1361

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We explore the use of synthetic benchmarks for the training phase of machine-learning-based automatic performance tuning. We focus on the problem of predicting if the use of local memory on a GPU is beneficial for caching a single target array in a GPU kernel. We show that the use of only 13 real benchmarks leads to poor prediction accuracy (about to 58%) of the 13 leave-one-out models trained using...

chapter

An improved automatic MPI code generation algorithm for parallelizing compilation

Yangxia Xiang, Caisen Chen, Hongyan Wang, Zeyun Zhou

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC) > 1623 - 1626

2017 IEEE 2nd Advanced Information Technology, Electronic and Automation Control Conference (IAEAC)

Open64 is an open source compiler with powerful analysis and widely used as a research and commercial development platform. However, it has not been designed and developed to realize MPI parallelization. There are many contributions in the paper. Firstly, the Open64 compiler infrastructure is showed. Secondly, the location of MPI code generation in the Open64 compiler architecture is analyzed. Thirdly,...

chapter

Directive-Based Pipelining Extension for OpenMP

Xuewen Cui, Thomas R. W. Scogland, Bronis R. de Supinski, Wu-Chun Feng

2016 IEEE International Conference on Cluster Computing (CLUSTER) > 481 - 484

2016 IEEE International Conference on Cluster Computing (CLUSTER)

Programming models like CUDA, OpenMP, OpenACC and OpenCL are designed to offload compute-intensive workloads to accelerators efficiently. However, the naive offload model, which synchronously copies and executes in sequence, requires extensive hand-tuning of techniques, such as pipelining to overlap computation and communication. Therefore, we propose an easy-to-use, directive-based pipelining extension...

chapter

Compressed L1 data cache and L2 cache in GPGPUs

Ehsan Atoofian

2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 1 - 8

2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

General-Purpose Graphics Processing Units (GPGPUs) exploit several levels of caches to hide latency of memory and provide data for thousands of simultaneously executing threads. L1 data cache and L2 cache are critical to performance of GPGPUs as an L1 data cache should provide data for all threads within the corresponding Streaming Multiprocessor (SM) and the L2 cache should service memory requests...

chapter

Many-Thread Aware Compression in GPGPUs

Ehsan Atoofian

2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) > 628 - 635

Compression is a promising technique to increase effective capacity of caches. Due to latency overhead of decompression, most of previous studies applied compression to lower level caches. General-Purpose Graphics Processing Units (GPGPUs) are throughput oriented computing platforms which execute hundreds to thousands of threads, simultaneously. The massive number of threads makes GPGPUs less sensitive...

chapter

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming

Riyadh Baghdadi, Ulysse Beaugnon, Albert Cohen, Tobias Grosser, more

2015 International Conference on Parallel Architecture and Compilation (PACT) > 138 - 149

2015 International Conference on Parallel Architecture and Compilation (PACT)

Programming accelerators such as GPUs withlow-level APIs and languages such as OpenCL and CUDAis difficult, error-prone, and not performance-portable. Au-tomatic parallelization and domain specific languages (DSLs)have been proposed to hide complexity and regain performanceportability. We present P ENCIL, a rigorously-defined subset ofGNU C99 -- enriched with additional language constructs -- that...

chapter

Polyhedral Optimizations of Explicitly Parallel Programs

Prasanth Chatarasi, Jun Shirako, Vivek Sarkar

2015 International Conference on Parallel Architecture and Compilation (PACT) > 213 - 226

2015 International Conference on Parallel Architecture and Compilation (PACT)

The polyhedral model is a powerful algebraic framework that hasenabled significant advances to analysis and transformation ofsequential affine (sub)programs, relative to traditional AST-basedapproaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworksto optimize explicitly parallel programs. An interesting side effectof supporting...

chapter

An OpenACC Extension for Data Layout Transformation

Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka

2014 First Workshop on Accelerator Programming using Directives > 12 - 18

2014 First Workshop on Accelerator Programming using Directives (WACCPD)

OpenACC is gaining momentum as an implicit and portable interface in porting legacy CPU-based applications to heterogeneous, highly parallel computational environment involving many-core accelerators such as GPUs and Intel Xeon Phi. OpenACC provides a set of loop directives similar to OpenMP for the parallelization and also to manage data movement, attaining functional portability across different...

chapter

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

Milan Stanic, Oscar Palomar, Ivan Ratkovic, Milovan Duric, more

2014 International Conference on High Performance Computing & Simulation (HPCS) > 47 - 54

2014 International Conference on High Performance Computing & Simulation (HPCS)

Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released...

chapter

Supporting x86-64 address translation for 100s of GPU lanes

Jason Power, Mark D. Hill, David A. Wood

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA) > 568 - 578

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Efficient memory sharing between CPU and GPU threads can greatly expand the effective set of GPGPU workloads. For increased programmability, this memory should be uniformly virtualized, necessitating compatible address translation support for GPU memory references. However, even a modest GPU might need 100s of translations per cycle (6 CUs * 64 lanes/CU) with memory access patterns designed for throughput...

chapter

Parallelizing more Loops with Compiler Guided Refactoring

Per Larsen, Razya Ladelsky, Jacob Lidman, Sally A. McKee, more

2012 41st International Conference on Parallel Processing > 410 - 419

2012 41st International Conference on Parallel Processing (ICPP)

The performance of many parallel applications relies not on instruction-level parallelism but on loop-level parallelism. Unfortunately, automatic parallelization of loops is a fragile process, many different obstacles affect or prevent it in practice. To address this predicament we developed an interactive compilation feedback system that guides programmers in iteratively modifying their application...

chapter

Performance characteristics of Graph500 on large-scale distributed environment

Toyotaro Suzumura, Koji Ueno, Hitoshi Sato, Katsuki Fujisawa, more

2011 IEEE International Symposium on Workload Characterization (IISWC) > 149 - 158

2011 IEEE International Symposium on Workload Characterization (IISWC)

Graph500 is a new benchmark for supercomputers based on large-scale graph analysis, which is becoming an important form of analysis in many real-world applications. Graph algorithms run well on supercomputers with shared memory. For the Linpack-based supercomputer rankings, TOP500 reports that heterogeneous and distributed-memory super-computers with large numbers of GPGPUs are becoming dominant....

chapter

An Evaluation of Vectorizing Compilers

Saeed Maleki, Yaoqing Gao, Maria J. Garzar´n, Tommy Wong, more

2011 International Conference on Parallel Architectures and Compilation Techniques > 372 - 382

2011 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates...

chapter

Performance comparison of Single Board Computer: A case study of kernel on ARM architecture

Naufal Alee, Mostafijur Rahman, R. B. Ahmad

2011 6th International Conference on Computer Science & Education (ICCSE) > 521 - 524

2011 6th International Conference on Computer Science & Education (ICCSE 2011)

Due to increase computer hardware technologies, software developers are more focusing on to developing embedded operating system. GNU/Linux becomes a common operating system widely use in embedded technologies. In this paper, we report performance results on a TS-7800 Single Board Computer with different version of kernel that has been released by hardware provider. We compare the performance between...

chapter

Developing a Parameterized Performance Proxy for Sequential Scientific Kernels

Hongzhang Shan, Erich Strohmaier

2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC) > 247 - 256

2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC 2010)

A simple, synthetic performance proxy for scientific applications is of great interest to the scientific computing community for the development of new products, procurements, and performance related questions in general. To develop such a performance proxy, we enhance the capability of the memory performance benchmark, Apex-MAP, by adding new concepts to capture the effects of computational details...

chapter

A novel reconfigurable scratchpad memory for audio applications on cost-effective SoC

Ji Kong, Peilin Liu

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip > 402 - 407

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC 2010)

Nowadays, the scratchpad memories (SPMs) are widely used as supplements or even alternatives for cache memories in audio applications on cost-effective SoCs. However, traditional SPM architectures encounter limitations of tight capacities and restricted data exchange methods with main memories. Such kinds of limitations significantly decrease the performance of the whole system, since most of the...

chapter

Outlier Detection for Learning-Based Optimizing Compiler

Shun Long, Weiheng Zhu

2010 Fifth International Conference on Frontier of Computer Science and Technology > 570 - 575

2010 Fifth International Conference on Frontier of Computer Science and Technology (FCST 2010)

Modern compilers use machine learning to find from their prior experience useful heuristics for new programs encountered in order to accelerate the optimization process. However, prior experience might not be applicable for outlier programs with unfamiliar code features. This paper presents a Reverse K-nearest neighbor (RKNN) algorithm based approach for outlier detection. The compiler can therefore...

chapter

Implementing the Himeno benchmark with CUDA on GPU clusters

Everett H Phillips, Massimiliano Fatica

2010 IEEE International Symposium on Parallel&Distributed Processing (IPDPS) > 1 - 10

2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)

This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA streams to overlap GPU execution with data transfers...

chapter

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

J. Siegel, J. Ributzka, Xiaoming Li

2009 International Conference on Parallel Processing Workshops > 174 - 181

2009 38th International Conference on Parallel Processing Workshops (ICPPW 2009)

Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications...

Data set:
ieee
Keywords:
KERNEL
ARRAYS
BENCHMARK TESTING

Publication date

Set your own date range

Keywords

OPTIMIZATION (9)
GRAPHICS PROCESSING UNITS (7)
PROGRAM PROCESSORS (6)
MEMORY MANAGEMENT (5)
GPGPU (4)
PARALLEL PROGRAMMING (4)
BANDWIDTH (3)
HARDWARE (3)
INSTRUCTION SETS (3)
LAYOUT (3)
OPTIMISATION (3)
PARALLEL PROCESSING (3)
PERFORMANCE EVALUATION (3)
COMPRESSION (2)
COMPUTATIONAL MODELING (2)
COMPUTERS (2)
COPROCESSORS (2)
CUDA (2)
DATA STRUCTURES (2)
DATA TRANSFER (2)
HIGH PERFORMANCE COMPUTING (2)
JAVA (2)
MEMORY USAGE (2)
MESSAGE PASSING (2)
MPI (2)
MULTI-THREADING (2)
PERFORMANCE (2)
PREFETCHING (2)
REGISTERS (2)
RESOURCE MANAGEMENT (2)
SOFTWARE PERFORMANCE EVALUATION (2)
SUPERCOMPUTERS (2)
SYNCHRONIZATION (2)
SYSTEM-ON-A-CHIP (2)
TRAINING (2)
VECTORS (2)
YARN (2)
ABSTRACT PROGRAMMING INTERFACES (1)
ALGORITHM DESIGN AND ANALYSIS (1)
ALGORITHMS (1)
APEX-MAP (1)
APPROXIMATION METHODS (1)
ATOMIC INSTRUCTIONS (1)
ATOMICITY (1)
AUTOMATIC LOOP PARALLELIZATION (1)
AUTOMATIC OPTIMIZATION (1)
AUTOMATIC PARALLELIZATION (1)
AUTOMATIC PROGRAMMING (1)
BINARY TRANSLATION (1)
C/FORTRAN PARALLEL LIBRARY (1)
CACHE (1)
CLASS 2 SPECIFICATION (1)
CLUSTER NODE (1)
CLUSTERING ALGORITHMS (1)
CODE GENERATION (1)
CODE STRUCTURE (1)
COMPILE-TIME ANALYSIS (1)
COMPILER (1)
COMPILER FEEDBACK (1)
COMPUTATIONAL COMPLEXITY (1)
COMPUTER ARCHITECTURE (1)
COMPUTER GRAPHIC EQUIPMENT (1)
COMPUTER GRAPHICS (1)
COMPUTER LANGUAGES (1)
COMPUTER PERFORMANCE ANALYSIS (1)
COMPUTER SIMULATION (1)
CUDA CLUSTER (1)
CUDA MEMORY OPTIMIZATIONS (1)
DATA COMPRESSION (1)
DATA HANDLING (1)
DATA INTENSIVE SUPERCOMPUTING (1)
DATA PREFETCHING (1)
DATA SHARING (1)
DELAYS (1)
DEPARTMENT OF DEFENSE SUPERCOMPUTING RESOURCE CENTER (1)
DISTRIBUTED LARGE MEMORY SYSTEM (1)
DISTRIBUTED MEMORY PROGRAMMING (1)
DISTRIBUTED MEMORY SYSTEMS (1)
DISTRIBUTED PARALLEL VERSION (1)
DISTRIBUTED PROCESSING (1)
DISTRIBUTED SHARED MEMORY SYSTEMS (1)
DOMAIN SPECIFIC LANGUAGES (1)
DP INDUSTRY (1)
DSL (1)
DYNAMIC BINARY TRANSLATION (1)
DYNAMIC RANGE (1)
EMBARASSINGLY PARALLEL ALGORITHMS (1)
EMBEDDED APPLICATION (1)
EMBEDDED LINUX (1)
ENERGY (1)
EXECUTION-DRIVEN APPROACH (1)
EXPLICIT PARALLELISM (1)
FFT (1)
FORCE (1)
GENERAL PURPOSE COMPUTERS (1)
GENERAL PURPOSE CPU (1)
GENERATORS (1)
more

INFONA - science communication portal

Search results

Power Efficient Sharing-Aware GPU Data Management

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning

An improved automatic MPI code generation algorithm for parallelizing compilation

Directive-Based Pipelining Extension for OpenMP

Compressed L1 data cache and L2 cache in GPGPUs

Many-Thread Aware Compression in GPGPUs

PENCIL: A Platform-Neutral Compute Intermediate Language for Accelerator Programming

Polyhedral Optimizations of Explicitly Parallel Programs

An OpenACC Extension for Data Layout Transformation

Evaluation of vectorization potential of Graph500 on Intel's Xeon Phi

Supporting x86-64 address translation for 100s of GPU lanes

Parallelizing more Loops with Compiler Guided Refactoring

Performance characteristics of Graph500 on large-scale distributed environment

An Evaluation of Vectorizing Compilers

Performance comparison of Single Board Computer: A case study of kernel on ARM architecture

Developing a Parameterized Performance Proxy for Sequential Scientific Kernels

A novel reconfigurable scratchpad memory for audio applications on cost-effective SoC

Outlier Detection for Learning-Based Optimizing Compiler

Implementing the Himeno benchmark with CUDA on GPU clusters

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

Filter options

Publication date

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options