Search results

Items from 1 to 12 out of 12 results

chapter

Cache Partitioning + Loop Tiling: A Methodology for Effective Shared Cache Management

Vasilios Kelefouras, Georgios Keramidas, Nikolaos Voros

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) > 477 - 482

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

In this paper, we present a new methodology that provides i) a theoretical analysis of the two most commonly used approaches for effective shared cache management (i.e., cache partitioning and loop tiling) and ii) a unified framework to fine tuning those two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by one order of magnitude keeping at...

chapter

An Overview of Performance Portability in the Uintah Runtime System through the Use of Kokkos

Daniel Sunderland, Brad Peterson, John Schmidt, Alan Humphrey, more

2016 Second International Workshop on Extreme Scale Programming Models and Middlewar (ESPM2) > 44 - 47

2016 Second International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)

The current diversity in nodal parallel computer architectures is seen in machines based upon multicore CPUs, GPUs and the Intel Xeon Phi's. A class of approaches for enabling scalability of complex applications on such architectures is based upon Asynchronous Many Task software architectures such as that in the Uintah framework used for the parallel solution of solid and fluid mechanics problems...

chapter

Adaptive digital scan variable pixels

Sherin Sugathan, Reshma Scaria, Alex Pappachen James

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) > 1185 - 1188

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)

The square and rectangular shape of the pixels in the digital images for sensing and display purposes introduces several inaccuracies in the representation of digital images. The major disadvantage of square pixel shapes is the inability to accurately capture and display the details in the objects having variable orientations to edges, shapes and regions. This effect can be observed by the inaccurate...

chapter

An OpenACC Extension for Data Layout Transformation

Tetsuya Hoshino, Naoya Maruyama, Satoshi Matsuoka

2014 First Workshop on Accelerator Programming using Directives > 12 - 18

2014 First Workshop on Accelerator Programming using Directives (WACCPD)

OpenACC is gaining momentum as an implicit and portable interface in porting legacy CPU-based applications to heterogeneous, highly parallel computational environment involving many-core accelerators such as GPUs and Intel Xeon Phi. OpenACC provides a set of loop directives similar to OpenMP for the parallelization and also to manage data movement, attaining functional portability across different...

chapter

Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems

Shuai Che, Jiayuan Meng, Kevin Skadron

2014 IEEE International Parallel & Distributed Processing Symposium Workshops > 916 - 924

2014 IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW)

There has been a growing trend in using heterogeneous systems with CPUs and GPUs to solve diverse compute problems. However, high application performance on these platforms relies on efficient memory accesses. For many applications, CPUs and GPUs prefer different memory mappings and data structure layouts. This in turn requires developers to use device-specific strategies for memory access optimizations...

chapter

Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs

Daichi Mukunoki, Daisuke Takahashi

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum > 1378 - 1386

2012 26th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We implemented and evaluated the triple precision Basic Linear Algebra Subprograms (BLAS) subroutines, AXPY, GEMV and GEMM on a Tesla C2050. In this paper, we present a Double Single (D+S) type triple precision floating-point value format and operations. They are based on techniques similar to Double-Double (DD) type quadruple precision operations. On the GPU, the D+S-type operations are more costly...

article

Fast Construction of SAH BVHs on the Intel Many Integrated Core (MIC) Architecture

Ingo Wald

IEEE Transactions on Visualization and Computer Graphics > 2012 > 18 > 1 > 47 - 57

We investigate how to efficiently build bounding volume hierarchies (BVHs) with surface area heuristic (SAH) on the Intel Many Integrated Core (MIC) Architecture. To achieve maximum performance, we use four key concepts: progressive 10-bit quantization to reduce cache footprint with negligible loss in BVH quality; an AoSoA data layout that allows efficient streaming and SIMD processing; high-performance...

chapter

An Evaluation of Vectorizing Compilers

Saeed Maleki, Yaoqing Gao, Maria J. Garzar´n, Tommy Wong, more

2011 International Conference on Parallel Architectures and Compilation Techniques > 372 - 382

2011 International Conference on Parallel Architectures and Compilation Techniques (PACT)

Most of today's processors include vector units that have been designed to speedup single threaded programs. Although vector instructions can deliver high performance, writing vector code in assembly language or using intrinsics in high level languages is a time consuming and error-prone task. The alternative is to automate the process of vectorization by using vectorizing compilers. This paper evaluates...

chapter

Design of MILC Lattice QCD Application for GPU Clusters

Guochun Shi, Steven Gottlieb, Aaron Torok, Volodymyr Kindratenko

2011 IEEE International Parallel & Distributed Processing Symposium > 363 - 371

2011 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

We present an implementation of the improved staggered quark action lattice QCD computation designed for execution on a GPU cluster. The parallelization strategy is based on dividing the space-time lattice along the time dimension and distributing the sub-lattices among the GPU cluster nodes. We provide a mixed-precision floating-point GPU implementation of the multi-mass conjugate gradient solver...

chapter

Dymaxion: Optimizing memory access patterns for heterogeneous systems

Shuai Che, Jeremy W. Sheaffer, Kevin Skadron

2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) > 1 - 11

2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Graphics processors (GPUs) have emerged as an important platform for general purpose computing. GPUs offer a large number of parallel cores and have access to high memory bandwidth; however, data structure layouts in GPU memory often lead to sub-optimal performance for programs designed with a CPU memory interface — or no particular memory interface at all! — in mind. This implies that application...

chapter

VAD: A Detail Investigation into Process's Memory

Xudong Li, Chunxia Zhang, Xing Lin, Shuguang Lin

2009 International Conference on Computational Intelligence and Security > 1 > 531 - 536

2009 International Conference on Computational Intelligence and Security (CIS 2009)

This paper discusses a process' s memory layout on Windows. We describe the structures of the Virtual Address Descriptor (VAD) and AVL tree of VADs in Windows Research kernel (WRK), and how to extract useful information from these structures, and how to locate process heaps using PEB. We recommend a way to get process' s memory layout in user space using VAD tree traverse on WRK, which is illustrated...

chapter

CUDA Memory Optimizations for Large Data-Structures in the Gravit Simulator

J. Siegel, J. Ributzka, Xiaoming Li

2009 International Conference on Parallel Processing Workshops > 174 - 181

2009 38th International Conference on Parallel Processing Workshops (ICPPW 2009)

Modern GPUs open a completely new field to optimize embarrassingly parallel algorithms. Implementing an algorithm on a GPU confronts the programmer with a new set of challenges for program optimization. Some of the most notable ones are isolating the part of the algorithm that can be optimized to run on the GPU; tuning the program for the GPU memory hierarchy whose organization and performance implications...

Filter options

Data set:
ieee
Keywords:
KERNEL
ARRAYS
LAYOUT

Publication date

Set your own date range

Publication type

book (11)
article (1)

Keywords

INSTRUCTION SETS (5)
BENCHMARK TESTING (3)
GRAPHICS PROCESSING UNIT (3)
GRAPHICS PROCESSING UNITS (3)
GPGPU (2)
MEMORY MANAGEMENT (2)
PROGRAMMING (2)
ALGORITHMS (1)
AVL TREE (1)
BLAS (1)
BOUNDING VOLUME HIERARCHIES (BVHS) (1)
CACHE PARTITIONING (1)
COMPILERS (1)
COMPUTER GRAPHICS (1)
COPROCESSORS (1)
CUDA (1)
CUDA MEMORY OPTIMIZATIONS (1)
DATA MINING (1)
DATA STRUCTURES (1)
DATA WAREHOUSES (1)
DIGITAL IMAGES (1)
DIGITAL SCANS (1)
EMBARASSINGLY PARALLEL ALGORITHMS (1)
FORCE (1)
GENERAL PURPOSE COMPUTERS (1)
GENERAL PURPOSE CPU (1)
GPU (1)
GPU ACCESS PATTERNS (1)
GRAVIT SIMULATOR (1)
GRAVITATIONAL FORCES (1)
GRAVITY (1)
HETEROGENEOUS COMPUTER ARCHITECTURES (1)
HYBRID PARALLELISM (1)
IMAGE RESOLUTION (1)
INDEXES (1)
INTEL MIC ARCHITECTURE. (1)
KOKKOS (1)
LATENCY HIDING (1)
LATTICES (1)
LIBRARIES (1)
LOOP TILING (1)
MEDIA (1)
MEMORY (1)
MEMORY ACCESS AND DATA LAYOUT (1)
MEMORY CARDS (1)
MEMORY LAYOUT (1)
MEMORY USAGE (1)
MERGING (1)
MULTICORE PROCESSING (1)
N-BODY (1)
NOISE (1)
OPERATING SYSTEM KERNELS (1)
OPTIMISATION (1)
OPTIMIZATION (1)
PARALLEL ALGORITHMS (1)
PARALLEL BVH CONSTRUCTION (1)
PERFORMANCE EVALUATION (1)
PERFORMANCE PORTABILITY (1)
PIXELS (1)
PROCESS (1)
PROCESS ENVIROMENT BLOCK (1)
PROCESS ENVIROMENT BLOCK (PEB) (1)
PROCESS MEMORY (1)
PROGRAM OPTIMIZATION (1)
PROGRAM PROCESSORS (1)
REGISTERS (1)
RUNTIME (1)
SCHEDULES (1)
SHAPE (1)
SHARED CACHE (1)
SPACE TECHNOLOGY (1)
SPATIAL SAMPLING (1)
SQUARE PIXELS (1)
SURFACE AREA HEURISTIC (SAH) (1)
TRIPLE PRECISION (1)
UINTAH (1)
VARIABLE PIXELS (1)
VECTORS (1)
VIRTUAL ADDRESS DESCRIPTOR (1)
VIRTUAL ADDRESS DESCRIPTOR (VAD) (1)
VIRTUAL PAGE NUMBER (1)
VIRTUAL PAGE NUMBER(VPN) (1)
WINDOWS RESEARCH KERNEL (1)
WINDOWS RESEARCH KERNEL (WRK) (1)
YARN (1)
more

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options