2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

chapter

Reconstructing Householder Vectors from Tall-Skinny QR

Grey Ballard, James Demmel, Laura Grigori, Mathias Jacquelin, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1159 - 1170

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing...

chapter

A Framework for Lattice QCD Calculations on GPUs

F.T. Winter, M.A. Clark, R.G. Edwards, B. Joo

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1073 - 1082

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Computing platforms equipped with accelerators like GPUs have proven to provide great computational power. However, exploiting such platforms for existing scientific applications is not a trivial task. Current GPU programming frameworks such as CUDA C/C++ require low-level programming from the developer in order to achieve high performance code. As a result porting of applications to GPUs is typically...

chapter

UPC++: A PGAS Extension for C++

Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1105 - 1114

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Partitioned Global Address Space (PGAS) languages are convenient for expressing algorithms with large, random-access data, and they have proven to provide high performance and scalability through lightweight one-sided communication and locality control. While very convenient for moving data around the system, PGAS languages have taken different views on the model of computation, with the static Single...

chapter

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Khaled Z. Ibrahim, Paul H. Hargrove, Costin Iancu, Katherine Yelick

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1115 - 1125

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

The Cray Gemini interconnect hardware provides multiple transfer mechanisms and out-of-order message delivery to improve communication throughput. In this paper we quantify the performance of one-sided and two-sided communication paradigms with respect to: 1) the optimal available hardware transfer mechanism, 2) message ordering constraints, 3) per node and per core message concurrency. In addition...

chapter

Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell

Bhavishya Goel, Ruben Titos-Gil, Anurag Negi, Sally A. McKee, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 615 - 624

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Hardware transactional memory implementations are becoming increasingly available. For instance, the Intel Core i7 4770 implements Restricted Transactional Memory (RTM) support for Intel Transactional Synchronization Extensions (TSX). In this paper, we present a detailed evaluation of RTM performance and energy expenditure. We compare RTM behavior to that of the TinySTM software transactional memory...

chapter

A Spatio-temporal Coupling Method to Reduce the Time-to-Solution of Cardiovascular Simulations

Amanda Randles, Efthimios Kaxiras

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 593 - 602

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

We present a new parallel-in-time method designed to reduce the overall time-to-solution of a patient-specific cardiovascular flow simulation. Using a modified Para real algorithm, our approach extends strong scalability beyond spatial parallelism with fully controllable accuracy and no decrease in stability. We discuss the coupling of spatial and temporal domain decompositions used in our implementation,...

chapter

Locating Parallelization Potential in Object-Oriented Data Structures

Korbinian Molitorisz, Thomas Karcher, Alexander Biele, Walter F. Tichy

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1005 - 1015

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

The free lunch of ever increasing single-processor performance is over. Software engineers have to parallelize software to gain performance improvements. But not every software engineer is a parallel expert and with millions of lines of code that have not been developed with multicore in mind, we have to find ways to assist in identifying parallelization potential. This paper makes three contributions:...

chapter

Reading the Tea-Leaves: How Architecture Has Evolved at the High End

Peter Kogge

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 515

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

chapter

Using Multiple Threads to Accelerate Single Thread Performance

Zehra Sura, Kevin OBrien, Jose Brunheroto

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 985 - 994

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Computing systems are being designed with an increasing number of hardware cores. To effectively use these cores, applications need to maximize the amount of parallel processing and minimize the time spent in sequential execution. In this work, we aim to exploit fine-grained parallelism beyond the parallelism already encoded in an application. We define an execution model using a primary core and...

chapter

Active Measurement of Memory Resource Consumption

Marc Casas, Greg Bronevetsky

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 995 - 1004

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Hierarchical memory is a cornerstone of modern hardware design because it provides high memory performance and capacity at a low cost. However, the use of multiple levels of memory and complex cache management policies makes it very difficult to optimize the performance of applications running on hierarchical memories. As the number of compute cores per chip continues to rise faster than the total...

chapter

An Accelerated Recursive Doubling Algorithm for Block Tridiagonal Systems

Sudip K. Seal

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1019 - 1028

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Block tridiagonal systems of linear equations arise in a wide variety of scientific and engineering applications. Recursive doubling algorithm is a well-known prefix computation-based numerical algorithm that requires O(M3(N/P + logP)) work to compute the solution of a block tridiagonal system with N block rows and block size M on P processors. In real-world applications, solutions of tridiagonal...

chapter

Nitro: A Framework for Adaptive Code Variant Tuning

Saurav Muralidharan, Manu Shantharam, Mary Hall, Michael Garland, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 501 - 512

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Auto tuning systems intelligently navigate a search space of possible implementations of a computation to find the implementation(s) that best meets a specific optimization criteria, usually performance. This paper describes Nitro, a programmer-directed auto tuning framework that facilitates tuning of code variants, or alternative implementations of the same computation. Nitro provides a library interface...

chapter

Active Measurement of the Impact of Network Switch Utilization on Application Performance

Marc Casas, Greg Bronevetsky

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 165 - 174

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network performance a common performance bottleneck. To achieve high performance...

chapter

CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination

Matthieu Dorier, Gabriel Antoniu, Rob Ross, Dries Kimpe, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 155 - 164

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Unmatched computation and storage performance in new HPC systems have led to a plethora of I/O optimizations ranging from application-side collective I/O to network and disk-level request scheduling on the file system side. As we deal with ever larger machines, the interference produced by multiple applications accessing a shared parallel file system in a concurrent manner becomes a major problem...

chapter

Author index

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1255 - 1259

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Presents an index of the authors whose articles are published in the conference proceedings record.

chapter

BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications

Reza Mokhtari, Michael Stumm

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 819 - 828

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

GPUs offer an order of magnitude higher compute power and memory bandwidth than CPUs. GPUs therefore might appear to be well suited to accelerate computations that operate on voluminous data sets in independent ways, e.g., for transformations, filtering, aggregation, partitioning or other "Big Data" style processing. Yet experience indicates that it is difficult, and often error-prone, to...

chapter

Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2

Wei Xue, Chao Yang, Haohuan Fu, Xinliang Wang, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 745 - 754

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

This paper presents a hybrid algorithm for the petascale global simulation of atmospheric dynamics on Tianhe-2, the world's current top-ranked supercomputer developed by China's National University of Defense Technology (NUDT). Tianhe-2 is equipped with both Intel Xeon CPUs and Intel Xeon Phi accelerators. A key idea of the hybrid algorithm is to enable flexible domain partition between an arbitrary...

chapter

An Efficient Method for Stream Semantics over RDMA

Patrick MacArthur, Robert D. Russell

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 841 - 851

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Most network applications today are written to use TCP/IP via sockets. Remote Direct Memory Access (RDMA) is gaining popularity because its zero-copy, kernel-bypass features provide a high throughput, low latency reliable transport. Unlike TCP, which is a stream-oriented protocol, RDMA is a message-oriented protocol, and the OFA verbs library for writing RDMA application programs is more complex than...

chapter

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Xiaoyi Lu, Fan Liang, Bing Wang, Li Zha, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 829 - 838

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

MPI has been widely used in High Performance Computing. In contrast, such efficient communication support is lacking in the field of Big Data Computing, where communication is realized by time consuming techniques such as HTTP/RPC. This paper takes a step in bridging these two fields by extending MPI to support Hadoop-like Big Data Computing jobs, where processing and communication of a large number...

chapter

Characterization and Optimization of Memory-Resident MapReduce on HPC Systems

Yandong Wang, Robin Goldstone, Weikuan Yu, Teng Wang

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 799 - 808

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

MapReduce is a widely accepted framework for addressing big data challenges. Recently, it has also gained broad attention from scientists at the U.S. leadership computing facilities as a promising solution to process gigantic simulation results. However, conventional high-end computing systems are constructed based on the compute-centric paradigm while big data analytics applications prefer a data-centric...

INFONA - science communication portal

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Reconstructing Householder Vectors from Tall-Skinny QR

A Framework for Lattice QCD Calculations on GPUs

UPC++: A PGAS Extension for C++

An Evaluation of One-Sided and Two-Sided Communication Paradigms on Relaxed-Ordering Interconnect

Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell

A Spatio-temporal Coupling Method to Reduce the Time-to-Solution of Cardiovascular Simulations

Locating Parallelization Potential in Object-Oriented Data Structures

Reading the Tea-Leaves: How Architecture Has Evolved at the High End

Using Multiple Threads to Accelerate Single Thread Performance

Active Measurement of Memory Resource Consumption

An Accelerated Recursive Doubling Algorithm for Block Tridiagonal Systems

Nitro: A Framework for Adaptive Code Variant Tuning

Active Measurement of the Impact of Network Switch Utilization on Application Performance

CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination

Author index

BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications

Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2

An Efficient Method for Stream Semantics over RDMA

DataMPI: Extending MPI to Hadoop-Like Big Data Computing

Characterization and Optimization of Memory-Resident MapReduce on HPC Systems

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS) $("#expandableTitles").expandable();

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)