Wesley Bland

chapter

Memory Compression Techniques for Network Address Management in MPI

Yanfei Guo, Charles J. Archer, Michael Blocksome, Scott Parker, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 1008 - 1017

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

MPI allows applications to treat processes as a logical collection of integer ranks for each MPI communicator, while internally translating these logical ranks into actual network addresses. In current MPI implementations the management and lookup of such network addresses use memory sizes that are proportional to the number of processes in each communicator. In this paper, we propose a new mechanism,...

chapter

Flexible Error Recovery Using Versions in Global View Resilience

Nan Dun, Hajime Fujita, Aiman Fang, Yan Liu, more

2015 IEEE International Conference on Cluster Computing > 512 - 513

2015 IEEE International Conference on Cluster Computing (CLUSTER)

We present the Global View Resilience (GVR) system, a library that enables applications to add resilience in a portable, application-controlled fashion using versioned distributed arrays. We briefly describe GVR's interfaces for distributed arrays, versioning, and cross-layer error recovery. We illustrate how GVR can be used for rollback recovery and a wide range additional error recovery techniques...

chapter

Lessons Learned Implementing User-Level Failure Mitigation in MPICH

Wesley Bland, Huiwei Lu, Sangmin Seo, Pavan Balaji

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing > 1123 - 1126

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and libraries and is being considered by the MPI Forum for future inclusion into MPI itself. In this paper, we introduce an implementation of ULFM in MPICH, a high-performance and widely portable implementation...

chapter

Fault tolerant MapReduce-MPI for HPC clusters

Yanfei Guo, Wesley Bland, Pavan Balaji, Xiaobo Zhou

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 12

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Building MapReduce applications using the Message-Passing Interface (MPI) enables us to exploit the performance of large HPC clusters for big data analytics. However, due to the lacking of native fault tolerance support in MPI and the incompatibility between the MapReduce fault tolerance model and HPC schedulers, it is very hard to provide a fault tolerant MapReduce runtime for HPC clusters. We propose...

chapter

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

Antonio J. Peña, Wesley Bland, Pavan Balaji

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 12

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient...

chapter

Simplifying the Recovery Model of User-Level Failure Mitigation

Wesley Bland, Kenneth Raffenetti, Pavan Balaji

2014 Workshop on Exascale MPI at Supercomputing Conference > 20 - 25

2014 Workshop on Exascale MPI at Supercomputing Conference (ExaMPI)

As resilience research in high-performance computing has matured, so too have the tools, libraries, and languages that result from it. The Message Passing Interface (MPI) Forum is considering the addition of fault tolerance to a future version of the MPI standard, and a new chapter called User-Level Failure Mitigation (ULFM) has been proposed to fill this need. However, as ULFM usage has become more...

article

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

Wesley Bland, Peng Du, Aurelien Bouteiller, Thomas Herault, more

Concurrency and Computation: Practice and Experience > 25 > 17 > 2381 - 2393

Most predictions of exascale machines picture billion ways parallelism, encompassing not only millions of cores but also tens of thousands of nodes. Even considering extremely optimistic advances in hardware reliability, probabilistic amplification entails that failures will be unavoidable. Consequently, software fault tolerance is paramount to maintain future scientific productivity. Two major problems...

article

An evaluation of User-Level Failure Mitigation support in MPI

Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, more

Computing > 2013 > 95 > 12 > 1171-1184

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead...

chapter

Enabling Application Resilience with and without the MPI Standard

Wesley Bland

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012) > 746 - 751

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

As recent research has demonstrated, it is becoming a necessity for large scale applications to have the ability to tolerate process failure during an execution. As the number of processes increases, checkpoint/restart fault tolerance approaches requiring large concurrent state check pointing become untenable and radically new methods to address fault tolerance are needed. This work addresses these...

INFONA - science communication portal

Search results for: Wesley Bland

Memory Compression Techniques for Network Address Management in MPI

Flexible Error Recovery Using Versions in Global View Resilience

Lessons Learned Implementing User-Level Failure Mitigation in MPICH

Fault tolerant MapReduce-MPI for HPC clusters

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

Simplifying the Recovery Model of User-Level Failure Mitigation

Extending the scope of the Checkpoint‐on‐Failure protocol for forward recovery in standard MPI

An evaluation of User-Level Failure Mitigation support in MPI

Enabling Application Resilience with and without the MPI Standard

Filter options

Publication date

Publication type

Keywords

Data set

Journal

INFONA - science communication portal

Search results for: Wesley Bland

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Data set

Journal

Reporting an error / abuse

Sending the report failed

Accessibility options