Search results for: Kathryn Mohror

Items from 1 to 5 out of 5 results

chapter

A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers

Kento Sato, Kathryn Mohror, Adam Moody, Todd Gamblin, more

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing > 21 - 30

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider using burst buffers. Burst buffers are dedicated storage...

chapter

Design and modeling of a non-blocking checkpointing system

Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, more

2012 International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 10

2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to...

chapter

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, more

2012 International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 11

2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost...

chapter

Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint / Restart Library

Kathryn Mohror, Adam Moody, Bronis R. de Supinski

IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) > 1 - 6

2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W)

Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint / Restart Library...

chapter

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R de Supinski

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 11

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level...

Filter options

Keywords:
CHECKPOINTING

Publication date

Set your own date range

Keywords

COMPUTATIONAL MODELING (3)
LIBRARIES (3)
REDUNDANCY (3)
THYRISTORS (3)
BANDWIDTH (2)
CHECKPOINT/RESTART (2)
FAULT TOLERANCE (2)
SERVERS (2)
ARRAYS (1)
BUFFER STORAGE (1)
BURST BUFFER (1)
COMPUTER NUMERICAL CONTROL (1)
FAILURE MODES (1)
FLASH (1)
HIGH PERFORMANCE COMPUTING (1)
HIGH-PERFORMANCE COMPUTING SYSTEM (1)
HPC SYSTEMS (1)
INSTRUCTION SETS (1)
LARGE-SCALE SYSTEMS (1)
MARKOV MODEL (1)
MARKOV PROCESSES (1)
MESSAGE SYSTEMS (1)
PARALLEL FILE SYSTEM (1)
RAM (1)
RANDOM ACCESS MEMORY (1)
REACTIVE POWER (1)
RELIABILITY (1)
SCALABILITY (1)
SCALABLE CHECKPOINT-RESTART LIBRARY (1)
SCALABLE MULTILEVEL CHECKPOINTING SYSTEM (1)
SCR RELIABILITY PROPERTY (1)
SOFTWARE FAULT TOLERANCE (1)
SWITCHES (1)
SYSTEM FAILURES (1)
SYSTEM MEAN-TIME-BEFORE-FAILURE (1)
SYSTEM MEMORY SIZES (1)
TRANSCEIVERS (1)
WRITING (1)
more

INFONA - science communication portal

Search results for: Kathryn Mohror

A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers

Design and modeling of a non-blocking checkpointing system

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint / Restart Library

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options