Search results for: Adam Moody

Items from 1 to 20 out of 20 results

chapter

Accelerating Big Data Infrastructure and Applications (Ongoing Collaboration)

Kevin Brown, Tianqi Xu, Keita Iwabuchi, Kento Sato, more

2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW) > 343 - 347

2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW)

High-performance computing (HPC) systems are increasingly being used for data-intensive, or "Big Data", workloads. However, since traditional HPC workloads are compute-intensive, the HPC-Big Data convergence has created many challenges with optimizing data movement and processing on modern supercomputers. Our collaborative work addresses these challenges using a three-pronged approach: (i)...

chapter

MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers

Teng Wang, Adam Moody, Yue Zhu, Kathryn Mohror, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 1174 - 1183

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. Their aggregate storage bandwidth grows linearly with system node count. However, although scientific applications can achieve scalable write bandwidth by having each process write to its node-local burst buffer, metadata challenges remain formidable, especially for files shared across...

chapter

An Ephemeral Burst-Buffer File System for Scientific Applications

Teng Wang, Kathryn Mohror, Adam Moody, Kento Sato, more

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis > 807 - 818

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis

Burst buffers are becoming an indispensable hardware resource on large-scale supercomputers to buffer the bursty I/O from scientific applications. However, there is a lack of software support for burst buffers to be efficiently shared by applications within a batch-submitted job and recycled across different batch jobs. In addition, burst buffers need to cope with a variety of challenging I/O patterns...

chapter

Exascale Algorithms for Generalized MPI_Comm_split

Adam Moody, Dong H. Ahn, Bronis R. Supinski

Lecture Notes in Computer Science > Recent Advances in the Message Passing Interface > 9-18

In the quest to build exascale supercomputers, designers are increasing the number of hierarchical levels that exist among system components. Software developed for these systems must account for the various hierarchies to achieve maximum efficiency. The first step in this work is to identify groups of processes that share common resources. We develop, analyze, and test several algorithms that can...

chapter

A Study of Failures in Community Clusters: The Case of Conte

Subrata Mitra, Suhas Javagal, Amiya K. Maji, Todd Gamblin, more

2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW) > 189 - 196

2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW)

Large community clusters are becoming increasingly common in universities and other organizations due to the benefits they provide to the researchers in terms of operational costs and resource availability. However, efficient administration, failure diagnosis, and performance debugging on community clusters are challenging tasks due to the sheer diversity of workloads and users. These clusters are...

chapter

Machine Learning Predictions of Runtime and IO Traffic on High-End Clusters

Ryan McKenna, Stephen Herbein, Adam Moody, Todd Gamblin, more

2016 IEEE International Conference on Cluster Computing (CLUSTER) > 255 - 258

2016 IEEE International Conference on Cluster Computing (CLUSTER)

We use supervised machine learning algorithms (i.e., Decision Trees, Random Forest, and K-nearest Neighbors) to predict performance characteristics such as runtime and IO traffic of batch jobs on high-end clusters, using only user job scripts as input. We show that decision trees outperform other algorithms and accurately predict the runtime of 73% of jobs within a error tolerance of 10 minutes, which...

chapter

Managing I/O Interference in a Shared Burst Buffer System

Sagar Thapaliya, Purushotham Bangalore, Jat Lofstead, Kathryn Mohror, more

2016 45th International Conference on Parallel Processing (ICPP) > 416 - 425

2016 45th International Conference on Parallel Processing (ICPP)

In this work, we investigate the problem of inter-application interference in a shared Burst Buffer (BB) system. A BB is a new storage technology for HPC architectures that acts as an intermediate layer between performance-hungry HPC applications and the slow parallel file system. While the BB is meant to alleviate the problem of slow I/O in HPC systems, it is itself prone to performance degradation...

chapter

Characterizing and Reducing Cross-Platform Performance Variability Using OS-Level Virtualization

Ivo Jimenez, Carlos Maltzahn, Jay Lofstead, Adam Moody, more

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1077 - 1080

2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Independent validation of experimental results in the field of parallel and distributed systems research is a challenging task, mainly due to changes and differences in software and hardware in computational environments. In particular, when an experiment runs on different hardware than the one where it originally executed, predicting the differences in results is difficult. In this paper, we introduce...

chapter

Non-Blocking PMI Extensions for Fast MPI Startup

Sourav Chakraborty, Hari Subramoni, Adam Moody, Akshay Venkatesh, more

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing > 131 - 140

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

An efficient implementation of the Process Management Interface (PMI) is crucial to enable fast start-up of MPI jobs. We propose three extensions to the PMI specification: 1) a blocking all gather collective (PMIX_Allgather), 2) a non-blocking all gather collective (PMIX_Iallgather), and 3) a non-blocking fence (PMIX_KVS_Ifence). We design and evaluate several PMI implementations to demonstrate how...

chapter

The Role of Container Technology in Reproducible Computer Systems Research

Ivo Jimenez, Carlos Maltzahn, Adam Moody, Kathryn Mohror, more

2015 IEEE International Conference on Cloud Engineering > 379 - 385

2015 IEEE International Conference on Cloud Engineering (IC2E)

Evaluating experimental results in the field of computer systems is a challenging task, mainly due to the many changes in software and hardware that computational environments go through. In this position paper, we analyze salient features of container technology that, if leveraged correctly, can help reduce the complexity of reproducing experiments in systems research. We present a use case in the...

chapter

The Spack package manager: bringing order to HPC software chaos

Todd Gamblin, Matthew LeGendre, Michael R. Collette, Gregory L. Lee, more

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 12

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Large HPC centers spend considerable time supporting software for thousands of users, but the complexity of HPC software is quickly outpacing the capabilities of existing software management tools. Scientific applications require specific versions of compilers, MPI, and other dependency libraries, so using a single, standard software stack is infeasible. However, managing many configurations is difficult...

chapter

IO-Cop: Managing Concurrent Accesses to Shared Parallel File System

Sagar Thapaliya, Purushotham Bangalore, Jay Lofstead, Kathrn Mohror, more

2014 43rd International Conference on Parallel Processing Workshops > 52 - 60

2014 43nd International Conference on Parallel Processing Workshops (ICCPW)

A parallel file system (PFS) is often used to store intermediate results and checkpoint/restart files in a high performance computing (HPC) system. Multiple applications running on an HPC system often access PFSs concurrently resulting in degraded and variable I/O performance. By managing PFS accesses, these sharing induced inefficiencies can be controlled and reduced. To this end, we are exploring...

chapter

A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers

Kento Sato, Kathryn Mohror, Adam Moody, Todd Gamblin, more

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing > 21 - 30

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Checkpoint/Restart is an indispensable fault tolerance technique commonly used by high-performance computing applications that run continuously for hours or days at a time. However, even with state-of-the-art checkpoint/restart techniques, high failure rates at large scale will limit application efficiency. To alleviate the problem, we consider using burst buffers. Burst buffers are dedicated storage...

chapter

FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery

Kento Sato, Adam Moody, Kathryn Mohror, Todd Gamblin, more

2014 IEEE 28th International Parallel and Distributed Processing Symposium > 1225 - 1234

2014 IEEE International Parallel & Distributed Processing Symposium (IPDPS)

Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover a failed application before the next failure occurs...

article

Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System

Kathryn Mohror, Adam Moody, Greg Bronevetsky, Bronis R. de Supinski

IEEE Transactions on Parallel and Distributed Systems > 2014 > 25 > 9 > 2255 - 2263

High-performance computing (HPC) systems are growing more powerful by utilizing more components. As the system mean time before failure correspondingly drops, applications must checkpoint frequently to make progress. However, at scale, the cost of checkpointing becomes prohibitive. A solution to this problem is multilevel checkpointing, which employs multiple types of checkpoints in a single run....

chapter

Efficient and Scalable Retrieval Techniques for Global File Properties

Dong H. Ahn, Michael J. Brim, Bronis R. de Supinski, Todd Gamblin, more

2013 IEEE 27th International Symposium on Parallel and Distributed Processing > 369 - 380

2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)

Large-scale systems typically mount many different file systems with distinct performance characteristics and capacity. Applications must efficiently use this storage in order to realize their full performance potential. Users must take into account potential file replication throughout the storage hierarchy as well as contention in lower levels of the I/O system, and must consider communicating the...

chapter

Design and modeling of a non-blocking checkpointing system

Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, more

2012 International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 10

2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

As the capability and component count of systems increase, the MTBF decreases. Typically, applications tolerate failures with checkpoint/restart to a parallel file system (PFS). While simple, this approach can suffer from contention for PFS resources. Multi-level checkpointing is a promising solution. However, while multi-level checkpointing is successful on today's machines, it is not expected to...

chapter

MCREngine: A scalable checkpointing system using data-aware aggregation and compression

Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, more

2012 International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 11

2012 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost...

chapter

Asynchronous checkpoint migration with MRNet in the Scalable Checkpoint / Restart Library

Kathryn Mohror, Adam Moody, Bronis R. de Supinski

IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) > 1 - 6

2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W)

Applications running on today's supercomputers tolerate failures by periodically saving their state in checkpoint files on stable storage, such as a parallel file system. Although this approach is simple, the overhead of writing the checkpoints can be prohibitive, especially for large-scale jobs. In this paper, we present initial results of an enhancement to our Scalable Checkpoint / Restart Library...

chapter

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Adam Moody, Greg Bronevetsky, Kathryn Mohror, Bronis R de Supinski

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 11

2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis

High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level...

Filter options

Publication date

Set your own date range

INFONA - science communication portal

Search results for: Adam Moody

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Data set

Reporting an error / abuse

Sending the report failed

Accessibility options