Search results for: Hui Jin

Items from 1 to 7 out of 7 results

article

Performance comparison under failures of MPI and MapReduce: An analytical approach

Hui Jin, Xian-He Sun

Future Generation Computer Systems > 2013 > 29 > 7 > 1808-1815

MPI has been the de facto standard of parallel programming for decades. There has been an increasing concern about the reliability of MPI applications in recent years, partially due to the inefficiency of parallel checkpointing. MapReduce is a new programming model originally introduced to handle massive data processing. There are numerous efforts recently that transform classical MPI based scientific...

chapter

Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

Hui Jin, Tao Ke, Yong Chen, Xian-He Sun

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012) > 276 - 283

2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Check pointing is widely used in technical computing. However, the overhead of check pointing is a subject of increasing in concern in recent years, especially for large-scale parallel computer systems. In these systems, check pointing generates a huge number of concurrent I/O writes. The burst of writes plus the worsening I/O-wall problem often leads to network and I/O congestion, and makes the overall...

chapter

Performance under Failures of MapReduce Applications

Hui Jin, Kan Qiao, Xian-He Sun, Ying Li

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing > 608 - 609

2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

The MapReduce programming paradigm is gaining more and more popularity in recent years due to its ability in supporting easy programming, data distribution, as well as fault tolerance. Failure is an unwanted but inevitable fact that all large-scale parallel computing systems have to face with. MapReduce introduces a novel data replication and task reexecution strategy for fault tolerance. This study...

chapter

REMEM: REmote MEMory as Checkpointing Storage

Hui Jin, Xian-He Sun, Yong Chen, Tao Ke

2010 IEEE Second International Conference on Cloud Computing Technology and Science > 319 - 326

2010 IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom 2010)

Check pointing is a widely used mechanism for supporting fault tolerance, but notorious in its high-cost disk access. The idea of memory-based check pointing has been extensively studied in research but made little success in practice due to its complexity and potential reliability concerns. In this study we present the design and implementation of REMEM, a Remote Memory check pointing system to extend...

chapter

Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

Hui Jin, Yong Chen, Huaiyu Zhu, Xian-He Sun

2010 39th International Conference on Parallel Processing > 525 - 534

39th International Conference on Parallel Processing (ICPP 2010)

The increasingly large ensemble size of modern High-Performance Computing (HPC) systems has drastically increased the possibility of failures. Performance under failures and its optimization become timely important issues facing the HPC community. In this study, we propose an analytical model to predict the application performance. The model characterizes the impact of coordinated checkpointing and...

chapter

Lattice QCD Workflows: A Case Study

L. Piccoli, J.B. Kowalkowski, J.N. Simone, Xian-He Sun, more

2008 IEEE Fourth International Conference on eScience > 620 - 625

2008 IEEE Fourth International Conference on eScience

This paper discusses the application of existing workflow management systems to a real world science application (LQCD). Typical workflows and execution environment used in production are described. Requirements for the LQCD production system are discussed. The workflow management systems Askalon and Swift were tested by implementing the LQCD workflows and evaluated against the requirements. We report...

chapter

Performance under failures of high-end computing

Ming Wu, Xian-He Sun, Hui Jin

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '7) > 1 - 11

2007 SC - International conference for High Performance Computing, Networking, Storage and Analysis

Modern high-end computers are unprecedentedly complex. Occurrence of faults is an inevitable fact in solving large-scale applications on future Petaflop machines. Many methods have been proposed in recent years to mask faults. These methods, however, impose various performance and production costs. A better understanding of faults' influence on application performance is necessary to use existing...

Filter options

Keywords:
FAULT TOLERANCE

Publication date

Set your own date range

Publication type

book (6)
article (1)

Keywords

Data set

ieee (6)
Elsevier (1)

INFONA - science communication portal

Search results for: Hui Jin

Performance comparison under failures of MPI and MapReduce: An analytical approach

Checkpointing Orchestration: Toward a Scalable HPC Fault-Tolerant Environment

Performance under Failures of MapReduce Applications

REMEM: REmote MEMory as Checkpointing Storage

Optimizing HPC Fault-Tolerant Environment: An Analytical Approach

Lattice QCD Workflows: A Case Study

Performance under failures of high-end computing

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Data set

Reporting an error / abuse

Sending the report failed

Accessibility options