Search results for: C. Leangsuksun

Items from 1 to 13 out of 13 results

chapter

Benefits of Software Rejuvenation on HPC Systems

N Naksinehaboon, N Taerat, C Leangsuksun, C F Chandler, more

International Symposium on Parallel and Distributed Processing with Applications > 499 - 506

2010 International Symposium on Parallel and Distributed Processing with Applications (ISPA 2010)

Rejuvenation is a technique expected to mitigate failures in HPC systems by replacing, repairing, or resetting system components. Because of the small overhead required by software rejuvenation, we primarily focus on OS/kernel rejuvenation. In this paper, we propose three rejuvenation scheduling techniques. Moreover, we investigate the claim that software rejuvenation prolongs failure times in HPC...

chapter

Proficiency Metrics for Failure Prediction in High Performance Computing

N Taerat, C Leangsuksun, C Chandler, N Naksinehaboon

International Symposium on Parallel and Distributed Processing with Applications > 491 - 498

2010 International Symposium on Parallel and Distributed Processing with Applications (ISPA 2010)

The number of failures occurring in large-scale high performance computing (HPC) systems is significantly increasing due to the large number of physical components found on the system. Fault tolerance (FT) mechanisms help parallel applications mitigate the impact of failures. However, using such mechanisms requires additional overhead. As such, failure prediction is needed in order to smartly utilize...

chapter

Blue Gene/L Log Analysis and Time to Interrupt Estimation

N. Taerat, N. Naksinehaboon, C. Chandler, J. Elliott, more

2009 International Conference on Availability, Reliability and Security > 173 - 180

2009 International Conference on Availability, Reliability and Security. ARES 2009

System- and application-level failures could be characterized by analyzing relevant log files. The resulting data might then be used in numerous studies on and future developments for the mission-critical and large scale computational architecture, including fields such as failure prediction, reliability modeling, performance modeling and power awareness. In this paper, system logs covering a six...

chapter

Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments

N. Naksinehaboon, Yudan Liu, C. Leangsuksun, R. Nassar, more

2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) > 783 - 788

2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08)

For full checkpoint on a large-scale HPC system, huge memory contexts must potentially be transferred through the network and saved in a reliable storage. As such, the time taken to checkpoint becomes a critical issue which directly impacts the total execution time. Therefore, incremental checkpoint as a less intrusive method to reduce the waste time has been gaining significant attentions in the...

chapter

Symmetric Active/Active High Availability for High-Performance Computing System Services: Accomplishments and Limitations

C. Engelmann, S.L. Scott, C. Leangsuksun, X. He

2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) > 813 - 818

2008 8th International Symposium on Cluster Computing and the Grid (CCGRID '08)

This paper summarizes our efforts over the last 3-4 years in providing symmetric active/active high availability for high-performance computing (HPC) system services. This work paves the way for high-level reliability, availability and serviceability in extreme-scale HPC systems by focusing on the most critical components, head and service nodes, and by reinforcing them with appropriate high availability...

chapter

An optimal checkpoint/restart model for a large scale high performance computing system

Yudan Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, more

2008 IEEE International Symposium on Parallel and Distributed Processing > 1 - 9

2008 IEEE International Parallel & Distributed Processing Symposium

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss (rollback and checkpoint overheads) due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims at addressing fault tolerance...

chapter

Symmetric Active/Active Replication for Dependent Services

C. Engelmann, S.L. Scott, C. Leangsuksun, X. He

2008 Third International Conference on Availability, Reliability and Security > 260 - 267

2008 3rd International Conference on Availability, Reliability and Security (ARES '08)

During the last several years, we have established the symmetric active/active replication model for service-level high availability and implemented several proof- of-concept prototypes. One major deficiency of our model is its inability to deal with dependent services, since its original architecture is based on the client- service model. This paper extends our model to dependent services using its...

chapter

A Framework for Proactive Fault Tolerance

G. Vallee, C. Engelmann, A. Tikotekar, T. Naughton, more

2008 Third International Conference on Availability, Reliability and Security > 659 - 664

2008 3rd International Conference on Availability, Reliability and Security (ARES '08)

Fault tolerance is a major concern to guarantee availability of critical services as well as application execution. Traditional approaches for fault tolerance include checkpoint/restart or duplication. However it is also possible to anticipate failures and proactively take action before failures occur in order to minimize failure impact on the system and application execution. This document presents...

chapter

A reliability-aware approach for an optimal checkpoint/restart model in HPC environments

Yudan Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, more

2007 IEEE International Conference on Cluster Computing > 452 - 457

2007 IEEE International Conference on Cluster Computing (CLUSTER)

The increase in the physical size of high performance computing (HPC) platform makes system reliability more challenging. In order to minimize the performance loss due to unexpected failures or unnecessary overhead of fault tolerant mechanisms, we present a reliability-aware method for an optimal checkpoint/restart strategy towards minimizing rollback and checkpoint overheads. Our scheme aims to address...

chapter

Evaluation of fault-tolerant policies using simulation

A. Tikotekar, G. Vallee, T. Naughton, S.L. Scott, more

2007 IEEE International Conference on Cluster Computing > 303 - 311

2007 IEEE International Conference on Cluster Computing (CLUSTER)

Various mechanisms for fault-tolerance (FT) are used today in order to reduce the impact of failures on application execution. In the case of system failure, standard FT mechanisms are checkpoint/restart (for reactive FT) and migration (for pro-active FT). However, each of these mechanisms create an overhead on application execution, overhead that for instance becomes critical on large-scale systems...

chapter

Transparent Symmetric Active/Active Replication for Service-Level High Availability

C. Engelmann, S.L. Scott, C. Leangsuksun, X. He

Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '7) > 755 - 760

Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07)

As service-oriented architectures become more important in parallel and distributed computing systems, individual service instance reliability as well as appropriate service redundancy becomes an essential necessity in order to increase overall system availability. This paper focuses on providing redundancy strategies using service-level replication techniques. Based on previous research using symmetric...

chapter

On Programming Models for Service-Level High Availability

C. Engelmann, S. L. Scott, C. Leangsuksun, X. He

The Second International Conference on Availability, Reliability and Security (ARES'7) > 999 - 1008

Second International Conference on Availability, Reliability and Security (ARES'07)

This paper provides an overview of existing programming models for service-level high availability and investigates their differences, similarities, advantages, and disadvantages. Its goal is to help to improve reuse of code and to allow adaptation to quality of service requirements by using a uniform programming model description. It further aims at encouraging a discussion about these programming...

chapter

IPMI-based Efficient Notification Framework for Large Scale Cluster Computing

C. Leangsuksun, T. Rao, A. Tikotekar, S.L. Scott, more

Sixth IEEE International Symposium on Cluster Computing and the Grid Workshops (CCGRIDW'6) > 2 > 23

Sixth IEEE International Symposium on Cluster Computing and the Grid

The demand for an efficient faith tolerance system has led to the development of complex monitoring infrastructure, which in turn has created an overwhelming task of data and event management. The increasing level of details at the hardware and software layer clearly affects the scalability and performance of monitoring and management tools. In this paper, we propose a problem notification framework...

Filter options

Publication date

Set your own date range

Keywords

CHECKPOINTING (6)
COMPUTATIONAL MODELING (3)
FAULT TOLERANCE (3)
HIGH PERFORMANCE COMPUTING (3)
PARALLEL PROCESSING (3)
SOFTWARE FAULT TOLERANCE (3)
SOFTWARE RELIABILITY (3)
CLIENT-SERVER SYSTEMS (2)
FAULT TOLERANT COMPUTING (2)
FAULT TOLERANT SYSTEMS (2)
HARDWARE (2)
KERNEL (2)
LARGE-SCALE HPC SYSTEM (2)
MAINTENANCE ENGINEERING (2)
RELIABILITY (2)
SERVICE-LEVEL HIGH AVAILABILITY (2)
SERVICEABILITY (2)
SYSTEM RECOVERY (2)
SYSTEM RELIABILITY (2)
ACCURACY (1)
ACTIVE HIGH AVAILABILITY (1)
ADAPTATION (1)
APPLICATION-LEVEL FAILURE (1)
AVAILABILITY (1)
BLUE GENE/L LOG ANALYSIS (1)
BLUE GENE/L SUPERCOMPUTER (1)
CHECKPOINT INTERVAL (1)
CHECKPOINT-RESTART MECHANISM (1)
CHECKPOINT/RESTART APPROACH (1)
CHECKPOINT/RESTART MODEL (1)
CLIENT-SERVICE MODEL (1)
CLIENT-SIDE INTERCEPTOR (1)
CLUSTERING (1)
CODE REUSE (1)
COMPLEX MONITORING INFRASTRUCTURE (1)
COMPUTER NETWORK MANAGEMENT (1)
COMPUTERISED MONITORING (1)
DATA MANAGEMENT (1)
DATA MINING (1)
DEPENDENT SERVICE-ORIENTED ARCHITECTURE (1)
DIGITAL SIMULATION (1)
DISTRIBUTED COMPUTING SYSTEM (1)
DISTRIBUTED GRID COMPUTING (1)
DUPLICATE LOG MESSAGE (1)
EVENT MANAGEMENT (1)
FAILURE DISTRIBUTIONS (1)
FAILURE IMPACT MINIMIZATION (1)
FAILURE MITIGATION (1)
FAILURE PREDICTION (1)
FAITH TOLERANCE SYSTEM (1)
FAULT TOLERANCE MECHANISM (1)
FAULT TOLERANT MECHANISM (1)
FAULT TOLERANT MECHANISMS (1)
FAULT-TOLERANCE (1)
FAULT-TOLERANT POLICY EVALUATION (1)
FILTERING (1)
FORMAL SPECIFICATION (1)
GRID COMPUTING (1)
HARDWARE CONTROLS (1)
HIGH-AVAILABILITY (1)
HIGH-LEVEL ABSTRACTION (1)
HIGH-LEVEL RELIABILITY (1)
HIGH-PERFORMANCE COMPUTING SYSTEM (1)
HIGH-PERFORMANCE COMPUTING SYSTEM SERVICES (1)
HPC (1)
HPC SYSTEM (1)
INCREMENTAL CHECKPOINT (1)
INCREMENTAL RESTART (1)
INDIVIDUAL SERVICE INSTANCE RELIABILITY (1)
IPM. (1)
IPMI-BASED EFFICIENT NOTIFICATION FRAMEWORK (1)
LARGE SCALE CLUSTER COMPUTING (1)
LARGE SCALE HIGH PERFORMANCE COMPUTING SYSTEM (1)
LARGE SCALE HPC SYSTEM (1)
LARGE-SCALE DISTRIBUTED SYSTEM EVENTS LOG ANALYSIS (1)
LARGE-SCALE SYSTEM (1)
LARGE-SCALE SYSTEMS (1)
LOG FILE ANALYSIS (1)
LOSS MEASUREMENT (1)
MANAGEMENT TOOLS (1)
MEASUREMENT UNCERTAINTY (1)
MIGRATION MECHANISM (1)
MODULAR ARCHITECTURE (1)
MONITOR SCALABILITY (1)
MONITORING (1)
MONITORING TOOLS (1)
MULTIPROCESSING SYSTEMS (1)
NODE STATISTICS (1)
NUMERICAL MODELS (1)
OBJECT-ORIENTED PROGRAMMING (1)
OPERATING SYSTEM KERNELS (1)
OPTIMAL CHECKPOINT-RESTART MODEL (1)
OPTIMAL CHECKPOINT-RESTART STRATEGY (1)
OS-KERNEL REJUVENATION (1)
PARALLEL COMPUTING (1)
PARALLEL MACHINES (1)
PERFORMANCE LOSS (1)
POISSON FAILURE (1)
PREDICTIVE MODELS (1)
PROACTIVE FAULT TOLERANCE (1)
more

INFONA - science communication portal

Search results for: C. Leangsuksun

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options