The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As the size of large-scale computer systems increases, their mean-time-between-failures are becoming significantly shorter than the execution time of many current scientific applications. To complete the execution of scientific applications, they must tolerate hardware failures. Conventional rollback-recovery protocols redo the computation of the crashed process since the last checkpoint on a single...
Fault tolerance is a critical issue in the arena of large-scale computing. The fault-tolerant parallel algorithm (FTPA) is an application-level technique for tolerating hardware failures. FTPA achieves fast failure recovery making use of parallel recomputing. However, it complicates the coding of the application program. This paper uses compiler technology to automate the design of FTPA, and introduces...
Application-level checkpointing can decrease the overhead of fault tolerance by minimizing the amount of checkpoint data. However this technique requires the programmer to manually choose the critical data that should be saved. In this paper, we firstly propose a live-variable analysis method for MPI programs. Then, we provide an optimization method of data saving for application-level checkpointing...
As the size of today's high performance computers continue to grow, node failures in these computers are becoming frequent events. Although checkpoint is the typical technique to tolerate such failures, it often introduces a considerable overhead and has shown poor scalability on today's large scale systems. In this paper we defined a new term called fault tolerant parallel algorithm which means that...
This paper addresses the issue of fault tolerance in parallel computing, and proposes a new method named parallel recomputing. Such method achieves fault recovery automatically by using surviving processes to recompute the workload of failed processes in parallel. The paper firstly defines the fault tolerant parallel algorithm (FTPA) as the parallel algorithm which tolerates failures by parallel recomputing...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.