Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit; Raja Nassar; Chokchai Leangsuksun; Mihaela Paun

doi:10.1007/s11227-014-1128-7

Reliability-aware performance model for optimal GPU-enabled cluster environment

Supada Laosooksathit, Raja Nassar, Chokchai Leangsuksun, Mihaela Paun

Source

The Journal of Supercomputing > 2014 > 68 > 3 > 1630-1651

Abstract

Given that the reliability of a very large-scaled system is inversely related to the number of computing elements, fault tolerance has become a major concern in high performance computing including the most recent deployments with graphic processing units (GPUs). Many fault tolerance strategies, such as the checkpoint/restart mechanism, have been studied to mitigate failures within such systems. However, fault tolerance mechanisms generate additional costs and these may cause a significant performance drop if it is not used carefully. This paper presents a novel fault tolerance scheduling model that explores the interplay between the GPGPU application performance and the reliability of a large GPU system. This work focuses on the checkpoint scheduling model that aims to minimize fault tolerance costs. Additionally, a GPU performance analysis is conducted. Furthermore, the effect of a checkpoint/restart mechanism on the application performance is thoroughly studied and discussed.

Identifiers

journal ISSN :	0920-8542
journal e-ISSN :	1573-0484
DOI	10.1007/s11227-014-1128-7

Authors

Supada Laosooksathit

Louisiana Tech University, Department of Computer Science, Ruston, USA

Raja Nassar

Louisiana Tech University, Department of Mathematics and Statistics, Ruston, USA

Chokchai Leangsuksun

Louisiana Tech University, Department of Computer Science, Ruston, USA

Mihaela Paun

Louisiana Tech University, Department of Mathematics and Statistics, Ruston, USA
National Institute for Research and Development for Biological Sciences, Bucharest, Romania

Keywords

GPUs Reliability Fault tolerance Checkpoint scheduling

Additional information

Publication languages: English

Data set: Springer

Publisher

Springer US

Fields of science

No field of science has been suggested yet.

article

Read online
Download
Add to read later
Add to collection
Add to followed
Share

Export to bibliography


Assign to other user
	×
Wrong email address

INFONA - science communication portal

Reliability-aware performance model for optimal GPU-enabled cluster environment $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Supada Laosooksathit

Raja Nassar

Chokchai Leangsuksun

Mihaela Paun

Keywords

Additional information

Publisher

Fields of science

Fields of science

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

Reliability-aware performance model for optimal GPU-enabled cluster environment