A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy

Xinyuan Guo; Hai Jiang; Kuan-Ching Li

doi:10.1109/SNPD.2013.5

A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy

Guo, Xinyuan, Jiang, Hai, Li, Kuan-Ching

Source

2013 14th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing > 247 - 252

Abstract

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many scientific applications. However, as GPU becomes a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to GPU's batch-mode execution manner. The paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states. A precompiler and run-time support module are developed to construct and save states in CPU system memory dynamically. Secondary storage can be utilized for scalability and long-term fault tolerance. CUDA applications with complicated memory use are support as well. Experimental results have demonstrated the effectiveness of the proposed scheme.

Identifiers

book e-ISBN :	978-0-7695-5005-3
DOI	10.1109/SNPD.2013.5

Authors

Keywords

Graphics processing units Kernel Libraries Arrays Radiation detectors Registers checkpoint/start GPU CUDA

Additional information

Data set: ieee

Publisher

IEEE

chapter

Read online
Download
Add to read later
Add to collection
Add to followed
Share

Export to bibliography


Assign to other user
	×
Wrong email address

INFONA - science communication portal

A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Guo, Xinyuan

Jiang, Hai

Li, Kuan-Ching

Keywords

Additional information

Publisher

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

A Checkpoint/Restart Scheme for CUDA Applications with Complex Memory Hierarchy