Date of Award

3-2011

Type

Thesis

Major

Computer Science - Applied Computing Track

Department

TSYS School of Computer Science

First Advisor

Angkul Kongmunvattana

Abstract

As the number of CPU cores in high-performance computing platforms continues to grow, the availability and reliability of these systems become a primary concern. As such, some solutions are physical (ie. power backup) and some are software driven. Lawrence Berkeley National Laboratory has created a system-level fault-tolerant checkpoint/restart implementation for Linux Clusters. This allows processes to restart computations at the last known checkpoint in the event the system crashes. The checkpoint data creation is highly dependent on system input and output operations. This paper proposes: (i) a technique to improve the efficiency of these I/O operations and (ii) an alternative checkpoint creation method to increase availability and reliability of checkpointing data.

Share

COinS