Date of Award
Computer Science - Applied Computing Track
TSYS School of Computer Science
As the number of CPU cores in high-performance computing platforms continues to grow, the availability and reliability of these systems become a primary concern. As such, some solutions are physical (ie. power backup) and some are software driven. Lawrence Berkeley National Laboratory has created a system-level fault-tolerant checkpoint/restart implementation for Linux Clusters. This allows processes to restart computations at the last known checkpoint in the event the system crashes. The checkpoint data creation is highly dependent on system input and output operations. This paper proposes: (i) a technique to improve the efficiency of these I/O operations and (ii) an alternative checkpoint creation method to increase availability and reliability of checkpointing data.
Cornwell, Jason, "Advanced I/O Techniques for Efficient and Highly Available Process Crash Recovery Protocols" (2011). Theses and Dissertations. 8.