Fault-Frequency Agnostic Checkpointing Strategies


Student Name: Joseph Vinduska
Defense Date:
Location: Eaton Hall, Room 2001B
Chair: Hongyang Sun

Arvin Agah

Drew Davidson

Abstract:

Checkpointing strategies in high-performance computing traditionally employ the Young-Daly for-

mula to determine the (first-order) optimal duration between checkpoints, which assumes a known

mean time between faults (MTBF). However, in practice, the MTBF may not be known accurately

or may vary, causing Young-Daly checkpointing to perform sub-optimally. In 2021, Sigdel et al.

introduced the CHORE (CHeckpointing Overhead and Rework Equated) checkpointing strategy,

which is MTBF-agnostic yet demonstrates a bounded increase in overhead compared to the op-

timal strategy. This thesis analyzed and extends the CHORE framework in several ways. First,

it verifies Sigdel et al.’s claims about the relative overhead of the CHORE strategy through both

event-driven simulations and expected runtimes derived from the underlying probablistic model.

Second, it extends the CHORE strategy to silent errors, which must be deliberately checked for to

be detected. In this scenario, the overhead compared to optimal checkpointing is once more ana-

lyzed through simulations and expected runtimes. Third, a heuristic is proposed to offer improved

performance of the CHORE algorithm under typical runtime scenarios by interpreting CHORE as

an additive-increase multiplicative-decrease model and tuning the parameters.

Degree: MS Thesis Defense (CS)
Degree Type: MS Thesis Defense
Degree Field: Computer Science