Fault-Frequency Agnostic Checkpointing Strategies
Arvin Agah
Drew Davidson
Checkpointing strategies in high-performance computing traditionally employ the Young-Daly for-
mula to determine the (first-order) optimal duration between checkpoints, which assumes a known
mean time between faults (MTBF). However, in practice, the MTBF may not be known accurately
or may vary, causing Young-Daly checkpointing to perform sub-optimally. In 2021, Sigdel et al.
introduced the CHORE (CHeckpointing Overhead and Rework Equated) checkpointing strategy,
which is MTBF-agnostic yet demonstrates a bounded increase in overhead compared to the op-
timal strategy. This thesis analyzed and extends the CHORE framework in several ways. First,
it verifies Sigdel et al.’s claims about the relative overhead of the CHORE strategy through both
event-driven simulations and expected runtimes derived from the underlying probablistic model.
Second, it extends the CHORE strategy to silent errors, which must be deliberately checked for to
be detected. In this scenario, the overhead compared to optimal checkpointing is once more ana-
lyzed through simulations and expected runtimes. Third, a heuristic is proposed to offer improved
performance of the CHORE algorithm under typical runtime scenarios by interpreting CHORE as
an additive-increase multiplicative-decrease model and tuning the parameters.