Dynamic Selective Protection for Sparse Iterative Solvers
Sumaiya Shomaji
Suzanne Shontz
Soft errors are frequent occurrences within extensive computing platforms, primarily attributed to the growing size and intricacy of high-performance computing (HPC) systems. To safeguard scientific applications against such errors, diverse resilience approaches have been introduced, encompassing techniques like checkpointing, Algorithm-Based Fault Tolerance (ABFT), and replication, each operating at distinct tiers of defense. Notably, system-level replication often necessitates the duplication or triplication of the entire computational process, yielding substantial resilience-associated costs. This project introduces a method for dynamic selective safeguarding of sparse iterative solvers, with a focus on the Preconditioned Conjugate Gradient (PCG) solver, aiming to mitigate system level resilience overhead. For this method, we leverage machine learning (ML) to predict the impact of soft errors that strike different elements of a key computation (i.e., sparse matrix-vector multiplication) at different iterations of the solver. Based on the result of the prediction, we design a dynamic strategy to selectively protect those elements that would result in a large performance degradation if struck by soft errors. Experimental assessment validates the efficacy of our dynamic protection strategy in curbing resilience overhead in contrast to prevailing algorithms.