Paper Presentation at SBAC-PAD 2022 (Bordeaux, France)
Vanderlei Munhoz presented his work entitled Strategies for Fault-Tolerant Tightly-coupled HPC Workloads Running on Low-Budget Spot Cloud Infrastructures at the EEE International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2022), which was held in Bordeaux, France. This work was coauthored by Prof. Márcio Castro and Prof. Odorico Mendizabal.
In this work, Vanderlei evaluated the viability of budget-constrained cloud environments for tightly-coupled MPI applications, exploring both spot and traditional low-budget infrastructures from real public cloud platforms. Two different fault tolerance strategies tailored for unreliable spot cloud environments have been proposed: system-level rollback restart with Berkeley Labs Checkpoint/Restart (BLCR) and in-memory rollback restart with User-Level Failure Mitigation (ULFM). A provider-agnostic empirical method for testing and predicting MPI workloads execution times and cloud infrastructure costs was also proposed. The results showed that: (i) adequate cluster sizing plays an important role in the overall job execution performance and cost-effectiveness, regardless of the type of selected instances; (ii) fault tolerance strategies based on BLCR may have worse performance than ULFM, but still be cost-effective considering software migration costs; (iii) the use of spot infrastructure does not guarantee costs savings depending on the chosen machine flavors and discounts, as experiments with persistent low-budget options attained better cost-effectiveness in some conditions.