OCFTL: an implementation-independent fault tolerance library for MPI
Fault tolerance (FT) is a common concern in HPC environments. One would expect that, when Message Passing Interface (MPI) is concerned (an HPC tool of paramount importance), FT would be a solved problem. It turns out that the scenario for FT and MPI is intricate. While FT is effectively a reality in these environments, it is usually done by hand. The few exceptions available tie MPI users to specific MPI implementations. This work proposes OCFTL, an Implementation Independent FT Library for MPI to be used in OmpCluster. OCFTL is capable of detecting failures with only a 50 ms delay (with low CPU overhead). It also provides false-positive failure detection, MPI communicator repair, and it can isolate users from unspecified behavior of MPI operations in the presence of failures. This work also discusses the relationship between FT and scheduling that are normally treated separately, proposing a model that integrates scheduling and FT by taking into account the characteristics of the tasks and target computing nodes. Preliminary experimental results indicate good potential to improve system reliability and execution makespan of scientific workflows.