OCFTL: an implementation-independent fault tolerance library for MPI
Fault tolerance (FT) is a common concern in HPC environments. One would expect that, when Message Passing Interface (MPI) is concerned (an HPC tool of paramount importance), FT would be a solved problem. It turns out that the scenario for FT and MPI is intricate. While FT is effectively a reality in these environments, it is usually done by hand. The few exceptions available tie MPI users to specific MPI implementations. This work proposes OCFTL, an Implementation Independent FT Library for MPI to be used in OmpCluster. OCFTL is capable of detecting failures with only a 50 ms delay (with low CPU overhead). It also provides false-positive failure detection, MPI communicator repair, and it can isolate users from unspecified behavior of MPI operations in the presence of failures. Experimental results indicate good potential to improve system reliability during the execution of scientific workflows.