Extreme scale High Performance Computing(HPC) environments with millions of cores are required to run scientific simulations. When the system scalability is at such a huge level, it becomes much more important to have a fault tolerance mechanism as the correctness of the application result and its utilization rate acceptability is a vital. One of the main fault that happens in such an environment is that the data corruption goes un-noticed by Fault Tolerance mechanism which are implemented at the hardware level. These kind of errors are collectively named as Silent Data Corruption(SDC).
MACORD is an efficient Silent Data Corruption(SDC) detection framework developed to detect SDC in HPC applications. A supervised machine learning algorithm is the main basis of this framework have been build on the basis of supervised machine Learning algorithm. Also, this is an online framework which require very little memory and has less execution overhead.
There are various SDC algorithms that are available in the market. But all those works using a pre-set learning algorithm.Whereas MACORD works by selecting the most suitable algorithm from a set of five algorithm during the runtime by studying the dynamically observed data. MACORD is automatically adjustable depending upon the user demand on detection condition, like how much prediction error can be tolerated.The detection method in MACORD framework involves two main steps: Predicting the values for each data point using the adaptive learning framework and Evaluating the observed value for each data point to see if it is within the confidential value range.
The MACORD frameworks is build in two steps: The first step involves in finding out the five SDC learning algorithms that can be applied online during the runtime. For this we start with 11 state-of-the-art SDC detection learning algorithm.A performance analysis is then conducted on these algorithms to select the five algorithms. The second step is creating the main algorithm of MACORD, which dynamically selects the best algorithm. For this the MACORD make use of spatial regression technique which uses neighboring data points. Also, the framework is easily expandable by including more online- applicable algorithms, or more detection conditions to adapt to new user demands. The detection requires to formalize a detection range. The formalization is so that the detection range can be enlarged when a false positive occurs. This helps in minimizing any subsequent false positives. The impact error bound signifies the error amount that can be safely ignored.