Wednesday, 23 October, 2013
SPEAKER: Miroslav Stoyanov, ORNL
TITLE: Algorithm Based Solver Resilience with Respect to Silent Hardware Faults
ABSTRACT: The exponential increase of computing power over the past few decades has lead to a corresponding decrease in hardware reliability. At the same time, numerical algorithms have been developed under the assumption that all computations are performed accurately. As a result, modern solvers are very susceptible to a particular class of silent errors that may modify the result of a numerical operation without any external indication. The ever increasing rate of silent faults has presented a new and unanticipated challenge.
We extend the current framework of numerical analysis by removing the assumption that all arithmetic operations can be computed accurately to machine precision. We introduce the concept of ``hardware error'' added to our numerical approximations by potentially unreliable hardware. Using rigorous analysis, we develop new algorithms that minimize the propagation of hardware error and therefore guarantee convergence even when faults are encountered.