In This Section

High Performance Computing Resilience

There are many challenges that face the high-performance computing community today, including the ability to easily program these types of machines, which in many cases, differs significantly from the traditional programming models taught to undergraduates.  Additionally, the capital and operational costs, including power and cooling play important roles in the procurement, deployment, and ultimate use of these supercomputers.  Finally, the reliability of these systems has become a focus in recent years due to a number of factors.

As the number of components in these massive computers has increased, and as transistor size has decreased, the number of faults or errors that occur at runtime has dramatically increased over the last twenty years.  These faults can originate at almost any place and can happen at almost any time.  What is worse is that sometimes these errors go unnoticed, and can lead to erroneous answers that are then used by decision makers. This is especially the case in situations where it is difficult to compare the computed values to ground truth.

Our work has been centered on several focus areas:  1) identifying and characterizing inherent resilience in operators and algorithms, 2) evaluating existing error detection and correction strategies using a soft-error fault injector, 3) identifying, designing, and testing new resilience strategies.

This work is funded under contract with the New Mexico Consortium / Los Alamos National Laboratory.

Publications

Laura Monroe, William M. Jones, Scott R. Lavigne, Claude H. Davis IV, Qiang Guan
and Nathan DeBardeleben, “On the Inherent Resilience of Integer Operations ”, Euro-par
2016: 22nd International European Conference on Parallel and Distributed Computing
/ 9th Workshop on Resiliency in High Performance Computing (Resilience) in Clusters, Clouds, and Grids, August 22 - 26, 2016 (accepted for publication, in press).

Qiang Guan, Nathan DeBardeleben, Sean Blanchard, Song Fu, Claude H. Davis IV, and William M. Jones, “Analyzing the Robustness of HPC Applications Using a Fine-Grained Soft Error Fault Injection Tool”, Innovative Research and Applications in Next-Generation High Performance Computing, IGI Global, DOI: 10.4018/978-1-5225-0287-6.ch011, June, 2016 (book chapter, in print).

Guan Q, DeBardeleben N, Atkinson B, Robey R, Jones W. 2015. Towards Building Resilience Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with CLAMR Hydrodynamics Mini-App. IEEE Cluster 2015.
Download PDF »


Atkinson B, DeBardeleben N, Guan Q, Robey R, Jones WM. 2014. Fault Injection Experiments with the CLAMR Hydrodynamics Mini-App. Software Reliability Engineering Workshops (ISSREW), 2014 IEEE International Symposium: 6-9.
Download PDF »


Jones WM, Daly JT, DeBardeleben N. 2012. Application monitoring and checkpointing in HPC: looking towards exascale systems. Proceedings of the 50th Annual Southeast Regional Conference: 262-267.
Download PDF »

‌Posters

Brian Atkinson, Walter Ligon III, Nathan DeBardeleben, Qiang Guan, Sean Blan- chard, Bob Robey, William M. Jones, "Fault Injection, Detection, and Correction in CLAMR Using F-SEFI", SC14 - International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, November 17-21, 2014
View Poster

 Laura Monroe, William M. Jones, Scott R. Lavigne, Claude H. Davis IV, Qiang Guan and Nathan DeBardeleben, “On the Resilience of Integer Operators”, SELSE 2016 – 12th Workshop on Silicon Errors in Logic - System Effects, Austin, TX, USA, March 29-30, 2016 (paper and poster; article copyright retained, informal proceedings only on USB distribution, poster published with conference).  View Poster

Brian Atkinson, Walter Ligon III, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Bob Robey, William M. Jones, “Fault Injection, Detection, and Correction in CLAMR Using F-SEFI”, Salishan 2015 – Conference on High Speed Computing, Gleneden Beach, Oregon, April 27, 2015, (poster, by invitation only at annual US DOE tri-lab conference, no official dissemination).‌ View Poster

Talks

"Multiplicative Resilience: A Soft Error Fault Injection Study", Clemson Univer- sity Booth, International Conference for High Performance Computing, Networking, Storage, and Analysis , November 19, 2015, Austin, TX.

"Multiplicative Resilience: A Double-Edged Sword", New Mexico Consortium / Ul- tra Scale Research Center, Los Alamos National Laboratory, June 4, 2015, Los Alamos, NM.

"Evaluating the Fault-Tolerance of the CLAMR Hydrodynamics Mini-App with the F-SEFI Fault Injector", Clemson University Booth, International Conference for High Performance Computing, Networking, Storage, and Analysis, November 19, 2014, New Orleans, LA.

"ABFT Matrix Multiplication: Theory, Practice, ... & more Practice", New Mexico Consortium / Ultra Scale Research Center, Los Alamos National Laboratory, July 27, 2014, Los Alamos, NM.

Personnel

Local Team:
Dr. William M. Jones, Coastal Carolina University, Professor

Mr. Scott Lavigne, Coastal Carolina University, Student

Mr. Terence Grové, Coastal Carolina University, Student

Ms. Alexandra Poulos, Coastal Carolina University, Student

MNC / LANL Team:
Dr. Nathan A. DeBardeleben, Los Alamos National Laboratory, Team Leader

Dr. Laura Monroe, Los Alamos National Laboratory, Mathematician

Dr. Qiang Guan, Los Alamos National Laboratory, Scientist

Former Personnel:

Mr. Rusty Davis, Clemson University, Student

Links

New Mexico Consortium:
http://newmexicoconsortium.org

Ultra-Scale Research Center:
https://newmexicoconsortium.org/research/advanced-computing/usrc

Los Alamos National Laboratory:
http://www.lanl.gov