There are many challenges that face the high-performance computing community today, including the ability to easily program these types of machines, which in many cases, differs significantly from the traditional programming models taught to undergraduates. Additionally, the capital and operational costs, including power and cooling play important roles in the procurement, deployment, and ultimate use of these supercomputers. Finally, the reliability of these systems has become a focus in recent years due to a number of factors.
As the number of components in these massive computers has increased, and as transistor size has decreased, the number of faults or errors that occur at runtime has dramatically increased over the last twenty years. These faults can originate at almost any place and can happen at almost any time. What is worse is that sometimes these errors go unnoticed, and can lead to erroneous answers that are then used by decision makers. This is especially the case in situations where it is difficult to compare the computed values to ground truth.
Our work has been centered on several focus areas: 1) identifying and characterizing inherent resilience in operators and algorithms, 2) evaluating existing error detection and correction strategies using a soft-error fault injector, 3) identifying, designing, and testing new resilience strategies. This work is funded under contract with the New Mexico Consortium / Los Alamos National Laboratory.
Computer Science professor, Dr. William Jones, has a long-standing collaboration with the US Department of Energy’s Los Alamos National Laboratory (LANL) in Los Alamos, NM. This collaboration has resulted in many opportunities for CCU students, including internships on-site at CCU as well as at LANL, full-time job placements, and research publications. Additionally, several of Dr. Jones’ students have gone on to pursue advanced training in graduate school.
- News: USRC Provides Opportunities to Coastal Carolina University Students
- New Mexico Consortium Facebook
- Bob Robey Visits CCU
CCU-LANL Student Bios
Coastal Carolina University and Los Alamos National Laboratory (LANL), in Los Alamos, New Mexico, have a strong relationship that has involved many CCU students over the past few years and has resulted in several year-long research assistantships at CCU, summer internships and permanent job placements at the laboratory. Let’s take a look at the students that have been involved in the LANL-CCU research project.
(CCU temporary staff employee, BS and MS in Computer Science, Clemson University)
Rusty, a (then) Clemson University undergraduate student in Computer Science, approached Dr. Jones in April 2014, to see if there were any interesting projects going on during the summer of 2014 that he could work on to get additional experience. Dr. Jones had just been contacted by LANL about some possible work in application-based fault tolerance and fault injection. Dr. Jones offered to work with Rusty on this (then) unfunded LANL research project. At the end of the summer, Dr. Jones and Rusty traveled to Los Alamos to present their initial results (photo below), and this resulted in an offer to fund this work in the following school year. Once funded, Dr. Jones hired Rusty as a part-time temporary staff employee at CCU (since he was a student at Clemson, not CCU), and it would turn out that this initial collaboration would ultimately result in over 5 years of continuous funding at CCU to work with LANL. In November 2014, Rusty Attended the 2014 Supercomputing Conference and presented his work at the Clemson Booth (photo below). This collaboration resulted in three publications together with Dr. Jones: 1) Euro-par 2016: 22nd International European Conference on Parallel and Distributed Computing / 9th Workshop on Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds, and Grids, 2) the 2016 Workshop on Silicon Errors in Logic - System Effects and 3) a book chapter in the 2016 Innovative Research and Applications in Next-Generation High-Performance Computing. The collaboration resulted in a fully-funded research assistantship for Rusty to pursue the MS in Computer Science at Clemson University. He completed two internships at LANL during the summers of 2016 and 2017 as a graduate student. After graduating from Clemson with the MS in May 2018, he was hired full time as a staff scientist at LANL, where he has been working since then.
(BS in Computer Science from CCU, May 2014, MS in Computer Engineering from Clemson, May 2018)
Our students first got involved with LANL in the late spring of 2014. Brian Atkinson, CCU CS graduate (May 2014), after meeting LANL folks at several Supercomputing Conferences, and also after working for Dr. William Jones on an externally funded parallel filesystem project (summer 2011 and 2012, through Clemson University subcontract), was invited for a 2014 summer internship at LANL to work on improving resilience and reliability of a scientific application at LANL. Brian later attended Clemson University where he earned a Master of Science in Computer Engineering, with a fully-funded research assistantship, and started full time in the summer of 2018 at LANL as a contract employee through the New Mexico Consortium, as a researcher and programmer working on large-scale storage and file systems, which is an extension of the work that he did for the MS at Clemson. More recently (effective May 2019), Brian has been promoted to a staff scientist, working directly for LANL. Brian’s work with Dr. Jones at CCU and folks at LANL has resulted in several scholarly publications and promising career!
(BS in Computer Science from CCU, May 2017)
Terry Grove began working with Dr. William Jones as a research assistant during his senior year (Fall 2016 / Spring 2017) on a project that was funded by LANL. Terry presented his initial work at the 2016 Supercomputing Conference (see photo), and continued this work after graduation as a “post-bachelor” full-time employee in the summer of 2017. Terry’s work ethic, combined with is technical skills caught the eye of several managers at LANL, and Terry was promoted to staff scientist early Fall 2017.
(BS in Computer Science and BS in Applied Math from CCU, December 2018)
Alex was hired as a research assistant in August 2016 to work with Dr. William Jones on his externally funded contract with LANL. Alex has amazing talents both with respect to computer science and programming, but her background in mathematics was an important contributing factor to her success with research at CCU, and also at LANL. Alex attended the 2016, 2017, and 2018 Supercomputing Conference with Dr. Jones, and was invited for summer internships at LANL during each of these summers (see 2016 photo). Alex’s work with CCU and LANL ultimately lead to a publication that was presented at the 2018 FTXS (Fault Tolerance at eXtreme Scale) workshop, co-located at the SC18 conference (see photos). Alex’s presentation at the 2018 FTXS workshop caught the attention of the US Department of Defense, which invited Alex for an internship during the late spring of 2019 and the summer of 2019. Alex has been accepted to the Ph.D. program in Computer Engineering at Clemson University with a fully-funded fellowship, where she intends to start in August 2019.
Megan Hickman Fulp
(BS in Computer Science from CCU, May 2019)
Megan was hired as a research assistant in August 2017 to work with Dr. Jones’ research with LANL, focusing primarily on data analytics for HPC resilience. She was invited for a summer internship at LANL in 2018, and this work resulted in a scholarly publication at the 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). Although she was unable to attend the conference due to a scheduling conflict, she was able to attend the 2017 Supercomputing Conference. It was there that she first met Dr. Jon Calhoun, a professor of Computer Engineering at Clemson University. Her work at CCU, LANL, and the associated publication caught his attention and resulted in a full-funded research assistantship to pursue the Master of Science of Computer Engineering at Clemson University. After interning for a second time at LANL during the summer of 2019, Megan will be starting at Clemson in August 2019 to being her MS under the direction of Dr. Jon Calhoun as her thesis advisor.
(BS Applied Physics, May 2015, BS in Computer Science, May 2017, from CCU)
After graduation in May 2017 from CCU with the BS in Computer Science, Dr. Jones hired Dakota as a part-time temporary staff employee at CCU to conduct research for him on the LANL contract. Dakota was still living and working in the area at Wetstone Technologies as a cyber-security analyst while his (then) fiancé, Megan Hickman Fulp, was finishing her BS in Computer Science. This gave Dakota an excellent opportunity to expand his skills and obtain additional experience in the computing sciences. Dakota attended the 2017 Supercomputing Conference with Dr. Jones, and me Dr. Jon Calhoun (photo at CCU below), an ECE professor at Clemson University who would later become Dakota’s MS thesis advisor at Clemson University. The initial work with Dr. Jones at CCU resulted in an invitation to intern at LANL during the summer of 2018, and this resulted in a scholarly publication (along with Megan Hickman Fulp) at the 2018 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW). Dakota will be interning for a second time at LANL during the summer of 2019. He was offered a fully-funded research assistantship at Clemson University to pursue the MS in Computer Engineering with Dr. Jon Calhoun as his thesis advisor, where he will be starting August 2019.
(BS in Computer Science at CCU, expected December 2019)
Dr. Jones hired Dylan in February 2018 to work with him on his research with LANL, specifically to help with visualizing and classifying the data produced from work with Jones and Alex Poulos. Dylan was later invited to complete an internship at LANL during the summer of 2018, and this resulted in the inclusion of some of Dylan’s work in a workshop (FTXS) publication at the 2018 supercomputing conference, that he also attended (see photo above and below). Dylan has been invited to return to LANL for another summer internship in May 2019. Dylan would like to pursue a position at LANL after graduating from CCU, tentatively in December 2019.
(BS in Applied Mathematics, minor in Computer Science at CCU expected 2020)
Dr. Jones first met Cannon while taking MATH 407 together during the Spring 2018 semester, taught by Dr. Ogul Arslan (math department). Dr. Jones was immediately impressed by Cannon’s knowledge and apparent intelligence, as well as his inquisitive nature, and thus hired Cannon in August 2018 to work with him on his LANL research contract. As an applied mathematics major and computer science minor, Cannon was ideally positioned to extend the error detection and correction work being done by Dr. Jones, Alex Poulos, et al. The focus of Cannon’s research is much more math-focused, with a specialization in error coding designs, and as such, Dr. Tom Hoffman and Dr. Ogul Arslan of CCU’s Department of Applied Mathematics have been serving as Cannon’s research mentors while at CCU. Cannon attended the 2018 Supercomputing Conference with Dr. Jones (photo above) and has been invited to complete a summer internship at LANL during summer 2019, where he will be working directly with Dr. Laura Monroe (LANL mathematician), pictured below when visiting CCU in March 2018.
(BS in Computer Science and BS in Applied Mathematics CCU, May 2017)
LANL hired Stephen Penton in a “post-bachelor” position as a contract employee through the New Mexico Consortium (NMC) starting in January 2019. While at CCU, Stephen worked on two different research projects, both with Dr. Erin Hackett in the Department of Coastal and Marine Systems Science at CCU. Stephen graduated from CCU in May 2017 with two BS degrees, one in CS and one in MATH. He then worked as a licensing analyst at Artech Consulting LLC from October 2017 – January 2019 before starting at LANL.
Publications Involving Students
(9 publications involving 7 different students)
Alexandra Poulos, Dylan Wallace, Robert Robey, Laura Monroe, Vanessa Job, Sean Blanchard, William M. Jones, Nathan DeBardeleben, “Improving Application Resilience by Extending Error Correction with Contextual Information ”, Fault Tolerance for HPC at eXtreme Scale (FTXS) Workshop at The International Conference for High-Performance Computing, Networking, Storage, and Analysis (SC18) , November 16, 2018.
Megan Hickman, Dakota Fulp, Elisabeth Baseman, Sean Blanchard, Hugh Green- burg, William M. Jones, Nathan DeBardeleben, “Enhancing HPC System Log Analysis by Identifying Message Origin in Source Code ”, The 29th IEEE International Symposium on Software Reliability Engineering (ISSRE 2018): Industry Track, October 15-18, 2018.
Laura Monroe, William M. Jones, Scott R. Lavigne, Claude H. Davis IV, Qiang Guan, and Nathan DeBardeleben, “On the Inherent Resilience of Integer Operations ”, Euro-par 2016: 22nd International European Conference on Parallel and Distributed Computing / 9th Workshop on Resiliency in High-Performance Computing (Resilience) in Clusters, Clouds, and Grids, August 22 - 26, 2016
Qiang Guan, Nathan DeBardeleben, Sean Blanchard, Song Fu, Claude H. Davis IV, and William M. Jones “Analyzing the Robustness of HPC Applications Using a Fine-Grained Soft Error Fault Injection Tool”, Innovative Research and Applications in Next-Generation High-Performance Computing, IGI Global, DOI: 10.4018/978-1- 5225-0287-6.ch011, June, 2016 (book chapter, in print).
Laura Monroe, William M. Jones, Scott R. Lavigne, Claude H. Davis IV, Qiang Guan and Nathan DeBardeleben, “On the Resilience of Integer Operators”, SELSE 2016 – 12th Workshop on Silicon Errors in Logic - System Effects, Austin, TX, USA, March 29- 30, 2016 (paper and poster; article copyright retained, informal proceedings only on USB distribution, poster published with conference).
Qiang Guan, Nathan DeBardeleben, Brian Atkinson, Robert Robey, William M. Jones, “Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini- App”, CLUSTER 2015 – 2015 IEEE International Conference on Cluster Computing, Chicago, IL, USA, September 8-11, 2015 (paper and presentation).
Brian Atkinson, Walter Ligon III, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Bob Robey, William M. Jones, “Fault Injection, Detection, and Correction in CLAMR Using F-SEFI”, Salishan 2015 – Conference on High-Speed Computing, Gleneden Beach, Oregon, April 27, 2015, (poster, by invitation only at annual US DOE tri-lab conference, no official dissemination).
Brian Atkinson, Walter Ligon III, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Bob Robey, William M. Jones, “Fault Injection, Detection, and Correction in CLAMR Using F-SEFI”, SC14 – International Conference for High-Performance Computing, Networking, Storage and Analysis, New Orleans, LA, November 17-21, 2014, (two-page abstract and poster).
Brian Atkinson, Nathan DeBardeleben, Qiang Guan, Bob Robey, William M. Jones, “Fault Injection Experiments With the CLAMR Hydrodynamics Mini-App”, Proceedings of the 25th IEEE International Symposium on Software Reliability Engineering (ISSRE), November 3-6, 2014, Naples, Italy, doi:10.1109/ISSREW.2014.51 (paper and presentation).