Power-Aware Resilience for Extreme Scale Computing
|The increase in processing power provided by successive generations of high performance computing platforms has made it possible, to tackle a diverse range of large problems in many different fields, that would not have been feasible otherwise. Exascale computing is on the horizon and it brings with it unique opportunities and challenges. Applications running on exascale systems will run into many errors due to the vast number of components in these systems. Traditional recovery methods such as checkpointing alone will not be sufficient to allow these applications to finish execution in a reasonable amount of time, and in some instances they will not be able to finish execution at all. This is because the number of errors will occur so often that they are expected to occur during the recovery process itself. Two primary issues that need research to make running applications on exascale systems viable is methods to provide scalable resilience and managing energy consumption. Managing the energy usage of a resilience method is vital because these systems will be a huge energy draw and energy usage is expected to the largest cost of running these systems. The research path we have taken, to introduce an energy efficient resilience method for exascale systems, is as follows: First, we introduced slicing as an energy efficient resilience method. In this phase of the research we presented a model of program structure and error propagation of data corruption and showed how slicing can be used to detect these errors. Slicing is a technique that can be used generate all the parts of a program, as an executable, that influence the computation of a given a variable. It is traditionally used for analysis in debugging, maintenance, testing, of software. Using this model we derive properties that show how slicing can be used to provide high confidence that a program has run without corruption errors. The results show that a high error detection can be obtained using only a small increase in power usage. Second, we introduced Slice Swarms for high performance computing (HPC) application resilience. In this phase of the research we scaled slicing to HPC environments. We developed a model that would allow us to reason about the use of multiple slices in an HPC environment. We showed that using multiple slices would provide the ability to to detect more errors in applications that don’t have extreme inter-dependencies among it’s variables while requiring only a nominal amount of extra energy to run. We also showed the best way to distribute these slices across the variables of the application, given specific energy constraints. Finally, we tackled the challenge of providing energy efficient resilience to HPC applications with regular structure. The largest computing systems routinely run into silent data corruption (SDC) as part of its normal operation. The number of SDCs will increase drastically as computing systems approach the exascale mark, forcing a need to reconsider the resilience approach taken to counteract the effects of unmitigated data corruption errors. Yet any resilience method must be sensitive to both resource and energy requirements. HPC applications often have a regular structure that can be exploited for providing resilience more efficiently. We explore the propagation of data corruption errors caused in stencil computation, an iterative kernel with structured communication pattern that is found in a wide variety of scientific and engineering problems. The key insight, is that SDCs and corruption of data that they cause have localized impact in these types of applications and recovery does not require the use of every process to recompute the application state. We present a resilience mechanism, mimic replication, for resilience against SDC errors through dynamic reexecution of select processes. We then provide an analytical model that allows tradeoff between resource and energy consumption and resilience.
|Temple University. Libraries
|Theses and Dissertations
|IN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available.
|Power-Aware Resilience for Extreme Scale Computing
|Shi, Justin Y.
|Tan, Chiu C.
|Computer and Information Science
|For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact email@example.com