Power-Aware Resilience for Extreme Scale Computing

Alazzawe, Anis

Item

Power-Aware Resilience for Extreme Scale Computing

Alazzawe, Anis

Genre

Thesis/Dissertation

Date

2019

Advisor

Kant, Krishna

Committee member

Shi, Justin Y.
Tan, Chiu C.
Kim, Albert

Department

Computer and Information Science

DOI

http://dx.doi.org/10.34944/dspace/591

Abstract

The increase in processing power provided by successive generations of high performance computing platforms has made it possible, to tackle a diverse range of large problems in many different fields, that would not have been feasible otherwise. Exascale computing is on the horizon and it brings with it unique opportunities and challenges. Applications running on exascale systems will run into many errors due to the vast number of components in these systems. Traditional recovery methods such as checkpointing alone will not be sufficient to allow these applications to finish execution in a reasonable amount of time, and in some instances they will not be able to finish execution at all. This is because the number of errors will occur so often that they are expected to occur during the recovery process itself. Two primary issues that need research to make running applications on exascale systems viable is methods to provide scalable resilience and managing energy consumption. Managing the energy usage of a resilience method is vital because these systems will be a huge energy draw and energy usage is expected to the largest cost of running these systems. The research path we have taken, to introduce an energy efficient resilience method for exascale systems, is as follows: First, we introduced slicing as an energy efficient resilience method. In this phase of the research we presented a model of program structure and error propagation of data corruption and showed how slicing can be used to detect these errors. Slicing is a technique that can be used generate all the parts of a program, as an executable, that influence the computation of a given a variable. It is traditionally used for analysis in debugging, maintenance, testing, of software. Using this model we derive properties that show how slicing can be used to provide high confidence that a program has run without corruption errors. The results show that a high error detection can be obtained using only a small increase in power usage. Second, we introduced Slice Swarms for high performance computing (HPC) application resilience. In this phase of the research we scaled slicing to HPC environments. We developed a model that would allow us to reason about the use of multiple slices in an HPC environment. We showed that using multiple slices would provide the ability to to detect more errors in applications that don’t have extreme inter-dependencies among it’s variables while requiring only a nominal amount of extra energy to run. We also showed the best way to distribute these slices across the variables of the application, given specific energy constraints. Finally, we tackled the challenge of providing energy efficient resilience to HPC applications with regular structure. The largest computing systems routinely run into silent data corruption (SDC) as part of its normal operation. The number of SDCs will increase drastically as computing systems approach the exascale mark, forcing a need to reconsider the resilience approach taken to counteract the effects of unmitigated data corruption errors. Yet any resilience method must be sensitive to both resource and energy requirements. HPC applications often have a regular structure that can be exploited for providing resilience more efficiently. We explore the propagation of data corruption errors caused in stencil computation, an iterative kernel with structured communication pattern that is found in a wide variety of scientific and engineering problems. The key insight, is that SDCs and corruption of data that they cause have localized impact in these types of applications and recovery does not require the use of every process to recompute the application state. We present a resilience mechanism, mimic replication, for resilience against SDC errors through dynamic reexecution of select processes. We then provide an analytical model that allows tradeoff between resource and energy consumption and resilience.

ADA compliance

For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu

Power-Aware Resilience for Extreme Scale Computing

Alazzawe, Anis

Citations

Genre

Date

Advisor

Committee member

Group

Department

Subject

Permanent link to this record

Collections

Files

Research Projects

Organizational Units

Journal Issue

DOI

Abstract

Description

Citation

Citation to related work

Has part

ADA compliance

Embedded videos