• OPERATIONAL RESILIENCE IN LARGE-SCALE DATA CENTERS

      Kant, Krishna; Ji, Bo, 1982-; Tan, Chiu C.; Biswas, Saroj K. (Temple University. Libraries, 2018)
      Nowadays the dependency on high availability data center services is crucial. Unfortunately, downtimes are frequently encountered. They may occur through inadequate or flawed operational procedures; ad-hoc operational procedures are followed or ad-hoc fixes. However, data centers’ complexity and size make such approaches ineffective and require systematic analysis to minimize the downtimes and performance impacts. The main research objective is to examine a core set of operational issues arise in data centers and study the mechanisms to improve high availability and resilience. Specifically, the following critical issues were examined: Fast data center services’ restoration in the event of large-scale failures or downtimes through progressive restoration planning optimization framework to increase services’ uptime and maximize users requests’ satisfaction during recovery process; Minimizing the impact on data center services and security in when organizations split or merge through identifying different misconfiguration scenarios and considering ways of resolving those conflicts to minimize manual changes, routing table sizes, and eliminate routing anomalies, and; Efficient detection and localization of failures in enterprise networks through a systematic failure diagnosis framework consisting of developing intelligent probing station selection, failure detection, and diagnosis across the network components. The research significance resides in solving real-world issues arise in enterprise networks and data centers. After extensive evaluations, results revealed: progressive recovery improves the uptime during large-scale failures; systematic and efficient solving configuration issues minimizes the manual changes which are error-prone; and, a systematic approach tackling failure diagnosis helps reducing troubleshooting times and all that improve availability and resiliency. The limitations are: the research did not include data center services operation planning under limited resources during partial failures, and diagnosis framework is limited to pass/fail cases. The study contributes to the knowledge, literature, and practice. This research opens up the space for further studies in various aspects of misconfigurations in large-scale cyber and cyber-physical infrastructures.