Loading...
Thumbnail Image
Item

Configuration Modeling and Diagnosis in Data Centers

Research Projects
Organizational Units
Journal Issue
DOI
http://dx.doi.org/10.34944/dspace/260
Abstract
The behavior of all cyber-systems in a data center or an enterprise system largely depends on their configuration which describes the resource allocations to achieve the desired goal under certain constraints. Poorly configured systems become a bottleneck for satisfying the desired goal and add to unnecessary overheads such as under-utilization, loss of functionality, poor performance, economic burden, energy consumption, etc. Ill-effects related to system misconfiguration are well documented with quantifiable metrics showing their impact on the economy, security incidents, service recovery time, loss of confidence, social impact, etc. However, configuration modeling and diagnosis of data center systems is challenging because of the complexities of subsystem interactions and the many (known and unknown) parameters that influence the behavior of the system. Further, a configuration is not a static object - but a dynamically evolving entity that requires changes (either automatically or manually) to address the evolving state of the system. We believe that a well-defined approach for configuration modeling is important as it paves a path to keeps the systems functioning properly in spite of the dynamic changes to configurations. Proper configuration of large systems is difficult because interdependencies between various configuration parameters and their impact on performance or other attributes of the system are generally poorly understood. Consequently, properly configuring a system or a subsystem/device within it is largely dependent on expert knowledge developed over time. In this work, we attempt to formalize some approaches to configuration management, particularly in the area of network devices and Cloud/Edge storage solutions. In particular, we address the following aspects in this study: (i) impact of resource allocation on the energy-performance trade-off, with a network topology as an example, (ii) prediction of performance of a complex IT system (such as Cloud Storage Gateway or an Edge Storage Infrastructure) under given conditions, (iii) development of a data-driven method to efficiently configure (allocate resources) to satisfy required QoS levels under constrained conditions, and (iv) a model to express configuration health as a quantifiable metric. With increasing stress on data center networks and correspondingly increasing energy consumption, we propose a method to simultaneously configure routing and energy management related parameters to ensure that the network can both avoid congestion and maximize opportunities for putting network ports in lower power mode. We also study the problem of choosing hardware and resource settings to minimize cost and achieve a given level of performance. Because of the complexity of the problem, we explored machine learning (ML) based techniques. For concreteness, we studied the problem in the context of configuring a cloud storage gateway (CSG) that involves such parameters as speed and number of CPU cores, memory size, and bandwidth, IO size and bandwidth, data and metadata cache size, etc. It turns out that it is very difficult to obtain a reliable ML model for this, and instead our approach is to use a model for the opposite problem (predicting optimal cost or performance for a given configuration) along with meta-heuristic such as genetic algorithm or simulated annealing. We show that an intelligent grouping of configuration parameters based on expected relationships between parameters and relative importance of the groups substantially outperforms the standard meta-heuristic based exploration of the state space. Our work in the configuration space revealed a dominant void, we noticed the absence of common vocabulary or quantifiable metric to clearly and unambiguously express the quality of the configuration. In our diagnosis work, we designed a model to define a simple, reproducible, and verifiable metric that allows users to express the quality of device configuration as a health score. Our configuration diagnosis model expresses the strength (or weakness) of a configuration as a ‘Health Index’, a vector of dimensions like performance, availability, and security. This health index will help users/administrators to identify the weak configuration objects and take remedial actions to rectify the configurations. Our work on Configuration Modeling and Diagnosis addresses an important topic in this vast chaotic space. Using industry-driven problems and empirical data, we bring in some meaning to this complex problem. Though our research and experiments involved specific devices (network topology, Cloud Gateway, Edge Storage, network routers, etc.) - we show that the proposed solution is generic and can be adequately applied to other domains. We hope that this work will encourage other communities to explore new 'configuration' challenges in a rapidly changing IT landscape.
Description
Citation
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
Embedded videos