Loading...
Thumbnail Image
Item

Modeling Infrequent Events by Integrating Noisy Data From Multiple Sources And Applying Machine Learning to Classify and Predict Disruptions

Otudi, Hussain
Citations
Altmetric:
Genre
Thesis/Dissertation
Date
2025-08
Advisor
Obradovic, Zoran
MacNeil, Stephen, 1987-
Committee member
Gao, Hongchang
MacNeil, Stephen, 1987-
Muto, Atsuhiro
Obradovic, Zoran
Group
Department
Computer and Information Science
Research Projects
Organizational Units
Journal Issue
DOI
https://doi.org/10.34944/3zfe-2g91
Abstract
Anticipating rare yet consequential events, whether sudden line faults rippling across an electric grid or severe storms crippling communities, demands learning systems that function effectively despite missing data, imperfect labels, and extreme class imbalance. In this dissertation, five studies address this challenge by augmenting limited primary signals with supplementary information sources, fusing the heterogeneous evidence within architectures that preserve space-time structure, and validating their effectiveness through careful sensitivity testing. Two of the papers focus on power-system reliability, where phasor measurement units (PMUs) sample the grid at sub-second rates but are deployed too sparsely to capture every disturbance and are accompanied by incomplete event logs. The remaining three papers target severe-weather prediction, a domain where automatic weather stations often fail during the very conditions they are designed to monitor, and historical archives contain only a handful of the most disruptive events. Across both domains, the work demonstrates that simulation, social or expert text, and numerical forecasts can each serve as a “second lens” through which the learning algorithm detects patterns that would otherwise remain hidden. In the power-system studies, the central idea is to use physically realistic simulations to supply the ground-truth diversity missing from field data. Real PMU recordings exhibit two intertwined limitations: many fault waveforms are captured far from their origin, making them difficult to distinguish, and the most destructive fault types appear so rarely that they contribute only a small fraction of the training examples. By generating synthetic voltage and current traces for all fault categories on a benchmark grid model, the first paper constructs a balanced, perfectly labeled companion dataset that is then blended with two years of western interconnection PMU observations. This merged corpus feeds conventional machine-learning classifiers that, once trained, recognize multi-phase faults with a level of confidence previously reserved only for the simpler single-phase cases, while overall misclassification rates drop sharply. The follow-up paper examines how to deploy simulation most efficiently. A systematic sensitivity analysis varies the number, placement, and voltage level of virtual PMUs, revealing that only a strategically selected subset of high-value sensors located near likely fault sites provides most of the predictive benefit. Even when models are trained entirely on synthetic data, they generalize effectively to a distant regional grid with poorly labeled field data, confirming that simulation can substitute for an expensive and error-prone labeling campaign. Together, these two contributions establish a transferable strategy: generate fault scenarios in silico, select placements guided by grid physics, blend them with whatever real data are available, and train a classifier that remains robust even when confronted with unfamiliar topologies or mislabeled events. Severe-weather prediction involves many of the same challenges, including sensor outages, label sparsity, and skewed data distributions. Three papers explore how human- and model-generated information can help fill the observational gaps. The first weather study treats social media users as impromptu weather observers. Geotagged messages posted during storms are filtered for meteorological language, encoded using a transformer language model, and temporally aligned with automated surface observing system (ASOS) readings. Feeding this combined representation to a bidirectional sequence network predicts warning signatures of oncoming hazards that were largely invisible to sensor data alone, especially in densely populated areas where social signals are strongest. Recognizing that social media coverage is uneven, the next study focuses on expert storm narratives maintained by national weather agencies. These concise textual descriptions document the timing, location, and impacts of every officially recorded disaster. Converted to embeddings and late-fused with ASOS measurements, these narratives inject human context that remains available even in remote regions with sparse online activity, consistently improving event detection accuracy. The final paper expands the fusion framework by incorporating short-range numerical weather predictions as a third modality. Forecast fields provide a physics-based description of the evolving atmosphere; textual sources contribute situational awareness of ground impacts; and sensors offer real-time verification. A trimodal architecture, reinforced by spatiotemporal feature engineering, integrates these perspectives without requiring perfect alignment, creating a credible multi-hazard early warning tool that maintains strong performance even in harsh, high-latitude environments where conventional methods often fail. Although the five studies span two application areas, they converge on several common insights. First, supplementary data, whether synthetically generated, crowd-sourced, written by experts, or forecast by numerical models, are indispensable for addressing the blind spots, noise, and imbalance inherent in primary measurements. Second, late-fusion architectures that allow each modality to develop its own latent representation before merging decisions provide the resilience needed when one stream is delayed, missing, or sampled at a different rate. Third, explicitly modeling temporal and spatial relationships through sequence models and physics-informed feature design is essential, as both grid faults and storms are dynamic processes whose signatures unfold over time and propagate across space. Finally, domain-specific sensitivity analyses translate methodological advances into actionable guidance, such as identifying the grid locations where installing an additional PMU yields the greatest diagnostic gain. This dissertation offers an end-to-end framework for learning under scarcity in safety-critical settings. In electric grids, it delivers a fault detection pipeline that remains effective even when faced with rare multi-phase disturbances or mislabeled event logs. In meteorology, it transforms partially missing observations into robust predictions of storms that previously escaped detection. More broadly, the work demonstrates that when primary data are lacking, the most effective approach is to enlist the supplementary source that most directly fills the gap and allow a spatiotemporally aware fusion model to integrate the multiple modalities.
Description
Citation
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
Embedded videos