Loading...
Thumbnail Image
Item

MAXIMIZING LEARNING EFFICIENCY WITH LIMITED LABELED DATA: APPLICATIONS TO HEALTHCARE AND EDUCATION

Research Projects
Organizational Units
Journal Issue
DOI
https://doi.org/10.34944/sc51-wf26
Abstract
Document classification is essential in domains such as healthcare and education, encompassing three major steps: annotation, training accurate models, and evaluation. Each of these steps is labor-intensive and time-consuming, requiring substantial amounts of labeled data, which is both costly and resource-demanding. This dissertation addresses these challenges by presenting innovative methodologies to enhance annotation efficiency, model training in low-resource settings, and automated scoring. In the healthcare domain, we tackle the challenges of annotation and training with limited resources. First, we develop a visualization approach for rapid labeling of clinical notes for smoking status extraction. The annotation process is labor-intensive and time-consuming; thus, we introduce a tool that accelerates annotation by clustering similar sentences and highlighting important keywords. This reduces the cognitive load on annotators, resulting in faster and more efficient labeling. Next, we address the problem of training accurate classifiers in low-resource settings with limited labeled data. In our first approach, MERIT (Minimal Supervision Through Label Augmentation for Biomedical Relation Extraction), we propose using shortest dependency path (SDP) representation and specific distance thresholds to propagate labels and augment high-quality labeled data. This method improves classifier accuracy compared to using limited labeled data alone. We extend this in our second approach by developing an iterative algorithm to learn automatic thresholds for label propagation. This method is tested in various scenarios, including semi-supervised learning, supervised learning, and in-context learning, demonstrating significant improvements in model performance. In the education domain, we focus on the problem of assessing narratives generated by school-aged children, a task that is both expensive and time-consuming for teachers. We leverage large language models (LLMs) to learn the scoring patterns of teachers accurately, offering a reliable tool for automated narrative scoring. This approach reduces the subjectivity and resource requirements of manual scoring, providing a scalable and consistent alternative. Experimental results across these methodologies demonstrate their effectiveness in improving annotation speed, data utilization, and model accuracy. This dissertation contributes to advancing document classification in low-resource settings, offering practical solutions for critical tasks in healthcare and education.
Description
Citation
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
Embedded videos