Loading...
REPRESENTATION LEARNING FOR VISUAL TASKS: A STUDY OF ATTENTION AND INFORMATION SELECTION
Citations
Altmetric:
Genre
Thesis/Dissertation
Date
2025-08
Advisor
Committee member
Group
Department
Computer and Information Science
Permanent link to this record
Collections
Research Projects
Organizational Units
Journal Issue
DOI
https://doi.org/10.34944/0209-5b31
Abstract
This dissertation investigates methods for improving visual representation learning by optimizing attention mechanisms and information selection strategies within deep learning models. Standard approaches often process images independently and compress them into single global descriptors, limiting performance on tasks requiring contextual understanding or fine-grained detail, and can be susceptible to shortcut learning. This work proposes and evaluates techniques that address these limitations by leveraging inter-example context, developing efficient multi-vector representations, and explicitly controlling attention. The research utilizes Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Graph Neural Networks (GNNs) and targets improvements in image classification (single and multi-label) and fine-grained image retrieval. Four primary contributions are detailed: (1) CNN2Graph, a hybrid CNN-GNN framework using cross-attention over a bipartite graph connecting image batches to learnable proxies and fixed anchors, designed to integrate dataset-level context into image classification efficiently and inductively. (2) DMCAC, a self-supervised image retrieval method that aligns representation learning with the retrieval task by conditioning training on database interactions, employing distributional divergence minimization between augmented query views relative to the database and a cross-attention classification mechanism. (3) Using Register Tokens as an efficient multi-vector image representation method for fine-grained retrieval that supplements the ViT ‘[CLS]‘ token with specialized register tokens. This allows us to internally discover Region-of-Interest (ROI) tokens derived from attention patterns. We optimize performance versus computational cost using a late-interaction framework. (4) Object-Focused Attention (OFA), a training technique for ViTs that adds an auxiliary loss based on semantic segmentation masks to penalize attention to non-object regions, aiming to reduce shortcut learning, improve out-of-distribution robustness, and enhance object shape representation without increasing inference complexity. The results demonstrate that managing attention and information flow—through context integration, multi-vector feature selection, and explicit object focus—yields visual representations with improved performance, robustness, and efficiency. This research provides methodologies and principles for advancing visual representation learning, particularly for complex models and tasks.
Description
Citation
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
