Loading...
Impact of epistasis on phylogenetic trees
Mahboubi, Mina
Mahboubi, Mina
Citations
Altmetric:
Genre
Thesis/Dissertation
Date
2025-12
Advisor
Committee member
Group
Department
Biology
Subject
Permanent link to this record
Collections
Files
Research Projects
Organizational Units
Journal Issue
DOI
Abstract
The reconstruction of phylogenetic trees from molecular sequences is a central task in evolutionary biology. These trees are typically inferred from pairwise sequence distances, under the assumption that sequence similarity reflects shared evolutionary history. However, this task often relies on models that treat sites in a sequence as evolving independently. While this simplification enables tractable inference, it does not account for the effects of epistasis (interactions among sites), which, by constraining the accessible sequence space, distort the distribution of pairwise distances. This model misspecification can introduce systematic biases in tree topology, such as artificial hierarchies or compressed branches, even under neutral evolution and in the absence of phylogenetic relatedness.
This thesis investigates the impact of epistasis on phylogenetic tree structure through a multi-scale approach, combining theoretical modeling, numerical simulations, and natural protein sequence analysis. We introduce a two-state model of protein evolution based on an Ising-like model with tunable pairwise correlations, allowing precise control over epistatic strength and structure. We show that epistatic constraints restrict the dynamics to a subset of sequence space. By analyzing the distribution of distances between evolved sequences, we show that sequences lie on a low-dimensional hypersurface, which we call the Neutral Evolution Manifold (NEM). We then demonstrate that the dimensionality of this manifold is controlled by the strength of epistasis and significantly affects the shape of the reconstructed phylogenetic trees. Thus, we derive several useful analytical results and validate our approximations with extensive numerical simulations.
We then extend this framework to real protein sequence data. Using multiple methods, including linear autoencoders, geodesic graph analysis, and discrete metric-based estimators, we show that real proteins exhibit significantly reduced intrinsic dimensionality compared to shuffled controls. This supports the hypothesis that epistatic constraints are a dominant factor shaping the observed sequence landscape. To isolate the impact of epistasis on tree topology, we generate synthetic MSAs from variational autoencoders trained on real protein data and assess tree structure using lineage-through-time plots and cherry proportion metrics. Finally, we apply statistical tests, including a likelihood ratio test, to quantify the dependence of tree shape on epistasis strength.
Our results reveal that even modest epistatic interactions can bias phylogenetic inference, leading to trees that suggest evolutionary structure where none exists. We conclude that a more accurate understanding of sequence evolution is essential for reliable phylogenetic reconstruction, especially in the presence of site dependencies. This work lays the foundation for dimensionality-aware models of sequence evolution and offers a geometric perspective on the relationship between sequence space constraints and tree topology.
Description
Citation
Citation to related work
Has part
ADA compliance
For Americans with Disabilities Act (ADA) accommodation, including help with reading this content, please contact scholarshare@temple.edu
Embedded videos
License
IN COPYRIGHT- This Rights Statement can be used for an Item that is in copyright. Using this statement implies that the organization making this Item available has determined that the Item is in copyright and either is the rights-holder, has obtained permission from the rights-holder(s) to make their Work(s) available, or makes the Item available under an exception or limitation to copyright (including Fair Use) that entitles it to make the Item available.
