Population Genomics

Topological Data Analysis in Population Genomics

Dimensionality reduction techniques such as PCA are simple but lose a lot of the genome's structure. We use algorithms from topology to analyze population structure in a more complete way.

technologies

python (ripser, BATS.py)

methods

Hausdorff Landmarking, Birth-Death plots in the rips complex

results

Discovered multiple structures not picked up by PCA2

Project overview

A big challenge in Computational Biology is how to describe data in a high dimensional space, and much progress has been made in clustering, and dimensionality reduction techniques such as Principal Components Analysis, t-SNE, and UMAP. These allow projections onto two dimensions which produce visualizations from which much of the structure of the space can be explained. However, a lot of information is lost in this process and it's difficult to know what other interesting relationships are left unexplained. As such a more robust method of analysis would yield significant benefits.

Execution

To maintain as much structure as possible, we ran algorithms that generate visual descriptions of the data using ideas based on operations applied to an object in high dimensions and how invariants known as Persistent Homology evolve as you look at in increasingly large neighborhoods of points. Due to the high computational demands of these methods, an initial subsetting procedure was necessary to reduce the number of input samples.

Results

We discovered evidence that patterns in genomic data can be discovered by the more robust set of tools given by Topological Data Analysis, and furthermore these appear to give better description of the geometric object that is a point cloud than a two dimensional projection.

This sets the foundation for further research in the field.