Cell In Silico


Molecules - Sequence - Disease

Linkage Equillibrium and Perfect phylogeny

March 13, 2021 | 2 Minute Read

Compare HWE to Linkage Equillibrum

| Hardy Weinberg Equillibrium | Linkage Equillibrium | |——————————–|—————————| | 1 locus | Many loci | | Allele frequency stay constant | Two locus are independent |

DNA without recombination

  • mtDNA
  • Y chromosome

Perfect Phylogeny

Assumptions:

  1. No recombination
  2. Infinite site: 1 position in the genome will NOT mutate 2 times.

Input: A SNP matrix

Each row is an individual, and each column is a locus. Denote the minor allele as 1, the major allele as 0.

Output: A tree (perfect phylogeny)

The tree leaves are individuals. Edges are annotated with locus mutation (column). For every subtree, each locus can only be $all 0$ or $all 1$.

Property: For any pair of column (loci), either of the 3 condition must be true.

Two columns are column $c$ and column $d$. denote $c_{1}$ as the rows containing 1 in column $c$.

For example: | | c | |—|—| | A | | | B | 1 | | C | 1 | | D | 1 | | E | | | F | | | G | | | H | |

$c_{1} = {B,C,D}$

  1. $c_{1} \subset d_{1}$ this means locus $c$ mutated later than locus $d$
  2. $d_{1} \subset c_{1}$ this means locus $d$ mutated later than locus $c$
  3. $c_{1}$ and $d_{1}$ are disjoint, meaning they are on two branches of the tree.

The algorithm to construct perfect phylogeny

  1. Make all column 0-major.
  2. Sort $n \times m$ matrix column so that all column later is either a subset, or disjoint to any previous column.
    • treat all column as a binary number, lexicographically sort them.
    • radix sort of $m$ numbers take $O(m)$ time. But now the number is $n$ bits long. Total time $O(nm)$
  3. Split matrix rows into subtrees column by column
    • to add 1 column to the tree, we need to compare at most $m$ columns, each column takes at least $O(m)$ times.
    • Therefore to add all $m$ columns, total time is $O(nm^{2})$
    • We don’t need to compare to all previous columns. Once we find a subset-superset relationship, we can stop comparing. This algorithm trick makes total time $O(nm)$

What will happen if we reorder rows?

Nothing.

We only care about the subset/disjoint relationship. Sorting rows won’t change it.