Linkage Equillibrium and Perfect phylogeny
Compare HWE to Linkage Equillibrum
| Hardy Weinberg Equillibrium | Linkage Equillibrium | |——————————–|—————————| | 1 locus | Many loci | | Allele frequency stay constant | Two locus are independent |
DNA without recombination
- mtDNA
- Y chromosome
Perfect Phylogeny
Assumptions:
- No recombination
- Infinite site: 1 position in the genome will NOT mutate 2 times.
Input: A SNP matrix
Each row is an individual, and each column is a locus. Denote the minor allele as 1, the major allele as 0.
Output: A tree (perfect phylogeny)
The tree leaves are individuals. Edges are annotated with locus mutation (column). For every subtree, each locus can only be $all 0$ or $all 1$.
Property: For any pair of column (loci), either of the 3 condition must be true.
Two columns are column $c$ and column $d$. denote $c_{1}$ as the rows containing 1 in column $c$.
For example: | | c | |—|—| | A | | | B | 1 | | C | 1 | | D | 1 | | E | | | F | | | G | | | H | |
$c_{1} = {B,C,D}$
- $c_{1} \subset d_{1}$ this means locus $c$ mutated later than locus $d$
- $d_{1} \subset c_{1}$ this means locus $d$ mutated later than locus $c$
- $c_{1}$ and $d_{1}$ are disjoint, meaning they are on two branches of the tree.
The algorithm to construct perfect phylogeny
- Make all column 0-major.
- Sort $n \times m$ matrix column so that all column later is either a subset, or disjoint to any previous column.
- treat all column as a binary number, lexicographically sort them.
- radix sort of $m$ numbers take $O(m)$ time. But now the number is $n$ bits long. Total time $O(nm)$
- Split matrix rows into subtrees column by column
- to add 1 column to the tree, we need to compare at most $m$ columns, each column takes at least $O(m)$ times.
- Therefore to add all $m$ columns, total time is $O(nm^{2})$
- We don’t need to compare to all previous columns. Once we find a subset-superset relationship, we can stop comparing. This algorithm trick makes total time $O(nm)$
What will happen if we reorder rows?
Nothing.
We only care about the subset/disjoint relationship. Sorting rows won’t change it.