Why people are obsessed with long range information?

May 09, 2020 | 2 Minute Read

Comparison of sequencing technologies (oversimplified)

	Sanger	Illumina	PacBio SMRT (Single Molecule, real-time sequencing)	Oxford Nanopore	10X	Optical mapping
	1st generation sequencing	short reads	long reads	long reads	linked reads = short reads + parallel barcoding
principle	Sequencing by synthesis	Sequencing by synthesis	Sequence by synthesis	Ion current.	Microfluidics, barcoding, sequencing by synthesis.	Microfluidics
	The added ddNTP will stop the reaction.	DNA is sheared to short pieces; while synthesizing, the incorpertaed nucleotide has dye attached to it. “clones of molecule” is visualize by the dye.	“single molecule” is visualize by the dye.	Measure the ion current when DNA bases crosses the membrane.	Microfluidis sort long fragments into beads. All large fragments are in the same bead has the same barcode. Although the reads are short, from the barcode we know which reads are from the same bead, inferring they might come from the same large fragment.	Linearized DNA molecule are cut by flourescent labelled restriction enzyme. The location of flourescence can infer long range information. But not the exact sequence in between.
read length		hundreds of b.p	thousands b.p.	thousands of b.p.	thousands of b.p.	100 k.b. to 1Mb
error rate		0.1 %	~10%	10%	similar to short reads

How PacBio SMRT (Single Molecule, Real Time) sequencing works?

How Oxford Nanopore works

How 10x Linked Reads work

Link to video

People are obsessed with “long range information (tens of thousands of b.p.)”, why is that?

The purpose of long reads, linked reads and optical mapping are to get information across large regions of DNA? Why is this information important? Here are the reasons:

The genome has a lot of repetitive regions. And short reads coming from those repeats all look very similar. Therefore it is hard to put back where exactly these short reads are from. If we can have longer reads that crosses the repetitive region, then it will be much easier.
The genome has many large variation: Structural Variation. Large pieces (thousands of b.p.) of DNA might be inverted, duplicated, or simply having different number of copies. With short reads, although we might get reads crossing breakpoints, or have more reads aligning to the duplicated regions, these methods have limited sensitivity. Long range information seems to be a more direct way to detect these.
Haplotype phasing: The human genome is diploid, meaning we have two copies of DNA that are almost the same. Say we have 2 positions, 5000 b.p. apart that are different, one being AT, the other being CG. To figure out which variants are on the same DNA is called haplotype phasing. Short reads will not be able to tell whether A and C or A and G is on the same copy of DNA.

Reference

Structural Variation in the sequencing era

« Bloom filters: Digging large sequence database How to calculate isotope profile? »

Cell In Silico

Molecules - Sequence - Disease

Why people are obsessed with long range information?

Comparison of sequencing technologies (oversimplified)

How PacBio SMRT (Single Molecule, Real Time) sequencing works?

How Oxford Nanopore works

How 10x Linked Reads work

People are obsessed with “long range information (tens of thousands of b.p.)”, why is that?

Reference