Pipeline scripts, markdown source code and data for assembly, analysis and presentation available at https://github.com/h3kker/assemblyTalk
Illumina Error Correction: K-mer Spectrum (SGA), Suffix Tree/Array, ...
Error Correction of Pacbio reads with Illumina (details later)
Adapter trimming
Quality trimming
Deduplication
Some assemblers depend on other, existing tools to perform these steps or do one or more as part of their pipeline. If so, don't use other tools - see Assemblathon.
They stared at the drinks were gone
They stared at the drinks went were gone
They stared at the drinks the drinks were gone
...
Look for paths that
(THEY ST)ARED AT THE DRINKS. THE DRINKS WENT WARM. THEY DRANK.
Overlap Consensus Layout, eg. SGA
More or less as shown. Minimum length of overlap k is the parameter that determines the graph complexity. Should ideally be as large as the dataset allows (sequencing errors, polyploidity). The ideal assembly visits all nodes exactly once (Hamilton-Path).
String Graphs are a special variant where all transitive edges ((X, Y), (X, Z), (Y, Z))
are reduced to ((X, Z))
, irreducible edges.
K-mer based, eg. Abyss, SOAPdenovo
Nodes represent all kmers in the reads. Two kmers are connected if there is a k-1
overlap between the nodes (de Bruijn graph). The Euler path that visits each edge exactly once corresponds to a chromosome in an ideal assembly.
K-mer sized (parameter k) should be chosen large enough to reduce the number of wrong connections between contigs, but small enough to allow for errors.
Hybrid strategies proposed: Combine contig and graph output from two types of assemblers.
Graph structure is very complex due to
transitive edges like ((1,2), (1,3), (2,3))
consecutive nodes like ((1,2), (2,3), (3,4))
error reads (branches that converge again later)
spurious branch points on repeat edges
dead ends (tips)
Collapse nodes that connect unambiguously (without branching) into one node representing the merged sequence.
Collapse nodes that connect unambiguously (without branching) into one node representing the merged sequence.
Sometimes also: tip erosion. Remove all nodes with connections only in one direction. These can be caused by low coverage regions and read errors. Can also shorten valid contigs!
Bubbles due to sequencing errors or polyploid genomes, heterozygosity. Selection of branch based on different criteria like coverage, quality.
Formed in repeated regions, were many reconstructions are possible. Resolved by forming parallel paths. Paired-End constraints can be used to discard invalid edges (too short, too long reconstruction).
Contigs: Build contiguous stretches of sequence, filter and correct (consensus)
Scaffolds: Either with built-in scaffolder or external program. Most assemblers come with their own scaffolder for PE or mate pair library information. Using Pacbio CLRs not yet popular.
Missing sequence information is filled with N (assembly gaps)
Use paired end information to join and orient contigs. Can also detect and filter misjoined contigs.
From the SGA paper:
[...] We then perform the standard assembly graph post-processing step of removing tips from the graph where a vertex only has a connection in one direction [...]
we have developed an algorithm [...] similar to the ‘‘bubble-popping’’ ap- proaches taken by de Bruijn graph assemblers [...]
Similar to other approaches to scaffolding (Pop et al. 2004), our method is based on constructing a graph of the relationships between contigs.
They all follow the same principles! Main "unique selling points" seem to be algorithms and data structures. The strategies and heuristics employed in graph simplification and postprocessing make the difference in results.
Ustilago bromivora, a fungus with a nice compact genome of about 20Mb.
17.4Mb in 28 contigs.
http://www.broadinstitute.org/annotation/genome/ustilago_maydis.2/Info.html
81% of est. 26.1 Mb in 71 supercontigs
25fold coverage with genomic and 10kb paired end library on 454, end-sequencing of a tiled BAC library assigned to 23 chromosomes with optical mapping
Linning et al., 2004. Genetics 166: 99-111
http://mips.helmholtz-muenchen.de/genre/proj/MUHDB/About/overview.html
mean length of 3910 bp (median 2903, max 20254), see report.html
Library might be problematic. Average insert size estimated at 211bp (+/- 52bp). According to scientist it should be 300-500bp, see report.html
Calculates various metrics to compare with test datasets from Assemblathon2:
Compensate for high error rate and indel in Pacbio reads by error correcting using relatively accurate Illumina short reads.
pacBioToCA (based on Celera assembler)
PreAssembly pipeline (from PacBio SMRTanalysis)
comes with SMRTanalysis software package, but must be run from command line.
Run
pacBioToCA -length 500 -partitions 200 -l ec_pacbio -t 16
-s pacbio.spec -fastq filtered_subreads.fastq $illuminaFrg
receive a 250MB file called ec_pacbio.fastq
and nothing else.
Pipeline steps:
see report.html
(no information about mapping to original reads from pipeline)
Computationally very intense (good for keeping clusters busy)
Reduction in Depth makes assembly seem infeasible
could in theory be run from the web interface, but only with PacBio input (error correcting CLRs with circular consensus reads (CCR). Needs .bas.h5 (primary analysis result from sequencer).
settings.xml
and input.xml
from job directoryrun
smrtpipe.py --distribute --output=result/
--params=settings.xml xml:input.xml
wait 1 - ? days depending on alignment parameters
see Pipeline output
Alignment with blasr.
see report.html
Fewer, even shorter reads
Bad results, but minimal relaxation of alignment criteria produced ~200GB of alignment files which then could not be read
Very sensitive to parameters for alignment between PacBio and Illumina Reads
Mapping information between corrected and original reads, better diagnostics
Compensate for short read length by assembling high-fidelity Illumina reads (with high coverage) and resolve repeats and gaps using long Pacbio reads.
Relatively new, few assemblers have native support for including Pacbio CLRs (in contrast to Mate Pair and Sanger reads)
Version 1.3.7 from Dec 11 2013 can use Pacbio CLRs internally with BWA version 0.7.5a+ (with bwa mem support).
abyss-pe
with parameter k=64
and library pathsbwa mem
manuallySimpson, JT et al. Genome Res 2009
set | # >2kb | N50 | max |
---|---|---|---|
scaffolds | 698 | 52136 | 200210 |
longscaff | 475 | 81601 | 435667 |
see report.html
abyss-pe k=64
(without long read library)${name}-contigs.dot
Deshpande V, et al. Algorithms Bioinformatics 2013
set | # >2kb | N50 | max |
---|---|---|---|
scaffolds | 698 | 52136 | 200210 |
longscaff | 475 | 81601 | 435667 |
cerulean | 310 | 106883 | 366413 |
see report.html
String Graph Assembler promises to be more memory-efficient with equally good results. Same first author as abyss.
discarded ~ 5M reads
Simpson JT. Genome Res. 2012
set | # >2kb | N50 | max |
---|---|---|---|
SOAP | 521 | 78347 | 280862 |
SGA | 467 | 57237 | 199401 |
Abyss | 698 | 51236 | 200210 |
see report.html
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
set | # >2kb | N50 | max |
---|---|---|---|
SOAP | 521 | 78347 | 280862 |
SGA | 467 | 57237 | 199401 |
Abyss | 698 | 51236 | 200210 |
see report.html
Created for filling scaffold gaps.
jelly.out.fasta
(do NOT run more than one PBJelly per directory!)English AC, et al. PLoS One. 2012
blasr
Gap statistics
set | gapped.contigs | overall | overall.width | width.mean |
---|---|---|---|---|
cerulean | 316 | 799 | 529462 | 1675.51 |
pbj.cerulean | 152 | 224 | 64066 | 421.49 |
sga | 337 | 612 | 17250 | 51.19 |
pbj.sga | 26 | 31 | 927 | 35.65 |
soap | 514 | 3084 | 33891 | 65.94 |
pbj.soap | 246 | 2705 | 19088 | 77.59 |
set | # >2kb | N50 | max |
---|---|---|---|
sga | 183 | 234931 | 767671 |
SOAP | 174 | 201830 | 541843 |
Cerulean | 238 | 159023 | 489237 |
But we do not have the luxury of Assemblathon or GAGE to have a reference to compare to!
Aligned with bwa mem -a -T 60 -k 16 -A 2 -L 4 -t 8 -S -P -k 32
A number of contigs with very high depth (>300) were found - A random BLAST produced rDNA.
Parra G, et al. CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics. 2007;23(9):1061–7
Could be used to examine tentative gene structure!
Hunt M, et al. Genome Biol. 2013