5 Essential Concepts In Genome Assembly From NGS data

The main goal for researchers, clinicians, and students who perform Next Generation Sequencing (NGS) and produce sequenced data for diverse projects involving human samples is to find biomarkers or variants to make diagnoses; and deduce the genetic anomalies that could be responsible for the disease they are conducting research on. Most projects (academic or non-academic) constitute the prior ideology on deciphering the “unknown.” There are well-versed computational protocols and pipelines formulated by labs across the world in determining what the “unknown” variants are. The fact that we have the “reference” human genome available – thanks to the Human Genome Project – plays a vital role in discerning unknown variants.

The Human Genome Project started in 1990 as an international scientific research project to comprehend the genetic make-up and base pairs that compose the human genome from both a physical and a functional standpoint. The final sequence map of the human genome was published on April 14th, 2003. Since then, ultimate progress has been achieved in the field of sequencing with multiple sequencing methodologies (such as Illumina and Pyrosequencing) available in the market. Moreover, as Moore’s law states, the cost per human genome has decreased; and with the advancement in sequencing techniques, sequencing prices have dropped 100,000 times – to around $1,000 per genome, leading to the massive production of sequenced genomic data.

This tremendous amount of sequenced data increased the demand for competent efforts in conducting computational analysis to apprehend the underlying information that could be viable for research and medical diagnosis. However exhaustive (manually and computationally) the effort might be to reach a diagnosis, the availability of the reference human genome has made the effort coherent in identifying the genetic anomalies in the sequence data. Having a reference genome is a convenience in perpetrating efficient variant calling.

Imagine if the human genome had never been sequenced; where would we stand today? Having that reference genome sequence available is indeed a convenience. However, not all species have their genomes sequenced. This is especially true for prokaryotes, mainly because of the high genomic diversity among strains – every strain of bacteria or virus shows significant genomic variance. Therefore, research concerning prokaryotes often starts with genome assembly. This step allows researchers to understand the genetic constitution of the prokaryotic strain under study, and eventually help in determining its deleteriousness – such as the study of pathogenicity and non-pathogenicity characteristics of bacterial communities in water samples and perceive their influence on human health.

An important mention here is that assembling prokaryote genomes is relatively easy compared to the human genome because of their small genome size. On average, the genome size of a bacteria can range from 30 kbp to 14 Mbp. Viral genomes range between 2 kb and 1 Mb. Such genome size requires less manual and computational resources as opposed to a human genome of 3 billion bases.

Producing high-quality genome assembly in prokaryotes starts with the basics of assessing the quality of the NGS sequence data. Before getting to the quality-check, it is important to understand which sequencing platforms you can use to generate quality sequence data. Illumina is at the forefront of this race because of its ability to produce paired-end reads. These reads have unique DNA molecules between the 5-prime and the 3-prime of the DNA fragments. In other words, during sequencing, the DNA fragment is read from one end and once the complementary strand is generated, the process starts again in the other direction (Figure 1).

With paired-end sequencing, researchers can sequence both ends of a DNA fragment and generate high-quality alignable sequence data. Moreover, paired-end sequencing facilitates the detection of genomic structural rearrangements and repetitive sequence elements (repeats). Hence, it is recommended to use Illumina to generate paired-end sequence data for genome assembly. As shown in figure 1, you can generate 2x150bp, meaning two reads (forward “Read1” and reverse “Read2” of 150 bases each.

Figure 1: Paired-end read and its significance in determining repeats [Image courtesy of Illumina, Inc.]

Once the sequence read data are generated, you can execute the downstream steps of the genome assembly.  In this article, we will review some basic but essential assembly steps and concepts.

1. Quality Control

FASTQC is a widely used and robust program to perform quality-control and spot potential problems such as low-quality bases in the sequence reads. The program generates a set of graphs to examine data quality. Here is an example of good Illumina data and bad Illumina data. Among the generated graphs, one of the important factors in estimating the data-quality is the “Per base sequence quality” graph. If you see a consensus in the quality scores (generally, a score between 20 and 30 is considered a good base score) of the bases in the graph then the NGS library preparation was done considerably well and depicts uniformity. However, a divergence in the scores could indicate poor quality of the NGS library. Depending on the available data and their quality-stats in the graphs, you will have to determine a quality-score threshold that would help remove bad-quality data and appropriately trim the reads at that threshold. Alternatively, you can re-sequence to generate better quality data.

After the refinement, contamination removal from the sequence data is performed. This step is often neglected, but it could play a significant role in minimizing genomic inaccuracy in assemblies. Contaminations are eliminated by mapping the reads against the reference with the contaminants such as human genomes. Common tools are BWA (for short-reads) and BLAST (for long-reads). 

Further, non-biological elements are discarded from the sequence data by trimming the reads. You can trim the sequence reads using Trimmomatic. This tool gives a good range of options to perform apt read trimming such as removing the non-biological elements like adapters and primers.

2. Genome Assembly Intricacies

Following the trimming step, genome assembly is performed using De Bruijn Graph (DBG) algorithm. Robust assembler programs implementing DBG algorithm are widely used for efficient genome assembly. One such program is SPAdes.

DBG uses the k-mer approach to generate the consensus sequence. In this approach, the reads are broken into k-mer substrings, and subsequently, the consecutive k-mers share an overlap of a (k − 1)-mer (Figure 2). Thereupon, the algorithm looks for overlaps in the pool of k-mers generated from the sequence reads. Afterward, the consensus sequence or assembly is formed as shown in Figure 3. It is to be noted that the 4 bases k-mer depicted in figure 3 are for illustration purposes. Ideally, the size of k-mers are larger; for example, SPAdes uses k-mers between 31 and 127bp.

Figure 2: A k-mer representation of a sequence read of 25 bases

You can calculate the number of k-mers from the length of the read and the k-mer size using the formula: N_kmers = L_read – Kmer_size. For instance, as shown in Figure 2; the read length (L_read) is 25 bases and the size (Kmer_size) of each k-mer is 20. Thus, the number (N_kmers) of k-mers is 5.

Figure 3: Consensus sequence representation from 2 reads using k-mers of 4 bases.

The k-mer approach of the DBG algorithm resolves the alignment issues encountered with repetitive regions in the genome, which makes it the superior choice for the assembly process. Since the distance between each paired-end read generated from Illumina is known, DBG uses this information to map the reads over repetitive regions more precisely than single-end reads where the DNA fragment is read from only one direction (Figure 1). Furthermore, with its k-mer approach, the DBG algorithm becomes a natural fit for the short reads generated from NGS methods like Illumina to find spurious overlaps between read pairs and generate consensus sequences or assemblies.

Here you might be asking yourself: how do de Bruijn graphs take long reads into account? The simple answer is that de Bruijn graphs are agnostic about the read length. The construction process of de Bruijn graphs chooses a k-mer and parses all reads (long or short) into k-mers of the chosen size. After the graph is constructed, the long-range information available from the long reads is lost. Such long-range information is most useful to resolve repetitive regions of the genome. Moreover, intuitively, the choice of large k-mers helps in resolving repeat region issues during alignment.

When using DBG, you should always choose a k-mer size below the read length in case of short read fragments because if the k-mer size exceeds the maximum read lengths of short-read libraries, then short reads cannot be included in the construction of de Bruijn graph. For example, when using the DBG algorithm on an Illumina paired-end reads library of 2×150 bp, you should use k-mers below 150. On the contrary to all the benefits of DBG approach, with a simple alignment-based approach a reliable alignment and overlapping of voluminous sequence read data to generate a consensus sequence cannot be obtained due to its limitations of computational complexity and time-consuming execution.

“Coverage/Depth” in the context of mapping of the sequence reads to the reference genome (if available) is a salient feature in assembly assessment. Coverage is defined as the average number of times any nucleotide is covered by a sequencing read. For instance, in Figure 4, the coverage of base “T” at that position is “4.”

The coverage of an entire genome can be calculated with the formula: C = (N x L)/G; where N is the number of reads, L is the average read length and G is the genome size. Similarly, since k-mers are also a short stretch of DNA fragments taken from sequence reads, their coverage can also be calculated with: Ck = C.(R-K+1)/R, where R is the read length, K is the length of k-mer, and C is the base coverage. K-mer coverage of an assembly is the number of k-mers that map with perfect identity to that assembly. A higher k-mer coverage corresponds to a high-quality assembly.

Figure 4: An overlap of sequenced reads representing the coverage of bases.

3. Subsampling And Assembling

Now that we have understood how DBG algorithm in SPAdes works, it is the right time to introduce the concept of “subsampling” in genome assembly or consensus sequence generation.

DBG is a robust algorithm, however, with millions of reads to assemble at once, the algorithm will likely choke on high coverage and produce inaccurate assemblies. Therefore, it is recommended to subsample the sequenced data. Subsampling means dividing sequenced reads present in the FASTQ file into separate datasets of reads. For example, if a FASTQ file contains 1 million reads, this file can be split into 20 datasets of 50,000 reads each. Datasets with 50,000 and 100,000 reads (especially for Prokaryotes and Phages) normally give better assembly results. However, it is on your perusal to experiment with your data and determine what dataset size gives better assembly. Once the datasets are ready, SPAdes is run on each dataset. This step produces sequence assemblies from every dataset. It is important to note that assemblies generated from every dataset are possible results from which the best assembly has to be chosen. When assessing the assembled sequences (from each dataset) for their accuracy, all assemblies should be taken into consideration and the best assembly should be picked based on the parameters mentioned in the “Assembly Quality Assessment” section.

The ideal result is to get only one assembly for every dataset with significant coverage of 50x-100x and assembly size compatible with a similar strain in the species. If you get only one assembly, that means all sequenced reads were used up to produce one consensus sequence, which is the best result. Be stringently careful with using all reads at once for the assembly, as the assembler will run and produce assemblies (possibly even just one consensus sequence) which could wrongly indicate a perfect assembly. Hence, as aforementioned, it is advised to run assemblers on sub-sampled datasets to get better results.

4. Assembly Quality Assessment

Like any experimental and simulation results, de novo assembled genomes need to be rigorously validated. Quast is a widely used tool to assess the quality and accuracy of the assembled genome. Quast takes FASTQ reads and the assembled sequences in FASTA format as input. It then computes coverage, assembly size, N50 value, and the percentage of the number of reads mapped back to the assembled genome. The number of sequenced reads that get mapped back to the assembly is considered an important validation of the accuracy of the assembly. Generally, if 95% of the reads are mapped back to the consensus sequence, then you have a good genome assembly.

Coverage

Genome coverage is the percentage of the genome that is contained in the assembly relative to the estimated genome size. Likewise, gene coverage refers to the percentage of genes that are covered by the assembled genome. Ideally, the genome coverage should be about 90 – 95% for the best assemblies in the Quast assessment output. However, repetitive sequences may lower this number because they are much more challenging to cover in a genome assembly.

N50

It is a measure that describes the ”completeness” of a genome assembly. To understand the concept, let’s assume you have a set of 7 assemblies generated from a dataset. The assemblies are put in the order of their sequence length, i.e. the longest comes first, followed by the second-longest, and with the shortest one in the end. Now, you add the lengths of all the assemblies starting from the beginning, so the longest assembly + the second-longest, until you reach the number that makes up 50% of the total assembly length. The length of the contig where you stopped counting is the N50 value. In other words, N50 is the length of the shortest sequence in a set of assemblies that covers 50% of all assemblies. Generally, the larger the N50 value is, the better the assembly. If your N50 is too short, you should perform additional sequencing to improve the assembly before moving onto gene prediction and annotation.

Assembly size and number

Ideally, one assembly from a dataset of sequenced reads is the best result; however, other quality assessment factors should be taken into consideration rather than relying only on the number of assemblies generated from the assembler. Additionally, if the assembled sequence is of a size similar to a strain in the same species then it is a good indication of efficient assembly. Nevertheless, you should check the reliability of the result by mapping the reads back to the assembled sequence using the Quast tool and if about 90 – 95% of reads are mapped then it is a viable result.

Gaps

Refers to uncovered regions within the assemblies. Generally, it is better to have fewer gaps. A higher percentage of gaps in the genome or having more gaps, in general, reduces assembly quality.

5. Cross-validation

Although SPAdes as an assembler does a significant job in generating quality assemblies, it is highly recommended to “Cross-Validate” the result with other robust DBG assemblers such as ABySS, SOAPdenovo, and Velvet. A high consensus in the obtained assemblies concerning assembly size, coverage, and N50 value from different assemblers provides legitimacy of the result and its confident usage for further downstream analyses such as gene prediction and annotation (Let’s keep this topic for next blog!).

In this article, we have covered the basic yet crucial concepts of genome assembly; although depending on the organism under study, the assembly steps may vary to some extent. The stated steps and concepts are not specific to prokaryotic genome assemblies and can also be used for eukaryotic genome assemblies. However, due to the genomic complexities such as the larger genome size, higher percentage of repeats, and an increase in heterozygosity, the eukaryotic genome assembly poses significant challenges. Obtaining an accurate assembly requires diligence in the approach as stated in this article. Nonetheless, with the right resources, concepts, and validation, high-quality assemblies are certainly possible.

To learn more about genome assembly from NGS data, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Expert Sequencing wait list.

Join Expert Cytometry's Mastery Class

ABOUT DEEPAK KUMAR, PHD

GENOMICS SOFTWARE APPLICATION ENGINEER

Deepak Kumar is a Genomics Software Application Engineer (Bioinformatics) at Agilent Technologies. He is the founder of the Expert Sequencing Program (ExSeq) at Cheeky Scientist. The ExSeq program provides a holistic understanding of the Next Generation Sequencing (NGS) field - its intricate concepts, and insights on sequenced data computational analyses. He holds diverse professional experience in Bioinformatics and computational biology and is always keen on formulating computational solutions to biological problems.

Deepak Kumar, PhD

Similar Articles

How To Do Variant Calling From RNASeq NGS Data

How To Do Variant Calling From RNASeq NGS Data

By: Deepak Kumar, PhD

Developing variant calling and analysis pipelines for NGS sequenced data have become a norm in clinical labs. These pipelines include a strategic integration of several tools and techniques to identify molecular and structural variants. That eventually helps in the apt variant annotation and interpretation. This blog will delve into the concepts and intricacies of developing a “variant calling” pipeline using GATK. “Variant calling” can also be performed using tools other than GATK, such as FREEBAYES and SAMTOOLS.  In this blog, I will walk you through variant calling methods on Illumina germline RNASeq data. In the steps, wherever required, I will…

Understanding Clinical Trials And Drug Development As A Research Scientist

Understanding Clinical Trials And Drug Development As A Research Scientist

By: Deepak Kumar, PhD

Clinical trials are studies designed to test the novel methods of diagnosing and treating health conditions – by observing the outcomes of human subjects under experimental conditions.  These are interventional studies that are performed under stringent clinical laboratory settings. Contrariwise, non-interventional studies are performed outside the clinical trial settings that provide researchers an opportunity to monitor the effect of drugs in real-life situations. Non-interventional trials are also termed observational studies as they include post-marketing surveillance studies (PMS) and post-authorization safety studies (PASS). Clinical trials are preferred for testing newly developed drugs since interventional studies are conducted in a highly monitored…

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

By: Deepak Kumar, PhD

In the first blog of this series, we explored the power of sequencing the genome at various levels. We also dealt with how the characterization of the RNA expression levels helps us to understand the changes at the genome level. These changes impact the downstream expression of the target genes. In this blog, we will explore how NGS sequencing can help us comprehend DNA modification that affect the expression pattern of the given genes (epigenetic profiling) as well as characterizing the DNA-protein interactions that allow for the identification of genes that may be regulated by a given protein.  DNA Methylation Profiling…

How To Profile DNA And RNA Expression Using Next Generation Sequencing

How To Profile DNA And RNA Expression Using Next Generation Sequencing

By: Deepak Kumar, PhD

Why is Next Generation Sequencing so powerful to explore and answer both clinical and research questions. With the ability to sequence whole genomes, identifying novel changes between individuals, to exploring what RNA sequences are being expressed, or to examine DNA modifications and protein-DNA interactions occurring that can help researchers better understand the complex regulation of transcription. This, in turn, allows them to characterize changes during different disease states, which can suggest a way to treat said disease.  Over the next two blogs, I will highlight these different methods along with illustrating how these can help clinical diagnostics as well as…

What Is Next Generation Sequencing (NGS) And How Is It Used In Drug Development

What Is Next Generation Sequencing (NGS) And How Is It Used In Drug Development

By: Deepak Kumar, PhD

NGS methodologies have been used to produce high-throughput sequence data. These data with appropriate computational analyses facilitate variant identification and prove to be extremely valuable in pharmaceutical industries and clinical practice for developing drug molecules inhibiting disease progression. Thus, by providing a comprehensive profile of an individual’s variome — particularly that of clinical relevance consisting of pathogenic variants — NGS helps in determining new disease genes. The information thus obtained on genetic variations and the target disease genes can be used by the Pharma companies to develop drugs impeding these variants and their disease-causing effect. However simple this may allude…

Structural Variant Calling From NGS Data

Structural Variant Calling From NGS Data

By: Deepak Kumar, PhD

Single Nucleotide Variant (SNVs) have been considered as the main source of genetic variation, therefore precisely identifying these SNVs is a critical part of the Next Generation Sequencing (NGS) workflow. However, in this report from 2004, the authors identified another form of variants called the Structural Variants (SVs), which are genetic alterations of 50 or more base pairs, and result in duplications, deletions, insertions, inversions, and translocations in the genome. The changes in the DNA organization resulting from these SVs have been shown to be responsible for both phenotypic variation and a variety of pathological conditions. While the average variation,…

Essential Concepts in Gene Prediction and Annotation

Essential Concepts in Gene Prediction and Annotation

By: Deepak Kumar, PhD

After genome assembly (covered in my previous blog) comes the vital step of gene prediction and annotation. This step entails the prediction of all the genes present in the assembled genome and to provide efficient functional annotation to these genes from the data available in diverse public repositories; such as Protein Family (PFAM), SuperFamily, Conserved Domain Database (CDD), TIGRFAM, PROSITE, CATH, SCOP, and other protein domain databases. It is imperative to understand that prediction and annotation of non-protein-coding genes, Untranslated Regions (UTR), and tRNA are as vital as protein-coding genes to determine the overall genetic constitution of the assembled genome. …

Top Industry Career eBooks

Get the Advanced Microscopy eBook

Get the Advanced Microscopy eBook

Heather Brown-Harding, PhD

Learn the best practices and advanced techniques across the diverse fields of microscopy, including instrumentation, experimental setup, image analysis, figure preparation, and more.

Get The Free Modern Flow Cytometry eBook

Get The Free Modern Flow Cytometry eBook

Tim Bushnell, PhD

Learn the best practices of flow cytometry experimentation, data analysis, figure preparation, antibody panel design, instrumentation and more.

Get The Free 4-10 Compensation eBook

Get The Free 4-10 Compensation eBook

Tim Bushnell, PhD

Advanced 4-10 Color Compensation, Learn strategies for designing advanced antibody compensation panels and how to use your compensation matrix to analyze your experimental data.