Structural Variant Calling From NGS Data

Single Nucleotide Variant (SNVs) have been considered as the main source of genetic variation, therefore precisely identifying these SNVs is a critical part of the Next Generation Sequencing (NGS) workflow. However, in this report from 2004, the authors identified another form of variants called the Structural Variants (SVs), which are genetic alterations of 50 or more base pairs, and result in duplications, deletions, insertions, inversions, and translocations in the genome. The changes in the DNA organization resulting from these SVs have been shown to be responsible for both phenotypic variation and a variety of pathological conditions. While the average variation, as measured for SNV is about 0.1% between two individuals; when the SVs are included, the variation rises to approximately 1.5%. Therefore, understanding how to identify and characterize the SVs using NGS data is critical.

Detection techniques for SVs are based on array comparative genomic hybridization (aCGH); very useful in detecting Copy Number Variations (CNVs) and single-nucleotide polymorphism arrays. NGS has enabled methods for the precise definition of SVs breakpoints of different sizes and types. However, due to the volume of the data and short sequence reads obtained from high-throughput sequencing the classification of SVs presents significant challenges.

Deletions, duplications, inversions, and translocations are the different types of SVs that occur due to chromosomal rearrangements. Copy number variations (CNVs) are a particular subtype of SVs mainly represented by deletions and duplications. By using the coverage method, deletion and duplication CNVs can be easily detected from the NGS sequence data, however, inversions are copy number-neutral, and hence coverage method cannot be used in its identification. Translocation is the exchange of genetic material between two non-homologous chromosomes. There are two main types of translocations: reciprocal and Robertsonian. In reciprocal translocation, two different chromosomes exchange segments with each other, whereas, in a Robertsonian translocation, an entire chromosome gets attached to another at the centromere.

NGS produces millions of reads in a single sequencing run by parallelizing the sequencing process. Consequently, it has provided a significant contribution to the detection of Single Nucleotide Polymorphism (SNP). However, since the nature of NGS sequencing is to produce short-read sequences (50 – 400 base pairs), the longer variants (SVs) remain poorly characterized. Due to the relevance of SVs in interpreting phenotypic variability and diseases by affecting more bases than SNPs, it is considered an important genetic variation. As a result, efforts have been made to develop algorithms to address the issue of structural variants calling from NGS data. 

With recent technological and methodological developments, long-read sequencing technology, such as Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are capable of producing reads of several thousand base pairs, even reaching up to 2 Mbp for Oxford Nanopore. Although PacBio and Oxford Nanopore methods can generate reads of thousands of base pairs, they pose two major disadvantages. First, the cost of sequencing is higher to obtain the same coverage compared to short-read sequencing. Second, the sequencing error rate is high – about 8 – 20% – which has to be considered for both alignment and SV calling steps. In spite of the drawbacks, long reads are advantageous for SV calling because they can span repetitive regions, improve the mapping, and capture large SVs better than the short-reads.

As we can fathom, both short-read and long-read sequence data come with their advantages and disadvantages for SV detection. Hence, bioinformaticians and researchers developed algorithms appropriate to the produced sequence reads for the SVs to be efficiently determined and characterized. Here, I am mentioning a few of the important algorithms/tools and their methodologies that are widely used for SV and CNV detection. 

Algorithms for SV detection depend on various properties of the underlying sequence data and differ in accuracy and sensitivity. These algorithms follow one or more methods that can be divided into categories: 

(1) read depth (RD)

(2) pairing end (PE)

(3) split reads (SR)

(4) de novo assembly (AS) 

The most effective approach for the identification of SV depends on the size and the type of the variant as well as the sequencing data characteristics. It is recommended to use multiple variant callers in the SV detection phase.

Variant callers take in a BAM/FastQ file and return a list of variants. Some of the widely used tools for SV, SNP, and Indels include:

Annotation of the identified variants can be done using tools such as:

There are other comprehensive pipelines to systematically handle a large number of variants, which include.

With this information, let’s turn our attention to some of the significant and widely implemented algorithms currently being used for SV  calling.

1. De novo assembly-based algorithm

I have elaborately explained the essential concepts in genome assembly in one of my previous blogs. Here, we will concentrate on understanding how de novo assembled sequences can be used to detect SVs/CNVs. In this method, the assembled sequences are aligned to the reference genome or other assemblies, and the aberrations between the two are determined. With the comparison between the genomic positions in the reference and the assembled sequence, different types of variations and discontinuity patterns are identified. It is to be noted that the alignment of such large size sequences is not a computationally trivial task. Conventional matrix-based alignment approaches could lead to out of memory errors – this limiting behavior of a function when the argument tends towards a particular value or infinity is called Big O Notation. To circumvent the issue, a few aligners have been developed, such as:

Most of these aligners like Mummer use a suffix tree data structure approach, and is one of the fastest and most efficient systems available for this task, enabling it to be applied to very long sequences. Some efficient applications using these aligners are:

These applications are used on previously assembled contigs or scaffolds i.e the sequence data has already been assembled, and only the alignment needs to be performed to detect the SVs. 

For the case where the detections of SVs have to be executed on unassembled sequence data, de Bruijn graph (refer to blog) and string graph-based approaches are widely used as they leverage the sequence read information directly. Cortex is one of these methods that use short-read sequencing data to assemble several genomes using the de Bruijn graph approach and infer SVs and complex combinations of indels, SNVs, and rearrangements. 

SGVar is another method that uses the string graph-based approach to assemble short-read sequence data. SGVar executes stringent assembly processing by taking read length and read quality into account. Additionally, it requires a perfect match of the sequence reads to be merged, which improves the assembly quality. Due to these reasons, SGVar has been shown to outperform Cortex for the identification of insertions and deletions. 

Compared with other methods, the key strength of the de novo assembly-based approach lies in detecting larger insertions (up to 3+ kbp). However, the absence of haplotype representation is one significant problem. Consequently, heterozygous SVs are often missed because a de novo assembly represents one haplotype. Nevertheless, there are methods to counter this problem, such as:

Having said that, the de novo assembly approach should be used for a small number of samples and for studying organisms that do not have a reference genome available. 

2. Paired-end alignment approach

Paired-end sequence reads typically map in the opposite direction (refer to blog) and within a certain distance of each other, for example, 500 base pairs. This very characteristic of paired-end reads allows the detection of different types of structural variants. In this approach, variations are detected by comparing the expected insert size between the paired-end reads and the observed insert size in the alignment. In the presence of SVs, these reads are abnormally oriented and spaced; reads map at inconsistent distances in the alignment denoting insertions, deletions, or inversions. Discordantly mapped paired reads can be:

(a) further apart than expected 

(b) closer together than expected 

(c) in incorrect order 

(d) in inverse orientation 

(e) on different chromosomes.

Over the years, quite a few short-read-based mappers have been introduced. The methods for detecting SVs from short reads differ in the type of information they exploit. BreakDancer is one of the important short-read mapper methods, which classifies the reads into normal or SV depending on the orientation and mapping distance between the read and its mate. The Paired-end approach could detect breakpoints (Figure 1). The resolution of the breakpoints depends on the insert size and the read coverage. In such a scenario, split-reads can be used to detect SVs with a single base-pair resolution. These split-reads contain the breakpoint of the structural variant. Alignments of the split-reads to the reference genome are split into two parts as shown in Figure 1. Parts of a split-read are independently aligned to the reference genome, hence the reads should be long enough to be aligned uniquely. DELLY is a robust method involving split-reads approach that can be used to integrate the analysis of split reads and to search for abnormal distances and orientations among pairs of reads. This increases the accuracy of breakpoint prediction and enables the detection of smaller deletions (20+ bp), however, the larger events remain hard to be distinguished from the mapping artifacts. To overcome this issue, coverage information can be taken into consideration to further improve SV detection. 

LUMPY and Manta use all three aspects of the read-depth/coverage, paired-end read discordance, and split-reads to perform analysis. Furthermore, to detect more complex events such as a tandem duplication where the second copy is inverted, methods such as TARDIS can be used.

Figure 1: represents a deletion in the reference genome detected by split-read.

3. Hybrid/Multi-method structural variant calling approach

The De novo assembly and Paired-end approach specialize in determining specific types of SVs and not to identify all types and sizes of SVs. They rely on different properties of the underlying sequence data and vary in sensitivity and accuracy. Metamethods or a combination of methods can be used to serve the purpose of detecting all SV types. Higher efficacy yield in variant calling has been reported by using metamethods, such as:

Additionally, in the hybrid approach, usually, the SV calling process utilizes multiple variant callers to overcome the limitations of individual approaches. Knowing the advantages and drawbacks of various tools is imperative to make proper decisions when designing NGS SV calling data analysis pipelines. Some of the robust pipelines that integrate different structural variant callers are:

Concluding Remarks

In conclusion, SVs are gaining attention in evolutionary, population, and clinical genomics. This class of variants is increasingly being recognized as a significant source of genotypic and phenotypic variation. However, before SV calling becomes a routine in clinical diagnosis, challenges such as efficient detection of variants and correct genotyping using the results obtained from computational methods/algorithms must be overcome. In addition to this, another factor that complicates SV detection is the very nature of these variants, where their size and type differences make it much more difficult to assess their frequency using reference databases such as gnomeAD/ExAC. Whereas, the frequency assessment for SNVs is easier. Because of the former reason, it becomes arduous to determine if an SV variant occurs at a low frequency (< 0.5%) in the population and could be considered as a candidate for pathogenicity. Furthermore, the aforementioned lack of format standardization and metadata information has proved to be an issue in clinical applications since the need for quality assurance and diagnostic certifications is a must.

Overall, SV calling is on the verge of gaining importance with the rapidly evolving calling methods, however, the lack of benchmarks and reference databases ask for careful interpretation of these variants. With the fiercely competitive market of sequencing methodologies and their providers and the increase in the need for SV characterization, it will not take long for structural variant analysis to become a regular standard regime in clinical labs. 

To learn more about gene prediction and how NGS can assist you, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Expert Sequencing wait list.

Join Expert Cytometry's Mastery Class

ABOUT DEEPAK KUMAR, PHD

GENOMICS SOFTWARE APPLICATION ENGINEER

Deepak Kumar is a Genomics Software Application Engineer (Bioinformatics) at Agilent Technologies. He is the founder of the Expert Sequencing Program (ExSeq) at Cheeky Scientist. The ExSeq program provides a holistic understanding of the Next Generation Sequencing (NGS) field - its intricate concepts, and insights on sequenced data computational analyses. He holds diverse professional experience in Bioinformatics and computational biology and is always keen on formulating computational solutions to biological problems.

Deepak Kumar, PhD

Similar Articles

How To Do Variant Calling From RNASeq NGS Data

How To Do Variant Calling From RNASeq NGS Data

By: Deepak Kumar, PhD

Developing variant calling and analysis pipelines for NGS sequenced data have become a norm in clinical labs. These pipelines include a strategic integration of several tools and techniques to identify molecular and structural variants. That eventually helps in the apt variant annotation and interpretation. This blog will delve into the concepts and intricacies of developing a “variant calling” pipeline using GATK. “Variant calling” can also be performed using tools other than GATK, such as FREEBAYES and SAMTOOLS.  In this blog, I will walk you through variant calling methods on Illumina germline RNASeq data. In the steps, wherever required, I will…

Understanding Clinical Trials And Drug Development As A Research Scientist

Understanding Clinical Trials And Drug Development As A Research Scientist

By: Deepak Kumar, PhD

Clinical trials are studies designed to test the novel methods of diagnosing and treating health conditions – by observing the outcomes of human subjects under experimental conditions.  These are interventional studies that are performed under stringent clinical laboratory settings. Contrariwise, non-interventional studies are performed outside the clinical trial settings that provide researchers an opportunity to monitor the effect of drugs in real-life situations. Non-interventional trials are also termed observational studies as they include post-marketing surveillance studies (PMS) and post-authorization safety studies (PASS). Clinical trials are preferred for testing newly developed drugs since interventional studies are conducted in a highly monitored…

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

By: Deepak Kumar, PhD

In the first blog of this series, we explored the power of sequencing the genome at various levels. We also dealt with how the characterization of the RNA expression levels helps us to understand the changes at the genome level. These changes impact the downstream expression of the target genes. In this blog, we will explore how NGS sequencing can help us comprehend DNA modification that affect the expression pattern of the given genes (epigenetic profiling) as well as characterizing the DNA-protein interactions that allow for the identification of genes that may be regulated by a given protein.  DNA Methylation Profiling…

How To Profile DNA And RNA Expression Using Next Generation Sequencing

How To Profile DNA And RNA Expression Using Next Generation Sequencing

By: Deepak Kumar, PhD

Why is Next Generation Sequencing so powerful to explore and answer both clinical and research questions. With the ability to sequence whole genomes, identifying novel changes between individuals, to exploring what RNA sequences are being expressed, or to examine DNA modifications and protein-DNA interactions occurring that can help researchers better understand the complex regulation of transcription. This, in turn, allows them to characterize changes during different disease states, which can suggest a way to treat said disease.  Over the next two blogs, I will highlight these different methods along with illustrating how these can help clinical diagnostics as well as…

What Is Next Generation Sequencing (NGS) And How Is It Used In Drug Development

What Is Next Generation Sequencing (NGS) And How Is It Used In Drug Development

By: Deepak Kumar, PhD

NGS methodologies have been used to produce high-throughput sequence data. These data with appropriate computational analyses facilitate variant identification and prove to be extremely valuable in pharmaceutical industries and clinical practice for developing drug molecules inhibiting disease progression. Thus, by providing a comprehensive profile of an individual’s variome — particularly that of clinical relevance consisting of pathogenic variants — NGS helps in determining new disease genes. The information thus obtained on genetic variations and the target disease genes can be used by the Pharma companies to develop drugs impeding these variants and their disease-causing effect. However simple this may allude…

Essential Concepts in Gene Prediction and Annotation

Essential Concepts in Gene Prediction and Annotation

By: Deepak Kumar, PhD

After genome assembly (covered in my previous blog) comes the vital step of gene prediction and annotation. This step entails the prediction of all the genes present in the assembled genome and to provide efficient functional annotation to these genes from the data available in diverse public repositories; such as Protein Family (PFAM), SuperFamily, Conserved Domain Database (CDD), TIGRFAM, PROSITE, CATH, SCOP, and other protein domain databases. It is imperative to understand that prediction and annotation of non-protein-coding genes, Untranslated Regions (UTR), and tRNA are as vital as protein-coding genes to determine the overall genetic constitution of the assembled genome. …

5 Essential Concepts In Genome Assembly From NGS data

5 Essential Concepts In Genome Assembly From NGS data

By: Deepak Kumar, PhD

The main goal for researchers, clinicians, and students who perform Next Generation Sequencing (NGS) and produce sequenced data for diverse projects involving human samples is to find biomarkers or variants to make diagnoses; and deduce the genetic anomalies that could be responsible for the disease they are conducting research on. Most projects (academic or non-academic) constitute the prior ideology on deciphering the “unknown.” There are well-versed computational protocols and pipelines formulated by labs across the world in determining what the “unknown” variants are. The fact that we have the “reference” human genome available – thanks to the Human Genome Project – plays…

Top Industry Career eBooks

Get the Advanced Microscopy eBook

Get the Advanced Microscopy eBook

Heather Brown-Harding, PhD

Learn the best practices and advanced techniques across the diverse fields of microscopy, including instrumentation, experimental setup, image analysis, figure preparation, and more.

Get The Free Modern Flow Cytometry eBook

Get The Free Modern Flow Cytometry eBook

Tim Bushnell, PhD

Learn the best practices of flow cytometry experimentation, data analysis, figure preparation, antibody panel design, instrumentation and more.

Get The Free 4-10 Compensation eBook

Get The Free 4-10 Compensation eBook

Tim Bushnell, PhD

Advanced 4-10 Color Compensation, Learn strategies for designing advanced antibody compensation panels and how to use your compensation matrix to analyze your experimental data.