How To Do Variant Calling From RNASeq NGS Data

Developing variant calling and analysis pipelines for NGS sequenced data have become a norm in clinical labs. These pipelines include a strategic integration of several tools and techniques to identify molecular and structural variants. That eventually helps in the apt variant annotation and interpretation. This blog will delve into the concepts and intricacies of developing a “variant calling” pipeline using GATK. “Variant calling” can also be performed using tools other than GATK, such as FREEBAYES and SAMTOOLS

In this blog, I will walk you through variant calling methods on Illumina germline RNASeq data. In the steps, wherever required, I will mention the difference in the tool used or the method for handling  DNASeq and RNASeq data. The sequenced data used in this example is a random paired-end (R1, R2) germline RNASeq data. 

Data Preparation

First, assign the values (fastq files, reference fasta file, SNP, and INDELS known sites) to the variables in the bash script (pipeline). The fastq files contain sequenced reads of the sample/patient under study – obtained as a result of the sequencing run. It is imperative to ascertain that the quality of the extracted and fragmented DNA is high, and the prepared DNA library is uniform and contaminant-free. You can find the detailed dos and don’ts of NGS sequencing and DNA library preparation in one of my blogs.

Quality Control

In this step, a quality check of the sequenced data in the Fastq files is performed. “FastQC” reads a set of Fastq files and produces quality control reports consisting of different modules – representing the sequence reads’ quality. These reports are useful in determining and eliminating the low-quality data.

Trimmomatic can be used to trim low-quality reads and non-biological sequences from the sequenced read data. This is an important step as the quality of the sequence data used for downstream processing significantly affects the variant calling process.

Alignment

For RNASeq reads, it is recommended to use the “STAR” aligner. An important consideration during this step is the use of the 2-pass mapping method. In this example, I have used the 1-pass mapping alone and did not continue with the 2-pass. The two-pass mapping method is useful in detecting the most sensitive novel junctions. This method does not increase the number of detected novel junctions but allows the detection of splices reads mapping to novel junctions. When using the 2-pass method, the junctions obtained from the 1-pass method are used as “annotated” junctions for the 2-pass mapping method. To get a thorough understanding of 2-pass mapping, it is advised to go through the STAR manual. Moreover, you may go through this discussion forum to formulate an appropriate strategy for 1-pass and 2-pass mapping methods.

From the STAR alignment, an “SJ.out.tab” file for each sample (in this case paired-end R1 and R2 sample “ABB_40”) will be generated. For instance, in the provided example, after the STAR alignment steps, an “ABB_40SJ.out.tab” will be generated. This tab file consists of all the junctions obtained from the 1-pass mapping step.

For DNAseq data, the alignment step can be performed by using the “BWA” tool, and the indexing and alignment steps would be:

Sorting

Sort the Sequence Alignment/Map format (SAM) file obtained as an output from the STAR alignment step based on genomic coordinates. In this step, we can also convert the SAM file to a Binary Alignment Map format (BAM) file. BAM files are binary files and require less computational storage compared to the SAM files.

An important note to mention here is that SAM/BAM files can be significantly large depending on the number of sequenced reads being aligned; hence, “sorting” and “indexing” of alignment and reference sequence files are vital steps since these methods allow for the faster processing of computationally complex operations on these files. 

The indexing of a genome or alignment can be explained similarly to indexing a book. Indexing in a book refers to an alphabetical list of names, subjects, etc. that belong to the pages on which they are mentioned. Thus, if one wants to know on which page a word appears or a chapter begins, then he/she can look up in the pre-built index rather than going through the whole book to find that word or chapter. Similarly, in alignments, indices allow the aligners to identify the potential origin of the query sequences or reads within the reference genome, which helps to save time and computational memory.

Alignment Summary

This step is useful in detailing the quality of the read alignments and the proportion of the machine signal-to-noise threshold quality filters. It is to be noted that these quality filters are specific to Illumina data.

In the next step, the “CollectInsertSizeMetrics” parameter of GATK provides useful metrics for validating library construction including the insert size distribution and read orientation of paired-end libraries. It produces the percentages of read pairs in three orientations (Forward Reverse, Reverse Forward, and TANDEM) as a histogram. 

Mark Duplicate Reads

Duplicate reads are the result of Polymerase Chain Reaction (PCR) in sequencers. The duplication events arise at different rates as the sequencing experiment proceeds. The duplicate reads might come from the same input DNA template; thus, it can be assumed that they have the same start position on the reference in an alignment.

Also, it should  be noted that any sequencing error during PCR reaction will be carried to all the duplicate reads. Hence, even if we eliminate the duplicate reads by marking or flagging them, the left representative read (after removing the duplicates) will still carry the sequencing error. Therefore, it is imperative to make sure that little to no errors are present in the DNA library preparation step, as these errors will be propagated to all PCR duplicates.

Figure 1: Sequencing error propagating in duplicates.

Adding Read Group (RG) tags

In this step, all the reads in the BAM file are assigned to a single new Read Group (RG). Picard and GATK require the presence of at least one RG tag – to which each read can be assigned. In this example, after assigning the reads to a Read Group with arguments “RGID=1 RGLB=lib2 RGPL=illumina RGPU=unit1 RGSM=3”, the “dedup_reads_RG1.bam” will have an entry stating:

@RG     ID:1    LB:lib2 PL:illumina     SM:3    PU:unit1

Where “RGID” refers to the “Read Group ID”, “RGLB” refers to “Read-Group library”, “RGPL” refers to “Read-Group platform” (in this example the platform is Illumina), “RGPU” refers to “Read-Group platform unit” (eg. run barcode), and “RGSM” refers to “Read-Group sample name”. 

Building BAM Index

With reference to the importance of “indexing” stated in the “Sorting” section – in this step – the “dedup” or “duplicate-read-marked” BAM file is indexed to improve the efficiency of downstream variant identification processes.  

Create Reference Sequence Dictionary

In this step, a reference genome sequence dictionary is created with a “.dict” extension – required for the downstream processing and analysis tools. The output “.dict” file contains only a header with sequence records but no SAM records.

Reference Sequence Index

For efficient pattern matching and downstream variant call processing, the reference genome sequence is indexed.

Knownsites Indexing

“Knownsites” are the already annotated or known variants present in the public or commercial variant repositories. Indexing these variants using “tabix“ improves the computational speed and performance for downstream tools and analysis. 

Handling Splicing Events in RNASeq Data

Since RNA aligners have different conventions than DNA aligners, we need to reformat some of the alignments that span introns for HaplotypeCaller. This method splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default, this step also reassigns mapping qualities for good alignments to match DNA conventions.

Furthermore, the algorithm Identifies all N cigars and creates k+1 new reads (where k is the number of N cigar elements). The first read includes the bases to the left of the first N element and the part to the right of the N (including the Ns) is hard clipped.

Base Quality Recalibration

This is an important step. As a part of the data pre-processing step, it detects systematic errors in the estimation of base call accuracy carried out by the sequencing machine.

The two keys steps are:

  1. The BaseRecalibrator is used to build a recalibration model (“recal_data.table” in the provided example) by using all the reads in the input BAM file and the set of known variants (“knownsites”).
  2. In the next step, the ApplyBQSR adjusts the base quality scores in the data based on the recalibration model and produces a new BAM file (“recal_reads.bam” in the provided example).

 It is recommended to refer here to get further details on the re-calibration step.

Variant Calling

In this example, we are performing variant calls on germline RNASeq data, hence the “HaplotypeCaller” argument is used. Please note that for somatic (cancer) variant discovery “Mutect2” can be used. 

In this step, the program first determines the active regions based on the presence of evidence for variation. Next, for each active region, it builds a De Bruijn-like graph to reassemble the region and identify the haplotypes (a group of alleles in an organism that is inherited together from a single parent) in the data. Afterwards, realignment of each haplotype is performed against the reference haplotype to determine potentially variant sites. Further, for each active region, a pairwise alignment of each read against each haplotype is performed using the PairHMM algorithm – to produce a matrix of likelihoods of haplotypes. These likelihoods are then marginalized to acquire likelihoods of alleles for each potential variant site given the read data. In the end, Bayes’ rule is applied for each potential variant site using the likelihoods of alleles to calculate the likelihoods of each genotype per sample given the read data observed for that sample; eventually, the most likely genotype is assigned to the sample. Finally, the called variants with their genotype are output as a Variant Call Format (VCF) file (“raw_variants.vcf” in this example).

Variant Selection

Selection of called variants is performed using the “SelectVariants” argument of GATK. Different options can be used for selecting subsets of variants from a larger variant callset. To get a complete understanding of these options, it is recommended to refer here.

Variant Filteration

In every VCF file, there are 9 mandatory columns (#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT) followed by the patient/sample column. It is important to understand the standards of VCF format before using the VCF filtering tools. 

The variants called in the previous step are hard-filtered based on the INFO and/or FORMAT annotations of the VCF file. In the example, the values chosen for filters are case-specific. The filter values may vary depending on the type of experiment and the sequence data. 

Building Database For Variant Annotation

Once the variants are filtered, the next step is to annotate them. For this step, a local database is created from the UCSC’s RefSeq table. More information on database creation for snpEFF can be found here: https://pcingola.github.io/SnpEff/se_buildingdb/ 

Variant Annotation

Once the local database is created, we annotate the variants using the “snpEff” tool.

Variant Callset Quality Control

Usually, after the variants are called and annotated, we move to downstream analyses; however, it is recommended to assess the quality of the called variants using the evaluation criteria of Number of Indels & SNPs, Indel Ratio, and TiTv Ratio.

The typical methods for variant evaluation are sanger sequencing of regions surrounding putative variants and evaluating concordance against results obtained from a genotyping chip run on the same samples. However, both of these methods have drawbacks. Sanger sequencing is the least scalable as it could prove to be costly and time-consuming to apply to an entire variant callset; and, although concordance evaluation is much more scalable than Sanger sequencing, it only covers the subset of known variants that the chip was designed for. 

Therefore, the third method mentioned in this step could be used to evaluate how the variants stack up against the gold standards or the already annotated truth set of variants in repositories such as dbSNP. To get a better and comprehensive understanding of how these evaluation methods work, I recommend you refer here.

Extracting variants of Interest

The final step after variant quality evaluation could be to filter large genomic datasets in order to find the most significant variants for your experiment. This can be performed using “snpSift”. In the example below, we are extracting all the “missense variants” from the annotated VCF file.

And, if you want to extract a set of chromosomal regions from the VCF file to be analyzed further, you can use “tabix”. In this example, I am extracting all SNPs belonging to a region between genomic coordinates 30000000 and 40000000 of chromosome 1. Please note that to perform this step, the VCF file should be “bgzipped” and “tabixed” as shown in “Step 20”.

Concluding Remarks

In this blog, I have covered SNP and Indel variant calling and its related concepts. For Copy Number Variant (CNV) calling, I recommend you refer to one of my previous blogs. Furthermore, I would like to emphasize the importance of manual curation of identified and annotated variants using genomics viewers such as IGV – to detect and eliminate false positives in the germline and somatic data.

Although I have covered the vital steps in variant calling, it is to be noted that there may be additional steps required depending on the experimental data used and the ultimate objective of the sequencing project. The end goal of these sequencing projects is mostly apt and accurate variant identification and interpretation that could help in clinical diagnoses.

To learn more about gene prediction and how NGS can assist you, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Expert Sequencing wait list.

Join Expert Cytometry's Mastery Class
Deepak Kumar, PhD
Deepak Kumar, PhD Genomics Software Application Engineer

Deepak Kumar is a Genomics Software Application Engineer (Bioinformatics) at Agilent Technologies. He is the founder of the Expert Sequencing Program (ExSeq) at Cheeky Scientist. The ExSeq program provides a holistic understanding of the Next Generation Sequencing (NGS) field - its intricate concepts, and insights on sequenced data computational analyses. He holds diverse professional experience in Bioinformatics and computational biology and is always keen on formulating computational solutions to biological problems.

Similar Articles

How small can you go? Flow cytometry of bacteria and viruses

How small can you go? Flow cytometry of bacteria and viruses

By: Tim Bushnell, PhD

Flow cytometers are traditionally designed for measuring particles, like beads and cells. These tend to fall in the small micron size range. Looking at the relative size of different targets of biological interest, it is clear the most common targets for flow cytometry (cells) are comparatively large (figure 1). Figure 1:  Relative size of different biological targets of interest. Image modified from Bioninja.    In the visible spectrum, where most of the excitation light sources reside, it is clear the cells are larger than the light. This is important as one of the characteristics that we typically measure is the amount…

What Is Spectral Unmixing And Why It's Important In Flow Cytometry

What Is Spectral Unmixing And Why It's Important In Flow Cytometry

By: Tim Bushnell, PhD

As the labeled cell passes through the interrogation point, it is illuminated by the excitation lasers. The fluorochromes, fluoresce; emitting photons of a higher wavelength than the excitation source. This is typically modeled using spectral viewers such as in the figure below, which shows the excitation (dashed lines) and emission (filled curves) for Brilliant Violet 421TM (purple) and Alexa Fluor 488Ⓡ (green).  Figure 1: Excitation and emission profiles of BV421TM and AF488Ⓡ  In traditional fluorescent flow cytometry (TFF), the instrument measures each fluorochrome off an individual detector. Since the detectors we use — photomultiplier tubes (PMT) and avalanche photodiodes (APD)…

How To Extract Cells From Tissues Using Laser Capture Microscopy

How To Extract Cells From Tissues Using Laser Capture Microscopy

By: Tim Bushnell, PhD

Extracting specific cells still remains an important aspect of several emerging genomic techniques. Prior knowledge about the input cells helps to put the downstream results in context. The most common isolation technique is cell sorting, but it requires a single cell suspension and eliminates any spatial information about the microenvironment. Spatial transcriptomics is an emerging technique that can address some of these issues, but that is a topic for another blog.  So what does a researcher who needs to isolate a specific type of cell do? The answer lies in the technique of laser capture microdissection (LCM). Developed at the National…

The Importance Of Quality Control And Quality Assurance In Flow Cytometry (Part 4 Of 6)

The Importance Of Quality Control And Quality Assurance In Flow Cytometry (Part 4 Of 6)

By: Tim Bushnell, PhD

Incorporating quality control as a part of the optimization process in  your flow cytometry protocol is important. Take a step back and consider how to build quality control tracking into the experimental protocol.  When researchers hear about quality control, they immediately shift their attention to those operating and maintaining the instrument, as if the whole weight of QC should fall on their shoulders.   It is true that core facilities work hard to provide high-quality instruments and monitor performance over time so that the researchers can enjoy uniformity in their experiments. That, however, is just one level of QC.  As the experimental…

Understanding Clinical Trials And Drug Development As A Research Scientist

Understanding Clinical Trials And Drug Development As A Research Scientist

By: Deepak Kumar, PhD

Clinical trials are studies designed to test the novel methods of diagnosing and treating health conditions – by observing the outcomes of human subjects under experimental conditions.  These are interventional studies that are performed under stringent clinical laboratory settings. Contrariwise, non-interventional studies are performed outside the clinical trial settings that provide researchers an opportunity to monitor the effect of drugs in real-life situations. Non-interventional trials are also termed observational studies as they include post-marketing surveillance studies (PMS) and post-authorization safety studies (PASS). Clinical trials are preferred for testing newly developed drugs since interventional studies are conducted in a highly monitored…

How To Optimize Instrument Voltage For Flow Cytometry Experiments  (Part 3 Of 6)

How To Optimize Instrument Voltage For Flow Cytometry Experiments (Part 3 Of 6)

By: Tim Bushnell, PhD

As we continue to explore the steps involved in optimizing a flow cytometry experiment, we turn our attention to the detectors and optimizing sensitivity: instrument voltage optimization.  This is important as we want to ensure that we can make as sensitive a measurement as possible.  This requires us to know the optimal sensitivity of our instrument, and how our stained cells are resolved based on that voltage.  Let’s start by asking the question what makes a good voltage?  Joe Trotter, from the BD Biosciences Advanced Technology Group, once suggested the following:  Electronic noise effects resolution sensitivity   A good minimal PMT…

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

By: Deepak Kumar, PhD

In the first blog of this series, we explored the power of sequencing the genome at various levels. We also dealt with how the characterization of the RNA expression levels helps us to understand the changes at the genome level. These changes impact the downstream expression of the target genes. In this blog, we will explore how NGS sequencing can help us comprehend DNA modification that affect the expression pattern of the given genes (epigenetic profiling) as well as characterizing the DNA-protein interactions that allow for the identification of genes that may be regulated by a given protein.  DNA Methylation Profiling…

How To Profile DNA And RNA Expression Using Next Generation Sequencing

How To Profile DNA And RNA Expression Using Next Generation Sequencing

By: Deepak Kumar, PhD

Why is Next Generation Sequencing so powerful to explore and answer both clinical and research questions. With the ability to sequence whole genomes, identifying novel changes between individuals, to exploring what RNA sequences are being expressed, or to examine DNA modifications and protein-DNA interactions occurring that can help researchers better understand the complex regulation of transcription. This, in turn, allows them to characterize changes during different disease states, which can suggest a way to treat said disease.  Over the next two blogs, I will highlight these different methods along with illustrating how these can help clinical diagnostics as well as…

Optimizing Flow Cytometry Experiments - Part 2         How To Block Samples (Sample Blocking)

Optimizing Flow Cytometry Experiments - Part 2 How To Block Samples (Sample Blocking)

By: Tim Bushnell, PhD

In my previous blog on  experimental optimization, we discussed the idea of identifying the best antibody concentration for staining the cells. We did this through a process called titration, which  focuses on finding the best signal-to-noise ratio at the lowest antibody concentration. In this blog we will deal with sample blocking As a reminder, there are two other major binding concerns with antibodies. The first is the specific binding of the Fc fragment of the antibody to the Fc Receptor expressed on some cells. This protein is critical for the process of destroying microbes or other cells that have been…

Top Technical Training eBooks

Get the Advanced Microscopy eBook

Get the Advanced Microscopy eBook

Heather Brown-Harding, PhD

Learn the best practices and advanced techniques across the diverse fields of microscopy, including instrumentation, experimental setup, image analysis, figure preparation, and more.

Get The Free Modern Flow Cytometry eBook

Get The Free Modern Flow Cytometry eBook

Tim Bushnell, PhD

Learn the best practices of flow cytometry experimentation, data analysis, figure preparation, antibody panel design, instrumentation and more.

Get The Free 4-10 Compensation eBook

Get The Free 4-10 Compensation eBook

Tim Bushnell, PhD

Advanced 4-10 Color Compensation, Learn strategies for designing advanced antibody compensation panels and how to use your compensation matrix to analyze your experimental data.