How To Do Variant Calling From RNASeq NGS Data

Developing variant calling and analysis pipelines for NGS sequenced data have become a norm in clinical labs. These pipelines include a strategic integration of several tools and techniques to identify molecular and structural variants. That eventually helps in the apt variant annotation and interpretation. This blog will delve into the concepts and intricacies of developing a “variant calling” pipeline using GATK. “Variant calling” can also be performed using tools other than GATK, such as FREEBAYES and SAMTOOLS

In this blog, I will walk you through variant calling methods on Illumina germline RNASeq data. In the steps, wherever required, I will mention the difference in the tool used or the method for handling  DNASeq and RNASeq data. The sequenced data used in this example is a random paired-end (R1, R2) germline RNASeq data. 

Data Preparation

First, assign the values (fastq files, reference fasta file, SNP, and INDELS known sites) to the variables in the bash script (pipeline). The fastq files contain sequenced reads of the sample/patient under study – obtained as a result of the sequencing run. It is imperative to ascertain that the quality of the extracted and fragmented DNA is high, and the prepared DNA library is uniform and contaminant-free. You can find the detailed dos and don’ts of NGS sequencing and DNA library preparation in one of my blogs.

Quality Control

In this step, a quality check of the sequenced data in the Fastq files is performed. “FastQC” reads a set of Fastq files and produces quality control reports consisting of different modules – representing the sequence reads’ quality. These reports are useful in determining and eliminating the low-quality data.

Trimmomatic can be used to trim low-quality reads and non-biological sequences from the sequenced read data. This is an important step as the quality of the sequence data used for downstream processing significantly affects the variant calling process.

Alignment

For RNASeq reads, it is recommended to use the “STAR” aligner. An important consideration during this step is the use of the 2-pass mapping method. In this example, I have used the 1-pass mapping alone and did not continue with the 2-pass. The two-pass mapping method is useful in detecting the most sensitive novel junctions. This method does not increase the number of detected novel junctions but allows the detection of splices reads mapping to novel junctions. When using the 2-pass method, the junctions obtained from the 1-pass method are used as “annotated” junctions for the 2-pass mapping method. To get a thorough understanding of 2-pass mapping, it is advised to go through the STAR manual. Moreover, you may go through this discussion forum to formulate an appropriate strategy for 1-pass and 2-pass mapping methods.

From the STAR alignment, an “SJ.out.tab” file for each sample (in this case paired-end R1 and R2 sample “ABB_40”) will be generated. For instance, in the provided example, after the STAR alignment steps, an “ABB_40SJ.out.tab” will be generated. This tab file consists of all the junctions obtained from the 1-pass mapping step.

For DNAseq data, the alignment step can be performed by using the “BWA” tool, and the indexing and alignment steps would be:

Sorting

Sort the Sequence Alignment/Map format (SAM) file obtained as an output from the STAR alignment step based on genomic coordinates. In this step, we can also convert the SAM file to a Binary Alignment Map format (BAM) file. BAM files are binary files and require less computational storage compared to the SAM files.

An important note to mention here is that SAM/BAM files can be significantly large depending on the number of sequenced reads being aligned; hence, “sorting” and “indexing” of alignment and reference sequence files are vital steps since these methods allow for the faster processing of computationally complex operations on these files. 

The indexing of a genome or alignment can be explained similarly to indexing a book. Indexing in a book refers to an alphabetical list of names, subjects, etc. that belong to the pages on which they are mentioned. Thus, if one wants to know on which page a word appears or a chapter begins, then he/she can look up in the pre-built index rather than going through the whole book to find that word or chapter. Similarly, in alignments, indices allow the aligners to identify the potential origin of the query sequences or reads within the reference genome, which helps to save time and computational memory.

Alignment Summary

This step is useful in detailing the quality of the read alignments and the proportion of the machine signal-to-noise threshold quality filters. It is to be noted that these quality filters are specific to Illumina data.

In the next step, the “CollectInsertSizeMetrics” parameter of GATK provides useful metrics for validating library construction including the insert size distribution and read orientation of paired-end libraries. It produces the percentages of read pairs in three orientations (Forward Reverse, Reverse Forward, and TANDEM) as a histogram. 

Mark Duplicate Reads

Duplicate reads are the result of Polymerase Chain Reaction (PCR) in sequencers. The duplication events arise at different rates as the sequencing experiment proceeds. The duplicate reads might come from the same input DNA template; thus, it can be assumed that they have the same start position on the reference in an alignment.

Also, it should  be noted that any sequencing error during PCR reaction will be carried to all the duplicate reads. Hence, even if we eliminate the duplicate reads by marking or flagging them, the left representative read (after removing the duplicates) will still carry the sequencing error. Therefore, it is imperative to make sure that little to no errors are present in the DNA library preparation step, as these errors will be propagated to all PCR duplicates.

Figure 1: Sequencing error propagating in duplicates.

Adding Read Group (RG) tags

In this step, all the reads in the BAM file are assigned to a single new Read Group (RG). Picard and GATK require the presence of at least one RG tag – to which each read can be assigned. In this example, after assigning the reads to a Read Group with arguments “RGID=1 RGLB=lib2 RGPL=illumina RGPU=unit1 RGSM=3”, the “dedup_reads_RG1.bam” will have an entry stating:

@RG     ID:1    LB:lib2 PL:illumina     SM:3    PU:unit1

Where “RGID” refers to the “Read Group ID”, “RGLB” refers to “Read-Group library”, “RGPL” refers to “Read-Group platform” (in this example the platform is Illumina), “RGPU” refers to “Read-Group platform unit” (eg. run barcode), and “RGSM” refers to “Read-Group sample name”. 

Building BAM Index

With reference to the importance of “indexing” stated in the “Sorting” section – in this step – the “dedup” or “duplicate-read-marked” BAM file is indexed to improve the efficiency of downstream variant identification processes.  

Create Reference Sequence Dictionary

In this step, a reference genome sequence dictionary is created with a “.dict” extension – required for the downstream processing and analysis tools. The output “.dict” file contains only a header with sequence records but no SAM records.

Reference Sequence Index

For efficient pattern matching and downstream variant call processing, the reference genome sequence is indexed.

Knownsites Indexing

“Knownsites” are the already annotated or known variants present in the public or commercial variant repositories. Indexing these variants using “tabix“ improves the computational speed and performance for downstream tools and analysis. 

Handling Splicing Events in RNASeq Data

Since RNA aligners have different conventions than DNA aligners, we need to reformat some of the alignments that span introns for HaplotypeCaller. This method splits reads with N in the cigar into multiple supplementary alignments and hard clips mismatching overhangs. By default, this step also reassigns mapping qualities for good alignments to match DNA conventions.

Furthermore, the algorithm Identifies all N cigars and creates k+1 new reads (where k is the number of N cigar elements). The first read includes the bases to the left of the first N element and the part to the right of the N (including the Ns) is hard clipped.

Base Quality Recalibration

This is an important step. As a part of the data pre-processing step, it detects systematic errors in the estimation of base call accuracy carried out by the sequencing machine.

The two keys steps are:

  1. The BaseRecalibrator is used to build a recalibration model (“recal_data.table” in the provided example) by using all the reads in the input BAM file and the set of known variants (“knownsites”).
  2. In the next step, the ApplyBQSR adjusts the base quality scores in the data based on the recalibration model and produces a new BAM file (“recal_reads.bam” in the provided example).

 It is recommended to refer here to get further details on the re-calibration step.

Variant Calling

In this example, we are performing variant calls on germline RNASeq data, hence the “HaplotypeCaller” argument is used. Please note that for somatic (cancer) variant discovery “Mutect2” can be used. 

In this step, the program first determines the active regions based on the presence of evidence for variation. Next, for each active region, it builds a De Bruijn-like graph to reassemble the region and identify the haplotypes (a group of alleles in an organism that is inherited together from a single parent) in the data. Afterwards, realignment of each haplotype is performed against the reference haplotype to determine potentially variant sites. Further, for each active region, a pairwise alignment of each read against each haplotype is performed using the PairHMM algorithm – to produce a matrix of likelihoods of haplotypes. These likelihoods are then marginalized to acquire likelihoods of alleles for each potential variant site given the read data. In the end, Bayes’ rule is applied for each potential variant site using the likelihoods of alleles to calculate the likelihoods of each genotype per sample given the read data observed for that sample; eventually, the most likely genotype is assigned to the sample. Finally, the called variants with their genotype are output as a Variant Call Format (VCF) file (“raw_variants.vcf” in this example).

Variant Selection

Selection of called variants is performed using the “SelectVariants” argument of GATK. Different options can be used for selecting subsets of variants from a larger variant callset. To get a complete understanding of these options, it is recommended to refer here.

Variant Filteration

In every VCF file, there are 9 mandatory columns (#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT) followed by the patient/sample column. It is important to understand the standards of VCF format before using the VCF filtering tools. 

The variants called in the previous step are hard-filtered based on the INFO and/or FORMAT annotations of the VCF file. In the example, the values chosen for filters are case-specific. The filter values may vary depending on the type of experiment and the sequence data. 

Building Database For Variant Annotation

Once the variants are filtered, the next step is to annotate them. For this step, a local database is created from the UCSC’s RefSeq table. More information on database creation for snpEFF can be found here: https://pcingola.github.io/SnpEff/se_buildingdb/ 

Variant Annotation

Once the local database is created, we annotate the variants using the “snpEff” tool.

Variant Callset Quality Control

Usually, after the variants are called and annotated, we move to downstream analyses; however, it is recommended to assess the quality of the called variants using the evaluation criteria of Number of Indels & SNPs, Indel Ratio, and TiTv Ratio.

The typical methods for variant evaluation are sanger sequencing of regions surrounding putative variants and evaluating concordance against results obtained from a genotyping chip run on the same samples. However, both of these methods have drawbacks. Sanger sequencing is the least scalable as it could prove to be costly and time-consuming to apply to an entire variant callset; and, although concordance evaluation is much more scalable than Sanger sequencing, it only covers the subset of known variants that the chip was designed for. 

Therefore, the third method mentioned in this step could be used to evaluate how the variants stack up against the gold standards or the already annotated truth set of variants in repositories such as dbSNP. To get a better and comprehensive understanding of how these evaluation methods work, I recommend you refer here.

Extracting variants of Interest

The final step after variant quality evaluation could be to filter large genomic datasets in order to find the most significant variants for your experiment. This can be performed using “snpSift”. In the example below, we are extracting all the “missense variants” from the annotated VCF file.

And, if you want to extract a set of chromosomal regions from the VCF file to be analyzed further, you can use “tabix”. In this example, I am extracting all SNPs belonging to a region between genomic coordinates 30000000 and 40000000 of chromosome 1. Please note that to perform this step, the VCF file should be “bgzipped” and “tabixed” as shown in “Step 20”.

Concluding Remarks

In this blog, I have covered SNP and Indel variant calling and its related concepts. For Copy Number Variant (CNV) calling, I recommend you refer to one of my previous blogs. Furthermore, I would like to emphasize the importance of manual curation of identified and annotated variants using genomics viewers such as IGV – to detect and eliminate false positives in the germline and somatic data.

Although I have covered the vital steps in variant calling, it is to be noted that there may be additional steps required depending on the experimental data used and the ultimate objective of the sequencing project. The end goal of these sequencing projects is mostly apt and accurate variant identification and interpretation that could help in clinical diagnoses.

To learn more about gene prediction and how NGS can assist you, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Expert Sequencing wait list.

Join Expert Cytometry's Mastery Class

ABOUT DEEPAK KUMAR, PHD

GENOMICS SOFTWARE APPLICATION ENGINEER

Deepak Kumar is a Genomics Software Application Engineer (Bioinformatics) at Agilent Technologies. He is the founder of the Expert Sequencing Program (ExSeq) at Cheeky Scientist. The ExSeq program provides a holistic understanding of the Next Generation Sequencing (NGS) field - its intricate concepts, and insights on sequenced data computational analyses. He holds diverse professional experience in Bioinformatics and computational biology and is always keen on formulating computational solutions to biological problems.

Deepak Kumar, PhD

Similar Articles

The Power Of Spectral Viewers And Their Use In Full Spectrum Flow Cytometry

The Power Of Spectral Viewers And Their Use In Full Spectrum Flow Cytometry

By: Tim Bushnell, PhD

What photon from yonder fluorochrome breaks?  It is … umm… hmmm. Let me see. Excitation off a 561 nm laser, with an emission maximum of 692 nm. I’m sure if Shakespeare was a flow cytometrist, he might have written that very scene. But the play is lost in time. However, since the protagonist had difficulty determining what fluorochrome was emitting photons, let’s consider how this could be figured out. In my opinion, one of the handiest flow cytometry tools is the spectral viewer. This tool helps visualize the excitation and emission profile of different fluorochromes, as well as allowing you…

Fickle Markers: Solutions For Antibody Binding Specificity Challenges

Fickle Markers: Solutions For Antibody Binding Specificity Challenges

By: Tim Bushnell, PhD

Reproducibility has been an ongoing, and important, concept in the sciences for years.  In the area of biomedical research, the alarm was sounded by several papers published in the early 2010’s.  Authors like Begley and Ellis, Prinz and coworkers, and Vasilevsky and colleagues, among others reported an alarming trend in the reproducibility of pre-clinical data.  These reports indicated between 50% to almost 90% of published pre-clinical data were not reproducible.  This was further highlighted in the article by Freedman and coworkers, who tried to identify and quantify the different sources of error that could be causing this crisis.  Figure 1,…

5 Common Flow Cytometry Questions, Answered

5 Common Flow Cytometry Questions, Answered

By: Tim Bushnell, PhD

I want to thank all of you who send us your questions about flow cytometry, so I thought I would dip into the old email bag and answer a few of the common ones here.  If your question isn’t answered this time, look for it to be answered in a future blog post.  Of course, if you want us to cover a specific topic, drop us a line.  1. How Fast Can I Go? This is  a common question. The allure of the ‘hi’ button is hard to resist.  The faster you go, the sooner you are finished with data…

Top Industry Career eBooks

Get the Advanced Microscopy eBook

Get the Advanced Microscopy eBook

Heather Brown-Harding, PhD

Learn the best practices and advanced techniques across the diverse fields of microscopy, including instrumentation, experimental setup, image analysis, figure preparation, and more.

Get The Free Modern Flow Cytometry eBook

Get The Free Modern Flow Cytometry eBook

Tim Bushnell, PhD

Learn the best practices of flow cytometry experimentation, data analysis, figure preparation, antibody panel design, instrumentation and more.

Get The Free 4-10 Compensation eBook

Get The Free 4-10 Compensation eBook

Tim Bushnell, PhD

Advanced 4-10 Color Compensation, Learn strategies for designing advanced antibody compensation panels and how to use your compensation matrix to analyze your experimental data.