Statistical Challenges Of Rare Event Measurements In Flow Cytometry

To conclude our series on rare event analysis, it is time to discuss the statistics behind rare event analysis. The first 2 parts of this series covered the hardware aspects of measuring rare events and some specific recommendations for gating/analysis of rare events.

It is necessary to sort through hundreds of thousands or millions of cells to find the few events of interest.

With such low event numbers, we move away from the comfortable domain of the Gaussian distribution and move into the realm of Poisson statistics.

There are 3 points to consider to build confidence in the data that the events being counted are truly events of interest and not random events that just happen to fall into the gates of interest.

1. How do you know if an event is real?

How do you know that your rare event is real? When subsetting the population, you might have an occurrence rate of 0.1% or lower. This means that for every 100,000 cells, 100 cells or fewer will be in the final gate of interest.

How can you confirm and be comfortable they are real?

In Poisson statistics, the number of positive events is the important factor, not the total number of events.

In Poisson statistics, the mean and variance of the distribution are equal to the number of positive events. The standard deviation is the square root of the variance.

So, if you have 2 events in a region the CV of that data is roughly 71%, whereas with 100 events, the CV drops to about 10%.

But, what does this really mean?

In this paper by Maecker and coworkers, the authors looked at inter-lab CV in flow cytometry experiments, and estimated it was high as 40%, some of which could be reduced by centralizing analysis. More importantly, the inter-lab CVs were highest (57-82%) on samples where the average percentage of cells was below 0.1%.

And, with as few as 12 events, that leads a standard deviation of 28%.

The frequency precision of our measurement is now dominated by assay errors, not the rare events that are analyzed.

With rare event analysis, you demonstrate significance through assay reproducibility. The number of samples that should be measured can be determined using the power calculation, which is discussed in more detail here.

2. How many total events do you need?

Statistically, assay variation can be a major source of error for this analysis. That leads to the question, “How many events do you need?”

The short answer is that whenever possible, collect as many events as possible.

As discussed previously, there may be limitations imposed by the hardware and software which limit how much data can be collected in a single file. This means you may have to collect multiple files from the same tube.

In third party software, it is possible to do some preliminary gating to reduce file size, and concatenate multiple files after this preliminary analysis to make the final gating more complete.

Turning back to how many events is enough, there is more than one n. To show the significance of the data, the analysis must be repeated multiple times (i.e. power the experiment appropriately) and have the correct complement of appropriate negative controls.

The data above shows both of these concepts. On the left, is the gating strategy and the control (normal patient control), while on the right are the results of several analysis runs on 2 patients to show the differences between the 2 populations.

The statistical analysis between these 2 show that there is a significant difference, as denoted by the asterisk.

Returning to the question of how many events is enough, the question to ask is, “What is the CV required for analysis — what spread of the data is acceptable?”

Is 10,000 events enough?

The chart below shows the coefficient of variation (CV) value for a given frequency of cells. The CV is related to the number of positive events, and is defined as the SD/mean.

In general, a lower CV is better. The CV is another way to express the precision and repeatability of an experiment.

Using this table, if a broad CV is acceptable, with a cell frequency of 0.1% 10,000 events is enough. However, a 10% CV requires 100 positive events and you can see now that 10,000 events is only good if at the 1% range.

For very rare events, a 10% CV for very rare populations requires collection of a million, or even 10 million, total events. Collecting at a rate of 10,000 events per second, it would take 1,000 seconds to collect 10 million events, or 16 minutes.

The CV is going to relate to the ability to identify a difference between 2 populations. This, in turn, will be related to the power of the experiment. Since we have the standard deviation of the population, it makes the calculations easier. However, the difference between the control and experimental will drive this.

In this paper by Mario Roederer, he discusses this issue of how many events you need to know if something is real. According to this paper, one of the important things to do is compare your positive sample to a set of controls so that you can interpret the data correctly.

There’s no arbitrary number of events that is the “right” number.

Even 12-14 positive events may be accurate, based upon your knowledge and the data generated by your controls.

3. How do you sort rare events?

The ability to sort cells for downstream applications is one of the most powerful applications of flow cytometry. Poisson statistics again play a role in determining an appropriate event rate.

If the drop drive frequency is 80 kHZ, or 80,000 droplets being generated per second, how many events per second should you run? Remember that a cell sorter sorts droplets, not cells, per se, but the cells are contained within the drops.

Depending on the sort envelope, the sort decision can include 1 or 2 droplets. So, what is a reasonable event rate?

When the event rate is equal to the drop drive frequency, Poisson statistics predict that a little under 40% of the drops will have no cells, about 40% will have 1, about almost 20% will have 2, and about 5% will have 3 or more cells.

If the event rate is ½ the drop drive frequency, 7.5% of the droplets will have 2 cells. When the event rate is ¼ the drop drive frequency, about 80% of the droplets are empty and about 2% of the drops will have 2 events. Going to ⅙ the drop drive frequency, the improvements are minimal.

So, if the drop drive frequency is 40 kilohertz or 40,000 droplets per second, the event rate should be no more than 10,000 events per second.

What does this mean practically?

This chart can help you determine how long it will take to sort, based on the frequency of the drop delay and the frequency of population, assuming 100,000 cells are needed for a downstream application.

Sort operators are often asked if there is a way to reduce the time it takes to sort, especially with a rare event population. Since there is a ceiling on event rates, our only option is to enrich the sample to increase the proportion of desired events.

This can be done using a depletion assay with magnetic beads from Miltenyi Biotec, the IMAG system, Dynabeads, and others.

In these systems, cells are tagged with an antibody label conjugated to a magnetic bead then exposed to a magnet. The cells that are labeled are held by the magnet, the cells that are not labeled stay in suspension or pass through the column for collection and downstream sorting.

Let’s look at when it might be helpful to incorporate this pre-sort enrichment step.

Starting with 100 million cells and a desired population at 0.01%, if you took those 100 million cells and sorted them at 20,000 events per second, it would take about 83 minutes to do the whole sort.

If we take those same 100 million cells and perform a magnetic bead enrichment, which will take about 45 minutes using one of the various magnetic isolation kits, the untouched cells will be about 10 million cells and the population of rare events is enriched to 0.1%.

Sorting those 10 million cells at 20,000 events per second will only take about 10 minutes, compared to 83 minutes pre-magnetic bead enrichment. The faster sort means that your cells are going to be healthier because you can get them back into culture, or into whatever buffer they need, more quickly.

After all the tweaking of the hardware and optimizing the data analysis, the statistics must be considered. Poisson statistics dominate rare event analysis. From determining how many cells to collect, to how fast to sort cells, the number of positive events is critical for determining the statistics involved. The charts and data in this blog can help design your next rare event analysis experiment, and help provide the basis for improving reproducibility and consistency of the experiments.

To learn more about Statistical Challenges Of Rare Event Measurements In Flow Cytometry, and to get access to all of our advanced materials including 20 training videos, presentations, workbooks, and private group membership, get on the Flow Cytometry Mastery Class wait list.

Join Expert Cytometry's Mastery Class

ABOUT TIM BUSHNELL, PHD

Tim Bushnell holds a PhD in Biology from the Rensselaer Polytechnic Institute. He is a co-founder of—and didactic mind behind—ExCyte, the world’s leading flow cytometry training company, which organization boasts a veritable library of in-the-lab resources on sequencing, microscopy, and related topics in the life sciences.

Tim Bushnell, PhD

Similar Articles

Common Numbers-Based Questions I Get As A Flow Cytometry Core Manager And How To Answer Them

Common Numbers-Based Questions I Get As A Flow Cytometry Core Manager And How To Answer Them

By: Tim Bushnell, PhD

Numbers are all around us.  My personal favorite is ≅1.618 aka ɸ aka ‘the golden ratio’.  It’s found throughout history, where it has influenced architects and artists. We see it in nature, in plants, and it is used in movies to frame shots. It can be approximated by the Fibonacci sequence (another math favorite of mine). However, I have not worked out how to apply this to flow cytometry.  That doesn’t mean numbers aren’t important in flow cytometry. They are central to everything we do, and in this blog, I’m going to flit around numbers-based questions that I have received…

How To Do Variant Calling From RNASeq NGS Data

How To Do Variant Calling From RNASeq NGS Data

By: Deepak Kumar, PhD

Developing variant calling and analysis pipelines for NGS sequenced data have become a norm in clinical labs. These pipelines include a strategic integration of several tools and techniques to identify molecular and structural variants. That eventually helps in the apt variant annotation and interpretation. This blog will delve into the concepts and intricacies of developing a “variant calling” pipeline using GATK. “Variant calling” can also be performed using tools other than GATK, such as FREEBAYES and SAMTOOLS.  In this blog, I will walk you through variant calling methods on Illumina germline RNASeq data. In the steps, wherever required, I will…

Understanding Clinical Trials And Drug Development As A Research Scientist

Understanding Clinical Trials And Drug Development As A Research Scientist

By: Deepak Kumar, PhD

Clinical trials are studies designed to test the novel methods of diagnosing and treating health conditions – by observing the outcomes of human subjects under experimental conditions.  These are interventional studies that are performed under stringent clinical laboratory settings. Contrariwise, non-interventional studies are performed outside the clinical trial settings that provide researchers an opportunity to monitor the effect of drugs in real-life situations. Non-interventional trials are also termed observational studies as they include post-marketing surveillance studies (PMS) and post-authorization safety studies (PASS). Clinical trials are preferred for testing newly developed drugs since interventional studies are conducted in a highly monitored…

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

How To Profile DNA And RNA Expression Using Next Generation Sequencing (Part-2)

By: Deepak Kumar, PhD

In the first blog of this series, we explored the power of sequencing the genome at various levels. We also dealt with how the characterization of the RNA expression levels helps us to understand the changes at the genome level. These changes impact the downstream expression of the target genes. In this blog, we will explore how NGS sequencing can help us comprehend DNA modification that affect the expression pattern of the given genes (epigenetic profiling) as well as characterizing the DNA-protein interactions that allow for the identification of genes that may be regulated by a given protein.  DNA Methylation Profiling…

How To Profile DNA And RNA Expression Using Next Generation Sequencing

How To Profile DNA And RNA Expression Using Next Generation Sequencing

By: Deepak Kumar, PhD

Why is Next Generation Sequencing so powerful to explore and answer both clinical and research questions. With the ability to sequence whole genomes, identifying novel changes between individuals, to exploring what RNA sequences are being expressed, or to examine DNA modifications and protein-DNA interactions occurring that can help researchers better understand the complex regulation of transcription. This, in turn, allows them to characterize changes during different disease states, which can suggest a way to treat said disease.  Over the next two blogs, I will highlight these different methods along with illustrating how these can help clinical diagnostics as well as…

What Is Next Generation Sequencing (NGS) And How Is It Used In Drug Development

What Is Next Generation Sequencing (NGS) And How Is It Used In Drug Development

By: Deepak Kumar, PhD

NGS methodologies have been used to produce high-throughput sequence data. These data with appropriate computational analyses facilitate variant identification and prove to be extremely valuable in pharmaceutical industries and clinical practice for developing drug molecules inhibiting disease progression. Thus, by providing a comprehensive profile of an individual’s variome — particularly that of clinical relevance consisting of pathogenic variants — NGS helps in determining new disease genes. The information thus obtained on genetic variations and the target disease genes can be used by the Pharma companies to develop drugs impeding these variants and their disease-causing effect. However simple this may allude…

7 Key Image Analysis Terms For New Microscopist

7 Key Image Analysis Terms For New Microscopist

By: Heather Brown-Harding, PhD

As scientists, we need to perform image analysis after we’ve acquired images in the microscope, otherwise, we have just a pretty picture and not data. The vocabulary for image processing and analysis can be a little intimidating to those new to the field. Therefore, in this blog, I’m going to break down 7 terms that are key when post-processing of images. 1. RGB Image Images acquired during microscopy can be grouped into two main categories. Either monochrome (that can be multichannel) or “RGB.” RGB stands for red, green, blue – the primary colors of light. The cameras in our phones…

We Tested 5 Major Flow Cytometry SPADE Programs for Speed - Here Are The Results

We Tested 5 Major Flow Cytometry SPADE Programs for Speed - Here Are The Results

By: Tim Bushnell, PhD

In the flow cytometry community, SPADE (Spanning-tree Progression Analysis of Density-normalized Events) is a favored algorithm for dealing with highly multidimensional or otherwise complex datasets. Like tSNE, SPADE extracts information across events in your data unsupervised and presents the result in a unique visual format. Given the growing popularity of this kind of algorithm for dealing with complex datasets, we decided to test the SPADE algorithm in 5 software packages, including Cytobank, FCS Express, FlowJo, R, and the original, free software made available by the author of SPADE. Which was the fastest?

5 FlowJo Hacks To Boost The Quality Of Your Flow Cytometry Analysis

5 FlowJo Hacks To Boost The Quality Of Your Flow Cytometry Analysis

By: Tim Bushnell, PhD

FlowJo is a powerful tool for performing and analyzing flow cytometry experiments, if you know how to use it to the fullest. This includes understanding embedding and using keywords, the FlowJo compensation wizard, spillover spreading matrix, FlowJo and R, and creating tables in FlowJo. Extending your use of FJ using these hacks will help organize your data, improve analysis and make your exported data easier to understand and explain to others. Take a few moments and explore all you can do with FJ beyond just gating populations.

Top Industry Career eBooks

Get the Advanced Microscopy eBook

Get the Advanced Microscopy eBook

Heather Brown-Harding, PhD

Learn the best practices and advanced techniques across the diverse fields of microscopy, including instrumentation, experimental setup, image analysis, figure preparation, and more.

Get The Free Modern Flow Cytometry eBook

Get The Free Modern Flow Cytometry eBook

Tim Bushnell, PhD

Learn the best practices of flow cytometry experimentation, data analysis, figure preparation, antibody panel design, instrumentation and more.

Get The Free 4-10 Compensation eBook

Get The Free 4-10 Compensation eBook

Tim Bushnell, PhD

Advanced 4-10 Color Compensation, Learn strategies for designing advanced antibody compensation panels and how to use your compensation matrix to analyze your experimental data.