The UMAP Plugin in FlowJo: A User's Review

Eric and I have been very eager to upgrade to UMAP (as opposed to tSNE) as our go to dimensionality reduction tool for single-cell data. UMAP is only about a year old, but it has become increasingly popular in the field. UMAP is very similar to tSNE, however it allows the analysis of many more events in a shorter amount of time (for a detailed comparison of UMAP and tSNE, check out this publication: biorxiv/Nature Biotechnology). However, finding a platform to use UMAP has been a challenge. Traditionally UMAP has been run through Python. Since most of the analysis algorithms we use are based in R, Java, or MATLAB, I’ve never tried Python and it would require some dedication to become comfortable with. tSNE and PhenoGraph have been sufficient for our analysis needs, so we’ve been waiting for a version of UMAP that would be more accessible to us and other bench scientists.

Recently, Cytofkit announced they were including UMAP in the new version of the pipeline (Cytofkit2), since we are avid users we explored this avenue first. However, as mentioned in a previous post, the new program has a bug in the data transformation step which makes it impossible to use currently. I, and other users, have reached out to the developers and to the community in an attempt to rectify this issue, but have had limited success. While the developers work to resolve this we thought we’d try out the new UMAP FlowJo plugin.

The UMAP plugin has very similar issues to the tSNE plugin that I discussed recently in the following post. The main issue is that there is no expedient way to run an analysis on all of the samples at once. For instance, for viSNE (used in Cytobank) or tSNE in Cytosplore you upload each FCS file (e.g. 9 FCS files total), select all of the files for clustering (e.g. B6 #1-5; IL10KO #1-4), determine an input methods (equal events, all events, etc.) and then select markers for clustering. After clustering is finished you can visualize all of the input events on the tSNE plot, or select each individual sample. This is essential for comparison between samples as the geography of each tSNE plot will be identical (e.g. the CD4 T cells are are the 2 o clock position), but the abundance of events in each island, and the expression of various markers, will differ between samples. Like so:

 
Kimball, AK … ET Clambey, J Immunol 2018

Kimball, AK … ET Clambey, J Immunol 2018

 

I determined a way to work around this issue in the previous post, but I will be contacting FlowJo to suggest they resolve this issue.

In order to assess UMAP’s ability to process a lot of events I analyzed all of the events from each replicate used in Kimball et al 2018. The total event # ranged from 9,141 to 266,644 events.

Data from: Kimball, AK … ET Clambey, J Immunol 2018

Data from: Kimball, AK … ET Clambey, J Immunol 2018

I clocked the analysis times and 9,141 events x 35 parameters took less than 1 min to complete, and 266,644 events x 35 parameters took about 10 minutes. For comparison, a tSNE analysis of 10,000 events in FlowJo took about the same amount of time, however 266,644 events crashed the program, indicating that UMAP truly does preform better than tSNE with higher event numbers. However, when I concatenated all of the events together and attempted a UMAP run (~1.2 million events x 35 parameters) it analyzed the data for about 2 hours and ended up crashing. I hypothesize that this failure may be due to FlowJo and not UMAP, as FlowJo is prone to crashing in my personal experience, even when analyzing simple flow cytometry data. I anticipate that UMAP preforms much better in Python or R.

Here is a comparison of a B6 replicate analyzed by tSNE and UMAP in FlowJo. Although UMAP allows the rapid analysis of more events, the overall expression and organization appear highly similar between the two methods:

Data from: Kimball, AK … ET Clambey, J Immunol 2018

Data from: Kimball, AK … ET Clambey, J Immunol 2018

I tested UMAP’s reproducibility. However, like tSNE it is not reproducible between analysis runs:

 
Screen Shot 2019-05-23 at 11.29.09 AM.png
 

For the UMAP analyses above I utilized the automatic settings. However, I wanted to see how changing
the distance function, the nearest neighbor #, and minimum distance value settings would influence the analysis results.

I selected B6 #1 (9,141 events total) and 35 markers for UMAP analysis. I did not change the automatic settings for neighbor # (15) and minimum distance value (0.5), but I varied the distance function (17 options total) between runs. 11 of the distance functions resulted in an error message and the analysis failed, but 6 of the functions resulted in successful analyses. The Euclidean, Manhattan, Chebyshev, and Minkowski distance functions resulted in very similar UMAP plots, these are all “Minkowski style metrics” and are appropriate to use on mass cytometry data. A simple explanation of the differences between Euclidean, Manhattan, and Chebyshev calculations can be found here, and a more complex explanation can be found here. The only other two distance metrics that worked were the Hamming and Sokalsneath functions, but these resulted in discordant tSNE plots. This makes sense as they are typically used for binary data, not continuous variables like those found in mass cytometry data and are thus not appropriate for this analysis.

Testing distance functions-1.png

The next variable I examined was the “nearest neighbor” function. I selected B6 #1 (9,141 events total) and 35 markers for UMAP analysis. I used the “Euclidean” distance function, a minimum distance value of 0.5 (the automatic setting), and varied the number of nearest neighbors run to run. You can select any value from 0-99, so I tried 0, 1, 5, 15 (the automatic value), 50, and 99. The runs utilizing 0, 1, and 99 all failed, but 5, 15 and 50 resulted in similar looking plots. The main observation appears to be a compression of the UMAP plot shape as neighbor number values increase.

Screen Shot 2019-05-23 at 11.31.16 AM.png

The last variable I examined was the minimum distance value. I selected B6 #1 (9,141 events total) and 35 markers for UMAP analysis. I used the “Euclidean” distance function, a nearest neighbor value of 15 (the automatic setting), and varied the minimum distance value run to run. You can select any value from 0.1-0.99, so I tried 0.1, 0.5 (the automatic value), and 0.99. The runs utilizing 0.99 failed, but 0.1 and 0.5 were successful. Although 0.5 was the automatic value, I think 0.1 looks less compressed and offers more resolution.

Screen Shot 2019-05-28 at 3.23.21 PM.png

If you are interested at looking at the impact of these variables in more depth check out this resource. Overall, it appears that any of the Minkowski Distance measurements (Euclidean, Manhattan, Chebyshev, and Minkowski) are appropriate, and produce similar results. It is impossible to tell if they produce identical results as UMAP plots are not reproducible between runs (see above). I would recommend using Euclidean as it appears to be the default option the programmers chose, and it is the distance measurement I have frequently seen in other algorithms (X-shift, PhenoGraph, etc.). I would also recommend sticking with the automatic values for the other variables, though small variations seem to have an impact on the compactness of UMAP plots which can improve resolution or visual appeal. For example, In the biorxiv version of Becht et al 2019, they noted that they used: “15 nn, a min_dist of 0.2 and euclidean distance” indicating that it is appropriate to slightly adjust the minimum distance value. However, I STRONGLY recommend always reporting the settings you used when publishing mass cytometry data.

My final thoughts on UMAP: for this dataset I didn’t notice a significant improvement in the “meaningful organization of cell clusters” that was reported in the Becht et al 2019 publication (see my comparison of tSNE vs UMAP above). However, I did find it to be the fastest and most robust dimensionality reduction tool I’ve used to date. The only complaints I have has to do with the use of the algorithm in FlowJo, which I hope to be addressed, or to be resolved when Cytofkit2 is ready for use. Overall, I anticipate replacing tSNE with UMAP in my analysis pipeline going forward.

For more detail and higher resolution images here is lab notebook entry for this project.

-Abby



The tSNE Plugin in FlowJo: A User's Review

I mentioned in a recent post that I tried to use the FlowJo plugin for tSNE analysis, but wasn’t satisfied with the results. I will highlight some of the strengths and weaknesses of this tool using the data from Kimball et al 2018.

The biggest issue I have with this tool is that there is no expedient way to run an analysis on all of the samples at once. For instance, in viSNE or Cytosplore you can upload each FCS file (e.g. 9 FCS files total), select all of the files for clustering (e.g. B6 #1-5; IL10KO #1-4), determine an input methods (equal events, all events, etc.) and then select markers for clustering. After clustering is finished you can visualize all of the input events on the tSNE plot, or select each individual sample. This is essential for comparison between samples as the geography of each tSNE plot will be identical (e.g. the CD4 T cells are are the 2 o clock position), but the abundance of events in each island, and the expression of various markers, will differ between samples.

However, the tSNE plugin in FlowJo doesn’t allow you to select all of the samples for clustering, you either have to: 1) analyze each sample separately, 2) concatenate all of your replicates together in each condition and then analyze, or 3) concatenate your replicates together regardless of condition, include sample ID when clustering, and then identify each sample via gating after clustering.

First I tried to analyze each sample separately. First, I had to use the plugin “downsample” to downsample the events (for example, one sample had 247,945 events). This plugin randomly selected a user defined number of events and creates a new version of the FCS file. Downsampling is essential to this process for two reasons: 1) the more events you analyze, the longer the analysis takes, and if you try to analyze too many events FlowJo will freeze/crash, 2) I want to be able to compare my replicates, so I want to equally sample the available events for each.

An example of downsampling, the original FCS file had 247,945 events, I used the downsampling tool to randomly select 9,141 events for analysis. 9,141 was chosen as that is the # of events in the sample with the lowest number of events.

An example of downsampling, the original FCS file had 247,945 events, I used the downsampling tool to randomly select 9,141 events for analysis. 9,141 was chosen as that is the # of events in the sample with the lowest number of events.

After downsampling all of the samples, I ran a tSNE analysis on each individually using identical settings. As you can see, even though the overall expression and shape of the tSNE plot look similar between replicates, it’s very hard to interrogate visually.

All of these plots are colored according to CD45 expression, note that the maximum value for CD45 expression varies between plots due to the different events used for each.

All of these plots are colored according to CD45 expression, note that the maximum value for CD45 expression varies between plots due to the different events used for each.

The next approach I took for analysis was to concatenate all of the replicates together (so that I had 2 concatenated files for each experimental condition) and analyze them using identical algorithm settings. Here are the two tSNE plots that were generated colored by a variety of lineage markers. The biggest issue with this approach is that you don’t know what events came from what sample. Furthermore, you still cannot visually detect major differences in cellular populations due to their differing geography.

You could determine gates for major cellular populations for each tSNE plot and then easily compare cellular abundance and expression (e.g % of CD4 T cells, MFI of Tbet expression in B6 vs IL10KO).

You could determine gates for major cellular populations for each tSNE plot and then easily compare cellular abundance and expression (e.g % of CD4 T cells, MFI of Tbet expression in B6 vs IL10KO).

The last method I tried was concatenating the files and clustering on all relevant markers and sample ID. I tried this using the concatenated IL10KO replicates (n=4), but you could concatenate all 9 files together across conditions (I limited it to IL10KO replicates as this was the option with the lowest # of events which resulted in a fast tSNE run). After the tSNE analysis I visualized the events by the variable “Sample ID” and gated the 4 distinct populations for each sample (please note that it is unclear what replicate is which population, so I have just labelled them as “sample” rather than a specific replicate ID). Thus, with a lot of work, you can then visualize events from specific replicates on the tSNE plot much like in viSNE or Cytosplore.

samples-1.png

In addition to the major issues with the plugin discussed above, FlowJo also only offers one color palette you can use, unlike viSNE (~6 options) and Cytosplore (~12 options).

B6 #1 colored by lineage markers-1.png

In addition to coloring the data by heatmap statistics, you can choose from a variety of dot and density plots (contour, density, zebra, and pseudocolor) just like you would for traditional flow cytometry data.

Asset 3.png

However, there are some advantages to the tSNE plugin in FlowJo. For instance, if you’re familiar with the various tSNE algorithm settings (this is a great resource for understanding) you can easily customize settings for your analysis, such as: iterations, perplexity, and the learning rate. A major advantage is that you can control the boundaries of the tSNE axis, this means you can standardize the axes values across different samples, something you can’t do in any other analysis program and has presented a significant issue for us in the past (see below as an example).

It’s also convenient that you can gate populations directly on the tSNE plot much like traditional flow data. You can then color these user-defined populations in the layout editor to make a plot like this:

colored by phenotype-1.png

You can also easily examine specific populations of interest. For instance, a histogram overlay comparing the expression of Tbet in CD4 T cells in Sample 1 vs Sample 2.

histogram-1.png

You can even quickly run an additional tSNE analysis on the CD4 T cell island:

tSNE on CD4 T cells-1 2.png

Overall, much like Cytosplore, I think the tSNE plugin for FlowJo is a great free and accessible tool for users who have recently started analyzing mass cytometry data. This is especially true if they are long term users of FlowJo as the learning curve will be very low. Depending on what type of questions you’re asking, the issues I’ve discussed may not impact your analysis. Personally, I think the inability to readily identify particular samples is a major bug in the program that the developers should fix before I will adopt it into my analysis workflow. I also wanted to mention that It’s also important to keep in mind that this tool can be used for any type of dataset, not just mass cytometry, I have personally used it for flow cytometry and IHC datasets and found it to be very useful.

Useful resources: live demo using the tSNE plugin, a guide to using the downsampling plugin in FlowJo, a guide to tSNE settings,

-Abby

Cytosplore: A User's Review

Today I’m trying out the software Cytosplore! I tried getting this program to work about a year ago, but at that time they only had a version of the software for PC and our ancient Dell couldn’t handle it. But now that they have a version for Mac, I thought I’d try it out on the dataset from our JI paper (you can find the fcs files here).

I think Cytosplore could be a great free alternative to Cytobank for using the SPADE and tSNE algorithm. There is an R script for SPADE, but I haven’t tried it as we don’t have much use for SPADE. As far as tSNE, FlowJo recently released a plugin for tSNE analysis, but I’m not a huge fan of it (I’ll explain why in detail in another post). In addition to SPADE and tSNE, Cytosplore also allows you analyze your data using the Hierarchical Stochastic Neighbor Embedding (HSNE) algorithm. I’ve wanted to play with HSNE for a while as it allows the analysis of very large high-dimensional data sets. A major issue with tSNE and PhenoGraph is that they limit the amount of events you can analyze due to computational power and/or crowding visualization (I discussed this issue here). However, with the HSNE and UMAP algorithms you can analyze much more data… so let’s test it out!

This is a great video explaining how to use Cytosplore. Seriously, it usually takes me weeks to figure out new software/code because the developers don’t have accessible how to guides, but this one has a fantastic video demo. I had no issues with the downloading or opening of the software and overall it was very intuitive to use.


Using the tSNE Algorithm in Cytosplore

As I mentioned above, I used the preprocessed data from our JI paper (normalized and gated from live singlet events). I selected the relevant markers (35 total), used the default settings for perplexity, and I tried to analyze all events (~1.2 million events) using the tSNE algorithm. This froze my computer (which I expected to happen), I had to force quit and try again. After reopening the program I used identical settings, but this time I downsampled the data, so all samples were equally represented (9,141 events from each, 82,269 events total). It took about ~20 minutes to finish analyzing my data (and it was super cool to watch as it did!).

When it was done computing I played around with the coloration. Cytosplore has a ton of different color palettes:

I then colored the events according to a variety of markers:

Cytosplore allows you to rapidly save tSNE plots colored by any parameter as high resolution .png files to your desktop:

Screen Shot 2019-04-29 at 1.17.46 PM.png

You can also color the tSNE plot by sample ID! Something that has been on my wishlist for a long time as viSNE and PhenoGraph doesn’t have this feature. Much like PhenoGraph, you can also check on and off certain samples, so you can visualize particular events on the tSNE plot.

Screen Shot 2019-04-29 at 1.20.07 PM.png

I wanted to compare how these plots look to those generated by viSNE (Cytobank) and PhenoGraph. Overall they look very similar in expression and cluster organization. Cytosplore allows you to customize your expression scale, so you can change the minimum and maximum values.

ezgif-5-b1765d35f0.pdf-1.png

You can also select specific events, look at abundance and expression data, and export this data as a .csv file. This is very useful and intuitive, however, a major complaint I have is that you cannot create a customizable shape for gating events of interest. Instead, you must select them using a rectangular/square shape, this limits the users dexterity to pick specific events of interest.


Using the HSNE Algorithm in Cytosplore

Next, I analyzed the same downsampled events (9,141 events from each sample) by the HSNE algoirthm. It was very fast, it took only 2-3 minutes! Below are the HSNE plots colored by a variety of markers, here they are individually.

Finally, I analyzed the all of the events (1,263,690 events total) by the HSNE algoirthm and it only took 20 minutes! Below are the HSNE plots colored by a variety of markers, here they are individually.


Overall, I think Cytosplore is a very promising tool and is especially great for beginners as it is very intuitive computationally and free to use. Personally, I will use Cytosplore in the future for tSNE anaysis instead of FlowJo or Cytobank. However, I still need to read up on how to interpret HSNE plots... For instance, there are differently sized circles, is this like a SPADE tree and the size of the circle denotes the number of cells in each node? If this is the case, HSNE allows the analysis of millions of events, but it also reduces the resolution of the data by grouping events together (similarly to SPADE, X-shift, and Citrus). For most datasets we analyze there is not a major issue with scalability. However, for researchers with millions of events to analyze, or those interested in rare populations, I think HSNE is a great solution!

-Abby

Getting "Cytofkit2" to work

After a couple of hours I have finally resolved the issues I was experiencing trying to get the shinyapp to work for the Cytofkit2 package.

Here was the code that finally worked for me:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("cytofkit")
if(!require(devtools)){
  install.packages("devtools") # If not already installed
}
a
devtools::install_github("JinmiaoChenLab/cytofkit2")
#install the "Reticulate" package to run python through R
install.packages("reticulate")
install.packages("plotly")
install.packages('DT')
install.packages("shinydashboard")
install.packages("shinyalert")
install.packages("shinyWidgets")
library("cytofkit2")
cytofkit_shiny_dashboard()

It looks like some other users were experiencing a variety of issues.

Now I am facing a different issue when trying to analyze the data by PhenoGraph and UMAP, see it here. It looks like the inability of the program to recognize the selection of "cytofAsinh" as a transformation method is a bug that other users have experienced and needs to be resolved by developers.

-Abby