Uniform Manifold Approximation and Projection (UMAP)

An algorithm for manifold learning and dimension reduction.
7.5K Downloads
Updated 25 May 2023

View License

Given a set of high-dimensional data, run_umap.m produces a lower-dimensional representation of the data for purposes of data visualization and exploration. See the comments at the top of the file run_umap.m for documentation and many examples of how to use this code.
The UMAP algorithm is the invention of Leland McInnes, John Healy, and James Melville. See their original paper for a long-form description (https://arxiv.org/pdf/1802.03426.pdf). Also see the documentation for the original Python implementation (https://umap-learn.readthedocs.io/en/latest/index.html).
This MATLAB implementation follows a very similar structure to the Python implementation from 2019, and many of the function descriptions are nearly identical.
Here are some additional tools we have added to our implementation:
1) The ability to detect clusters in the low-dimensional output of UMAP. As clustering method, we invoke either DBM (described at https://www.hindawi.com/journals/abi/2009/686759/) or DBSCAN (built in to MATLAB R2019a and later).
2) Visual and computational tools for data group comparisons. Data groups can be defined either by running clustering on the data islands resulting from UMAP’s reduction or by external classification labels. We use a change quantification metric (QFMatch) which detects similarity in both mass & distance (described at https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5818510/) as well as an F-score for measuring overlap when the groups are different classifications for the same data. For visualizing data groups, we provide a dendrogram (described as QF-tree at https://www.nature.com/articles/s42003-019-0467-6), a Multidimensional scaling view and sortable tables which show each data group’s similarity, overlap, false positive rate and false negative rate. The documentation in run_umap.m and UMAP_extra_results.m describes these and additional related tools provided.
3) A PredictionAdjudicator feature that helps determine how well one classification’s subsets predict another’s.
4) A complementary independent classifier named “exhaustive projection pursuit” (EPP) that generates labels both for supervising UMAP as well as for classification comparison research. EPP is described at https://onedrive.live.com/?authkey=%21ALyGEpe8AqP2sMQ&cid=FFEEA79AC523CD46&id=FFEEA79AC523CD46%21209192&parId=FFEEA79AC523CD46%21204865&o=OneUp.
5) The ability to use neural networks either from MATLAB's "fitcnet" function or the Python package TensorFlow to learn from a training data set and provide a classification on new data to either compare against or merge with UMAP classification.
Without the aid of any compression, this MATLAB UMAP implementation tends to be faster than the current Python implementation (version 0.5.2 of umap-learn). Due to File Exchange requirements, we only supply the C++ source code for the MEX modules we use to accelerate the computations. The command "run_umap" (without arguments) lets you select the immediate download of these files or the building of these files with C++ source code and build script that we provide. See the fast_approximation argument comments in the run_umap.m file for further speedups. As examples 13 to 15 show, you can test the speed difference between the implementations for yourself on your computer by setting the 'python' argument to true.
The Bioinformatics Toolbox is required to change the 'qf_tree' argument, which is optional.
This implementation is a work in progress. It has been looked over by Leland McInnes, who in 2019 described it as "a fairly faithful direct translation of the original Python code". We hope to continue improving it in the future.
Provided by the Herzenberg Lab at Stanford University.
This submission interoperates with FlowJo v10.x, a widely used analysis app for flow cytometry distributed by BD Life Sciences. You can supervise UMAP with population definitions made in a FlowJo workspace and export UMAP regions of interest back into FlowJo workspaces.
We appreciate all and any help in finding bugs. Our priority has been determining the suitability of our concepts for research publications in flow cytometry for the use of UMAP supervised templates and exhaustive projection pursuit.

Cite As

Connor Meehan, Jonathan Ebrahimian, Wayne Moore, and Stephen Meehan (2022). Uniform Manifold Approximation and Projection (UMAP) (https://www.mathworks.com/matlabcentral/fileexchange/71902), MATLAB Central File Exchange.

MATLAB Release Compatibility
Created with R2023a
Compatible with R2017a to R2023a
Platform Compatibility
Windows macOS Linux
Categories
Find more on Statistics and Machine Learning Toolbox in Help Center and MATLAB Answers
Acknowledgements

Inspired: CytoMAP

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

epp

fcs

mlp

umap

util

Version Published Release Notes
4.4

Fixes and improvements based on feedback from CYTO 2023 conference.
Testing with R2023a release

4.2.1

Corrected documentation in run_umap for examples 4 & 5 which use FlowJo.

4.2

1. Integration with FlowJO
- Import data and supervision labels from workspaces
- Export results to workspaces
2. Multidimensional scaling views supervised template reduction.
3. HeatMap improvements.
4. Many bug fixes and other improvements

4.1

1) Improved documentation and examples for using MLP train/predict independently of UMAP
2) MlpPython.Predict function
-Is faster on r2019b or later
-Allows test set with all OR MORE of the training set columns in any order

4.0

-mlp_train combines neural network and supervised template classification
-job_folder allows batching runs of run_umap from external software without MATLAB reloads
Example 34 in run_umap.m illustrates these new arguments and others

3.01

1. Fast approximation now accelerates both matching and reduction processing.

2. Prediction table now:
a) Displays dimensions for true+, false+ and false- stacked together.
b) Highlights selections as yellow on UMAP and EPP plots.

3.0

V3.0 improves speed, classification assessment and ROI functionality. For details see the last section of the FileExchange description and/or search the run_umap.m file for fast_approximation, run_epp and match_predictions.

2.2

-New table showing density distribution & KLD of unreduced data associated with groupings of the reduced data
-New run_umap arguments for supervised templates and accessing prior UMAP features
-New examples with larger data sets

2.1.3

Fix edge case where running template fails IF the metric is a user defined function.

2.1.2

-Added parameters to run_umap "wrapper" that reach more capabilities within the UMAP.m core; search "v2.1.2" in run_umap.m to see these additions.
-Fixed bugs for edge cases involving minimal data and user-defined metrics.

2.1.01

-Maximized UMAP parallelism speed by using all MATLAB’s assigned logical CPU cores
-Added NN-descent support for 'SEuclidean'
-New slider for shading UMAP supervisor colors
-Stochastic gradient descent halts gracefully if user closes progress window

2.1.0

-Stochastic gradient descent (SGD) is now parallelized by default with our MEX method. See 'sgd_tasks' in the documentation.
-'Randomize' is now true by default in order to use parallelism to accelerate both NN-descent and SGD
-Other minor bug fixes

2.0.0

-Improved documentation for some arguments and removed all popups when "verbose" is false
-run_umap now accepts all knnsearch arguments (except for 'SortIndices')
-Nearest neighbour computations are significantly accelerated for certain data inputs

1.5.2

-Removed .exe and .MEX files to comply with File Exchange requirements. Users are now encouraged to download these from our Google Drive if they wish to significantly speed up run_umap.
-Added examples 17 to 19 in run_umap header comment.

1.3.4

-Fixed a bug in SGD in Java where data was unintentionally stored as two distinct objects
-Added QF trees and dissimilarity plots
-Added an experimental joined_transform method that outperforms transform() when training data is missing populations

1.3.3

-Fixed some minor cosmetic issues such as suboptimal plot scaling

1.3.2

-If applying a UMAP template on data that appears to have new populations, a warning occurs and the option is given to perform a re-supervised reduction
-Fixed an indexing error occurring in smooth_knn_dist.m if data had too many identical points

1.3.1

-Fixed a GUI bug that would occur for users with MATLAB R2018b or earlier

1.3.0

-Data can now be reduced to any number of dimensions by changing the 'n_components' parameter; if reducing to more than 2 dimensions, a 3D plot is shown
-DBSCAN can be used to cluster UMAP output
-The 'n_epochs' parameter can now be manually changed

1.2.1

-Added precomputed parameter values for users without the Curve Fitting Toolbox
-Fixed an issue when using transform() on new data sets of same size of previous embedding and improved adjacency matrix for transform()
-Improved progress bars

1.2.0

-Added 2 examples (run_umap.m) showing how to perform supervised dimension reduction with UMAP
-Improved labelling of plots; for supervised UMAP, the plot includes a legend with labels from the categorical data
-Explained proper MATLAB path settings

1.1.0