tDistributed Stochastic Neighbor Embedding
tDistributed Stochastic Neighbor Embedding (tSNE) is a new technique for dimensionality reduction that is particularly well suited for the visualization of highdimensional datasets. The technique has a natural landmark version that can readily be applied to large realworld datasets (we applied it on, e.g., all 60,000 MNIST digit images). The technique is introduced in the following papers:

• L.J.P. van der Maaten and G.E. Hinton. Visualizing HighDimensional Data Using tSNE. Journal of Machine Learning Research 9(Nov):25792605, 2008. [ PDF ] [ Supplemental Material (24MB) ]

• L.J.P. van der Maaten. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR W&CP 5:384391, 2009. [ PDF ]

• L.J.P. van der Maaten. BarnesHutSNE. In Proceedings of the International Conference on Learning Representations, 2013. [ Arxiv ] [ Code ] [ Talk ]
Implementations
Below, six implementations of tSNE are available for download. The first implementation is a simple Matlab implementation of tSNE, that allows you to understand the technique. The code requires Matlab R2007a or higher. An implementation that takes as input a pairwise probability or a pairwise distance matrix is also included.
The second implementation is in CUDA. Depending on the available hardware and problem size, this implementation can be up to 20x faster than other naive tSNE implementations. The implementation does require a working installation of CUDA 3.x, as well as a working Matlab installation.
The third implementation is a binary that reads a datafile and writes the results to a file. Scripts allowing you to use this implementation in Matlab and Python are also available. The binary implementation is faster than the Matlab implementation, and allows you to use the landmark version of tSNE. The binary implementation was implemented using Intel IPP.
The fourth and fifth implementation are in Python and R, respectively. These are much slower, but they work just as well.
The sixth implementation is a BarnesHut implementation of tSNE, which is the fastest tSNE implementation so far, and whic scales much better to big data sets. There is also an implementation of parametric tSNE (in Matlab) available.
You are free to use, modify, or redistribute this software in any way you want, but only for noncommercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.
Simple Matlab implementation
CUDA implementation (with contributions by Alex)
Binary implementation
Matlab script that runs binary implementation
Python script that runs binary implementation (by Philippe)
Simple Python implementation
Simple Julia implementation (by Leif Jonsson)
Simple R implementation (by Justin)
User guide
MNIST Dataset
Parametric tSNE (Matlab; see here)
Alternative tSNE Optimizer (Matlab; see here and here)
(Note: this code sometimes has numerical problems)
BarnesHutSNE (C++, Matlab, and Python; see here for algorithm details)
Examples
Some results of our experiments with tSNE are available for download below. In the plots of the Netflix dataset and the words dataset, the third dimension is encoded by means of a color encoding (similar words/movies are close together and have the same color). Most of the ‘errors’ in the embeddings (such as in the 20 newsgroups) are actually due to ‘errors’ in the features tSNE was applied on. In general, the embeddings have a 1NN error that is comparable to that of the original highdimensional features!
MNIST dataset (in 2D)
MNIST dataset (in 3D)
Olivetti faces dataset (in 2D)
COIL20 dataset (in 2D)
Netflix dataset (in 3D) on Russ’ RBM features
Words dataset (in 3D) on Andriy’s semantic features
20 Newsgroups dataset (in 2D) on Simon’s discLDA features
Reuters dataset (in 2D) landmark tSNE using semantic hashing
NIPS dataset (in 2D) on coauthorship data (19882003)
NORB dataset (in 2D) by Vinod
Words (in 2D) by Joseph on features learned by Ronan and Jason
CalTech101 on SIFT bagofwords features
S&P 500 by Steve @ Opera; on information about daily returns on company stock
Interactive map of scientific journals on data by NeesJan and Ludo, using VOSviewer
Relation between World Economic Forum councils
FAQ
The binary implementation of tSNE seems to have messed up the ordering of my data?
It sure did! The fast implementation of tSNE is a landmark version that randomly picks it landmarks, even if you set the ratio of landmarks to 1.0. You can get the indices of the selected landmarks from the resultfile (or from the Matlab script that runs it, for that matter). The format of the result file is described in the User’s guide.
I can’t figure out the file format for the binary version of tSNE?
The format is described in the User’s guide. You also might want to have a look at the Matlab or Python wrapper code: it has code that writes the datafile and reads the resultsfile that can be ported fairly easily to other languages. Please note that the file format is binary (so don’t try to write or read text!), and that it does not contain any spaces, separators, newlines or whatsoever.
How should I specify the landmarks to the binary version of tSNE?
You can either specify a ratio of points to use as landmark points (between 0 and 1), or you can specify a vector with the indices of the points to use as landmark points. In both cases, the fast version of tSNE will return a vector with the indices of the used landmark points, so you can check what happened.
How can I asses the quality of the visualizations that tSNE constructed?
Preferably, just look at them! Notice that tSNE does not retain distances but probabilities, so measuring some error between the Euclidean distances in highD and lowD is useless. However, if you use the same data and perplexity, you can compare the KullbackLeibler divergences that tSNE reports. It is perfectly fine to run tSNE ten times, and select the solution with the lowest KL divergence.
How should I set the perplexity in tSNE?
The performance of tSNE is fairly robust under different settings of the perplexity. The most appropriate value depends on the density of your data. Loosely speaking, one could say that a larger / denser dataset requires a larger perplexity. Typical values for the perplexity range between 5 and 50.
What is perplexity anyway?
Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In tSNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.
Every time I run tSNE, I get a (slightly) different result?
In contrast to, e.g., PCA, tSNE has a nonconvex objective function. The objective function is minimized using a gradient descent optimization that is initiated randomly. As a result, it is possible that different runs give you different solutions. Notice that it is perfectly fine to run tSNE a number of times (with the same data and parameters), and to select the visualization with the lowest value of the objective function as your final visualization.
When I run tSNE, I get a strange ‘ball’ with uniformly distributed points?
This usually indicates you set your perplexity way too high. All points now want to be equidistant. The result you got is the closest you can get to equidistant points as is possible in two dimensions. If lowering the perplexity doesn’t help, you might have run into the problem described in the next question. Similar effects may also occur when you use highly nonmetric similarities as input.
When I run tSNE, it reports a very low error but the results look crappy?
Presumably, your data contains some very large numbers, causing the binary search for the correct perplexity to fail. In the beginning of the optimization, tSNE then reports a minimum, mean, and maximum value for sigma of 1. This is a sign that something went wrong! Just divide your data or distances by a big number, and try again.
I tried everything you said, but tSNE still doesn’t seem to work very well?
Maybe there is something weird in your data. As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but tSNE doesn’t, I am fairly sure you did something wrong. Just check your code again until you found the bug! If nothing works, feel free to drop me a line.
Can I use a pairwise Euclidean distance matrix as input into tSNE?
Yes you can! Download the simple Matlab implementation, and use your pairwise Euclidean distance matrix as input into the TSNE_D function.
Can I use a pairwise similarity matrix as input into tSNE?
Yes you can! For instance, we successfully applied tSNE on a dataset of word association data. Download the simple Matlab implementation, make sure the diagonal of the pairwise similarity matrix contains only zeros, symmetrize the pairwise similarity matrix, and normalize it to sum up to one. You can now use the result as input into the TSNE_P function.
Can I use tSNE to embed data in more than two dimensions?
Well, yes you can, but there is a catch. The key characteristic of tSNE is that it solves a problem known as the crowding problem. The extent to which this problem occurs depends on the ratio between the intrinsic data dimensionality and the embedding dimensionality. So, if you embed in, say, thirty dimensions, the crowding problem is less severe than when you embed in two dimensions. As a result, it often works better if you increase the degrees of freedom of the tdistribution when embedding into thirty dimensions (or if you try to embed intrinsically very lowdimensional data such as the Swiss roll). More details about this are described in the AISTATS paper.
Why doesn’t tSNE work as well as LLE or Isomap on the Swiss roll data?
When embedding the Swiss roll data, the crowding problem does not apply. So you may have to use a lightertailed tdistribution to embed the Swiss toll successfully (see above). But frankly... who cares about Swiss rolls when you can embed complex realworld data nicely?
Contact
If you encounter problems with the implementations or have questions about tSNE, make sure you read the paper, the User guide, and the FAQ first! If your question is not answered afterwards, feel free to send me an email. Of course we are also interested in the (pretty) results you obtained when you were using tSNE on your data!