tDistributed Stochastic Neighbor Embedding
tDistributed Stochastic Neighbor Embedding (tSNE) is a (prizewinning) technique for dimensionality reduction that is particularly well suited for the visualization of highdimensional datasets. The technique can be implemented via BarnesHut approximations, allowing it to be applied on large realworld datasets (we applied it on data sets with up to 30 million examples). The technique is introduced in the following papers:

• L.J.P. van der Maaten and G.E. Hinton. Visualizing HighDimensional Data Using tSNE. Journal of Machine Learning Research 9(Nov):25792605, 2008. [ PDF ] [ Supplemental Material (24MB) ]

• L.J.P. van der Maaten. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AISTATS), JMLR W&CP 5:384391, 2009. [ PDF ]

• L.J.P. van der Maaten. BarnesHutSNE. In Proceedings of the International Conference on Learning Representations, 2013. [ Arxiv ] [ Talk ]
Implementations
Below, implementations of tSNE in various languages are available for download. Some of these implementations were developed by me, and some by other contributors. For the standard tSNE method, implementations in Matlab, C++, CUDA, Python, R, Julia, and JavaScript are available. In addition, we provide a Matlab implementation of parametric tSNE (described here).
Finally, we provide a BarnesHut implementation of tSNE (described here), which is the fastest tSNE implementation to date, and which scales much better to big data sets.
You are free to use, modify, or redistribute this software in any way you want, but only for noncommercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.
Matlab implementation (user guide)
CUDA implementation (with contributions by Alex)
Binary implementation (with wrappers for Matlab and Python; user guide)
Python implementation
Julia implementation (by Leif Jonsson)
R implementation (by Justin)
JavaScript implementation (by Andrej)
MNIST Dataset
Parametric tSNE (Matlab; see here)
Examples
Some results of our experiments with tSNE are available for download below. In the plots of the Netflix dataset and the words dataset, the third dimension is encoded by means of a color encoding (similar words/movies are close together and have the same color). Most of the ‘errors’ in the embeddings (such as in the 20 newsgroups) are actually due to ‘errors’ in the features tSNE was applied on. In general, the embeddings have a 1NN error that is comparable to that of the original highdimensional features.
MNIST dataset (in 2D)
MNIST dataset (in 3D)
Olivetti faces dataset (in 2D)
COIL20 dataset (in 2D)
Netflix dataset (in 3D) on Russ’ RBM features
Words dataset (in 3D) on Andriy’s semantic features
20 Newsgroups dataset (in 2D) on Simon’s discLDA features
Reuters dataset (in 2D) landmark tSNE using semantic hashing
NIPS dataset (in 2D) on coauthorship data (19882003)
NORB dataset (in 2D) by Vinod
Words (in 2D) by Joseph on features learned by Ronan and Jason
CalTech101 on SIFT bagofwords features
S&P 500 by Steve @ Opera; on information about daily returns on company stock
Interactive map of scientific journals on data by NeesJan and Ludo, using VOSviewer
Relation between World Economic Forum councils
ImageNet by Andrej on Caffe convolutional net features
Multiple maps visualizations
You may also be interested in these blog posts describing applications of tSNE by Andrej Karpathy, Paul Mineiro, Alexander Fabisch, and Henry Tan.
FAQ
The binary implementation of tSNE seems to have messed up the ordering of my data?
It sure did! The fast implementation of tSNE is a landmark version that randomly picks it landmarks, even if you set the ratio of landmarks to 1.0. You can get the indices of the selected landmarks from the resultfile (or from the Matlab script that runs it, for that matter). The format of the result file is described in the User’s guide.
I can’t figure out the file format for the binary version of tSNE?
The format is described in the User’s guide. You also might want to have a look at the Matlab or Python wrapper code: it has code that writes the datafile and reads the resultsfile that can be ported fairly easily to other languages. Please note that the file format is binary (so don’t try to write or read text!), and that it does not contain any spaces, separators, newlines or whatsoever.
How should I specify the landmarks to the binary version of tSNE?
You can either specify a ratio of points to use as landmark points (between 0 and 1), or you can specify a vector with the indices of the points to use as landmark points. In both cases, the fast version of tSNE will return a vector with the indices of the used landmark points, so you can check what happened.
How can I asses the quality of the visualizations that tSNE constructed?
Preferably, just look at them! Notice that tSNE does not retain distances but probabilities, so measuring some error between the Euclidean distances in highD and lowD is useless. However, if you use the same data and perplexity, you can compare the KullbackLeibler divergences that tSNE reports. It is perfectly fine to run tSNE ten times, and select the solution with the lowest KL divergence.
How should I set the perplexity in tSNE?
The performance of tSNE is fairly robust under different settings of the perplexity. The most appropriate value depends on the density of your data. Loosely speaking, one could say that a larger / denser dataset requires a larger perplexity. Typical values for the perplexity range between 5 and 50.
What is perplexity anyway?
Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In tSNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.
Every time I run tSNE, I get a (slightly) different result?
In contrast to, e.g., PCA, tSNE has a nonconvex objective function. The objective function is minimized using a gradient descent optimization that is initiated randomly. As a result, it is possible that different runs give you different solutions. Notice that it is perfectly fine to run tSNE a number of times (with the same data and parameters), and to select the visualization with the lowest value of the objective function as your final visualization.
When I run tSNE, I get a strange ‘ball’ with uniformly distributed points?
This usually indicates you set your perplexity way too high. All points now want to be equidistant. The result you got is the closest you can get to equidistant points as is possible in two dimensions. If lowering the perplexity doesn’t help, you might have run into the problem described in the next question. Similar effects may also occur when you use highly nonmetric similarities as input.
When I run tSNE, it reports a very low error but the results look crappy?
Presumably, your data contains some very large numbers, causing the binary search for the correct perplexity to fail. In the beginning of the optimization, tSNE then reports a minimum, mean, and maximum value for sigma of 1. This is a sign that something went wrong! Just divide your data or distances by a big number, and try again.
I tried everything you said, but tSNE still doesn’t seem to work very well?
Maybe there is something weird in your data. As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but tSNE doesn’t, I am fairly sure you did something wrong. Just check your code again until you found the bug! If nothing works, feel free to drop me a line.
Can I use a pairwise Euclidean distance matrix as input into tSNE?
Yes you can! Download the Matlab implementation, and use your pairwise Euclidean distance matrix as input into the TSNE_D function.
Can I use a pairwise similarity matrix as input into tSNE?
Yes you can! For instance, we successfully applied tSNE on a dataset of word association data. Download the Matlab implementation, make sure the diagonal of the pairwise similarity matrix contains only zeros, symmetrize the pairwise similarity matrix, and normalize it to sum up to one. You can now use the result as input into the TSNE_P function.
Can I use tSNE to embed data in more than two dimensions?
Well, yes you can, but there is a catch. The key characteristic of tSNE is that it solves a problem known as the crowding problem. The extent to which this problem occurs depends on the ratio between the intrinsic data dimensionality and the embedding dimensionality. So, if you embed in, say, thirty dimensions, the crowding problem is less severe than when you embed in two dimensions. As a result, it often works better if you increase the degrees of freedom of the tdistribution when embedding into thirty dimensions (or if you try to embed intrinsically very lowdimensional data such as the Swiss roll). More details about this are described in the AISTATS paper.
Why doesn’t tSNE work as well as LLE or Isomap on the Swiss roll data?
When embedding the Swiss roll data, the crowding problem does not apply. So you may have to use a lightertailed tdistribution to embed the Swiss toll successfully (see above). But frankly... who cares about Swiss rolls when you can embed complex realworld data nicely?
Contact
If you encounter problems with the implementations or have questions about tSNE, make sure you read the paper, the User guide, and the FAQ first! If your question is not answered afterwards, feel free to send me an email. Of course we are also interested in the (pretty) results you obtained when you were using tSNE on your data!