t-Distributed Stochastic Neighbor Embedding

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a (prize-winning) technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique can be implemented via Barnes-Hut approximations, allowing it to be applied on large real-world datasets (we applied it on data sets with up to 30 million examples). The technique is introduced in the following papers:


  1. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008. [ PDF ] [ Supplemental Material (24MB) ]

  2. L.J.P. van der Maaten. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AI-STATS), JMLR W&CP 5:384-391, 2009. [ PDF ]

  3. L.J.P. van der Maaten. Barnes-Hut-SNE. In Proceedings of the International Conference on Learning Representations, 2013. [ Arxiv ] [ Talk ]


Implementations

Below, implementations of t-SNE in various languages are available for download. Some of these implementations were developed by me, and some by other contributors. For the standard t-SNE method, implementations in Matlab, C++, CUDA, Python, R, Julia, and JavaScript are available. In addition, we provide a Matlab implementation of parametric t-SNE (described here).


Finally, we provide a Barnes-Hut implementation of t-SNE (described here), which is the fastest t-SNE implementation to date, and which scales much better to big data sets.


You are free to use, modify, or redistribute this software in any way you want, but only for non-commercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.


Matlab implementation (user guide)


CUDA implementation (with contributions by Alex)


Binary implementation (with wrappers for Matlab and Python; user guide)




Python implementation


Julia implementation (by Leif Jonsson)


R implementation (by Justin)


JavaScript implementation (by Andrej)


MNIST Dataset


Parametric t-SNE (Matlab; see here)


Barnes-Hut-SNE (with C++, Matlab, Python, and R wrappers)

Examples

Some results of our experiments with t-SNE are available for download below. In the plots of the Netflix dataset and the words dataset, the third dimension is encoded by means of a color encoding (similar words/movies are close together and have the same color). Most of the ‘errors’ in the embeddings (such as in the 20 newsgroups) are actually due to ‘errors’ in the features t-SNE was applied on. In general, the embeddings have a 1-NN error that is comparable to that of the original high-dimensional features.


MNIST dataset (in 2D)

MNIST dataset (in 3D)


Olivetti faces dataset (in 2D)


COIL-20 dataset (in 2D)


Netflix dataset (in 3D) on RussRBM features


Words dataset (in 3D) on Andriy’s semantic features


20 Newsgroups dataset (in 2D) on Simon’s discLDA features


Reuters dataset (in 2D) landmark t-SNE using semantic hashing


NIPS dataset (in 2D) on co-authorship data (1988-2003)


NORB dataset (in 2D) by Vinod


Words (in 2D) by Joseph on features learned by Ronan and Jason


CalTech-101 on SIFT bag-of-words features


S&P 500 by Steve @ Opera; on information about daily returns on company stock


Interactive map of scientific journals on data by Nees-Jan and Ludo, using VOSviewer


Relation between World Economic Forum councils


ImageNet by Andrej on Caffe convolutional net features


Multiple maps visualizations


You may also be interested in these blog posts describing applications of t-SNE by Andrej Karpathy, Paul Mineiro, Alexander Fabisch, and Henry Tan.



FAQ

The binary implementation of t-SNE seems to have messed up the ordering of my data?
It sure did! The fast implementation of t-SNE is a landmark version that randomly picks it landmarks, even if you set the ratio of landmarks to 1.0. You can get the indices of the selected landmarks from the result-file (or from the Matlab script that runs it, for that matter). The format of the result file is described in the User’s guide.


I can’t figure out the file format for the binary version of t-SNE?

The format is described in the User’s guide. You also might want to have a look at the Matlab or Python wrapper code: it has code that writes the data-file and reads the results-file that can be ported fairly easily to other languages. Please note that the file format is binary (so don’t try to write or read text!), and that it does not contain any spaces, separators, newlines or whatsoever.


How should I specify the landmarks to the binary version of t-SNE?
You can either specify a ratio of points to use as landmark points (between 0 and 1), or you can specify a vector with the indices of the points to use as landmark points. In both cases, the fast version of t-SNE will return a vector with the indices of the used landmark points, so you can check what happened.


How can I asses the quality of the visualizations that t-SNE constructed?

Preferably, just look at them! Notice that t-SNE does not retain distances but probabilities, so measuring some error between the Euclidean distances in high-D and low-D is useless. However, if you use the same data and perplexity, you can compare the Kullback-Leibler divergences that t-SNE reports. It is perfectly fine to run t-SNE ten times, and select the solution with the lowest KL divergence.


How should I set the perplexity in t-SNE?

The performance of t-SNE is fairly robust under different settings of the perplexity. The most appropriate value depends on the density of your data. Loosely speaking, one could say that a larger / denser dataset requires a larger perplexity. Typical values for the perplexity range between 5 and 50.


What is perplexity anyway?

Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.


Every time I run t-SNE, I get a (slightly) different result?

In contrast to, e.g., PCA, t-SNE has a non-convex objective function. The objective function is minimized using a gradient descent optimization that is initiated randomly. As a result, it is possible that different runs give you different solutions. Notice that it is perfectly fine to run t-SNE a number of times (with the same data and parameters), and to select the visualization with the lowest value of the objective function as your final visualization.


When I run t-SNE, I get a strange ‘ball’ with uniformly distributed points?

This usually indicates you set your perplexity way too high. All points now want to be equidistant. The result you got is the closest you can get to equidistant points as is possible in two dimensions. If lowering the perplexity doesn’t help, you might have run into the problem described in the next question. Similar effects may also occur when you use highly non-metric similarities as input.


When I run t-SNE, it reports a very low error but the results look crappy?

Presumably, your data contains some very large numbers, causing the binary search for the correct perplexity to fail. In the beginning of the optimization, t-SNE then reports a minimum, mean, and maximum value for sigma of 1. This is a sign that something went wrong! Just divide your data or distances by a big number, and try again.


I tried everything you said, but t-SNE still doesn’t seem to work very well?

Maybe there is something weird in your data. As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but t-SNE doesn’t, I am fairly sure you did something wrong. Just check your code again until you found the bug! If nothing works, feel free to drop me a line.


Can I use a pairwise Euclidean distance matrix as input into t-SNE?

Yes you can! Download the Matlab implementation, and use your pairwise Euclidean distance matrix as input into the TSNE_D function.


Can I use a pairwise similarity matrix as input into t-SNE?

Yes you can! For instance, we successfully applied t-SNE on a dataset of word association data. Download the Matlab implementation, make sure the diagonal of the pairwise similarity matrix contains only zeros, symmetrize the pairwise similarity matrix, and normalize it to sum up to one. You can now use the result as input into the TSNE_P function.


Can I use t-SNE to embed data in more than two dimensions?
Well, yes you can, but there is a catch. The key characteristic of t-SNE is that it solves a problem known as the crowding problem. The extent to which this problem occurs depends on the ratio between the intrinsic data dimensionality and the embedding dimensionality. So, if you embed in, say, thirty dimensions, the crowding problem is less severe than when you embed in two dimensions. As a result, it often works better if you increase the degrees of freedom of the t-distribution when embedding into thirty dimensions (or if you try to embed intrinsically very low-dimensional data such as the Swiss roll). More details about this are described in the AI-STATS paper.


Why doesn’t t-SNE work as well as LLE or Isomap on the Swiss roll data?
When embedding the Swiss roll data, the crowding problem does not apply. So you may have to use a lighter-tailed t-distribution to embed the Swiss toll successfully (see above). But frankly... who cares about Swiss rolls when you can embed complex real-world data nicely?


Contact

If you encounter problems with the implementations or have questions about t-SNE, make sure you read the paper, the User guide, and the FAQ first! If your question is not answered afterwards, feel free to send me an email. Of course we are also interested in the (pretty) results you obtained when you were using t-SNE on your data!