t-Distributed Stochastic Neighbor Embedding

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a new technique for dimensionality reduction that is particularly well suited for the visualization of high-dimensional datasets. The technique has a natural landmark version that can readily be applied to large real-world datasets (we applied it on, e.g., all 60,000 MNIST digit images). The technique is introduced in the following papers:


  1. L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008. [ PDF ] [ Supplemental Material (24MB) ]

  2. L.J.P. van der Maaten. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AI-STATS), JMLR W&CP 5:384-391, 2009. [ PDF ]

  3. L.J.P. van der Maaten. Barnes-Hut-SNE. In Proceedings of the International Conference on Learning Representations, 2013. [ Arxiv ] [ Code ] [ Talk ]


Implementations

Below, six implementations of t-SNE are available for download. The first implementation is a simple Matlab implementation of t-SNE, that allows you to understand the technique. The code requires Matlab R2007a or higher. An implementation that takes as input a pairwise probability or a pairwise distance matrix is also included.

The second implementation is in CUDA. Depending on the available hardware and problem size, this implementation can be up to 20x faster than other naive t-SNE implementations. The implementation does require a working installation of CUDA 3.x, as well as a working Matlab installation.
The third implementation is a binary that reads a datafile and writes the results to a file. Scripts allowing you to use this implementation in Matlab and Python are also available. The binary implementation is faster than the Matlab implementation, and allows you to use the landmark version of t-SNE. The binary implementation was implemented using
Intel IPP.

The fourth and fifth implementation are in Python and R, respectively. These are much slower, but they work just as well.

The sixth implementation is a Barnes-Hut implementation of t-SNE, which is the fastest t-SNE implementation so far, and whic scales much better to big data sets. There is also an implementation of parametric t-SNE (in Matlab) available.

You are free to use, modify, or redistribute this software in any way you want, but only for non-commercial purposes. The use of the software is at your own risk; the authors are not responsible for any damage as a result from errors in the software.


Simple Matlab implementation


CUDA implementation (with contributions by Alex)


Binary implementation




Matlab script that runs binary implementation


Python script that runs binary implementation (by Philippe)


Simple Python implementation


Simple Julia implementation (by Leif Jonsson)


Simple R implementation (by Justin)


User guide


MNIST Dataset


Parametric t-SNE (Matlab; see here)


Alternative t-SNE Optimizer (Matlab; see here and here)

(Note: this code sometimes has numerical problems)


Barnes-Hut-SNE (C++, Matlab, and Python; see here for algorithm details)


Examples

Some results of our experiments with t-SNE are available for download below. In the plots of the Netflix dataset and the words dataset, the third dimension is encoded by means of a color encoding (similar words/movies are close together and have the same color). Most of the ‘errors’ in the embeddings (such as in the 20 newsgroups) are actually due to ‘errors’ in the features t-SNE was applied on. In general, the embeddings have a 1-NN error that is comparable to that of the original high-dimensional features!


MNIST dataset (in 2D)

MNIST dataset (in 3D)


Olivetti faces dataset (in 2D)


COIL-20 dataset (in 2D)


Netflix dataset (in 3D) on RussRBM features


Words dataset (in 3D) on Andriy’s semantic features


20 Newsgroups dataset (in 2D) on Simon’s discLDA features


Reuters dataset (in 2D) landmark t-SNE using semantic hashing


NIPS dataset (in 2D) on co-authorship data (1988-2003)


NORB dataset (in 2D) by Vinod


Words (in 2D) by Joseph on features learned by Ronan and Jason


CalTech-101 on SIFT bag-of-words features


S&P 500 by Steve @ Opera; on information about daily returns on company stock


Interactive map of scientific journals on data by Nees-Jan and Ludo, using VOSviewer


Relation between World Economic Forum councils



FAQ

The binary implementation of t-SNE seems to have messed up the ordering of my data?
It sure did! The fast implementation of t-SNE is a landmark version that randomly picks it landmarks, even if you set the ratio of landmarks to 1.0. You can get the indices of the selected landmarks from the result-file (or from the Matlab script that runs it, for that matter). The format of the result file is described in the User’s guide.


I can’t figure out the file format for the binary version of t-SNE?

The format is described in the User’s guide. You also might want to have a look at the Matlab or Python wrapper code: it has code that writes the data-file and reads the results-file that can be ported fairly easily to other languages. Please note that the file format is binary (so don’t try to write or read text!), and that it does not contain any spaces, separators, newlines or whatsoever.


How should I specify the landmarks to the binary version of t-SNE?
You can either specify a ratio of points to use as landmark points (between 0 and 1), or you can specify a vector with the indices of the points to use as landmark points. In both cases, the fast version of t-SNE will return a vector with the indices of the used landmark points, so you can check what happened.


How can I asses the quality of the visualizations that t-SNE constructed?

Preferably, just look at them! Notice that t-SNE does not retain distances but probabilities, so measuring some error between the Euclidean distances in high-D and low-D is useless. However, if you use the same data and perplexity, you can compare the Kullback-Leibler divergences that t-SNE reports. It is perfectly fine to run t-SNE ten times, and select the solution with the lowest KL divergence.


How should I set the perplexity in t-SNE?

The performance of t-SNE is fairly robust under different settings of the perplexity. The most appropriate value depends on the density of your data. Loosely speaking, one could say that a larger / denser dataset requires a larger perplexity. Typical values for the perplexity range between 5 and 50.


What is perplexity anyway?

Perplexity is a measure for information that is defined as 2 to the power of the Shannon entropy. The perplexity of a fair die with k sides is equal to k. In t-SNE, the perplexity may be viewed as a knob that sets the number of effective nearest neighbors. It is comparable with the number of nearest neighbors k that is employed in many manifold learners.


Every time I run t-SNE, I get a (slightly) different result?

In contrast to, e.g., PCA, t-SNE has a non-convex objective function. The objective function is minimized using a gradient descent optimization that is initiated randomly. As a result, it is possible that different runs give you different solutions. Notice that it is perfectly fine to run t-SNE a number of times (with the same data and parameters), and to select the visualization with the lowest value of the objective function as your final visualization.


When I run t-SNE, I get a strange ‘ball’ with uniformly distributed points?

This usually indicates you set your perplexity way too high. All points now want to be equidistant. The result you got is the closest you can get to equidistant points as is possible in two dimensions. If lowering the perplexity doesn’t help, you might have run into the problem described in the next question. Similar effects may also occur when you use highly non-metric similarities as input.


When I run t-SNE, it reports a very low error but the results look crappy?

Presumably, your data contains some very large numbers, causing the binary search for the correct perplexity to fail. In the beginning of the optimization, t-SNE then reports a minimum, mean, and maximum value for sigma of 1. This is a sign that something went wrong! Just divide your data or distances by a big number, and try again.


I tried everything you said, but t-SNE still doesn’t seem to work very well?

Maybe there is something weird in your data. As a sanity check, try running PCA on your data to reduce it to two dimensions. If this also gives bad results, then maybe there is not very much nice structure in your data in the first place. If PCA works well but t-SNE doesn’t, I am fairly sure you did something wrong. Just check your code again until you found the bug! If nothing works, feel free to drop me a line.


Can I use a pairwise Euclidean distance matrix as input into t-SNE?

Yes you can! Download the simple Matlab implementation, and use your pairwise Euclidean distance matrix as input into the TSNE_D function.


Can I use a pairwise similarity matrix as input into t-SNE?

Yes you can! For instance, we successfully applied t-SNE on a dataset of word association data. Download the simple Matlab implementation, make sure the diagonal of the pairwise similarity matrix contains only zeros, symmetrize the pairwise similarity matrix, and normalize it to sum up to one. You can now use the result as input into the TSNE_P function.


Can I use t-SNE to embed data in more than two dimensions?
Well, yes you can, but there is a catch. The key characteristic of t-SNE is that it solves a problem known as the crowding problem. The extent to which this problem occurs depends on the ratio between the intrinsic data dimensionality and the embedding dimensionality. So, if you embed in, say, thirty dimensions, the crowding problem is less severe than when you embed in two dimensions. As a result, it often works better if you increase the degrees of freedom of the t-distribution when embedding into thirty dimensions (or if you try to embed intrinsically very low-dimensional data such as the Swiss roll). More details about this are described in the AI-STATS paper.


Why doesn’t t-SNE work as well as LLE or Isomap on the Swiss roll data?
When embedding the Swiss roll data, the crowding problem does not apply. So you may have to use a lighter-tailed t-distribution to embed the Swiss toll successfully (see above). But frankly... who cares about Swiss rolls when you can embed complex real-world data nicely?


Contact

If you encounter problems with the implementations or have questions about t-SNE, make sure you read the paper, the User guide, and the FAQ first! If your question is not answered afterwards, feel free to send me an email. Of course we are also interested in the (pretty) results you obtained when you were using t-SNE on your data!