Every time we build a machine learning model or any predictive model, the first thing we ask is how to evaluate it? What’s the best metric for each model? For supervised machine learning problem, there are usually pre-set or well-known metrics. But for unsupervised learning, what should we do?

Let’s first look at what’s the typical unsupervised learning algorithms and its corresponding application scenes.

Typical unsupervised learning includes:

- Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree
- k-means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster
- Gaussian mixture models: models clusters as a mixture of multivariate normal density components
- Self-organizing maps: uses neural network that learns the topology and distribution of the data
- Hidden Markov models: uses observed data to recover the sequence of states
- Generative model such as Boltzmann machine to generate the distribution of outputs similar to input

Unsupervised learning methods are sued in bioinformatics for sequence analysis and genetic clustering; in data mining for sequence and pattern mining; in medical imaging for image segmentation; and in computer vision for object recognition, dimensionality reduction techniques for reducing dimensions.

Let’s go back to our original question: how to evaluate unsupervised learning?

Obviously, the answer depends on the class of unsupervised algorithms you use.

- Dimensionality reduction algorithms

For this type of algorithms, we can use methods similar to supervised learning by looking at its reconstructing error with test dataset or by applying a k-fold cross-validation procedure.

2. Clustering algorithms

It is difficult to evaluate a clustering if you don’t have labeled test data. Typically there are two types of metrics: I. internal metrics, use only information on the computed clusters to evaluate if clusters are compact and well-separated[3]; II. external metrics that perform a statistical testing on the structure of your data [1].

For external indices, we evaluate the results of a clustering algorithm based on a known cluster structure of a data set (or cluster labels).

For internal indices, we evaluate the results using quantities and features inherent in the data set. The optimal number of clusters is usually determined based on an internal validity index.

A very good resource for clustering evaluation is from sklearn’s documentation page where it listed methods like adjusted rand index, mutual information based scores, homogeneity,, completeness and V-measure, Fowlkes-Mallows scores and etc. With one method not covered: the Silhouette Coefficient which assumes ground truth labels are available.

Sometimes, an extrinsic performance function can be defined to evaluate it. For instance, if clustering is used to create meaningful classes (e.g. documents classification), it is possible to create an external dataset by hand-labeling and test the accuracy (gold standard). Another way of evaluating clustering is using high-dimension visualization tools like t-sne to visually check. For example, for feature learning in images, visualization of the learned features can be useful.

3. Generative models

This type of method is stochastic, the actual value achieved after a given amount of training may depend on random seeds. So we can vary these and compare several runs to see if there is significant different performance. Also, visualizing the constructed output along with input can be a good metric too. For example, reconstructed hand-written digits with RBM can be compared with training samples.

Ref: