How to evaluate unsupervised learning

Every time we build a machine learning model or any predictive model, the first thing we ask is how to evaluate it? What’s the best metric for each model? For supervised machine learning problem,  there are usually pre-set or well-known metrics. But for unsupervised learning, what should we do?

Let’s first look at what’s the typical unsupervised learning algorithms and its corresponding application scenes.

Typical unsupervised learning includes:

  • Hierarchical clustering: builds a multilevel hierarchy of clusters by creating a cluster tree
  • k-means clustering: partitions data into k distinct clusters based on distance to the centroid of a cluster
  • Gaussian mixture models: models clusters as a mixture of multivariate normal density components
  • Self-organizing maps: uses neural network that learns the topology and distribution of the data
  • Hidden Markov models: uses observed data to recover the sequence of states
  • Generative model such as Boltzmann machine to generate the distribution of outputs similar to input

Unsupervised learning methods are sued in bioinformatics for sequence analysis and genetic clustering; in data mining for sequence and pattern mining; in medical imaging for image segmentation; and in computer vision for object recognition, dimensionality reduction techniques for reducing dimensions.

Let’s go back to our original question: how to evaluate unsupervised learning?

Obviously, the answer depends on the class of unsupervised algorithms you use.

  1. Dimensionality reduction algorithms

For this type of algorithms, we can use methods similar to supervised learning by looking at its reconstructing error with test dataset or by applying a k-fold cross-validation procedure.

2.  Clustering algorithms

It is difficult to evaluate a clustering if you don’t have labeled test data. Typically there are two types of metrics: I. internal metrics, use only information on the computed clusters to evaluate if clusters are compact and well-separated[3]; II. external metrics that perform a statistical testing on the structure of your data [1].

For external indices, we evaluate the results of a clustering algorithm based on a known cluster structure of a data set (or cluster labels).

For internal indices, we evaluate the results using quantities and features inherent in the data set. The optimal number of clusters is usually determined based on an internal validity index.

A very good resource for clustering evaluation is from sklearn’s documentation page where it listed methods like adjusted rand index, mutual information based scores,  homogeneity,, completeness and V-measure, Fowlkes-Mallows scores and etc. With one method not covered: the Silhouette Coefficient which assumes ground truth labels are available.

Sometimes, an extrinsic performance function can be defined to evaluate it. For instance, if clustering is used to create meaningful classes (e.g. documents classification), it is possible to create an external dataset by hand-labeling and test the accuracy (gold standard). Another way of evaluating clustering is using high-dimension visualization tools like t-sne to visually check. For example, for feature learning in images, visualization of the learned features can be useful.

Screen Shot 2018-01-27 at 9.48.47 PM.png

3. Generative models

This type of method is stochastic, the actual value achieved after a given amount of training may depend on random seeds. So we can vary these and compare several runs to see if there is significant different performance. Also, visualizing the constructed output along with input can be a good metric too. For example, reconstructed hand-written digits with RBM can be compared with training samples.

 

Ref:

[1] Halkidi, Maria, Yannis Batistakis, and Michalis Vazirgiannis. “On clustering validation techniques.” Journal of Intelligent Information Systems 17.2-3 (2001): 107-145.
[2] Hall, Peter, Jeff Racine, and Qi Li. “Cross-validation and the estimation of conditional probability densities.” Journal of the American Statistical Association 99.468 (2004).
[3] Yanchi Liu, Zhongmou Li, and Hui Xiong. “Understanding of Internal Clustering Validation Measures” IEEE International Conference on Data Mining 2010.
[4] https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

 

 

 

 

 

Advertisements

How to choose the number of hidden layers and nodes in a feedforward neural network?

Whiling building neural networks, it takes a lot of time to fine-tune the hyperparameters from the number of layers, the number of nodes each layer, learning rate, momentum etc. How to best choose the most important starting parameters while setting up a  neural network? Especially the number of layers and number of nodes. Is there any rule of thumb?

Before jumping to the conclusions, let’s review the key parts of a feedforward neural network and its intrinsic properties. A typical feedforward neural network has three parts: the input layer, hidden layer, and the output layer.

The input layer:  

Simple – every NN has only one layer (also referred as activation layer of zero) and the number of neurons equals to the number of features in the input (columns in the input dataset).

The output layer:

Like the input layer, each neural network only has one output layer. The size is totally determined by the model configuration. For example, it is one for regression output, and number of class for classification (10 for example in hand-written digits recognition).

The hidden layer:

This is what we need to figure out. How many layers do we need?

First, look at your data problem, if it is linearly separable, then you’ll need none hidden layer.

If it’s not linearly separable, there are numerous comments on this question.  One issue within this subject on which there is a consensus is the performance difference from adding additional hidden layers: the situations in which performance improves with a second (or third, etc.) hidden layer are very few. One hidden layer is sufficient for the large majority of problems. (ref:ftp://ftp.sas.com/pub/neural/FAQ.html)

Here is the excerpt from Jeff Heaton’s book?

 There is currently no theoretical reason to use neural networks with any more than two hidden layers. In fact, for many practical problems, there is no reason to use any more than one hidden layer. Table 5.1 summarizes the capabilities of neural network architectures with various hidden layers.

Table 5.1: Determining the Number of Hidden Layers

| Number of Hidden Layers | Result |

 0 - Only capable of representing linear separable functions or decisions.

 1 - Can approximate any function that contains a continuous mapping
from one finite space to another.

 2 - Can represent an arbitrary decision boundary to arbitrary accuracy
with rational activation functions and can approximate any smooth
mapping to any accuracy.

How about the number of neurons?

According to the Jeff Heaton, “the optimal size of the hidden layer is usually between the size of the input and size of the output layers.” — “Introduction to Neural Networks in Java”

From Jeff Heaton’s book.

Using too few neurons in the hidden layers will result in something called underfitting. Underfitting occurs when there are too few neurons in the hidden layers to adequately detect the signals in a complicated data set.

Using too many neurons in the hidden layers can result in several problems. First, too many neurons in the hidden layers may result in overfitting. Overfitting occurs when the neural network has so much information processing capacity that the limited amount of information contained in the training set is not enough to train all of the neurons in the hidden layers. A second problem can occur even when the training data is sufficient. An inordinately large number of neurons in the hidden layers can increase the time it takes to train the network. The amount of training time can increase to the point that it is impossible to adequately train the neural network. Obviously, some compromise must be reached between too many and too few neurons in the hidden layers.

There are many rule-of-thumb methods for determining the correct number of neurons to use in the hidden layers, such as the following:

  • The number of hidden neurons should be between the size of the input layer and the size of the output layer.
  • The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
  • The number of hidden neurons should be less than twice the size of the input layer.

These three rules provide a starting point for you to consider. Ultimately, the selection of an architecture for your neural network will come down to trial and error. But what exactly is meant by trial and error? You do not want to start throwing random numbers of layers and neurons at your network. To do so would be very time consuming. Chapter 8, “Pruning a Neural Network” will explore various ways to determine an optimal structure for a neural network.

Empirical equations for determining number of neurons:

A. For supervised learning problems, there are some empirical formulas to determine the size of neurons:

Screen Shot 2018-01-02 at 10.51.32 AM.png

This concept is explained in excellent NN design, where you want to limit the number of free parameters in your model (its degree or number of nonzero weights) to a small portion of the degrees of freedom in your data. The data’s degree of freedom is Ns * (Ni+No) assuming all independent. Alpha in this equation is a way to indicate how general you want to prevent overfitting.

B. A rough approximation can be obtained by the geometric pyramid rule proposed by Masters (1993). For a three layer network with n input and m output neurons, the hidden layer would have sqrt(N * M) neurons.  — Masters, Timothy. Pratical neural network recipes in C++. Morgan Kaufmann, 1993.

In the end, it involves trial and errors. There are many optimization methods to address this problem after NN initialization. Most popular ones are like pruning and up-front approach like genetic algorithms. While using pruning, you can look at the weights that it learned. Usually, when the weights are close to zero, it means that neuron is not important. And you can also build automatically pruning algorithms by iteratively reducing the neuron size and compare the model performance.

References:

  1. https://web.archive.org/web/20140721050413/http://www.heatonresearch.com/node/707
  2. http://www.heatonresearch.com/book/programming-neural-networks-java-2.html
  3. http://www.heatonresearch.com/book/
  4. http://hagan.okstate.edu/NNDesign.pdf#page=469
  5. http://www.faqs.org/faqs/ai-faq/neural-nets/part1/preamble.html

 

 

 

 

 

Social Network Analysis: where do you start?

Since the debut of Facebook, Twitter, Snapchat and Wechat, the social network has taken the center stage for innovation, new business model and transformed into a new media that deeply impact our lives. Some even claim it can change election results. Social network relationship is mostly represented by a directional or un-directional graph. What can we do with it and what value can we get given a network graph?

Before diving into the things we can do, I’ll first go over some terminology about a graph by looking at this illustration example:

GraphExample

Vertex: The fundamental unit of a graph, in this example, are the people node (objects).

Arc or directed edge: A connection between a pair of nodes. It can be both directed (with direction) and undirected.

Indegree, outdegree, hub and authority scores: These are measures of the centrality (“importance”) of a paper in a network.

Path: A sequence of vertices such that from each vertex there is an edge to the next vertex in the sequence.

Now we know the terminologies for network graph and what are we going to do with a given graph? Here are some starting points and the R code to illustrate it:

Measure connectedness of points

This connectedness will measure how many vertexes are connected to other vertices. The number of lines connecting to a vertex is also called vertex of degree.

Example: Node (1) has 3 vertexes of degree. And node (2) has highest 6.

library(‘igraph’)
> library(‘sna’)
> graph1 <- sample_pa(15, power = 1, directed = FALSE)
> plot(graph1)
> degree(graph1)
[1] 3 6 3 1 1 5 1 1 1 1 1 1 1 1 1

graph1_degree.png

  • Measure betweenness of points

This metric measures the bridge that individual node provide between groups or individuals. Generally, higher betweenness score, the more important that individually is.  As seen from here, node 2 bridges the most within all the nodes.

> betweenness(graph1)
[1] 25 75 25 0 0 46 0 0 0 0 0 0 0 0 0

  • Network density

Network density is defined as the number of connection divided by all the possible connections. A completed connected network is 1. In this graph, it is 0.133.

> density <- edge_density(graph1, loops = FALSE)
> density
[1] 0.1333333

  • Identify cliques 

A clique is a small group of interconnected nodes with similar features. This is useful to identify groups of similar traits from the graph. In the above example, there is no clique and I’ll create another one.

clique.png

> graph2 <- sample_gnp(20,0.3)
> plot(graph2)
> cliques(graph2, min = 4)   # minimum of 4 members

[[1]]
+ 4/20 vertices, from da697b6:
[1] 1 3 19 20

[[2]]
+ 4/20 vertices, from da697b6:
[1] 6 13 17 19

[[3]]
+ 4/20 vertices, from da697b6:
[1] 6 13 16 17

[[4]]
+ 4/20 vertices, from da697b6:
[1] 2 3 16 17

[[5]]
+ 4/20 vertices, from da697b6:
[1] 3 11 15 16

[[6]]
+ 4/20 vertices, from da697b6:
[1] 3 11 14 16

[[7]]
+ 4/20 vertices, from da697b6:
[1] 3 11 16 17

[[8]]
+ 4/20 vertices, from da697b6:
[1] 5 7 13 15

  • Find components of a graph

For a graph, it is possible that some nodes are not connected to another node. So the graph can have multiple components that are not interconnected. Here is how to identify the components. First, create a sparse graph.

components.png

comp_graph <- sample_gnp(30,0.05,directed =FALSE, loops =FALSE)
> plot(comp_graph)
> components(comp_graph)

$membership
[1] 1 2 3 4 1 5 6 5 5 1 7 8 1 9 1 6 1 1 5 10 1 1 1 1 1 2 1 11
[29] 1 1

$csize
[1] 15 2 1 1 4 2 1 1 1 1 1

$no
[1] 11

So there are $no 11 components with its membership in $membership, with a size of $csize.

  • Take a random walk a graph

Some graphs present processes or path where an active node can change. When you take a random walk, each path assigned an equal weight. The random walk process will take the walk from beginning to the end and shows which nodes are visited.  Let’s look at the code (start at node 29, steps of 8):

random_walk(comp_graph, 29, 8, stuck = “return”)
+ 8/30 vertices, from 9a4a115:
[1] 29 13 29 1 5 1 29 13

The path is: 29, 13, 29, 1, 5, 1, 29, 13.

components

Ref: 

Comparison of Translational Patterns in Two Nutrient-Disease Associations: Nutritional Research Series, Vol. 5.

Illustration of Hadoop ecosystem and landscape of possibility of data pipelining

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Instead of using one large computer to process and store the data, Hadoop allows clustering commodity hardware together to analyze massive data sets in parallel. Apache Hadoop is the Hadoop distribution from Apache Community.

The Hadoop ecosystem has grown almost exponentially since its launch in 2006 (Figure 1). The number of libraries shown in orange is growing every year and also the commercial services (shown in Black at the top). The most important family is from Apache Spark project (Spark, Kafka, Storm etc.) which has been widely implemented in Amazon AWS, Databricks, Cloudera etc.

Hadoop Ecosystem Growth
Figure 1. Hadoop Ecosystem Growth (Ref. LinkedIn)

It is overwhelming to look at all the libraries for data pipelining but it is important to know their optimal usage. The dependency for these libraries can be seen in Figure 2. But when to use each library?

HadoopStack
Figure 2. Hadoop stack (ref. http://blog.newtechways.com)

Amazon has created a great image to illustrate the possibility of data pipelining architect (Figure 3.). It is a little busy but it is very helpful. The horizontal color bar shows the data temperature from the data ingest and the vertical color bar shows the processing speed for each architect. At the upper left, Apache Kafka (similar to Google’s Cloud PubSub) and Amazon’s in-house ‘Kinesis’ is fastest for ingest pipeline.  But for real-time, it goes to Spark Streaming using Apache Storm for one event at a time. In this case, ‘Hadoop’ is not shown here since it is assumed that the core libraries will be understood as part of Hadoop.

And going to the right, there is Amazon DynamoDB for NoSQL and Redshift for relational database sending data to Hive which sits on top of Hadoop. Moving horizontally to the interactive line, Amazon S3 as a Hadoop alternative now save the data into Redshift. And Spark library Presto for the interactive application then Hive for the batch application.

Data pipeline possibility illustration
Figure 3. Data pipeline possibility illustration (Ref. Amazon)

For different big data applications, the need for their architects will be different depends on the size of the data and the application that will be built around it. It is helpful to look at the use case put out by these cloud companies.

Here is an example batch ETL architecture (Figure 4.) and an event-driven batch analytics on AWS from Amazon (Figure 5.).

Batch ETL Architecture
Figure 4. Batch ETL Architecture (Ref. Amazon)
Event Driven Batch Analytics
Figure 5. Event Driven Batch Analytics Architect (Ref. Amazon)

 

Pandas– ValueError: If using all scalar values, you must pass an index

For Python users, we all know that it is very convenient to create a data frame from a dictionary. For example:

df = pd.DataFrame({‘Key’:[‘a’,’b’,’c’,’d’], ‘Value’:[1,2,3,4]})

It works beautifully when the values is a list/dict with multiple columns. However, you may encounter into syntax errors ValueError: If using all scalar values, you must pass an index” when you try to convert the following dictionary to a data frame.

dict_test = {

‘bacon’:’pig’,

‘pulled pork’:’pig’,

‘pastrami’: ‘cow’,

‘honey ham’:’pip’,

‘nova lox’: ‘salmon’

}

df = pd.DataFrame.from_dict(dict_test)

Why is that?

While pandas create data frame from a dictionary, it is expecting its value to be a list or dict. If you give it a scalar, you’ll also need to supply index. In this example, the values are ‘pig’ instead of [‘pig’].

How to fix it:

  1. Change the data to:

dict_test = {

‘bacon’:[‘pig’],

‘pulled pork’:[‘pig’],

‘pastrami’: [‘cow’],

‘honey ham’:[‘pip’],

‘nova lox’: [‘salmon’]

}

2. Get the list items from the dictionary and add ‘list’ for Python 3.x.

pd.DataFrame.from_dict(list(dict_test.items()), columns = [‘food’,’animal’])

3. Specify the orientation with ‘index’.

pd.DataFrame.from_dict(dict_test, orient = ‘index’)

4. Pass the Series constructor instead:

s = pd.Series(dict_test, name = ‘animal’)

s.index.name = ‘Food’

df = pd.DataFrame(s)

A quick hack for creating a series of template feature names in Python

Let’s say that I want to create a bunch of feature names with the format of ‘model_pow_1′,’model_pow_2′,’model_pow_3’ and to the Nth variable. It’s very easy to do so in R with just paste and seq function (e.g. paste(‘pow’,seq(1,5))). In python,  we can use list apprehension to accomplish the same thing but pay attention to the format tweak.

ind = [‘model_pow_%d’%i for i in range(1,10)]

Another way to do is using character concatenation.

ind=[‘model_pow_’+str(i) for  i in range(1,10)]

 

Numerical instability in deep learning with softmax

One of the most frequently used activation function in output layers for multi-class classification neural network is softmax. Softmax is defined as f(X) = exp(xi)/sum(exp(xi)) and it returns probability for each individual class with all probability the sum of one. For the two-class problem, sigmoid will return the same probability as softmax.

While translating softmax into program code, there are some little thing to watch out due to numerical instability associated with exploding gradient or vanishing gradient.  Let’s look at two examples:

Screen Shot 2017-10-09 at 9.07.14 PM

While the weights increase 1000 X, the probability becomes useless, either 0 or 1.

Screen Shot 2017-10-09 at 9.16.37 PM

The same thing happens to the weights vanishing. All the weights now share the same probability as the weights approach zero.

However, there is an easy fix for addressing the exploding and vanishing weights by modifying the softmax function to softmax(X + c). Mostly commonly used C is max(X) and it leaves the weight vector to be all negative. It will rule out overflow and vanishing denominator with at least one zero elements. Underflow some but not all weights are harmless.

Let’s see the impact.

Screen Shot 2017-10-09 at 9.33.02 PM

Screen Shot 2017-10-09 at 9.30.31 PM

Screen Shot 2017-10-09 at 9.30.40 PM

As seen from the two examples with stable syntax, it saved more weights from vanishing or exploding gradient.