Social Network Analysis: where do you start?

Since the debut of Facebook, Twitter, Snapchat and Wechat, the social network has taken the center stage for innovation, new business model and transformed into a new media that deeply impact our lives. Some even claim it can change election results. Social network relationship is mostly represented by a directional or un-directional graph. What can we do with it and what value can we get given a network graph?

Before diving into the things we can do, I’ll first go over some terminology about a graph by looking at this illustration example:

GraphExample

Vertex: The fundamental unit of a graph, in this example, are the people node (objects).

Arc or directed edge: A connection between a pair of nodes. It can be both directed (with direction) and undirected.

Indegree, outdegree, hub and authority scores: These are measures of the centrality (“importance”) of a paper in a network.

Path: A sequence of vertices such that from each vertex there is an edge to the next vertex in the sequence.

Now we know the terminologies for network graph and what are we going to do with a given graph? Here are some starting points and the R code to illustrate it:

Measure connectedness of points

This connectedness will measure how many vertexes are connected to other vertices. The number of lines connecting to a vertex is also called vertex of degree.

Example: Node (1) has 3 vertexes of degree. And node (2) has highest 6.

library(‘igraph’)
> library(‘sna’)
> graph1 <- sample_pa(15, power = 1, directed = FALSE)
> plot(graph1)
> degree(graph1)
[1] 3 6 3 1 1 5 1 1 1 1 1 1 1 1 1

graph1_degree.png

  • Measure betweenness of points

This metric measures the bridge that individual node provide between groups or individuals. Generally, higher betweenness score, the more important that individually is.  As seen from here, node 2 bridges the most within all the nodes.

> betweenness(graph1)
[1] 25 75 25 0 0 46 0 0 0 0 0 0 0 0 0

  • Network density

Network density is defined as the number of connection divided by all the possible connections. A completed connected network is 1. In this graph, it is 0.133.

> density <- edge_density(graph1, loops = FALSE)
> density
[1] 0.1333333

  • Identify cliques 

A clique is a small group of interconnected nodes with similar features. This is useful to identify groups of similar traits from the graph. In the above example, there is no clique and I’ll create another one.

clique.png

> graph2 <- sample_gnp(20,0.3)
> plot(graph2)
> cliques(graph2, min = 4)   # minimum of 4 members

[[1]]
+ 4/20 vertices, from da697b6:
[1] 1 3 19 20

[[2]]
+ 4/20 vertices, from da697b6:
[1] 6 13 17 19

[[3]]
+ 4/20 vertices, from da697b6:
[1] 6 13 16 17

[[4]]
+ 4/20 vertices, from da697b6:
[1] 2 3 16 17

[[5]]
+ 4/20 vertices, from da697b6:
[1] 3 11 15 16

[[6]]
+ 4/20 vertices, from da697b6:
[1] 3 11 14 16

[[7]]
+ 4/20 vertices, from da697b6:
[1] 3 11 16 17

[[8]]
+ 4/20 vertices, from da697b6:
[1] 5 7 13 15

  • Find components of a graph

For a graph, it is possible that some nodes are not connected to another node. So the graph can have multiple components that are not interconnected. Here is how to identify the components. First, create a sparse graph.

components.png

comp_graph <- sample_gnp(30,0.05,directed =FALSE, loops =FALSE)
> plot(comp_graph)
> components(comp_graph)

$membership
[1] 1 2 3 4 1 5 6 5 5 1 7 8 1 9 1 6 1 1 5 10 1 1 1 1 1 2 1 11
[29] 1 1

$csize
[1] 15 2 1 1 4 2 1 1 1 1 1

$no
[1] 11

So there are $no 11 components with its membership in $membership, with a size of $csize.

  • Take a random walk a graph

Some graphs present processes or path where an active node can change. When you take a random walk, each path assigned an equal weight. The random walk process will take the walk from beginning to the end and shows which nodes are visited.  Let’s look at the code (start at node 29, steps of 8):

random_walk(comp_graph, 29, 8, stuck = “return”)
+ 8/30 vertices, from 9a4a115:
[1] 29 13 29 1 5 1 29 13

The path is: 29, 13, 29, 1, 5, 1, 29, 13.

components

Ref: 

Comparison of Translational Patterns in Two Nutrient-Disease Associations: Nutritional Research Series, Vol. 5.

Advertisements

Illustration of Hadoop ecosystem and landscape of possibility of data pipelining

Hadoop is an open-source software framework for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs. Instead of using one large computer to process and store the data, Hadoop allows clustering commodity hardware together to analyze massive data sets in parallel. Apache Hadoop is the Hadoop distribution from Apache Community.

The Hadoop ecosystem has grown almost exponentially since its launch in 2006 (Figure 1). The number of libraries shown in orange is growing every year and also the commercial services (shown in Black at the top). The most important family is from Apache Spark project (Spark, Kafka, Storm etc.) which has been widely implemented in Amazon AWS, Databricks, Cloudera etc.

Hadoop Ecosystem Growth
Figure 1. Hadoop Ecosystem Growth (Ref. LinkedIn)

It is overwhelming to look at all the libraries for data pipelining but it is important to know their optimal usage. The dependency for these libraries can be seen in Figure 2. But when to use each library?

HadoopStack
Figure 2. Hadoop stack (ref. http://blog.newtechways.com)

Amazon has created a great image to illustrate the possibility of data pipelining architect (Figure 3.). It is a little busy but it is very helpful. The horizontal color bar shows the data temperature from the data ingest and the vertical color bar shows the processing speed for each architect. At the upper left, Apache Kafka (similar to Google’s Cloud PubSub) and Amazon’s in-house ‘Kinesis’ is fastest for ingest pipeline.  But for real-time, it goes to Spark Streaming using Apache Storm for one event at a time. In this case, ‘Hadoop’ is not shown here since it is assumed that the core libraries will be understood as part of Hadoop.

And going to the right, there is Amazon DynamoDB for NoSQL and Redshift for relational database sending data to Hive which sits on top of Hadoop. Moving horizontally to the interactive line, Amazon S3 as a Hadoop alternative now save the data into Redshift. And Spark library Presto for the interactive application then Hive for the batch application.

Data pipeline possibility illustration
Figure 3. Data pipeline possibility illustration (Ref. Amazon)

For different big data applications, the need for their architects will be different depends on the size of the data and the application that will be built around it. It is helpful to look at the use case put out by these cloud companies.

Here is an example batch ETL architecture (Figure 4.) and an event-driven batch analytics on AWS from Amazon (Figure 5.).

Batch ETL Architecture
Figure 4. Batch ETL Architecture (Ref. Amazon)
Event Driven Batch Analytics
Figure 5. Event Driven Batch Analytics Architect (Ref. Amazon)

 

Three different ways of initializing deep neural network yield surprising results

While training deep neural net, there are many parameters to be initialized and trained through the forward and backward propagation. A lot of times we spent a lot of time on trying different activation function, tuning the depth of deepnet, and number of units and other hyperparameters. But we may forget the importance of initialization of its weights and biases. In this article, I’ll share three ways of initialization methods (1. zeros initialization, 2. Random initialization 3. He initialization) and see their corresponding impact.

In this example, it is a three-layer neural network with the following setup: Linear –> RELU –> Linear –> RELU -> Linear –> Sigmoid. So the first two hidden layer are (linear + Relu) and the last layer (L) is (linear + sigmoid) as illustrated in the below figure.

Deep+Learning+Neural+Network_+Cascading+the+neurons+……+Layer+1+Layer+2

The dataset is created using the following code.

Screen Shot 2017-12-14 at 11.31.13 PM

The data looks like this:

download

  1. Zero initialization

In this case, just assign all parameters and bias to zero using np.zeros().

Screen Shot 2017-12-14 at 11.44.50 PM

The below plots shows that none of the points are correctly separated and the logloss cost function stays stagnant since all the neuron are the same.

2. Random initialization

In this case, the weights are randomly initialized by a large number 10 and bias set to zero. And we can see the neural network starts to learn correctly.

Screen Shot 2017-12-14 at 11.45.03 PM

3. He initialization

Last, we’ll see how ‘He’ initialization method works. In He et al., 2015, they proposed a new way for neural network: sqrt(2./layer_dims[l-1]). And we can see that this separated the two-class very well.

Screen Shot 2017-12-14 at 11.45.17 PM

As we can see that initialization is very important in training deep neural network. It is important to break the symmetry of the neurons in the same layer and proper initialization can make your training much faster.

The forward propagation and backward propagation are shown below:

Screen Shot 2017-12-14 at 11.36.09 PMScreen Shot 2017-12-14 at 11.36.19 PM

Ref: deeplearning.ai

Reconstructing a unix timestamp to readable date in Python and Java

We’re living in a multiple dimension world, at least four dimensions. The most critical dimension is time and it is recorded with all the dataset in digital observations. One of the most common but unreadable ways to record the time is Unix timestamp. It will show you a date time in the format of ‘1513219212‘ (ISO 8601: 2017-12-14T02:40:12Z). And you have might no idea what it is.

First, what is a Unix timestamp? It is the time in seconds from January 1st, 1970 to the very moment you call for the stamp itself.Simply put, the Unix timestamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC. Therefore, the Unix timestamp is merely the number of seconds between a particular date and the Unix Epoch. The reason why Unix timestamps are used by many webmasters is that they can represent all time zones at once. (Wikipedia)

All the programming languages have a library to handle this.

In python, you can do the following:

import datetime

datetime.datetime.fromtimestamp(int(‘1513219212’)).strftime(“%Y-%m-%d %H:%M:%S”)

#It will print out ‘2017-12-13 21:40:12’ Local Time

Since the above method will give you locale dependent time and error prone. The method is better:

datetime.datetime.utcfromtimestamp(int(‘1513219212’)).strftime(‘%Y-%m-%dT%H:%M:%SZ’)

# This will print out ‘2017-12-14T02:40:12Z’ 

‘Z’ stands for Zulu time, which is also GMT and UTC.

You can also use ‘ctime’ from time module to get a human-readable timestamp from a Unix timestamp.

Here is how to do it in Java:

TimestampToDateJava
Timestamp to Date in Java

The output will be: The date is: Wed Dec 13 21:40:12 EST 2017.

It is important to note the difference here. Java expect milliseconds and you’ll need to cast it to long otherwise it will have integer overflow.

If you want to set the timezone in Java, you can simply add this line:

TimeZone.setDefault(TimeZone.getTimeZone(“UTC”));

It will return The date is: Thu Dec 14 02:40:12 UTC 2017。

Dealing with legacy code contains ‘xrange’ in Python 2.7

Python 3.x has been around since 2008 but 2.7.x is still around and continues used in current development. While doing machine learning, one of the most used function is ‘xrange’ in loops. But ‘xrange’ has been replaced with ‘range’ in 3.x. Here is a good practice for writing code that’s compatible with both Python 2 and 3.x.

try:

xrange

except NameError:

xrange = range

For Python 2.7 die-hard fans switching to 3.x, you can define ‘xrange’ as following:

def xrange(x):

return iter(range(x))

Cheers.

Understand blockchain with Simply Python Code

Everybody knows Bitcoin now but not everyone knows how blockchain technology works. The blockchain is like a distributed ledger which is a consensus of replicated, shared and synchronized digital data geographically spread across multiple sites and there is no centralized data storage. This is different from centralized and decentralized storage. See illustration image here:

blockchain

In another word, blockchain is a public data storage where every new data is stored in a ‘block’ container and inserted into an immutable chain with past data. In terms of bitcoin or other coins, these data are a series of the transaction record. Of course, the data stored here can be anything. The blockchain technology is supposed to be more secure and hack-proof since the computation resources required to hack it is unimaginable.

Here, I’ll show a simple Python code to demonstrate how blockchain works:

The code structure is like this shown in Eclipse:

Screen Shot 2017-12-10 at 12.34.35 PM

The Python code is shown below:Screen Shot 2017-12-10 at 12.35.57 PMScreen Shot 2017-12-10 at 12.35.40 PMScreen Shot 2017-12-10 at 12.35.49 PM

Let’s look at the blockchain created:

Screen Shot 2017-12-10 at 12.39.43 PM.png

As seen in the code, each block contains the hash of the previous block. And this makes it’s hard to modify the blockchain. In practice, there are other restriction to make each new block harder to generate. For example, you can restrict new block to all start with nth zero in the new hash. The more leading zero will make it harder to generate a new block. The way it is distributed requires that a new legitimate block need to be voted ‘valid’ by at least 51% of public storage holder.

 

 

How to clear all in python Spyder workspace

While doing data analysis, sometimes we want clear everything in current workspace to have a fresh environment. It is similar to Matlab’s ‘clear all’ function. Here is how the function looks like (clear_all.py):

 
def clear_all():
"""Clears all the variables from the workspace of the spyder application."""
gl = globals().copy()
for var in gl:
    if var[0] == '_': continue
    if 'func' in str(globals()[var]): continue
    if 'module' in str(globals()[var]): continue

del globals()[var]
    if __name__ == "__main__":
clear_all()