## Best practice machine learning pipeline and flowchart

Here are two top references from previous Kaggle Masters:

## Lorenz Attractor: A demo for butterfly effect and super computational efficiency of implementing C code in R

The Lorenz attractor(by Cliffor Alan Pickover) is an attractor that arises in a simplified system of equations describing the two-dimensional flow of fluid of uniform depth, with an imposed temperature difference, under gravity, with buoyancy, thermal diffusivity, and kinematic viscosity. The full equation are:

where ψ is a stream function, defined such that the velocity component U=(u,w)U=(u,w)

In the early 1960s, Lorenz accidentally discovered the chaotic behavior of this system. One of his chaotic attractors is defined:

grew for Rayleigh numbers larger than the critical value,. Furthermore, vastly different results were obtained for very small changes in the initial values, representing one of the earliest discoveries of the so-called butterfly effect.

In the iterative calculation, the n+1 th position depends on n_th position and the four parameters (a,b,c,d). Let’s do a simulation of 10 million iterations with (x0,y0)=(0,0) and a = -1.24, b=-1.25, c=-1.81, d=-1.91.

R Code:

“{r lorenzor}
require(Rcpp)
require(ggplot2)
require(dplyr)
#define the theme
my.theme = theme(legend.position = ‘none’,
panel.background = element_rect(fill=’black’),
axis.ticks = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank()

)

# define cpp function
cppFunction(‘DataFrame createTrajectory(int n, double x0, double y0,double a, double b, double c, double d){
// create the columns
NumericVector x(n);
NumericVector y(n);
x[0]=x0;
y[0]=y0;
for(int i=1; i<n; ++i){
x[i]=sin(a*y[i-1]) + c*cos(a*x[i-1]);
y[i]=sin(b*x[i-1]) + d*cos(b*y[i-1]);
}
// return a data frame
return DataFrame::create(_[“x”]=x,_[“y”]=y);
}’)

createTrajectoryR <- function(n,x0,y0,a,b,c,d){

#implementation with R

x = rep(0,n+1)
y = rep(0,n+1)
x[1] = x0
y[1] = y0
for (i in seq(2,n+1)){
x[i] <- sin(a*y[i-1])+ c*cos(a*x[i-1])
y[i] <- sin(b*x[i-1]) +d*cos(d*y[i-1])
}

return(data.frame(x=x,y=y))
}

#Initial parameters for dynamic system

a = -1.24
b = -1.25
c = 1.81
d = 1.91

system.time(df_C <- createTrajectory(10000000,0,0,a,b,c,d))

system.time(df_R <- createTrajectoryR(10000000,0,0,a,b,c,d))

#png(“./lorenzor_attractor.png”,units =’px’,width = 1600, height = 1600, res = 300)
# plot results from c
ggplot(df_C, aes(x,y)) + geom_point(color=’white’,shape=46,alpha=0.1) + my.theme
# plot results from R
ggplot(df_R, aes(x,y)) + geom_point(color=’white’,shape=46,alpha=0.1) + my.theme
#dev.off()
“

End of R code

Runtime comparison

The R runtime is more than 5 times of C code.

How does Lorenz Attractor System look like?

Butterfly effect:

By slightly changing the parameters, you’ll get a vastly different solution.

## Kernel Density Estimation–Optimal bandwidth

Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable which is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. (Wikipedia).

Let (x1,x2,…, Xn) be a uni-variate independent and identically distributed sample drawn from some distribution with an unknown density f.  We’re interested in estimating the shape of this function f. The kernel density estimator is defined as:

where K is the kernel and h >0 is a smoothing parameter called the bandwidth. The most common kernel are:: uniform, triangular, biweight, triweight, Epanechnikov, normal and others.

Intuitively, we want to choose bandwidth (h) as small as the data will allow. However, there is always a trade-off between the bias of the estimator and its variance.

The most common optimality criterion used to select the bandwidth is the mean integrated squared error:

A rule-of-thumb bandwidth estimator.

is used to approximate univariate data and the underlying density being Gaussian. However, this can yield inaccurate estimates when density is not close to being normal.

, a.k.a Freedman-Diaconis rule, is a practical way to get the optimal binwidth for histogram.

Another better estimator is the so-called: solve-the-equation bandwidth. (Botev, Z.I.; Grotowski, J.F.; Kroese, D.P. (2010). “Kernel density estimation via diffusion”. Annals of Statistics. 38 (5): 2916–2957. doi:10.1214/10-AOS799.)

In R, the function recommended to get the optimal bandwidth is MASS::bandwidth.nrd(), dpih().

Another interesting thing about KDE is that: it can be shown that both kNN (k nearest neighbor) and KDE converge to the true probability density as 𝑁 → ∞, provided that 𝑉 (volume for each bin) shrinks with 𝑁, and that 𝑘 (number of data falls in each bin) grows with 𝑁 appropriately.

ref: https://en.wikipedia.org/wiki/Kernel_density_estimation#cite_note-bo10-    https://en.wikipedia.org/wiki/Probability_density_function  https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule

## How to configure Aliyun DNS for Aliyun enterprise mail server

For people living in the United States, we’ve all heard about Amazon EC2 which is the king of cloud computing. While outside of USA, Alibaba’s Aliyun cloud service is gaining tremendous ground especially in China. Given the fact that Alibaba is handling the largest e-commerce transactions, its infrastructure and technology are at least on par with its counterparts around the world. In 2016, Alibaba smashes world’s online transaction record by 175,000 per second.  For business or services that are oriented toward China’s market, it makes great sense to set up a server in China for faster access with minimum transocean data transmission.

Recently, I have started to use Aliyun’s Elastic Cloud Instance and the experience has been great. The price is also cheaper than EC2 which is important for small business or amateur users. After finishing the front-end development of my machine-learning based ocean data platform, I realized that a mail server is needed for a registered user to retrieve the password and also for me to send batch messages. The first hurdle for this is to configure the DNS for the mail service from Aliyun. After some research on the internet, mostly in the Chinese language, I have successfully configured the DNS for it.

I think it will be helpful to share the basic step for setting up this service for whoever wants to show their presence in China. The pre-requisite is to register an Aliyun ECS service.

Here are the simple steps:

1. Register an Aliyun enterprise mailbox from: https://wanwang.aliyun.com/mail/freemail/?spm=5176.8071678.875975.topareabtn0.1bcf708bW5hGG5    You can choose free mail service with maximum 50 accounts, 5 GB space.
2. After register the mail service, log onto your Aliyun console
3. From Home –> Direct Mail –> Email Settings –> Email domains , Click ‘New Domain’ on the upper right. Add new domain with something like: mail.google.com (assume google.com is your domain)
4. Then you’ll have to verify that you own that domain.
5. Here is the most important part, you’ll need to finish DNS configuration before verifying. Click ‘Configure’ from the email domain you just created at Step 4.
6. Keep the above page open, we’re going to set up the DNS. Go to Aliyun control console, home –> Domain and Websites –> Alibaba Cloud DNS –> Click ‘全部域名’ (All domains) –> ‘解析‘ (configure), this will lead to DNS Settings.
7. Click ‘Add Record’, and we’re going to add four Record using the information from Step 5.
8.  After adding two TXT record, one MX record and one CNAME record, you have done configured DNS.
9. Now go back to the mail domain we created in Step 3, Click ‘verify’. It will probably take up to 2 minutes for you to be able to verify.
10. After verification, you can set up the ‘Sender Addresses’ and SMTP password etc, by going back to the Email Domain from Home.

## Delete thousands of spam emails without subject in gmail

Machine learning has been successfully used to automatically detect and flag spam email in Gmail and other services, but it still fails to do so in many cases. One biggest missing feature in Gmail is that it allows spam without subject lines to be kept in the inbox. This is very annoying since these spam are all automatically generated with unique email addresses. It is difficult to create universal filters using sender’s email address and not feasible to delete it manually. This happened to my Gmail couple of weeks ago when it is flooded with spams.

Finally, I found a way to automatically remove all the junks using Google Lab’s filter scripts.

First, create a new filter and ‘export’ it to a XML file.

Then Edit the file and update the section enclosed with entry.

The XML script to do the trick is:

<entry><entry> <category term=’filter’></category> <title>Mail Filter</title> <id>tag:mail.google.com,2008:filter:1434203171999</id> <updated>2017-09-30T14:47:33Z</updated> <content></content>    <apps:property name=’subject’ value=”/> <apps:property name=’hasAttachment’ value=’true’/>    <apps:property name=’shouldTrash’ value=’true’/> <apps:property name=’sizeOperator’ value=’s_sl’/> <apps:property name=’sizeUnit’ value=’s_smb’/> </entry>

Then, import the xml file. Now the script will do the job for you.

The <id></id> tag should be different than this sample.

## Common challenges while aggregating data with multiple group IDs and functions in R

While analyzing a dataset, one of the most common tasks will be looking at the data features in an aggregated way. For example, aggregate the dataset by its year, month, day, or IDs, etc. Then you might want to look at the aggregated effects using the aggregate functions, not only one but multiple (say Min, Max, count etc).

There are a couple of ways to do it in R:

• Aggregate each function separately and merge them.

agg.sum <- aggregate(. ~ id1 + id2, data = x, FUN = sum)

agg.min <- aggregate(. ~id1 + id2, data = x, FUN = min)

merge(agg.sum, agg.min, by  = c(“id1”, “id2”)

• Aggregate all at a once using ‘dplyr’

# inclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(var1, var2))

# exclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(-var1, -var2))

These are very handy for quick analysis, especially for people prefer simpler coding.