Machine Learning Algorithms in one map

With the advancement in machine learning algorithms, there are too many to be remembered. I have come across a handy reference map which categorizes the most popular ML algorithms in a network map. It can be a very useful reference.

Machine Learning Algorithms


Delete thousands of spam emails without subject in gmail

Machine learning has been successfully used to automatically detect and flag spam email in Gmail and other services, but it still fails to do so in many cases. One biggest missing feature in Gmail is that it allows spam without subject lines to be kept in the inbox. This is very annoying since these spam are all automatically generated with unique email addresses. It is difficult to create universal filters using sender’s email address and not feasible to delete it manually. This happened to my Gmail couple of weeks ago when it is flooded with spams.


Finally, I found a way to automatically remove all the junks using Google Lab’s filter scripts.

First, create a new filter and ‘export’ it to a XML file.


Then Edit the file and update the section enclosed with entry.


The XML script to do the trick is:

<entry><entry> <category term=’filter’></category> <title>Mail Filter</title> <id>,2008:filter:1434203171999</id> <updated>2017-09-30T14:47:33Z</updated> <content></content>    <apps:property name=’subject’ value=”/> <apps:property name=’hasAttachment’ value=’true’/>    <apps:property name=’shouldTrash’ value=’true’/> <apps:property name=’sizeOperator’ value=’s_sl’/> <apps:property name=’sizeUnit’ value=’s_smb’/> </entry>

Then, import the xml file. Now the script will do the job for you.

The <id></id> tag should be different than this sample.


Common challenges while aggregating data with multiple group IDs and functions in R

While analyzing a dataset, one of the most common tasks will be looking at the data features in an aggregated way. For example, aggregate the dataset by its year, month, day, or IDs, etc. Then you might want to look at the aggregated effects using the aggregate functions, not only one but multiple (say Min, Max, count etc).

There are a couple of ways to do it in R:

  • Aggregate each function separately and merge them.

agg.sum <- aggregate(. ~ id1 + id2, data = x, FUN = sum)

agg.min <- aggregate(. ~id1 + id2, data = x, FUN = min)

merge(agg.sum, agg.min, by  = c(“id1”, “id2”)

  • Aggregate all at a once using ‘dplyr’

# inclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(var1, var2))

# exclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(-var1, -var2))

These are very handy for quick analysis, especially for people prefer simpler coding.

Cast multiple value.var columns simultaneiously for reshaping data from Long to Wide in R

While working with R, reshaping dataframe from wide format to long format is relatively easier than the opposite. Especially when you want to reshape a dataframe to a wide format with multiple columns for value.var in dcast. Let’s look at an example dataset:


From v1.9.6 of data.table, we can cast multiple value.var by this syntax:

testWide <- dcast(setDT(test), formula = time ~ country, value.var = c(‘feature1′,’feature2’))

All you need is add ‘setDT’ for the dataframe and pass the list of value.var to it.


Adding an existing folder to github on Mac

  1. Create a new repository on GitHub. Here are the important part: don’t initialize the new repository with README, license or gitignore files.
  2. Open Terminal
  3. cd [project foler-root]
  4. Initialize the local directory as a Git Repository. $ git init
  5. Add the files to your new repository. This set the stage for them. $ git add .
  6. Commit the files staged.  $ git commit -m ‘message’
  7. Copy the URL from Github repository from the web page, something like;.hello_world.git
  8. Add the URL to the remote repository. $ git remote add origin URL_just_copied
  9. Check remote $ git remote -v
  10. Push the changes to GitHub. $ git push -u origin master

Now you have successfully pushed an existing project to Github, for any new changes after this step, simply do the following for update to Git.

  1. git add .
  2. git commit -m ‘new changes’
  3. git push

Some most frequently used command:

git status  – check git status

git log  – show all the log for the branch

git pull   – pull the update from Git to local

git config –global credential.helper “cache –timeout=3600”     – cache password in memory for 15 minutes

git config credential.helper store  — store it to a clear text file (.git-credentials) permanently

git config –unset credential.helper  — reset credential to ask each time

Are we ready for the Aug 21, 2017 solar eclipse?

The 2017 total solar eclipse is fast approaching, and hordes of sky gazers are scrambling to find a spot where they can see the shadow of the moon completely obscure the sun for a few moments on Aug. 21. Here is an illustration for the science behind it:


Image Credit: Rick Fienberg, TravelQuest International, and Wilderness Travel

Who can see it?


Image Credit: NASA’s Scientific Visualization Studio

For those living in the United States, you might want to look at this gif animation image to check out when should you look out to this rare event. For people live in D.C. area, the prime time is 2:40 PM local time.

ecl2017 has made a very cool web widget to check the prime time by a given zip code. Check here.

Be sure to wear sunglasses to protect your eyes.


How to direct system output to a variable in R

For people familiar with Linux/Unix/Mac command line, we all know that there are many system commands that can save our day. One of the most encountered problems is to get the number of lines/words in a large file. Here I’m talking about tens of millions record and above. There are many ways to do it: the easiest way to do it is to ‘

There are many ways to do it: the easiest way to do it is to ‘readLines’ to get all the lines and count the shape. But this will be impossible if your memory won’t allow it. But in Linux platform, you can easily do it by call ‘wc -l filename.txt’.

In R environment, you can excecute all the system command by calling  ‘system()’. In this example, system(“wc -l filename.txt”) to show the number of lines. Here is the original quesiton: how do I assign the output to a variable?

It won’t work if you just do:

varName <- system(“wc -l filename.txt”)

But here is the trick:

varName <- system(“wc -l filename.txt”, intern = TRUE)


For more information on the most frequently used Linux command, refer to 50 Most Commonly Used Linux Command with Example.