Convert columnar URL text file to html bookmark file for import

When you have a list of URL stored in columnar and want to import to your browser like Chrome, it is painful to do it one by one. So I created a script in R to automate this process by converting a csv or any separator separated file to an HTML file for bookmark import.

Here is the code:


add_link &lt;- function(name, urllink){
current_time &lt;- as.integer(as.POSIXct(Sys.time()))
modified_time &lt;- as.integer(as.POSIXct(Sys.time()))
tag_list &lt;- c(&quot;<DT><A>", trimws(as.character(name)), "</A>")
A_tag &lt;- paste(tag_list, sep=&#039;&#039;, collapse = &#039;&#039;)

add_bookmark_folder &lt;- function(foldername){
tag_list &lt;- c(&quot;<DT><H3>",
return(paste(tag_list, sep='', collapse=''))

## change the following two lines for your use.

bookmark_textfile = "bookmark.txt"  # input file

fileOutput &lt;- &quot;bookmart_import.html&quot; # output file
bookmark_text = read.table(bookmark_textfile, sep = &#039;|&#039;)
colnames(bookmark_text) &lt;- c(&quot;name&quot;,&quot;url&quot;)

header_string &lt;- paste(&#039;
<!-- This is an automatically generated file.
It will be read and overwritten.

<DL>', add_bookmark_folder('DSAT'),
'<DL>', sep='\n', collapse='')

end_string &lt;- &#039;

# write strings to html file

write(header_string, fileOutput)
# DSat
for (i in 1:(dim(bookmark_text)[1] - 1)){
name &lt;- bookmark_text$name[i]
urllink &lt;- bookmark_text$url[i]
write(add_link(name, urllink), fileOutput, append=TRUE)
write(&#039; </DL>', fileOutput, append=TRUE)
# dev folder
dev_folder &lt;- paste(c(add_bookmark_folder(&#039;DEV&#039;), &#039;<DL>'), sep = '\n', collapse = '\n')
write(dev_folder, fileOutput, append=TRUE)
devIndex &lt;- dim(bookmark_text)[1] - 1
for (i in devIndex:dim(bookmark_text)[1]){
name &lt;- bookmark_text$name[i]
urllink &lt;- bookmark_text$url[i]
write(add_link(name,urllink), fileOutput, append=TRUE)

write(end_string, fileOutput, append=TRUE)

print(&quot;bookmark import file successfully completed. &quot;)


Paste function in R implementation in Python

Paste function is very handy when it comes to concatenate the vectors/strings.  A very common use case in creating sql queries in R:

vec <- letters[1:5]

paste("SELECT * FROM db WHERE col IN ('",paste(vec,collapse = "','"),"')",sep = "")

[1] "SELECT * FROM db WHERE col IN ('a','b','c','d','e')"

How do we create similar function for Python using map reduce?

import functools
def reduce_concat(x, sep=""):
    return functools.reduce(lambda x, y: str(x) + sep + str(y), x)

def paste(*lists, sep=" ", collapse=None):
    result = map(lambda x: reduce_concat(x, sep=sep), zip(*lists))
    if collapse is not None:
       return reduce_concat(result, sep=collapse)
       return list(result)

print(paste([1,2,3], [11,12,13], sep=','))
print(paste([1,2,3], [11,12,13], sep=',', collapse=";"))

# ['1,11', '2,12', '3,13']
# '1,11;2,12;3,13'


Two Common Catches in R Programming

Sometimes, the scripts you created gives you a big surprise due to some subtle differences of the command. Here are two common difficult to catch traps in R programming.

1. which vs. %in% during subset dataset


df <- data.frame(a = runif(5), d = runif(5), animals = c('dog','cat','snake','lion','rat'), z = 1:5)
results1 <- df[, -which(names(df) %in% c("a","d"))]  # works as expected
# how about this one
results2 <- df[, -which(names(df) %in% c("b","c"))]  # surprise! All data are gone


df <- data.frame(a = runif(5), d = runif(5), animals = c('dog','cat','snake','lion','rat'), z = 1:5)
results1 <- df[, !names(df) %in% c("a","d")]  # works as expected
# how about this one
results2 <- df[, !names(df) %in% c("b","c")]  # returns the un-altered data.frame

Another fast way to drop columns is assign it to NULL

dropVec <- c('a','d')
df[dropVec] <- list(NULL)

2. Missing parathesis ()

Look at the following examples, you would expect it to print 1:9, right? Instead, it is print i-1.

n <- 10
for (i in 1:n-1) {
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
n <- 10
for (i in 1:(n-1)){
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9

Or check out my Rpub:

Common challenges while aggregating data with multiple group IDs and functions in R

While analyzing a dataset, one of the most common tasks will be looking at the data features in an aggregated way. For example, aggregate the dataset by its year, month, day, or IDs, etc. Then you might want to look at the aggregated effects using the aggregate functions, not only one but multiple (say Min, Max, count etc).

There are a couple of ways to do it in R:

  • Aggregate each function separately and merge them.

agg.sum <- aggregate(. ~ id1 + id2, data = x, FUN = sum)

agg.min <- aggregate(. ~id1 + id2, data = x, FUN = min)

merge(agg.sum, agg.min, by  = c(“id1”, “id2”)

  • Aggregate all at a once using ‘dplyr’

# inclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(var1, var2))

# exclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(-var1, -var2))

These are very handy for quick analysis, especially for people prefer simpler coding.

Cast multiple value.var columns simultaneiously for reshaping data from Long to Wide in R

While working with R, reshaping dataframe from wide format to long format is relatively easier than the opposite. Especially when you want to reshape a dataframe to a wide format with multiple columns for value.var in dcast. Let’s look at an example dataset:


From v1.9.6 of data.table, we can cast multiple value.var by this syntax:

testWide <- dcast(setDT(test), formula = time ~ country, value.var = c(‘feature1′,’feature2’))

All you need is add ‘setDT’ for the dataframe and pass the list of value.var to it.


How to direct system output to a variable in R

For people familiar with Linux/Unix/Mac command line, we all know that there are many system commands that can save our day. One of the most encountered problems is to get the number of lines/words in a large file. Here I’m talking about tens of millions record and above. There are many ways to do it: the easiest way to do it is to ‘

There are many ways to do it: the easiest way to do it is to ‘readLines’ to get all the lines and count the shape. But this will be impossible if your memory won’t allow it. But in Linux platform, you can easily do it by call ‘wc -l filename.txt’.

In R environment, you can excecute all the system command by calling  ‘system()’. In this example, system(“wc -l filename.txt”) to show the number of lines. Here is the original quesiton: how do I assign the output to a variable?

It won’t work if you just do:

varName <- system(“wc -l filename.txt”)

But here is the trick:

varName <- system(“wc -l filename.txt”, intern = TRUE)


For more information on the most frequently used Linux command, refer to 50 Most Commonly Used Linux Command with Example.


Automate tabular financial datatable into vectorized sequential data

A lot of times, we receive time-related data in a table format and we want convert it into a simple data format with one column of datetime and the other as value. See this sample table:1

Now we want to convert this dataset into another format which can be easier to visulize and convert to other data structure like xts or timeSeries object. The converted data will be like:


Let’s look at a sample unemployment rate from Depart of labor.

sampleData <- read.csv(‘table_date.csv’)


Method in R, there are two common ways to do it, first:

tableDataFlat <- as.vector(t(sampleData[1:nrow(sampleData),2:ncol(sampleData)]))
dates <- seq.Date(as.Date(‘2005-01-01’),as.Date(‘2017-12-01′),’month’)
newTS <- data.frame(dates=dates,value=tableDataFlat)


the second way in R:

tableDataFlat <- c(t(as.matrix(sampleData[1:nrow(sampleData),2:ncol(sampleData)])))
newTS <- data.frame(dates=dates,value=tableDataFlat)

Now we can do visualization and analysis more conveniently.



Method in Python:

In python, it is even more simple. Flatten the data matrix by using:

import numpy as np
import pandas as pd
df = pd.read_csv(‘table_date.csv’)
data = df.values
data_flat = data.flatten()
dates = pd.date_range(start = ‘2005-01-01’, end = ‘2017-12-01′,freq=’M’)
new_df = pd.Dataframe({date:dates,value:data_flat})