Two Common Catches in R Programming

Sometimes, the scripts you created gives you a big surprise due to some subtle differences of the command. Here are two common difficult to catch traps in R programming.

1. which vs. %in% during subset dataset

which

df <- data.frame(a = runif(5), d = runif(5), animals = c('dog','cat','snake','lion','rat'), z = 1:5)
results1 <- df[, -which(names(df) %in% c("a","d"))]  # works as expected
# how about this one
results2 <- df[, -which(names(df) %in% c("b","c"))]  # surprise! All data are gone

%in%

df <- data.frame(a = runif(5), d = runif(5), animals = c('dog','cat','snake','lion','rat'), z = 1:5)
results1 <- df[, !names(df) %in% c("a","d")]  # works as expected
# how about this one
results2 <- df[, !names(df) %in% c("b","c")]  # returns the un-altered data.frame

Another fast way to drop columns is assign it to NULL

dropVec <- c('a','d')
df[dropVec] <- list(NULL)

2. Missing parathesis ()

Look at the following examples, you would expect it to print 1:9, right? Instead, it is print i-1.

n <- 10
for (i in 1:n-1) {
  print(i)
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
n <- 10
for (i in 1:(n-1)){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9

Or check out my Rpub: http://rpubs.com/euler-tech/303265
Advertisements

Common challenges while aggregating data with multiple group IDs and functions in R

While analyzing a dataset, one of the most common tasks will be looking at the data features in an aggregated way. For example, aggregate the dataset by its year, month, day, or IDs, etc. Then you might want to look at the aggregated effects using the aggregate functions, not only one but multiple (say Min, Max, count etc).

There are a couple of ways to do it in R:

  • Aggregate each function separately and merge them.

agg.sum <- aggregate(. ~ id1 + id2, data = x, FUN = sum)

agg.min <- aggregate(. ~id1 + id2, data = x, FUN = min)

merge(agg.sum, agg.min, by  = c(“id1”, “id2”)

  • Aggregate all at a once using ‘dplyr’

# inclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(var1, var2))

# exclusion

df %>% group_by(id1, id2) %>% summarise_at(.funs = funs(mean, min, n()), .vars = vars(-var1, -var2))

These are very handy for quick analysis, especially for people prefer simpler coding.

Cast multiple value.var columns simultaneiously for reshaping data from Long to Wide in R

While working with R, reshaping dataframe from wide format to long format is relatively easier than the opposite. Especially when you want to reshape a dataframe to a wide format with multiple columns for value.var in dcast. Let’s look at an example dataset:

1

From v1.9.6 of data.table, we can cast multiple value.var by this syntax:

testWide <- dcast(setDT(test), formula = time ~ country, value.var = c(‘feature1′,’feature2’))

All you need is add ‘setDT’ for the dataframe and pass the list of value.var to it.

2

How to direct system output to a variable in R

For people familiar with Linux/Unix/Mac command line, we all know that there are many system commands that can save our day. One of the most encountered problems is to get the number of lines/words in a large file. Here I’m talking about tens of millions record and above. There are many ways to do it: the easiest way to do it is to ‘

There are many ways to do it: the easiest way to do it is to ‘readLines’ to get all the lines and count the shape. But this will be impossible if your memory won’t allow it. But in Linux platform, you can easily do it by call ‘wc -l filename.txt’.

In R environment, you can excecute all the system command by calling  ‘system()’. In this example, system(“wc -l filename.txt”) to show the number of lines. Here is the original quesiton: how do I assign the output to a variable?

It won’t work if you just do:

varName <- system(“wc -l filename.txt”)

But here is the trick:

varName <- system(“wc -l filename.txt”, intern = TRUE)

Bingo.

For more information on the most frequently used Linux command, refer to 50 Most Commonly Used Linux Command with Example.

 

Automate tabular financial datatable into vectorized sequential data

A lot of times, we receive time-related data in a table format and we want convert it into a simple data format with one column of datetime and the other as value. See this sample table:1

Now we want to convert this dataset into another format which can be easier to visulize and convert to other data structure like xts or timeSeries object. The converted data will be like:

2

Let’s look at a sample unemployment rate from Depart of labor.

sampleData <- read.csv(‘table_date.csv’)
sampleData

3

Method in R, there are two common ways to do it, first:

tableDataFlat <- as.vector(t(sampleData[1:nrow(sampleData),2:ncol(sampleData)]))
dates <- seq.Date(as.Date(‘2005-01-01’),as.Date(‘2017-12-01′),’month’)
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

4

the second way in R:

tableDataFlat <- c(t(as.matrix(sampleData[1:nrow(sampleData),2:ncol(sampleData)])))
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

Now we can do visualization and analysis more conveniently.

plot(newTS)

5

Method in Python:

In python, it is even more simple. Flatten the data matrix by using:

import numpy as np
import pandas as pd
df = pd.read_csv(‘table_date.csv’)
data = df.values
data_flat = data.flatten()
dates = pd.date_range(start = ‘2005-01-01’, end = ‘2017-12-01′,freq=’M’)
new_df = pd.Dataframe({date:dates,value:data_flat})
new_df.set_index(date)

All about *apply family in R

R has many *apply functions which are very helpful to simplify our code. The *apply functions are all covered in dplyr package but it is still good to know the differences and how to use it. It is just too convenient to ignore them.

First, the following Mnemonics gives you an overview of what each *apply function do in general.

Mnemonics

  • lapply is a list apply which acts on a list or vector and returns a list.
  • sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
  • vapply is a verified apply (allows the return object type to be prespecified)
  • rapply is a recursive apply for nested lists, i.e. lists within lists
  • tapply is a tagged apply where the tags identify the subsets
  • apply is generic: applies a function to a matrix’s rows or columns (or, more generally, to dimensions of an array)

Example:

apply
For sum/mean of each row/columns, there are more optimzed function: colMeans, rowMeans, colSums, rowSums.While using apply to dataframe, it will automatically coerce it to a matrix.

# Two dimensional matrix# Two dimensional matrix
myMetric <- matrix(floor(runif(15,0,100)),5,3)
myMetric
# apply min to rows
apply(myMetric,1,min)
# apply min to columns
apply(myMetric,2,min)

[,1] [,2] [,3]
[1,] 28 22 6
[2,] 31 75 80
[3,] 7 88 96
[4,] 15 70 27
[5,] 74 84 12 //
[1] 6 31 7 15 12 //
[1] 7 22 6 //

lapply
For list vector, it applies the function to each element in it. lapply is the workhorse under all * apply functions. The most fundamental one.

x <- list(a = runif(5,0,1), b = seq(1:10), c = seq(10:100))
lapply(x, FUN = mean)

# Result

$a
[1] 0.4850281

$b
[1] 5.5

$c
[1] 46

sapply
sapply is doing the similar to lapply, it is just the output different. It simplifies the output to a vector rather than a list.

x <- list(a = runif(5,0,1), b = seq(1:10), c = seq(10:100))
sapply(x, FUN = mean)

a                 b                  c
0.2520706 5.5000000 46.0000000

vapply – similar to sapply, just speed faster.

rapply
This is a recursive apply, especially useful for a nested list structure. For example:

#Append ! to string, otherwise increment
myFun <- function(x){
if (is.character(x)){
return(paste(x,”!”,sep=””))
}
else{
return(x + 1)
}
}

#A nested list structure
l <- list(a = list(a1 = “Boo”, b1 = 2, c1 = “Eeek”),
b = 3, c = “Yikes”,
d = list(a2 = 1, b2 = list(a3 = “Hey”, b3 = 5)))

#Result is named vector, coerced to character
rapply(l,myFun)

#Result is a nested list like l, with values altered
rapply(l, myFun, how = “replace”)

a.a1 a.b1 a.c1 b c d.a2 d.b2.a3 d.b2.b3
“Boo!” “3” “Eeek!” “4” “Yikes!” “2” “Hey!” “6”

[1] “break”
$a
$a$a1
[1] “Boo!”

$a$b1
[1] 3

$a$c1
[1] “Eeek!”

 

$b
[1] 4

$c
[1] “Yikes!”

$d
$d$a2
[1] 2

$d$b2
$d$b2$a3
[1] “Hey!”

$d$b2$b3
[1] 6

tapply
For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, avg, ddply, etc.)

x <- 1:20
y = factor(rep(letters[1:5], each = 4))
tapply(x,y,sum)

a b c d e
10 26 42 58 74

mapply and map
For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

**Map** is a wrapper to mapply with SIMPLIFY = FALSE, so it will be guaranteed to return a list.

mapply(sum, 1:5, 1:10,1:20)
mapply(rep, 1:4, 4:1)

[1] 3 6 9 12 15 13 16 19 22 25 13 16 19 22 25 23 26 29 32 35
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

 

This post is compiled from stackoverflow’s top answers.

A better view of this is to look at the R Notebook I’ve created: https://rpubs.com/euler-tech/292794

How to use customized function for any Pipe operator %>% in R

For advanced R programmer or Python (spark) machine learning engineer, you probably have heard or used at least once pipeline for your data or model work flow. The concept of pipeline is derived from Unix/Linux shell command. A pipeline is a sequence of processes chained together by their standard streams so that the output of each process (stdout) feeds directly as input (stdin) to the next one, for example: ls -l | grep key
less. Since the debut of one of the greatest R package ‘magrittr‘, pipeline has been one of my favorite thing in data engineering.

As we know, the way pipeline requires you to pass the whole output from previous command [process] to next one. Here comes a problem when you want use some basic/simple R command for just a particular column in the data object. For example, if I have a dataset ‘babynames’ and I want to round the ‘prop’ column to 3 digits. What will happen if I do this:

library(babynames)
library(dplyr)
library(magrittr)

babynames %>%
round(‘prop’) %>%
head

It gives me an error:

babynames %>%
+ round(‘prop’) %>%
+ head
Error in Math.data.frame(list(year = c(1880, 1880, 1880, 1880, 1880, 1880, :
non-numeric variable in data frame: sexname

How am I going to fix it? The solution is simple, write a customized wrapper function to let it go with the flow. Here is the solution:

library(babynames)
library(dplyr)
library(magrittr)

myRound <- function(df,colname){
df[[colname]] <- round(df[[colname]], 3)
return(df)
}

babynames %>%
myRound(‘prop’) %>%
head

Now it works. Whooray!

year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.072
2 1880 F Anna 2604 0.027
3 1880 F Emma 2003 0.021
4 1880 F Elizabeth 1939 0.020
5 1880 F Minnie 1746 0.018
6 1880 F Margaret 1578 0.016

Why it works?

The way pipeline works are like going through a multiple-stage filter for a signal, it can only take the whole object as input instead part of it. So the wrapper function operates as a buffer function within the pipeline.