A efficient way to preprocess large text file in Linux file system

Linux commands like sed, tr,  works efficiently on text processing, here is a common way to strip text in a large text file.

cat somefile.txt | sed -e “s/\([.\!?,’/()]\)/ \1 /g” | tr “[:upper:]” “[:lower:]” > reprocessed.txt


Three ways of rename column with groupby, agg operation in pySpark

Group and aggregation operations are very common in any data manipulation and analysis, but pySpark change the column name to a format of aggFunc(colname). This usually not the column name you’d like to use. The most intuitive way would be something like this:

group_df = df.groupby(‘colname’).max(‘value_column’).alias(‘max_column’)

However, this won’t change anything, neither did it give your an error. The reason is that we’re aliasing the whole data frame instead a column.

Here are three ways to just alias the column you just created from groupby.agg operation:

import pyspark.sql.functions as F

group_df = df.groupBy(‘colname’).max(‘value_column’)\


Second method is sue agg instead of calling the ‘max’ method:

from pyspark.sql.functions import max


Another method is to use ‘withColumnRenamed’:

df .groupBy(‘colname’).max(‘value_column’)\



Setting up Python for connecting to Netezza database in Unix/Linux server

Here is a quick instruction for setting up connection to Netezza database from a typical Unix/Linux server:

  1. Setup the $PATH for JAVA_HOME and NZ_HOME
export JAVA_HOME=/somepath/jdk/jdk1.7.0-x64/jre/
export NZ_HOME=/somepath/nzcli/7.0.4
export LD_LIBRARY_PATH=.:/usr/lib:/usr/ucblib:${NZ_HOME}/lib
export PATH=/usr/bin::${NZ_HOME}/bin:${JAVA_HOME}/bin

Sample code

dsn_database = "db_name"
dsn_hostname = "awarehouse-unz01"
dsn_port = "5480"
dsn_uid = "username"
dsn_pwd = "password"
jdbc_driver_name = "org.netezza.Driver"
jdbc_driver_loc = "/some path/nzcli/7.0.4/lib/nzjdbc3.jar"
url = '{0}:user={1};password={2}'.format(connection_string, dsn_uid, dsn_pwd)
print("URL: " + url)
print("Connection String: " + connection_string)

conn = jaydebeapi.connect("org.netezza.Driver", connection_string, {'user': dsn_uid, 'password': dsn_pwd},
jars = "/export/apps/nzcli/7.0.4/lib/nzjdbc3.jar")
curs = conn.cursor()

testQuery = "select * from LN_SRVG_ACTV_CASH_EOD_SPST_VW2 limit 1"
result = curs.fetchall()
print("Total records: " + str(len(result)))

for i in range(len(result)):


Changing python notebook display to custom format

Have you encountered times when you do df.describe() or other displaying of data results, it shows you a table in scientific notation like following which is not convenient at all.


finance 1.192433e+08
auto 1.293066e+08
airline 1.077142e+08

How do we fix it?

One way is to set the display options to the following:

pd.set_option('display.float_format', lambda x: '%.3f' % x)


pd['department'].apply(lambda x: ':%.3f'%x)


Lorenz Attractor: A demo for butterfly effect and super computational efficiency of implementing C code in R

The Lorenz attractor(by Cliffor Alan Pickover) is an attractor that arises in a simplified system of equations describing the two-dimensional flow of fluid of uniform depth, with an imposed temperature difference, under gravity, with buoyancy, thermal diffusivity, and kinematic viscosity. The full equation are:


where ψ is a stream function, defined such that the velocity component U=(u,w)U=(u,w)

In the early 1960s, Lorenz accidentally discovered the chaotic behavior of this system. One of his chaotic attractors is defined:


grew for Rayleigh numbers larger than the critical value,. Furthermore, vastly different results were obtained for very small changes in the initial values, representing one of the earliest discoveries of the so-called butterfly effect.

In the iterative calculation, the n+1 th position depends on n_th position and the four parameters (a,b,c,d). Let’s do a simulation of 10 million iterations with (x0,y0)=(0,0) and a = -1.24, b=-1.25, c=-1.81, d=-1.91.

R Code:

“`{r lorenzor}
#define the theme
my.theme = theme(legend.position = ‘none’,
panel.background = element_rect(fill=’black’),
axis.ticks = element_blank(),
panel.grid = element_blank(),
axis.title = element_blank(),
axis.text = element_blank()


# define cpp function
cppFunction(‘DataFrame createTrajectory(int n, double x0, double y0,double a, double b, double c, double d){
// create the columns
NumericVector x(n);
NumericVector y(n);
for(int i=1; i<n; ++i){
x[i]=sin(a*y[i-1]) + c*cos(a*x[i-1]);
y[i]=sin(b*x[i-1]) + d*cos(b*y[i-1]);
// return a data frame
return DataFrame::create(_[“x”]=x,_[“y”]=y);

createTrajectoryR <- function(n,x0,y0,a,b,c,d){

#implementation with R

x = rep(0,n+1)
y = rep(0,n+1)
x[1] = x0
y[1] = y0
for (i in seq(2,n+1)){
x[i] <- sin(a*y[i-1])+ c*cos(a*x[i-1])
y[i] <- sin(b*x[i-1]) +d*cos(d*y[i-1])


#Initial parameters for dynamic system

a = -1.24
b = -1.25
c = 1.81
d = 1.91

system.time(df_C <- createTrajectory(10000000,0,0,a,b,c,d))

system.time(df_R <- createTrajectoryR(10000000,0,0,a,b,c,d))

#png(“./lorenzor_attractor.png”,units =’px’,width = 1600, height = 1600, res = 300)
# plot results from c
ggplot(df_C, aes(x,y)) + geom_point(color=’white’,shape=46,alpha=0.1) + my.theme
# plot results from R
ggplot(df_R, aes(x,y)) + geom_point(color=’white’,shape=46,alpha=0.1) + my.theme

End of R code

Runtime comparison


The R runtime is more than 5 times of C code.

How does Lorenz Attractor System look like?


Butterfly effect:

By slightly changing the parameters, you’ll get a vastly different solution.


ref: http://mathworld.wolfram.com/LorenzAttractor.html

Kernel Density Estimation–Optimal bandwidth

Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable which is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. (Wikipedia).


Let (x1,x2,…, Xn) be a uni-variate independent and identically distributed sample drawn from some distribution with an unknown density f.  We’re interested in estimating the shape of this function f. The kernel density estimator is defined as:

 where K is the kernel and h >0 is a smoothing parameter called the bandwidth. The most common kernel are:: uniform, triangular, biweight, triweight, Epanechnikov, normal and others.

Intuitively, we want to choose bandwidth (h) as small as the data will allow. However, there is always a trade-off between the bias of the estimator and its variance.

The most common optimality criterion used to select the bandwidth is the mean integrated squared error:

A rule-of-thumb bandwidth estimator.

 is used to approximate univariate data and the underlying density being Gaussian. However, this can yield inaccurate estimates when density is not close to being normal.

{\displaystyle {\text{Bin size}}=2\,{{\text{IQR}}(x) \over {n^{1/3}}}}, a.k.a Freedman-Diaconis rule, is a practical way to get the optimal binwidth for histogram.

Another better estimator is the so-called: solve-the-equation bandwidth. (Botev, Z.I.; Grotowski, J.F.; Kroese, D.P. (2010). “Kernel density estimation via diffusion”. Annals of Statistics. 38 (5): 2916–2957. doi:10.1214/10-AOS799.)

In R, the function recommended to get the optimal bandwidth is MASS::bandwidth.nrd(), dpih().

Another interesting thing about KDE is that: it can be shown that both kNN (k nearest neighbor) and KDE converge to the true probability density as 𝑁 → ∞, provided that 𝑉 (volume for each bin) shrinks with 𝑁, and that 𝑘 (number of data falls in each bin) grows with 𝑁 appropriately.

ref: https://en.wikipedia.org/wiki/Kernel_density_estimation#cite_note-bo10-    https://en.wikipedia.org/wiki/Probability_density_function  https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule