Convert columnar URL text file to html bookmark file for import

When you have a list of URL stored in columnar and want to import to your browser like Chrome, it is painful to do it one by one. So I created a script in R to automate this process by converting a csv or any separator separated file to an HTML file for bookmark import.

Here is the code:

</pre>
library(gdata)

add_link &lt;- function(name, urllink){
current_time &lt;- as.integer(as.POSIXct(Sys.time()))
modified_time &lt;- as.integer(as.POSIXct(Sys.time()))
tag_list &lt;- c(&quot;<DT><A>", trimws(as.character(name)), "</A>")
A_tag &lt;- paste(tag_list, sep=&#039;&#039;, collapse = &#039;&#039;)
return(A_tag)
}

add_bookmark_folder &lt;- function(foldername){
tag_list &lt;- c(&quot;<DT><H3>",
foldername,"</H3>")
return(paste(tag_list, sep='', collapse=''))
}

## change the following two lines for your use.

bookmark_textfile = "bookmark.txt"  # input file

fileOutput &lt;- &quot;bookmart_import.html&quot; # output file
bookmark_text = read.table(bookmark_textfile, sep = &#039;|&#039;)
colnames(bookmark_text) &lt;- c(&quot;name&quot;,&quot;url&quot;)

header_string &lt;- paste(&#039;
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->

Bookmarks
<H1>Bookmarks</H1>
<DL>', add_bookmark_folder('DSAT'),
'<DL>', sep='\n', collapse='')

end_string &lt;- &#039;
</DL>
</DL>
'

# write strings to html file

write(header_string, fileOutput)
# DSat
for (i in 1:(dim(bookmark_text)[1] - 1)){
name &lt;- bookmark_text$name[i]
urllink &lt;- bookmark_text$url[i]
#print(add_link(name,urllink))
write(add_link(name, urllink), fileOutput, append=TRUE)
}
write(&#039; </DL>', fileOutput, append=TRUE)
# dev folder
dev_folder &lt;- paste(c(add_bookmark_folder(&#039;DEV&#039;), &#039;<DL>'), sep = '\n', collapse = '\n')
write(dev_folder, fileOutput, append=TRUE)
devIndex &lt;- dim(bookmark_text)[1] - 1
for (i in devIndex:dim(bookmark_text)[1]){
name &lt;- bookmark_text$name[i]
urllink &lt;- bookmark_text$url[i]
print(add_link(name,urllink))
write(add_link(name,urllink), fileOutput, append=TRUE)
}

write(end_string, fileOutput, append=TRUE)

print(&quot;bookmark import file successfully completed. &quot;)

&nbsp;
<pre>
Advertisements

Setting up Python for connecting to Netezza database in Unix/Linux server

Here is a quick instruction for setting up connection to Netezza database from a typical Unix/Linux server:

  1. Setup the $PATH for JAVA_HOME and NZ_HOME
export JAVA_HOME=/somepath/jdk/jdk1.7.0-x64/jre/
export NZ_HOME=/somepath/nzcli/7.0.4
export LD_LIBRARY_PATH=.:/usr/lib:/usr/ucblib:${NZ_HOME}/lib
export PATH=/usr/bin::${NZ_HOME}/bin:${JAVA_HOME}/bin

Sample code


dsn_database = "db_name"
dsn_hostname = "awarehouse-unz01"
dsn_port = "5480"
dsn_uid = "username"
dsn_pwd = "password"
jdbc_driver_name = "org.netezza.Driver"
jdbc_driver_loc = "/some path/nzcli/7.0.4/lib/nzjdbc3.jar"
connection_string='jdbc:netezza://'+dsn_hostname+':'+dsn_port+'/'+dsn_database
url = '{0}:user={1};password={2}'.format(connection_string, dsn_uid, dsn_pwd)
print("URL: " + url)
print("Connection String: " + connection_string)

conn = jaydebeapi.connect("org.netezza.Driver", connection_string, {'user': dsn_uid, 'password': dsn_pwd},
jars = "/export/apps/nzcli/7.0.4/lib/nzjdbc3.jar")
curs = conn.cursor()

testQuery = "select * from LN_SRVG_ACTV_CASH_EOD_SPST_VW2 limit 1"
curs.execute(testQuery)
result = curs.fetchall()
print("Total records: " + str(len(result)))
print(result[0])

for i in range(len(result)):
print(result[i])

curs.close()
conn.close()

Connect to Oracle Database with cx_Oracle and using binding variables

Connecting to Oracle database requires Oracle client library to work and thus some environment settings are required to make it work especially on unix server.

A typical shell script to set up the environment for set up Oracle DB to interface with Python:


export ORACLE_HOME=/[your path to oracle client]/oracli/12.1.0.1
export PATH=$PATH\:$ORACLE_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH\:$ORACLE_HOME/lib
alias conda=/some path/ANACONDA/anaconda3/bin/python 

After setting up the Unix environment, let’s looks at the example to access Oracle Database:
import cx_Oracle
import pandas as pd
host = 'your hostname'
sid = 'your sid'
port = 1521
user = 'username'
password = 'password'
dsn_tns_adw = cx_Oracle.makedsn(host, port, sid)
db = cx_Oracle.connect(user,password, dsn_tns_adw)
cursor= db.cursor()
testQuery = "select count(*) from test_db"
result = cursor.execute(testQuery)
df = pd.DataFrame(result.fetchall())
df.columns = result.keys()
print(df)
A lot times, you may need to do similar queries many time by just changing come where clause, cx_Oracle have a function to let you do binding variable easily.
con = cx_Oracle.connect(user, password, dsn_tns)
cur = con.cursor()
cur.prepare('select * from department where department_id = :id')
cur.execute(None, {'id': 210})
res = cur.fetechall()
print(res)

 

Paste function in R implementation in Python

Paste function is very handy when it comes to concatenate the vectors/strings.  A very common use case in creating sql queries in R:


vec <- letters[1:5]

paste("SELECT * FROM db WHERE col IN ('",paste(vec,collapse = "','"),"')",sep = "")

[1] "SELECT * FROM db WHERE col IN ('a','b','c','d','e')"

How do we create similar function for Python using map reduce?


import functools
def reduce_concat(x, sep=""):
    return functools.reduce(lambda x, y: str(x) + sep + str(y), x)

def paste(*lists, sep=" ", collapse=None):
    result = map(lambda x: reduce_concat(x, sep=sep), zip(*lists))
    if collapse is not None:
       return reduce_concat(result, sep=collapse)
       return list(result)

print(paste([1,2,3], [11,12,13], sep=','))
print(paste([1,2,3], [11,12,13], sep=',', collapse=";"))

# ['1,11', '2,12', '3,13']
# '1,11;2,12;3,13'

 

Use multi-threading for hyper-parameter tuning in pyspark

Using threads allow a program to run multiple operations at the same time in the same process space. Python has a threading library to do it and here is a recap of how it is used:

 
import threading

def f(id):

    print("Thread function %s"%(id))

    return

for i in range(3):

    t = threading.Thread(target = f, args = (i,))

    t.start()

 

output:

thread function 0
thread function 1
thread function 2

Now we know how to invoke the multi-threading in python, how about pyspark for machine learning? Let’s learn it with an example:

Say I want to find the best k clusters in k-means clustering methods. I can use multi-threading for parallel processing to accomplish that.

 
from pyspark.mllib.clustering import KMeans 
import numpy as np 
def error(point, clusters): 
    center = clusters.centers[clusters.predict(point)] 
    return np.linalg.norm(point - center) 

def cacl_wssse(i): 
    clusters = KMeans.train(c_points, i, maxIterations = 20, runs =20, initializationMode = 'random') 
    WSSSE = c_points.map(lambda point: error(point, clusters)) .reduce(lambda x,y: x + y) 
    return (i, WSSSE)

Run it with multiprocessing

from multiprocesing.pool import ThreadPool 
tpool = ThreadPool(processes = 4) 
wssse_points = tpool.map(calc_wssse, range(1,10)) 
wssse_points

gives:

195318509740785.66
77539612257334.33
78254073754531.1

 

 

 

Set class weight for unbalanced classification model in Keras

While training unbalanced neural network in Keras, the model.fit() has the option to specify the class weights but you’ll need to compute it manually. Actually, there is an automatic way to get the dictionary to pass to ‘class_weight’ in model.fit().

 
from sklearn.utils.class_weight import compute_class_weight

class_weight_list = compute_class_weight('balanced', np.unique(y_train_labels), y_train_labels)
class_weight = dict(zip(np.unique(y_train_labels), class_weight_list))
 

 

Using categorical features in Python

While doing machine learning, we will always get features that are categorical features or can be transformed as categorical features for easier building machine learning models. If you know sklearn, we know there are tutorials covering labelencoder and onehotencoder. But there are still confusions on how to code it up in real scenarios.

Assume here we’re only going to look at the object columns, which are typically categorical feature.

  1. First get the column names with dtype of ‘object’
import pandas as pd
catColumns = df.select_dtypes(['object']).columns

2. Convert binary label feature to binary using labelEncoder and for N>2, using get_dummy


from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
        X = pd.get_dummies(df[col])
        X = X.drop(X.columns[0], axis=1)
        df[X.columns] = X
        df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional)
    else:
        le.fit(df[col])
        df[col] = le.transform(df[col])