The pitfall of eval function and its safe alternative in Python

As some of you have used eval and exec in python as a handy quick-and-dirty way to get the dynamic source code, then munge it a little, then execute it. The functionality of ‘eval’ is its ability to evaluate a string as though it were an expression and returns a result. In Python, it can also be a structured representation of code such as abstract syntax tree (like Lisp forms). Its key usages are:

  • Evaluating a mathematical expression
  • Compiler bootstrapping
  • Scripting (dynamic code)
  • Language tutors

For example:


x = 1

eval('x + 1') # returns 2

eval ('x') # returns 1

# The most general form for evaluating statements is using code objects
x = 1
y = 2
eval(compile("print ('x+y=',x+y)", "compile-sample.py", "single"))
# returns
x + y = 3
 

However, there is a big safety flaw while using ‘eval’ command which evaluates the code in the expression without considering whether it’s safe or not. See a good article about ‘The danger of eval”. A good example can be:


eval(input("__import__('os').system('rm -rf /root/important_data')"))

The cure for this is to use $ast.literal_eval$ which raise an exception if the input isn’t a valid Python datatype, so the code won’t be executed if it is not safe.

The rule of thumb is to use ast.literal_eval whenever you need eval. But there are some notable differences: ast.literal_eval won’t work for bitwise operators. For example:


ast.literal_eval("1 & 1") # raise an error

eval("1 & 1") # will return 1

ast.literal_eval() only considers a small subset of Python’s syntax to be valid:

The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None.

Advertisements

Secret ingredient for tuning Random Forest Classifier and XGBoost Tree

Tuning a machine learning model can be time consuming and may still not get to where you want. Here are two highly-used settings for Random Forest Classifier and XGBoost Tree in Kaggle competitions. Can be a good start.

Random Forest:


RandomForestClassifier(bootstrap=True, class_weight='balanced',

criterion='gini', max_depth=9, max_features=0.6,

max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

n_estimators=500, n_jobs=-1, oob_score=False, random_state=0,

verbose=0, warm_start=False)

XGBoost Tree:

params = {'objective':'binary:logistic',
          'eta': 0.3,  # analogous to learning rate in GBM
          'tree_method': "hist",
          'grow_policy': "lossguide",
          'max_leaves': 256,  # =2^n depth
          'max_depth': 5, # typical 3-10 , 5 chosen based on testing
          'subsample': 0.8, # lower value prevents overfitting
          'colsample_bytree': 0.7, 
          'colsample_bylevel':0.7,
          'min_child_weight':0,  # different than 'min_child_leaf' in GBM
          'alpha':4,
          'lambda':10,  # add L2 regullarization for prevent overfitting
          'scale_pos_weight': 27,
          'objective': 'binary:logistic', 
          'eval_metric': 'auc', 
          'nthread':8,
          'random_state': 99, 
          'silent': True}

Setting up Python for connecting to Netezza database in Unix/Linux server

Here is a quick instruction for setting up connection to Netezza database from a typical Unix/Linux server:

  1. Setup the $PATH for JAVA_HOME and NZ_HOME
export JAVA_HOME=/somepath/jdk/jdk1.7.0-x64/jre/
export NZ_HOME=/somepath/nzcli/7.0.4
export LD_LIBRARY_PATH=.:/usr/lib:/usr/ucblib:${NZ_HOME}/lib
export PATH=/usr/bin::${NZ_HOME}/bin:${JAVA_HOME}/bin

Sample code


dsn_database = "db_name"
dsn_hostname = "awarehouse-unz01"
dsn_port = "5480"
dsn_uid = "username"
dsn_pwd = "password"
jdbc_driver_name = "org.netezza.Driver"
jdbc_driver_loc = "/some path/nzcli/7.0.4/lib/nzjdbc3.jar"
connection_string='jdbc:netezza://'+dsn_hostname+':'+dsn_port+'/'+dsn_database
url = '{0}:user={1};password={2}'.format(connection_string, dsn_uid, dsn_pwd)
print("URL: " + url)
print("Connection String: " + connection_string)

conn = jaydebeapi.connect("org.netezza.Driver", connection_string, {'user': dsn_uid, 'password': dsn_pwd},
jars = "/export/apps/nzcli/7.0.4/lib/nzjdbc3.jar")
curs = conn.cursor()

testQuery = "select * from LN_SRVG_ACTV_CASH_EOD_SPST_VW2 limit 1"
curs.execute(testQuery)
result = curs.fetchall()
print("Total records: " + str(len(result)))
print(result[0])

for i in range(len(result)):
print(result[i])

curs.close()
conn.close()

Connect to Oracle Database with cx_Oracle and using binding variables

Connecting to Oracle database requires Oracle client library to work and thus some environment settings are required to make it work especially on unix server.

A typical shell script to set up the environment for set up Oracle DB to interface with Python:


export ORACLE_HOME=/[your path to oracle client]/oracli/12.1.0.1
export PATH=$PATH\:$ORACLE_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH\:$ORACLE_HOME/lib
alias conda=/some path/ANACONDA/anaconda3/bin/python 

After setting up the Unix environment, let’s looks at the example to access Oracle Database:
import cx_Oracle
import pandas as pd
host = 'your hostname'
sid = 'your sid'
port = 1521
user = 'username'
password = 'password'
dsn_tns_adw = cx_Oracle.makedsn(host, port, sid)
db = cx_Oracle.connect(user,password, dsn_tns_adw)
cursor= db.cursor()
testQuery = "select count(*) from test_db"
result = cursor.execute(testQuery)
df = pd.DataFrame(result.fetchall())
df.columns = result.keys()
print(df)
A lot times, you may need to do similar queries many time by just changing come where clause, cx_Oracle have a function to let you do binding variable easily.
con = cx_Oracle.connect(user, password, dsn_tns)
cur = con.cursor()
cur.prepare('select * from department where department_id = :id')
cur.execute(None, {'id': 210})
res = cur.fetechall()
print(res)

 

Paste function in R implementation in Python

Paste function is very handy when it comes to concatenate the vectors/strings.  A very common use case in creating sql queries in R:


vec <- letters[1:5]

paste("SELECT * FROM db WHERE col IN ('",paste(vec,collapse = "','"),"')",sep = "")

[1] "SELECT * FROM db WHERE col IN ('a','b','c','d','e')"

How do we create similar function for Python using map reduce?


import functools
def reduce_concat(x, sep=""):
    return functools.reduce(lambda x, y: str(x) + sep + str(y), x)

def paste(*lists, sep=" ", collapse=None):
    result = map(lambda x: reduce_concat(x, sep=sep), zip(*lists))
    if collapse is not None:
       return reduce_concat(result, sep=collapse)
       return list(result)

print(paste([1,2,3], [11,12,13], sep=','))
print(paste([1,2,3], [11,12,13], sep=',', collapse=";"))

# ['1,11', '2,12', '3,13']
# '1,11;2,12;3,13'

 

Using categorical features in Python

While doing machine learning, we will always get features that are categorical features or can be transformed as categorical features for easier building machine learning models. If you know sklearn, we know there are tutorials covering labelencoder and onehotencoder. But there are still confusions on how to code it up in real scenarios.

Assume here we’re only going to look at the object columns, which are typically categorical feature.

  1. First get the column names with dtype of ‘object’
import pandas as pd
catColumns = df.select_dtypes(['object']).columns

2. Convert binary label feature to binary using labelEncoder and for N>2, using get_dummy


from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
        X = pd.get_dummies(df[col])
        X = X.drop(X.columns[0], axis=1)
        df[X.columns] = X
        df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional)
    else:
        le.fit(df[col])
        df[col] = le.transform(df[col])

How to execute Jupiter notebook from a terminal without being killed after termincal offline

Have you encountered issues while running a big Jupyter notebook job from a remote server (say AWS/Azure) and it gets killed either due to your local connection lost or your web browser crashed? And you have to restart the process again. What a waste of time especially if you have process takes a long time to finish.

If you connect to your remote server using a putty of SSH from linux or Max OS, you can run your notebook in the background and don’t need to keep terminal and web browser open all the time.

There are two ways of doing it:

A.

  1. First, convert the notebook to a *.py file either from the Jupyter GUI or from a command line with this command:

jupyter nbconvert --to python .ipynb

sudo pip install mistune #you may need this

 

2. Run the py script in the background process:

2.1

 python test.py & disown 

B.

The second way and also easier way is to run the notebook directly:

nbconvert allows you to run notebooks with the –execute flag:


jupyter nbconvert -execute

 

If you want to run a notebook and produce a new notebook, you can add –to notebook:


jupyter nbconvert --execute --to notebook

Or if you want to replace the existing notebook with the new output:


jupyter nbconvert --execute --to notebook --inplace

Since that’s a really long command, you can use an alias:


alias nbx="jupyter nbconvert --execute --to notebook" nbx [--inplace]

 

How to kill the background process:

  1. pgrep jupyter –> find the PID and type ‘kill ‘
  2. ps auxf –> check all the processes

Another catch: when you running the code, you might need to switch the matplotlib backend to with no display.

 


import matplotlib matplotlib.use("Agg")

import matplotlib.pyplot as plt

print(matplotlib.get_backend())
plt.plot([1,2,3,4])
plt.savefig('test.png')
plt.close()
# or plt.clf() when saving more than one figures.