Secret ingredient for tuning Random Forest Classifier and XGBoost Tree

Tuning a machine learning model can be time consuming and may still not get to where you want. Here are two highly-used settings for Random Forest Classifier and XGBoost Tree in Kaggle competitions. Can be a good start.

Random Forest:

RandomForestClassifier(bootstrap=True, class_weight='balanced',

criterion='gini', max_depth=9, max_features=0.6,

max_leaf_nodes=None, min_impurity_decrease=0.0,

min_impurity_split=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

n_estimators=500, n_jobs=-1, oob_score=False, random_state=0,

verbose=0, warm_start=False)

XGBoost Tree:

params = {'objective':'binary:logistic',
          'eta': 0.3,  # analogous to learning rate in GBM
          'tree_method': "hist",
          'grow_policy': "lossguide",
          'max_leaves': 256,  # =2^n depth
          'max_depth': 5, # typical 3-10 , 5 chosen based on testing
          'subsample': 0.8, # lower value prevents overfitting
          'colsample_bytree': 0.7, 
          'min_child_weight':0,  # different than 'min_child_leaf' in GBM
          'lambda':10,  # add L2 regullarization for prevent overfitting
          'scale_pos_weight': 27,
          'objective': 'binary:logistic', 
          'eval_metric': 'auc', 
          'random_state': 99, 
          'silent': True}

How to select a particular row with a condition on pyspark?

Dataframes in Spark is distributed and that means you can’t access the data in a typical procedural way like pandas df.loc[].
We need do an analysis in order to get a specific row(rows). Here is an example:

import pyspark
from pyspark.sql import SQLContext
df = sqlContext.createDataFrame([(“a”,1), (“b”,2), (“c”,3)], [“letter”, “name”])
myIndex = 1
values = (df.rdd.zipWithIndex()
.filter(lambda ((l,v), i): i == myIndex)
.map(lambda ((l,v), i): (l,v))


Connect to Oracle Database with cx_Oracle and using binding variables

Connecting to Oracle database requires Oracle client library to work and thus some environment settings are required to make it work especially on unix server.

A typical shell script to set up the environment for set up Oracle DB to interface with Python:

export ORACLE_HOME=/[your path to oracle client]/oracli/
alias conda=/some path/ANACONDA/anaconda3/bin/python 

After setting up the Unix environment, let’s looks at the example to access Oracle Database:
import cx_Oracle
import pandas as pd
host = 'your hostname'
sid = 'your sid'
port = 1521
user = 'username'
password = 'password'
dsn_tns_adw = cx_Oracle.makedsn(host, port, sid)
db = cx_Oracle.connect(user,password, dsn_tns_adw)
cursor= db.cursor()
testQuery = "select count(*) from test_db"
result = cursor.execute(testQuery)
df = pd.DataFrame(result.fetchall())
df.columns = result.keys()
A lot times, you may need to do similar queries many time by just changing come where clause, cx_Oracle have a function to let you do binding variable easily.
con = cx_Oracle.connect(user, password, dsn_tns)
cur = con.cursor()
cur.prepare('select * from department where department_id = :id')
cur.execute(None, {'id': 210})
res = cur.fetechall()


Paste function in R implementation in Python

Paste function is very handy when it comes to concatenate the vectors/strings.  A very common use case in creating sql queries in R:

vec <- letters[1:5]

paste("SELECT * FROM db WHERE col IN ('",paste(vec,collapse = "','"),"')",sep = "")

[1] "SELECT * FROM db WHERE col IN ('a','b','c','d','e')"

How do we create similar function for Python using map reduce?

import functools
def reduce_concat(x, sep=""):
    return functools.reduce(lambda x, y: str(x) + sep + str(y), x)

def paste(*lists, sep=" ", collapse=None):
    result = map(lambda x: reduce_concat(x, sep=sep), zip(*lists))
    if collapse is not None:
       return reduce_concat(result, sep=collapse)
       return list(result)

print(paste([1,2,3], [11,12,13], sep=','))
print(paste([1,2,3], [11,12,13], sep=',', collapse=";"))

# ['1,11', '2,12', '3,13']
# '1,11;2,12;3,13'


Use multi-threading for hyper-parameter tuning in pyspark

Using threads allow a program to run multiple operations at the same time in the same process space. Python has a threading library to do it and here is a recap of how it is used:

import threading

def f(id):

    print("Thread function %s"%(id))


for i in range(3):

    t = threading.Thread(target = f, args = (i,))




thread function 0
thread function 1
thread function 2

Now we know how to invoke the multi-threading in python, how about pyspark for machine learning? Let’s learn it with an example:

Say I want to find the best k clusters in k-means clustering methods. I can use multi-threading for parallel processing to accomplish that.

from pyspark.mllib.clustering import KMeans 
import numpy as np 
def error(point, clusters): 
    center = clusters.centers[clusters.predict(point)] 
    return np.linalg.norm(point - center) 

def cacl_wssse(i): 
    clusters = KMeans.train(c_points, i, maxIterations = 20, runs =20, initializationMode = 'random') 
    WSSSE = point: error(point, clusters)) .reduce(lambda x,y: x + y) 
    return (i, WSSSE)

Run it with multiprocessing

from multiprocesing.pool import ThreadPool 
tpool = ThreadPool(processes = 4) 
wssse_points =, range(1,10)) 






Set class weight for unbalanced classification model in Keras

While training unbalanced neural network in Keras, the has the option to specify the class weights but you’ll need to compute it manually. Actually, there is an automatic way to get the dictionary to pass to ‘class_weight’ in

from sklearn.utils.class_weight import compute_class_weight

class_weight_list = compute_class_weight('balanced', np.unique(y_train_labels), y_train_labels)
class_weight = dict(zip(np.unique(y_train_labels), class_weight_list))


Using categorical features in Python

While doing machine learning, we will always get features that are categorical features or can be transformed as categorical features for easier building machine learning models. If you know sklearn, we know there are tutorials covering labelencoder and onehotencoder. But there are still confusions on how to code it up in real scenarios.

Assume here we’re only going to look at the object columns, which are typically categorical feature.

  1. First get the column names with dtype of ‘object’
import pandas as pd
catColumns = df.select_dtypes(['object']).columns

2. Convert binary label feature to binary using labelEncoder and for N>2, using get_dummy

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for col in catColumns:
    n = len(df[col].unique())
    if (n > 2):
        X = pd.get_dummies(df[col])
        X = X.drop(X.columns[0], axis=1)
        df[X.columns] = X
        df.drop(col, axis=1, inplace=True) # drop the original categorical variable (optional)
        df[col] = le.transform(df[col])