The pitfall of eval function and its safe alternative in Python

As some of you have used eval and exec in python as a handy quick-and-dirty way to get the dynamic source code, then munge it a little, then execute it. The functionality of ‘eval’ is its ability to evaluate a string as though it were an expression and returns a result. In Python, it can also be a structured representation of code such as abstract syntax tree (like Lisp forms). Its key usages are:

  • Evaluating a mathematical expression
  • Compiler bootstrapping
  • Scripting (dynamic code)
  • Language tutors

For example:


x = 1

eval('x + 1') # returns 2

eval ('x') # returns 1

# The most general form for evaluating statements is using code objects
x = 1
y = 2
eval(compile("print ('x+y=',x+y)", "compile-sample.py", "single"))
# returns
x + y = 3
 

However, there is a big safety flaw while using ‘eval’ command which evaluates the code in the expression without considering whether it’s safe or not. See a good article about ‘The danger of eval”. A good example can be:


eval(input("__import__('os').system('rm -rf /root/important_data')"))

The cure for this is to use $ast.literal_eval$ which raise an exception if the input isn’t a valid Python datatype, so the code won’t be executed if it is not safe.

The rule of thumb is to use ast.literal_eval whenever you need eval. But there are some notable differences: ast.literal_eval won’t work for bitwise operators. For example:


ast.literal_eval("1 & 1") # raise an error

eval("1 & 1") # will return 1

ast.literal_eval() only considers a small subset of Python’s syntax to be valid:

The string or node provided may only consist of the following Python literal structures: strings, numbers, tuples, lists, dicts, booleans, and None.

Advertisements

Setting up Python for connecting to Netezza database in Unix/Linux server

Here is a quick instruction for setting up connection to Netezza database from a typical Unix/Linux server:

  1. Setup the $PATH for JAVA_HOME and NZ_HOME
export JAVA_HOME=/somepath/jdk/jdk1.7.0-x64/jre/
export NZ_HOME=/somepath/nzcli/7.0.4
export LD_LIBRARY_PATH=.:/usr/lib:/usr/ucblib:${NZ_HOME}/lib
export PATH=/usr/bin::${NZ_HOME}/bin:${JAVA_HOME}/bin

Sample code


dsn_database = "db_name"
dsn_hostname = "awarehouse-unz01"
dsn_port = "5480"
dsn_uid = "username"
dsn_pwd = "password"
jdbc_driver_name = "org.netezza.Driver"
jdbc_driver_loc = "/some path/nzcli/7.0.4/lib/nzjdbc3.jar"
connection_string='jdbc:netezza://'+dsn_hostname+':'+dsn_port+'/'+dsn_database
url = '{0}:user={1};password={2}'.format(connection_string, dsn_uid, dsn_pwd)
print("URL: " + url)
print("Connection String: " + connection_string)

conn = jaydebeapi.connect("org.netezza.Driver", connection_string, {'user': dsn_uid, 'password': dsn_pwd},
jars = "/export/apps/nzcli/7.0.4/lib/nzjdbc3.jar")
curs = conn.cursor()

testQuery = "select * from LN_SRVG_ACTV_CASH_EOD_SPST_VW2 limit 1"
curs.execute(testQuery)
result = curs.fetchall()
print("Total records: " + str(len(result)))
print(result[0])

for i in range(len(result)):
print(result[i])

curs.close()
conn.close()

Paste function in R implementation in Python

Paste function is very handy when it comes to concatenate the vectors/strings.  A very common use case in creating sql queries in R:


vec <- letters[1:5]

paste("SELECT * FROM db WHERE col IN ('",paste(vec,collapse = "','"),"')",sep = "")

[1] "SELECT * FROM db WHERE col IN ('a','b','c','d','e')"

How do we create similar function for Python using map reduce?


import functools
def reduce_concat(x, sep=""):
    return functools.reduce(lambda x, y: str(x) + sep + str(y), x)

def paste(*lists, sep=" ", collapse=None):
    result = map(lambda x: reduce_concat(x, sep=sep), zip(*lists))
    if collapse is not None:
       return reduce_concat(result, sep=collapse)
       return list(result)

print(paste([1,2,3], [11,12,13], sep=','))
print(paste([1,2,3], [11,12,13], sep=',', collapse=";"))

# ['1,11', '2,12', '3,13']
# '1,11;2,12;3,13'

 

Changing python notebook display to custom format

Have you encountered times when you do df.describe() or other displaying of data results, it shows you a table in scientific notation like following which is not convenient at all.

df.groupby('deptpartment')['data1'].sum()

department
finance 1.192433e+08
auto 1.293066e+08
airline 1.077142e+08

How do we fix it?

One way is to set the display options to the following:

pd.set_option('display.float_format', lambda x: '%.3f' % x)

or

pd['department'].apply(lambda x: ':%.3f'%x)

 

Analyzing data from Microsoft Access file

While working as a data scientist, sometime you’ll receive an old data file from a system in Microsoft Access format. The first instinct is that “Oh, do I need to install Microsoft Access?”. You’re out of luck if you use Linux or Mac and it is cumbersome to install a big piece of software that you won’t use often. The good thing is that we can use Python to do the task easily.

Microsoft only creates ODBC driver for windows. If you’re using windows, it should have been installed on the system. The newest ‘ACE’ ODBC driver is included in the Office installation. To check whether you have the drive on your windows system, run the following code:

import pyodbc

[x for x in pyodbc.drivers() if x.startswith('Microsoft Access Driver']

 

If you see an empty list then you are running 64-bit Python and you need to install the 64-bit version of the “ACE” driver. If you only see [‘Microsoft Access Driver (*.mdb)’] and you need to work with a .accdb file then you need to install the 32-bit version of the “ACE” driver.

To connect the file and start the query, here is the code:


conn_str = ( r'DRIVER={Microsoft Access Driver ( *.mdb, .accdb)};'r'DBQ=C:\path.accdb; )
connection =pyodbc.connect(conn_str)
cursor =connection.cursor()
# Run a sql-like query
results = cursor.execute("select * from TableName limit 5;")
rows = results.fetchall()
# Turn it into a dataframe
labels = list('ABCDE') # if it's a five column data
pd.Dataframe.from_records(rows, columns = labels) # You must provide column names

 

For Ubuntu User, you can install pyodbc drive by the following command:

$ sudo apt-get install python-pyodbc

$ sudo apt-get install mdbtools libmdbodbc1

On Mac, you can install pyodbc simply by: pip install pyodbc

Access to the file is the same as windows, make the connection and setup the cursor to the file, then start to execute the sql commands.

The rest will be on you for creating interesting results.

 

Reconstructing a unix timestamp to readable date in Python and Java

We’re living in a multiple dimension world, at least four dimensions. The most critical dimension is time and it is recorded with all the dataset in digital observations. One of the most common but unreadable ways to record the time is Unix timestamp. It will show you a date time in the format of ‘1513219212‘ (ISO 8601: 2017-12-14T02:40:12Z). And you have might no idea what it is.

First, what is a Unix timestamp? It is the time in seconds from January 1st, 1970 to the very moment you call for the stamp itself.Simply put, the Unix timestamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC. Therefore, the Unix timestamp is merely the number of seconds between a particular date and the Unix Epoch. The reason why Unix timestamps are used by many webmasters is that they can represent all time zones at once. (Wikipedia)

All the programming languages have a library to handle this.

In python, you can do the following:

import datetime

datetime.datetime.fromtimestamp(int(‘1513219212’)).strftime(“%Y-%m-%d %H:%M:%S”)

#It will print out ‘2017-12-13 21:40:12’ Local Time

Since the above method will give you locale dependent time and error prone. The method is better:

datetime.datetime.utcfromtimestamp(int(‘1513219212’)).strftime(‘%Y-%m-%dT%H:%M:%SZ’)

# This will print out ‘2017-12-14T02:40:12Z’ 

‘Z’ stands for Zulu time, which is also GMT and UTC.

You can also use ‘ctime’ from time module to get a human-readable timestamp from a Unix timestamp.

Here is how to do it in Java:

TimestampToDateJava
Timestamp to Date in Java

The output will be: The date is: Wed Dec 13 21:40:12 EST 2017.

It is important to note the difference here. Java expect milliseconds and you’ll need to cast it to long otherwise it will have integer overflow.

If you want to set the timezone in Java, you can simply add this line:

TimeZone.setDefault(TimeZone.getTimeZone(“UTC”));

It will return The date is: Thu Dec 14 02:40:12 UTC 2017。

Buffered and unbuffered IO in Python

Sometimes, you may wonder why my print statement in my Python script not working until the end of my program, especially when you have a fairly heavy computational program that can take up a minute or longer to run. This is due to the way system handles I/O. By default, I/O in programs is buffered which means the IO is not served immediately byte by byte rather are served from a temporary storage to the requesting program. Buffering improves IO performance by reducing the total number of calls.Standard output is buffered because it is assumed there will be far more data going through it.  By buffering, the whole block is read into the buffer at once then the individual bytes are delivered to you from the (fast in-memory) buffer area.

The counterpart of buffered output is unbuffered output, which is applied when you want to ensure the output has been written immediately without delay, before continuing. For example, standard error under a C runtime library is usually unbuffered by default. There are mainly two reasons: 1. errors are supposedly infrequent, 2. you want to know it immediately.

The following is a detailed explanation of when buffered vs. unbuffered output should be used:

You want unbuffered output when you already have a large sequence of bytes ready to write to disk and want to avoid an extra copy into a second buffer in the middle.

Buffered output streams will accumulate write results into an intermediate buffer, sending it to the OS file system only when enough data has accumulated (or flush() is requested). This reduces the number of file system calls. Since file system calls can be expensive on most platforms (compared to short memcpy), the buffered output is a net win when performing a large number of small writes. A unbuffered output is generally better when you already have large buffers to send — copying to an intermediate buffer will not reduce the number of OS calls further and introduces additional work.

Unbuffered output has nothing to do with ensuring your data reaches the disk; that functionality is provided by flush(), and works on both buffered and unbuffered streams. Unbuffered IO writes don’t guarantee the data has reached the physical disk — the OS file system is free to hold on to a copy of your data indefinitely, never writing it to disk, if it wants. It is only required to commit it to disk when you invoke flush(). (Note that close() will call flush() on your behalf).  — Quote from stackoverflow community wiki.

Here is an example of buffered output:

Capture.PNG

Now we get an idea of how buffered IO works. How do we force Python’s print function to output to the screen?

If you’re using Python 3.3+, it has added a flush option. By setting flush = True, the stream is forcibly flushed immediately.

print(*objects, sep=”, end = ‘\n’, file = sys.stdout, flush = False)

Another general way is to use sys.stdout.flush().

import sys

print “This will be output immediately”

sys.stdout.flush()

Running using command line, add -u

python -u mypython.py

You can also use an unbuffered file:

f = open(‘file.txt’, ‘a’,0)  # 0 is no buffer, 1 is one line, other is the buffer size

#or

sys.stdout = open(‘file.txt’, ‘a’,0)

You can also change the default for the shell operating environment

in Linux or OSX:

$export PYTHONUNBUFFERED=TRUE

or Windows:

C:\SET PYTHONUNBUFFERED=TRUE