Analyzing data from Microsoft Access file

While working as a data scientist, sometime you’ll receive an old data file from a system in Microsoft Access format. The first instinct is that “Oh, do I need to install Microsoft Access?”. You’re out of luck if you use Linux or Mac and it is cumbersome to install a big piece of software that you won’t use often. The good thing is that we can use Python to do the task easily.

Microsoft only creates ODBC driver for windows. If you’re using windows, it should have been installed on the system. The newest ‘ACE’ ODBC driver is included in the Office installation. To check whether you have the drive on your windows system, run the following code:

import pyodbc

[x for x in pyodbc.drivers() if x.startswith(‘Microsoft Access Driver’]

If you see an empty list then you are running 64-bit Python and you need to install the 64-bit version of the “ACE” driver. If you only see [‘Microsoft Access Driver (*.mdb)’] and you need to work with a .accdb file then you need to install the 32-bit version of the “ACE” driver.

To connect the file and start the query, here is the code:

conn_str = ( rDRIVER={Microsoft Access Driver (*.mdb, *.accdb)}; rDBQ=C:\path\to\mydb.accdb; ) connection = pyodbc.connect(conn_str) cursor = connection.cursor()
# Run a sql-like query
results = cursor.execute(“select * from TableName limit 5;”)
rows = results.fetchall()
# Turn it into a dataframe
labels = list(‘ABCDE’) # if it’s a five column data
pd.Dataframe.from_records(rows, columns = labels) # You must provide column names

For Ubuntu User, you can install pyodbc drive by the following command:

$ sudo apt-get install python-pyodbc

$ sudo apt-get install mdbtools libmdbodbc1

On Mac, you can install pyodbc simply by: pip install pyodbc

Access to the file is the same as windows, make the connection and setup the cursor to the file, then start to execute the sql commands.

The rest will be on you for creating interesting results.

 

Advertisements

Reconstructing a unix timestamp to readable date in Python and Java

We’re living in a multiple dimension world, at least four dimensions. The most critical dimension is time and it is recorded with all the dataset in digital observations. One of the most common but unreadable ways to record the time is Unix timestamp. It will show you a date time in the format of ‘1513219212‘ (ISO 8601: 2017-12-14T02:40:12Z). And you have might no idea what it is.

First, what is a Unix timestamp? It is the time in seconds from January 1st, 1970 to the very moment you call for the stamp itself.Simply put, the Unix timestamp is a way to track time as a running total of seconds. This count starts at the Unix Epoch on January 1st, 1970 at UTC. Therefore, the Unix timestamp is merely the number of seconds between a particular date and the Unix Epoch. The reason why Unix timestamps are used by many webmasters is that they can represent all time zones at once. (Wikipedia)

All the programming languages have a library to handle this.

In python, you can do the following:

import datetime

datetime.datetime.fromtimestamp(int(‘1513219212’)).strftime(“%Y-%m-%d %H:%M:%S”)

#It will print out ‘2017-12-13 21:40:12’ Local Time

Since the above method will give you locale dependent time and error prone. The method is better:

datetime.datetime.utcfromtimestamp(int(‘1513219212’)).strftime(‘%Y-%m-%dT%H:%M:%SZ’)

# This will print out ‘2017-12-14T02:40:12Z’ 

‘Z’ stands for Zulu time, which is also GMT and UTC.

You can also use ‘ctime’ from time module to get a human-readable timestamp from a Unix timestamp.

Here is how to do it in Java:

TimestampToDateJava
Timestamp to Date in Java

The output will be: The date is: Wed Dec 13 21:40:12 EST 2017.

It is important to note the difference here. Java expect milliseconds and you’ll need to cast it to long otherwise it will have integer overflow.

If you want to set the timezone in Java, you can simply add this line:

TimeZone.setDefault(TimeZone.getTimeZone(“UTC”));

It will return The date is: Thu Dec 14 02:40:12 UTC 2017。

Buffered and unbuffered IO in Python

Sometimes, you may wonder why my print statement in my Python script not working until the end of my program, especially when you have a fairly heavy computational program that can take up a minute or longer to run. This is due to the way system handles I/O. By default, I/O in programs is buffered which means the IO is not served immediately byte by byte rather are served from a temporary storage to the requesting program. Buffering improves IO performance by reducing the total number of calls.Standard output is buffered because it is assumed there will be far more data going through it.  By buffering, the whole block is read into the buffer at once then the individual bytes are delivered to you from the (fast in-memory) buffer area.

The counterpart of buffered output is unbuffered output, which is applied when you want to ensure the output has been written immediately without delay, before continuing. For example, standard error under a C runtime library is usually unbuffered by default. There are mainly two reasons: 1. errors are supposedly infrequent, 2. you want to know it immediately.

The following is a detailed explanation of when buffered vs. unbuffered output should be used:

You want unbuffered output when you already have a large sequence of bytes ready to write to disk and want to avoid an extra copy into a second buffer in the middle.

Buffered output streams will accumulate write results into an intermediate buffer, sending it to the OS file system only when enough data has accumulated (or flush() is requested). This reduces the number of file system calls. Since file system calls can be expensive on most platforms (compared to short memcpy), the buffered output is a net win when performing a large number of small writes. A unbuffered output is generally better when you already have large buffers to send — copying to an intermediate buffer will not reduce the number of OS calls further and introduces additional work.

Unbuffered output has nothing to do with ensuring your data reaches the disk; that functionality is provided by flush(), and works on both buffered and unbuffered streams. Unbuffered IO writes don’t guarantee the data has reached the physical disk — the OS file system is free to hold on to a copy of your data indefinitely, never writing it to disk, if it wants. It is only required to commit it to disk when you invoke flush(). (Note that close() will call flush() on your behalf).  — Quote from stackoverflow community wiki.

Here is an example of buffered output:

Capture.PNG

Now we get an idea of how buffered IO works. How do we force Python’s print function to output to the screen?

If you’re using Python 3.3+, it has added a flush option. By setting flush = True, the stream is forcibly flushed immediately.

print(*objects, sep=”, end = ‘\n’, file = sys.stdout, flush = False)

Another general way is to use sys.stdout.flush().

import sys

print “This will be output immediately”

sys.stdout.flush()

Running using command line, add -u

python -u mypython.py

You can also use an unbuffered file:

f = open(‘file.txt’, ‘a’,0)  # 0 is no buffer, 1 is one line, other is the buffer size

#or

sys.stdout = open(‘file.txt’, ‘a’,0)

You can also change the default for the shell operating environment

in Linux or OSX:

$export PYTHONUNBUFFERED=TRUE

or Windows:

C:\SET PYTHONUNBUFFERED=TRUE

 

Display all the data column in Jupyter Notebook

During our data exploration, there are often times that too many data columns in the dataframe. By default, the Jupyter Notebook only display a handful of it for simplicity.

Here is the couple of ways to display all the columns:

import pandas as pd

from IPython.display import display

data = pd.read_csv(‘mydave.cvs’)

# Direclty set the options

pd.options.display.max_columns = None

display(data)

Or, you set_option method from pandas.

pd.set_option(‘display.max_columns’, None)

display(data)

To locally change the setting for an only specific cell, do the following:

with pd.option_context(‘display.max_columns’,None):

           display(data)

You can also do:

from IPython.display import HTML

HTML(data.to_html())

Automate tabular financial datatable into vectorized sequential data

A lot of times, we receive time-related data in a table format and we want convert it into a simple data format with one column of datetime and the other as value. See this sample table:1

Now we want to convert this dataset into another format which can be easier to visulize and convert to other data structure like xts or timeSeries object. The converted data will be like:

2

Let’s look at a sample unemployment rate from Depart of labor.

sampleData <- read.csv(‘table_date.csv’)
sampleData

3

Method in R, there are two common ways to do it, first:

tableDataFlat <- as.vector(t(sampleData[1:nrow(sampleData),2:ncol(sampleData)]))
dates <- seq.Date(as.Date(‘2005-01-01’),as.Date(‘2017-12-01′),’month’)
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

4

the second way in R:

tableDataFlat <- c(t(as.matrix(sampleData[1:nrow(sampleData),2:ncol(sampleData)])))
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

Now we can do visualization and analysis more conveniently.

plot(newTS)

5

Method in Python:

In python, it is even more simple. Flatten the data matrix by using:

import numpy as np
import pandas as pd
df = pd.read_csv(‘table_date.csv’)
data = df.values
data_flat = data.flatten()
dates = pd.date_range(start = ‘2005-01-01’, end = ‘2017-12-01′,freq=’M’)
new_df = pd.Dataframe({date:dates,value:data_flat})
new_df.set_index(date)

Python traps you should know

Like every language, there are some easy to overlook traps when writing Python programs. Some of the traps are hidden and can cause big problems or errors for your program. Here are some of the most common traps a good Python programmer should be aware:

    •  1. a mutable object used as the default parameter

Like all the other languages, Python provides default parameters for functions which are great for making thing easier. However, things can become unpleasant if you have put a mutable object in the function as the default value for a parameter. Let’s look at an example:

1

A surprise? ! The root cause is that everything is an object in Python, even function and default parameter is just an attribute of the function. Default parameter values are evaluated when the function definition is executed.

Another more obvious example:

2

How to fix it?

According to Python document: A way around this is to use None as the default, and explicitly test for it in the body of the function.

3

  • 2.  x += y vs x = x+y

Generally speaking, the two are equivalent. Let’s look at the example:

4

As we can see, when using +=, it returns the same id.  In the first example (53,54), x points to a new object while the latter one (55,56) it modifies its value at the current location.

  • 3. Majic parathesis ()

In Python, () can represent a tuple data structure which is immutable.

5

What if you only have one element in the tuple:

6

Majic, it becomes an integer instead of a tuple. The right thing to do is this:

7

  • 4. Generated element is a list of list

This is like a 2-D array. Or generating a list with mutable element in it. Sounds very easy:

8

By adding a value 10 into the first element in the list, we populated all the elements with the same value. Interesting, hmmm? That’s not what I want!!

The reason is still the same: mutable object within the list and they all point to the same object. The right syntax is:
9

As seen above, there are many traps while using Python and definitely you should be aware of it.

 

The convenience of subplot = True in dataframe.plot

When it comes to data analysis, there is always a saying: “one picture worths a thousand words.”. Visualization is an essential and effective way of data exploration and usually as our first step of understanding the raw data. In Python, there are a lot of visualization libraries. For python dataframe, it has plenty of built-in plotting methods: line, bar, barh, hist, box, kde, density, area, pie, scatter and hexbin.

The quickest way to visualize all the columns data in a dataframe can be achieved by simply call: df.plot().  For example:

df = pd.DataFrame({‘A’:np.arange(1,10),’B’:2*np.arange(1,10)})
df.plot(title = ‘plot all columns in one chart.’)

dfplot1.png

But a lot of times we want each feature plotted on a separate chart due to the complex of data. It will help us disentangle the dataset.

It turns out that there is a simple trick to play with in df.plot, using ‘subplot = True’.

df.plot(figsize = (8,4), subplots=True, layout = (2,1), title = ‘plot all columns in seperate chart’);

dfplot2.png

That’s it. Simple but effective. You can change the layout by playing with the layout tupple input.

Hope you find it helpful too.