Buffered and unbuffered IO in Python

Sometimes, you may wonder why my print statement in my Python script not working until the end of my program, especially when you have a fairly heavy computational program that can take up a minute or longer to run. This is due to the way system handles I/O. By default, I/O in programs is buffered which means the IO is not served immediately byte by byte rather are served from a temporary storage to the requesting program. Buffering improves IO performance by reducing the total number of calls.Standard output is buffered because it is assumed there will be far more data going through it.  By buffering, the whole block is read into the buffer at once then the individual bytes are delivered to you from the (fast in-memory) buffer area.

The counterpart of buffered output is unbuffered output, which is applied when you want to ensure the output has been written immediately without delay, before continuing. For example, standard error under a C runtime library is usually unbuffered by default. There are mainly two reasons: 1. errors are supposedly infrequent, 2. you want to know it immediately.

The following is a detailed explanation of when buffered vs. unbuffered output should be used:

You want unbuffered output when you already have a large sequence of bytes ready to write to disk and want to avoid an extra copy into a second buffer in the middle.

Buffered output streams will accumulate write results into an intermediate buffer, sending it to the OS file system only when enough data has accumulated (or flush() is requested). This reduces the number of file system calls. Since file system calls can be expensive on most platforms (compared to short memcpy), the buffered output is a net win when performing a large number of small writes. A unbuffered output is generally better when you already have large buffers to send — copying to an intermediate buffer will not reduce the number of OS calls further and introduces additional work.

Unbuffered output has nothing to do with ensuring your data reaches the disk; that functionality is provided by flush(), and works on both buffered and unbuffered streams. Unbuffered IO writes don’t guarantee the data has reached the physical disk — the OS file system is free to hold on to a copy of your data indefinitely, never writing it to disk, if it wants. It is only required to commit it to disk when you invoke flush(). (Note that close() will call flush() on your behalf).  — Quote from stackoverflow community wiki.

Here is an example of buffered output:

Capture.PNG

Now we get an idea of how buffered IO works. How do we force Python’s print function to output to the screen?

If you’re using Python 3.3+, it has added a flush option. By setting flush = True, the stream is forcibly flushed immediately.

print(*objects, sep=”, end = ‘\n’, file = sys.stdout, flush = False)

Another general way is to use sys.stdout.flush().

import sys

print “This will be output immediately”

sys.stdout.flush()

Running using command line, add -u

python -u mypython.py

You can also use an unbuffered file:

f = open(‘file.txt’, ‘a’,0)  # 0 is no buffer, 1 is one line, other is the buffer size

#or

sys.stdout = open(‘file.txt’, ‘a’,0)

You can also change the default for the shell operating environment

in Linux or OSX:

$export PYTHONUNBUFFERED=TRUE

or Windows:

C:\SET PYTHONUNBUFFERED=TRUE

 

Advertisements

Display all the data column in Jupyter Notebook

During our data exploration, there are often times that too many data columns in the dataframe. By default, the Jupyter Notebook only display a handful of it for simplicity.

Here is the couple of ways to display all the columns:

import pandas as pd

from IPython.display import display

data = pd.read_csv(‘mydave.cvs’)

# Direclty set the options

pd.options.display.max_columns = None

display(data)

Or, you set_option method from pandas.

pd.set_option(‘display.max_columns’, None)

display(data)

To locally change the setting for an only specific cell, do the following:

with pd.option_context(‘display.max_columns’,None):

           display(data)

You can also do:

from IPython.display import HTML

HTML(data.to_html())

Automate tabular financial datatable into vectorized sequential data

A lot of times, we receive time-related data in a table format and we want convert it into a simple data format with one column of datetime and the other as value. See this sample table:1

Now we want to convert this dataset into another format which can be easier to visulize and convert to other data structure like xts or timeSeries object. The converted data will be like:

2

Let’s look at a sample unemployment rate from Depart of labor.

sampleData <- read.csv(‘table_date.csv’)
sampleData

3

Method in R, there are two common ways to do it, first:

tableDataFlat <- as.vector(t(sampleData[1:nrow(sampleData),2:ncol(sampleData)]))
dates <- seq.Date(as.Date(‘2005-01-01’),as.Date(‘2017-12-01′),’month’)
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

4

the second way in R:

tableDataFlat <- c(t(as.matrix(sampleData[1:nrow(sampleData),2:ncol(sampleData)])))
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

Now we can do visualization and analysis more conveniently.

plot(newTS)

5

Method in Python:

In python, it is even more simple. Flatten the data matrix by using:

import numpy as np
import pandas as pd
df = pd.read_csv(‘table_date.csv’)
data = df.values
data_flat = data.flatten()
dates = pd.date_range(start = ‘2005-01-01’, end = ‘2017-12-01′,freq=’M’)
new_df = pd.Dataframe({date:dates,value:data_flat})
new_df.set_index(date)

Python traps you should know

Like every language, there are some easy to overlook traps when writing Python programs. Some of the traps are hidden and can cause big problems or errors for your program. Here are some of the most common traps a good Python programmer should be aware:

    •  1. a mutable object used as the default parameter

Like all the other languages, Python provides default parameters for functions which are great for making thing easier. However, things can become unpleasant if you have put a mutable object in the function as the default value for a parameter. Let’s look at an example:

1

A surprise? ! The root cause is that everything is an object in Python, even function and default parameter is just an attribute of the function. Default parameter values are evaluated when the function definition is executed.

Another more obvious example:

2

How to fix it?

According to Python document: A way around this is to use None as the default, and explicitly test for it in the body of the function.

3

  • 2.  x += y vs x = x+y

Generally speaking, the two are equivalent. Let’s look at the example:

4

As we can see, when using +=, it returns the same id.  In the first example (53,54), x points to a new object while the latter one (55,56) it modifies its value at the current location.

  • 3. Majic parathesis ()

In Python, () can represent a tuple data structure which is immutable.

5

What if you only have one element in the tuple:

6

Majic, it becomes an integer instead of a tuple. The right thing to do is this:

7

  • 4. Generated element is a list of list

This is like a 2-D array. Or generating a list with mutable element in it. Sounds very easy:

8

By adding a value 10 into the first element in the list, we populated all the elements with the same value. Interesting, hmmm? That’s not what I want!!

The reason is still the same: mutable object within the list and they all point to the same object. The right syntax is:
9

As seen above, there are many traps while using Python and definitely you should be aware of it.

 

The convenience of subplot = True in dataframe.plot

When it comes to data analysis, there is always a saying: “one picture worths a thousand words.”. Visualization is an essential and effective way of data exploration and usually as our first step of understanding the raw data. In Python, there are a lot of visualization libraries. For python dataframe, it has plenty of built-in plotting methods: line, bar, barh, hist, box, kde, density, area, pie, scatter and hexbin.

The quickest way to visualize all the columns data in a dataframe can be achieved by simply call: df.plot().  For example:

df = pd.DataFrame({‘A’:np.arange(1,10),’B’:2*np.arange(1,10)})
df.plot(title = ‘plot all columns in one chart.’)

dfplot1.png

But a lot of times we want each feature plotted on a separate chart due to the complex of data. It will help us disentangle the dataset.

It turns out that there is a simple trick to play with in df.plot, using ‘subplot = True’.

df.plot(figsize = (8,4), subplots=True, layout = (2,1), title = ‘plot all columns in seperate chart’);

dfplot2.png

That’s it. Simple but effective. You can change the layout by playing with the layout tupple input.

Hope you find it helpful too.

Access Amazon Redshift Database from Python

Amazon has definitely made significant gain from the cloud movement in the past decade as more and more company are ditching their own data server in favor of theirs. There is a very good reason to do that. Cheaper, faster and easy access from anywhere.

Now how do we retrieve data in Redshift and do data analysis from Python. It is very simple to do that. The information that you’ll need ahead is: usename, password, url to redshift and port number (default is 5439).

I’ll show you how to connect to Amazon Redshift using psycopg2 library. First install library ‘psycopg2’ using : pip install psycopg2.

Then use the following Python code to define your connections.

def create_conn(*args, **kwargs):

import psycopg2
config = kwargs[‘config’]
try:
con = psycopg2.connect(dbname = config[‘dbname’], host=config[‘host’],
port = config[‘port’], user=config[‘user’],
password=config[‘pwd’])
return con
except Exception as err:
print(err)

keyword= getpass(‘password’)   # type in password  or you can read from a saved json file.

config = {‘dbname’: ‘lake_one’,
‘user’:’username’,
‘pwd’:’keyword’,
‘host’:'[your host url].redshift.amazonaws.com’,
‘port’:5439}

How to use this and do fun stuff:

con = create_conn(config = config)
data = read_sql(“select *  from mydatabase.tablename;”,con, index_col = ‘date’)
print(data.columns)
data.plot(title=’My data’, grid = True,figsize=(15,5))

con.close() # close connection

Simple as that. Now it’s your turn to create fantastic analysis.

Can’t start Jupyter Notebook in macOS Sierra 10.12.5

Many people have experienced an annoying issue after updating to macOS Sierra 10.12.5 while trying to fire up the Jupiter notebook. There are two sequential ways of fixing the issue depends on your Mac environment.

The first easy fix is to copy and paste: http://localhost:8888/tree directly into your browser. If doesn’t work in Chrome, using Safari.

What I found out the more annoying issue is that the first fix simply won’t do it, and it prompts you to enter a password after pasting the link into browser. This problem can be fixed by changing the Jupiter notebook server settings. Simply by following the steps:

  1. Open Mac Terminal
  2. jupyter notebook –generate-config 

    This command generate config file under

    /Users/[username]/.jupyter/

  3. jupyter notebook password

Type in your password twice and it will save the hashed password into      ./jupyter/jupyter_notebook_config.json

After this setting, fire up your notebook and type in your previous password (not hashed value) and save it. It will not ask for password anymore.

Some helpful command to use:

jupyter –version

jupyter notebook list  : list all running notebook sesion

If you still have problem, combine two steps together and it should work.