Display all the data column in Jupyter Notebook

During our data exploration, there are often times that too many data columns in the dataframe. By default, the Jupyter Notebook only display a handful of it for simplicity.

Here is the couple of ways to display all the columns:

import pandas as pd

from IPython.display import display

data = pd.read_csv(‘mydave.cvs’)

# Direclty set the options

pd.options.display.max_columns = None

display(data)

Or, you set_option method from pandas.

pd.set_option(‘display.max_columns’, None)

display(data)

To locally change the setting for an only specific cell, do the following:

with pd.option_context(‘display.max_columns’,None):

           display(data)

You can also do:

from IPython.display import HTML

HTML(data.to_html())

Advertisements

Automate tabular financial datatable into vectorized sequential data

A lot of times, we receive time-related data in a table format and we want convert it into a simple data format with one column of datetime and the other as value. See this sample table:1

Now we want to convert this dataset into another format which can be easier to visulize and convert to other data structure like xts or timeSeries object. The converted data will be like:

2

Let’s look at a sample unemployment rate from Depart of labor.

sampleData <- read.csv(‘table_date.csv’)
sampleData

3

Method in R, there are two common ways to do it, first:

tableDataFlat <- as.vector(t(sampleData[1:nrow(sampleData),2:ncol(sampleData)]))
dates <- seq.Date(as.Date(‘2005-01-01’),as.Date(‘2017-12-01′),’month’)
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

4

the second way in R:

tableDataFlat <- c(t(as.matrix(sampleData[1:nrow(sampleData),2:ncol(sampleData)])))
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

Now we can do visualization and analysis more conveniently.

plot(newTS)

5

Method in Python:

In python, it is even more simple. Flatten the data matrix by using:

import numpy as np
import pandas as pd
df = pd.read_csv(‘table_date.csv’)
data = df.values
data_flat = data.flatten()
dates = pd.date_range(start = ‘2005-01-01’, end = ‘2017-12-01′,freq=’M’)
new_df = pd.Dataframe({date:dates,value:data_flat})
new_df.set_index(date)

Python traps you should know

Like every language, there are some easy to overlook traps when writing Python programs. Some of the traps are hidden and can cause big problems or errors for your program. Here are some of the most common traps a good Python programmer should be aware:

    •  1. a mutable object used as the default parameter

Like all the other languages, Python provides default parameters for functions which are great for making thing easier. However, things can become unpleasant if you have put a mutable object in the function as the default value for a parameter. Let’s look at an example:

1

A surprise? ! The root cause is that everything is an object in Python, even function and default parameter is just an attribute of the function. Default parameter values are evaluated when the function definition is executed.

Another more obvious example:

2

How to fix it?

According to Python document: A way around this is to use None as the default, and explicitly test for it in the body of the function.

3

  • 2.  x += y vs x = x+y

Generally speaking, the two are equivalent. Let’s look at the example:

4

As we can see, when using +=, it returns the same id.  In the first example (53,54), x points to a new object while the latter one (55,56) it modifies its value at the current location.

  • 3. Majic parathesis ()

In Python, () can represent a tuple data structure which is immutable.

5

What if you only have one element in the tuple:

6

Majic, it becomes an integer instead of a tuple. The right thing to do is this:

7

  • 4. Generated element is a list of list

This is like a 2-D array. Or generating a list with mutable element in it. Sounds very easy:

8

By adding a value 10 into the first element in the list, we populated all the elements with the same value. Interesting, hmmm? That’s not what I want!!

The reason is still the same: mutable object within the list and they all point to the same object. The right syntax is:
9

As seen above, there are many traps while using Python and definitely you should be aware of it.

 

The convenience of subplot = True in dataframe.plot

When it comes to data analysis, there is always a saying: “one picture worths a thousand words.”. Visualization is an essential and effective way of data exploration and usually as our first step of understanding the raw data. In Python, there are a lot of visualization libraries. For python dataframe, it has plenty of built-in plotting methods: line, bar, barh, hist, box, kde, density, area, pie, scatter and hexbin.

The quickest way to visualize all the columns data in a dataframe can be achieved by simply call: df.plot().  For example:

df = pd.DataFrame({‘A’:np.arange(1,10),’B’:2*np.arange(1,10)})
df.plot(title = ‘plot all columns in one chart.’)

dfplot1.png

But a lot of times we want each feature plotted on a separate chart due to the complex of data. It will help us disentangle the dataset.

It turns out that there is a simple trick to play with in df.plot, using ‘subplot = True’.

df.plot(figsize = (8,4), subplots=True, layout = (2,1), title = ‘plot all columns in seperate chart’);

dfplot2.png

That’s it. Simple but effective. You can change the layout by playing with the layout tupple input.

Hope you find it helpful too.

Access Amazon Redshift Database from Python

Amazon has definitely made significant gain from the cloud movement in the past decade as more and more company are ditching their own data server in favor of theirs. There is a very good reason to do that. Cheaper, faster and easy access from anywhere.

Now how do we retrieve data in Redshift and do data analysis from Python. It is very simple to do that. The information that you’ll need ahead is: usename, password, url to redshift and port number (default is 5439).

I’ll show you how to connect to Amazon Redshift using psycopg2 library. First install library ‘psycopg2’ using : pip install psycopg2.

Then use the following Python code to define your connections.

def create_conn(*args, **kwargs):

import psycopg2
config = kwargs[‘config’]
try:
con = psycopg2.connect(dbname = config[‘dbname’], host=config[‘host’],
port = config[‘port’], user=config[‘user’],
password=config[‘pwd’])
return con
except Exception as err:
print(err)

keyword= getpass(‘password’)   # type in password  or you can read from a saved json file.

config = {‘dbname’: ‘lake_one’,
‘user’:’username’,
‘pwd’:’keyword’,
‘host’:'[your host url].redshift.amazonaws.com’,
‘port’:5439}

How to use this and do fun stuff:

con = create_conn(config = config)
data = read_sql(“select *  from mydatabase.tablename;”,con, index_col = ‘date’)
print(data.columns)
data.plot(title=’My data’, grid = True,figsize=(15,5))

con.close() # close connection

Simple as that. Now it’s your turn to create fantastic analysis.

Can’t start Jupyter Notebook in macOS Sierra 10.12.5

Many people have experienced an annoying issue after updating to macOS Sierra 10.12.5 while trying to fire up the Jupiter notebook. There are two sequential ways of fixing the issue depends on your Mac environment.

The first easy fix is to copy and paste: http://localhost:8888/tree directly into your browser. If doesn’t work in Chrome, using Safari.

What I found out the more annoying issue is that the first fix simply won’t do it, and it prompts you to enter a password after pasting the link into browser. This problem can be fixed by changing the Jupiter notebook server settings. Simply by following the steps:

  1. Open Mac Terminal
  2. jupyter notebook –generate-config 

    This command generate config file under

    /Users/[username]/.jupyter/

  3. jupyter notebook password

Type in your password twice and it will save the hashed password into      ./jupyter/jupyter_notebook_config.json

After this setting, fire up your notebook and type in your previous password (not hashed value) and save it. It will not ask for password anymore.

Some helpful command to use:

jupyter –version

jupyter notebook list  : list all running notebook sesion

If you still have problem, combine two steps together and it should work.

 

 

Converting week numbers to dates

This is the post excerpt.

While working with time series dataset, sometimes you’ll only get the date as week number of its residing year. This articles presents easy way to convert it to datetime tuple and datetime object in Python.

Easy way to do it is using strptime from the datetime module. Example:

import datetime

week = 12

year = 2017

atime = datetime.datetime.strptime(‘{} {} 1’.format(year,week), ‘%Y %W %w’).timetuple()

This will return ‘time.struct_time(tm_year=2017, tm_mon=3, tm_mday=20, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=79, tm_isdst=-1)’

In this command, the symbols used to parse the date string are %Y, %W and %w. The whole symbol table can be found at: strftime symbol table
%Y: represents four-digits year
%W: Week number of the year (Monday as the first day of the week) as a decimal number. All days in a new year preceding the first Monday are considered to be in week 0.
%w: Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.

To convert this to a datetime.date object:

datetime.datetime.fromtimestamp(time.mktime(atime)).date()