Display all the data column in Jupyter Notebook

During our data exploration, there are often times that too many data columns in the dataframe. By default, the Jupyter Notebook only display a handful of it for simplicity.

Here is the couple of ways to display all the columns:

import pandas as pd

from IPython.display import display

data = pd.read_csv(‘mydave.cvs’)

# Direclty set the options

pd.options.display.max_columns = None

display(data)

Or, you set_option method from pandas.

pd.set_option(‘display.max_columns’, None)

display(data)

To locally change the setting for an only specific cell, do the following:

with pd.option_context(‘display.max_columns’,None):

           display(data)

You can also do:

from IPython.display import HTML

HTML(data.to_html())

Are we ready for the Aug 21, 2017 solar eclipse?

The 2017 total solar eclipse is fast approaching, and hordes of sky gazers are scrambling to find a spot where they can see the shadow of the moon completely obscure the sun for a few moments on Aug. 21. Here is an illustration for the science behind it:

eclipestages

Image Credit: Rick Fienberg, TravelQuest International, and Wilderness Travel

Who can see it?

globe_inset_v3

Image Credit: NASA’s Scientific Visualization Studio

For those living in the United States, you might want to look at this gif animation image to check out when should you look out to this rare event. For people live in D.C. area, the prime time is 2:40 PM local time.

ecl2017

TIME.com has made a very cool web widget to check the prime time by a given zip code. Check here.

Be sure to wear sunglasses to protect your eyes.

 

How to direct system output to a variable in R

For people familiar with Linux/Unix/Mac command line, we all know that there are many system commands that can save our day. One of the most encountered problems is to get the number of lines/words in a large file. Here I’m talking about tens of millions record and above. There are many ways to do it: the easiest way to do it is to ‘

There are many ways to do it: the easiest way to do it is to ‘readLines’ to get all the lines and count the shape. But this will be impossible if your memory won’t allow it. But in Linux platform, you can easily do it by call ‘wc -l filename.txt’.

In R environment, you can excecute all the system command by calling  ‘system()’. In this example, system(“wc -l filename.txt”) to show the number of lines. Here is the original quesiton: how do I assign the output to a variable?

It won’t work if you just do:

varName <- system(“wc -l filename.txt”)

But here is the trick:

varName <- system(“wc -l filename.txt”, intern = TRUE)

Bingo.

For more information on the most frequently used Linux command, refer to 50 Most Commonly Used Linux Command with Example.

 

Using R in Jupyter Notebook

R has started to gain momentum in data science due to its easy-to-use and full of statistic packages. For longtime Python user, I want to run some R commands within Jupyter for pratical reasons, like some collaborators are using R for some tasks or just convenience. This article will show you how to do it.

  • Setup environment

Install R essentials in your current environment:

conda install -c r r-essentials

These ‘essentials’ include the packages dplyr, shiny, ggplot2, tidyr, caret and nnet. 

You can also create a new environment just for the R essentials:

conda create -n my-r-env -c r r-essentials

Now you’re all set to work with R in Jupyter.

How about install new packages in R for my usage in Jupyter?

There are two ways of doing it: 1. build a Conda R package by running:

conda skeleton cran xxx conda build r-xxx/

Or you can install the package from inside of R via install.packages() or devtools::install_github. But with one change: change the destination to conda R library.

install.packages(“xxx”,”home/user/anaconda3/lib/R/library)

  • Into good hands

THe interactivity comes mainly from the so-called “magic commands” which allows you to switch from Python to command line instructions (like ls, cat etc) or to write code in other languages such as R, Scala, Julia, …

After open Jupiter notebook, you should be able to see R in the console:

1

To switch from Python to R, first download the following pacakge:

%load_ext rpy2.ipython

After that, start to use R with the %R magic command.

# Hide warnings if there are any
import warnings
warnings.filterwarnings(‘ignore’)
# Load in the r magic
%load_ext rpy2.ipython
# We need ggplot2
%R require(ggplot2)
# Load in the pandas library
import pandas as pd
# Make a pandas DataFrame
df = pd.DataFrame({‘Alphabet’: [‘a’, ‘b’, ‘c’, ‘d’,’e’, ‘f’, ‘g’, ‘h’,’i’],
‘A’: [4, 3, 5, 2, 1, 7, 7, 5, 9],
‘B’: [0, 4, 3, 6, 7, 10,11, 9, 13],
‘C’: [1, 2, 3, 1, 2, 3, 1, 2, 3]})
# Take the name of input variable df and assign it to an R variable of the same name
%%R -i df
# Plot the DataFrame df
ggplot(data=df) + geom_point(aes(x=A, y=B, color=C))

Automate tabular financial datatable into vectorized sequential data

A lot of times, we receive time-related data in a table format and we want convert it into a simple data format with one column of datetime and the other as value. See this sample table:1

Now we want to convert this dataset into another format which can be easier to visulize and convert to other data structure like xts or timeSeries object. The converted data will be like:

2

Let’s look at a sample unemployment rate from Depart of labor.

sampleData <- read.csv(‘table_date.csv’)
sampleData

3

Method in R, there are two common ways to do it, first:

tableDataFlat <- as.vector(t(sampleData[1:nrow(sampleData),2:ncol(sampleData)]))
dates <- seq.Date(as.Date(‘2005-01-01’),as.Date(‘2017-12-01′),’month’)
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

4

the second way in R:

tableDataFlat <- c(t(as.matrix(sampleData[1:nrow(sampleData),2:ncol(sampleData)])))
newTS <- data.frame(dates=dates,value=tableDataFlat)
head(newTS)

Now we can do visualization and analysis more conveniently.

plot(newTS)

5

Method in Python:

In python, it is even more simple. Flatten the data matrix by using:

import numpy as np
import pandas as pd
df = pd.read_csv(‘table_date.csv’)
data = df.values
data_flat = data.flatten()
dates = pd.date_range(start = ‘2005-01-01’, end = ‘2017-12-01′,freq=’M’)
new_df = pd.Dataframe({date:dates,value:data_flat})
new_df.set_index(date)

Python traps you should know

Like every language, there are some easy to overlook traps when writing Python programs. Some of the traps are hidden and can cause big problems or errors for your program. Here are some of the most common traps a good Python programmer should be aware:

    •  1. a mutable object used as the default parameter

Like all the other languages, Python provides default parameters for functions which are great for making thing easier. However, things can become unpleasant if you have put a mutable object in the function as the default value for a parameter. Let’s look at an example:

1

A surprise? ! The root cause is that everything is an object in Python, even function and default parameter is just an attribute of the function. Default parameter values are evaluated when the function definition is executed.

Another more obvious example:

2

How to fix it?

According to Python document: A way around this is to use None as the default, and explicitly test for it in the body of the function.

3

  • 2.  x += y vs x = x+y

Generally speaking, the two are equivalent. Let’s look at the example:

4

As we can see, when using +=, it returns the same id.  In the first example (53,54), x points to a new object while the latter one (55,56) it modifies its value at the current location.

  • 3. Majic parathesis ()

In Python, () can represent a tuple data structure which is immutable.

5

What if you only have one element in the tuple:

6

Majic, it becomes an integer instead of a tuple. The right thing to do is this:

7

  • 4. Generated element is a list of list

This is like a 2-D array. Or generating a list with mutable element in it. Sounds very easy:

8

By adding a value 10 into the first element in the list, we populated all the elements with the same value. Interesting, hmmm? That’s not what I want!!

The reason is still the same: mutable object within the list and they all point to the same object. The right syntax is:
9

As seen above, there are many traps while using Python and definitely you should be aware of it.

 

Deep Learning with GPU- How do we start? — A quick setup guide on Amazon ec2

Deep learning is one of the hottest buzzwords in tech and is impacting everything from health care to transportation to manufacturing, and more. Companies are turning to deep learning to solve hard problems, like speech recognition, object recognition, and machine translation.

Everything new breakthrough comes with challenges. The biggest challenge for deep learning is that it requires intensive training of the model and massive amount of matrix multiplications and other operations. A single CPU usually has no more than 12 cores and it will be a bottleneck for deep learning network development. The good thing is that all the matrix computation can be parallelized and that’s where GPU comes into rescue. A single GPU might have thousands of cores and it is a perfect solution to deep learning’s massive matrix operations. GPUs are much faster than CPUs for deep learning because they have orders of magnitude more resources dedicated to floating point operations, running specialized algorithms ensuring that their deep pipelines are always filled.

The good thing is that all the matrix computation can be parallelized and that’s where GPU comes into rescue. A single GPU might have thousands of cores and it is a perfect solution to deep learning’s massive matrix operations. GPUs are much faster than CPUs for deep learning because they have orders of magnitude more resources dedicated to floating point operations, running specialized algorithms ensuring that their deep pipelines are always filled.

GPUs are much faster than CPUs for deep learning because they have orders of magnitude more resources dedicated to floating point operations, running specialized algorithms ensuring that their deep pipelines are always filled.

Now we’re know why GPU is necessary for deep learning. Probably you’re interested in deep learning and can’t wait to do something about it. But you don’t have big GPUs on your computer. The good news is that there are public GPU serves for you to start with. Google, Amazon, OVH all have GPU servers for you to rent and the cost is very reasonable.

In this article, I’ll show you how to set up a deep learning server on Amazon ec2, p2-2xlarge GPU instance in this case. In order to set up amazon instance, here is the prerequisite software you’ll need:

  1. Python 2.7 (recommend anaconda)
  2. Cygwin with wget, vim (if on windows)
  3. Install Amazon AWS Command Line Interface (AWS CLI), for Mac

Here is the fun part:

  1. Register an Amazon ec2 account at: https://aws.amazon.com/console/
  2. Go to Support –> Support Center –> Create case  (Only for the new ec2 user.)fastai.PNGType in the information in the form and ‘submit’ at the end. Wait for up to 24-48 hours for it to be activated. If you are already an ec2 user, you can skip this step.
  3. Create new user group. From console, Services –> Security, Identity & Compliance –> IAM –> Users –> Add user
  4. After created new user, add permission to the user by click the user just created.user group
  5. Obtain Access keys: Users –> Access Keys –> Create access key. Save the information.key
  6. Now we’re done with Amazon EC2 account, go to Mac Terminal or Cygwin on Windows
  7. Download set-up files from fast.ai. setup_p2.sh and setup_instance.sh . Change the extension to .sh since WordPress doesn’t support bash file upload
  8. Save the two shell script to your current working directory
  9. In the terminal, type: aws configure                                                                                      Type in the access key ID and Secret access key saved in step 5.
  10. bash setup_p2.sh
  11. Save the generated text (on terminal) for connecting to the server
  12. Connect to your instance: ssh -i /Users/lxxxx/.ssh/aws-key-fast-ai.pem ubuntu@ec2-34-231-172-2xx.compute-1.amazonaws.com
  13. Check your instance by typing: nvidia-smi
  14. Open Chrome Browser with URL: ec2-34-231-172-2xx.compute-1.amazonaws.com:8888Password: dl_course
  15. Now you can start to write your deep learning code in the Python Notebook.
  16. Shut down your instance in the console or you’ll pay a lot of money.

For a complete tutorial video, please check Jeremy Howard’s video here.

Tips:

The settings, passwords are all saved at ~/username/.aws , ~/username/.ipython.