Dealing with non-ASCII characters in R and Python

While dealing with an online dataset, especially global data, we’ll encounter non-ASCII characters very frequently. In this post,  I’ll take dealing with Chinese character as an example to show how to address this issue in R and Python.

In R:

Sys.getlocale()  #First get the system locale setting
Sys.setlocale(category = "LC_ALL", locale = "chs")  #cht for traditional chinese
print("测试")  # means test in chinese

# While reading data with non-ASCII files
data <- read.csv('file.csv',encoding = 'UTF-8')

Publish Shiny App with foreign languages.

Here are the tips when dealing with Shiny App that contains non-ASCII characters:

  1. Save the scripts with ‘save with Encoding’ –> choose ‘UTF-8’
  2. While dealing with I/O, add option encoding=’UTF-8′
  3. Publishing shiny app with say Chinese characters, do the following:
  • temp.enc <- options()$encoding
  • options(encoding=’UTF-8′)
  • deployApp(appDir=”your app’s root directory”)

Try not using options(‘UTF-8’) in the script, this will cause all I/O to be UTF-8.

 

In Python

In your python code, just add the following to your python script and save it.

# -*- coding: utf-8 -*-
print (u"你好".encode('utf-8'))
Advertisements

Quick way to extract coastline data from NOAA shapefile with R

Time and space are the two most common factor in all the data we’re dealing with every day. While analyzing dataset with spatial information, it is often interesting to represent our analysis on a geo-map. While dealing data in ocean, rivers and other water bodies, the coastline will be an important feature to delineate our target and geology features.  For most complete and accurate coastline data, NOAA is the best agency for it. While the most common coastline file is provided in shapefile and most people are not quite familiar with shapefile. In this post, I’ll show a quick way of extracting polygon features (also coastline polygons) from NOAA global coastline shapefile. The file in this example can be downloaded from here.

Before going to code, one has to understand the data structure of a standard shapefile. Typically, a shapefile will contain files with extension of shp, proj, dbf,xml,shx. The raw data contains all geometry features are in file ended with shp. And proj file specifies the projection standard it used which is critical when you’re dealing with data using different projections. The file ended with dbf stores attributes of all shapes.

Here is how we extract the WGS84 coastline from NOAA dataset.

library(maptools)
library(shapefiles)
library(rgdal)
library(ggplot2)
library(rgeos)})

# Projection and datum setup
wgs84 <- CRS(“+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs”)
GlobalCoastline <- readShapePoly(‘./ne_50m_ocean.shp’)

class(GlobalCoastline)

## [1] “SpatialPolygonsDataFrame”
## attr(,”package”)
## [1] “sp”

sp object
The object has five slots – data, polygons, bbox, plotOrder, bbox, proj4string. Data contains the information about the polygons, polygons contains the actual polygon coordinates, bbox is the bounding box drawn around the boundaries of the shapefile and the proj4string is the projection.

View the global coastline

Let’s take a look at the global 50m resolution coastline and check one location for accuracy.

plot(GlobalCoastline)
points(-73.4,37,col=’red’)

Capture.JPG

Extract coordinates from a particular polygon

# for example, get the polygon with ID of 1
poly1_coord <- GlobalCoastline@polygons[[1]]@Polygons
# it will return a list of polygons since each polygon may be nested with other polygons
poly1_coord[1]

#Example output:

[[1]]
## An object of class “Polygon”
## Slot “labpt”:
## [1] 50.79407 41.77107
##
## Slot “area”:
## [1] 42.96339
##
## Slot “hole”:
## [1] FALSE
##
## Slot “ringDir”:
## [1] 1
##
## Slot “coords”:
## [,1] [,2]
## [1,] 51.74453 46.93374
## [2,] 51.94512 46.89487
## [3,] 52.01113 46.90190
## [4,] 52.08555 46.83960
## [5,] 52.13828 46.82861
## [6,] 52.18877 46.83950
## [7,] 52.34033 46.89478
## [8,] 52.38486 46.92212

Check whether a location on land (a.k.a in polygon)

lng <- -70.4
lat <- 37
p <- SpatialPoints(list(lng,lat), proj4string = wgs84)
gContains(GlobalCoastline,p)

## [1] TRUE

Now you see how easy it is to extract coastlines using R ‘maptools’ package.

If you want to do it in Python, check the package ‘ogr’.

 

 

Kernel Density Estimation–Optimal bandwidth

Kernel density estimation (KDE) is a non-parametric way to estimate the probability density function (PDF) of a random variable which is used to specify the probability of the random variable falling within a particular range of values, as opposed to taking on any one value. (Wikipedia).

Capture.PNG

Let (x1,x2,…, Xn) be a uni-variate independent and identically distributed sample drawn from some distribution with an unknown density f.  We’re interested in estimating the shape of this function f. The kernel density estimator is defined as:

 where K is the kernel and h >0 is a smoothing parameter called the bandwidth. The most common kernel are:: uniform, triangular, biweight, triweight, Epanechnikov, normal and others.

Intuitively, we want to choose bandwidth (h) as small as the data will allow. However, there is always a trade-off between the bias of the estimator and its variance.

The most common optimality criterion used to select the bandwidth is the mean integrated squared error:

A rule-of-thumb bandwidth estimator.

 is used to approximate univariate data and the underlying density being Gaussian. However, this can yield inaccurate estimates when density is not close to being normal.

{\displaystyle {\text{Bin size}}=2\,{{\text{IQR}}(x) \over {n^{1/3}}}}, a.k.a Freedman-Diaconis rule, is a practical way to get the optimal binwidth for histogram.

Another better estimator is the so-called: solve-the-equation bandwidth. (Botev, Z.I.; Grotowski, J.F.; Kroese, D.P. (2010). “Kernel density estimation via diffusion”. Annals of Statistics. 38 (5): 2916–2957. doi:10.1214/10-AOS799.)

In R, the function recommended to get the optimal bandwidth is MASS::bandwidth.nrd(), dpih().

Another interesting thing about KDE is that: it can be shown that both kNN (k nearest neighbor) and KDE converge to the true probability density as 𝑁 → ∞, provided that 𝑉 (volume for each bin) shrinks with 𝑁, and that 𝑘 (number of data falls in each bin) grows with 𝑁 appropriately.

ref: https://en.wikipedia.org/wiki/Kernel_density_estimation#cite_note-bo10-    https://en.wikipedia.org/wiki/Probability_density_function  https://en.wikipedia.org/wiki/Freedman%E2%80%93Diaconis_rule

 

A quick hack for creating a series of template feature names in Python

Let’s say that I want to create a bunch of feature names with the format of ‘model_pow_1′,’model_pow_2′,’model_pow_3’ and to the Nth variable. It’s very easy to do so in R with just paste and seq function (e.g. paste(‘pow’,seq(1,5))). In python,  we can use list apprehension to accomplish the same thing but pay attention to the format tweak.

ind = [‘model_pow_%d’%i for i in range(1,10)]

Another way to do is using character concatenation.

ind=[‘model_pow_’+str(i) for  i in range(1,10)]

 

Download existing file from Shiny

Shiny App is particularly convenient and powerful when it comes to sharing interactive and complex data analysis results. It can pretty much do any web-based application depends on your knowledge of javascript and html, css. In this short article, I’ll address a thing that comes up quite often. Sometimes, you want to show the user what raw sample data looks like and the best way is to give them the data. By doing so, it will avoid the confusion on syntax. How do we add the download utility to allow the user to download the pre-uploaded sample file?

library(shiny)

ui <- fluidPage(

titlePanel(“Download Sample”),

sidebarLayout(
sidebarPanel(
downloadButton(“downloadData”, label = “Download”)
),

mainPanel(
h6(“Sample download”, align = “center”)
)
)
)

server <- function(input, output) {
output$downloadData <- downloadHandler(
filename <- function() {
paste(“sample”, “csv”, sep=”.”)
},

content <- function(file) {
file.copy(“sample.csv”, file)
},
contentType = “text/csv”
#contentType = “application/zip” # for zip files
)
}

# Run the application
shinyApp(ui = ui, server = server)

One last tip, if you run this code in R Studio, it won’t show you the right file name. Instead, run it in a browser and it works like charm.

 

Buffered and unbuffered IO in Python

Sometimes, you may wonder why my print statement in my Python script not working until the end of my program, especially when you have a fairly heavy computational program that can take up a minute or longer to run. This is due to the way system handles I/O. By default, I/O in programs is buffered which means the IO is not served immediately byte by byte rather are served from a temporary storage to the requesting program. Buffering improves IO performance by reducing the total number of calls.Standard output is buffered because it is assumed there will be far more data going through it.  By buffering, the whole block is read into the buffer at once then the individual bytes are delivered to you from the (fast in-memory) buffer area.

The counterpart of buffered output is unbuffered output, which is applied when you want to ensure the output has been written immediately without delay, before continuing. For example, standard error under a C runtime library is usually unbuffered by default. There are mainly two reasons: 1. errors are supposedly infrequent, 2. you want to know it immediately.

The following is a detailed explanation of when buffered vs. unbuffered output should be used:

You want unbuffered output when you already have a large sequence of bytes ready to write to disk and want to avoid an extra copy into a second buffer in the middle.

Buffered output streams will accumulate write results into an intermediate buffer, sending it to the OS file system only when enough data has accumulated (or flush() is requested). This reduces the number of file system calls. Since file system calls can be expensive on most platforms (compared to short memcpy), the buffered output is a net win when performing a large number of small writes. A unbuffered output is generally better when you already have large buffers to send — copying to an intermediate buffer will not reduce the number of OS calls further and introduces additional work.

Unbuffered output has nothing to do with ensuring your data reaches the disk; that functionality is provided by flush(), and works on both buffered and unbuffered streams. Unbuffered IO writes don’t guarantee the data has reached the physical disk — the OS file system is free to hold on to a copy of your data indefinitely, never writing it to disk, if it wants. It is only required to commit it to disk when you invoke flush(). (Note that close() will call flush() on your behalf).  — Quote from stackoverflow community wiki.

Here is an example of buffered output:

Capture.PNG

Now we get an idea of how buffered IO works. How do we force Python’s print function to output to the screen?

If you’re using Python 3.3+, it has added a flush option. By setting flush = True, the stream is forcibly flushed immediately.

print(*objects, sep=”, end = ‘\n’, file = sys.stdout, flush = False)

Another general way is to use sys.stdout.flush().

import sys

print “This will be output immediately”

sys.stdout.flush()

Running using command line, add -u

python -u mypython.py

You can also use an unbuffered file:

f = open(‘file.txt’, ‘a’,0)  # 0 is no buffer, 1 is one line, other is the buffer size

#or

sys.stdout = open(‘file.txt’, ‘a’,0)

You can also change the default for the shell operating environment

in Linux or OSX:

$export PYTHONUNBUFFERED=TRUE

or Windows:

C:\SET PYTHONUNBUFFERED=TRUE

 

How to configure Aliyun DNS for Aliyun enterprise mail server

For people living in the United States, we’ve all heard about Amazon EC2 which is the king of cloud computing. While outside of USA, Alibaba’s Aliyun cloud service is gaining tremendous ground especially in China. Given the fact that Alibaba is handling the largest e-commerce transactions, its infrastructure and technology are at least on par with its counterparts around the world. In 2016, Alibaba smashes world’s online transaction record by 175,000 per second.  For business or services that are oriented toward China’s market, it makes great sense to set up a server in China for faster access with minimum transocean data transmission.

Recently, I have started to use Aliyun’s Elastic Cloud Instance and the experience has been great. The price is also cheaper than EC2 which is important for small business or amateur users. After finishing the front-end development of my machine-learning based ocean data platform, I realized that a mail server is needed for a registered user to retrieve the password and also for me to send batch messages. The first hurdle for this is to configure the DNS for the mail service from Aliyun. After some research on the internet, mostly in the Chinese language, I have successfully configured the DNS for it.

I think it will be helpful to share the basic step for setting up this service for whoever wants to show their presence in China. The pre-requisite is to register an Aliyun ECS service.

Here are the simple steps:

  1. Register an Aliyun enterprise mailbox from: https://wanwang.aliyun.com/mail/freemail/?spm=5176.8071678.875975.topareabtn0.1bcf708bW5hGG5    You can choose free mail service with maximum 50 accounts, 5 GB space.
  2. After register the mail service, log onto your Aliyun console
  3. From Home –> Direct Mail –> Email Settings –> Email domains , Click ‘New Domain’ on the upper right. Add new domain with something like: mail.google.com (assume google.com is your domain)Screen Shot 2017-10-22 at 8.25.34 PM
  4. Then you’ll have to verify that you own that domain.
  5. Here is the most important part, you’ll need to finish DNS configuration before verifying. Click ‘Configure’ from the email domain you just created at Step 4.域名配置0307.png
  6. Keep the above page open, we’re going to set up the DNS. Go to Aliyun control console, home –> Domain and Websites –> Alibaba Cloud DNS –> Click ‘全部域名’ (All domains) –> ‘解析‘ (configure), this will lead to DNS Settings. Screen Shot 2017-10-22 at 8.32.46 PM
  7. Click ‘Add Record’, and we’re going to add four Record using the information from Step 5. 万网域名解析0302
  8.  After adding two TXT record, one MX record and one CNAME record, you have done configured DNS.
  9. Now go back to the mail domain we created in Step 3, Click ‘verify’. It will probably take up to 2 minutes for you to be able to verify.
  10. After verification, you can set up the ‘Sender Addresses’ and SMTP password etc, by going back to the Email Domain from Home.