Data handling and visualization using Python

Oleksandr (Sasha) Huziy and Jonathan Doyle

CMOS, June 2014

The presentation is available here: http://tinyurl.com/cmos2014

Introduction and history

What is Python??

"Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability." wiki)

"Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open." python.org

What is Python??

Readable and interpreted

Not compiled (computer executes in order)
Easy to develop, debug, and share

Easy to learn, powerful, and fast

High-level and dynamically typed
Interfaces with compiled languages (e.g., C, C++, Fortran)
Versitile libraries "under the hood"

Open and free

Community support and development
Skills transfer without overhead (no site lisence)

Example uses for Python

General programming (e.g., clock applet in Ubuntu)
Video games (e.g., Ubisoft in Montreal)
Web development (e.g., Pinterest, Instagram)
Data analysis (e.g., visulalization, multi dimensional data, statisitics)
Scripting (e.g,. automating system tasks, smart file manipulation)

Talk on Python history by Python developer Guido Van Rossum here.

Some things I've done with python: lake level observations

More cool examples:

Outline part I

Python basics
- IPython and basic python syntax
- Builtin data types
- Operations on file system, strings and dates
Libraries for scientific computing
- NumPy/SciPy
Handling NetCDF4 files
- Netcdf4-python

Getting started

Open a new terminal: Ctrl-Alt-T (Ubuntu) or search for 'terminal' in the dash

Make a working directory

    $ mkdir python_workshop

Change into new directory

    $ cd python_workshop

First create new python script (for later)

    $ gedit example01.py &

List the files in directory

    $ ls

Launch the ipython terminal

    $ ipython

Using python as a calculator

In [6]:

# Try addition, subtraction, multiplcation, and divison
2 + 2
3 - 1
2 * 2
3 / 2;

In [7]:

# Careful with division!
# older python versions treat 2 as an integer and 2. as a float
3 / 2.;

In [8]:

# Try exponentials
2 ^ 3 
# this is actually an XOR NOT exponential!
2 ** 3
9 ** .5

Out[8]:

3.0

Using python as a calculator (scientific)

In [9]:

import math

# Create a variable for pi
pi = math.pi
print pi

math.cos(pi)

math.exp(1)

3.14159265359

Out[9]:

2.718281828459045

Use the 'Tab' auto-complete in iPython to list other math functions.

Note: math only works on scalars, not arrays (with arrays we will use numpy)

Quick overview: native types

Basic data types

string (str): 'This is a string', "'This' too"
int: 0, 1, 20, 1000
float: 0.0, 1.0, 20., 1000.0000001
bool: True, False

Data containers

list: [1,2,4, 5, "I am a fifth element"]
tuple: (1, 2, 3)
dictionary (dict): {'key':'value', 'key2':['list','of','values'], ...}

Extended types (requires an import)

datetime: datetime.datetime.now(), datetime.datetime.utcnow()
array: numpy.array([1,2,3,4])

Quick overview: Flow control

Python uses a four-space indent for loops, case statments, and function definition

Loops:

    for item in item_list:
        # Do something, for example:
        print item

     while condition==True:
         # Do something

Quick overview: Flow control

Case statements:

    
    if condition == True:
        # Do one thing, for example:
        print condition

    elif condition == False:
        # Do something else
        print 'This condition failed to pass'

    else:
        # If neither condition is satisfied
        print 'Warning, condition was not satisfied!'

Quick overview: Flow control

Function definition:

In [2]:

def my_func(input_a, input_b):
    """This is my function documentation"""
        
    output = input_a + input_b
        
    return output

Quick overview: Syntax

In [10]:

#Variable decalaration <-This is a comment
a = 10; b = "climate"; x = None;

x = [1,2,4, 5, "I am a fifth element"]

#looping
for item in x:
    
    #Checking conditions
    if type(item) == int:

        print(item ** 2 % 10),
        
    else:
        
        msg = ", but my index is {0}."
        msg = msg.format(x.index(item))
        
        print item + msg,

#now we are outside of the loop since no identation
print "\nSomething ..."

1 4 6 5 I am a fifth element, but my index is 4. 
Something ...

Builtin data types

Strings

In []:

s = "climate"
#reverse (also works for lists)
print s[-1::-1]

In []:

#Dynamically changing parts of a string (for left right justification use '>', '<')
tpl = "Formatting floats: {0:.3f};\nformatting integers: {1:05d};\nformatting dates: {2: %Y-%m-%d %H:%M}"
print tpl.format(789.3342353554, 32, datetime.now())

Strings

In []:

#Splitting
s = "This,is,a,sentence"
s_list = s.split(",")
print s_list

Strings

In []:

#joining a list of strings
list_of_fruits = ["apple", "banana", "cherry"]
x =   """
        I would like to eat 
        {0}. 
        And by the way I am a multiline string.
      """
print x.format(" or ".join(list_of_fruits))

Strings: regular expressions

In []:

#regular expressions module
import re
msg = "Find 192, numbers 278: and -7 and do smth w 89"
groups = re.findall(r"-?\d+", msg)
print groups 
print [float(el) for el in groups] #convert strings to floats

Dates

In []:

#What time is it
from datetime import datetime
d = datetime.now(); print d

In []:

#hours and minutes
d.strftime("%H:%M"), d.strftime("%Hh%Mmin")

Dates: How long is the CMOS 2014?

In []:

start_date = datetime(2014, 6, 1, 8, 30)
end_date = datetime(2014, 6, 5, 17)
print end_date - start_date

In []:

#you can mutiply the interval by an integer
print (end_date - start_date) * 2

You can generate dates for timeseries with arbitrary steps

In []:

from datetime import timedelta
dt = timedelta(days = 5, hours = 6, minutes = 25)
d0 = datetime(2005, 4, 23)
[str(d0 + i * dt) for i in range(10)]

You can use the calendar module for different convenience functions with dates

Example: how many days in a given month in 3001?

In []:

import calendar
for month in range(1, 13):
    weekday_of_last_day, ndays = calendar.monthrange(3001, month)
    print ndays,

Follow an example

Data from the IML-4 buoy outside Rimouski provided by the St. Laurence Global Observatory

Download data here save it in the working directory python_workshop we created earlier.

Follow an example

Extract the and open the data to prepare for parsing.

$ tar -xzf IML-4_example.tar.gz

$ gedit IML-4_salinite.txt

    IML-4 — Water salinity —
    Date     Time (UTC)     Water salinity (PSU)
    2013-05-28     10:18:22 PM     24.7712
    2013-05-28     10:33:22 PM     24.7736
    2013-05-28     10:48:22 PM     24.7924
    ...
    2013-10-30     04:33:22 PM     25.7566
    2013-10-30     04:48:22 PM     25.7388

Follow along with the example, the completed examples are available for reference in the complete_examlpes directory.

Exercises: Operations with strings, dates and file system

Check what this statement prints

    print 5 * "123"

Count the number of files and number of folders in your home or in the current directory.
Get the sum of digits (not numbers) in the sentence below using module re.

This is a 78 sentence very confusing (1) which might change232 so you need to 346 automate the procedure of extracting the /21e893/ useful information from here10.21?

Figure out on which day of week you were born (see the calendar module), you can have fun by determining on which day of week you'll have your next birthday.
Assume you have a timeseries of length 10000. You know that it starts on 2009-01-02 at 18:00 and the time step is 17 hours 25 minutes, what is the date and time corresponding to the last point of the timeseries.

List comprehensions and sequences

In []:

#square eleaments of a list
#list comprehension
print [the_el ** 2 for the_el in the_list]

In []:

#Generating list or iterable of integers
print range(1,20)

Builtin data containers (lists)¶

In []:

the_list = [3,6,78,9,0,32,13]
##There are some utility functions that can be applied to lists
the_list.sort(reverse = True)
print the_list

In []:

#loop through several lists at the same time
for el1, el2 in zip(the_list, other_list):
    print(el1+el2),

Builtin data containers (dictionary)

In []:

#dictionary
author_to_books = { 
"Stephen King": 
    ["Carrie","On writing","Green Mile"],

"Richard Feynman": 
    ["Lectures on computation", 
     "The pleasure of finding things out"]
}
#add elements to a dictionary
author_to_books["Andrey Kolmogorov"] = \
        ["Foundations Of The Theory Of Prob..."]

Builtin data containers (dictionary)

In []:

#print the list of authors
print author_to_books.keys()

In []:

#Iterate over keys and values
for author, book_list in author_to_books.iteritems():
    print author, " wrote ", book_list

Exercises: builtin containers

Find a sum of squares of all odd integers that are smaller than 100 in one line of code (Hint: use list comprehensions)
Check what this statement prints

    print 10 * [18,]

NumPy

The library for fast manipulations with big arrays that fit into memory.

In []:

import numpy as np
#Creating a numpy array from list
np_arr = np.asarray([253,1.9,23, 4, 3.5,6,7,86, 18.9, 10, 89, 99])
print np_arr
print np_arr + 1
print np_arr ** 2
print np_arr + 5 * np_arr

NumPy

In []:

#Reshape
np_arr.shape = (3,4)
print np_arr

In []:

#sum along a specified dimension, here 
print np_arr.sum(axis = 1)

Numpy

In []:

#create prefilled arrays
the_zeros = np.zeros((3,9))
the_ones = np.ones((3,9))

print the_zeros
print 20 * "-"
print the_ones

Numpy provides many vectorized functions to efficiently operate on arrays

In []:

print np.sin(np_arr)

In []:

print np.where(np_arr == np_arr.min()) # <--- indices of the minimum value

You can mask arrays easily (and the plotting library will recognize your masking)

In []:

masked_np_arr = np.ma.masked_where(np_arr > np.percentile(np_arr, 75), np_arr)
print masked_np_arr

Numpy: fancy indexing

In []:

arr = np.random.randn(3,5)
print "Sum of positive numbers: ", arr[arr > 0].sum()

print "Sum over (-0.1 <= arr <= 0.1): ", \
       arr[(arr >= -0.1) & (arr <= 0.1)].sum()

print "Sum over (-0.1 > arr) or (arr > 0.1): ", \
       arr[(arr < -0.1) | (arr > 0.1)].sum()

Exercises: Numpy

Generate a 10 x 3 array of random numbers. Count the number of positive and negative elements.
Checkout numpy for MATLAB users transition table.

SciPy packages

scipy.constants - Physical and mathematical constants
scipy.fftpack - Fourier transform
scipy.integrate - Integration routines
scipy.interpolate - Interpolation
scipy.io - Data input and output
scipy.linalg - Linear algebra routines
scipy.ndimage - n-dimensional image package
scipy.optimize - Optimization
scipy.signal - Signal processing
scipy.sparse - Sparse matrices
scipy.spatial - Spatial data structures and algorithms
scipy.special - Any special mathematical functions
scipy.stats - Statistics

And more ...

SciPy: ttest example

In []:

from scipy import stats
#generating random variable samples for different distributions
x1 = stats.norm.rvs(loc = 50, scale = 20, size=100)
x2 = stats.norm.rvs(loc=5, scale = 20, size=100)
print stats.ttest_ind(x1, x2)

In []:

x1 = stats.norm.rvs(loc = 50, scale = 20, size=100)
x2 = stats.norm.rvs(loc=45, scale = 20, size=100)
print stats.ttest_ind(x1, x2)

SciPy: integrate a function

$$ \int\limits_{-\infty}^{+\infty} e^{-x^2}dx \;=\; ? $$

In []:

from scipy import integrate

In []:

def the_func(x):
    return np.exp(-x ** 2)
print integrate.quad(the_func, -np.Inf, np.inf)

In []:

print "Exact Value sqrt(pi) = {0} ...".format(np.pi ** 0.5)

Modules and classes to operate on

File system (os, shutil, sys)
- create folder, list folder contents, check if file or folder exists
Strings (+, *, join, split, regular expressions)
Dates (datetime, timedelta, )

File system

In [12]:

import os
#print current directory
print os.getcwd()

/home/san/Python/CMOS-python-tutorial

File system

In [13]:

#get list of files in the current directory
flist = os.listdir(".")
print flist[:7]

['test.nc', 'custom.css', 'norm.daily.nao.index.b500101.current.ascii', 'images', 'default_transition.tpl', 'countries', 'monthly_mean_qc_0.1_1979_05_PR.rpn.nc']

File system

In [14]:

#Check if file exists
fname = flist[0]
print fname,":", os.path.isfile(fname), \
      os.path.isdir(fname),  \
      os.path.islink(fname)

test.nc : True False False

You might also find useful the following modules: sys, shutil, path

Netcdf4-python

The python module that for reading and writing NetCDF files in python, created by Jeff Whitaker.

Requires installation of C libraries:

NetCDF4
HDF5

Netcd4-python

Below is the example of creating a netcdf file using the netcdf4-python library.

In [15]:

from netCDF4 import Dataset
file_name = "test.nc"
if os.path.isfile(file_name):
    os.remove(file_name)
    
#open the file for writing, you can Also specify format="NETCDF4_CLASSIC" or "NETCDF3_CLASSIC"
#The format is NETCDF4 by default
ds = Dataset(file_name, mode="w")

Netcdf4-python: create dimensions

In [16]:

ds.createDimension("x", 20)
ds.createDimension("y", 20)
ds.createDimension("time", None)

Out[16]:

<netCDF4.Dimension at 0x3cb45a0>

Netcdf4-python: create variables

In [17]:

var1 = ds.createVariable("field1", "f4", ("time", "x", "y"))
var2 = ds.createVariable("field2", "f4", ("time", "x", "y"))

#add netcdf attributes
var1.units = "m/s"
var1.long_name = "zonal wind speed"

Write data to the file

In [18]:

#generate random data and tell the program where it should go
data = np.random.randn(10, 20, 20)
var1[:] = data
var2[:] = 10 * data + 10
#actually write data to the disk
ds.close();

In [19]:

%%bash
#check the file header:
ncdump -h test.nc

netcdf test {
dimensions:
	x = 20 ;
	y = 20 ;
	time = UNLIMITED ; // (10 currently)
variables:
	float field1(time, x, y) ;
		field1:units = "m/s" ;
		field1:long_name = "zonal wind speed" ;
	float field2(time, x, y) ;
}

Netcdf4-python: reading a netcdf file

Open the netcdf file for reading:

In [20]:

from netCDF4 import Dataset
ds = Dataset("test.nc")

Select variables of interest: no data loading happens at this point

In [21]:

#what variables are in the file
print ds.variables.keys()

#now data is a netcdf4 Variable object, which contain only links to the data
data1_var = ds.variables["field1"]
data2_var = ds.variables["field2"]
#You can query dimensions and shapes of the variables
print data1_var.dimensions, data1_var.shape

[u'field1', u'field2']
(u'time', u'x', u'y') (10, 20, 20)

Read data from the netcdf file for corresponding variables and time steps

In [22]:

#now we ask to really read the data into the memory
all_data = data1_var[:]
#print all_data.shape
data1 = data1_var[1,:,:]
data2 = data2_var[2,:,:]
print data1.shape, all_data.shape, all_data.mean(axis = 0).mean(axis = 0).mean(axis = 0)

(20, 20) (10, 20, 20) 0.0175713

Outline part II

Plotting libraries
- Matplotlib
- Basemap
Grouping and subsetting temporal data with pandas
Interpolation using cKDTree (KDTree) class
Speeding up your code

Matplotlib

The module for creating publication quality plots (mainly 2D), created by John Hunter.

Matplotlib gallery

An alternative is PyNGL - a wrapper around NCL developed at NCAR.

Let us draw the data we just saved to the netcdf file.

Do necessary imports and create configuration objects.

In [23]:

import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
from matplotlib import cm

levels = [-30,-10, -3,-1,0,1,3,10,30,40]
bn = BoundaryNorm(levels, len(levels) - 1)
cmap = cm.get_cmap("jet", len(levels) - 1);

Matplotlib: Preparations and options

In [24]:

#
font_size = 20
params = {
        'xtick.labelsize': font_size,
        'ytick.labelsize': font_size,
        'figure.figsize' : (9, 3)
}
plt.rcParams.update(params) #set back font size to the default value

def apply_some_formatting(axes, im):
    axes[1].set_yticks([])
    #format colorbar axes
    cax = axes[2]
    cax.set_aspect(20) 
    cax.set_anchor("W")
    cb = plt.colorbar(im2, cax = cax);

Matplotlib: plotting 2 fields with nonlinear and shared colorbar

In [25]:

fig, axes = plt.subplots(nrows=1, ncols=3)
im1 = axes[0].contourf(data1.transpose(), 
                       levels = levels, 
                       norm = bn, cmap = cmap)
im2 = axes[1].contourf(data2.transpose(), 
                       levels = levels, 
                       norm = bn, cmap = cmap)
apply_some_formatting(axes, im2)

Execises: Matplotlib

Reproduce the panel plot given in the example. (Try different colormaps)
Plot a timeseries of daily random data. (Show only month names as Jan, Feb, .. along the x-axis, make sure the labels do not overlap)
Plot CRU temperatures using contourf and pcolormesh. The file can be downloaded from here.

Pandas

Initially designed to process and analyse long timeseries

Author: Wes McKinney
Home page: pandas.pydata.org

Download and analyse teleconnection indexes

For pandas 1D examples we will load, plot and correlate different teleconnection indices (source).

In [26]:

#Set name to link mapping
name_to_link = {
"AAO" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.aao.index.b790101.current.ascii",
"AO" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.ao.index.b500101.current.ascii",
"NAO" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.nao.index.b500101.current.ascii",
"PNA" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.pna.index.b500101.current.ascii"
}

Define a helper download function

In [27]:

def download_link(url, local_path):
    if os.path.isfile(local_path):
        return
    
    import urllib2
    s = urllib2.urlopen(url)
    with open(local_path, "wb") as local_file:
        local_file.write(s.read())

In [28]:

#Download data locally if needed
for name, link in name_to_link.iteritems():
    fname = os.path.basename(link)
    if not os.path.isfile(fname):
        download_link(link, fname)

In [29]:

#Read data into memory
import pandas as pd
df_list = []
for name, link in name_to_link.iteritems():
    fname = os.path.basename(link)
    dfi = pd.DataFrame.from_csv(fname, sep = r"\s+", index_col=[0,1,2], parse_dates=False)
    dfi.columns = [name]
    dfi = dfi.select(lambda t: "*" not in t[2]) # because some data is like: 2003  4 30*******
    dfi.index = dfi.index.map(lambda t: [int(ti) for ti in t])
    dfi.index = dfi.index.map(lambda t: datetime(*t))
    df_list.append(dfi)
df = pd.concat(df_list, axis = 1)
df = df.dropna()
df.head(3) #The periods

Out[29]:

	PNA	AAO	NAO	AO
1979-01-02	-0.319	-0.888	-0.838	-1.623
1979-01-03	-0.561	0.255	-0.538	-1.922
1979-01-04	-0.479	0.862	-0.225	-1.180

Plot the timeseries

In [30]:

df.plot();

On separate panels

In [31]:

axes = df.plot(subplots=True, figsize=(10,6), yticks=[-4, 0, 4], lw = 0.5, 
               style = ["g", "r", "b", "k"]);

What about correlations

In [32]:

df.corr()

Out[32]:

	PNA	AAO	NAO	AO
PNA	1.000000	-0.007499	0.036017	-0.159892
AAO	-0.007499	1.000000	0.049362	0.048930
NAO	0.036017	0.049362	1.000000	0.552575
AO	-0.159892	0.048930	0.552575	1.000000

Plot monthly mean indices

In [33]:

df_monthly = df.groupby(by = lambda d: datetime(d.year, d.month, 15)).mean()
df_monthly.hist(figsize=(10, 6), sharex = True, sharey = True);

Exercises: Pandas

Using the data in the examples calculate monthly climatologies of the indices and plot them. See if the correlations are different for the climatological values.
Further reading: usually we want do time selecting, grouping, merging on 3D or 4D fields. It is also possible with pandas (Checkout Panel data container).

Basemap

The author is Jeff Whitaker.

Basemap code is available on github.

Example gallery.

PyNGL - can be used as an alternative.

The usual workflow with the library:

Create a basemap object

basemap = Basemap(projection="...", ...)

Draw your field using:

basemap.contour(), basemap.contourf(), 
basemap.pcolormesh()

It provides utility functions to draw coastlines, meridians and shapefiles (only those that contain lat/lon coordinates), mask ocean points.
Drawing the colorbar is as easy as:
```
basemap.colorbar(img)
```

Basemap

Can be used for nice map backgrounds

In [35]:

from mpl_toolkits.basemap import Basemap
b = Basemap()
fig, (ax1, ax2) = plt.subplots(1,2)

b.warpimage(ax = ax1, scale = 0.1); 
im = b.etopo(ax = ax2, scale = 0.1);

Basemap: display model results in rotated lat/lon projection

In [36]:

f_name = "monthly_mean_qc_0.1_1979_05_PR.rpn.nc"
base_url = "http://scaweb.sca.uqam.ca/~huziy/example_data"
#Fetch the file if it is not there
url = os.path.join(base_url, f_name)
download_link(url, f_name)

#read the file and check what is inside
ds_pr = Dataset(f_name)
print ds_pr.variables.keys()

[u'rlon', u'rlat', u'level', u'time', u'lon', u'lat', u'rotated_pole', u'preacc']

Read data from the NetCDF file

In [37]:

rotpole = ds_pr.variables["rotated_pole"]
lon = ds_pr.variables["lon"][:]
lat = ds_pr.variables["lat"][:]
data = ds_pr.variables["preacc"][:].squeeze()
#rotpole.ncattrs(); - returns a list of netcdf attributes of a variable
print rotpole.grid_north_pole_latitude, \
      rotpole.grid_north_pole_longitude, \
      rotpole.north_pole_grid_longitude

37.8787 106.65 176.709

Creating a basemap object for the data

In [38]:

lon_0 = rotpole.grid_north_pole_longitude - 180
o_lon_p = rotpole.north_pole_grid_longitude
o_lat_p = rotpole.grid_north_pole_latitude
b = Basemap(projection="rotpole", 
            lon_0=lon_0, 
            o_lon_p = o_lon_p, 
            o_lat_p = o_lat_p,
            llcrnrlon = lon[0, 0], 
            llcrnrlat = lat[0, 0],
            urcrnrlon = lon[-1, -1], 
            urcrnrlat = lat[-1, -1], 
            resolution="l")

Converting to the projection coordinates and back to geographic coordinates

In [39]:

#(lat, lon) -> (x, y)
x, y = b(lon, lat)

#(x,y) -> (lat, lon)
loninv, latinv = b(x, y, inverse=True)

Define plotting function to reuse

In [40]:

def plot_data(x, y, data, ax):
    im = b.contourf(x, y, data, ax = ax)
    b.colorbar(im, ax = ax)
    b.drawcoastlines(linewidth=0.5);

Plot the data

In [41]:

fig, (ax1, ax2) = plt.subplots(1,2)
#plot data as is
plot_data(x, y, data, ax1)
#mask small values
to_plot = np.ma.masked_where(data <= 2, data) 
plot_data(x, y, to_plot, ax2)

Masking oceans is as easy as

In [42]:

from mpl_toolkits.basemap import maskoceans
lon1 = lon.copy()
lon1[lon1 > 180] = lon1[lon1 > 180] - 360
data_no_ocean = maskoceans(lon1, lat, data)

Plot masked field

In [43]:

fig, ax = plt.subplots(1,1)
plot_data(x, y, data_no_ocean, ax)

Basemap: plotting wind field (read and convert units)

In [44]:

download_link("http://scaweb.sca.uqam.ca/~huziy/example_data/wind.nc", "wind.nc")
ds_wind = Dataset("wind.nc")
u = ds_wind.variables["UU"][:]
v = ds_wind.variables["VV"][:]
coef_wind = 0.51444444444 #m/s in one knot
ds_wind.close()

Basemap: plotting wind field

In [45]:

fig = plt.figure()
fig.set_size_inches(6,8)
Q = b.quiver(x[::8, ::8], y[::8, ::8], 
             u[::8, ::8] * coef_wind, v[::8, ::8] * coef_wind)
# make quiver key.
qk = plt.quiverkey(Q, 0.35, 0.1, 5, '5 m/s', labelpos='N',
                   coordinates = "figure", 
                   fontproperties = dict(weight="bold"))
b.drawcoastlines(linewidth = 0.1);
plt.savefig("wind.png");

Basemap: plotting wind field (saved image, the legend is not cropped)

Exercises: Basemap

Plot the CRU temperatures (here) map using basemap.
Plot fields preacc and air from here using basemap. Add meridians and parallels as well as rivers to the map. Use subplots to put the plots side by side.
Plot field preacc from here.

Basemap: reading and visualizing shape files (Download data)

In [47]:

#download the shape files directory
local_folder = "countries"
remote_folder = 'http://scaweb.sca.uqam.ca/~huziy/example_data/countries/'
if not os.path.isfile(local_folder+"/cntry00.shp"):
    urlpath = urlopen(remote_folder)
    string = urlpath.read().decode('utf-8')
    pattern = re.compile(r'cntry00\...."')
    filelist = pattern.findall(string)
    filelist = [s[:-1] for s in filelist if not s.endswith('zip"')]
    
    if not os.path.isdir(local_folder):
        os.mkdir(local_folder)
    for fname in filelist:
        f_path = os.path.join(local_folder, fname)
        remote_f_path = os.path.join(remote_folder, fname)
        #download selected files
        download_link(remote_f_path, f_path)

Basemap: reading and visualizing shape files

In [48]:

bworld = Basemap()
shp_file = os.path.join(local_folder, "cntry00")
ncountries, _, _, _, linecollection = bworld.readshapefile(shp_file, "country")

Reading country polygons (and attributes) from the shape file, necessary imports

In [49]:

import fiona
from matplotlib.collections import PatchCollection
from shapely.geometry import MultiPolygon, shape
from matplotlib.patches import Polygon
import shapely
import matplotlib.cm as cm

Reading country polygons (and attributes) from the shape file, preparations

In [50]:

cmap_shape = cm.get_cmap("Greens", 10)
bounds = np.arange(0, 1.65, 0.15) * 1e9
bn_shape = BoundaryNorm(bounds, len(bounds) - 1)

bworld = Basemap()
populations = []; patches = []

def to_mpl_poly(shp_poly, apatches): 
    a = np.asarray(shp_poly.exterior)
    x, y = bworld(a[:, 0], a[:, 1])
    apatches.append(Polygon(zip(x, y)))

Reading country polygons (and attributes) from the shape file, preparations

In [51]:

with fiona.open('countries/cntry00.shp', 'r') as inp:
    for f in inp:
        the_population = f["properties"]["POP_CNTRY"]
        sh = shape(f['geometry'])
        
        if isinstance(sh, shapely.geometry.Polygon):
            to_mpl_poly(sh, patches)
            populations.append(the_population)
        elif isinstance(sh, shapely.geometry.MultiPolygon):
            for sh1 in sh.geoms:
                to_mpl_poly(sh1, patches)
                populations.append(the_population)

Coloring the countries based on the attribute value

In [52]:

from matplotlib.ticker import ScalarFormatter
sf = ScalarFormatter(useMathText=True)
fig = plt.figure(); ax = fig.add_subplot(111)
pcol = PatchCollection(patches, cmap = cmap_shape, norm = bn_shape)
pcol.set_array(np.array(populations))
ax.add_collection(pcol)
cb = bworld.colorbar(pcol, ticks = bounds, format = sf)
cb.ax.yaxis.get_offset_text().set_position((-2,0))
bworld.drawcoastlines(ax = ax, linewidth = 0);

Exercise: shape files

There is a script which is supposed to plot world population but it contains a bug. Copy it to your working directory and try to fix the bug. Try launching it and see if the produced image is correct.

cKDTree

Is a class defined in scipy.spatial package.
Alternatives: pyresample and basemap.interp

cKDTree - workflow

1. Convert all lat/lon coordinates to Cartesian coordinates

    (xs, ys, zs) #correspond to coordinates of the source grid
    (xt, yt, zt) #coordinates of the target grid
    #All x,y,z - are flattened 1d arrays

2. Create a cKDTree object representing source grid

    tree = cKDTree(data = zip(xs, ys, zs))

cKDTree - workflow

3. Query indices of the nearest target cells and distances to the corresponding source cells

dists, inds = tree.query(zip(xt, yt, zt), k = 1)
#k = 1, means find 1 nearest point

4. Get interpolated data (and reshape it to 2D)

data_target = data_source[inds].reshape(lon_target.shape)

cKDTree

For an example usage of cKDTree, please read my post at earthpy.org

There is a folowup to that post about pyresample here.

cKDTree: exercise

Interpolate CRU temperatures to the model grid, use inverse distance weighting for intepolation from 20 nearest points.

$$ T_t = \frac{\sum_{i=1}^{20}w_i T_{s,i}}{\sum_{i=1}^{20}w_i} $$

where $w_i = 1/d_i^2$, $d_i$ - is the distance between corresponding source and target grid points. Grid longitudes and latitudes are defined in this file. The observation temperature field is here.

Speeding up your code

Avoid loops and try to make use of Numpy/Scipy/Pandas functionality as much as possible
If you have a lot of computations the speedup can be achieved by delegating the computations to the NumExpr module.
You can use Cython to compile libraries to get C-like speed.
Use multiprocessing module if the tasks are fairly independent and cannot be solved using the above methods

Multiprocessing example

In [53]:

from multiprocessing import Pool

def func(x):
    return x ** 2

p1 = Pool(processes=1)
p2 = Pool(processes=2)

nums = np.arange(100000)

Tests of performance gain by using 2 processes instead of 1

In [54]:

%%timeit
p1.map(func, nums)

1 loops, best of 3: 1.04 s per loop

In [55]:

%%timeit
p2.map(func, nums)

1 loops, best of 3: 986 ms per loop

Squaring example using Numpy

In [56]:

%%timeit
sq = nums ** 2

10000 loops, best of 3: 120 µs per loop

Squaring example using NumExpr

In [57]:

import numexpr as ne
print ne.detect_number_of_cores()
print ne.evaluate("nums ** 2")[:5]

2
[ 0  1  4  9 16]

In [58]:

%%timeit
ne.evaluate("nums ** 2")

10000 loops, best of 3: 152 µs per loop

Squaring example using cythonized function

In [61]:

%load_ext cythonmagic

In [62]:

%%cython
def square_list(the_list):
    cdef list result = []
    cdef int i
    app_meth = result.append
    for i in the_list:
        app_meth(i ** 2)
    return result

In [63]:

%%timeit
square_list(nums)

100 loops, best of 3: 15.3 ms per loop

In [64]:

%%timeit
[i ** 2 for i in nums]

10 loops, best of 3: 128 ms per loop

Resources

Python: http://www.python.org/
IPython: http://ipython.org/
NumPy: http://www.numpy.org/ - there is a page NumPy for MATLAB users, might be useful for those familiar with MATLAB.
SciPy: http://www.scipy.org/
Scipy lectures
Scientific python lectures: http://scipy-lectures.github.io/
Matplotlib: http://matplotlib.org/
Scitools (iris and cartopy): http://scitools.org.uk/index.html

Resources

NetCDF4: http://netcdf4-python.googlecode.com/svn/trunk/docs/netCDF4-module.html
Basemap: http://matplotlib.org/basemap/
Notebook on shape files
My favourite python IDE: Pycharm, you might also checkout many other like Eclipse, Netbeans, Spyder...
On creating presentations using IPython and a lot of other things: Damian Avilla's blog.

In [65]:

#from IPython.display import HTML

#s = open("custom.css").read()
#s = "<style>" + s + "</style>"
#HTML(s)

In [65]:

Data handling and visualization using Python

Introduction and history

What is Python??

What is Python??

What is Python??

Example uses for Python

Talk on Python history by Python developer Guido Van Rossum here.

Some things I've done with python: lake level observations

More cool examples:

Outline part I

Getting started

Using python as a calculator

Using python as a calculator (scientific)

Quick overview: native types

Quick overview: Flow control

Quick overview: Flow control

Quick overview: Flow control

Quick overview: Syntax

Builtin data types

Strings

Strings

Strings

Strings: regular expressions

Dates

Dates: How long is the CMOS 2014?

You can generate dates for timeseries with arbitrary steps

You can use the calendar module for different convenience functions with dates

Example: how many days in a given month in 3001?

Follow an example

Follow an example

Exercises: Operations with strings, dates and file system

List comprehensions and sequences

Builtin data containers (lists)¶

Builtin data containers (dictionary)

Builtin data containers (dictionary)

Exercises: builtin containers

NumPy

NumPy

Numpy

Numpy provides many vectorized functions to efficiently operate on arrays

You can mask arrays easily (and the plotting library will recognize your masking)

Numpy: fancy indexing

Exercises: Numpy

SciPy packages

SciPy: ttest example

SciPy: integrate a function

Modules and classes to operate on

File system

File system

File system

Netcdf4-python

The python module that for reading and writing NetCDF files in python, created by Jeff Whitaker.

Requires installation of C libraries:

Netcd4-python

Below is the example of creating a netcdf file using the netcdf4-python library.

Netcdf4-python: create dimensions

Netcdf4-python: create variables

Write data to the file

Netcdf4-python: reading a netcdf file

Select variables of interest: no data loading happens at this point

Read data from the netcdf file for corresponding variables and time steps

Outline part II

Matplotlib

The module for creating publication quality plots (mainly 2D), created by John Hunter.

Matplotlib gallery

Let us draw the data we just saved to the netcdf file.

Matplotlib: Preparations and options

Matplotlib: plotting 2 fields with nonlinear and shared colorbar

Execises: Matplotlib

Pandas

Initially designed to process and analyse long timeseries

Author: Wes McKinney

Home page: pandas.pydata.org

Download and analyse teleconnection indexes

Define a helper download function

Plot the timeseries

On separate panels

What about correlations

Plot monthly mean indices

Exercises: Pandas