In [5]:
#Necessary imports
import os
import re
from datetime import datetime, timedelta
from netCDF4 import Dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Data handling and visualization using Python

Oleksandr (Sasha) Huziy and Jonathan Doyle

CMOS, June 2014

The presentation is available here: http://tinyurl.com/cmos2014

Introduction and history

What is Python??

"Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability." wiki)

"Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open." python.org

What is Python??

What is Python??

Readable and interpreted

  • Not compiled (computer executes in order)
  • Easy to develop, debug, and share

Easy to learn, powerful, and fast

  • High-level and dynamically typed
  • Interfaces with compiled languages (e.g., C, C++, Fortran)
  • Versitile libraries "under the hood"

Open and free

  • Community support and development
  • Skills transfer without overhead (no site lisence)

Example uses for Python

  • General programming (e.g., clock applet in Ubuntu)
  • Video games (e.g., Ubisoft in Montreal)
  • Web development (e.g., Pinterest, Instagram)
  • Data analysis (e.g., visulalization, multi dimensional data, statisitics)
  • Scripting (e.g,. automating system tasks, smart file manipulation)

Talk on Python history by Python developer Guido Van Rossum here.

Some things I've done with python: lake level observations

More cool examples:

Outline part I

  • Python basics

    • IPython and basic python syntax

    • Builtin data types

    • Operations on file system, strings and dates

  • Libraries for scientific computing

    • NumPy/SciPy
  • Handling NetCDF4 files

    • Netcdf4-python

Getting started

Open a new terminal: Ctrl-Alt-T (Ubuntu) or search for 'terminal' in the dash

Make a working directory

    $ mkdir python_workshop

Change into new directory

    $ cd python_workshop 

First create new python script (for later)

    $ gedit example01.py & 

List the files in directory

    $ ls

Launch the ipython terminal

    $ ipython

Using python as a calculator

In [6]:
# Try addition, subtraction, multiplcation, and divison
2 + 2
3 - 1
2 * 2
3 / 2;
In [7]:
# Careful with division!
# older python versions treat 2 as an integer and 2. as a float
3 / 2.;
In [8]:
# Try exponentials
2 ^ 3 
# this is actually an XOR NOT exponential!
2 ** 3
9 ** .5
Out[8]:
3.0

Using python as a calculator (scientific)

In [9]:
import math

# Create a variable for pi
pi = math.pi
print pi

math.cos(pi)

math.exp(1)
3.14159265359

Out[9]:
2.718281828459045

Use the 'Tab' auto-complete in iPython to list other math functions.

Note: math only works on scalars, not arrays (with arrays we will use numpy)

Quick overview: native types

Basic data types

string (str): 'This is a string', "'This' too"
int: 0, 1, 20, 1000
float: 0.0, 1.0, 20., 1000.0000001
bool: True, False

Data containers

list: [1,2,4, 5, "I am a fifth element"]
tuple: (1, 2, 3)
dictionary (dict): {'key':'value', 'key2':['list','of','values'], ...}

Extended types (requires an import)

datetime: datetime.datetime.now(), datetime.datetime.utcnow()
array: numpy.array([1,2,3,4])

Quick overview: Flow control

Python uses a four-space indent for loops, case statments, and function definition

Loops:

    for item in item_list:
        # Do something, for example:
        print item

     while condition==True:
         # Do something

Quick overview: Flow control

Case statements:

    
    if condition == True:
        # Do one thing, for example:
        print condition

    elif condition == False:
        # Do something else
        print 'This condition failed to pass'

    else:
        # If neither condition is satisfied
        print 'Warning, condition was not satisfied!'

Quick overview: Flow control

Function definition:

In [2]:
def my_func(input_a, input_b):
    """This is my function documentation"""
        
    output = input_a + input_b
        
    return output

Quick overview: Syntax

In [10]:
#Variable decalaration <-This is a comment
a = 10; b = "climate"; x = None;

x = [1,2,4, 5, "I am a fifth element"]

#looping
for item in x:
    
    #Checking conditions
    if type(item) == int:

        print(item ** 2 % 10),
        
    else:
        
        msg = ", but my index is {0}."
        msg = msg.format(x.index(item))
        
        print item + msg,

#now we are outside of the loop since no identation
print "\nSomething ..."
1 4 6 5 I am a fifth element, but my index is 4. 
Something ...

Builtin data types

Strings

In []:
s = "climate"
#reverse (also works for lists)
print s[-1::-1]
In []:
#Dynamically changing parts of a string (for left right justification use '>', '<')
tpl = "Formatting floats: {0:.3f};\nformatting integers: {1:05d};\nformatting dates: {2: %Y-%m-%d %H:%M}"
print tpl.format(789.3342353554, 32, datetime.now())

Strings

In []:
#Splitting
s = "This,is,a,sentence"
s_list = s.split(",")
print s_list

Strings

In []:
#joining a list of strings
list_of_fruits = ["apple", "banana", "cherry"]
x =   """
        I would like to eat 
        {0}. 
        And by the way I am a multiline string.
      """
print x.format(" or ".join(list_of_fruits))

Strings: regular expressions

In []:
#regular expressions module
import re
msg = "Find 192, numbers 278: and -7 and do smth w 89"
groups = re.findall(r"-?\d+", msg)
print groups 
print [float(el) for el in groups] #convert strings to floats
In []:
#regular expressions module
groups = re.findall(r"-?\d+/\d+|-?\d+\.\d+|-?\.?\d+|-?\d+\.?", "Find 192.28940, -2/3 numbers 278: and -7 and .005 w 89,fh.5 -354.")
print groups

Dates

In []:
#What time is it
from datetime import datetime
d = datetime.now(); print d
In []:
#hours and minutes
d.strftime("%H:%M"), d.strftime("%Hh%Mmin")

Dates: How long is the CMOS 2014?

In []:
start_date = datetime(2014, 6, 1, 8, 30)
end_date = datetime(2014, 6, 5, 17)
print end_date - start_date
In []:
#you can mutiply the interval by an integer
print (end_date - start_date) * 2

You can generate dates for timeseries with arbitrary steps

In []:
from datetime import timedelta
dt = timedelta(days = 5, hours = 6, minutes = 25)
d0 = datetime(2005, 4, 23)
[str(d0 + i * dt) for i in range(10)]

You can use the calendar module for different convenience functions with dates

Example: how many days in a given month in 3001?

In []:
import calendar
for month in range(1, 13):
    weekday_of_last_day, ndays = calendar.monthrange(3001, month)
    print ndays,

Follow an example

Data from the IML-4 buoy outside Rimouski provided by the St. Laurence Global Observatory

Download data here save it in the working directory python_workshop we created earlier.

Follow an example

Extract the and open the data to prepare for parsing.

$ tar -xzf IML-4_example.tar.gz

$ gedit IML-4_salinite.txt
    IML-4 — Water salinity —
    Date     Time (UTC)     Water salinity (PSU)
    2013-05-28     10:18:22 PM     24.7712
    2013-05-28     10:33:22 PM     24.7736
    2013-05-28     10:48:22 PM     24.7924
    ...
    2013-10-30     04:33:22 PM     25.7566
    2013-10-30     04:48:22 PM     25.7388

Follow along with the example, the completed examples are available for reference in the complete_examlpes directory.

Exercises: Operations with strings, dates and file system

  • Check what this statement prints
    print 5 * "123"
  • Count the number of files and number of folders in your home or in the current directory.

  • Get the sum of digits (not numbers) in the sentence below using module re.

    This is a 78 sentence very confusing (1) which might change232 so you need to 346 automate the procedure of extracting the /21e893/ useful information from here10.21?

  • Figure out on which day of week you were born (see the calendar module), you can have fun by determining on which day of week you'll have your next birthday.

  • Assume you have a timeseries of length 10000. You know that it starts on 2009-01-02 at 18:00 and the time step is 17 hours 25 minutes, what is the date and time corresponding to the last point of the timeseries.

List comprehensions and sequences

In []:
#square eleaments of a list
#list comprehension
print [the_el ** 2 for the_el in the_list] 
In []:
#Generating list or iterable of integers
print range(1,20)

Builtin data containers (lists)

In []:
the_list = [3,6,78,9,0,32,13]
##There are some utility functions that can be applied to lists
the_list.sort(reverse = True)
print the_list
In []:
#loop through several lists at the same time
for el1, el2 in zip(the_list, other_list):
    print(el1+el2),

Builtin data containers (dictionary)

In []:
#dictionary
author_to_books = { 
"Stephen King": 
    ["Carrie","On writing","Green Mile"],

"Richard Feynman": 
    ["Lectures on computation", 
     "The pleasure of finding things out"]
}
#add elements to a dictionary
author_to_books["Andrey Kolmogorov"] = \
        ["Foundations Of The Theory Of Prob..."]

Builtin data containers (dictionary)

In []:
#print the list of authors
print author_to_books.keys()
In []:
#Iterate over keys and values
for author, book_list in author_to_books.iteritems():
    print author, " wrote ", book_list

Exercises: builtin containers

  • Find a sum of squares of all odd integers that are smaller than 100 in one line of code (Hint: use list comprehensions)

  • Check what this statement prints

    print 10 * [18,]

NumPy

The library for fast manipulations with big arrays that fit into memory.

In []:
import numpy as np
#Creating a numpy array from list
np_arr = np.asarray([253,1.9,23, 4, 3.5,6,7,86, 18.9, 10, 89, 99])
print np_arr
print np_arr + 1
print np_arr ** 2
print np_arr + 5 * np_arr

NumPy

In []:
#Reshape
np_arr.shape = (3,4)
print np_arr
In []:
#sum along a specified dimension, here 
print np_arr.sum(axis = 1)

Numpy

In []:
#create prefilled arrays
the_zeros = np.zeros((3,9))
the_ones = np.ones((3,9))

print the_zeros
print 20 * "-"
print the_ones

Numpy provides many vectorized functions to efficiently operate on arrays

In []:
print np.sin(np_arr)
In []:
print np.where(np_arr == np_arr.min()) # <--- indices of the minimum value

You can mask arrays easily (and the plotting library will recognize your masking)

In []:
masked_np_arr = np.ma.masked_where(np_arr > np.percentile(np_arr, 75), np_arr)
print masked_np_arr

Numpy: fancy indexing

In []:
arr = np.random.randn(3,5)
print "Sum of positive numbers: ", arr[arr > 0].sum()

print "Sum over (-0.1 <= arr <= 0.1): ", \
       arr[(arr >= -0.1) & (arr <= 0.1)].sum()

print "Sum over (-0.1 > arr) or (arr > 0.1): ", \
       arr[(arr < -0.1) | (arr > 0.1)].sum()

Exercises: Numpy

  • Generate a 10 x 3 array of random numbers. Count the number of positive and negative elements.
  • Checkout numpy for MATLAB users transition table.

SciPy packages

  • scipy.constants - Physical and mathematical constants
  • scipy.fftpack - Fourier transform
  • scipy.integrate - Integration routines
  • scipy.interpolate - Interpolation
  • scipy.io - Data input and output
  • scipy.linalg - Linear algebra routines
  • scipy.ndimage - n-dimensional image package
  • scipy.optimize - Optimization
  • scipy.signal - Signal processing
  • scipy.sparse - Sparse matrices
  • scipy.spatial - Spatial data structures and algorithms
  • scipy.special - Any special mathematical functions
  • scipy.stats - Statistics

And more ...

SciPy: ttest example

In []:
from scipy import stats
#generating random variable samples for different distributions
x1 = stats.norm.rvs(loc = 50, scale = 20, size=100)
x2 = stats.norm.rvs(loc=5, scale = 20, size=100)
print stats.ttest_ind(x1, x2)
In []:
x1 = stats.norm.rvs(loc = 50, scale = 20, size=100)
x2 = stats.norm.rvs(loc=45, scale = 20, size=100)
print stats.ttest_ind(x1, x2)

SciPy: integrate a function

$$ \int\limits_{-\infty}^{+\infty} e^{-x^2}dx \;=\; ? $$

In []:
from scipy import integrate
In []:
def the_func(x):
    return np.exp(-x ** 2)
print integrate.quad(the_func, -np.Inf, np.inf)
In []:
print "Exact Value sqrt(pi) = {0} ...".format(np.pi ** 0.5)

Modules and classes to operate on

  • File system (os, shutil, sys)

    • create folder, list folder contents, check if file or folder exists
  • Strings (+, *, join, split, regular expressions)

  • Dates (datetime, timedelta, )

File system

In [12]:
import os
#print current directory
print os.getcwd()
/home/san/Python/CMOS-python-tutorial

File system

In [13]:
#get list of files in the current directory
flist = os.listdir(".")
print flist[:7]
['test.nc', 'custom.css', 'norm.daily.nao.index.b500101.current.ascii', 'images', 'default_transition.tpl', 'countries', 'monthly_mean_qc_0.1_1979_05_PR.rpn.nc']

File system

In [14]:
#Check if file exists
fname = flist[0]
print fname,":", os.path.isfile(fname), \
      os.path.isdir(fname),  \
      os.path.islink(fname)
test.nc : True False False

You might also find useful the following modules: sys, shutil, path

Netcdf4-python

The python module that for reading and writing NetCDF files in python, created by Jeff Whitaker.

Requires installation of C libraries:

  • NetCDF4

  • HDF5

Netcd4-python

Below is the example of creating a netcdf file using the netcdf4-python library.

In [15]:
from netCDF4 import Dataset
file_name = "test.nc"
if os.path.isfile(file_name):
    os.remove(file_name)
    
#open the file for writing, you can Also specify format="NETCDF4_CLASSIC" or "NETCDF3_CLASSIC"
#The format is NETCDF4 by default
ds = Dataset(file_name, mode="w")

Netcdf4-python: create dimensions

In [16]:
ds.createDimension("x", 20)
ds.createDimension("y", 20)
ds.createDimension("time", None)
Out[16]:
<netCDF4.Dimension at 0x3cb45a0>

Netcdf4-python: create variables

In [17]:
var1 = ds.createVariable("field1", "f4", ("time", "x", "y"))
var2 = ds.createVariable("field2", "f4", ("time", "x", "y"))

#add netcdf attributes
var1.units = "m/s"
var1.long_name = "zonal wind speed"

Write data to the file

In [18]:
#generate random data and tell the program where it should go
data = np.random.randn(10, 20, 20)
var1[:] = data
var2[:] = 10 * data + 10
#actually write data to the disk
ds.close();
In [19]:
%%bash
#check the file header:
ncdump -h test.nc
netcdf test {
dimensions:
	x = 20 ;
	y = 20 ;
	time = UNLIMITED ; // (10 currently)
variables:
	float field1(time, x, y) ;
		field1:units = "m/s" ;
		field1:long_name = "zonal wind speed" ;
	float field2(time, x, y) ;
}

Netcdf4-python: reading a netcdf file

Open the netcdf file for reading:

In [20]:
from netCDF4 import Dataset
ds = Dataset("test.nc")

Select variables of interest: no data loading happens at this point

In [21]:
#what variables are in the file
print ds.variables.keys()

#now data is a netcdf4 Variable object, which contain only links to the data
data1_var = ds.variables["field1"]
data2_var = ds.variables["field2"]
#You can query dimensions and shapes of the variables
print data1_var.dimensions, data1_var.shape
[u'field1', u'field2']
(u'time', u'x', u'y') (10, 20, 20)

Read data from the netcdf file for corresponding variables and time steps

In [22]:
#now we ask to really read the data into the memory
all_data = data1_var[:]
#print all_data.shape
data1 = data1_var[1,:,:]
data2 = data2_var[2,:,:]
print data1.shape, all_data.shape, all_data.mean(axis = 0).mean(axis = 0).mean(axis = 0)
(20, 20) (10, 20, 20) 0.0175713

Outline part II

  • Plotting libraries

    • Matplotlib

    • Basemap

  • Grouping and subsetting temporal data with pandas

  • Interpolation using cKDTree (KDTree) class

  • Speeding up your code

Matplotlib

The module for creating publication quality plots (mainly 2D), created by John Hunter.

An alternative is PyNGL - a wrapper around NCL developed at NCAR.

Let us draw the data we just saved to the netcdf file.

Do necessary imports and create configuration objects.

In [23]:
import matplotlib.pyplot as plt
from matplotlib.colors import BoundaryNorm
from matplotlib import cm

levels = [-30,-10, -3,-1,0,1,3,10,30,40]
bn = BoundaryNorm(levels, len(levels) - 1)
cmap = cm.get_cmap("jet", len(levels) - 1);

Matplotlib: Preparations and options

In [24]:
#
font_size = 20
params = {
        'xtick.labelsize': font_size,
        'ytick.labelsize': font_size,
        'figure.figsize' : (9, 3)
}
plt.rcParams.update(params) #set back font size to the default value

def apply_some_formatting(axes, im):
    axes[1].set_yticks([])
    #format colorbar axes
    cax = axes[2]
    cax.set_aspect(20) 
    cax.set_anchor("W")
    cb = plt.colorbar(im2, cax = cax);

Matplotlib: plotting 2 fields with nonlinear and shared colorbar

In [25]:
fig, axes = plt.subplots(nrows=1, ncols=3)
im1 = axes[0].contourf(data1.transpose(), 
                       levels = levels, 
                       norm = bn, cmap = cmap)
im2 = axes[1].contourf(data2.transpose(), 
                       levels = levels, 
                       norm = bn, cmap = cmap)
apply_some_formatting(axes, im2)

Execises: Matplotlib

  • Reproduce the panel plot given in the example. (Try different colormaps)

  • Plot a timeseries of daily random data. (Show only month names as Jan, Feb, .. along the x-axis, make sure the labels do not overlap)

  • Plot CRU temperatures using contourf and pcolormesh. The file can be downloaded from here.

Pandas

  • Initially designed to process and analyse long timeseries

Download and analyse teleconnection indexes

For pandas 1D examples we will load, plot and correlate different teleconnection indices (source).

In [26]:
#Set name to link mapping
name_to_link = {
"AAO" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.aao.index.b790101.current.ascii",
"AO" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.ao.index.b500101.current.ascii",
"NAO" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.nao.index.b500101.current.ascii",
"PNA" : "ftp://ftp.cpc.ncep.noaa.gov/cwlinks/norm.daily.pna.index.b500101.current.ascii"
}

Define a helper download function

In [27]:
def download_link(url, local_path):
    if os.path.isfile(local_path):
        return
    
    import urllib2
    s = urllib2.urlopen(url)
    with open(local_path, "wb") as local_file:
        local_file.write(s.read())   
In [28]:
#Download data locally if needed
for name, link in name_to_link.iteritems():
    fname = os.path.basename(link)
    if not os.path.isfile(fname):
        download_link(link, fname)
In [29]:
#Read data into memory
import pandas as pd
df_list = []
for name, link in name_to_link.iteritems():
    fname = os.path.basename(link)
    dfi = pd.DataFrame.from_csv(fname, sep = r"\s+", index_col=[0,1,2], parse_dates=False)
    dfi.columns = [name]
    dfi = dfi.select(lambda t: "*" not in t[2]) # because some data is like: 2003  4 30*******
    dfi.index = dfi.index.map(lambda t: [int(ti) for ti in t])
    dfi.index = dfi.index.map(lambda t: datetime(*t))
    df_list.append(dfi)
df = pd.concat(df_list, axis = 1)
df = df.dropna()
df.head(3) #The periods 
Out[29]:
PNA AAO NAO AO
1979-01-02 -0.319 -0.888 -0.838 -1.623
1979-01-03 -0.561 0.255 -0.538 -1.922
1979-01-04 -0.479 0.862 -0.225 -1.180

Plot the timeseries

In [30]:
df.plot();

On separate panels

In [31]:
axes = df.plot(subplots=True, figsize=(10,6), yticks=[-4, 0, 4], lw = 0.5, 
               style = ["g", "r", "b", "k"]);

What about correlations

In [32]:
df.corr()
Out[32]:
PNA AAO NAO AO
PNA 1.000000 -0.007499 0.036017 -0.159892
AAO -0.007499 1.000000 0.049362 0.048930
NAO 0.036017 0.049362 1.000000 0.552575
AO -0.159892 0.048930 0.552575 1.000000

Plot monthly mean indices

In [33]:
df_monthly = df.groupby(by = lambda d: datetime(d.year, d.month, 15)).mean()
df_monthly.hist(figsize=(10, 6), sharex = True, sharey = True);

Exercises: Pandas

  • Using the data in the examples calculate monthly climatologies of the indices and plot them. See if the correlations are different for the climatological values.

  • Further reading: usually we want do time selecting, grouping, merging on 3D or 4D fields. It is also possible with pandas (Checkout Panel data container).

In [34]:
import pandas as pd
from matplotlib.dates import DateFormatter
df_monthclim = df.groupby(by = lambda d: datetime(2001, d.month, 15)).mean()
df_monthclim.plot(lw=2)
ax = plt.gca()
ax.xaxis.set_major_formatter(DateFormatter("%b"))
print df_monthclim.corr()
          PNA       AAO       NAO        AO
PNA  1.000000  0.392216  0.836360 -0.485964
AAO  0.392216  1.000000  0.124929 -0.604202
NAO  0.836360  0.124929  1.000000 -0.348045
AO  -0.485964 -0.604202 -0.348045  1.000000

Basemap

The author is Jeff Whitaker.

Basemap code is available on github.

PyNGL - can be used as an alternative.

The usual workflow with the library:

  • Create a basemap object
    basemap = Basemap(projection="...", ...)
    
  • Draw your field using:

    basemap.contour(), basemap.contourf(), 
    basemap.pcolormesh()
    
  • It provides utility functions to draw coastlines, meridians and shapefiles (only those that contain lat/lon coordinates), mask ocean points.

  • Drawing the colorbar is as easy as:

    basemap.colorbar(img)
    

Basemap

Can be used for nice map backgrounds

In [35]:
from mpl_toolkits.basemap import Basemap
b = Basemap()
fig, (ax1, ax2) = plt.subplots(1,2)

b.warpimage(ax = ax1, scale = 0.1); 
im = b.etopo(ax = ax2, scale = 0.1); 

Basemap: display model results in rotated lat/lon projection

In [36]:
f_name = "monthly_mean_qc_0.1_1979_05_PR.rpn.nc"
base_url = "http://scaweb.sca.uqam.ca/~huziy/example_data"
#Fetch the file if it is not there
url = os.path.join(base_url, f_name)
download_link(url, f_name)

#read the file and check what is inside
ds_pr = Dataset(f_name)
print ds_pr.variables.keys()
[u'rlon', u'rlat', u'level', u'time', u'lon', u'lat', u'rotated_pole', u'preacc']

Read data from the NetCDF file

In [37]:
rotpole = ds_pr.variables["rotated_pole"]
lon = ds_pr.variables["lon"][:]
lat = ds_pr.variables["lat"][:]
data = ds_pr.variables["preacc"][:].squeeze()
#rotpole.ncattrs(); - returns a list of netcdf attributes of a variable
print rotpole.grid_north_pole_latitude, \
      rotpole.grid_north_pole_longitude, \
      rotpole.north_pole_grid_longitude
37.8787 106.65 176.709

Creating a basemap object for the data

In [38]:
lon_0 = rotpole.grid_north_pole_longitude - 180
o_lon_p = rotpole.north_pole_grid_longitude
o_lat_p = rotpole.grid_north_pole_latitude
b = Basemap(projection="rotpole", 
            lon_0=lon_0, 
            o_lon_p = o_lon_p, 
            o_lat_p = o_lat_p,
            llcrnrlon = lon[0, 0], 
            llcrnrlat = lat[0, 0],
            urcrnrlon = lon[-1, -1], 
            urcrnrlat = lat[-1, -1], 
            resolution="l")

Converting to the projection coordinates and back to geographic coordinates

In [39]:
#(lat, lon) -> (x, y)
x, y = b(lon, lat)

#(x,y) -> (lat, lon)
loninv, latinv = b(x, y, inverse=True)

Define plotting function to reuse

In [40]:
def plot_data(x, y, data, ax):
    im = b.contourf(x, y, data, ax = ax)
    b.colorbar(im, ax = ax)
    b.drawcoastlines(linewidth=0.5);

Plot the data

In [41]:
fig, (ax1, ax2) = plt.subplots(1,2)
#plot data as is
plot_data(x, y, data, ax1)
#mask small values
to_plot = np.ma.masked_where(data <= 2, data) 
plot_data(x, y, to_plot, ax2)

Masking oceans is as easy as

In [42]:
from mpl_toolkits.basemap import maskoceans
lon1 = lon.copy()
lon1[lon1 > 180] = lon1[lon1 > 180] - 360
data_no_ocean = maskoceans(lon1, lat, data)

Plot masked field

In [43]:
fig, ax = plt.subplots(1,1)
plot_data(x, y, data_no_ocean, ax)

Basemap: plotting wind field (read and convert units)

In [44]:
download_link("http://scaweb.sca.uqam.ca/~huziy/example_data/wind.nc", "wind.nc")
ds_wind = Dataset("wind.nc")
u = ds_wind.variables["UU"][:]
v = ds_wind.variables["VV"][:]
coef_wind = 0.51444444444 #m/s in one knot
ds_wind.close()

Basemap: plotting wind field

In [45]:
fig = plt.figure()
fig.set_size_inches(6,8)
Q = b.quiver(x[::8, ::8], y[::8, ::8], 
             u[::8, ::8] * coef_wind, v[::8, ::8] * coef_wind)
# make quiver key.
qk = plt.quiverkey(Q, 0.35, 0.1, 5, '5 m/s', labelpos='N',
                   coordinates = "figure", 
                   fontproperties = dict(weight="bold"))
b.drawcoastlines(linewidth = 0.1);
plt.savefig("wind.png");

Basemap: plotting wind field (saved image, the legend is not cropped)

Exercises: Basemap

  • Plot the CRU temperatures (here) map using basemap.

  • Plot fields preacc and air from here using basemap. Add meridians and parallels as well as rivers to the map. Use subplots to put the plots side by side.

  • Plot field preacc from here.

Basemap: reading and visualizing shape files (Download data)

In [46]:
from urllib2 import urlopen
import re, os
In [47]:
#download the shape files directory
local_folder = "countries"
remote_folder = 'http://scaweb.sca.uqam.ca/~huziy/example_data/countries/'
if not os.path.isfile(local_folder+"/cntry00.shp"):
    urlpath = urlopen(remote_folder)
    string = urlpath.read().decode('utf-8')
    pattern = re.compile(r'cntry00\...."')
    filelist = pattern.findall(string)
    filelist = [s[:-1] for s in filelist if not s.endswith('zip"')]
    
    if not os.path.isdir(local_folder):
        os.mkdir(local_folder)
    for fname in filelist:
        f_path = os.path.join(local_folder, fname)
        remote_f_path = os.path.join(remote_folder, fname)
        #download selected files
        download_link(remote_f_path, f_path)

Basemap: reading and visualizing shape files

In [48]:
bworld = Basemap()
shp_file = os.path.join(local_folder, "cntry00")
ncountries, _, _, _, linecollection = bworld.readshapefile(shp_file, "country")

Reading country polygons (and attributes) from the shape file, necessary imports

In [49]:
import fiona
from matplotlib.collections import PatchCollection
from shapely.geometry import MultiPolygon, shape
from matplotlib.patches import Polygon
import shapely
import matplotlib.cm as cm

Reading country polygons (and attributes) from the shape file, preparations

In [50]:
cmap_shape = cm.get_cmap("Greens", 10)
bounds = np.arange(0, 1.65, 0.15) * 1e9
bn_shape = BoundaryNorm(bounds, len(bounds) - 1)

bworld = Basemap()
populations = []; patches = []

def to_mpl_poly(shp_poly, apatches): 
    a = np.asarray(shp_poly.exterior)
    x, y = bworld(a[:, 0], a[:, 1])
    apatches.append(Polygon(zip(x, y)))

Reading country polygons (and attributes) from the shape file, preparations

In [51]:
with fiona.open('countries/cntry00.shp', 'r') as inp:
    for f in inp:
        the_population = f["properties"]["POP_CNTRY"]
        sh = shape(f['geometry'])
        
        if isinstance(sh, shapely.geometry.Polygon):
            to_mpl_poly(sh, patches)
            populations.append(the_population)
        elif isinstance(sh, shapely.geometry.MultiPolygon):
            for sh1 in sh.geoms:
                to_mpl_poly(sh1, patches)
                populations.append(the_population)

Coloring the countries based on the attribute value

In [52]:
from matplotlib.ticker import ScalarFormatter
sf = ScalarFormatter(useMathText=True)
fig = plt.figure(); ax = fig.add_subplot(111)
pcol = PatchCollection(patches, cmap = cmap_shape, norm = bn_shape)
pcol.set_array(np.array(populations))
ax.add_collection(pcol)
cb = bworld.colorbar(pcol, ticks = bounds, format = sf)
cb.ax.yaxis.get_offset_text().set_position((-2,0))
bworld.drawcoastlines(ax = ax, linewidth = 0);

Exercise: shape files

  • There is a script which is supposed to plot world population but it contains a bug. Copy it to your working directory and try to fix the bug. Try launching it and see if the produced image is correct.

cKDTree

  • Is a class defined in scipy.spatial package.

  • Alternatives: pyresample and basemap.interp

cKDTree - workflow

1. Convert all lat/lon coordinates to Cartesian coordinates

    (xs, ys, zs) #correspond to coordinates of the source grid
    (xt, yt, zt) #coordinates of the target grid
    #All x,y,z - are flattened 1d arrays  

2. Create a cKDTree object representing source grid

    tree = cKDTree(data = zip(xs, ys, zs))

cKDTree - workflow

3. Query indices of the nearest target cells and distances to the corresponding source cells

dists, inds = tree.query(zip(xt, yt, zt), k = 1)
#k = 1, means find 1 nearest point

4. Get interpolated data (and reshape it to 2D)

data_target = data_source[inds].reshape(lon_target.shape)

cKDTree

  • For an example usage of cKDTree, please read my post at earthpy.org
  • There is a folowup to that post about pyresample here.

cKDTree: exercise

  • Interpolate CRU temperatures to the model grid, use inverse distance weighting for intepolation from 20 nearest points.

    $$ T_t = \frac{\sum_{i=1}^{20}w_i T_{s,i}}{\sum_{i=1}^{20}w_i} $$

    where $w_i = 1/d_i^2$, $d_i$ - is the distance between corresponding source and target grid points. Grid longitudes and latitudes are defined in this file. The observation temperature field is here.

Speeding up your code

  • Avoid loops and try to make use of Numpy/Scipy/Pandas functionality as much as possible
  • If you have a lot of computations the speedup can be achieved by delegating the computations to the NumExpr module.
  • You can use Cython to compile libraries to get C-like speed.
  • Use multiprocessing module if the tasks are fairly independent and cannot be solved using the above methods

Multiprocessing example

In [53]:
from multiprocessing import Pool

def func(x):
    return x ** 2

p1 = Pool(processes=1)
p2 = Pool(processes=2)

nums = np.arange(100000)

Tests of performance gain by using 2 processes instead of 1

In [54]:
%%timeit
p1.map(func, nums)
1 loops, best of 3: 1.04 s per loop

In [55]:
%%timeit
p2.map(func, nums)
1 loops, best of 3: 986 ms per loop

Squaring example using Numpy

In [56]:
%%timeit
sq = nums ** 2
10000 loops, best of 3: 120 µs per loop

Squaring example using NumExpr

In [57]:
import numexpr as ne
print ne.detect_number_of_cores()
print ne.evaluate("nums ** 2")[:5]
2
[ 0  1  4  9 16]

In [58]:
%%timeit
ne.evaluate("nums ** 2")
10000 loops, best of 3: 152 µs per loop

In [59]:
ne.set_num_threads(1)
Out[59]:
2
In [60]:
%%timeit
ne.evaluate("nums**2")
10000 loops, best of 3: 151 µs per loop

Squaring example using cythonized function

In [61]:
%load_ext cythonmagic
In [62]:
%%cython
def square_list(the_list):
    cdef list result = []
    cdef int i
    app_meth = result.append
    for i in the_list:
        app_meth(i ** 2)
    return result
In [63]:
%%timeit
square_list(nums)
100 loops, best of 3: 15.3 ms per loop

In [64]:
%%timeit
[i ** 2 for i in nums]
10 loops, best of 3: 128 ms per loop

Resources

Resources

In [65]:
#from IPython.display import HTML

#s = open("custom.css").read()
#s = "<style>" + s + "</style>"
#HTML(s)
In [65]: