Go to top, next, previous, or johnstachurski.net

SciPy

SciPy provides functions and other objects that work with NumPy arrays:

Very impressive package

Currently being actively developed

SciPy versus NumPy

When we import SciPy, we also get NumPy

Here's the relevant lines of the SciPy initialization file:

# Import numpy symbols to scipy name space
from numpy import *
from numpy.random import rand, randn
from numpy.fft import fft, ifft
from numpy.lib.scimath import *
# Remove the linalg imported from numpy so that the scipy.linalg package can be
# imported.
del linalg

Most of SciPy resides in submodules

In the current set up these are not imported into the scipy namespace

Means that you have to import them separately

import scipy.optimize
from scipy.integrate import quad

Which to use, NumPy or SciPy?

SciPy includes NumPy, so you can just import SciPy

But personally I like NumPy because it is small and stable

I usually start my programs with

import numpy as np

Then I import bits and pieces from SciPy as needed

from scipy.integrate import quad
from scipy.optimize import brentq
# etc

Statistics

The scipy.stats package supplies

Random Variables and Distributions

Recall that numpy.random provides functions for generating random variables

>>> a = 5
>>> b = 5
>>> np.random.beta(a, b, size=3)
array([ 0.23607443,  0.90941746,  0.49067643])

This generates a draw from the distribution (density)



However, sometimes we need access to the density itself, or the cdf, the quantiles, etc.

For this we can use scipy.stats

Here's an example

import numpy as np
from scipy.stats import beta
from matplotlib.pyplot import hist, plot, show
q = beta(5, 5)      # Beta(a, b), with a = b = 5
obs = q.rvs(2000)   # 2000 observations
hist(obs, bins=40, normed=True)
grid = np.linspace(0.01, 0.99, 100)
plot(grid, q.pdf(grid), 'k-', linewidth=2)
show()

The following plot is produced


In this code we created an rv_frozen object by the call

q = beta(5, 5)

The general syntax is

identifier = scipy.stats.distribution_name(shape_parameters)

where distribution_name is one of the distribution names in dir(scipy.stats)

The shape parameters depend on the specific distribution---use the online help

After that we called the methods rvs and pdf, which give random variates and the density respectively

Other methods include cdf, moments, etc. See online help

In fact there are also two keyword arguments, loc and scale:

identifier = scipy.stats.distribution_name(shape_parameters, `loc=c`, `scale=d`)

These transform the original random variable X into



The methods rvs, pdf, cdf, etc. are transformed accordingly

Note that there is an alternative way of calling the methods described above

For example, the previous code can be replaced by

import numpy as np
from scipy.stats import beta
from matplotlib.pyplot import hist, plot, show
obs = beta.rvs(5, 5, size=2000)   # 2000 observations
hist(obs, bins=40, normed=True)
grid = np.linspace(0.01, 0.99, 100)
plot(grid, beta.pdf(grid, 5, 5), 'k-', linewidth=2)
show()

Other Goodies in scipy.stats

There are lots of other functions and routines in scipy.stats

A useful one is scipy.stats.linregress

from scipy import stats
x = # some array like object (list, tuple, NumPy array, etc)
y = # some array like object (list, tuple, NumPy array, etc)
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)

However, probably the best way to do statistics in Python is to use rpy2

Roots and Fixed Points

Some background on rates of convergence

A sequence (xn) converges to x linearly if



For example, if λ = 1/2 then the distance between the current value and the limit is at least halved at each step

The sequence converges to x with order q



If q = 2, then called quadratic convergence

The higher is q the better the rate

Case Study: Root finding

A root of f on [a, b] is an x such that f(x) = 0.

How to find one?

To simplify matters, suppose that we have a function f on [a, b] which

Bisection

A simple algorithm is bisection

You know the game where

This is bisection. I leave further explanation to Wikipedia

Here's an implementation of bisection for the function described above

## Filename: bisection.py
## Author: John Stachurski

def bisect(f, a, b, tol=10e-5):
    lower, upper = a, b
    while upper - lower > tol:
        middle = 0.5 * (upper + lower)
        if f(middle) > 0: # root is between lower and middle
            lower, upper = lower, middle
        else:  # Root is between middle and upper
            lower, upper = middle, upper
    print 'Root:', lower

Problem:

Recall the concept of recusive function calls from this lecture

Write a recursive implementation of the bisection function

Solution:

## Filename: bisection2.py
## Author: John Stachurski

def bisect(f, a, b, tol=10e-5):
    lower, upper = a, b
    if upper - lower < tol:
        print 'Root:', lower
    else:
        middle = 0.5 * (upper + lower)
        print 'lower = %.2f, middle = %.2f, upper = %.2f' \
                % (lower, middle, upper)
        if f(middle) > 0: # root is between lower and middle
            bisect(f, lower, middle)
        else:  # Root is between middle and upper
            bisect(f, middle, upper)

Let's test this implementation

import bisection2
from scipy.stats import beta
from random import uniform

c = uniform(0, 1)
f = lambda x: beta.cdf(x, 5, 5) - c
bisection2.bisect(f, 0, 1)

Here's the output

lower = 0.00, middle = 0.50, upper = 1.00
lower = 0.00, middle = 0.25, upper = 0.50
lower = 0.25, middle = 0.38, upper = 0.50
lower = 0.38, middle = 0.44, upper = 0.50
lower = 0.44, middle = 0.47, upper = 0.50
## some lines omitted
Root: 0.448425292969

Analysis

This bisection algorithm will always converge to the root in our setting

However, the rate of convergence is only linear

An alternative would be to use the Newton-Raphson method

Under certain conditions on the second derivative, convergence is quadratic

But the algorithm is less robust

For some functions it does not converge (cycles, etc.)

This illustrates a general principle

In practice, most default algorithms for root finding, optimization and fixed points use hybrid methods

Roots and Fixed Points in Scipy

Regarding the scalar (i.e., univariate) case:

In scipy.optimize we find bisect, newton, brentq and some others

Here brentq is a hybrid method, and a good default

Let's test them

from scipy.stats import beta
from random import uniform
import bisection2
from scipy.optimize import bisect, newton, brentq

print 'Our implementation:'
c = uniform(0, 1)
f = lambda x: beta.cdf(x, 5, 5) - c
bisection2.bisect(f, 0, 1)
print 'Scipy bisect:', bisect(f, 0, 1)
print 'Scipy newton:', newton(f, 0.5)  # 0.5 = initial condition
print 'Scipy brentq:', brentq(f, 0, 1)

Here's the output

Our implementation:
lower = 0.00, middle = 0.50, upper = 1.00
## Some lines omitted
Root: 0.177001953125
Scipy bisect: 0.177014592494
Scipy newton: 0.177014592494
Scipy brentq: 0.177014592494

Note that bisect, newton and brentq all give the same value

Using IPython's timeit we can get an idea of how long each one takes

In [11]: timeit bisect(f, 0, 1)
100 loops, best of 3: 8.91 ms per loop
In [12]: timeit newton(f, 0.5)
1000 loops, best of 3: 1.33 ms per loop
In [13]: timeit brentq(f, 0, 1)
100 loops, best of 3: 2.11 ms per loop

Here newton (using Newton-Raphson) is best because the function is well behaved

But brentq is not far off, and would work well for a less well behaved function

Fixed Points

SciPy has a function for finding (scalar) fixed points too:

>>> from scipy.optimize import fixed_point
>>> fixed_point(lambda x: x**2, 10.0)  # 10.0 is an initial guess
1.0

However, current algorithm is not very robust

Might be better to use a root finder such as brentq to find the fixed point of f

>>> from scipy.optimize import brentq
>>> brentq(lambda x: x - x**2, 0.5, 10)
1.0

Multivariate Root Finding

Use scipy.optimize.fsolve, a wrapper for a hybrid method in MINPACK

See the docstring for details

Optimization

Most numerical packages provide only functions for minimization

To find the maximizer of f on D, find the minimizer of -f on D

Minimization is closely related to root finding

As in root finding, nice properties of the function permit use of faster methods

But often we don't know much about the function

In this case it's best to use hybrid methods

Univariate Optimization

For constrained, univariate (i.e., scalar) minimization, a good option is fminbound

>>> from scipy.optimize import fminbound
>>> fminbound(lambda x: x**2, -1, 2)  # Search in [-1, 2]
0.0

Uses Brent's hybrid method

Finds local optima, not global

Multivariate Optimization

Multivariate local optimizers: fmin, fmin_powell, fmin_cg, fmin_bfgs, fmin_ncg

Constrained multivariate local optimizers: fmin_l_bfgs_b, fmin_tnc, fmin_cobyla

See the docstrings for details

Integration

Most numerical methods compute integral of approximating polynomial

Error depends on quality of fit (of polynomial to integrand)

The relevant module is scipy.integrate

A good default for univariate integration is quad

>>> from scipy.integrate import quad
>>> quad(lambda x: x**2, 0, 1)   # Integrate f(x) = x**2 over [0, 1]
(0.33333333333333331, 3.7007434154171879e-15)  # Integral, estimate of error

Interface to function in Fortran library QUADPACK

Uses Clenshaw-Curtis quadrature

Other univariate options:

Multivariate options

In high dimensions, use Monte Carlo if possible

Linear Algebra

We saw that NumPy provides a module for linear algebra called linalg

SciPy also provides a module for linear algebra with the same name

Personally I just use numpy.linalg