Linear Regression Models with Python

Author: Matti Pastell
Tags: Python, Pweave
Apr 19 2013

I have been looking into using Python for basic statistical analyses lately and I decided to write a short example about fitting linear regression models using statsmodels-library.

Requirements

This example uses statsmodels version 0.5 from github and we’ll use the new formula API which makes fitting the models very familiar for R users. You’ll also need Numpy, Pandas and matplolib.

The analysis has been published using Pweave development version. See my other post.

Import libraries

import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt

We’ll use whiteside dataset from R package MASS. You can read the description of the dataset from the link, but in short it contains:

The weekly gas consumption and average external temperature at a house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed.

Read the data from pydatasets repo using Pandas:

url = 'https://raw.github.com/cpcloud/pydatasets/master/datasets/MASS/whiteside.csv'
whiteside = pd.read_csv(url)

Fitting the model

Let’s see what the relationship between the gas consumption is before the insulation. See statsmodels documentation for more information about the syntax.

model = sm.ols(formula='Gas ~ Temp', data=whiteside, subset = whiteside['Insul']=="Before")
fitted = model.fit()
print fitted.summary()

                            OLS Regression Results
==============================================================================
Dep. Variable:                    Gas   R-squared:                       0.944
Model:                            OLS   Adj. R-squared:                  0.941
Method:                 Least Squares   F-statistic:                     403.1
Date:                Fri, 19 Apr 2013   Prob (F-statistic):           1.64e-16
Time:                        16:07:56   Log-Likelihood:                -2.8783
No. Observations:                  26   AIC:                             9.757
Df Residuals:                      24   BIC:                             12.27
Df Model:                           1
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept      6.8538      0.118     57.876      0.000         6.609     7.098
Temp          -0.3932      0.020    -20.078      0.000        -0.434    -0.353
==============================================================================
Omnibus:                        0.296   Durbin-Watson:                   2.420
Prob(Omnibus):                  0.862   Jarque-Bera (JB):                0.164
Skew:                          -0.177   Prob(JB):                        0.921
Kurtosis:                       2.839   Cond. No.                         13.3
==============================================================================

Plot the data and fit

Before = whiteside[whiteside["Insul"] == "Before"]
plt.plot(Before["Temp"], Before["Gas"], 'ro')
plt.plot(Before["Temp"], fitted.fittedvalues, 'b')
plt.legend(['Data', 'Fitted model'])
plt.ylim(0, 10)
plt.xlim(-2, 12)
plt.xlabel('Temperature')
plt.ylabel('Gas')
plt.title('Before Insulation')

Fit diagnostiscs

Statsmodels OLSresults objects contain the usual diagnostic information about the model and you can use the get_influence() method to get more diagnostic information (such as Cook’s distance).

A look at the residuals

Histogram of normalized residuals

plt.hist(fitted.norm_resid())
plt.ylabel('Count')
plt.xlabel('Normalized residuals')

Cooks distance

OLSInfluence objects contain more diagnostic information

influence = fitted.get_influence()
#c is the distance and p is p-value
(c, p) = influence.cooks_distance
plt.stem(np.arange(len(c)), c, markerfmt=",")

Statsmodels builtin plots

Statsmodels includes a some builtin function for plotting residuals against leverage:

from statsmodels.graphics.regressionplots import *
plot_leverage_resid2(fitted)
influence_plot(fitted)

MPASTELL.COM