Linear Regression Models with Python
Author: Matti Pastell
Tags:
Python, Pweave
Apr 19 2013
I have been looking into using Python for basic statistical analyses lately and I decided to write a short example about fitting linear regression models using statsmodels-library.
Requirements
This example uses statsmodels version 0.5 from github and we’ll use the new formula API which makes fitting the models very familiar for R users. You’ll also need Numpy, Pandas and matplolib.
The analysis has been published using Pweave development version. See my other post.
Import libraries
import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
import matplotlib.pyplot as plt
We’ll use whiteside dataset from R package MASS. You can read the description of the dataset from the link, but in short it contains:
The weekly gas consumption and average external temperature at a house in south-east England for two heating seasons, one of 26 weeks before, and one of 30 weeks after cavity-wall insulation was installed.
Read the data from pydatasets repo using Pandas:
url = 'https://raw.github.com/cpcloud/pydatasets/master/datasets/MASS/whiteside.csv'
whiteside = pd.read_csv(url)
Fitting the model
Let’s see what the relationship between the gas consumption is before the insulation. See statsmodels documentation for more information about the syntax.
model = sm.ols(formula='Gas ~ Temp', data=whiteside, subset = whiteside['Insul']=="Before")
fitted = model.fit()
print fitted.summary()
OLS Regression Results
==============================================================================
Dep. Variable: Gas R-squared: 0.944
Model: OLS Adj. R-squared: 0.941
Method: Least Squares F-statistic: 403.1
Date: Fri, 19 Apr 2013 Prob (F-statistic): 1.64e-16
Time: 16:07:56 Log-Likelihood: -2.8783
No. Observations: 26 AIC: 9.757
Df Residuals: 24 BIC: 12.27
Df Model: 1
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept 6.8538 0.118 57.876 0.000 6.609 7.098
Temp -0.3932 0.020 -20.078 0.000 -0.434 -0.353
==============================================================================
Omnibus: 0.296 Durbin-Watson: 2.420
Prob(Omnibus): 0.862 Jarque-Bera (JB): 0.164
Skew: -0.177 Prob(JB): 0.921
Kurtosis: 2.839 Cond. No. 13.3
==============================================================================
Plot the data and fit
Before = whiteside[whiteside["Insul"] == "Before"]
plt.plot(Before["Temp"], Before["Gas"], 'ro')
plt.plot(Before["Temp"], fitted.fittedvalues, 'b')
plt.legend(['Data', 'Fitted model'])
plt.ylim(0, 10)
plt.xlim(-2, 12)
plt.xlabel('Temperature')
plt.ylabel('Gas')
plt.title('Before Insulation')
Fit diagnostiscs
Statsmodels OLSresults objects contain the usual diagnostic information about the model and you can use the get_influence()
method to get more diagnostic information (such as Cook’s distance).
A look at the residuals
Histogram of normalized residuals
plt.hist(fitted.norm_resid())
plt.ylabel('Count')
plt.xlabel('Normalized residuals')
Cooks distance
OLSInfluence objects contain more diagnostic information
influence = fitted.get_influence()
#c is the distance and p is p-value
(c, p) = influence.cooks_distance
plt.stem(np.arange(len(c)), c, markerfmt=",")
Statsmodels builtin plots
Statsmodels includes a some builtin function for plotting residuals against leverage:
from statsmodels.graphics.regressionplots import *
plot_leverage_resid2(fitted)
influence_plot(fitted)
comments powered by Disqus