Logo

Generalized Linear ModelsΒΆ

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: from scipy import stats

In [4]: from matplotlib import pyplot as plt

Example for using GLM on binomial response data the input response vector in this case is N by 2 (success, failure) This data is taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach

The response variable is (of students above the math national median, #of students below)

The explanatory variables are (in column order)
The proportion of low income families “LOWINC”
The proportions of minority students,”PERASIAN”,”PERBLACK”,”PERHISP”
The percentage of minority teachers “PERMINTE”,
The median teacher salary including benefits in 1000s “AVSALK”
The mean teacher experience in years “AVYRSEXP”,
The per-pupil expenditures in thousands “PERSPENK”
The pupil-teacher ratio “PTRATIO”
The percent of students taking college credit courses “PCTAF”,
The percentage of charter schools in the districut “PCTCHRT”
The percent of schools in the district operating year round “PCTYRRND”
The following are interaction terms “PERMINTE_AVYRSEXP”,”PERMINTE_AVSAL”,
“AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO”,”PERSPEN_PCTAF”,”PTRATIO_PCTAF”,
“PERMINTE_AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO_PCTAF”
In [5]: data = sm.datasets.star98.load()

In [6]: data.exog = sm.add_constant(data.exog)

The response variable is (success, failure). Eg., the first observation is

In [7]: print data.endog[0]
[ 452.  355.]

Giving a total number of trials for this observation of print data.endog[0].sum()

In [8]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

In [9]: binom_results = glm_binom.fit()

The fitted values are

In [10]: print binom_results.params
[-0.01681504  0.00992548 -0.01872421 -0.01423856  0.25448717  0.24069366
  0.08040867 -1.9521605  -0.33408647 -0.16902217  0.0049167  -0.00357996
 -0.01407656 -0.00400499 -0.0039064   0.0917143   0.04898984  0.00804074
  0.00022201 -0.00224925  2.95887793]

The corresponding t-values are

In [11]: print binom_results.tvalues
[-38.74908321  16.50473627 -25.1821894  -32.81791308   8.49827113
   4.21247925   5.7749976   -6.16191078  -5.45321673  -5.16865445
   3.92119964 -15.87825999  -7.39093058  -8.44963886  -4.05916246
   6.3210987    6.57434662   5.36229044   7.42806363  -6.44513698
   1.91301155]

It is common in GLMs with interactions to compare first differences. We are interested in the difference of the impact of the explanatory variable on the response variable. This example uses interquartile differences for the percentage of low income households while holding the other values constant at their mean.

In [12]: means = data.exog.mean(axis=0)

In [13]: means25 = means.copy()

In [14]: means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)

In [15]: means75 = means.copy()

In [16]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)

In [17]: resp_25 = binom_results.predict(means25)

In [18]: resp_75 = binom_results.predict(means75)

In [19]: diff = resp_75 - resp_25

The interquartile first difference for the percentage of low income households in a school district is

In [20]: print diff*100
-11.8752509918

In [21]: means0 = means.copy()

In [22]: means100 = means.copy()

In [23]: means0[0] = data.exog[:,0].min()

In [24]: means100[0] = data.exog[:,0].max()

In [25]: resp_0 = binom_results.predict(means0)

In [26]: resp_100 = binom_results.predict(means100)

In [27]: diff_full = resp_100 - resp_0

In [28]: print """The full range difference is %2.4f %%""" % (diff_full*100)
The full range difference is -36.1636 %

In [29]: nobs = binom_results.nobs

In [30]: y = data.endog[:,0]/data.endog.sum(1)

In [31]: yhat = binom_results.mu

Plot of yhat vs y

In [32]: plt.figure()
Out[32]: <matplotlib.figure.Figure at 0xf2d8aaac>

In [33]: plt.scatter(yhat, y)
Out[33]: <matplotlib.collections.PathCollection at 0xf19679ec>

In [34]: line_fit = sm.OLS(y, sm.add_constant(yhat)).fit().params

In [35]: fit = lambda x: line_fit[1]+line_fit[0]*x # better way in scipy?

In [36]: plt.plot(np.linspace(0,1,nobs), fit(np.linspace(0,1,nobs)))
Out[36]: [<matplotlib.lines.Line2D at 0xf2d104ec>]

In [37]: plt.title('Model Fit Plot')
Out[37]: <matplotlib.text.Text at 0xf195196c>

In [38]: plt.ylabel('Observed values')
Out[38]: <matplotlib.text.Text at 0xf1dc6dec>

In [39]: plt.xlabel('Fitted values')
Out[39]: <matplotlib.text.Text at 0xf1e9806c>
../../_images/glm_fitted.png

Plot of yhat vs. Pearson residuals

In [40]: plt.figure()
Out[40]: <matplotlib.figure.Figure at 0xf1b82acc>

In [41]: plt.scatter(yhat, binom_results.resid_pearson)
Out[41]: <matplotlib.collections.PathCollection at 0xf19679ac>

In [42]: plt.plot([0.0, 1.0],[0.0, 0.0], 'k-')
Out[42]: [<matplotlib.lines.Line2D at 0xf1ba598c>]

In [43]: plt.title('Residual Dependence Plot')
Out[43]: <matplotlib.text.Text at 0xf1e61c0c>

In [44]: plt.ylabel('Pearson Residuals')
Out[44]: <matplotlib.text.Text at 0xf1963b6c>

In [45]: plt.xlabel('Fitted values')
Out[45]: <matplotlib.text.Text at 0xf1bb2ccc>
../../_images/glm_resids.png

Histogram of standardized deviance residuals

In [46]: plt.figure()
Out[46]: <matplotlib.figure.Figure at 0xf17c032c>

In [47]: res = binom_results.resid_deviance.copy()

In [48]: stdres = (res - res.mean())/res.std()

In [49]: plt.hist(stdres, bins=25)
Out[49]: 
(array([  3.,   2.,   4.,  11.,  11.,  20.,  27.,  31.,  33.,  41.,  28.,
         26.,  23.,  17.,  11.,   4.,   2.,   3.,   1.,   2.,   1.,   0.,
          1.,   0.,   1.]),
 array([-2.54284768, -2.27049293, -1.99813819, -1.72578344, -1.45342869,
        -1.18107394, -0.9087192 , -0.63636445, -0.3640097 , -0.09165496,
         0.18069979,  0.45305454,  0.72540929,  0.99776403,  1.27011878,
         1.54247353,  1.81482827,  2.08718302,  2.35953777,  2.63189252,
         2.90424726,  3.17660201,  3.44895676,  3.7213115 ,  3.99366625,
         4.266021  ]),
 <a list of 25 Patch objects>)

In [50]: plt.title('Histogram of standardized deviance residuals')
Out[50]: <matplotlib.text.Text at 0xf1b82b0c>
../../_images/glm_hist_res.png

QQ Plot of Deviance Residuals

In [51]: plt.figure()
Out[51]: <matplotlib.figure.Figure at 0xf1759eac>

In [52]: res.sort()

In [53]: p = np.linspace(0 + 1./(nobs-1), 1-1./(nobs-1), nobs)

In [54]: quants = np.zeros_like(res)

In [55]: for i in range(nobs):
   ....:     quants[i] = stats.scoreatpercentile(res, p[i]*100)
   ....: 

In [56]: mu = res.mean()

In [57]: sigma = res.std()

In [58]: y = stats.norm.ppf(p, loc=mu, scale=sigma)

In [59]: plt.scatter(y, quants)
Out[59]: <matplotlib.collections.PathCollection at 0xf1ba52cc>

In [60]: plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
Out[60]: [<matplotlib.lines.Line2D at 0xf1c8cfcc>]

In [61]: plt.title('Normal - Quantile Plot')
Out[61]: <matplotlib.text.Text at 0xf1cbbdec>

In [62]: plt.ylabel('Deviance Residuals Quantiles')
Out[62]: <matplotlib.text.Text at 0xf1b9ffac>

In [63]: plt.xlabel('Quantiles of N(0,1)')
Out[63]: <matplotlib.text.Text at 0xf17c42ec>

In [64]: from statsmodels import graphics

In [65]: img = graphics.gofplots.qqplot(res, line='r')
../../_images/glm_qqplot.png

Example for using GLM Gamma for a proportional count response Brief description of the data and design

In [66]: print sm.datasets.scotland.DESCRLONG

This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts.  This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.

The original source files and variable information are included in
/scotland/src/


In [67]: data2 = sm.datasets.scotland.load()

In [68]: data2.exog = sm.add_constant(data2.exog)

In [69]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())

In [70]: glm_results = glm_gamma.fit()

Example for Gaussian distribution with a noncanonical link

In [71]: nobs2 = 100

In [72]: x = np.arange(nobs2)

In [73]: np.random.seed(54321)

In [74]: X = np.column_stack((x,x**2))

In [75]: X = sm.add_constant(X)

In [76]: lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)

In [77]: gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))

In [78]: gauss_log_results = gauss_log.fit()

check summary

In [79]: print binom_results.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           ['y1', 'y2']   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2998.6
Date:                Sat, 29 Nov 2014   Deviance:                       4078.8
Time:                        15:57:50   Pearson chi2:                 4.05e+03
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0168      0.000    -38.749      0.000        -0.018    -0.016
x2             0.0099      0.001     16.505      0.000         0.009     0.011
x3            -0.0187      0.001    -25.182      0.000        -0.020    -0.017
x4            -0.0142      0.000    -32.818      0.000        -0.015    -0.013
x5             0.2545      0.030      8.498      0.000         0.196     0.313
x6             0.2407      0.057      4.212      0.000         0.129     0.353
x7             0.0804      0.014      5.775      0.000         0.053     0.108
x8            -1.9522      0.317     -6.162      0.000        -2.573    -1.331
x9            -0.3341      0.061     -5.453      0.000        -0.454    -0.214
x10           -0.1690      0.033     -5.169      0.000        -0.233    -0.105
x11            0.0049      0.001      3.921      0.000         0.002     0.007
x12           -0.0036      0.000    -15.878      0.000        -0.004    -0.003
x13           -0.0141      0.002     -7.391      0.000        -0.018    -0.010
x14           -0.0040      0.000     -8.450      0.000        -0.005    -0.003
x15           -0.0039      0.001     -4.059      0.000        -0.006    -0.002
x16            0.0917      0.015      6.321      0.000         0.063     0.120
x17            0.0490      0.007      6.574      0.000         0.034     0.064
x18            0.0080      0.001      5.362      0.000         0.005     0.011
x19            0.0002   2.99e-05      7.428      0.000         0.000     0.000
x20           -0.0022      0.000     -6.445      0.000        -0.003    -0.002
const          2.9589      1.547      1.913      0.057        -0.073     5.990
==============================================================================

This Page