Logo

The Datasets Package

Original Proposal

The idea for a datasets package was originally proposed by David Cournapeau and can be found here with updates by me (Skipper Seabold).

Main Usage

To load a dataset do the following

In [1]: import statsmodels.api as sm

In [2]: data = sm.datasets.longley.load()

The Dataset object follows the bunch pattern as explain in the proposal.

Most datasets have two attributes of particular interest to users for examples

In [3]: data.endog
Out[3]: 
array([ 60323.,  61122.,  60171.,  61187.,  63221.,  63639.,  64989.,
        63761.,  66019.,  67857.,  68169.,  66513.,  68655.,  69564.,
        69331.,  70551.])

In [4]: data.exog
Out[4]: 
array([[     83. ,  234289. ,    2356. ,    1590. ,  107608. ,    1947. ],
       [     88.5,  259426. ,    2325. ,    1456. ,  108632. ,    1948. ],
       [     88.2,  258054. ,    3682. ,    1616. ,  109773. ,    1949. ],
       [     89.5,  284599. ,    3351. ,    1650. ,  110929. ,    1950. ],
       [     96.2,  328975. ,    2099. ,    3099. ,  112075. ,    1951. ],
       [     98.1,  346999. ,    1932. ,    3594. ,  113270. ,    1952. ],
       [     99. ,  365385. ,    1870. ,    3547. ,  115094. ,    1953. ],
       [    100. ,  363112. ,    3578. ,    3350. ,  116219. ,    1954. ],
       [    101.2,  397469. ,    2904. ,    3048. ,  117388. ,    1955. ],
       [    104.6,  419180. ,    2822. ,    2857. ,  118734. ,    1956. ],
       [    108.4,  442769. ,    2936. ,    2798. ,  120445. ,    1957. ],
       [    110.8,  444546. ,    4681. ,    2637. ,  121950. ,    1958. ],
       [    112.6,  482704. ,    3813. ,    2552. ,  123366. ,    1959. ],
       [    114.2,  502601. ,    3931. ,    2514. ,  125368. ,    1960. ],
       [    115.7,  518173. ,    4806. ,    2572. ,  127852. ,    1961. ],
       [    116.9,  554894. ,    4007. ,    2827. ,  130081. ,    1962. ]])

Univariate datasets, however, do not have an exog attribute. You can find out the variable names by doing

In [5]: data.endog_name
Out[5]: 'TOTEMP'

In [6]: data.exog_name
Out[6]: ['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.

In [7]: type(data.data)
Out[7]: numpy.core.records.recarray

In [8]: type(data.raw_data)
Out[8]: numpy.ndarray

In [9]: data.names
Out[9]: ['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']

Loading data as pandas objects

For many users it may be preferable to get the datasets as a pandas DataFrame or Series object. Each of the dataset modules is equipped with a load_pandas method which returns a Dataset instance with the data as pandas objects:

In [10]: data = sm.datasets.longley.load_pandas()

In [11]: data.exog
Out[11]: 
    GNPDEFL     GNP  UNEMP  ARMED     POP  YEAR
0      83.0  234289   2356   1590  107608  1947
1      88.5  259426   2325   1456  108632  1948
2      88.2  258054   3682   1616  109773  1949
3      89.5  284599   3351   1650  110929  1950
4      96.2  328975   2099   3099  112075  1951
5      98.1  346999   1932   3594  113270  1952
6      99.0  365385   1870   3547  115094  1953
7     100.0  363112   3578   3350  116219  1954
8     101.2  397469   2904   3048  117388  1955
9     104.6  419180   2822   2857  118734  1956
10    108.4  442769   2936   2798  120445  1957
11    110.8  444546   4681   2637  121950  1958
12    112.6  482704   3813   2552  123366  1959
13    114.2  502601   3931   2514  125368  1960
14    115.7  518173   4806   2572  127852  1961
15    116.9  554894   4007   2827  130081  1962

In [12]: data.endog
Out[12]: 
0     60323
1     61122
2     60171
3     61187
4     63221
5     63639
6     64989
7     63761
8     66019
9     67857
10    68169
11    66513
12    68655
13    69564
14    69331
15    70551
Name: TOTEMP, dtype: float64

With pandas integration in the estimation classes, the metadata will be attached to model results:

In [13]: y, x = data.endog, data.exog

In [14]: res = sm.OLS(y, x).fit()

In [15]: res.params
Out[15]: 
GNPDEFL   -52.993570
GNP         0.071073
UNEMP      -0.423466
ARMED      -0.572569
POP        -0.414204
YEAR       48.417866
dtype: float64

In [16]: res.summary()
Out[16]: 
<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 TOTEMP   R-squared:                       0.988
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     161.9
Date:                Sat, 29 Nov 2014   Prob (F-statistic):           3.13e-09
Time:                        15:57:13   Log-Likelihood:                -117.56
No. Observations:                  16   AIC:                             247.1
Df Residuals:                      10   BIC:                             251.8
Df Model:                           5                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
GNPDEFL      -52.9936    129.545     -0.409      0.691      -341.638   235.650
GNP            0.0711      0.030      2.356      0.040         0.004     0.138
UNEMP         -0.4235      0.418     -1.014      0.335        -1.354     0.507
ARMED         -0.5726      0.279     -2.052      0.067        -1.194     0.049
POP           -0.4142      0.321     -1.289      0.226        -1.130     0.302
YEAR          48.4179     17.689      2.737      0.021         9.003    87.832
==============================================================================
Omnibus:                        1.443   Durbin-Watson:                   1.277
Prob(Omnibus):                  0.486   Jarque-Bera (JB):                0.605
Skew:                           0.476   Prob(JB):                        0.739
Kurtosis:                       3.031   Cond. No.                     4.56e+05
==============================================================================

The condition number is large, 4.56e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
"""

Extra Information

If you want to know more about the dataset itself, you can access the following, again using the Longley dataset as an example

>>> dir(sm.datasets.longley)[:6]
['COPYRIGHT', 'DESCRLONG', 'DESCRSHORT', 'NOTE', 'SOURCE', 'TITLE']

How to Add a Dataset

See the notes on adding a dataset.