pingouin.linear_regression

pingouin.linear_regression(X, y, add_intercept=True, weights=None, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False, relimp=False)[source]

(Multiple) Linear regression.

Parameters
Xarray_like

Predictor(s), of shape (n_samples, n_features) or (n_samples).

yarray_like

Dependent variable, of shape (n_samples).

add_interceptbool

If False, assume that the data are already centered. If True, add a constant term to the model. In this case, the first value in the output dict is the intercept of the model.

Note

It is generally recommended to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. The intercept coefficient and p-values are however rarely meaningful.

weightsarray_like

An optional vector of sample weights to be used in the fitting process, of shape (n_samples). Missing or negative weights are not allowed. If not null, a weighted least squares is calculated.

New in version 0.3.5.

coef_onlybool

If True, return only the regression coefficients.

alphafloat

Alpha value used for the confidence intervals. \(\text{CI} = [\alpha / 2 ; 1 - \alpha / 2]\)

as_dataframebool

If True, returns a pandas DataFrame. If False, returns a dictionnary.

remove_nabool

If True, apply a listwise deletion of missing values (i.e. the entire row is removed). Default is False, which will raise an error if missing values are present in either the predictor(s) or dependent variable.

relimpbool

If True, returns the relative importance (= contribution) of predictors. This is irrelevant when the predictors are uncorrelated: the total \(R^2\) of the model is simply the sum of each univariate regression \(R^2\)-values. However, this does not apply when predictors are correlated. Instead, the total \(R^2\) of the model is partitioned by averaging over all combinations of predictors, as done in the relaimpo R package (calc.relimp(type="lmg")).

Warning

The computation time roughly doubles for each additional predictor and therefore this can be extremely slow for models with more than 12-15 predictors.

New in version 0.3.0.

Returns
statspandas.DataFrame or dict

Linear regression summary:

  • 'names': name of variable(s) in the model (e.g. x1, x2…)

  • 'coef': regression coefficients

  • 'se': standard errors

  • 'T': T-values

  • 'pval': p-values

  • 'r2': coefficient of determination (\(R^2\))

  • 'adj_r2': adjusted \(R^2\)

  • 'CI[2.5%]': lower confidence intervals

  • 'CI[97.5%]': upper confidence intervals

  • 'relimp': relative contribution of each predictor to the final \(R^2\) (only if relimp=True).

  • 'relimp_perc': percent relative contribution

In addition, the output dataframe comes with hidden attributes such as the residuals, and degrees of freedom of the model and residuals, which can be accessed as follow, respectively:

>>> lm = pg.linear_regression() 
>>> lm.residuals_, lm.df_model_, lm.df_resid_ 

Note that to follow scikit-learn convention, these hidden atributes end with an “_”. When as_dataframe=False however, these attributes are no longer hidden and can be accessed as any other keys in the output dictionary.

>>> lm = pg.linear_regression() 
>>> lm['residuals'], lm['df_model'], lm['df_resid'] 

When as_dataframe=False the dictionary also contains the processed X and y arrays (i.e, with NaNs removed if remove_na=True) and the model’s predicted values pred.

>>> lm['X'], lm['y'], lm['pred'] 

For a weighted least squares fit, the weighted Xw and yw arrays are included in the dictionary.

>>> lm['Xw'], lm['yw'] 

Notes

The \(\beta\) coefficients are estimated using an ordinary least squares (OLS) regression, as implemented in the scipy.linalg.lstsq() function. The OLS method minimizes the sum of squared residuals, and leads to a closed-form expression for the estimated \(\beta\):

\[\hat{\beta} = (X^TX)^{-1} X^Ty\]

It is generally recommended to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. Note that intercept coefficient and p-values are however rarely meaningful.

The standard error of the estimates is a measure of the accuracy of the prediction defined as:

\[\sigma = \sqrt{\text{MSE} \cdot (X^TX)^{-1}}\]

where \(\text{MSE}\) is the mean squared error,

\[\text{MSE} = \frac{SS_{\text{resid}}}{n - p - 1} = \frac{\sum{(\text{true} - \text{pred})^2}}{n - p - 1}\]

\(p\) is the total number of predictor variables in the model (excluding the intercept) and \(n\) is the sample size.

Using the \(\beta\) coefficients and the standard errors, the T-values can be obtained:

\[T = \frac{\beta}{\sigma}\]

and the p-values approximated using a T-distribution with \(n - p - 1\) degrees of freedom.

The coefficient of determination (\(R^2\)) is defined as:

\[R^2 = 1 - (\frac{SS_{\text{resid}}}{SS_{\text{total}}})\]

The adjusted \(R^2\) is defined as:

\[\overline{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\]

The relative importance (relimp) column is a partitioning of the total \(R^2\) of the model into individual \(R^2\) contribution. This is calculated by taking the average over average contributions in models of different sizes. For more details, please refer to Groemping et al. 2006 and the R package relaimpo.

Note that Pingouin will automatically remove any duplicate columns from \(X\), as well as any column with only one unique value (constant), excluding the intercept.

Results have been compared against sklearn, R, statsmodels and JASP.

Examples

  1. Simple linear regression using columns of a pandas dataframe

In this first example, we’ll use the tips dataset to see how well we can predict the waiter’s tip (in dollars) based on the total bill (also in dollars).

>>> import numpy as np
>>> import pingouin as pg
>>> df = pg.read_dataset('tips')
>>> # Let's predict the tip ($) based on the total bill (also in $)
>>> lm = pg.linear_regression(df['total_bill'], df['tip'])
>>> lm.round(2)
        names  coef    se      T  pval    r2  adj_r2  CI[2.5%]  CI[97.5%]
0   Intercept  0.92  0.16   5.76   0.0  0.46    0.45      0.61       1.23
1  total_bill  0.11  0.01  14.26   0.0  0.46    0.45      0.09       0.12

It comes as no surprise that total bill is indeed a significant predictor of the waiter’s tip (T=14.26, p<0.05). The \(R^2\) of the model is 0.46 and the adjusted \(R^2\) is 0.45, which means that our model roughly explains ~45% of the total variance in the tip amount.

  1. Multiple linear regression

We can also have more than one predictor and run a multiple linear regression. Below, we add the party size as a second predictor of tip.

>>> # We'll add a second predictor: the party size
>>> lm = pg.linear_regression(df[['total_bill', 'size']], df['tip'])
>>> lm.round(2)
        names  coef    se      T  pval    r2  adj_r2  CI[2.5%]  CI[97.5%]
0   Intercept  0.67  0.19   3.46  0.00  0.47    0.46      0.29       1.05
1  total_bill  0.09  0.01  10.17  0.00  0.47    0.46      0.07       0.11
2        size  0.19  0.09   2.26  0.02  0.47    0.46      0.02       0.36

The party size is also a significant predictor of tip (T=2.26, p=0.02). Note that adding this new predictor however only improved the \(R^2\) of our model by ~1%.

This function also works with numpy arrays:

>>> X = df[['total_bill', 'size']].to_numpy()
>>> y = df['tip'].to_numpy()
>>> pg.linear_regression(X, y).round(2)
       names  coef    se      T  pval    r2  adj_r2  CI[2.5%]  CI[97.5%]
0  Intercept  0.67  0.19   3.46  0.00  0.47    0.46      0.29       1.05
1         x1  0.09  0.01  10.17  0.00  0.47    0.46      0.07       0.11
2         x2  0.19  0.09   2.26  0.02  0.47    0.46      0.02       0.36
  1. Get the residuals

>>> # For clarity, only display the first 9 values
>>> np.round(lm.residuals_, 2)[:9]
array([-1.62, -0.55,  0.31,  0.06, -0.11,  0.93,  0.13, -0.81, -0.49])

Using pandas, we can show a summary of the distribution of the residuals:

>>> import pandas as pd
>>> pd.Series(lm.residuals_).describe().round(2)
count    244.00
mean      -0.00
std        1.01
min       -2.93
25%       -0.55
50%       -0.09
75%        0.51
max        4.04
dtype: float64
  1. No intercept and return only the regression coefficients

Sometimes it may be useful to remove the constant term from the regression, or to only return the regression coefficients without calculating the standard errors or p-values. This latter can potentially save you a lot of time if you need to calculate hundreds of regression and only care about the coefficients!

>>> pg.linear_regression(X, y, add_intercept=False, coef_only=True)
array([0.1007119 , 0.36209717])
  1. Return a dictionnary instead of a dataframe

>>> lm_dict = pg.linear_regression(X, y, as_dataframe=False)
>>> lm_dict.keys()
dict_keys(['names', 'coef', 'se', 'T', 'pval', 'r2', 'adj_r2', 'CI[2.5%]',
           'CI[97.5%]', 'df_model', 'df_resid', 'residuals', 'X', 'y',
           'pred'])
  1. Remove missing values

>>> X[4, 1] = np.nan
>>> y[7] = np.nan
>>> pg.linear_regression(X, y, remove_na=True, coef_only=True)
array([0.65749955, 0.09262059, 0.19927529])
  1. Get the relative importance of predictors

>>> lm = pg.linear_regression(X, y, remove_na=True, relimp=True)
>>> lm[['names', 'relimp', 'relimp_perc']]
       names    relimp  relimp_perc
0  Intercept       NaN          NaN
1         x1  0.342503    73.045583
2         x2  0.126386    26.954417

The relimp column is a partitioning of the total \(R^2\) of the model into individual contribution. Therefore, it sums to the \(R^2\) of the full model. The relimp_perc is normalized to sum to 100%. See Groemping 2006 for more details.

>>> lm[['relimp', 'relimp_perc']].sum()
relimp           0.468889
relimp_perc    100.000000
dtype: float64
  1. Weighted linear regression

>>> X = [1, 2, 3, 4, 5, 6]
>>> y = [10, 22, 11, 13, 13, 16]
>>> w = [1, 0.1, 1, 1, 0.5, 1]  # Array of weights. Must be >= 0.
>>> lm = pg.linear_regression(X, y, weights=w)
>>> lm.round(2)
       names  coef    se     T  pval    r2  adj_r2  CI[2.5%]  CI[97.5%]
0  Intercept  9.00  2.03  4.42  0.01  0.51    0.39      3.35      14.64
1         x1  1.04  0.50  2.06  0.11  0.51    0.39     -0.36       2.44