pingouin.linear_regression

pingouin.linear_regression(X, y, add_intercept=True, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False, relimp=False)[source]

(Multiple) Linear regression.

Parameters
Xpd.DataFrame, np.array or list

Predictor(s). Shape = (n_samples, n_features) or (n_samples,).

ypd.Series, np.array or list

Dependent variable. Shape = (n_samples).

add_interceptbool

If False, assume that the data are already centered. If True, add a constant term to the model. In this case, the first value in the output dict is the intercept of the model.

Note

It is generally recommanded to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. Note that intercept coefficient and p-values are however rarely meaningful.

coef_onlybool

If True, return only the regression coefficients.

alphafloat

Alpha value used for the confidence intervals. \(\text{CI} = [\alpha / 2 ; 1 - \alpha / 2]\)

as_dataframebool

If True, returns a pandas DataFrame. If False, returns a dictionnary.

remove_nabool

If True, apply a listwise deletion of missing values (i.e. the entire row is removed). Default is False, which will raise an error if missing values are present in either the predictor(s) or dependent variable.

relimpbool

If True, returns the relative importance (= contribution) of predictors. This is irrelevant when the predictors are uncorrelated: the total \(R^2\) of the model is simply the sum of each univariate regression \(R^2\)-values. However, this does not apply when predictors are correlated. Instead, the total \(R^2\) of the model is partitioned by averaging over all combinations of predictors, as done in the relaimpo R package (calc.relimp(type="lmg")).

Warning

The computation time roughly doubles for each additional predictor and therefore this can be extremely slow for models with more than 12-15 predictors.

New in version 0.3.0.

Returns
statsdataframe or dict

Linear regression summary:

  • 'names': name of variable(s) in the model (e.g. x1, x2…)

  • 'coef': regression coefficients

  • 'se': standard error of the estimate

  • 'T': T-values

  • 'pval': p-values

  • 'r2': coefficient of determination (\(R^2\))

  • 'adj_r2': adjusted \(R^2\)

  • 'CI[2.5%]': lower confidence interval

  • 'CI[97.5%]': upper confidence interval

  • 'residuals': residuals (only if as_dataframe=False)

  • 'relimp': relative contribution of each predictor to the final \(R^2\) (only if relimp=True).

  • 'relimp_perc': percent relative contribution

Notes

The \(\beta\) coefficients are estimated using an ordinary least squares (OLS) regression, as implemented in the numpy.linalg.lstsq() function. The OLS method minimizes the sum of squared residuals, and leads to a closed-form expression for the estimated \(\beta\):

\[\hat{\beta} = (X^TX)^{-1} X^Ty\]

It is generally recommanded to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. Note that intercept coefficient and p-values are however rarely meaningful.

The standard error of the estimates is a measure of the accuracy of the prediction defined as:

\[\sigma = \sqrt{\text{MSE} \cdot (X^TX)^{-1}}\]

where \(\text{MSE}\) is the mean squared error,

\[\text{MSE} = \frac{SS_{\text{resid}}}{n - p - 1} = \frac{\sum{(\text{true} - \text{pred})^2}}{n - p - 1}\]

\(p\) is the total number of predictor variables in the model (excluding the intercept) and \(n\) is the sample size.

Using the \(\beta\) coefficients and the standard errors, the T-values can be obtained:

\[T = \frac{\beta}{\sigma}\]

and the p-values approximated using a T-distribution with \(n - p - 1\) degrees of freedom.

The coefficient of determination (\(R^2\)) is defined as:

\[R^2 = 1 - (\frac{SS_{\text{resid}}}{SS_{\text{total}}})\]

The adjusted \(R^2\) is defined as:

\[\overline{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\]

The residuals can be accessed via stats.residuals_ if stats is a pandas DataFrame or stats['residuals'] if stats is a dict.

The relative importance (relimp) column is a partitioning of the total \(R^2\) of the model into individual \(R^2\) contribution. This is calculated by taking the average over average contributions in models of different sizes. For more details, please refer to Groemping et al. 2006 and the R package relaimpo.

Note that Pingouin will automatically remove any duplicate columns from \(X\), as well as any column with only one unique value (constant), excluding the intercept.

Results have been compared against sklearn, statsmodels and JASP.

Examples

  1. Simple linear regression

>>> import numpy as np
>>> from pingouin import linear_regression
>>> np.random.seed(123)
>>> mean, cov, n = [4, 6], [[1, 0.5], [0.5, 1]], 30
>>> x, y = np.random.multivariate_normal(mean, cov, n).T
>>> lm = linear_regression(x, y)
>>> lm.round(2)
       names  coef    se     T  pval    r2  adj_r2  CI[2.5%]  CI[97.5%]
0  Intercept  4.40  0.54  8.16  0.00  0.24    0.21      3.29       5.50
1         x1  0.39  0.13  2.99  0.01  0.24    0.21      0.12       0.67
  1. Multiple linear regression

>>> np.random.seed(42)
>>> z = np.random.normal(size=n)
>>> X = np.column_stack((x, z))
>>> lm = linear_regression(X, y)
>>> print(lm['coef'].values)
[4.54123324 0.36628301 0.17709451]
  1. Get the residuals

>>> np.round(lm.residuals_, 2)
array([ 1.18, -1.17,  1.32,  0.76, -1.25,  0.34, -1.54, -0.2 ,  0.36,
       -0.39,  0.69,  1.39,  0.2 , -1.14, -0.21, -1.68,  0.67, -0.69,
        0.62,  0.92, -1.  ,  0.64, -0.21, -0.78,  1.08, -0.03, -1.3 ,
        0.64,  0.81, -0.04])
  1. Using a Pandas DataFrame

>>> import pandas as pd
>>> df = pd.DataFrame({'x': x, 'y': y, 'z': z})
>>> lm = linear_regression(df[['x', 'z']], df['y'])
>>> print(lm['coef'].values)
[4.54123324 0.36628301 0.17709451]
  1. No intercept and return coef only

>>> linear_regression(X, y, add_intercept=False, coef_only=True)
array([ 1.40935593, -0.2916508 ])
  1. Return a dictionnary instead of a DataFrame

>>> lm_dict = linear_regression(X, y, as_dataframe=False)
  1. Remove missing values

>>> X[4, 1] = np.nan
>>> y[7] = np.nan
>>> linear_regression(X, y, remove_na=True, coef_only=True)
array([4.64069731, 0.35455398, 0.1888135 ])
  1. Get the relative importance of predictors

>>> lm = linear_regression(X, y, remove_na=True, relimp=True)
>>> lm[['names', 'relimp', 'relimp_perc']]
           names    relimp  relimp_perc
0  Intercept       NaN          NaN
1         x1  0.217265    82.202201
2         x2  0.047041    17.797799

The relimp column is a partitioning of the total \(R^2\) of the model into individual contribution. Therefore, it sums to the \(R^2\) of the full model. The relimp_perc is normalized to sum to 100%. See Groemping 2006 for more details.

>>> lm[['relimp', 'relimp_perc']].sum()
relimp           0.264305
relimp_perc    100.000000
dtype: float64