pingouin.linear_regression

pingouin.linear_regression(X, y, add_intercept=True, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False)[source]

(Multiple) Linear regression.

Parameters
Xpd.DataFrame, np.array or list

Predictor(s). Shape = (n_samples, n_features) or (n_samples,).

ypd.Series, np.array or list

Dependent variable. Shape = (n_samples).

add_interceptbool

If False, assume that the data are already centered. If True, add a constant term to the model. In this case, the first value in the output dict is the intercept of the model.

Note

It is generally recommanded to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. Note that intercept coefficient and p-values are however rarely meaningful.

coef_onlybool

If True, return only the regression coefficients.

alphafloat

Alpha value used for the confidence intervals. \(\text{CI} = [\alpha / 2 ; 1 - \alpha / 2]\)

as_dataframebool

If True, returns a pandas DataFrame. If False, returns a dictionnary.

remove_nabool

If True, apply a listwise deletion of missing values (i.e. the entire row is removed). Default is False, which will raise an error if missing values are present in either the predictor(s) or dependent variable.

Returns
statsdataframe or dict

Linear regression summary:

'names' : name of variable(s) in the model (e.g. x1, x2...)
'coef' : regression coefficients
'se' : standard error of the estimate
'T' : T-values
'pval' : p-values
'r2' : coefficient of determination (R2)
'adj_r2' : adjusted R2
'CI[2.5%]' : lower confidence interval
'CI[97.5%]' : upper confidence interval
'residuals' : residuals (only if as_dataframe is False)

Notes

The \(\beta\) coefficients are estimated using an ordinary least squares (OLS) regression, as implemented in the numpy.linalg.lstsq() function. The OLS method minimizes the sum of squared residuals, and leads to a closed-form expression for the estimated \(\beta\):

\[\hat{\beta} = (X^TX)^{-1} X^Ty\]

It is generally recommanded to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. Note that intercept coefficient and p-values are however rarely meaningful.

The standard error of the estimates is a measure of the accuracy of the prediction defined as:

\[\sigma = \sqrt{\text{MSE} \cdot (X^TX)^{-1}}\]

where \(\text{MSE}\) is the mean squared error,

\[\text{MSE} = \frac{SS_{\text{resid}}}{n - p - 1} = \frac{\sum{(\text{true} - \text{pred})^2}}{n - p - 1}\]

\(p\) is the total number of predictor variables in the model (excluding the intercept) and \(n\) is the sample size.

Using the \(\beta\) coefficients and the standard errors, the T-values can be obtained:

\[T = \frac{\beta}{\sigma}\]

and the p-values approximated using a T-distribution with \(n - p - 1\) degrees of freedom.

The coefficient of determination (\(R^2\)) is defined as:

\[R^2 = 1 - (\frac{SS_{\text{resid}}}{SS_{\text{total}}})\]

The adjusted \(R^2\) is defined as:

\[\overline{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\]

The residuals can be accessed via stats.residuals_ if stats is a pandas DataFrame or stats['residuals'] if stats is a dict.

Note that Pingouin will automatically remove any duplicate columns from \(X\), as well as any column with only one unique value (constant), excluding the intercept.

Results have been compared against sklearn, statsmodels and JASP.

Examples

  1. Simple linear regression

>>> import numpy as np
>>> from pingouin import linear_regression
>>> np.random.seed(123)
>>> mean, cov, n = [4, 6], [[1, 0.5], [0.5, 1]], 30
>>> x, y = np.random.multivariate_normal(mean, cov, n).T
>>> lm = linear_regression(x, y)
>>> lm.round(2)
       names  coef    se     T  pval    r2  adj_r2  CI[2.5%]  CI[97.5%]
0  Intercept  4.40  0.54  8.16  0.00  0.24    0.21      3.29       5.50
1         x1  0.39  0.13  2.99  0.01  0.24    0.21      0.12       0.67
  1. Multiple linear regression

>>> np.random.seed(42)
>>> z = np.random.normal(size=n)
>>> X = np.column_stack((x, z))
>>> lm = linear_regression(X, y)
>>> print(lm['coef'].values)
[4.54123324 0.36628301 0.17709451]
  1. Get the residuals

>>> np.round(lm.residuals_, 2)
array([ 1.18, -1.17,  1.32,  0.76, -1.25,  0.34, -1.54, -0.2 ,  0.36,
       -0.39,  0.69,  1.39,  0.2 , -1.14, -0.21, -1.68,  0.67, -0.69,
        0.62,  0.92, -1.  ,  0.64, -0.21, -0.78,  1.08, -0.03, -1.3 ,
        0.64,  0.81, -0.04])
  1. Using a Pandas DataFrame

>>> import pandas as pd
>>> df = pd.DataFrame({'x': x, 'y': y, 'z': z})
>>> lm = linear_regression(df[['x', 'z']], df['y'])
>>> print(lm['coef'].values)
[4.54123324 0.36628301 0.17709451]
  1. No intercept and return coef only

>>> linear_regression(X, y, add_intercept=False, coef_only=True)
array([ 1.40935593, -0.2916508 ])
  1. Return a dictionnary instead of a DataFrame

>>> lm_dict = linear_regression(X, y, as_dataframe=False)
  1. Remove missing values

>>> X[4, 1] = np.nan
>>> y[7] = np.nan
>>> linear_regression(X, y, remove_na=True, coef_only=True)
array([4.64069731, 0.35455398, 0.1888135 ])