pingouin.linear_regression

pingouin.linear_regression(X, y, add_intercept=True, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False)[source]

(Multiple) Linear regression.

Parameters
Xnp.array or list

Predictor(s). Shape = (n_samples, n_features) or (n_samples,).

ynp.array or list

Dependent variable. Shape = (n_samples).

add_interceptbool

If False, assume that the data are already centered. If True, add a constant term to the model. In this case, the first value in the output dict is the intercept of the model.

coef_onlybool

If True, return only the regression coefficients.

alphafloat

Alpha value used for the confidence intervals. CI = [alpha / 2 ; 1 - alpha / 2]

as_dataframebool

If True, returns a pandas DataFrame. If False, returns a dictionnary.

remove_nabool

If True, apply a listwise deletion of missing values (i.e. the entire row is removed).

Returns
statsdataframe or dict

Linear regression summary:

'names' : name of variable(s) in the model (e.g. x1, x2...)
'coef' : regression coefficients
'se' : standard error of the estimate
'T' : T-values
'pval' : p-values
'r2' : coefficient of determination (R2)
'adj_r2' : adjusted R2
'CI[2.5%]' : lower confidence interval
'CI[97.5%]' : upper confidence interval
'residuals' : residuals (only if as_dataframe is False)

Notes

The beta coefficients of the regression are estimated using the numpy.linalg.lstsq() function.

It is generally recommanded to include a constant term (intercept) to the model to limit the bias and force the residual mean to equal zero. Note that intercept coefficient and p-values are however rarely meaningful.

The standard error of the estimates is a measure of the accuracy of the prediction defined as:

\[se = \sqrt{MSE \cdot (X^TX)^{-1}}\]

where \(MSE\) is the mean squared error,

\[MSE = \frac{\sum{(true - pred)^2}}{n - p - 1}\]

\(p\) is the total number of explanatory variables in the model (excluding the intercept) and \(n\) is the sample size.

Using the coefficients and the standard errors, the T-values can be obtained:

\[T = \frac{coef}{se}\]

and the p-values can then be approximated using a T-distribution with \(n - p - 1\) degrees of freedom.

The coefficient of determination (\(R^2\)) is defined as:

\[R^2 = 1 - (\frac{SS_{resid}}{SS_{total}})\]

The adjusted \(R^2\) is defined as:

\[\overline{R}^2 = 1 - (1 - R^2) \frac{n - 1}{n - p - 1}\]

The residuals can be accessed via stats.residuals_ if stats is a pandas DataFrame or stats['residuals'] if stats is a dict.

Results have been compared against sklearn, statsmodels and JASP.

This function will return an error if missing values are present in either the target or predictor(s) variables. Please remove them before runing the function.

Examples

  1. Simple linear regression

>>> import numpy as np
>>> from pingouin import linear_regression
>>> np.random.seed(123)
>>> mean, cov, n = [4, 6], [[1, 0.5], [0.5, 1]], 30
>>> x, y = np.random.multivariate_normal(mean, cov, n).T
>>> lm = linear_regression(x, y)
>>> lm.round(2)
       names  coef    se     T  pval    r2  adj_r2  CI[2.5%]  CI[97.5%]
0  Intercept  4.40  0.54  8.16  0.00  0.24    0.21      3.29       5.50
1         x1  0.39  0.13  2.99  0.01  0.24    0.21      0.12       0.67
  1. Multiple linear regression

>>> np.random.seed(42)
>>> z = np.random.normal(size=n)
>>> X = np.column_stack((x, z))
>>> lm = linear_regression(X, y)
>>> print(lm['coef'].values)
[4.54123324 0.36628301 0.17709451]
  1. Get the residuals

>>> np.round(lm.residuals_, 2)
array([ 1.18, -1.17,  1.32,  0.76, -1.25,  0.34, -1.54, -0.2 ,  0.36,
       -0.39,  0.69,  1.39,  0.2 , -1.14, -0.21, -1.68,  0.67, -0.69,
        0.62,  0.92, -1.  ,  0.64, -0.21, -0.78,  1.08, -0.03, -1.3 ,
        0.64,  0.81, -0.04])
  1. Using a Pandas DataFrame

>>> import pandas as pd
>>> df = pd.DataFrame({'x': x, 'y': y, 'z': z})
>>> lm = linear_regression(df[['x', 'z']], df['y'])
>>> print(lm['coef'].values)
[4.54123324 0.36628301 0.17709451]
  1. No intercept and return coef only

>>> linear_regression(X, y, add_intercept=False, coef_only=True)
array([ 1.40935593, -0.2916508 ])
  1. Return a dictionnary instead of a DataFrame

>>> lm_dict = linear_regression(X, y, as_dataframe=False)
  1. Remove missing values

>>> X[4, 1] = np.nan
>>> y[7] = np.nan
>>> linear_regression(X, y, remove_na=True, coef_only=True)
array([4.64069731, 0.35455398, 0.1888135 ])