pingouin.logistic_regression

pingouin.logistic_regression(X, y, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False, **kwargs)[source]

(Multiple) Binary logistic regression.

Parameters
Xnp.array or list

Predictor(s). Shape = (n_samples, n_features) or (n_samples,).

ynp.array or list

Dependent variable. Shape = (n_samples). Must be binary.

coef_onlybool

If True, return only the regression coefficients.

alphafloat

Alpha value used for the confidence intervals. \(\text{CI} = [\alpha / 2 ; 1 - \alpha / 2]\)

as_dataframebool

If True, returns a pandas DataFrame. If False, returns a dictionnary.

remove_nabool

If True, apply a listwise deletion of missing values (i.e. the entire row is removed). Default is False, which will raise an error if missing values are present in either the predictor(s) or dependent variable.

**kwargsoptional

Optional arguments passed to sklearn.linear_model.LogisticRegression.

Returns
statsdataframe or dict

Logistic regression summary:

'names' : name of variable(s) in the model (e.g. x1, x2...)
'coef' : regression coefficients
'se' : standard error
'z' : z-scores
'pval' : two-tailed p-values
'CI[2.5%]' : lower confidence interval
'CI[97.5%]' : upper confidence interval

Notes

This is a wrapper around the sklearn.linear_model.LogisticRegression class. Note that Pingouin automatically disables the l2 regularization applied by scikit-learn. This can be modified by changing the penalty argument.

The calculation of the p-values and confidence interval is adapted from a code found at https://gist.github.com/rspeare/77061e6e317896be29c6de9a85db301d

Note that the first coefficient is always the constant term (intercept) of the model. Scikit-learn will automatically add the intercept to your predictor(s) matrix, therefore, \(X\) should not include a constant term. Pingouin will remove any constant term (e.g column with only one unique value), or duplicate columns from \(X\).

Results have been compared against statsmodels, R, and JASP.

Examples

  1. Simple binary logistic regression

>>> import numpy as np
>>> from pingouin import logistic_regression
>>> np.random.seed(123)
>>> x = np.random.normal(size=30)
>>> y = np.random.randint(0, 2, size=30)
>>> lom = logistic_regression(x, y)
>>> lom.round(2)
       names  coef    se     z  pval  CI[2.5%]  CI[97.5%]
0  Intercept -0.27  0.37 -0.74  0.46     -1.00       0.45
1         x1  0.07  0.32  0.21  0.84     -0.55       0.68
  1. Multiple binary logistic regression

>>> np.random.seed(42)
>>> z = np.random.normal(size=30)
>>> X = np.column_stack((x, z))
>>> lom = logistic_regression(X, y)
>>> print(lom['coef'].values)
[-0.36736745 -0.04374684 -0.47829392]
  1. Using a Pandas DataFrame

>>> import pandas as pd
>>> df = pd.DataFrame({'x': x, 'y': y, 'z': z})
>>> lom = logistic_regression(df[['x', 'z']], df['y'])
>>> print(lom['coef'].values)
[-0.36736745 -0.04374684 -0.47829392]
  1. Return only the coefficients

>>> logistic_regression(X, y, coef_only=True)
array([-0.36736745, -0.04374684, -0.47829392])
  1. Passing custom parameters to sklearn

>>> lom = logistic_regression(X, y, solver='sag', max_iter=10000,
...                           random_state=42)
>>> print(lom['coef'].values)
[-0.36751796 -0.04367056 -0.47841908]