pingouin.logistic_regression

pingouin.logistic_regression(X, y, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False, **kwargs)[source]

(Multiple) Binary logistic regression.

Parameters
Xarray_like

Predictor(s), of shape (n_samples, n_features) or (n_samples).

yarray_like

Dependent variable, of shape (n_samples). y must be binary, i.e. only contains 0 or 1. Multinomial logistic regression is not supported.

coef_onlybool

If True, return only the regression coefficients.

alphafloat

Alpha value used for the confidence intervals. \(\text{CI} = [\alpha / 2 ; 1 - \alpha / 2]\)

as_dataframebool

If True, returns a pandas DataFrame. If False, returns a dictionnary.

remove_nabool

If True, apply a listwise deletion of missing values (i.e. the entire row is removed). Default is False, which will raise an error if missing values are present in either the predictor(s) or dependent variable.

**kwargsoptional

Optional arguments passed to sklearn.linear_model.LogisticRegression (see Notes).

Returns
statspandas.DataFrame or dict

Logistic regression summary:

  • 'names': name of variable(s) in the model (e.g. x1, x2…)

  • 'coef': regression coefficients (log-odds)

  • 'se': standard error

  • 'z': z-scores

  • 'pval': two-tailed p-values

  • 'CI[2.5%]': lower confidence interval

  • 'CI[97.5%]': upper confidence interval

Notes

Caution

This function is a wrapper around the sklearn.linear_model.LogisticRegression class. However, Pingouin internally disables the L2 regularization and changes the default solver in order to get results that are similar to R and statsmodels.

The logistic regression assumes that the log-odds (the logarithm of the odds) for the value labeled “1” in the response variable is a linear combination of the predictor variables. The log-odds are given by the logit function, which map a probability \(p\) of the response variable being “1” from \([0, 1)\) to \((-\infty, +\infty)\).

\[\text{logit}(p) = \ln \frac{p}{1 - p} = \beta_0 + \beta X\]

The odds of the response variable being “1” can be obtained by exponentiating the log-odds:

\[\frac{p}{1 - p} = e^{\beta_0 + \beta X}\]

and the probability of the response variable being “1” is given by the logistic function:

\[p = \frac{1}{1 + e^{-(\beta_0 + \beta X})}\]

The first coefficient is always the constant term (intercept) of the model. Pingouin will automatically add the intercept to your predictor(s) matrix, therefore, \(X\) should not include a constant term. Pingouin will remove any constant term (e.g column with only one unique value), or duplicate columns from \(X\).

The calculation of the p-values and confidence interval is adapted from a code by Rob Speare. Results have been compared against statsmodels, R, and JASP.

Examples

  1. Simple binary logistic regression.

In this first example, we’ll use the penguins dataset to see how well we can predict the sex of penguins based on their bodies mass.

>>> import numpy as np
>>> import pandas as pd
>>> import pingouin as pg
>>> df = pg.read_dataset('penguins')
>>> # Let's first convert the target variable from string to boolean:
>>> df['male'] = (df['sex'] == 'male').astype(int)  # male: 1, female: 0
>>> # Since there are missing values in our outcome variable, we need to
>>> # set `remove_na=True` otherwise regression will fail.
>>> lom = pg.logistic_regression(df['body_mass_g'], df['male'],
...                              remove_na=True)
>>> lom.round(2)
         names  coef    se     z  pval  CI[2.5%]  CI[97.5%]
0    Intercept -5.16  0.71 -7.24   0.0     -6.56      -3.77
1  body_mass_g  0.00  0.00  7.24   0.0      0.00       0.00

Body mass is a significant predictor of sex (p<0.001). Here, it could be useful to rescale our predictor variable from g to kg (e.g divide by 1000) in order to get more intuitive coefficients and confidence intervals:

>>> df['body_mass_kg'] = df['body_mass_g'] / 1000
>>> lom = pg.logistic_regression(df['body_mass_kg'], df['male'],
...                              remove_na=True)
>>> lom.round(2)
          names  coef    se     z  pval  CI[2.5%]  CI[97.5%]
0     Intercept -5.16  0.71 -7.24   0.0     -6.56      -3.77
1  body_mass_kg  1.23  0.17  7.24   0.0      0.89       1.56
  1. Multiple binary logistic regression

We’ll now add the species as a categorical predictor in our model. To do so, we first need to dummy-code our categorical variable, dropping the first level of our categorical variable (species = Adelie) which will be used as the reference level:

>>> df = pd.get_dummies(df, columns=['species'], drop_first=True)
>>> X = df[['body_mass_kg', 'species_Chinstrap', 'species_Gentoo']]
>>> y = df['male']
>>> lom = pg.logistic_regression(X, y, remove_na=True)
>>> lom.round(2)
               names   coef    se     z  pval  CI[2.5%]  CI[97.5%]
0          Intercept -26.24  2.84 -9.24  0.00    -31.81     -20.67
1       body_mass_kg   7.10  0.77  9.23  0.00      5.59       8.61
2  species_Chinstrap  -0.13  0.42 -0.31  0.75     -0.96       0.69
3     species_Gentoo  -9.72  1.12 -8.65  0.00    -11.92      -7.52
  1. Using NumPy aray and returning only the coefficients

>>> pg.logistic_regression(X.to_numpy(), y.to_numpy(), coef_only=True,
...                        remove_na=True)
array([-26.23906892,   7.09826571,  -0.13180626,  -9.71718529])
  1. Passing custom parameters to sklearn

>>> lom = pg.logistic_regression(X, y, solver='sag', max_iter=10000,
...                           random_state=42, remove_na=True)
>>> print(lom['coef'].to_numpy())
[-25.98248153   7.02881472  -0.13119779  -9.62247569]

How to interpret the log-odds coefficients?

We’ll use the Wikipedia example of the probability of passing an exam versus the hours of study:

A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?

>>> # First, let's create the dataframe
>>> Hours = [0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50,
...          2.75, 3.00, 3.25, 3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50]
>>> Pass = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1]
>>> df = pd.DataFrame({'HoursStudy': Hours, 'PassExam': Pass})
>>> # And then run the logistic regression
>>> lr = pg.logistic_regression(df['HoursStudy'], df['PassExam']).round(3)
>>> lr
        names   coef     se      z   pval  CI[2.5%]  CI[97.5%]
0   Intercept -4.078  1.761 -2.316  0.021    -7.529     -0.626
1  HoursStudy  1.505  0.629  2.393  0.017     0.272      2.737

The Intercept coefficient (-4.078) is the log-odds of PassExam=1 when HoursStudy=0. The odds ratio can be obtained by exponentiating the log-odds:

>>> np.exp(-4.078)
0.016941314421496552

i.e. \(0.017:1\). Conversely the odds of failing the exam are \((1/0.017) \approx 59:1\).

The probability can then be obtained with the following equation

\[p = \frac{1}{1 + e^{-(-4.078 + 0 * 1.505)}}\]
>>> 1 / (1 + np.exp(-(-4.078)))
0.016659087580814722

The HoursStudy coefficient (1.505) means that for each additional hour of study, the log-odds of passing the exam increase by 1.505, and the odds are multipled by \(e^{1.505} \approx 4.50\).

For example, a student who studies 2 hours has a probability of passing the exam of 25%:

>>> 1 / (1 + np.exp(-(-4.078 + 2 * 1.505)))
0.2557836148964987

The table below shows the probability of passing the exam for several values of HoursStudy:

Hours of Study

Log-odds

Odds

Probability

0

−4.08

0.017 ≈ 1:59

0.017

1

−2.57

0.076 ≈ 1:13

0.07

2

−1.07

0.34 ≈ 1:3

0.26

3

0.44

1.55

0.61

4

1.94

6.96

0.87

5

3.45

31.4

0.97

6

4.96

141.4

0.99