pingouin.logistic_regression

pingouin.
logistic_regression
(X, y, coef_only=False, alpha=0.05, as_dataframe=True, remove_na=False, **kwargs)[source] (Multiple) Binary logistic regression.
 Parameters
 Xnp.array or list
Predictor(s). Shape = (n_samples, n_features) or (n_samples,).
 ynp.array or list
Dependent variable. Shape = (n_samples).
y
must be binary, i.e. only contains 0 or 1. Multinomial logistic regression is not supported. coef_onlybool
If True, return only the regression coefficients.
 alphafloat
Alpha value used for the confidence intervals. \(\text{CI} = [\alpha / 2 ; 1  \alpha / 2]\)
 as_dataframebool
If True, returns a pandas DataFrame. If False, returns a dictionnary.
 remove_nabool
If True, apply a listwise deletion of missing values (i.e. the entire row is removed). Default is False, which will raise an error if missing values are present in either the predictor(s) or dependent variable.
 **kwargsoptional
Optional arguments passed to
sklearn.linear_model.LogisticRegression
(see Notes).
 Returns
 statsdataframe or dict
Logistic regression summary:
'names' : name of variable(s) in the model (e.g. x1, x2...) 'coef' : regression coefficients (logodds) 'se' : standard error 'z' : zscores 'pval' : twotailed pvalues 'CI[2.5%]' : lower confidence interval 'CI[97.5%]' : upper confidence interval
See also
Notes
This is a wrapper around the
sklearn.linear_model.LogisticRegression
class. Importantly, Pingouin automatically disables the L2 regularization applied by scikitlearn. This can be modified by changing thepenalty
argument.The logistic regression assumes that the logodds (the logarithm of the odds) for the value labeled “1” in the response variable is a linear combination of the predictor variables. The logodds are given by the logit function, which map a probability \(p\) of the response variable being “1” from \([0, 1)\) to \((\infty, +\infty)\).
\[\text{logit}(p) = \ln \frac{p}{1  p} = \beta_0 + \beta X\]The odds of the response variable being “1” can be obtained by exponentiating the logodds:
\[\frac{p}{1  p} = e^{\beta_0 + \beta X}\]and the probability of the response variable being “1” is given by:
\[p = \frac{1}{1 + e^{(\beta_0 + \beta X})}\]Note that the above function that converts logodds to probability is called the logistic function.
The first coefficient is always the constant term (intercept) of the model. Scikitlearn will automatically add the intercept to your predictor(s) matrix, therefore, \(X\) should not include a constant term. Pingouin will remove any constant term (e.g column with only one unique value), or duplicate columns from \(X\).
The calculation of the pvalues and confidence interval is adapted from a code found at https://gist.github.com/rspeare/77061e6e317896be29c6de9a85db301d
Results have been compared against statsmodels, R, and JASP.
Examples
Simple binary logistic regression
>>> import numpy as np >>> from pingouin import logistic_regression >>> np.random.seed(123) >>> x = np.random.normal(size=30) >>> y = np.random.randint(0, 2, size=30) >>> lom = logistic_regression(x, y) >>> lom.round(2) names coef se z pval CI[2.5%] CI[97.5%] 0 Intercept 0.27 0.37 0.74 0.46 1.00 0.45 1 x1 0.07 0.32 0.21 0.84 0.55 0.68
Multiple binary logistic regression
>>> np.random.seed(42) >>> z = np.random.normal(size=30) >>> X = np.column_stack((x, z)) >>> lom = logistic_regression(X, y) >>> print(lom['coef'].values) [0.36736745 0.04374684 0.47829392]
Using a Pandas DataFrame
>>> import pandas as pd >>> df = pd.DataFrame({'x': x, 'y': y, 'z': z}) >>> lom = logistic_regression(df[['x', 'z']], df['y']) >>> print(lom['coef'].values) [0.36736745 0.04374684 0.47829392]
Return only the coefficients
>>> logistic_regression(X, y, coef_only=True) array([0.36736745, 0.04374684, 0.47829392])
Passing custom parameters to sklearn
>>> lom = logistic_regression(X, y, solver='sag', max_iter=10000, ... random_state=42) >>> print(lom['coef'].values) [0.36751796 0.04367056 0.47841908]
How to interpret the logodds coefficients?
We’ll use the Wikipedia example of the probability of passing an exam versus the hours of study:
A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?
>>> # First, let's create the dataframe >>> Hours = [0.50, 0.75, 1.00, 1.25, 1.50, 1.75, 1.75, 2.00, 2.25, 2.50, ... 2.75, 3.00, 3.25, 3.50, 4.00, 4.25, 4.50, 4.75, 5.00, 5.50] >>> Pass = [0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1] >>> df = pd.DataFrame({'HoursStudy': Hours, 'PassExam': Pass}) >>> # And then run the logistic regression >>> lr = logistic_regression(df['HoursStudy'], df['PassExam']).round(3) >>> lr names coef se z pval CI[2.5%] CI[97.5%] 0 Intercept 4.078 1.761 2.316 0.021 7.529 0.626 1 HoursStudy 1.505 0.629 2.393 0.017 0.272 2.737
The
Intercept
coefficient (4.078) is the logodds ofPassExam=1
whenHoursStudy=0
. The odds ratio can be obtained by exponentiating the logodds:>>> np.exp(4.078) 0.016941314421496552
i.e. \(0.017:1\). Conversely the odds of failing the exam are \((1/0.017) \approx 59:1\).
The probability can then be obtained with the following equation
\[p = \frac{1}{1 + e^{(4.078 + 0 * 1.505)}}\]>>> 1 / (1 + np.exp((4.078))) 0.016659087580814722
The
HoursStudy
coefficient (1.505) means that for each additional hour of study, the logodds of passing the exam increase by 1.505, and the odds are multipled by \(e^{1.505} \approx 4.50\).For example, a student who studies 2 hours has a probability of passing the exam of 25%:
>>> 1 / (1 + np.exp((4.078 + 2 * 1.505))) 0.2557836148964987
The table below shows the probability of passing the exam for several values of
HoursStudy
:Hours of Study
Logodds
Odds
Probability
0
−4.08
0.017 ≈ 1:59
0.017
1
−2.57
0.076 ≈ 1:13
0.07
2
−1.07
0.34 ≈ 1:3
0.26
3
0.44
1.55
0.61
4
1.94
6.96
0.87
5
3.45
31.4
0.97
6
4.96
141.4
0.99