Pingouin is an opensource statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. For a full list of available functions, please refer to the API documentation.
ANOVAs: one and twoways, repeated measures, mixed, ancova
Pairwise posthocs tests (parametric and nonparametric) and pairwise correlations
Robust, partial, distance and repeated measures correlations
Linear/logistic regression and mediation analysis
Bayes Factor of Ttest and Pearson correlation
Multivariate tests
Reliability and consistency
Effect sizes and power analysis
Parametric/bootstrapped confidence intervals around an effect size or a correlation coefficient
Circular statistics
Plotting: BlandAltman plot, QQ plot, paired plot, robust correlation…
Pingouin is designed for users who want simple yet exhaustive statistical functions.
For example, the ttest_ind
function of SciPy returns only the Tvalue and the pvalue. By contrast,
the ttest
function of Pingouin returns the Tvalue, pvalue, degrees of freedom, effect size (Cohen’s d), 95% confidence intervals, statistical power and Bayes Factor (BF10) of the test.
Installation
Dependencies
The main dependencies of Pingouin are :
NumPy (>= 1.15)
SciPy (>= 1.1.0)
Pandas (>= 0.23)
Matplotlib (>= 3.0.2)
Seaborn (>= 0.9.0)
In addition, some functions require :
Statsmodels
Scikitlearn
Pingouin is a Python 3 package and is currently tested for Python 3.5, 3.6 and 3.7. Pingouin does not work with Python 2.7.
User installation
Pingouin can be easily installed using pip
pip install pingouin
or conda
conda install c condaforge pingouin
New releases are frequent so always make sure that you have the latest version:
pip install upgrade pingouin
GitHub repository
Link to the GitHub repository.
Quick start
Click on the link below and navigate to the notebooks/ folder to run a collection of interactive Jupyter notebooks showing the main functionalities of Pingouin. No need to install Pingouin beforehand, the notebooks run in a Binder environment.
10 minutes to Pingouin
1. Ttest
import numpy as np
import pingouin as pg
np.random.seed(123)
mean, cov, n = [4, 5], [(1, .6), (.6, 1)], 30
x, y = np.random.multivariate_normal(mean, cov, n).T
# Ttest
pg.ttest(x, y)
T 
dof 
tail 
pval 
CI95% 
cohend 
BF10 
power 

3.401 
58 
twosided 
0.001 
[1.68 0.43] 
0.878 
26.155 
0.917 
2. Pearson’s correlation
pg.corr(x, y)
n 
r 
CI95% 
r2 
adj_r2 
pval 
BF10 
power 

30 
0.595 
[0.3 0.79] 
0.354 
0.306 
0.001 
54.222 
0.95 
3. Robust correlation
# Introduce an outlier
x[5] = 18
# Use the robust Shepherd's pi correlation
pg.corr(x, y, method="shepherd")
n 
r 
CI95% 
r2 
adj_r2 
pval 
power 

30 
0.561 
[0.25 0.77] 
0.315 
0.264 
0.002 
0.917 
4. Test the normality of the data
# Return a boolean (true if normal) and the associated pvalue
print(pg.normality(x, y)) # Univariate normality
print(pg.multivariate_normality(np.column_stack((x, y)))) # Multivariate normality
(array([False, True]), array([0., 0.552]))
(False, 0.00018)
5. QQ plot
import numpy as np
import pingouin as pg
np.random.seed(123)
x = np.random.normal(size=50)
ax = pg.qqplot(x, dist='norm')
6. Oneway ANOVA using a pandas DataFrame
# Read an example dataset
df = pg.read_dataset('mixed_anova')
# Run the ANOVA
aov = pg.anova(data=df, dv='Scores', between='Group', detailed=True)
print(aov)
Source 
SS 
DF 
MS 
F 
punc 
np2 

Group 
5.460 
1 
5.460 
5.244 
0.02320 
0.029 
Within 
185.343 
178 
1.041 
7. Repeated measures ANOVA
pg.rm_anova(data=df, dv='Scores', within='Time', subject='Subject', detailed=True)
Source 
SS 
DF 
MS 
F 
punc 
np2 
eps 

Time 
7.628 
2 
3.814 
3.913 
0.022629 
0.062 
0.999 
Error 
115.027 
118 
0.975 
8. Posthoc tests corrected for multiplecomparisons
# FDRcorrected post hocs with Hedges'g effect size
posthoc = pg.pairwise_ttests(data=df, dv='Scores', within='Time', subject='Subject',
parametric=True, padjust='fdr_bh', effsize='hedges')
# Pretty printing of table
pg.print_table(posthoc, floatfmt='.3f')
Contrast 
A 
B 
Paired 
Parametric 
T 
dof 
tail 
punc 
pcorr 
padjust 
BF10 
CLES 
hedges 

Time 
August 
January 
True 
True 
1.740 
59.000 
twosided 
0.087 
0.131 
fdr_bh 
0.582 
0.585 
0.328 
Time 
August 
June 
True 
True 
2.743 
59.000 
twosided 
0.008 
0.024 
fdr_bh 
4.232 
0.644 
0.485 
Time 
January 
June 
True 
True 
1.024 
59.000 
twosided 
0.310 
0.310 
fdr_bh 
0.232 
0.571 
0.170 
9. Twoway mixed ANOVA
# Compute the twoway mixed ANOVA and export to a .csv file
aov = pg.mixed_anova(data=df, dv='Scores', between='Group', within='Time',
subject='Subject', correction=False,
export_filename='mixed_anova.csv')
pg.print_table(aov)
Source 
SS 
DF1 
DF2 
MS 
F 
punc 
np2 
eps 

Group 
5.460 
1 
58 
5.460 
5.052 
0.028 
0.080 

Time 
7.628 
2 
116 
3.814 
4.027 
0.020 
0.065 
0.999 
Interaction 
5.168 
2 
116 
2.584 
2.728 
0.070 
0.045 
10. Pairwise correlations between columns of a dataframe
import pandas as pd
np.random.seed(123)
z = np.random.normal(5, 1, 30)
data = pd.DataFrame({'X': x, 'Y': y, 'Z': z})
pg.pairwise_corr(data, columns=['X', 'Y', 'Z'])
X 
Y 
method 
tail 
n 
r 
CI95% 
r2 
adj_r2 
z 
punc 
BF10 
power 

X 
Y 
pearson 
twosided 
30 
0.366 
[0.01 0.64] 
0.134 
0.070 
0.384 
0.047 
1.006 
0.525 
X 
Z 
pearson 
twosided 
30 
0.251 
[0.12 0.56] 
0.063 
0.006 
0.256 
0.181 
0.344 
0.272 
Y 
Z 
pearson 
twosided 
30 
0.020 
[0.34 0.38] 
0.000 
0.074 
0.020 
0.916 
0.142 
0.051 
11. Convert between effect sizes
# Convert from Cohen's d to Hedges' g
pg.convert_effsize(0.4, 'cohen', 'hedges', nx=10, ny=12)
0.384
12. Multiple linear regression
pg.linear_regression(data[['X', 'Z']], data['Y'])
names 
coef 
se 
T 
pval 
r2 
adj_r2 
CI[2.5%] 
CI[97.5%] 

Intercept 
4.650 
0.841 
5.530 
0.000 
0.139 
0.076 
2.925 
6.376 
X 
0.143 
0.068 
2.089 
0.046 
0.139 
0.076 
0.003 
0.283 
Z 
0.069 
0.167 
0.416 
0.681 
0.139 
0.076 
0.412 
0.273 
13. Mediation analysis
pg.mediation_analysis(data=data, x='X', m='Z', y='Y', seed=42, n_boot=1000)
path 
coef 
se 
pval 
CI[2.5%] 
CI[97.5%] 
sig 

Z ~ X 
0.103 
0.075 
0.181 
0.051 
0.256 
No 
Y ~ Z 
0.018 
0.171 
0.916 
0.332 
0.369 
No 
Total 
0.136 
0.065 
0.047 
0.002 
0.269 
Yes 
Direct 
0.143 
0.068 
0.046 
0.003 
0.283 
Yes 
Indirect 
0.007 
0.025 
0.898 
0.070 
0.029 
No 
14. BlandAltman plot
import numpy as np
import pingouin as pg
np.random.seed(123)
mean, cov = [10, 11], [[1, 0.8], [0.8, 1]]
x, y = np.random.multivariate_normal(mean, cov, 30).T
ax = pg.plot_blandaltman(x, y)
15. Plot achieved power of a paired Ttest
Plot the curve of achieved power given the effect size (Cohen d) and the sample size of a paired Ttest.
import matplotlib.pyplot as plt
import seaborn as sns
import pingouin as pg
import numpy as np
sns.set(style='ticks', context='notebook', font_scale=1.2)
d = 0.5 # Fixed effect size
n = np.arange(5, 80, 5) # Incrementing sample size
# Compute the achieved power
pwr = pg.power_ttest(d=d, n=n, contrast='paired', tail='twosided')
# Start the plot
plt.plot(n, pwr, 'ko.')
plt.axhline(0.8, color='r', ls=':')
plt.xlabel('Sample size')
plt.ylabel('Power (1  type II error)')
plt.title('Achieved power of a paired Ttest')
sns.despine()
16. Paired plot
import pingouin as pg
import numpy as np
df = pg.read_dataset('mixed_anova').query("Group == 'Meditation' and Time != 'January'")
ax = pg.plot_paired(data=df, dv='Scores', within='Time', subject='Subject', dpi=150)
ax.set_title("Effect of meditation on school performance")
Integration with Pandas
Several functions of Pingouin can be used directly as pandas.DataFrame
methods. Try for yourself with the code below:
import pingouin as pg
# Example 1  ANOVA
df = pg.read_dataset('mixed_anova')
df.anova(dv='Scores', between='Group', detailed=True)
# Example 2  Pairwise correlations
data = pg.read_dataset('mediation')
data.pairwise_corr(columns=['X', 'M', 'Y'], covar=['Mbin'])
# Example 3  Partial correlation matrix
data.pcorr()
The functions that are currently supported as pandas method are:
Contents
Development
Pingouin was created and is maintained by Raphael Vallat. Contributions are more than welcome so feel free to contact me, open an issue or submit a pull request!
To see the code or report a bug, please visit the GitHub repository.
Note that this program is provided with NO WARRANTY OF ANY KIND. If you can, always double check the results with another statistical software.
Contributors
Nicolas Legrand
How to cite Pingouin?
If you want to cite Pingouin, please use the publication in JOSS:
Vallat, R. (2018). Pingouin: statistics in Python. Journal of Open Source Software, 3(31), 1026, https://doi.org/10.21105/joss.01026
@ARTICLE{Vallat2018,
title = "Pingouin: statistics in Python",
author = "Vallat, Raphael",
journal = "The Journal of Open Source Software",
volume = 3,
number = 31,
pages = "1026",
month = nov,
year = 2018
}
Acknowledgement
Several functions of Pingouin were inspired from R or Matlab toolboxes, including:
circular statistics (Matlab) (Berens 2009)
robust correlations (Matlab) (Pernet, Wilcox & Rousselet, 2012)
repeatedmeasure correlation (R) (Bakdash & Marusich, 2017)
I am also grateful to Charles Zaiontz and his website www.realstatistics.com which has been useful to understand the practical implementation of several functions.