FAQ

Python

To install Python on your computer, you should use Anaconda, a Python distribution which natively includes all the most important packages. Then, open the newly installed Anaconda prompt and type:

conda install pip

This will install pip, the most-widely used package manager in Python. Once pip is installed, you should be able to install Pingouin. Still in Anaconda prompt, run the following command:

pip install pingouin

You are almost ready to use Pingouin. First, you need to open an interactive Python console (either IPython or Jupyter). To do so, type the following command:

ipython

Now, let’s do a simple paired T-test using Pingouin:

import pingouin as pg
# Create two variables
x = [4, 6, 5, 7, 6]
y = [2, 2, 3, 1, 2]
# Run a T-test
pg.ttest(x, y, paired=True)
# 1) Import the full package
# --> Best if you are planning to use several Pingouin functions.
import pingouin as pg
pg.ttest(x, y)

# 2) Import specific functions
# --> Best if you are planning to use only this specific function.
from pingouin import ttest
ttest(x, y)

Statsmodels is a great statistical Python package that provides several advanced functions (regression, GLM, time-series analysis) as well as an R-like syntax for fitting models. However, statsmodels can be quite hard to grasp and use for Python beginners and/or users who just want to perform simple statistical tests. The goal of Pingouin is not to replace statsmodels but rather to provide some easy-to-use functions to perform the most widely-used statistical tests. In addition, Pingouin also provides some novel functions (to cite but a few: effect sizes, pairwise T-tests and correlations, ICC, repeated measures correlation, circular statistics…).

The scipy.stats module provides several low-level statistical functions. However, most of these functions do not return a very detailed output (e.g. only the T- and p-values for a T-test). Most of Pingouin function are using the low-level SciPy funtions to provide a richer, more exhaustive, output. See for yourself!:

import pingouin as pg
from scipy.stats import ttest_ind

x = [4, 6, 5, 7, 6]
y = [2, 2, 3, 1, 2]

print(pg.ttest(x, y))   # Pingouin: returns a DataFrame with T-value, p-value, degrees of freedom, tail, Cohen d, power and Bayes Factor
print(ttest_ind(x, y))  # SciPy: returns only the T- and p-values

Data

You need to use the pandas.read_csv() or pandas.read_excel() functions:

import pandas as pd
pd.read_csv('myfile.csv')     # Load a .csv file
pd.read_excel('myfile.xlsx')  # Load an Excel file

Pingouin hates missing values almost as much as you do!

Most functions of Pingouin will automatically remove the missing values. In the case of paired measurements (e.g. paired T-test, correlation, or repeated measures ANOVA), a listwise deletion of missing values is performed, meaning that the entire row is removed. This is generally the best strategy if you have a large sample size and only a few missing values. However, this can be quite drastic if there are a lot of missing values in your data. In that case, it might be useful to look at imputation methods (see Pandas documentation).

If you prefer to know what’s going on under the hood, you can also remove the missing values a priori using the pingouin.remove_na() and pingouin.remove_rm_na() functions. The first one is a convenient and flexible function to remove rows or columns with missing values in 1D or 2D array(s), and the second one is specifically geared at long-format repeated measures dataframe, such as the ones required by the pingouin.rm_anova() function.

In wide format, each row represent a subject, and each column a measurement (e.g. “Pre”, “Post”). This is the most convenient way for humans to look at repeated measurements. It typically results in spreadsheet with a larger number of columns than rows. An example of wide-format dataframe is shown below:

Subject

Pre

Post

Gender

Age

1

2.5

3.1

M

24

2

4.2

4.8

F

32

3

2.5

2.9

F

38

In long-format, each row is one time point per subject and each column is a variable (e.g. one column with the “Subject” identifier, another with the “Scores” and another with the “Time” grouping factors). In long-format, there are usually many more rows than columns. While this is harder to read for humans, this is much easier to read for computers. For this reason, all the repeated measures functions in Pingouin work only with long-format dataframe. In the example below, the wide-format dataframe from above was converted into a long-format dataframe:

Subject

Gender

Age

Time

Scores

1

M

24

Pre

2.5

1

M

24

Post

3.1

2

F

32

Pre

4.2

2

F

32

Post

4.8

3

F

38

Pre

2.5

3

F

38

Post

2.9

The Pandas package provides some convenient functions to convert from one format to the other:

  • From wide-format to long-format (easier to read for computer), use the pandas.melt() function.

  • From long-format to wide-format, use the pandas.pivot_table() function.

No, the central idea behind Pingouin is that all data manipulations and descriptive statistics should be first performed in Pandas (or NumPy). For example, to compute the mean, standard deviation, and quartiles of all the numeric columns of a pandas DataFrame, one can easily use the pandas.DataFrame.describe() method:

data.describe()

Others

Pingouin is licensed under the GNU General Public License v3.0 (GPL-3), which is less permissive than the BSD or MIT licenses. The reason for this is that Pingouin borrows extensively from R packages, which are all licensed under the GPL-3. To read more about what you can do and cannot do with a GPL-3 license, please visit tldrlegal.com or choosealicense.com.

To be notified whenever a new release of Pingouin is available, you can click on “Watch releases” on the GitHub of Pingouin (see below).

_images/github_watch_release.png

Whenever a new release is out there, you can simply upgrade your version by typing the following line in a terminal window:

pip install --upgrade pingouin

There are many ways to contribute to Pingouin, even if you are not a programmer, for example, reporting bugs or results that are inconsistent with other statistical softwares, improving the documentation and examples, or, even buying the developpers a coffee!

To cite Pingouin, please use the publication in JOSS:

Vallat, R. (2018). Pingouin: statistics in Python. Journal of Open Source Software, 3(31), 1026, https://doi.org/10.21105/joss.01026

BibTeX:

@ARTICLE{Vallat2018,
  title    = "Pingouin: statistics in Python",
  author   = "Vallat, Raphael",
  journal  = "The Journal of Open Source Software",
  volume   =  3,
  number   =  31,
  pages    = "1026",
  month    =  nov,
  year     =  2018
}