pingouin.normality

pingouin.normality(data, dv=None, group=None, method='shapiro', alpha=0.05)[source]

Univariate normality test.

Parameters
datadataframe, series, list or 1D np.array

Iterable. Can be either a single list, 1D numpy array, or a wide- or long-format pandas dataframe.

dvstr

Dependent variable (only when data is a long-format dataframe).

groupstr

Grouping variable (only when data is a long-format dataframe).

methodstr

Normality test. ‘shapiro’ (default) performs the Shapiro-Wilk test using scipy.stats.shapiro(), and ‘normaltest’ performs the omnibus test of normality using scipy.stats.normaltest(). The latter is more appropriate for large samples.

alphafloat

Significance level.

Returns
statsdataframe

Pandas DataFrame with columns:

  • 'W': test statistic

  • 'pval': p-value

  • 'normal': True if data is normally distributed.

See also

homoscedasticity

Test equality of variance.

sphericity

Mauchly’s test for sphericity.

Notes

The Shapiro-Wilk test calculates a \(W\) statistic that tests whether a random sample \(x_1, x_2, ..., x_n\) comes from a normal distribution.

The \(W\) statistic is calculated as follows:

\[W = \frac{(\sum_{i=1}^n a_i x_{i})^2} {\sum_{i=1}^n (x_i - \overline{x})^2}\]

where the \(x_i\) are the ordered sample values (in ascending order) and the \(a_i\) are constants generated from the means, variances and covariances of the order statistics of a sample of size \(n\) from a standard normal distribution. Specifically:

\[(a_1, ..., a_n) = \frac{m^TV^{-1}}{(m^TV^{-1}V^{-1}m)^{1/2}}\]

with \(m = (m_1, ..., m_n)^T\) and \((m_1, ..., m_n)\) are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and \(V\) is the covariance matrix of those order statistics.

The null-hypothesis of this test is that the population is normally distributed. Thus, if the p-value is less than the chosen alpha level (typically set at 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.

The result of the Shapiro-Wilk test should be interpreted with caution in the case of large sample sizes. Indeed, quoting from Wikipedia:

“Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case.”

Note that missing values are automatically removed (casewise deletion).

References

1

Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3/4), 591-611.

2

https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm

3

https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test

Examples

  1. Shapiro-Wilk test on a 1D array.

>>> import numpy as np
>>> import pingouin as pg
>>> np.random.seed(123)
>>> x = np.random.normal(size=100)
>>> pg.normality(x)
         W      pval  normal
0  0.98414  0.274886    True
  1. Omnibus test on a wide-format dataframe with missing values

>>> data = pg.read_dataset('mediation')
>>> data.loc[1, 'X'] = np.nan
>>> pg.normality(data, method='normaltest')
               W           pval  normal
X       1.791839   4.082320e-01    True
M       0.492349   7.817859e-01    True
Y       0.348676   8.400129e-01    True
Mbin  839.716156  4.549393e-183   False
Ybin  814.468158  1.381932e-177   False
  1. Pandas Series

>>> pg.normality(data['X'], method='normaltest')
          W      pval  normal
X  1.791839  0.408232    True
  1. Long-format dataframe

>>> data = pg.read_dataset('rm_anova2')
>>> pg.normality(data, dv='Performance', group='Time')
             W      pval  normal
Pre   0.967718  0.478773    True
Post  0.940728  0.095157    True