pingouin.normality

pingouin.
normality
(data, dv=None, group=None, method='shapiro', alpha=0.05)[source] Univariate normality test.
 Parameters
 datadataframe, series, list or 1D np.array
Iterable. Can be either a single list, 1D numpy array, or a wide or longformat pandas dataframe.
 dvstr
Dependent variable (only when
data
is a longformat dataframe). groupstr
Grouping variable (only when
data
is a longformat dataframe). methodstr
Normality test. ‘shapiro’ (default) performs the ShapiroWilk test using
scipy.stats.shapiro()
, and ‘normaltest’ performs the omnibus test of normality usingscipy.stats.normaltest()
. The latter is more appropriate for large samples. alphafloat
Significance level.
 Returns
 statsdataframe
Pandas DataFrame with columns:
'W'
: test statistic'pval'
: pvalue'normal'
: True ifdata
is normally distributed.
See also
homoscedasticity
Test equality of variance.
sphericity
Mauchly’s test for sphericity.
Notes
The ShapiroWilk test calculates a \(W\) statistic that tests whether a random sample \(x_1, x_2, ..., x_n\) comes from a normal distribution.
The \(W\) statistic is calculated as follows:
\[W = \frac{(\sum_{i=1}^n a_i x_{i})^2} {\sum_{i=1}^n (x_i  \overline{x})^2}\]where the \(x_i\) are the ordered sample values (in ascending order) and the \(a_i\) are constants generated from the means, variances and covariances of the order statistics of a sample of size \(n\) from a standard normal distribution. Specifically:
\[(a_1, ..., a_n) = \frac{m^TV^{1}}{(m^TV^{1}V^{1}m)^{1/2}}\]with \(m = (m_1, ..., m_n)^T\) and \((m_1, ..., m_n)\) are the expected values of the order statistics of independent and identically distributed random variables sampled from the standard normal distribution, and \(V\) is the covariance matrix of those order statistics.
The nullhypothesis of this test is that the population is normally distributed. Thus, if the pvalue is less than the chosen alpha level (typically set at 0.05), then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed.
The result of the ShapiroWilk test should be interpreted with caution in the case of large sample sizes. Indeed, quoting from Wikipedia:
“Like most statistical significance tests, if the sample size is sufficiently large this test may detect even trivial departures from the null hypothesis (i.e., although there may be some statistically significant effect, it may be too small to be of any practical significance); thus, additional investigation of the effect size is typically advisable, e.g., a Q–Q plot in this case.”
Note that missing values are automatically removed (casewise deletion).
References
 1
Shapiro, S. S., & Wilk, M. B. (1965). An analysis of variance test for normality (complete samples). Biometrika, 52(3/4), 591611.
 2
https://www.itl.nist.gov/div898/handbook/prc/section2/prc213.htm
 3
Examples
ShapiroWilk test on a 1D array.
>>> import numpy as np >>> import pingouin as pg >>> np.random.seed(123) >>> x = np.random.normal(size=100) >>> pg.normality(x) W pval normal 0 0.98414 0.274886 True
Omnibus test on a wideformat dataframe with missing values
>>> data = pg.read_dataset('mediation') >>> data.loc[1, 'X'] = np.nan >>> pg.normality(data, method='normaltest') W pval normal X 1.791839 4.082320e01 True M 0.492349 7.817859e01 True Y 0.348676 8.400129e01 True Mbin 839.716156 4.549393e183 False Ybin 814.468158 1.381932e177 False
Pandas Series
>>> pg.normality(data['X'], method='normaltest') W pval normal X 1.791839 0.408232 True
Longformat dataframe
>>> data = pg.read_dataset('rm_anova2') >>> pg.normality(data, dv='Performance', group='Time') W pval normal Pre 0.967718 0.478773 True Post 0.940728 0.095157 True