pingouin.chi2_independence

pingouin.
chi2_independence
(data, x, y, correction=True)[source] Chisquared independence tests between two categorical variables.
The test is computed for different values of \(\lambda\): 1, 2/3, 0, 1/2, 1 and 2 (Cressie and Read, 1984).
 Parameters
 data
pandas.DataFrame
The dataframe containing the ocurrences for the test.
 x, ystring
The variables names for the Chisquared test. Must be names of columns in
data
. correctionbool
Whether to apply Yates’ correction when the degree of freedom of the observed contingency table is 1 (Yates 1934).
 data
 Returns
 expected
pandas.DataFrame
The expected contingency table of frequencies.
 observed
pandas.DataFrame
The (corrected or not) observed contingency table of frequencies.
 stats
pandas.DataFrame
The test summary, containing four columns:
'test'
: The statistic name'lambda'
: The \(\lambda\) value used for the power divergence statistic'chi2'
: The test statistic'pval'
: The pvalue of the test'cramer'
: The Cramer’s V effect size'power'
: The statistical power of the test
 expected
Notes
From Wikipedia:
The chisquared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
As application examples, this test can be used to i) evaluate the quality of a categorical variable in a classification problem or to ii) check the similarity between two categorical variables. In the first example, a good categorical predictor and the class column should present high \(\chi^2\) and low pvalue. In the second example, similar categorical variables should present low \(\chi^2\) and high pvalue.
This function is a wrapper around the
scipy.stats.power_divergence()
function.Warning
As a general guideline for the consistency of this test, the observed and the expected contingency tables should not have cells with frequencies lower than 5.
References
Cressie, N., & Read, T. R. (1984). Multinomial goodness‐of‐fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3), 440464.
Yates, F. (1934). Contingency Tables Involving Small Numbers and the \(\chi^2\) Test. Supplement to the Journal of the Royal Statistical Society, 1, 217235.
Examples
Let’s see if gender is a good categorical predictor for the presence of heart disease.
>>> import pingouin as pg >>> data = pg.read_dataset('chi2_independence') >>> data['sex'].value_counts(ascending=True) 0 96 1 207 Name: sex, dtype: int64
If gender is not a good predictor for heart disease, we should expect the same 96:207 ratio across the target classes.
>>> expected, observed, stats = pg.chi2_independence(data, x='sex', ... y='target') >>> expected target 0 1 sex 0 43.722772 52.277228 1 94.277228 112.722772
Let’s see what the data tells us.
>>> observed target 0 1 sex 0 24.5 71.5 1 113.5 93.5
The proportion is lower on the class 0 and higher on the class 1. The tests should be sensitive to this difference.
>>> stats.round(3) test lambda chi2 dof pval cramer power 0 pearson 1.000 22.717 1.0 0.0 0.274 0.997 1 cressieread 0.667 22.931 1.0 0.0 0.275 0.998 2 loglikelihood 0.000 23.557 1.0 0.0 0.279 0.998 3 freemantukey 0.500 24.220 1.0 0.0 0.283 0.998 4 modloglikelihood 1.000 25.071 1.0 0.0 0.288 0.999 5 neyman 2.000 27.458 1.0 0.0 0.301 0.999
Very low pvalues indeed. The gender qualifies as a good predictor for the presence of heart disease on this dataset.