pingouin.chi2_independence

pingouin.
chi2_independence
(data, x, y, correction=True)[source] Chisquared independence tests between two categorical variables.
The test is computed for different values of \(\lambda\): 1, 2/3, 0, 1/2, 1 and 2 (Cressie and Read, 1984).
 Parameters
 datapd.DataFrame
The dataframe containing the ocurrences for the test.
 x, ystring
The variables names for the Chisquared test. Must be names of columns in
data
. correctionbool
Whether to apply Yates’ correction when the degree of freedom of the observed contingency table is 1 (Yates 1934).
 Returns
 expectedpd.DataFrame
The expected contingency table of frequencies.
 observedpd.DataFrame
The (corrected or not) observed contingency table of frequencies.
 statspd.DataFrame
The tests summary, containing four columns:
'test'
: The statistic name'lambda'
: The \(\lambda\) value used for the power divergence statistic'chi2'
: The test statistic'p'
: The pvalue of the test'cramer'
: The Cramer’s V effect size'power'
: The statistical power of the test
Notes
From Wikipedia:
The chisquared test is used to determine whether there is a significant difference between the expected frequencies and the observed frequencies in one or more categories.
As application examples, this test can be used to i) evaluate the quality of a categorical variable in a classification problem or to ii) check the similarity between two categorical variables. In the first example, a good categorical predictor and the class column should present high \(\chi^2\) and low pvalue. In the second example, similar categorical variables should present low \(\chi^2\) and high pvalue.
This function is a wrapper around the
scipy.stats.power_divergence()
function.Warning
As a general guideline for the consistency of this test, the observed and the expected contingency tables should not have cells with frequencies lower than 5.
References
 1
Cressie, N., & Read, T. R. (1984). Multinomial goodness‐of‐fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3), 440464.
 2
Yates, F. (1934). Contingency Tables Involving Small Numbers and the \(\chi^2\) Test. Supplement to the Journal of the Royal Statistical Society, 1, 217235.
Examples
Let’s see if gender is a good categorical predictor for the presence of heart disease.
>>> import pingouin as pg >>> data = pg.read_dataset('chi2_independence') >>> data['sex'].value_counts(ascending=True) 0 96 1 207 Name: sex, dtype: int64
If gender is not a good predictor for heart disease, we should expect the same 96:207 ratio across the target classes.
>>> expected, observed, stats = pg.chi2_independence(data, x='sex', ... y='target') >>> expected target 0 1 sex 0 43.722772 52.277228 1 94.277228 112.722772
Let’s see what the data tells us.
>>> observed target 0 1 sex 0 24.5 71.5 1 113.5 93.5
The proportion is lower on the class 0 and higher on the class 1. The tests should be sensitive to this difference.
>>> stats.round(3) test lambda chi2 dof p cramer power 0 pearson 1.000 22.717 1.0 0.0 0.274 0.997 1 cressieread 0.667 22.931 1.0 0.0 0.275 0.998 2 loglikelihood 0.000 23.557 1.0 0.0 0.279 0.998 3 freemantukey 0.500 24.220 1.0 0.0 0.283 0.998 4 modloglikelihood 1.000 25.071 1.0 0.0 0.288 0.999 5 neyman 2.000 27.458 1.0 0.0 0.301 0.999
Very low pvalues indeed. The gender qualifies as a good predictor for the presence of heart disease on this dataset.