pingouin.intraclass_corr#

pingouin.intraclass_corr(data=None, targets=None, raters=None, ratings=None, nan_policy='raise')[source]#

Compute intraclass correlation (ICC) coefficients to assess measurement reliability.

This function provides six variants of the ICC to evaluate how consistently targets (e.g., patients, samples) are rated across different measurements (e.g., raters, days). It follows the practical guidance of Liljequist et al. (2019) [2], which suggests calculating all ICC types together rather than picking a single statistical model upfront.

Parameters:

datapandas.DataFrame: Long-format dataframe containing the targets, raters, and scores.
targetsstring: Name of the column containing the subjects or items being measured.
ratersstring: Name of the column containing the raters, sessions, or conditions.
ratingsstring: Name of the column containing the numerical scores or values.
nan_policystr: Defines how to handle when input contains missing values (nan). ‘raise’ (default) throws an error, ‘omit’ performs the calculations after deleting target(s) with one or more missing values (= listwise deletion).

Added in version 0.3.0.

Returns:

statspandas.DataFrame

Summary table with one row per ICC variant, containing:

'Type': ICC variant (e.g., ICC(1,1), ICC(A,1), ICC(C,1)).
'ICC': the intraclass correlation coefficient.
'F', 'pval': F-test results for detecting systematic differences (bias) between raters. A significant p-value suggests that rater means differ, indicating non-negligible bias.
'df1', 'df2': degrees of freedom for the F-test.
'CI95': 95% confidence interval for the ICC.

Notes

The ICC measures the ratio of between-target variance to total variance [1]. It reflects how consistently targets (e.g., patients, samples) are measured across raters or sessions, with values typically ranging from 0 (no reliability) to 1 (perfect reliability).

Pingouin follows the notation of Liljequist et al. (2019) [2], based on McGraw and Wong (1996) [3]. Six ICC variants are returned, organized along two dimensions.

How bias is handled (first index):

ICC(1,): Assumes raters are interchangeable with no systematic bias. Only valid when rater means are roughly equal.
ICC(A,): Absolute agreement. Penalises systematic differences between raters: “do raters give the same scores?”
ICC(C,): Consistency. Ignores systematic differences between raters: “do raters rank targets in the same order?” Equivalent to the reliability expected if rater biases were removed.

Single or averaged scores (second index):

1: Reliability of one rating by a single rater.
k: Reliability of the mean across all \(k\) raters

Practical guidance:

Liljequist et al. (2019) recommend computing all three single-score ICCs together and comparing them, rather than selecting a single model upfront.

1. Detecting bias (systematic errors):

Start by comparing ICC(1,1), ICC(A,1), and ICC(C,1). When they are approximately equal, systematic bias between raters is likely negligible. When ICC(C,1) is notably larger than ICC(A,1), non-negligible bias is likely present. In that case, ICC(1,1) is invalid and should not be reported. The F statistic and p-value in the output provide a formal test of whether rater means differ significantly.

2. Agreement vs. consistency:

When bias is present, both ICC(A,1) and ICC(C,1) should be reported together with their confidence intervals. ICC(A,1) reflects absolute agreement (do raters assign the same values?), while ICC(C,1) reflects consistency (do raters rank targets in the same order?).

3. Single vs. average ratings:

Use the single-score variants (ICC(1,1), ICC(A,1), ICC(C,1)) when reporting the reliability of one rating. Use the average-score variants (ICC(1,k), ICC(A,k), ICC(C,k)) when the final measurement will be the mean of \(k\) ratings.

Interpretation guidelines:

General benchmarks for ICC values:

< 0.50: Poor
0.50 - 0.75: Moderate
0.75 - 0.90: Good
> 0.90: Excellent

Whether a given ICC is acceptable depends on the intended clinical or practical context, not on these thresholds alone.

This function has been validated against the ICC function of the R psych package. The current implementation uses ANOVA rather than linear mixed effects models and requires complete, balanced data.

References

[1]

http://www.real-statistics.com/reliability/intraclass-correlation/

[2] (1,2)

Liljequist, D., Elfving, B., & Skavberg Roaldsen, K. (2019). Intraclass correlation - A discussion and demonstration of basic features. PLOS ONE, 14(7), e0219854.

[3]

McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46.

Examples

ICCs of wine quality assessed by 4 judges.

>>> import pingouin as pg
>>> data = pg.read_dataset("icc")
>>> icc = pg.intraclass_corr(data=data, targets="Wine", raters="Judge", ratings="Scores").round(
...     3
... )
>>> icc.set_index("Type")
          ICC       F  df1  df2  pval          CI95
Type
ICC(1,1)  0.728  11.680    7   24   0.0  [0.43, 0.93]
ICC(A,1)  0.728  11.787    7   21   0.0  [0.43, 0.93]
ICC(C,1)  0.729  11.787    7   21   0.0  [0.43, 0.93]
ICC(1,k)  0.914  11.680    7   24   0.0  [0.75, 0.98]
ICC(A,k)  0.914  11.787    7   21   0.0  [0.75, 0.98]
ICC(C,k)  0.915  11.787    7   21   0.0  [0.75, 0.98]