pingouin.corr

pingouin.corr(x, y, tail='two-sided', method='pearson')[source]

(Robust) correlation between two variables.

Parameters
x, yarray_like

First and second set of observations. x and y must be independent.

tailstring

Specify whether to return 'one-sided' or 'two-sided' p-value. Note that the former are simply half the latter.

methodstring

Correlation type:

  • 'pearson': Pearson \(r\) product-moment correlation

  • 'spearman': Spearman \(\rho\) rank-order correlation

  • 'kendall': Kendall’s \(\tau\) correlation (for ordinal data)

  • 'bicor': Biweight midcorrelation (robust)

  • 'percbend': Percentage bend correlation (robust)

  • 'shepherd': Shepherd’s pi correlation (robust)

  • 'skipped': Skipped correlation (robust)

Returns
statspandas.DataFrame
  • 'n': Sample size (after removal of missing values)

  • 'outliers': number of outliers, only if a robust method was used

  • 'r': Correlation coefficient

  • 'CI95': 95% parametric confidence intervals around \(r\)

  • 'r2': R-squared (\(= r^2\))

  • 'adj_r2': Adjusted R-squared

  • 'p-val': tail of the test

  • 'BF10': Bayes Factor of the alternative hypothesis (only for Pearson correlation)

  • 'power': achieved power of the test (= 1 - type II error).

See also

pairwise_corr

Pairwise correlation between columns of a pandas DataFrame

partial_corr

Partial correlation

rm_corr

Repeated measures correlation

Notes

The Pearson correlation coefficient measures the linear relationship between two datasets. Strictly speaking, Pearson’s correlation requires that each dataset be normally distributed. Correlations of -1 or +1 imply a perfect negative and positive linear relationship, respectively, with 0 indicating the absence of association.

\[r_{xy} = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})} {\sqrt{\sum_i(x_i - \bar{x})^2} \sqrt{\sum_i(y_i - \bar{y})^2}} = \frac{\text{cov}(x, y)}{\sigma_x \sigma_y}\]

where \(\text{cov}\) is the sample covariance and \(\sigma\) is the sample standard deviation.

If method='pearson', The Bayes Factor is calculated using the pingouin.bayesfactor_pearson() function.

The Spearman correlation coefficient is a non-parametric measure of the monotonicity of the relationship between two datasets. Unlike the Pearson correlation, the Spearman correlation does not assume that both datasets are normally distributed. Correlations of -1 or +1 imply an exact negative and positive monotonic relationship, respectively. Mathematically, the Spearman correlation coefficient is defined as the Pearson correlation coefficient between the rank variables.

The Kendall correlation coefficient is a measure of the correspondence between two rankings. Values also range from -1 (perfect disagreement) to 1 (perfect agreement), with 0 indicating the absence of association. Consistent with scipy.stats.kendalltau(), Pingouin returns the Tau-b coefficient, which adjusts for ties:

\[\tau_B = \frac{(P - Q)}{\sqrt{(P + Q + T) (P + Q + U)}}\]

where \(P\) is the number of concordant pairs, \(Q\) the number of discordand pairs, \(T\) the number of ties in x, and \(U\) the number of ties in y.

The biweight midcorrelation and percentage bend correlation [1] are both robust methods that protects against univariate outliers by down-weighting observations that deviate too much from the median.

The Shepherd pi [2] correlation and skipped [3], [4] correlation are both robust methods that returns the Spearman correlation coefficient after removing bivariate outliers. Briefly, the Shepherd pi uses a bootstrapping of the Mahalanobis distance to identify outliers, while the skipped correlation is based on the minimum covariance determinant (which requires scikit-learn). Note that these two methods are significantly slower than the previous ones.

Important

Please note that rows with missing values (NaN) are automatically removed.

References

1

Wilcox, R.R., 1994. The percentage bend correlation coefficient. Psychometrika 59, 601–616. https://doi.org/10.1007/BF02294395

2

Schwarzkopf, D.S., De Haas, B., Rees, G., 2012. Better ways to improve standards in brain-behavior correlation analysis. Front. Hum. Neurosci. 6, 200. https://doi.org/10.3389/fnhum.2012.00200

3

Rousselet, G.A., Pernet, C.R., 2012. Improving standards in brain-behavior correlation analyses. Front. Hum. Neurosci. 6, 119. https://doi.org/10.3389/fnhum.2012.00119

4

Pernet, C.R., Wilcox, R., Rousselet, G.A., 2012. Robust correlation analyses: false positive and power validation using a new open source matlab toolbox. Front. Psychol. 3, 606. https://doi.org/10.3389/fpsyg.2012.00606

Examples

  1. Pearson correlation

>>> import numpy as np
>>> import pingouin as pg
>>> # Generate random correlated samples
>>> np.random.seed(123)
>>> mean, cov = [4, 6], [(1, .5), (.5, 1)]
>>> x, y = np.random.multivariate_normal(mean, cov, 30).T
>>> # Compute Pearson correlation
>>> pg.corr(x, y).round(3)
          n      r         CI95%     r2  adj_r2  p-val  BF10  power
pearson  30  0.491  [0.16, 0.72]  0.242   0.185  0.006  8.55  0.809
  1. Pearson correlation with two outliers

>>> x[3], y[5] = 12, -8
>>> pg.corr(x, y).round(3)
          n      r          CI95%     r2  adj_r2  p-val   BF10  power
pearson  30  0.147  [-0.23, 0.48]  0.022  -0.051  0.439  0.302  0.121
  1. Spearman correlation (robust to outliers)

>>> pg.corr(x, y, method="spearman").round(3)
           n      r         CI95%     r2  adj_r2  p-val  power
spearman  30  0.401  [0.05, 0.67]  0.161   0.099  0.028   0.61
  1. Biweight midcorrelation (robust)

>>> pg.corr(x, y, method="bicor").round(3)
        n      r         CI95%     r2  adj_r2  p-val  power
bicor  30  0.393  [0.04, 0.66]  0.155   0.092  0.031  0.592
  1. Percentage bend correlation (robust)

>>> pg.corr(x, y, method='percbend').round(3)
           n      r         CI95%     r2  adj_r2  p-val  power
percbend  30  0.389  [0.03, 0.66]  0.151   0.089  0.034  0.581
  1. Shepherd’s pi correlation (robust)

>>> pg.corr(x, y, method='shepherd').round(3)
           n  outliers      r         CI95%     r2  adj_r2  p-val  power
shepherd  30         2  0.437  [0.09, 0.69]  0.191   0.131   0.02  0.694
  1. Skipped spearman correlation (robust)

>>> pg.corr(x, y, method='skipped').round(3)
          n  outliers      r         CI95%     r2  adj_r2  p-val  power
skipped  30         2  0.437  [0.09, 0.69]  0.191   0.131   0.02  0.694
  1. One-tailed Pearson correlation

>>> pg.corr(x, y, tail="one-sided", method='pearson').round(3)
          n      r          CI95%     r2  adj_r2  p-val   BF10  power
pearson  30  0.147  [-0.23, 0.48]  0.022  -0.051   0.22  0.467  0.194
  1. Using columns of a pandas dataframe

>>> import pandas as pd
>>> data = pd.DataFrame({'x': x, 'y': y})
>>> pg.corr(data['x'], data['y']).round(3)
          n      r          CI95%     r2  adj_r2  p-val   BF10  power
pearson  30  0.147  [-0.23, 0.48]  0.022  -0.051  0.439  0.302  0.121