Correlation Analysis
Given two n-element sample populations, X and Y, it is possible to quantify the degree of fit to a linear model using the correlation coefficient. The correlation coefficient, r, is a scalar quantity in the interval [-1.0, 1.0], and is defined as the ratio of the covariance of the sample populations to the product of their standard deviations.
or

The correlation coefficient is a direct measure of how well two sample populations vary jointly. A value of r = +1 or r = –1 indicates a perfect fit to a positive or negative linear model, respectively. A value of r close to +1 or –1 indicates a high degree of correlation and a good fit to a linear model. A value of r close to 0 indicates a poor fit to a linear model.
Correlation Example
The following sample populations represent a perfect positive linear correlation.
X = [-8.1, 1.0, -14.3, 4.2, -10.1, 4.3, 6.3, 5.0, 15.1, -2.2] Y = [-9.8, -0.7, -16.0, 2.5, -11.8, 2.6, 4.6, 3.3, 13.4, -3.9] ;Compute the correlation coefficient of X and Y. PRINT, CORRELATE(X, Y)
IDL prints:
The following sample populations represent a high negative linear correlation.
X = [ 1.8, -2.7, 0.7, -0.5, -1.3, -0.9, 0.6, -1.5, 2.5, 3.0] Y = [-4.7, 9.8, -3.7, 2.8, 5.1, 3.9, -3.6, 5.8, -7.3, -7.4] ;Compute the correlation coefficient of X and Y: PRINT, CORRELATE(X, Y)
IDL prints:
The following sample populations represent a poor linear correlation.
X = [-1.8, 0.1, -0.1, 1.9, 0.5, 1.1, 1.9, 0.3, -0.2, -1.0] Y = [ 1.5, -1.0, -0.6, 1.1, 0.7, -0.7, 1.1, -0.1, 0.6, -0.1] ;Compute the correlation coefficient of X and Y: PRINT, CORRELATE(X, Y)
IDL prints:
Notes on Interpreting the Correlation Coefficient
When interpreting the value of the correlation coefficient, it is important to remember the following two caveats:
- Although a high degree of correlation (a value close to +1 or –1) indicates a good mathematical fit to a linear model, its applied interpretation may be completely nonsensical. For example, there may be a high degree of correlation between the number of scientists using IDL to study atmospheric phenomena and the consumption of alcohol in Russia, but the two events are clearly unrelated.
- Although a correlation coefficient close to 0 indicates a poor fit to a linear model, it does not mean that there is no correlation between the two sample populations. It is possible that the relationship between X and Y is accurately described by a nonlinear model. See Curve and Surface Fitting for further details on fitting data to linear and nonlinear models.
Multiple Linear Models
The fundamental principles of correlation that apply to the linear model of two sample populations may be extended to the multiple-linear model. The degree of relationship between three or more sample populations may be quantified using the multiple correlation coefficient. The degree of relationship between two sample populations when the effects of all other sample populations are removed may be quantified using the partial correlation coefficient. Both of these coefficients are scalar quantities in the interval [0.0, 1.0]. A value of +1 indicates a perfect linear relationship between populations. A value close to +1 indicates a high degree of linear relationship between populations; whereas a value close to 0 indicates a poor linear relationship between populations. (Although a value of 0 indicates no linear relationship between populations, remember that there may be a nonlinear relationship.)
Partial Correlation Example
Define the independent (X) and dependent (Y) data.
X = [[0.477121, 2.0, 13.0], $ [0.477121, 5.0, 6.0], $ [0.301030, 5.0, 9.0], $ [0.000000, 7.0, 5.5], $ [0.602060, 3.0, 7.0], $ [0.698970, 2.0, 9.5], $ [0.301030, 2.0, 17.0], $ [0.477121, 5.0, 12.5], $ [0.698970, 2.0, 13.5], $ [0.000000, 3.0, 12.5], $ [0.602060, 4.0, 13.0], $ [0.301030, 6.0, 7.5], $ [0.301030, 2.0, 7.5], $ [0.698970, 3.0, 12.0], $ [0.000000, 4.0, 14.0], $ [0.698970, 6.0, 11.5], $ [0.301030, 2.0, 15.0], $ [0.602060, 6.0, 8.5], $ [0.477121, 7.0, 14.5], $ [0.000000, 5.0, 9.5]] Y = [97.682, 98.424, 101.435, 102.266, 97.067, 97.397, $ 99.481, 99.613, 96.901, 100.152, 98.797, 100.796, $ 98.750, 97.991, 100.007, 98.615, 100.225, 98.388, $ 98.937, 100.617]
Compute the multiple correlation of Y on the first column of X. The result should be 0.798816.
IDL prints:
Compute the multiple correlation of Y on the first two columns of X. The result should be 0.875872.
IDL prints:
Compute the multiple correlation of Y on all columns of X. The result should be 0.877197.
IDL prints:
0.877197 ;Define the five sample populations. X0 = [30, 26, 28, 33, 35, 29] X1 = [0.29, 0.33, 0.34, 0.30, 0.30, 0.35] X2 = [65, 60, 65, 70, 70, 60] X3 = [2700, 2850, 2800, 3100, 2750, 3050] Y = [37, 33, 32, 37, 36, 33]
Compute the partial correlation of X1 and Y with the effects of X0, X2 and X3 removed.
IDL prints:
Routines for Computing Correlations
See Correlation Analysis (in the functional category "Mathematics" (IDL Quick Reference)) for a brief description of IDL routines for computing correlations. Detailed information is available in the IDL Reference Guide.