IMSL_DISCR_ANALYSIS

Syntax | Arguments | Keywords | Discussion | Example | Errors | Version History

The IMSL_DISCR_ANALYSIS procedure performs a linear or a quadratic discriminant function analysis among several known groups.

Note
This routine requires an IDL Advanced Math and Stats license. For more information, contact your ITT Visual Information Solutions sales or technical support representative.

Syntax

IMSL_DISCR_ANALYSIS, x, n_groups [, CLASS_MEMBER=variable] [, CLASS_TABLE=variable] [, COEFFICIENTS=variable] [, COVARIANCES=variable] [, /DOUBLE] [, GROUP_COUNTS=variable] [, IDX_COLS=array] [, IDX_VARS=array] [, METHOD=value] [, /PRIOR_EQUAL] [, PRIOR_INPUT=array] [, PRIOR_OUTPUT=variable] [, /PRIOR_PROP] [, MAHALANOBIS=variable] [, MEANS=variable] [, NMISSING=variable] [, PROB=variable] [, STATS=variable]

Arguments

n_groups

Number of groups in the data.

x

Two-dimensional array of size n_rows by n_variables + 1 containing the data where n_rows = N_ELEMENTS(x(*,0)), the number of rows to be processed and n_variables = number of variables to be used in the discrimination. The first n_variables columns correspond to the variables, and the last column contains the group numbers. The groups must be numbered 1, 2, ..., n_groups.

Keywords

CLASS_MEMBER

Named variable into which an one-dimensional integer array of length n_rows containing the group to which the observation was classified is stored.

If an observation has an invalid group number, frequency, or weight when the leaving-out-one method has been specified, then the observation is not classified and the corresponding elements of Class_Member (and Prob, see Prob below) are set to zero.

CLASS_TABLE

Named variable into which a two-dimensional array of size n_groups by n_groups containing the classification table is stored. Each observation that is classified and has a group number 1.0, 2.0, ..., n_groups is entered into the table. The rows of the table correspond to the known group membership. The columns refer to the group to which the observation was classified.

COEFFICIENTS

Named variable into which a two-dimensional array of size n_groups by (n_variables + 1) containing the linear discriminant coefficients is stored. The first column of Coefficients contains the constant term, and the remaining columns contain the variable coefficients. Row i – 1 of Coefficients corresponds to group i, for i = 1, 2, ..., n_variables + 1. Array Coefficients are always computed as the linear discriminant function coefficients even when quadratic discrimination is specified.

COVARIANCES

Named variable into which a three-dimensional array of size g by n_variables by n_variables containing covariance results is stored. The within-group covariance matrices (Method 1, 2, 4, and 5 only) is the first g-1 matrices, and the pooled covariance matrix is the g-th matrix.

DOUBLE

If present and nonzero, double precision is used.

GROUP_COUNTS

Named variable into which an one-dimensional integer array of length n_groups containing the number of observations in each group is stored.

IDX_COLS

One-dimensional array containing the indices of the variables to be used in the analysis.

IDX_VARS

Three element array indicating the column numbers of x in which particular types of data are stored. Columns are numbered 0 ... N_ELEMENTS(Idx_Cols) - 1.

Idx_Vars(0) contains the index for the column of x in which the group numbers are stored.

Idx_Vars(1) and Idx_Vars(2) contain the column numbers of x in which the frequencies and weights, respectively, are stored. Set Idx_Vars(1) = -1 if there will be no column for frequencies. Set Idx_Vars(2) = -1 if there will be no column for weights. Weights are rounded to the nearest integer. Negative weights are not allowed.

Defaults: Idx_Cols = 0, 1, ..., n_variables – 1,

Idx_Vars(0) = n_variables,

Idx_Vars(1) = -1, and

Idx_Vars(2) = -1

METHOD

Method of discrimination. The method chosen determines whether linear or quadratic discrimination is used, whether the group covariance matrices are computed (the pooled covariance matrix is always computed), and whether the leaving-out-one or the reclassification method is used to classify each observation. The Method values are shown in Table 21-1.

Table 21-1: Method Values

Method
discrimination
method
covariances
computed
classification
method
1

linear

pooled, group

reclassification

2

quadratic

pooled, group

reclassification

3

linear

pooled

reclassification

4

linear

pooled, group

leaving-out-one

5

quadratic

pooled, group

leaving-out-one

6

linear

pooled

leaving-out-one

In the leaving-out-one method of classification, the posterior probabilities are adjusted so as to eliminate the effect of the observation from the sample statistics prior to its classification. In the classification method, the effect of the observation is not eliminated from the classification function. Default: Method = 1

PRIOR_EQUAL

By default, (or if Prior_Equal is used), equal prior probabilities are calculated as 1.0/n_groups. Keywords Prior_Equal, Prior_Prop, and Prior_Input must not be used together.

PRIOR_INPUT

If present, an array of length n_groups containing the prior probabilities for each group, such that the sum of all prior probabilities is equal to 1.0. Keywords Prior_Input, Prior_Equal, and Prior_Prop must not be used together.

PRIOR_OUTPUT

Named variable into which an one-dimensional array of length n_groups containing the most recently calculated or input prior probabilities is stored.

PRIOR_PROP

If present, prior probabilities are calculated to be proportional to the sample size in each group. Keywords Prior_Prop, Prior_Equal, and Prior_Input must not be used together.

MAHALANOBIS

Named variable into which a two-dimensional array of size n_groups by n_groups containing the Mahalanobis distances:

IMSL_DISCR_ANALYSIS-10.jpg

between the group means is stored.

For linear discrimination, the Mahalanobis distance is computed using the pooled covariance matrix. Otherwise, the Mahalanobis distance:

IMSL_DISCR_ANALYSIS-11.jpg

between group means i and j is computed using the within covariance matrix for group i in place of the pooled covariance matrix.

MEANS

Named variable into which a two-dimensional array of size
n_groups by n_variables containing the variable means is stored. The i-th row of means contains the group i variable means.

NMISSING

Named variable into which the number of rows of data encountered containing missing values (NaN) for the classification, group, weight, and/or frequency variables is stored. If a row of data contains a missing value (NaN) for any of these variables, that row is excluded from the computations.

PROB

Named variable into which a two-dimensional array of size n_rows by n_groups containing the posterior probabilities for each observation is stored.

STATS

Named variable into which an one-dimensional array of length 4 + 2 * (n_groups + 1) containing various statistics of interest is stored. The first element of Stats is the sum of the degrees of freedom for the within-covariance matrices. The second, third, and fourth elements of Stats correspond to the chi-squared statistic, its degrees of freedom, and the probability of a greater chi-squared, respectively, of a test of the homogeneity of the within-covariance matrices (not computed if Method is equal to 3 or 6). The fifth through 5 + n_groups elements of Stats contain the log of the determinants of each group's covariance matrix (not computed if Method is equal to 3 or 6) and of the pooled covariance matrix (element 4 + n_groups). Finally, the last n_groups + 1 elements of Stats contain the sum of the weights within each group, and in the last position, the sum of the weights in all groups.

Comments

  1. Common choices for the Bayesian prior probabilities are given by:
    Prior_Input(i) = 1.0/n_groups (equal priors)
    Prior_Input(i) = Group_Count/n_rows (proportional priors)
    Prior_Input(i) = Past history or subjective judgment.
    In all cases, the priors should sum to 1.0.

Discussion

IMSL_DISCR_ANALYSIS performs discriminant function analysis using either linear or quadratic discrimination. The output includes a measure of distance between the groups, a table summarizing the classification results, a matrix containing the posterior probabilities of group membership for each observation, and the within-sample means and covariance matrices. Linear discriminant function coefficients are also computed.

Covariance matrices are defined as follows: Let Ni denote the sum of frequencies of observations in group i and Mi denote the number of observations in group i. Then, if Si denotes the within-group i covariance matrix:

IMSL_DISCR_ANALYSIS-12.jpg

Where wj is the weight of the j-th observation in group i, fj is the frequency, xj is the j-th observation column vector (in group i), and:

IMSL_DISCR_ANALYSIS-13.jpg

denotes the mean vector of the observations in group i. The mean vectors are computed as:

IMSL_DISCR_ANALYSIS-14.jpg

Given the means and the covariance matrices, the linear discriminant function for group i is computed as:

IMSL_DISCR_ANALYSIS-15.jpg

where ln (pi) is the natural log of the prior probability for the i-th group, x is the observation to be classified, and Sp denoted the pooled covariance matrix.

Let S denote either the pooled covariance matrix of one of the within-group covariance matrices Si. (S will be the pooled covariance matrix in linear discrimination, and Si otherwise.) The Mahalanobis distance between group i and group j is computed as:

IMSL_DISCR_ANALYSIS-16.jpg

Finally, the asymptotic chi-squared test for the equality of covariance matrices is computed as follows (Morrison 1976, p. 252):

IMSL_DISCR_ANALYSIS-17.jpg

where ni is the number of degrees of freedom in the i-th sample covariance matrix, k is the number of groups, and:

IMSL_DISCR_ANALYSIS-18.jpg

where p is the number of variables.

The estimated posterior probability of each observation x belonging to group is computed using the prior probabilities and the sample mean vectors and estimated covariance matrices under a multivariate normal assumption. Under quadratic discrimination, the within-group covariance matrices are used to compute the estimated posterior probabilities. The estimated posterior probability of an observation x belonging to group i is:

IMSL_DISCR_ANALYSIS-19.jpg

where:

IMSL_DISCR_ANALYSIS-20.jpg

For the leaving-out-one method of classification (Method equal to 4, 5 or 6), the sample mean vector and sample covariance matrices in the formula for:

IMSL_DISCR_ANALYSIS-21.jpg

are adjusted so as to remove the observation x from their computation. For linear discrimination (Method equal to 1, 2, 4, or 6), the linear discriminant function coefficients are actually used to compute the same posterior probabilities.

Using the posterior probabilities, each observation in x is classified into a group; the result is tabulated in the array Class_Table and saved in the array Class_Member. Array Class_Table is not altered at this stage if x(i)(Idx_Vars(0)) contains a group number that is out of range. If the reclassification method is specified, then all observations with no missing values in the n_variables classification variables are classified. When the leaving-out-one method is used, observations with invalid group numbers, weights, frequencies, or classification variables are not classified. Regardless of the frequency, a 1 is added (or subtracted) from Class_Table for each row of x that is classified and contains a valid group number.

When Method > 3, adjustment is made to the posterior probabilities to remove the effect of the observation in the classification rule. In this adjustment, each observation is presumed to have a weight of x(i)(Idx_Vars(2)) if Idx_Vars(2) > -1 (and a weight of 1.0 if Idx_Vars(2) = -1), and a frequency of 1.0. See Lachenbruch (1975, p. 36) for the required adjustment.

The covariance matrices are computed from their LU factorizations.

Example

The following example uses liner discrimination with equal prior probabilities on Fisher's (1936) iris data.

.RUN 
PRO print_results, counts, table, d2, prior_out, coef, means, $ 
   cov, stats, nrmiss 
   num  =  INDGEN(3) 
   PRINT, '      Counts' 
   PRINT, num + 1, FORMAT = '(3I5)' 
   PRINT, counts, FORMAT = '(3I5)' 
   PRINT 
   PRINT, '        Table' 
   PRINT, num + 1, FORMAT = '(2X, 3I5)' 
   FOR i  =  0, 2 DO $ 
      PRINT, num(i) + 1, table(i, *), FORMAT = '(I2, 3I5)' 
   PRINT 
   PRINT, '           D2' 
   PRINT, num + 1, FORMAT = '(3I7)' 
   FOR i  =  0, 2 DO $ 
      PRINT, num(i) + 1, d2(i, *), FORMAT = '(I2, 3F7.1)' 
   PRINT 
   PRINT, '          Prior OUT' 
   PRINT, num + 1, FORMAT = '(3I10)' 
   PRINT, prior_out, FORMAT = '(3F10.4)' 
   PRINT 
   num  =  INDGEN(5) 
   PRINT, '                         Coef' 
   PRINT, num + 1, FORMAT = '(1X, 5I10) 
   FOR i  =  0, 2 DO $ 
      PRINT, num(i) + 1, coef(i, *), FORMAT = '(I2, 5F10.1)' 
   PRINT 
   num  =  INDGEN(4) 
   PRINT, '                  Means' 
   PRINT, num + 1, FORMAT = '(4I10)' 
   FOR i  =  0, 2 DO $ 
      PRINT, num(i) + 1, means(i, *), FORMAT = '(I2, 4F10.3)' 
   PRINT 
   PRINT, '             Covariance' 
   PRINT, num + 1, FORMAT = '(4I10)' 
   FOR i  =  0, 3 DO $ 
      PRINT, num(i) + 1, cov(0, *, i), FORMAT = '(I2, 4F10.4)' 
   PRINT 
   num  =  INDGEN(12) 
   PRINT, '           Stats' 
   FOR i  =  0, 11 DO $ 
      PRINT, num(i) + 1, stats(i) 
   PRINT 
   PRINT, 'nrmiss = ', nrmiss 
END 
 
idxv  =  [1, 2, 3, 4] 
idxc  =  [0, -1, -1] 
n_groups  =  3 
method  =  3 
; Retrieve the Fisher Iris Data Set 
x  =  IMSL_STATDATA(3) 
IMSL_DISCR_ANALYSIS, x, n_groups, Idx_Vars = idxv, $ 
   Idx_cols = idxc, Method = method, /Prior_Equal, $ 
   Prior_Output = prior_out, Group_Counts = counts, $ 
   Means = means, Covariances = cov, $ 
   Coefficients = coef, Class_Member = cm, $ 
   Class_Table = table, Prob = prob, $ 
   Mahalanobis = d2, Stats = stats, Nmissing = nrmiss 
print_results, counts, table, d2, prior_out, coef, means, $ 
   cov, stats, nrmiss 
 
   Counts 
1    2    3 
50   50   50 
    
    
   Table 
      1    2    3 
   1   50    0    0 
   2    0   48    2 
   3    0    1   49 
    
   D2 
   1      2      3 
1 0.0   89.9  179.4 
2 89.9    0.0   17.2 
3 179.4   17.2    0.0 
 
   Prior OUT 
   1         2         3 
0.3333    0.3333    0.3333 
   Coef 
      1         2         3         4         5 
1     -86.3      23.5      23.6     -16.4     -17.4 
2     -72.9      15.7       7.1       5.2       6.4 
3    -104.4      12.4       3.7      12.8      21.1 
   Means 
      1         2         3         4 
1     5.006     3.428     1.462     0.246 
2     5.936     2.770     4.260     1.326 
3     6.588     2.974     5.552     2.026 
   Covariance 
      1         2         3         4 
1    0.2650    0.0927    0.1675    0.0384 
2    0.0927    0.1154    0.0552    0.0327 
3    0.1675    0.0552    0.1852    0.0427 
4    0.0384    0.0327    0.0427    0.0419 
   Stats 
   1      147.000 
   2          NaN 
   3          NaN 
   4          NaN 
   5          NaN 
   6          NaN 
   7          NaN 
   8     -9.95854 
   9      50.0000 
   10      50.0000 
   11      50.0000 
   12      150.000 
    
nrmiss =            0 

Errors

Warning Errors

STAT_BAD_OBS_1In call #, row # of the data matrix, "x", has group number = #. The group number must be an integer between 1.0 and "n_groups" = #, inclusively. This observation will be ignored.

STAT_BAD_OBS_2The leaving-out-one method is specified but this observation does not have a valid group number (Its group number is #.). This observation (row #) is ignored.

STAT_BAD_OBS_3The leaving-out-one method is specified but this observation does not have a valid weight or it does not have a valid frequency. This observation (row #) is ignored.

STAT_COV_SINGULAR_3The group # covariance matrix is singular. "Stats(1)" cannot be computed. "Stats(1)" and "Stats(3)" are set to the missing value code (NaN).

Fatal Errors

STAT_COV_SINGULAR_1The variance-covariance matrix for population number # is singular. The computations cannot continue.

STAT_COV_SINGULAR_2The pooled variance-covariance matrix is singular. The computations cannot continue.

STAT_COV_SINGULAR_4A variance-covariance matrix is singular. The index of the first zero element is equal to #.

Version History

6.4

Introduced