IMSL_STEPWISE

Syntax | Arguments | Keywords | Discussion | Example | Errors | Version History

The IMSL_STEPWISE procedure builds multiple linear regression models using forward, backward, or stepwise selection.

Note
This routine requires an IDL Advanced Math and Stats license. For more information, contact your ITT Visual Information Solutions sales or technical support representative.

Syntax

IMSL_STEPWISE, x, y [, /ALL_STEPS] [, ANOVA_TABLE=variable]
[, /BACKWARD] [, COV_NOBS=value] [, COV_INPUT=array] [, COEF_T_TESTS=variable] [, COEF_VIF=variable] [, COV_SWEPT=variable] [, /DOUBLE] [, /FIRST_STEP] [, FORCE=value]
[, /FORWARD] [, FREQUENCIES=array] [, HISTORY=variable]
[, /INTER_STEP] [, /LAST_STEP] [, IEND=variable] [, LEVEL=array] [, N_STEPS=value] [, P_IN=value] [, P_OUT=value]
[, /STEPWISE] [, SWEPT=value] [, /TOLERANCE] [, WEIGHTS=array])

Arguments

x

Two-dimensional array containing the data for the candidate variables.

y

Array of length N_ELEMENTS(x(*, 0)) containing the responses for the dependent variable.

Keywords

ALL_STEPS

This is the only invocation. Initialization, stepping, and wrap-up computations are performed.

Note
One or none of these options — First_Step, Inter_Step, Last_Step, and All_Steps — can be specified. If none of these is specified, the action defaults to All_Steps.

ANOVA_TABLE

Named variable into which the one-dimensional array containing the analysis of variance table is stored. The analysis of variance statistics are as follows:

BACKWARD

An attempt is made to remove a variable from the model. A variable is removed if its p-value exceeds P_Out. During initialization, all candidate independent variables enter the model.

Note
One or none of these options — Forward, Backward, Stepwise — can be specified. If none is specified, the action defaults to Backward

COV_NOBS

The number of observations associated with array Cov_Input. Keywords Cov_Input and Cov_Nobs must be used together.

Note
Keywords Cov_Input and Cov_Nobs must be used together.

COV_INPUT

Two-dimensional square array of size (N_ELEMENTS(x(0,*)) + 1) x (N_ELEMENTS(x(0,*)) + 1) containing a variance-covariance or sum-of-squares and crossproducts matrix, in which the last column must correspond to the dependent variable.

Array Cov_Input can be computed using IMSL_COVARIANCES. Parameters x and y, and keywords Frequencies and Weights are not accessed when this option is specified. Normally, IMSL_ALLBEST computes Cov_Input from the input data matrices x and y. However, there may be cases when you want to calculate the covariance matrix and manipulate it before calling IMSL_ALLBEST. See the Discussion section for a discussion of such cases.

Note
Keywords Cov_Input and Cov_Nobs must be used together.

COEF_T_TESTS

Named variable into which the two-dimensional array containing statistics relating to the regression coefficient for the final model in this invocationing is stored. The rows correspond to the N_ELEMENTS(x(0, *)) in dependent variables. The rows are in the same order as the variables in x (or, if Cov_Input is specified, the rows are in the same order as the variables in Cov_Input). Each row corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variable corresponding to the row in question.

COEF_VIF

Named variable into which the two-dimensional array containing variance inflation factors for the final model in this invocation is stored. The elements correspond to the N_ELEMENTS (x(0, *)) in dependent variables. The elements are in the same order as the variables in x (or, if Cov_Input is specified, the elements are in the same order as the variables in Cov_Input). Each element corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variables corresponding to the element in question.

The square of the multiple correlation coefficient for the i-th regressor after all others have been obtained from VIF = Coef_Vif (i) by the following formula:

1.0 – (1.0/VIF)

COV_SWEPT

Named variable into which the two-dimensional array of size N_ELEMENTS (x(0, *)) + 1) x (N_ELEMENTS (x(0, *)) + 1) that results after Cov_Swept has been swept on the columns corresponding to the variables in the model. The estimated variance-covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns of Cov_Swept corresponding to the independent variables in the final model and multiplying the elements of this matrix by Anova_Table(7).

DOUBLE

If present and nonzero, double precision is used.

FIRST_STEP

This is the first invocation; additional calls will be made. Initialization and stepping is performed.

Note
One or none of these options — First_Step, Inter_Step, Last_Step, and All_Steps — can be specified. If none of these is specified, the action defaults to All_Steps.

FORCE

Scalar integer specifying how variables are forced into the model as independent variables. Variable with levels 1, 2, ..., Force are forced into the model as independent variables. See Level.

FORWARD

An attempt is made to add a variable to the model. A variable is added if its p-value is less than P_In. During initialization, only the forced variables enter the model.

Note
One or none of these options — Forward, Backward, Stepwise — can be specified. If none is specified, the action defaults to Backward

FREQUENCIES

One-dimensional array containing the frequency for each row of x. Default: Frequencies (*) = 1

HISTORY

Named variable into which the one-dimensional array of length N_ELEMENTS (x(0, *)) + 1 containing the recent history of the independent variables is stored.

Element History(N_ELEMENTS (x(0, *))) usually corresponds to the dependent variable (see Level) as shown in Table 14-6.

Table 14-6: History Variable

History (i)
Status of i-th Variable
0.0

Variable has never been added to model.

0.5

Variable was added into the model during initialization.

k > 0.0

Variable was added to the model during the k-th step.

k < 0.0

Variable was deleted from model during the k-th step.

INTER_STEP

This is an intermediate invocation. Stepping is performed.

Note
One or none of these options — First_Step, Inter_Step, Last_Step, and All_Steps — can be specified. If none of these is specified, the action defaults to All_Steps.

LAST_STEP

This is the final invocation. Stepping and wrap-up computations are performed.

Note
One or none of these options — First_Step, Inter_Step, Last_Step, and All_Steps — can be specified. If none of these is specified, the action defaults to All_Steps.

IEND

Named variable into which an integer which indicates whether additional steps are possible is stored.

LEVEL

Array of length N_ELEMENTS(x(0, *)) + 1 containing levels of priority for variables entering and leaving the regression. Each variable is assigned a positive value that indicates its level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step. Level(i) = 0 means the i-th variable is never to enter the model. Level(i) = –1 means the i-th variable is the dependent variable. Level (N_ELEMENTS(x(0, *))) must correspond to the dependent variable, except when Cov_Input is specified. Default: 1, 1, ..., 1, –1, where –1 corresponds to Level (N_ELEMENTS(x(0, *)))

N_STEPS

For nonnegative N_Steps, N_Steps steps are taken. If   N_Steps = –1, stepping continues until completion. Default: N_Steps = 1

Note
Keyword N_Steps is not referenced if All_Steps is used.

P_IN

Largest p-value for variable entering the model. Variables with p-values less than P_In may enter the model. Default: P_In = 0.05

P_OUT

Smallest p-value for removing variables with p-values greater than P_Out may leave the model. Keyword P_Out must be greater than or equal to P_In. A common choice for P_Out is 2*P_In. Default: P_Out = 0.10

STEPWISE

A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization.

Note
One or none of these options — Forward, Backward, Stepwise — can be specified. If none is specified, the action defaults to Backward

SWEPT

Named variable into which the one-dimensional array of length (N_ELEMENTS(x(0, *)) + 1) with information to indicate the independent variables in the model is stored. Keyword Swept (N_ELEMENTS (x(0, *))) usually corresponds to the dependent variable (see Level).

TOLERANCE

Tolerance used in determining linear dependence. Default: Tolerance = 100*ε, where ε is machine precision.

WEIGHTS

One-dimensional array containing the weight for each row of x. Default: Weights (*) = 1

Discussion

The IMSL_STEPWISE procedure builds a multiple linear regression model using forward, backward, or forward stepwise (with a backward glance) selection. The IMSL_STEPWISE procedure is designed so you can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to IMSL_STEPWISE (using keywords First_Step, Inter_Step, or Last_Step) are made. Alternatively, IMSL_STEPWISE can be invoked once (default, or specify keyword All_Steps) in order to perform the stepping until a final model is selected.

Levels of priority can be assigned to the candidate independent variables (use keyword Level). All variables with a priority level of 1 must enter the model before variables with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc. Variables also can be forced into the model (see keyword Force). Note that specifying keyword Force without also specifying keyword Level results in all variables being forced into the model.

Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum-of-squares and crossproducts matrix for the independent and dependent variables corrected for the mean is used. Other possibilities are as follows:

The stepwise regression algorithm is due to Efroymson (1960). The IMSL_STEPWISE procedure uses sweeps of the covariance matrix (input using keyword Cov_Input, if specified, or generated internally by default) to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed in Goodnight (1979) is used. A description of the stepwise algorithm also is given by Kennedy and Gentle (1980, pp. 335–340). The advantage of stepwise model building over all possible regression (see IMSL_ALLBEST) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest R2) for any subset size of independent variables.

Example

This example uses a data set from Draper and Smith (1981, pp. 629-630). Backwards stepping is performed by default. First, a procedure to output the results is defined.

PRO print_results, anova_table, t, s 
labels = ['df for regression              ', $ 
   'df for error                   ', $ 
   'total df                       ', $ 
   'ss for regression              ', $ 
   'ss for error                   ', $ 
   'total ss                       ', $ 
   'mean square for regression     ', $ 
   'mean square error              ', $ 
   'F-statistic                    ', $ 
   'p-value                        ', $ 
   'R-squared (in percent)         ', $ 
   'adjusted R-squared (in percent)'] 
PRINT  
PRINT, '       * * Analysis of Variance * *'  
; Print the table.  
FOR i = 0, 11 DO PRINT, labels(i), $ 
   anova_table(i), FORMAT = '(a32,f8.2)'  
   PRINT  
   PRINT, '* * Inference on Coefficients * *'  
   PRINT, '            Estimate    s.e.       t' + $ 
      '        prob>t     swept'  
   PRINT,'$(a, 4f10.4)','variable 1',t(0,*),s(0) 
   PRINT,'$(a, 4f10.4)','variable 2',t(1,*),s(1) 
   PRINT,'$(a, 4f10.4)','variable 3',t(2,*),s(2) 
   PRINT,'$(a, 4f10.4)','variable 4',t(3,*),s(3) 
END  
x = MAKE_ARRAY(13, 4) 
; Define the data.  
x(0, *) = [7., 26., 6., 60.]  
x(1, *) = [1., 29., 15., 52.]  
x(2, *) = [11., 56., 8., 20.]  
x(3, *) = [11., 31., 8., 47.]  
x(4, *) = [7., 52., 6., 33.]  
x(5, *) = [11., 55., 9., 22.]  
x(6, *) = [3., 71., 17., 6.]  
x(7, *) = [1., 31., 22., 44.]  
x(8, *) = [2., 54., 18., 22.]  
x(9, *) = [21., 47., 4., 26.]  
x(10, *) = [1., 40., 23., 34.]  
x(11, *) = [11., 66., 9., 12.]  
x(12, *) = [10., 68., 8., 12.]  
y = [78.5, 74.3, 104.3, 87.6, 95.9, $ 
   109.2, 102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4]  
IMSL_STEPWISE, x, y, Anova_Table = anova_table, $ 
   Coef_T_Tests = t, swept = s 
   ; Backward stepwise regression.  
print_results, anova_table, t, s 
        * * Analysis of Variance * *  
 df for regression                  2.00  
df for error                      10.00  
total df                          12.00  
ss for regression               2657.86  
ss for error                      57.90  
total ss                        2715.76  
mean square for regression      1328.93  
mean square error                  5.79  
F-statistic                      229.50  
P-value                            0.00  
R-squared (in percent)            97.87  
adjusted R-squared (in percent)   97.44  
* * Inference on Coefficients * * 
               Estimate    s.e.       t        prob>t     swept 
   variable 1    1.4683    0.1213   12.1046    0.0000        1. 
   variable 2    0.6623    0.0459   14.4423    0.0000        1. 
   variable 3    0.2500    0.1847    1.3536    0.2089       -1. 
   variable 4   -0.2365    0.1733   -1.3650    0.2054       -1. 

Errors

Warning Errors

STAT_LINEAR_DEPENDENCE_1—Based on Tolerance = #, there are linear dependencies among the variables to be forced.

Fatal Errors

STAT_NO_VARIABLES_ENTERED—No variables entered the model. All elements of Anova_Table are set to NaN.

Version History

6.4

Introduced