The IMSL_STEPWISE procedure builds multiple linear regression models using forward, backward, or stepwise selection.
This routine requires an IDL Advanced Math and Stats license. For more information, contact your sales or technical support representative.
The IMSL_STEPWISE procedure builds a multiple linear regression model using forward, backward, or forward stepwise (with a backward glance) selection. The IMSL_STEPWISE procedure is designed so you can monitor, and perhaps change, the variables added (deleted) to (from) the model after each step. In this case, multiple calls to IMSL_STEPWISE (using keywords FIRST_STEP, INTER_STEP, or LAST_STEP) are made. Alternatively, IMSL_STEPWISE can be invoked once (default, or specify keyword ALL_STEPS) in order to perform the stepping until a final model is selected.
Levels of priority can be assigned to the candidate independent variables (use keyword LEVEL). All variables with a priority level of 1 must enter the model before variables with a priority level of 2. Similarly, variables with a level of 2 must enter before variables with a level of 3, etc. Variables also can be forced into the model (see keyword FORCE). Note that specifying keyword Force without also specifying keyword LEVEL results in all variables being forced into the model.
Typically, the intercept is forced into all models and is not a candidate variable. In this case, a sum-of-squares and crossproducts matrix for the independent and dependent variables corrected for the mean is used. Other possibilities are as follows:
- The intercept is not in the model. A raw (uncorrected) sum-of-squares and crossproducts matrix for the independent and dependent variables is required as input in COV_INPUT. Keyword COV_NOBS must be set to 1 greater than the number of observations.
- An intercept is to be a candidate variable. A raw (uncorrected) sum-of-squares and crossproducts matrix for the constant regressor (=1), independent and dependent variables are required for COV_INPUT. In this case, COV_INPUT contains one additional row and column corresponding to the constant regressor. This row/column contains the sum-of-squares and crossproducts of the constant regressor with the independent and dependent variables. The remaining elements in COV_INPUT are the same as in the previous case. Keyword COV_NOBS must be set to 1 greater than the number of observations.
The stepwise regression algorithm is due to Efroymson (1960). The IMSL_STEPWISE procedure uses sweeps of the covariance matrix (input using keyword COV_INPUT, if specified, or generated internally by default) to move variables in and out of the model (Hemmerle 1967, Chapter 3). The SWEEP operator discussed in Goodnight (1979) is used. A description of the stepwise algorithm also is given by Kennedy and Gentle (1980, pp. 335–340). The advantage of stepwise model building over all possible regression (see IMSL_ALLBEST) is that it is less demanding computationally when the number of candidate independent variables is very large. However, there is no guarantee that the model selected will be the best model (highest R2) for any subset size of independent variables.
Example
This example uses a data set from Draper and Smith (1981, pp. 629-630). Backwards stepping is performed by default. First, a procedure to output the results is defined.
PRO print_results, anova_table, t, s
labels = ['df for regression ', $
'df for error ', $
'total df ', $
'ss for regression ', $
'ss for error ', $
'total ss ', $
'mean square for regression ', $
'mean square error ', $
'F-statistic ', $
'p-value ', $
'R-squared (in percent) ', $
'adjusted R-squared (in percent)']
PRINT
PRINT, ' * * Analysis of Variance * *'
FOR i = 0, 11 DO PRINT, labels(i), $
anova_table(i), FORMAT = '(a32,f8.2)'
PRINT
PRINT, '* * Inference on Coefficients * *'
PRINT, ' Estimate s.e. t' + $
' prob>t swept'
PRINT,'$(a, 4f10.4)','variable 1',t(0,*),s(0)
PRINT,'$(a, 4f10.4)','variable 2',t(1,*),s(1)
PRINT,'$(a, 4f10.4)','variable 3',t(2,*),s(2)
PRINT,'$(a, 4f10.4)','variable 4',t(3,*),s(3)
END
x = MAKE_ARRAY(13, 4)
x(0, *) = [7., 26., 6., 60.]
x(1, *) = [1., 29., 15., 52.]
x(2, *) = [11., 56., 8., 20.]
x(3, *) = [11., 31., 8., 47.]
x(4, *) = [7., 52., 6., 33.]
x(5, *) = [11., 55., 9., 22.]
x(6, *) = [3., 71., 17., 6.]
x(7, *) = [1., 31., 22., 44.]
x(8, *) = [2., 54., 18., 22.]
x(9, *) = [21., 47., 4., 26.]
x(10, *) = [1., 40., 23., 34.]
x(11, *) = [11., 66., 9., 12.]
x(12, *) = [10., 68., 8., 12.]
y = [78.5, 74.3, 104.3, 87.6, 95.9, $
109.2, 102.7, 72.5, 93.1, 115.9, 83.8, 113.3, 109.4]
IMSL_STEPWISE, x, y, Anova_Table = anova_table, $
Coef_T_Tests = t, swept = s
print_results, anova_table, t, s
END
IDL prints:
* * Analysis of Variance * *
df for regression 2.00
df for error 10.00
total df 12.00
ss for regression 2657.86
ss for error 57.90
total ss 2715.76
mean square for regression 1328.93
mean square error 5.79
F-statistic 229.50
P-value 0.00
R-squared (in percent) 97.87
adjusted R-squared (in percent) 97.44
* * Inference on Coefficients * *
Estimate s.e. t prob>t swept
variable 1 1.4683 0.1213 12.1046 0.0000 1.
variable 2 0.6623 0.0459 14.4423 0.0000 1.
variable 3 0.2500 0.1847 1.3536 0.2089 -1.
variable 4 -0.2365 0.1733 -1.3650 0.2054 -1.
Errors
Warning Errors
STAT_LINEAR_DEPENDENCE_1: Based on Tolerance = #, there are linear dependencies among the variables to be forced.
Fatal Errors
STAT_NO_VARIABLES_ENTERED: No variables entered the model. All elements of Anova_Table are set to NaN.
Syntax
IMSL_STEPWISE, x, y [, /ALL_STEPS] [, ANOVA_TABLE=variable] [, /BACKWARD] [, COV_NOBS=value] [, COV_INPUT=array] [, COEF_T_TESTS=variable] [, COEF_VIF=variable] [, COV_SWEPT=variable] [, /DOUBLE] [, /FIRST_STEP] [, FORCE=value] [, FORWARD] [, FREQUENCIES=array] [, HISTORY=variable] [, /INTER_STEP] [, /LAST_STEP] [, IEND=variable] [, LEVEL=array] [, N_STEPS=value] [, P_IN=value] [, P_OUT=value] [, /STEPWISE] [, SWEPT=value] [, /TOLERANCE] [, WEIGHTS=array])
Arguments
X
Two-dimensional array containing the data for the candidate variables.
Y
Array of length N_ELEMENTS(x(*, 0)) containing the responses for the dependent variable.
Keywords
ANOVA_TABLE (optional)
Named variable into which the one-dimensional array containing the analysis of variance table is stored. The analysis of variance statistics are as follows:
- 0: Degrees of freedom for regression
- 1: Degrees of freedom for error
- 2: Total degrees of freedom
- 3: Sum of squares for regression
- 4: Sum of squares for error
- 5: Total sum of squares
- 6: Regression mean square
- 7: Error mean square
- 8: F-statistic
- 9: p-value
- 10: R2 (in percent)
- 11: Adjusted R2 (in percent)
- 12: Estimate of the standard deviation
BACKWARD (optional)
An attempt is made to remove a variable from the model. A variable is removed if its p-value exceeds P_OUT. During initialization, all candidate independent variables enter the model. One or none of these options can be specified: BACKWARD, FORWARD, STEPWISE. If none are specified, the default is BACKWARD.
COV_NOBS (optional)
The number of observations associated with array COC_INPUT. COV_INPUT and COV_NOBS must be used together.
COV_INPUT (optional)
Two-dimensional square array of size (N_ELEMENTS(x(0,*)) + 1) x (N_ELEMENTS(x(0,*)) + 1) containing a variance-covariance or sum-of-squares and crossproducts matrix, in which the last column must correspond to the dependent variable.
Array COV_INPUT can be computed using IMSL_COVARIANCES. Arguments X and Y, and keywords FREQUENCIES and WEIGHTS are not accessed when this option is specified. Normally, IMSL_ALLBEST computes COV_INPUT from the input data matrices x and y. However, there may be cases when you want to calculate the covariance matrix and manipulate it before calling IMSL_ALLBEST. See the description at the beginning of this topic for a discussion of such cases. COV_INPUT and COV_NOBS must be used together.
COEF_T_TESTS (optional)
Named variable into which the two-dimensional array containing statistics relating to the regression coefficient for the final model in this invocationing is stored. The rows correspond to the N_ELEMENTS(x(0, *)) in dependent variables. The rows are in the same order as the variables in x (or, if COV_INPUT is specified, the rows are in the same order as the variables in COV_INPUT ). Each row corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variable corresponding to the row in question.
- 0: Coefficient estimate
- 1: Estimated standard error of the coefficient estimate
- 2: t-statistic for the test that the coefficient is zero
- 3: p-value for the two-sided t test
COEF_VIF (optional)
Named variable into which the two-dimensional array containing variance inflation factors for the final model in this invocation is stored. The elements correspond to the N_ELEMENTS (x(0, *)) in dependent variables. The elements are in the same order as the variables in x (or, if COV_INPUT is specified, the elements are in the same order as the variables in COV_INPUT). Each element corresponding to a variable not in the model contains statistics for a model which includes the variables of the final model and the variables corresponding to the element in question.
The square of the multiple correlation coefficient for the i-th regressor after all others have been obtained from VIF = COEF_VIF (i) by the following formula:
1.0 – (1.0/VIF)
COV_SWEPT (optional)
Named variable into which the two-dimensional array of size N_ELEMENTS (x(0, *)) + 1) x (N_ELEMENTS (x(0, *)) + 1) that results after COV_SWEPT has been swept on the columns corresponding to the variables in the model. The estimated variance- covariance matrix of the estimated regression coefficients in the final model can be obtained by extracting the rows and columns of COV_SWEPT corresponding to the independent variables in the final model and multiplying the elements of this matrix by ANOVA_TABLE.
DOUBLE (optional)
If present and nonzero, double precision is used.
FIRST_STEP
This is the first invocation; additional calls will be made. Initialization and stepping is performed. One or none of the following options can be specified: FIRST_STEP INTER_STEP, LAST_STEP, and ALL_STEPS. If none are specified, the default is ALL_STEPS.
FORCE (optional)
Scalar integer specifying how variables are forced into the model as independent variables. Variable with levels 1, 2, ..., FORCE are forced into the model as independent variables. See the keyword LEVEL.
FORWARD (optional)
An attempt is made to add a variable to the model. A variable is added if its p-value is less than P_IN. During initialization, only the forced variables enter the model. One or none of these options can be specified: BACKWARD, FORWARD, STEPWISE. If none are specified, the default is BACKWARD.
FREQUENCIES (optional)
One-dimensional array containing the frequency for each row of x. Default: (*) = 1
HISTORY (optional)
Named variable into which the one-dimensional array of length N_ELEMENTS (x(0, *)) + 1 containing the recent history of the independent variables is stored.
Element History(N_ELEMENTS (x(0, *))) usually corresponds to the dependent variable (see Level) as shown in the following:
History (i) |
Status of i-th Variable
|
0.0 |
Variable has never been added to model.
|
0.5 |
Variable was added into the model during initialization.
|
k > 0.0 |
Variable was added to the model during the k-th step.
|
k < 0.0
|
Variable was deleted from model during the k-th step.
|
INTER_STEP (optional)
This is an intermediate invocation. Stepping is performed. One or none of the following options can be specified: FIRST_STEP INTER_STEP, LAST_STEP, and ALL_STEPS. If none are specified, the default is ALL_STEPS.
LAST_STEP (optional)
This is the final invocation. Stepping and wrap-up computations are performed. One or none of the following options can be specified: FIRST_STEP INTER_STEP, LAST_STEP, and ALL_STEPS. If none are specified, the default is ALL_STEPS.
IEND (optional)
Named variable into which an integer which indicates whether additional steps are possible is stored.
- 0: Additional steps may be possible
- 1: No additional steps are possible
LEVEL (optional)
Array of length N_ELEMENTS(x(0, *)) + 1 containing levels of priority for variables entering and leaving the regression. Each variable is assigned a positive value that indicates its level of entry into the model. A variable can enter the model only after all variables with smaller nonzero levels of entry have entered. Similarly, a variable can only leave the model after all variables with higher levels of entry have left. Variables with the same level of entry compete for entry (deletion) at each step. LEVEL(i) = 0 means the i-th variable is never to enter the model. LEVEL(i) = –1 means the i-th variable is the dependent variable. LEVEL (N_ELEMENTS(x(0, *))) must correspond to the dependent variable, except when COV_INPUT is specified. Default: 1, 1, ..., 1, –1, where –1 corresponds to LEVEL (N_ELEMENTS(x(0, *))).
N_STEPS (optional)
For nonnegative N_STEPS, N_STEPS steps are taken. If N_STEPS = –1, stepping continues until completion. N_STEPS is not referenced if ALL_STEPS is used. Default: 1
P_IN (optional)
Largest p-value for variable entering the model. Variables with p-values less than P_IN may enter the model. Default: 0.05
P_OUT (optional)
Smallest p-value for removing variables with p-values greater than P_OUT may leave the model. Keyword P_OUT must be greater than or equal to P_IN. A common choice for P_OUT is 2*P_IN. Default: 0.10
STEPWISE (optional)
A backward step is attempted. If a variable is not removed, a forward step is attempted. This is a stepwise step. Only the forced variables enter the model during initialization. One or none of these options can be specified: BACKWARD, FORWARD, STEPWISE. If none are specified, the default is BACKWARD.
SWEPT (optional)
Named variable into which the one-dimensional array of length (N_ELEMENTS(x(0, *)) + 1) with information to indicate the independent variables in the model is stored. Keyword SWEPT (N_ELEMENTS (x(0, *))) usually corresponds to the dependent variable (see LEVEL).
- –1: Variable i is not in model.
- 1: Variable i is in model.
TOLERANCE (optional)
Tolerance used in determining linear dependence. Default: 100*ε, where ε is machine precision.
WEIGHTS (optional)
One-dimensional array containing the weight for each row of x. Default: (*) = 1
Version History
See Also
IMSL_ALLBEST