The IMSL_CHISQTEST function performs a chi-squared goodness-of-fit test.
This routine requires an IDL Advanced Math and Stats license. For more information, contact your sales or technical support representative.
The IMSL_CHISQTEST function performs a chi-squared goodness-of-fit test that a random sample of observations is distributed according to a specified theoretical cumulative distribution. The theoretical distribution, which may be continuous, discrete, or a mixture of discrete and continuous distributions, is specified by defined function f. Because you are allowed to give a range for the observations, a test that is conditional upon the specified range is performed.
The argument N_categories gives the number of intervals into which the observations are to be divided. By default, equi-probable intervals are computed by IMSL_CHISQTEST, but intervals that are not equi-probable can be specified (through the use of keyword CUTPOINTS).
Regardless of the method used to obtain the cutpoints, the intervals are such that the lower endpoint is not included in the interval, while the upper endpoint is always included. If the cumulative distribution function has discrete elements, then user- provided cutpoints should always be used since IMSL_CHISQTEST cannot determine the discrete elements in discrete distributions.
By default, the lower and upper endpoints of the first and last intervals are –infinity and +infinity. The endpoints can be specified by using the keywords LOWER_BOUND and UPPER_BOUND.
A tally of counts is maintained for the observations in x as follows:
- If the cutpoints are specified, the tally is made in the interval to which xi belongs using the endpoints specified.
- If the cutpoints are determined by IMSL_CHISQTEST, then the cumulative probability at xi, F(xi), is computed by the function f.
The tally for xi is made in interval number:
⋅
where m = n categories and:
is the function that takes the greatest integer that is no larger than the parameter of the function. Thus, if the computer time required to calculate the cumulative distribution function is large, user-specified cutpoints may be preferred in order to reduce the total computing time.
If the expected count in any cell is less than 1, then a rule of thumb is that the chi- squared approximation may be suspect. A warning message to this effect is issued in this case, as well as when an expected value is less than 5.
Programming Notes
You must supply a function f with calling sequence F(y) that returns the value of the cumulative distribution function at any point y in the (optionally) specified range.
Many of the cumulative distribution functions in this reference manual can be used for f. It is, however, necessary to write a user-defined IDL Advanced Math and Stats function that calls the CDF, and then pass the name of this user-defined function for f.
Example
This example illustrates the use of IMSL_CHISQTEST on a randomly generated sample from the normal distribution. One-thousand randomly generated observations are tallied into 10 equi-probable intervals. In this example, the null hypothesis is not rejected.
.RUN
FUNCTION user_cdf, k
RETURN, IMSL_NORMALCDF(k)
END
IMSL_RANDOMOPT, Set = 123457
x = IMSL_RANDOM(1000, /Normal)
p_value = IMSL_CHISQTEST('user_cdf', 10, x)
PM, p_value
IDL prints:
0.154603
Errors
Warning Errors
STAT_EXPECTED_VAL_LESS_THAN_1: An expected value is less than 1.
STAT_EXPECTED_VAL_LESS_THAN_5: An expected value is less than 5.
Fatal Errors
STAT_ALL_OBSERVATIONS_MISSING: All observations contain missing values.
STAT_INCORRECT_CDF_1: Function f is not a cumulative distribution function. The value at the lower bound must be nonnegative, and the value at the upper bound must not be greater than 1.
STAT_INCORRECT_CDF_2: Function f is not a cumulative distribution function. The probability of the range of the distribution is not positive.
STAT_INCORRECT_CDF_3: Function f is not a cumulative distribution function. Its evaluation at an element in x is inconsistent with either the evaluation at the lower or upper bound.
STAT_INCORRECT_CDF_4: Function f is not a cumulative distribution function. Its evaluation at a cutpoint is inconsistent with either the evaluation at the lower or upper bound.
STAT_INCORRECT_CDF_5: An error has occurred when inverting the cumulative distribution function. This function must be continuous and defined over the whole real line.
Syntax
Result = IMSL_CHISQTEST(F, N_Categories, X [, CELL_COUNTS=variable] [, CELL_EXPECTED=variable] [, CELL_CHISQ=variable] [, CHI_SQUARED=variable] [, CUTPOINTS=variable] [, DF=variable] [, /DOUBLE] [, /EQUAL_CUTPOINTS] [, FREQUENCIES=variable] [, LOWER_BOUND=value] [, N_PARAMS_ESTIMATED=value] [, UPPER_BOUND=value] [, USED_CUTPOINTS=variable])
Return Value
The p-value for the goodness-of-fit chi-squared statistic.
Arguments
F
Scalar string specifying a user-supplied function. Argument F accepts one scalar parameter and returns the hypothesized, cumulative distribution function at that point.
N_Categories
Number of cells into which the observations are to be tallied.
X
One-dimensional array containing the vector of data elements for this test.
Keywords
CELL_COUNTS (optional)
Named variable into which the cell counts are stored. The cell counts are the observed frequencies in each of the N_Categories cells.
CELL_EXPECTED (optional)
Named variable into which the cell expected values are stored. The expected value of a cell is the expected count in the cell given that the hypothesized distribution is correct.
CELL_CHISQ (optional)
Named variable into which an array of length N_Categories containing the cell contributions to chi-squared are stored.
CHI_SQUARED (optional)
Named variable into which the chi-squared test statistic is stored.
CUTPOINTS (optional)
Specifies the named variable containing user-defined cutpoints to be used by IMSL_CHISQTEST. The keywords CUTPOINTS and EQUAL_CUTPOINTS cannot be used together.
DF (optional)
Named variable into which the degrees of freedom for the chi-squared goodness-of- fit test are stored.
DOUBLE (optional)
If present and nonzero, double precision is used.
EQUAL_CUTPOINTS (optional)
If present and nonzero, equal probability cutpoints are used. The keywords CUTPOINTS and EQUAL_CUTPOINTS cannot be used together.
FREQUENCIES (optional)
Named variable into which the array containing the vector frequencies for the observations stored in x is stored.
LOWER_BOUND (optional)
Lower bound of the range of the distribution. If LOWER_BOUND = UPPER_BOUND, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside of the range of these bounds are ignored. Distributions conditional on a range can be specified when LOWER_BOUND and UPPER_BOUND are used. If LOWER_BOUND is specified, then UPPER_BOUND also must be specified. By convention, LOWER_BOUND is excluded from the first interval, but UPPER_BOUND is included in the last interval.
N_PARAMS_ESTIMATED (optional)
Number of parameters estimated in computing the cumulative distribution function.
UPPER_BOUND (optional)
Upper bound of the range of the distribution. If LOWER_BOUND = UPPER_BOUND, a range on the whole real line is used (the default). If the lower and upper endpoints are different, points outside of the range of these bounds are ignored. Distributions conditional on a range can be specified when LOWER_BOUND and UPPER_BOUND are used. If UPPER_BOUND is specified, then LOWER_BOUND also must be specified. By convention, LOWER_BOUND is excluded from the first interval, but UPPER_BOUND is included in the last interval.
USED_CUTPOINTS (optional)
Specifies the named variable into which the cutpoints to be used by IMSL_CHISQTEST are stored.
Version History
See Also
IMSL_NORMALITY