12743
Beware the behavior of SMOOTH
Suppose I have an array that has an outlier. A really big outlier:
IDL> a = [1.0, 1.0, 2.0, 3.0, 4.0, 1.0d18, 4.0, 3.0, 2.0, 1.0, 1.0]
I'd like to smooth this double-precision array with a running mean (or boxcar or tophat, depending on where you learned this technique) filter, such as provided by the IDL SMOOTH function. For efficiency, SMOOTH divides the kernel width into a running total of the differences between neighboring values. Here's the result of applying SMOOTH to A with a filter width of 3:
IDL> print, smooth(a, 3)
1.0000000 1.3333333 2.0000000 3.0000000 3.3333333e+017
3.3333333e+017 3.3333333e+017 0.00000000 -1.0000000 -1.6666667
1.0000000
Whoa. The input array is symmetric, so why isn't the output? Also—and this is worrisome—the input array is composed of positive numbers, so how can the mean of any subset of these numbers be negative? The answer lies in the way floating point numbers are represented on computers. To see why, we can use the information returned from the MACHAR function:
IDL> m = machar(/double)
IDL> help, m
** Structure DMACHAR, 13 tags, length=72, data length=68:
IBETA LONG 2
IT LONG 53
IRND LONG 5
NGRD LONG 0
MACHEP LONG -52
NEGEP LONG -53
IEXP LONG 11
MINEXP LONG -1022
MAXEXP LONG 1024
EPS DOUBLE 2.2204460e-016
EPSNEG DOUBLE 1.1102230e-016
XMIN DOUBLE 2.2250739e-308
XMAX DOUBLE 1.7976931e+308
Using the IBETA field, which gives the base used to construct numbers (2, no surprise), and the IT field, which gives the number of base-2 digits used in the mantissa of a number, the maximum resolvable distance between two double precision numbers must be given by:
IDL> mrd = double(m.ibeta)^m.it
IDL> print, mrd
9.0071993e+015
So, when SMOOTH tries to difference two numbers whose distance is greater than MRD, bad things can happen because of loss of precision. Let's apply this information to the example above. Define values slightly above and below the threshold set by MRD:
IDL> below = double(m.ibeta)^(m.it-1)
IDL> above = double(m.ibeta)^(m.it+1)
IDL> print, below, above
4.5035996e+015 1.8014399e+016
and substitute them into the array used above:
IDL> b = [1.0, 1.0, 2.0, 3.0, 4.0, below, 4.0, 3.0, 2.0, 1.0, 1.0]
IDL> c = [1.0, 1.0, 2.0, 3.0, 4.0, above, 4.0, 3.0, 2.0, 1.0, 1.0]
Now apply SMOOTH to these arrays and evaluate the results:
IDL> print, smooth(b, 3)
1.0000000 1.3333333 2.0000000 3.0000000 1.5011999e+015
1.5011999e+015 1.5011999e+015 3.0000000 2.0000000 1.3333333
1.0000000
IDL> print, smooth(c, 3)
1.0000000 1.3333333 2.0000000 3.0000000 6.0047995e+015
6.0047995e+015 6.0047995e+015 3.3333333 2.3333333 1.6666667
1.0000000
Note that SMOOTH works well when applied to B, but not to C—the results aren't symmetric.
A better discussion of this behavior is given in the section Note on Smoothing Over Large Data Ranges in the IDL Help page for SMOOTH, along with a workaround for this situation.
Update (2013-07-01): I neglected to mention that the seed for this post, as well as the note in the Help, came from discussions a few years ago with Carmen Lucas at DRDC Atlantic. Thanks, Carmen, for pointing out this unexpected behavior to me.