Beware the behavior of SMOOTH

Suppose I have an array that has an outlier. A really big outlier:

IDL> a = [1.0, 1.0, 2.0, 3.0, 4.0, 1.0d18, 4.0, 3.0, 2.0, 1.0, 1.0]

I'd like to smooth this double-precision array with a running mean (or boxcar or tophat, depending on where you learned this technique) filter, such as provided by the IDL SMOOTH function. For efficiency, SMOOTH divides the kernel width into a running total of the differences between neighboring values. Here's the result of applying SMOOTH to A with a filter width of 3:

IDL> print, smooth(a, 3)
       1.0000000       1.3333333       2.0000000       3.0000000  3.3333333e+017
  3.3333333e+017  3.3333333e+017      0.00000000      -1.0000000      -1.6666667
       1.0000000

Whoa. The input array is symmetric, so why isn't the output? Also—and this is worrisome—the input array is composed of positive numbers, so how can the mean of any subset of these numbers be negative? The answer lies in the way floating point numbers are represented on computers. To see why, we can use the information returned from the MACHAR function:

IDL> m = machar(/double)
IDL> help, m
** Structure DMACHAR, 13 tags, length=72, data length=68:
IBETA           LONG                 2
IT              LONG                53
IRND            LONG                 5
NGRD            LONG                 0
MACHEP          LONG               -52
NEGEP           LONG               -53
IEXP            LONG                11
MINEXP          LONG             -1022
MAXEXP          LONG              1024
EPS             DOUBLE      2.2204460e-016
EPSNEG          DOUBLE      1.1102230e-016
XMIN            DOUBLE      2.2250739e-308
XMAX            DOUBLE      1.7976931e+308

Using the IBETA field, which gives the base used to construct numbers (2, no surprise), and the IT field, which gives the number of base-2 digits used in the mantissa of a number, the maximum resolvable distance between two double precision numbers must be given by:

IDL> mrd = double(m.ibeta)^m.it
IDL> print, mrd
  9.0071993e+015

So, when SMOOTH tries to difference two numbers whose distance is greater than MRD, bad things can happen because of loss of precision. Let's apply this information to the example above. Define values slightly above and below the threshold set by MRD:

IDL> below = double(m.ibeta)^(m.it-1)
IDL> above = double(m.ibeta)^(m.it+1)
IDL> print, below, above
  4.5035996e+015  1.8014399e+016

and substitute them into the array used above:

IDL> b = [1.0, 1.0, 2.0, 3.0, 4.0, below, 4.0, 3.0, 2.0, 1.0, 1.0]
IDL> c = [1.0, 1.0, 2.0, 3.0, 4.0, above, 4.0, 3.0, 2.0, 1.0, 1.0]

Now apply SMOOTH to these arrays and evaluate the results:

IDL> print, smooth(b, 3)
       1.0000000       1.3333333       2.0000000       3.0000000  1.5011999e+015
  1.5011999e+015  1.5011999e+015       3.0000000       2.0000000       1.3333333
       1.0000000
IDL> print, smooth(c, 3)
       1.0000000       1.3333333       2.0000000       3.0000000  6.0047995e+015
  6.0047995e+015  6.0047995e+015       3.3333333       2.3333333       1.6666667
       1.0000000

Note that SMOOTH works well when applied to B, but not to C—the results aren't symmetric.

A better discussion of this behavior is given in the section Note on Smoothing Over Large Data Ranges in the IDL Help page for SMOOTH, along with a workaround for this situation.

Update (2013-07-01): I neglected to mention that the seed for this post, as well as the note in the Help, came from discussions a few years ago with Carmen Lucas at DRDC Atlantic. Thanks, Carmen, for pointing out this unexpected behavior to me.

Common Ideals & New Approaches ENVI Services Engine 1.0 Service Pack 1