X
PrevPrev Go to previous topic
NextNext Go to next topic
Last Post 18 May 2021 04:29 PM by  Ben Castellani
Question about 'MEDIAN' function
 3 Replies
Sort:
You are not authorized to post a reply.
Author Messages

James Lane



New Member


Posts:1
New Member


--
10 May 2021 09:11 AM
    Hi all,

    I had a quick question about the MEDIAN function in IDL that I can't seem to find answers to (https://www.l3harrisgeospatial.com/docs/median.html).

    I've noticed that when using MEDIAN, you must specify /EVEN as a keyword, otherwise in the case you have an even dataset, the MEDIAN function won't take the average of the middle two numbers.

    E.g. x = [1,2,3,4]

    MEDIAN(x) = 3
    MEDIAN(x,/EVEN) = 2.5

    As far as I'm aware, if you take the MEDIAN on an even set of numbers, you're supposed to average (mean) the middle two. Yet for some reason, the 'default' in IDL is that (unless you specify even), it always gives you what I'm calling the 'upper median'. So MEDIAN([5,6,7,8]) = 7, MEDIAN([10,11,12,13]) = 12, i.e. the value just past halfway through the array.

    My question is simple: does anybody know why IDL does this by default? Is it 'wrong' to not average the middle two numbers if you're taking the median of a dataset? I processed all the data I was working with using MEDIAN without the /EVEN keyword and I'm trying to figure out whether it's worth reprocessing or not...

    Thanks

    James





    Ben Castellani



    Basic Member


    Posts:130
    Basic Member


    --
    13 May 2021 10:01 AM
    IDL was created in the 1980's. Believe it or not, back then the current idea to average the two middle values for datasets with an even number of elements to obtain the median was not standard practice. This idea seems to have emerged in the early 1990's.

    The Numerical Recipes 2nd Edition book was used to formulate much of core IDL back in the 1980's and 1990's. This book was cutting-edge for its time and described how to use computer code for science! A funny quote about median calculations from this book follows...

    "One often wants to know the median element in an array, or the top and bottom quartile elements. When N is odd, the median is the kth element, with k = (N + 1)/2. When N is even, statistics books define the median as the arithmetic mean of the elements k = N/2 and k = N/2+1 (that is, N/2 from the bottom and N/2 from the top). If you accept such pedantry, you must perform two separate selections to find these elements. For N > 100 we usually define k = N/2 to be the median element, pedants be damned."

    See Section 8.5 on Page 333 for more information: https://websites.pmc.ucsc.edu/~fnimmo/eart290c_17/NumericalRecipesinF77.pdf

    So in summary, yes the accepted method to average the two middle elements is now technically correct, and has been for the last ~30 years. However, in order to maintain backwards compatibility, this "new" method was only included in IDL through the optional EVEN keyword.

    Hope this helps!

    James Lane



    New Member


    Posts:1
    New Member


    --
    17 May 2021 02:47 AM
    Hi Ben,

    Thanks for the information on this - I never knew that was the commonly accepted way of doing it back in the 80's/90's.

    Regarding my data - do you think it's something I need to re-process using /EVEN, or do you think it's perfectly fine to say I've used MEDIAN without the EVEN keyword? I'm happy to give more information if you'd like.

    Cheers

    Ben Castellani



    Basic Member


    Posts:130
    Basic Member


    --
    18 May 2021 04:29 PM
    The "lazy" approach to median derivation 30+ years ago was rooted in the computational expense to locate two numbers and then average them, rather than just locating one number.

    I am not a statistics expert, but the answer to your re-processing question will depend on the range of your data and more importantly, how many data values are involved. If you only have 6 data points, it would likely be important to re-process to get a more correct median. However, if you have 4 million data points, you won't see a significant difference after reprocessing and I wouldn't bother. Of course, most likely your number of data values is somewhere in between my two end case examples. As mentioned in the quote from my last comment, they say that averaging the two middle values is only worthwhile if you have less than 100 data points. If you have more, simply choosing the higher of the two middle values is largely sufficient.
    You are not authorized to post a reply.