Testing Arrays for Invalid Values
Anonym
I was recently writing a function where I needed to validate that an array passed into it contained no negative values. I quickly wrote a test using IDL's WHERE() function, but then got thinking about how efficient that was and if there was a faster and/or less memory intensive way to do this. After giving the matter some thought, I came up with a few different ways to implement this test, and was a little surprised how the times varied.
Here is some test code I put together that tests the different implementations I came up with. I use the TIC/TOC routines to perform the timing of a number of iterations. Each iteration generates a new array using RANDOMN() to generate a new array each time, to avoid any data caching that might taint my results. I put a variable increment inside the IF test, to make sure it wouldn't be optimized away.
pro test_find_negatives
compile_opt idl2
arraySize = 1000000
iterations = 100
fooWhere = 0
fooMin = 0
fooTotal = 0
fooHasValue = 0
print, 'where'
TIC
FOR i = 1, iterations DO BEGIN
seed = i
data = RANDOMN(seed, arraySize)
w = WHERE(data LT 0, count)
IF (count GT 0) THEN BEGIN
++fooWhere
ENDIF
ENDFOR
TOC
print, 'min'
TIC
FOR i = 1, iterations DO BEGIN
seed = i
data = RANDOMN(seed, arraySize)
IF (MIN(data) LT 0) THEN BEGIN
++fooMin
ENDIF
ENDFOR
TOC
print, 'total'
TIC
FOR i = 1, iterations DO BEGIN
seed = i
data = RANDOMN(seed, arraySize)
IF (TOTAL(data LT 0) GT 0) THEN BEGIN
++fooTotal
ENDIF
ENDFOR
TOC
print, 'hasValue'
TIC
FOR i = 1, iterations DO BEGIN
seed = i
data = RANDOMN(seed, arraySize)
IF ((data LT 0).HasValue(1)) THEN BEGIN
++fooHasValue
ENDIF
ENDFOR
TOC
print, fooWhere, fooMin, fooTotal, fooHasValue
end
As you can see, I came up with four implementations. The first is the class WHERE() function, which returns an array of all the bad values. I could have called N_ELEMENTS() on that returned array, but used the slightly more efficient positional parameter which returns the number of elements that are returned. The second implementation uses MIN(), and then just compares that value against 0. This approach may have limited utility for other types of tests, but for finding negative numbers it works. The third implementation uses TOTAL() to sum up the contents of a transient array of Boolean values. This one is a little expensive since it has to create that transient array, as we'll see in the timing results. Observant readers will notice the fourth implementation and wonder what it is. This is a preview of a new feature in IDL 8.3.1, where we have static methods available on variables. In this case the HasValue() method is testing the transient array of Boolean variables to see if any of them are true. This has a slight optimization over the TOTAL() approach, in that the method has short circuit logic to return as soon as it finds the desired value, instead of having to process the entire array like TOTAL() does.
Now on the to the timing results, which can be seen in the table below. I ran the code with many different values of iterations and arraySize, to see how the array size and other system differences like the memory manager would affect the outcome. The columns are the different array sizes, and the rows are for different numbers of iterations of the four approaches. The times are listed in milliseconds, which may be more precision than can be accurately measured by TIC/TOC.
WHERE iterations
|
100
|
1000
|
10000
|
100000
|
1000000
|
100
|
0
|
5
|
44
|
371
|
3340
|
1000
|
9
|
47
|
427
|
3780
|
35000
|
10000
|
90
|
466
|
4330
|
37800
|
335000
|
|
|
|
|
|
|
MIN iterations
|
|
|
|
|
|
100
|
1
|
3
|
32
|
339
|
3070
|
1000
|
7
|
36
|
311
|
3510
|
30900
|
10000
|
79
|
344
|
3320
|
34800
|
306000
|
|
|
|
|
|
|
TOTAL iterations
|
|
|
|
|
|
100
|
1
|
4
|
36
|
371
|
3210
|
1000
|
8
|
39
|
353
|
3740
|
33100
|
10000
|
83
|
387
|
3780
|
36300
|
321000
|
|
|
|
|
|
|
HasValue iterations
|
|
|
|
|
|
100
|
51
|
5
|
36
|
375
|
3280
|
1000
|
10
|
41
|
360
|
3760
|
33300
|
10000
|
105
|
410
|
3760
|
37000
|
327000
|
At first glance the timing looks relatively close, but upon inspection of the Profiler View we can see that these tests are dominated by the RANDOMN() function more than anything else. So I added another FOR loop that only called RANDOMN() and timed that, then subtracted those times and got much more interesting results that compare the real array test logic. You can't use the Profiler View to do this, because in cases like TOTAL(), the creation of that transient array of Boolean values doesn't get measured by the profiler.
WHERE iterations
|
100
|
1000
|
10000
|
100000
|
1000000
|
100
|
0
|
2
|
14
|
77
|
246
|
1000
|
3
|
13
|
126
|
493
|
4550
|
10000
|
17
|
136
|
1340
|
8070
|
31700
|
|
|
|
|
|
|
MIN iterations
|
|
|
|
|
|
100
|
1
|
0
|
2
|
45
|
64
|
1000
|
1
|
2
|
10
|
224
|
443
|
10000
|
6
|
14
|
322
|
5040
|
2260
|
|
|
|
|
|
|
TOTAL iterations
|
|
|
|
|
|
100
|
1
|
1
|
6
|
77
|
207
|
1000
|
2
|
5
|
52
|
447
|
2700
|
10000
|
10
|
57
|
785
|
6580
|
17200
|
|
|
|
|
|
|
HasValue iterations
|
|
|
|
|
|
100
|
51
|
2
|
6
|
81
|
282
|
1000
|
4
|
7
|
59
|
467
|
2900
|
10000
|
32
|
80
|
766
|
7290
|
23200
|
These numbers are much more interesting, and show how much the different approaches vary in performance. We can see how increasing the number of iterations by a factor of 10 doesn't always increase the time by a factor of 10, but for small arrays I think we are more influenced by other system load and timer accuracy more than anything else. As the number of iterations increases, we are also more likely to fragment the process memory and see that influence in the numbers. This is most notable in the 10,000 iteration results for MIN(), that the 100,000 element array took longer than the 1,000,000 element array. I think this is definitely a case where the memory manager was having trouble trying to find contiguous blocks of memory to use for all the iterations.
But the important part of this analysis is the more qualitative result that MIN() is far superior to everything else. This is most likely due to the fact that it does not have to deal with the memory manager to create transient input arrays like TOTAL() or HasValue(), nor a dynamically resized result array like WHERE(). The performance improvement is more pronounced as the array size increases, since it is avoiding allocations of larger and larger arrays. But MIN() has less generality than the other options, so if you were checking that you had only odd numbers, for example, then you wouldn't be able to use MIN().
The TOTAL() function comes in second place, but only by a little over HasValue(). Both of these approaches have to create that transient input array, which takes time to allocate. I think TOTAL() is faster because it is a free function, whereas HasValue is a class method so there's a small amount of overhead in the dereferencing of the class to get the memory address of the method to invoke.
The WHERE() function was far and away the slowest. It avoids creating the transient input array, but since it has no a priori knowledge of how many output elements it will have, it has to use array appending for each element, which slams the memory manager with a new allocation for each found element. Of course, if you wanted to know which elements were invalid, you would have to use WHERE() and pay the cost in time in space, but if you just want to know whether there are any bad elements, this is not your best option.
All in all, the numbers aren't that different for a single call to any of these tests, but they are a good demonstration of the thought process that should go into how you design you code to be time and space efficient.