X
11980 Rate this article:
No rating

Testing Arrays for Invalid Values

Anonym

I was recently writing a function where I needed to validate that an array passed into it contained no negative values. I quickly wrote a test using IDL's WHERE() function, but then got thinking about how efficient that was and if there was a faster and/or less memory intensive way to do this. After giving the matter some thought, I came up with a few different ways to implement this test, and was a little surprised how the times varied.

Here is some test code I put together that tests the different implementations I came up with. I use the TIC/TOC routines to perform the timing of a number of iterations. Each iteration generates a new array using RANDOMN() to generate a new array each time, to avoid any data caching that might taint my results. I put a variable increment inside the IF test, to make sure it wouldn't be optimized away.

pro test_find_negatives

  compile_opt idl2

 

  arraySize = 1000000

  iterations = 100

 

  fooWhere = 0

  fooMin = 0

  fooTotal = 0

  fooHasValue = 0

 

  print, 'where'

  TIC

  FOR i = 1, iterations DO BEGIN

    seed = i

    data = RANDOMN(seed, arraySize)

    w = WHERE(data LT 0, count)

    IF (count GT 0) THEN BEGIN

      ++fooWhere

    ENDIF

  ENDFOR

  TOC

 

  print, 'min'

  TIC

  FOR i = 1, iterations DO BEGIN

    seed = i

    data = RANDOMN(seed, arraySize)

    IF (MIN(data) LT 0) THEN BEGIN

      ++fooMin

    ENDIF

  ENDFOR

  TOC

 

  print, 'total'

  TIC

  FOR i = 1, iterations DO BEGIN

    seed = i

    data = RANDOMN(seed, arraySize)

    IF (TOTAL(data LT 0) GT 0) THEN BEGIN

      ++fooTotal

    ENDIF

  ENDFOR

  TOC

 

  print, 'hasValue'

  TIC

  FOR i = 1, iterations DO BEGIN

    seed = i

    data = RANDOMN(seed, arraySize)

    IF ((data LT 0).HasValue(1)) THEN BEGIN

      ++fooHasValue

    ENDIF

  ENDFOR

  TOC

 

  print, fooWhere, fooMin, fooTotal, fooHasValue

end

 

As you can see, I came up with four implementations. The first is the class WHERE() function, which returns an array of all the bad values. I could have called N_ELEMENTS() on that returned array, but used the slightly more efficient positional parameter which returns the number of elements that are returned. The second implementation uses MIN(), and then just compares that value against 0. This approach may have limited utility for other types of tests, but for finding negative numbers it works. The third implementation uses TOTAL() to sum up the contents of a transient array of Boolean values. This one is a little expensive since it has to create that transient array, as we'll see in the timing results. Observant readers will notice the fourth implementation and wonder what it is. This is a preview of a new feature in IDL 8.3.1, where we have static methods available on variables. In this case the HasValue() method is testing the transient array of Boolean variables to see if any of them are true. This has a slight optimization over the TOTAL() approach, in that the method has short circuit logic to return as soon as it finds the desired value, instead of having to process the entire array like TOTAL() does.

Now on the to the timing results, which can be seen in the table below. I ran the code with many different values of iterations and arraySize, to see how the array size and other system differences like the memory manager would affect the outcome. The columns are the different array sizes, and the rows are for different numbers of iterations of the four approaches. The times are listed in milliseconds, which may be more precision than can be accurately measured by TIC/TOC.

WHERE iterations

100

1000

10000

100000

1000000

100

0

5

44

371

3340

1000

9

47

427

3780

35000

10000

90

466

4330

37800

335000

 

 

 

 

 

 

MIN iterations

 

 

 

 

 

100

1

3

32

339

3070

1000

7

36

311

3510

30900

10000

79

344

3320

34800

306000

 

 

 

 

 

 

TOTAL iterations

 

 

 

 

 

100

1

4

36

371

3210

1000

8

39

353

3740

33100

10000

83

387

3780

36300

321000

 

 

 

 

 

 

HasValue iterations

 

 

 

 

 

100

51

5

36

375

3280

1000

10

41

360

3760

33300

10000

105

410

3760

37000

327000

 

At first glance the timing looks relatively close, but upon inspection of the Profiler View we can see that these tests are dominated by the RANDOMN() function more than anything else. So I added another FOR loop that only called RANDOMN() and timed that, then subtracted those times and got much more interesting results that compare the real array test logic. You can't use the Profiler View to do this, because in cases like TOTAL(), the creation of that transient array of Boolean values doesn't get measured by the profiler.

WHERE iterations

100

1000

10000

100000

1000000

100

0

2

14

77

246

1000

3

13

126

493

4550

10000

17

136

1340

8070

31700

 

 

 

 

 

 

MIN iterations

 

 

 

 

 

100

1

0

2

45

64

1000

1

2

10

224

443

10000

6

14

322

5040

2260

 

 

 

 

 

 

TOTAL iterations

 

 

 

 

 

100

1

1

6

77

207

1000

2

5

52

447

2700

10000

10

57

785

6580

17200

 

 

 

 

 

 

HasValue iterations

 

 

 

 

 

100

51

2

6

81

282

1000

4

7

59

467

2900

10000

32

80

766

7290

23200

 

These numbers are much more interesting, and show how much the different approaches vary in performance. We can see how increasing the number of iterations by a factor of 10 doesn't always increase the time by a factor of 10, but for small arrays I think we are more influenced by other system load and timer accuracy more than anything else. As the number of iterations increases, we are also more likely to fragment the process memory and see that influence in the numbers. This is most notable in the 10,000 iteration results for MIN(), that the 100,000 element array took longer than the 1,000,000 element array. I think this is definitely a case where the memory manager was having trouble trying to find contiguous blocks of memory to use for all the iterations.

But the important part of this analysis is the more qualitative result that MIN() is far superior to everything else. This is most likely due to the fact that it does not have to deal with the memory manager to create transient input arrays like TOTAL() or HasValue(), nor a dynamically resized result array like WHERE(). The performance improvement is more pronounced as the array size increases, since it is avoiding allocations of larger and larger arrays. But MIN() has less generality than the other options, so if you were checking that you had only odd numbers, for example, then you wouldn't be able to use MIN().

The TOTAL() function comes in second place, but only by a little over HasValue(). Both of these approaches have to create that transient input array, which takes time to allocate. I think TOTAL() is faster because it is a free function, whereas HasValue is a class method so there's a small amount of overhead in the dereferencing of the class to get the memory address of the method to invoke.

The WHERE() function was far and away the slowest. It avoids creating the transient input array, but since it has no a priori knowledge of how many output elements it will have, it has to use array appending for each element, which slams the memory manager with a new allocation for each found element. Of course, if you wanted to know which elements were invalid, you would have to use WHERE() and pay the cost in time in space, but if you just want to know whether there are any bad elements, this is not your best option.

All in all, the numbers aren't that different for a single call to any of these tests, but they are a good demonstration of the thought process that should go into how you design you code to be time and space efficient.