String processing performance in IDL
			
			
		
		
		
			
			
				
				Anonym
				
			
		
			IDL performs array based operations very efficiently, but most processing tasks do require some amount of string parsing and manipulation. I have selected 3 common string processing tasks to analyze in more depth in order to find the best string processing strategies in each of these cases. The first example is to find all the strings that start with a given substring. IDL 8.4 has many new intrinsic methods for string type variables, and one of them is "StartsWith". Here is the code I used to compare 4 different approaches to find out which strings in a string array starts with the word "end".
pro StrTest_StartsWith 
 compile_opt idl2,logical_predicate 
 
 f = file_which('amoeba.pro') 
 str = strarr(file_lines(f)) 
 openr, lun, f, /get_lun 
 readf, lun, str 
 free_lun, lun 
 
 first = str.StartsWith('end') 
 n = 50000 
 times = dblarr(4) 
 methods = ['StartsWith','STRCMP','STREGEX','STRPOS'] 
 for method=0,3 do begin 
   t0 = tic() 
   case method of 
   0: for i=0, n-1 do x = str.StartsWith('end') 
   1: for i=0, n-1 do x = strcmp(str,'end',3) 
   2: for i=0, n-1 do x = stregex(str,'^end',/boolean) 
   3: for i=0, n-1 do x = strpos(str,'end') eq 0 
   endcase 
   times[method] = toc(t0) 
   print, array_equal(x,first) ? 'Same answer' : 'Different answer' 
 endfor 
 print, string(methods[sort(times)] + ':', format='(a-15)') + $ 
   string(times[sort(times)], format='(g0)'), $ 
   format='(a)' 
end 
The first method is to use the new intrinsic "StartsWith" method, the next is to use STRCMP with a 3rd argument specifying how many characters to compare. The third method uses a regular expression with STREGEX, and the final method uses STRPOS and compare the result to 0, meaning the pattern was found starting at position 0. The result I get when I run this code in IDL 8.4 is:
Same answer 
Same answer 
Same answer 
Same answer 
STRCMP:        0.128 
StartsWith:    0.147 
STRPOS:        0.91 
STREGEX:      1.497
All methods return a byte array of zeros and ones indicating where the matches are. STRCMP with 3 arguments ended up being the fastest, with the new "StartsWith" method being a close second. STREGEX should be avoided unless it is really needed for a more complex expression.
In this second example, the goal is to replace the first occurrence of an equal sign (=) with a color (:) on every line that contains at least one equal (=) sign. If there are additional equal signs, they should remain unchanged. This is mostly useful for converting the format of name/value pairs stored in a text file. I used 4 different methods to achieve the same result:
pro StrTest_Substring 
 compile_opt idl2,logical_predicate 
 
 f = file_which('amoeba.pro') 
 str = strarr(file_lines(f)) 
 openr, lun, f, /get_lun 
 readf, lun, str 
 free_lun, lun 
 
 n = 2000 
 index = str.IndexOf('=') 
 w = where(index ne -1) 
 index = index[w] 
 first = str 
 first[w] = str[w].Substring(0,index-1)+':'+str[w].Substring(index+1) 
 methods = ['Substring','STRPUT','Split/Join','BYTARR'] 
 times = dblarr(4) 
 for method=0,3 do begin 
   t0 = tic() 
   case method of 
     0: for i=0, n-1 do begin 
       index = str.IndexOf('=') 
       w = where(index ne -1) 
       index = index[w] 
       y = str[w] 
       x = str 
       x[w] = y.SubString(0,index-1)+':'+y.SubString(index+1) 
     endfor 
     1: for i=0, n-1 do begin 
       x = str 
       pos = strpos(str,'=') 
       foreach xx, x, j do begin 
          if pos[j] ne -1 then begin 
            strput, xx, ':', pos[j] 
            x[j] = xx 
          endif 
       endforeach 
     endfor 
     2: for i=0, n-1 do begin 
       x = str 
       foreach xx, x, j do begin 
          parts = xx.Split('=') 
          if parts.length gt 1 then x[j] = ([parts[0],parts[1:*].join('=')]).join(':') 
       endforeach 
     endfor 
     3: for i=0, n-1 do begin 
       b = byte(str) 
       b[maxInd[where(max(b eq 61b, dimension=1, maxInd))]] = 58b 
       x = string(b) 
     endfor 
   endcase 
   times[method] = toc(t0) 
   print, array_equal(x,first) ? 'Same answer' : 'Different answer' 
 endfor 
 print, string(methods[sort(times)] + ':', format='(a-15)') + $ 
   string(times[sort(times)], format='(g0)'), $ 
   format='(a)' 
  
end
Same answer 
Same answer 
Same answer 
Same answer 
BYTARR:        0.148 
STRPUT:        0.187 
Substring:     0.188 
Split/Join:   1.456
The cryptic byte array method ended up being the fastest, even though it does perform a lot of copying, and doesn't contain any obvious string processing functions. This is because IDL can run operations on arrays very efficiently to speed up the computations. For example, the internal array indexing gives good predictable memory access patterns. However, I would not really recommend using this approach here, since the code is very hard to understand, and to modify if needed. I would also avoid using the SPLIT/JOIN approach as that is very inefficient. Using "IndexOf" and "Substring" is nice here, especially notice that the "Substring" method is similar to STRMID, but can handle an array of different positions matching the size of the string array. This is a significant improvement over the old STRMID. For example, to extract the beginnings of every string up and including the first "e", you could use:
IDL> a=['!Hello!', 'test','this one!'] 
IDL> a.Substring(0,a.IndexOf('e')) 
!He 
te 
this one 
Or, to extract the characters after the first colon:
IDL> x = ((orderedhash(!cpu))._overloadPrint()) 
IDL> x 
HW_VECTOR:            0 
VECTOR_ENABLE:            0 
HW_NCPU:            6 
TPOOL_NTHREADS:            6 
TPOOL_MIN_ELTS:                 100000 
TPOOL_MAX_ELTS:                      0 
IDL> x.Substring(x.IndexOf(':')) 
:            0 
:            0 
:            6 
:            6 
:                 100000 
:                     0
The final example is replacing every occurrence of = with =>. I used 2 different methods for this, using the new "Replace"method on string types, and using STRSPLIT/STRJOIN. The results show that the new Replace method is much more efficient.
pro StrTest_Replace 
 compile_opt idl2,logical_predicate 
 
  f = file_which('amoeba.pro') 
 str = strarr(file_lines(f)) 
 openr, lun, f, /get_lun 
 readf, lun, str 
 free_lun, lun 
 
 n = 5000 
 first = str.Replace('=', '=>') 
 methods = ['Replace','STRSPLIT'] 
 times = dblarr(2) 
 for method=0,1 do begin 
   t0 = tic() 
   case method of 
     0: for i=0, n-1 do begin 
       x = str.Replace('=','=>') 
     endfor 
     1: for i=0, n-1 do begin 
       x = str 
       foreach xx, x, j do x[j] = strjoin(strsplit(xx,'=',/extract),'=>') 
     endfor 
   endcase 
   times[method] = toc(t0) 
   print, array_equal(x,first) ? 'Same answer' : 'Different answer' 
 endfor 
 print, string(methods[sort(times)] + ':', format='(a-15)') + $ 
   string(times[sort(times)], format='(g0)'), $ 
   format='(a)' 
end
Same answer 
Same answer 
Replace:       0.545 
STRSPLIT:     2.778