Using Spawn and Timers to Manage Concurrent Asynchronous Jobs

Anonym Monday, September 28, 2015

A common complaint about IDL is its single threaded nature, and how you can’t peg all your CPUs by running a single PRO routine. I’m not here to announce any new multi-threaded version of IDL, but instead discuss a very cool implementation that a colleague showed me that allows you to fork off multiple processes and manage their lifecycles from IDL. You might argue that I could do this in Python, or even a simple shell script to a degree, but without IDL and ENVI’s rich support for data formats, it would difficult to implement a generic solution that can inspect your input data and determine how to partition the problem, nevermind reassemble to results into the final output product.

The main component is IDL’s Spawn procedure, though the use is a little nonstandard. The most common use of Spawn() is to build a command string, and then pass it into Spawn() along with positional parameters for stdout and maybe stderr. The big problems here are that this is a blocking call and there is no parameter for stdin, so you end up writing a file to disk and adding redirection or a pipe to the command string. The former may be overcome with the NOWAIT keyword, but that only works on Windows and I’m not entirely sure how you would capture the stdout or stderr output. Instead we will use the UNIT keyword, which makes the call non-blocking and wraps a bidirectional pipe in a LUN, so you can easily send data to stdin and read the results of stdout. Of course if we follow up the Spawn() call with a “while not EOF(lun)” loop then we gain nothing, as our code still effectively blocks until the external process has terminated and we’re done reading from its stdout. Enter timers, which allow us to spawn multiple processes and revisit them until they are complete. So my colleague created this class AsyncSpawner which will do just that, allowing you to enqueuer as many Spawn requests as you want, with a callback that will be invoked as the process spits stuff out on stdout. We use a class to maintain a catalog of all the job requests that have been made and the state of the running processes.

The AsyncSpawn class has been trimmed down for this post, you would normally want a lot more error checking and configurability. The “started” Boolean member indicates whether it should be running any timers to process events, which is controlled by the appropriately named method Start and Stop. There is a configurable CONCURRENCY parameter passed into Init() that controls how many processes can run simultaneously. The “queue” member is a List which stores the job requests in the order they are received. There are two Hash members, one for the running jobs keyed on their shell’s process id and one for the timers used to query the processes’ stdout pipe, keyed on the timer id.

A client uses Enqueue to request a process to be spawned. The first argument is the exact string you want passed to Spawn(), and the STDIN keyword is set to whatever string or string array you want written to the process’s stdin pipe. The CALLBACK keyword specifies either a procedure name or object reference to a class with the ‘JobCallback’ prodedure. In either case the callback procedure will be passed two arguments, a string message and a Dictionary containing the current job state. The values of the input parameters are all stored in a Dictionary object (for case insensitivity and dot notation access to members), which is then added to the List member queue.

The main method is called ExecutiveLoop, which is initially invoked by Start and then subsequently by a timer using the special userdate value of -1. If the “started” state is changed to False by the Stop method, then it won’t do anything. When “started” is True it will dequeue as many job requests as it can and spawn them, until it hits the concurrency limit. It will then set a new timer to call itself in 0.1 seconds. When it dequeues a job request, it calls Spawn with the UNIT and PID keywords so that it can write and read from stdin and stdout, respectively, as well as to have the shell process id as a unique identifier. Once any stdin data has been sent and flushed to stdin, a Dictionary is created to store the state of the job, which in turn is added to the “running” member Hash keying on the process id. The job state includes to full command string requested, the stdin contents, the callback procedure name/object reference, the LUN of the process, the process id of the process, a List to contain all the stdout contents, and the start time of the process. It is up to clients to not change any of this information when they implement the callback procedure, as that could break things. After storing the job state, the callback procedure is invoked with the “spawned” message, and then a timer is started, using the job’s process id as the userdata.

The AsyncSpawn class implements a HandleTimerEvent method, which is what the timers invoke when they go off. This method takes two arguments, the timer id and userdata, which is either the process id returned from Spawn or -1 to indicate ExecutiveLoop should be invoked again. While it would be cleaner to have one timer and select on all the job LUNs, the FILE_POLL_INPUT() method doesn’t seem to work the way I’d like, at least on Windows. So we resorted to separate timers for each job LUN, as well as one for ExecuteLoop. When a positive value is passed in for userdata, then we invoke PollJobStatus with that process id.

The PollJobStatus method does just what it sounds like it should. It gets the job state Dictionary from the “running” member, and checks if that job’s stdout LUN has reached its EOF marker. If so then we free the LUN, remove the job from “running”, and invoke the callback with the “done” message and the final job state. If we haven’t reached EOF, then we verify if there is activity on the pipe using FILE_POLL_INPUT(), and read a string from stdout. That string is added to the job’s stdout state, and then the callback is invoked with an “updated” message and the current job state.

That pretty much describes the minimalist implementation of AsyncSpawn, here is the code:

function AsyncSpawn::Init, CONCURRENCY=concurrency
compile_opt idl2
self.started = !False
self.concurrency = (N_Elements(concurrency) eq 1) ? (1>concurrency<20) : 1
self.timers = Hash()
self.running = Hash()
self.queue = List()
return, 1
end

pro AsyncSpawn::Cleanup
compile_opt idl2
self.Stop
end

pro AsyncSpawn::Start
compile_opt idl2
Timer.Block
if (~self.started) then begin
    self.started = !True
    self.ExecutiveLoop
endif
Timer.Unblock
end

pro AsyncSpawn::Stop
compile_opt idl2
Timer.Block
self.started = !False
; kill existing timers
foreach t, self.timers.Keys() do !null = Timer.Cancel(t)
self.timers = Hash()
; free luns for pipes to active jobs
foreach job, self.running, pid do Free_Lun, job.unit
Timer.Unblock
end

pro AsyncSpawn::Enqueue, cmd, STDIN=stdin, CALLBACK=callback
compile_opt idl2
if (~ISA(callback, /SCALAR, /STRING) && ~Obj_Valid(callback)) $
    then Message, 'CALLBACK must be scalar string or object'
if (~ISA(stdin, /STRING)) $
         then Message, 'STDIN must be string or string array'
self.queue.Add, Dictionary('cmd', cmd, 'stdin', stdin, $
                                                                     'callback', callback)
end

pro AsyncSpawn::ExecutiveLoop
compile_opt idl2
; this is the main loop, it runs queued jobs and manages the
; polling timers that watch for results on the jobs
while (self.started && ~self.queue.IsEmpty() && $
         (self.running.Count() lt self.concurrency)) do begin
    req = self.queue.Remove(0)
    Spawn, req.cmd, UNIT=u, PID=pid, /NOSHELL, /HIDE
    ; print to the pipe, it will be read by stdin in the spawned process
    printf, u, req.stdin
    flush, u
    ; add job to running catalog
    job = Dictionary('cmd', req.cmd, 'stdin', req.stdin, 'unit', u, $
                     'callback', req.callback, 'pid', pid, $
                     'stdout', List(), 'start', SysTime(/Seconds))

    self.running[pid] = job

    if (Obj_Valid(req.callback)) then begin
      Call_Method, 'JobCallback', req.callback, 'spawned', job
    endif else begin
      Call_Procedure, req.callback, 'spawned', job
    endelse
    ; if not stopped, set a timer to inspect this new job
    if (self.started) then self.timers[Timer.Set(0.1, self, pid)] = !True
endwhile
; if not stopped, set a timer to call myself
if (self.started) then self.timers[Timer.Set(0.1, self, -1)] = !True
end

pro AsyncSpawn::PollJobStatus, pid
compile_opt idl2
job = self.running[pid]
if (EOF(job.unit)) then begin
    Free_Lun, job.unit
    self.running.Remove, pid
    if (Obj_Valid(job.callback)) then begin
      Call_Method, 'JobCallback', job.callback, 'done', job
    endif else begin
      Call_Procedure, job.callback, 'done', job
    endelse
    return
endif
; check if there is content on this job's stdout
if (File_Poll_Input(job.unit, TIMEOUT=0) gt 0) then begin
    s = ''
    readf, job.unit, s
    job.stdout.Add, s
    if (Obj_Valid(job.callback)) then begin
      Call_Method, 'JobCallback', job.callback, 'updated', job
    endif else begin
      Call_Procedure, job.callback, 'updated', job
    endelse
endif
; if not stopped, set a timer to reinspect this job
if (self.started) then self.timers[Timer.Set(0.1, self, pid)] = !True
end

pro AsyncSpawn::HandleTimerEvent, id, pid
compile_opt idl2
self.timers.Remove, id
if (pid eq -1) then begin
    self.ExecutiveLoop
endif else begin
    self.PollJobStatus, pid
endelse
end

pro AsyncSpawn__define
compile_opt idl2
!null = {AsyncSpawn,      $
           started: !False, $
           concurrency: 0L, $
           timers: Hash(), $
           running: Hash(), $
           queue: List()    $
          }
end

The next step is to show this class in action. I created a simple IDL procedure which can be run from the command line using “idl –e” syntax (https://www.nv5geospatialsoftware.com/docs/command_line_options_for.html#-e). It’s called ProcessRasterTile and it has three input keywords: URI is a raster filename to load, SUB_RECT is a spatial subset to process, and INDEX is the spectral index name to calculate. It will launch ENVI, load the raster, spatially subset it, and calculate the requested spectral index. It then gets a temporary filename and exports the spectral index raster to it, and prints the filename and a special “OUTPUT=” prefix that can be identified by the spawn callbacks. This isn’t a process you really need parallelize, but this is just a short example of what is possible with this approach. I did add some error handling, where any error message from any API call is printed to stdout before returning.

pro ProcessRasterTile, URI=uri, SUB_RECT=subRect, INDEX=index

compile_opt idl2

nv = ENVI(/HEADLESS)
oRaster = nv.OpenRaster(uri, ERROR=error)
if (StrLen(error) gt 0) then begin
    print, 'Error opening raster: ' + error
    return
endif

oSubRaster = ENVISubsetRaster(oRaster, SUB_RECT=subRect, ERROR=error)
if (StrLen(error) gt 0) then begin
    print, 'Error subsetting raster: ' + error
    return
endif

oIndexRaster = ENVISpectralIndexRaster(oSubRaster, INDEX=index, ERROR=error)
if (StrLen(error) gt 0) then begin
    print, 'Error calculating spectral index: ' + error
    return
endif

outFilename = nv.GetTemporaryFilename('dat')
oIndexRaster.Export, outFilename, 'ENVI'

print, 'OUTPUT=' + outFilename
nv.close
end

The next part is callback, which I chose to implement as a class since I need to maintain state of multiple processes until they are all complete, and I didn’t want to use common block variables to do that. I called this class SpawnCallbackHandler, since that is its job. This class has a Boolean “done” state, which was needed to avoid garbage collection in my simple test driver. It also has two Hash members, “jobs” is for active jobs and “finishedJobs” is for jobs that have completed. The main purposed of this class is its JobCallback method, which is invoked by the SpawnAsync class. When it gets a “spawned” message it adds new job to its “jobs” hash, using the PID as the key. When it gets an “updated” message it updates the job Dictionary in the “jobs” hash, and when it gets a “done” message it moves the job Dictionary from the “jobs” hash to the “finishedJobs” hash. There is some error checking to make sure the PIDs are unique and only in the correct hash. It then checks if there are no running jobs, only finished jobs, and in that case it calls its AssembleFinalProduct method, a more robust version would make this a callback procedure that takes the “finishedJobs” hash as input.

The AssembleFinalProduct method iterates over all the finished jobs, looking for the line in their stdout list that has that special “OUTPUT=” prefix. Each of these are catalogued, so they can then be passed into ENVI::OpenRaster() to load all the NDVI tiles. I then create an ENVIMosaicRaster, which is exported to another temporary filename that is printed out so you know what it is.

function SpawnCallbackHandler::Init

compile_opt idl2
self.done = !False
self.jobs = Hash()
self.finishedJobs = Hash()
return, 1
end

pro SpawnCallbackHandler::Cleanup
compile_opt idl2
foo = 1
end

function SpawnCallbackHandler::Done
compile_opt idl2
return, self.done
end

pro SpawnCallbackHandler::JobCallback, msg, job
compile_opt idl2
case msg.ToLower() of
    'spawned': begin
      if (self.jobs.HasKey(job.pid) || self.finishedJobs.HasKey(job.pid)) then begin
        Message, 'Already know about job with PID ' + StrTrim(job.pid, 2)
      endif
      self.jobs[job.pid] = job
    end
    'updated': begin
      if (~self.jobs.HasKey(job.pid) || self.finishedJobs.HasKey(job.pid)) then begin
        Message, 'Update to unknown job with PID ' + StrTrim(job.pid, 2)
      endif
      self.jobs[job.pid] = job
    end
    'done': begin
      if (~self.jobs.HasKey(job.pid) || self.finishedJobs.HasKey(job.pid)) then begin
        Message, 'Completion of unknown job with PID ' + StrTrim(job.pid, 2)
      endif
      self.jobs.Remove, job.pid
      self.finishedJobs[job.pid] = job
    end
endcase
; are they all done yet?
if (self.jobs.IsEmpty() && ~self.finishedJobs.IsEmpty()) then begin
    self.AssembleFinalProduct
endif
end

pro SpawnCallbackHandler::AssembleFinalProduct
compile_opt idl2
jobUrls = List()
foreach job, self.finishedJobs do begin
    foreach line, job.stdout do begin
      if (line.StartsWith('OUTPUT=')) then begin
        jobUrls.Add, (line.Substring(7)).Trim()
      endif
    endforeach
endforeach
; launch ENVI and load all the output rasters
nv = ENVI(/HEADLESS)
tiles = List()
foreach url, jobUrls do begin
    tiles.Add, nv.OpenRaster(url)
endforeach
; mosaic the rasters and export the result
mosaic = ENVIMosaicRaster(tiles.ToArray())
outFilename = nv.GetTemporaryFilename('dat')
mosaic.Export, outFilename, 'ENVI'
print, outFilename
nv.close
self.done = !True
end

pro SpawnCallbackHandler__define
compile_opt idl2

!null = {SpawnCallbackHandler, $
           done: !False,          $
           jobs: Hash(),          $
           finishedJobs: Hash()   $
          }

end

My test driver code for this is as follows. Since this example doesn’t have any persistence, I had to use the SpawnCallbackHandler::Done() method to keep it and the AsyncSpawn object alive until everything was complete. If you used an approach like this in a widget based application you wouldn’t need that method.

pro test_parallelRasterTile

compile_opt idl2

filename = 'C:\Program Files\Exelis\ENVI53\data\qb_boulder_msi'
subRect = [[0, 0, 511, 511], [512, 0, 1023, 511], $
             [0, 512, 511, 1023], [512, 512, 1023, 1023]]
index = 'NDVI'

spawner = AsyncSpawn(CONCURRENCY=4)
handler = SpawnCallbackHandler()

for i = 0, 3 do begin
    cmd = 'C:\Program Files\Exelis\IDL85\bin\bin.x86_64\idl.exe -e ' + $
          '"ProcessRasterTile, URI=''' + filename + ''', SUB_RECT=[' + $
          StrJoin(StrTrim(Reform(subRect[*,i]), 2), ',') + '], INDEX=''' + index + '''"'
    spawner.Enqueue, cmd, STDIN='', CALLBACK=handler
endfor
spawner.Start
while (~handler.Done()) do begin
    Wait, 0.5
endwhile
end

Making Image-to-Image Alignment Simpler Combining IDL and Python Graphics