Internal: Does IDL launch a separate thread when using call_external?. External communication loses packets.

Anonym Monday, September 20, 2010

Topic
[This is an internal tech tip only]

A customer is writing some networking code (in a shared library under CentOS Linux) to receive UDP packets from an x-ray camera, which interfaces to IDL using the call_external routine. Everything seems to be working fine, except that packets are being dropped. Running outside of IDL this doesn't happen, and I suspect the problem has to do with process priorities.

Does IDL launch a separate thread or process when using call_external? Does IDL do anything with the process priority of external functions called with call_external, or do anything that would prevent me from raising the priority of a thread that I spawn from within my shared library called from call_external?

[This is an internal tech tip only]

Discussion

Jim Pendleton explains:

The simple answer is that CALL_EXTERNAL doesn't launch a separate process or thread. It simply loads the .so file and executes the specified entry point in a blocking fashion until it completes. It runs in the same process space as the main thread.

However, the entry point in the .so file may start a thread or process, and that would be outside of IDL's process space. Any control over *this* state with respect to priorities and so forth would be managed by the external library and IDL wouldn't know anything about it. I think that's the state the customer is referencing in their last sentence.

My suspicion is that the external routine is reading the packets and populating a buffer which IDL would then parse after the call_external completes. Depending on how this buffer is constructed and read, there may be an issue with overwrites.

If call_external is being called in a loop to gather the packet info and the device producing the packets isn't buffering the data, packets could be lost between loop increments. If this is the case, the external routine would need to be modified to buffer the data between calls then empty the cache.

One might also consider using a semaphore (sem_create, sem_lock, etc.) to ensure that in a non-blocking environment only one process at a time is writing or reading from a buffer. Whether this would apply depends on the architecture of the data capture process.

If, on the other hand, the call_external library routine blocks until all expected data is accumulated before returning to IDL, that indicates there's a problem in the external library's construction of the return buffer, a problem that probably can't be debugged from IDL directly.

The cusomter might also consider an external utility like Wireshark (http://www.wireshark.org/) to monitor the packets directly. I've used this, but not extensively. Exelis VIS doesn't endorse this as a solution and advises our customers that their mileage may vary.

Possible reasons as to why the call_external communication drops packets?

It's hard to say where the problem is coming from, but here are some ideas to work around it.

First, if your customer is running from the Workbench, try running from the IDL command line instead, specially if this is 8.0 where the Workbench and main IDL process live in the same process space. The Workbench uses extra event handling loops that the command line does not. Command line does have some background processing (for example it checks for a valid license at intervals), but the overhead should be quite low.

Second, there's the option of using an IDL_IDLBridge to execute the camera capture. This will create a new process space. The execution of the data capture could be non-blocking with the main process checking the return value of ->Status() at intervals. That may be the cleanest solution. Even if the main process is the Workbench, this separate process shouldn't have any thread conflicts.

Third, adjust the polling time between calls to the call_external that checks the semaphore. The polling loop itself may be part of the problem. Also, it should be possible to check the semaphore directly by IDL without the need for an extra call_external. It's possible that reentering into the .so during execution could be un-threadsafe.

On a device with which I have a lot of personal familiarity, the capture was highly dependent on the polling interval. If I polled too frequently, the thread that was capturing the data didn't have time to collect the data between queries. If I polled too infrequently, I'd get a buffer overflow on the device or a deadlock. It would have been far preferable for the device to have an on-board memory buffer, but the manufacturer only implemented a streaming solution, assuming the acquisition computer to which it was attached had nothing better to do with its time than service the device.

I don't know too much about what tools may be available on CentOS Linux, but on Windows it was easy to spot transfer bottlenecks by running the Task Manager. By monitoring both the network traffic and the CPU use during acquisition, I was able to optimize the polling rates so both the CPU and network traffic remained pretty constant. If CPU drops off and, say, paging increases, then "something" needs to be tweaked.