INTERNAL: Efficiently manage memory when processing large datasets in IDL

Anonym Friday, May 29, 2015

This help article summarizes hints and tips on how to efficiently manage memory when processing large datasets in IDL:

Observe memory allocation

Use the command "HELP, /MEMORY" (http://www.harrisgeospatial.com/docs/memory.html) to obtain information about the amount of dynamic memory (in bytes) that is currently in use by the IDL session. This command also returns the maximum amount of dynamic memory allocated since its last call, as well as the number of times dynamic memory has been allocated and de-allocated. The following help articles provide more information regarding dynamic memory allocation and how it is released in IDL: http://harrisgeospatial.com/Support/SelfHelpTools/HelpArticles/HelpArticles-Detail/TabId/2718/ArtMID/10220/ArticleID/19801/533.aspx and http://www.harrisgeospatial.com/docs/DynamicMemory.html

Consider memory fragmentation

As IDL is written in C language, it also uses its "MALLOC()" and "FREE()" routines to allocate and free up memory. In case that large datasets are imported into IDL, a contiguous memory block of at least the same file size is needed so that the data can be stored. If you receive an error message such as "Unable to allocate memory" despite that enough virtual memory is available on your system (RAM), this might indicate that a contiguous block of sufficient free memory could not be allocated due to fragmentation of address space.

To work-around this problem, you can try to allocate a large memory chunk already at the beginning of your IDL routine which is then freed up afterwards. This can be employed for instance by using a command such as "BYTARR()" with sufficiently large dimensional sizes (e.g. "foo = BYTARR(2048, 3000, /NOZERO)"). Moreover, don't forget to close unnecessary routines that might consume a considerable amount of memory.

Memory fragmentation was particularly an issue for 32-bit operating systems, such as Microsoft Windows XP. To assess associated limitations, have a look at our provided evaluation program (http://harrisgeospatial.com/Support/SelfHelpTools/HelpArticles/HelpArticles-Detail/TabId/2718/ArtMID/10220/ArticleID/19227/3441.aspx) and to the following help article: http://harrisgeospatial.com/Support/SelfHelpTools/HelpArticles/HelpArticles-Detail/TabId/2718/ArtMID/10220/ArticleID/19259/3346.aspx

Work with small data chunks instead of entire datasets

Try to split up your data into smaller chunks at a time, and only append relevant data portions that are needed for your operations. Serial data import and processing of smaller fragments helps to save memory and reduces the risk of fragmentation as well. For the processing of large raster datasets, consider also to take advantage of the "Tile Iterator" functionality as part of the ENVI API in IDL (http://www.exelisvis.com/docs/ProgrammingGuideTileIterators.html).

Pass parameters by reference

Pass parameters as "Call-by-Reference" instead of "Call-by-Value", if possible. By passing the address (reference) of a data object in an argument, you omit that a copy of the dataset will be generated (http://www.harrisgeospatial.com/docs/Parameter_Passing_Mechan.html). If applicable, define the variable parameter as a pointer (e.g. http://www.harrisgeospatial.com/docs/PassingParameterArraysByReference.html). Reference an array by its pointer without memory copy such as "ptr = PTR_NEW(BINDGEN(256, 800), /NO_COPY)", by using "*" for dereferencing (e.g. "TV, *ptr"). Keyword inheritance by reference is realized with the"_REF_EXTRA" keyword (instead of "_EXTRA" for passing by value: http://www.harrisgeospatial.com/docs/Keyword_Inheritance.html).

Prevent the generation of local data copies

Large datasets should be directly read from the external files by the IDL routine instead of copying the data to a newly created variable. Suppose you would like to process large HDF5 data files, then it may be beneficial to directly place the HDF datasets within your functions, e.g. by the statement "H5D_READ(dataset_id1)". This avoids memory allocation as result of a newly generated local copy.

If available, use the "/NO_COPY" keyword in a function call to prohibit duplication of the data being processed. Consider also the use of the "TEMPORARY()" function to save memory when performing operations on large arrays. This function improves performance and avoids that a new copy of only temporary results will be generated (http://www.harrisgeospatial.com/docs/temporary.html).

Speed up array creation

Array creation is faster with initialized values, e.g. by using "REPLICATE" (http://www.harrisgeospatial.com/docs/replicate.html) or "MAKE_ARRAY" (http://www.harrisgeospatial.com/docs/make_array.html). For example executes "foo = MAKE_ARRAY(ncols, nrows, value=5, long=0)" faster as "foo = LONARR(ncols, nrows) & foo[*] = 5". When using latter kind of commands (also such as "INTARR", "FLTARR",...) you can use the "/NOZERO" keyword to speed up array creation.

NH, 2015-07-10; reviewed by ???