Pixel Interleave – Why You Should Care and How To Handle It

Anonym Friday, June 19, 2015

On seemingly innocuous piece of raster metadata is the pixel interleave, but it can have dramatic impact on the performance of your algorithms if you don’t adjust your data access patterns to account for it.

I’ll start with a quick review of what pixel interleave is, then show how it can affect performance, and finish up with how you can update your code to optimize your data access patterns and minimize that impact.

There are 3 forms of pixel interleave – Band Interleaved by Pixel (BIP), Band Interleaved by Line (BIL), and Band Sequential (BSQ). The help docs for ENVIRaster::GetData() discuss this, as you can request the data be returned to you in a particular interleave. If you call GetData() with the different values for INTERLEAVE, you will get back the same data, but with different dimensions. The BIP interleave is best for spectral processing, as the values for all the bands of a given pixel are adjacent. BSQ interleave is best for spatial processing, as adjacent pixels in the same band are close together. BIL interleave is a compromise between the other two, where all the pixels in a given scanline are adjacent, one band at a time.

Knowing how the data is stored on disk can make requesting the same collection of pixels much faster or slower if you request the pixels in the wrong interleave. Let’s look at the qb_boulder_msi file that is included with ENVI. It has 1024 columns, 1024 rows, and 4 bands, and is stored in BSQ interleave. This means that it is stored as a [1024, 1024, 4] array on disk, and the stream of pixels is stored as [0, 0, 0], [1, 0, 0], [2, 0, 0], … , [1023, 0, 0], [0, 1, 0], [1, 1, 0], … , [1022, 1023, 0], [1023, 1023, 0], [0, 0, 1], [1, 0, 1], … , [1022, 1023, 3], [1023, 1023, 3]. If you request the data in BSQ interleave, then we can just read the data in one subrect per band and assemble the data cube slice by slice. If you request the data in BIL or BIP interleave, we can read it in a similar fashion, but need to rotate the data before copying it into the data cube.

When you want to write your own algorithms using the ENVI API (hopefully to be wrapped in an ENVITaskFromProcedure), you have to make some decisions about how you want to read the data and process it. If you’re lucky, you can read all the pixels into one array and iterate over it however you want, but if you want to make your code more robust you should plan for accessing the data in tiles, so it can handle rasters of any dimension. To help you out in this endeavor, the ENVI API includes the ENVIRasterIterator class, which can be built for you by the ENVIRaster::CreateIterator() function. This function has the important keyword MODE, which is used to control how the resulting tiles are shaped, and which way they traverse the raster.

The default value for MODE is ‘spatial’, which returns the pixels of only one band. Iterating over this type of iterator will return tiles of data that span the first band in a row major order- tiles start in upper left corner of raster/requested SUB_RECT, the move left to right to right edge, then down a row and repeat until all pixels in first band are covered, then follow same pattern for each successive band. The default tile size used by the iterator is 1024x1024, but to emphasize the point let’s consider a 512x512 pixel iterator on qb_boulder_msi. The first tile will cover the subrect from [0,0] to [511,511] in band 0, followed by [512,0] to [1023,511] in band 0, then [0,512] to [511,1023] in band 0, then [512,512] to [1023, 1023] in band 0, then followed by the same sequence of 4 SUB_RECTs in band 1, then band 2, and lastly band 3.

The alternate value for MODE is ‘spectral’, which returns pixels in a BIL like scanline. Each tile is has the same number of columns as the raster, and the number of rows is the number of bands in the raster. Iterating over the raster this way results in the same number of tiles as rows of pixels in the raster.

I put together the following simple procedure to illustrate the effect interleave can have on performance. It performs calculations similar to Spectral Angle Mapper (SAM), you pass in an ENVIRaster and a reference spectrum and get back a 2-D array of values. The first thing it does is normalize the input spectrum, by dividing each element by the square root of the array dotted with itself. I then create the output array, which has the same spatial dimensions as the input raster, and a helper array called ones which is used to create a matrix outer product so we have a 2-D array the same size as the tile, where each row is identical. I then create my iterator, in spectral mode since I’m doing spectral dot products, and iterate over it with foreach. As the ENVIRasterIterator docs show, using foreach on this object will cause the loop variable to be set to the pixels in the current tile. I take the nColumns * nBands tile and square it, then call Total() on the result to sum each column, and then the square root. The result is a 1-D array with the magnitude of each pixel across its bands. I need to convert this 1-D array back into a 2-D array, which is done by taking the outer product (# operator) with that ones array, and divide that into the original tile. The result, normTile, is now a normalized version of the tile, where each pixel’s values will sum to unity when squared and summed together. I then take the dot product of this normalized tile with the normalized spectrum, which yields a 1-D array of values, essentially the dot products of the each pixel’s spectrum and the reference spectrum. I then copy this row of pixels into the appropriate location in the output array.

pro spectralDivergence, RASTER=oRaster, SPECTRUM=spectrumIn, RESULT=result
compile_opt idl2

refSpectrum = spectrumIn / Sqrt(Total(spectrumIn*spectrumIn))
result = DblArr(oRaster.nColumns, oRaster.nRows)
ones = Make_Array(oRaster.nBands, /DOUBLE, Value=1)
iter = oRaster.CreateTileIterator(MODE='spectral')
foreach tile, iter do begin
    tileSq = Double(tile) * tile
    norms = Sqrt(Total(tileSq, 2))
    normTile = tile / (norms # ones)
    tileDot = normTile # refSpectrum
    result[*, iter.Current_Subrect[1]] = tileDot
endforeach
end

I took qb_boulder_msi and converted it to BIL and BIP interleaves using the “Convert Interleave” toolbox tool, and then ran this code against all 3 files. The timing results using tic/toc are in this table:

Interleave	Time (s) local	Time (s) network
BIP	0.750	1.417
BIL	0.706	0.730
BSQ	0.738	1.671

I ran the timing tests using the files on my local hard drive, as well as on a network share, which could more accurately represent the situation with large repositories of data. The results don’t look that different for the local file, on the order of 5%, but for the network file it’s a factor of 2 or more.

Let’s take a closer look at how the data is stored on disk and how it is accessed to learn why this is. In the BIP interleave, the data is stored by band first, then by column across each row, then by row. So when we request a single scanline of pixels for the tile, we can read one chunk of data from the disk. But this chunk of data has dimensions of nBands * nColumns, when we want nColumns * nBands, so the data needs to be transposed. In the BSQ interleave, which I detailed above, the data is stored by column first, then by row, then by band. So when we request a single scanline of pixels for the tile, we have to read a scanline from each band, then seek to the next band and read it, so we end up with nBands reads and seeks, as opposed to one of each. This is why we see such a dramatic performance decrease for the networked files, all those separate seeks and reads add up and can’t be as effectively cached like it is for the local file.

So the question is how to adjust your code to take the pixel interleave into account and reduce all the time consuming parts of the data access.

For BIP interleave, the changes aren’t that significant, we just can’t use the ENVIRasterIterator since it will force BIL requests. So we manually keep track of the SUB_RECT for the current row of pixels and pass that into ENVIRaster::GetData(). This will return a tile that is transposed from the original version, so we swap the # matrix operator with the ## operator to perform the calculations on the other dimensions. Since we’re still producing a 1-D array of output for each scanline of pixels, we never have to transpose any of the variables.

For BSQ interleave, we need to do a bit more work. Since we are trying to avoid the nBands seeks and reads per scanline, we will read a number of scanlines at once into a 3-D array of data, then process one scanline at a time from the cube until it is time to read the next cube. The big question then is how many scanelines to read at a time, which becomes a tradeoff of time and space – how much memory can you afford to use for this cube to avoid the overhead of all those extra seeks and reads? For my example, I chose a number I knew was small enough to force multiple cubes to be read. I picked a number for the maximum number of pixels to read (half a million), then divided that by the number of columns and bands to get a row count (128 for qb_boulder_msi). I start by reading one cube of data, then iterate over all the scanlines. At the beginning of each iteration, I check if my current row is past the last SUB_RECT, and if it is I translate the SUB_RECT down, clamp it against the total number of rows, and read the next cube of data. I then copy out one scanline slice of data, which needs to be reformed by a [1024, 1, 4] array down to a [1024, 4] 2-D array, and then proceed with the calculations like the BIL case.

Here is the updated version of the code:

pro spectralDivergence2, RASTER=oRaster, SPECTRUM=spectrumIn, RESULT=result
compile_opt idl2

refSpectrum = spectrumIn / Sqrt(Total(spectrumIn*spectrumIn))
result = DblArr(oRaster.nColumns, oRaster.nRows)
ones = Make_Array(oRaster.nBands, /DOUBLE, Value=1)
switch oRaster.Interleave of
    'bip' : begin
      for row = 0, oRaster.nRows-1 do begin
        subRect = [0, row, oRaster.nColumns-1, row]
        tile = oRaster.GetData(SUB_RECT=subRect)
        tileSq = Double(tile) * tile
        norms = Sqrt(Total(tileSq, 1))
        normTile = tile / (norms ## ones)
        tileDot = normTile ## refSpectrum
        result[*, row] = Reform(tileDot)
      endfor
      break
    end
    'bil' : begin
      iter = oRaster.CreateTileIterator(MODE='spectral')
      foreach tile, iter do begin
        tileSq = Double(tile) * tile
        norms = Sqrt(Total(tileSq, 2))
        normTile = tile / (norms # ones)
        tileDot = normTile # refSpectrum
        result[*, iter.Current_Subrect[1]] = tileDot
      endforeach
      break
    end
    'bsq' : begin
      MaxTileSize = 512 * 1024L ; half a million pixels
      numRows = Floor(MaxTileSize / Double(oRaster.nColumns * oRaster.nBands))
      subRect = [0, 0, oRaster.nColumns-1, numRows-1]
      bsqCube = oRaster.GetData(SUB_RECT=subRect)
      for row = 0, oRaster.nRows-1 do begin
        if (row gt subRect[-1]) then begin
          subRect += [0, numRows, 0, numRows]
          subRect[3] <= oRaster.nRows-1
          bsqCube = oRaster.GetData(SUB_RECT=subRect)
        endif
        tile = Reform(bsqCube[*, row - subRect[1], *])
        tileSq = Double(tile) * tile
        norms = Sqrt(Total(tileSq, 2))
        normTile = tile / (norms # ones)
        tileDot = normTile # refSpectrum
        result[*, row] = Reform(tileDot)
      endfor
      break
    end
endswitch
end

And more importantly, the new performance:

Interleave	Time (s) local	Time (s) network
BIP	0.648	1.277
BIL	0.720	0.740
BSQ	0.146	1.008

These numbers may not make much sense at first glance, but there are a number of things to take into account. The BIL numbers are not too different, within the standard deviation of my measurements, since I ran the code multiple times for each data point. The BIP numbers show a little improvement, which is most likely due to the simpler logic of manually updating the SUB_RECT instead of going through the convenience of the ENVIRasterIterator foreach logic. The BSQ numbers are the most intriguing, however. The local file is way faster, by a factor of 5, but the networked file is only 40% faster. If we analyze the number of seeks and reads for the entire algorithm, we see the tile iterator peforming nBands seeks and reads for each scanline, for a total of 4 * 1024 = 4096 seeks and reads, while the cached cube reading version performs nBands seeks and reads for each cube, for a total of 4 * 8 = 32 seeks and reads. On the local file, this makes a huge difference due to the disk caching that can occur, while the network share can’t perform as much caching due to the file requests it gets from all the other systems on the LAN.

Your mileage will vary depending on the type of hardware involved, and the load on it, but the important thing to take away from this is that with some forethought and a little extra code you can speed up your algorithms a bunch, just because of how you get the data from the storage medium into your algorithm.

3D Point-Cloud Generation from Stereo Imagery Server Side TCP/IP Sockets Officially Documented in IDL 8.5