Pixel Interleave – Why You Should Care and How To Handle It
On seemingly innocuous piece of raster metadata is the pixel
interleave, but it can have dramatic impact on the performance of your
algorithms if you don’t adjust your data access patterns to account for it.
I’ll start with a quick review of what pixel interleave is,
then show how it can affect performance, and finish up with how you can update
your code to optimize your data access patterns and minimize that impact.
There are 3 forms of pixel interleave – Band Interleaved by
Pixel (BIP), Band Interleaved by Line (BIL), and Band Sequential (BSQ). The
help docs for ENVIRaster::GetData()
discuss this, as you can request the data be returned to you in a particular
interleave. If you call GetData() with the different values for INTERLEAVE,
you will get back the same data, but with different dimensions. The BIP
interleave is best for spectral processing, as the values for all the bands of
a given pixel are adjacent. BSQ interleave is best for spatial processing, as
adjacent pixels in the same band are close together. BIL interleave is a
compromise between the other two, where all the pixels in a given scanline are
adjacent, one band at a time.
Knowing how the data is stored on disk can make requesting
the same collection of pixels much faster or slower if you request the pixels
in the wrong interleave. Let’s look at the qb_boulder_msi file that is
included with ENVI. It has 1024 columns, 1024 rows, and 4 bands, and is stored
in BSQ interleave. This means that it is stored as a [1024, 1024, 4] array on
disk, and the stream of pixels is stored as [0, 0, 0], [1, 0, 0], [2, 0, 0], … ,
[1023, 0, 0], [0, 1, 0], [1, 1, 0], … , [1022, 1023, 0], [1023, 1023, 0], [0,
0, 1], [1, 0, 1], … , [1022, 1023, 3], [1023, 1023, 3]. If you request the
data in BSQ interleave, then we can just read the data in one subrect per band
and assemble the data cube slice by slice. If you request the data in BIL or
BIP interleave, we can read it in a similar fashion, but need to rotate the
data before copying it into the data cube.
When you want to write your own algorithms using the ENVI
API (hopefully to be wrapped in an ENVITaskFromProcedure),
you have to make some decisions about how you want to read the data and process
it. If you’re lucky, you can read all the pixels into one array and iterate
over it however you want, but if you want to make your code more robust you
should plan for accessing the data in tiles, so it can handle rasters of any
dimension. To help you out in this endeavor, the ENVI API includes the
ENVIRasterIterator
class, which can be built for you by the ENVIRaster::CreateIterator()
function. This function has the important keyword MODE, which is used to
control how the resulting tiles are shaped, and which way they traverse the
raster.
The default value for MODE is ‘spatial’, which returns the
pixels of only one band. Iterating over this type of iterator will return
tiles of data that span the first band in a row major order- tiles start in
upper left corner of raster/requested SUB_RECT, the move left to right to right
edge, then down a row and repeat until all pixels in first band are covered,
then follow same pattern for each successive band. The default tile size used
by the iterator is 1024x1024, but to emphasize the point let’s consider a
512x512 pixel iterator on qb_boulder_msi. The first tile will cover the
subrect from [0,0] to [511,511] in band 0, followed by [512,0] to [1023,511] in
band 0, then [0,512] to [511,1023] in band 0, then [512,512] to [1023, 1023] in
band 0, then followed by the same sequence of 4 SUB_RECTs in band 1, then band
2, and lastly band 3.
The alternate value for MODE is ‘spectral’, which returns
pixels in a BIL like scanline. Each tile is has the same number of columns as
the raster, and the number of rows is the number of bands in the raster. Iterating
over the raster this way results in the same number of tiles as rows of pixels
in the raster.
I put together the following simple procedure to illustrate
the effect interleave can have on performance. It performs calculations
similar to Spectral Angle Mapper (SAM), you pass in an ENVIRaster and a
reference spectrum and get back a 2-D array of values. The first thing it does
is normalize the input spectrum, by dividing each element by the square root of
the array dotted with itself. I then create the output array, which has the
same spatial dimensions as the input raster, and a helper array called ones
which is used to create a matrix outer product so we have a 2-D array the same
size as the tile, where each row is identical. I then create my iterator, in
spectral mode since I’m doing spectral dot products, and iterate over it with
foreach. As the ENVIRasterIterator docs show, using foreach on this object
will cause the loop variable to be set to the pixels in the current tile. I
take the nColumns * nBands tile and square it, then call Total() on the result
to sum each column, and then the square root. The result is a 1-D array with
the magnitude of each pixel across its bands. I need to convert this 1-D array
back into a 2-D array, which is done by taking the outer product (# operator)
with that ones array, and divide that into the original tile. The result,
normTile, is now a normalized version of the tile, where each pixel’s values
will sum to unity when squared and summed together. I then take the dot
product of this normalized tile with the normalized spectrum, which yields a
1-D array of values, essentially the dot products of the each pixel’s spectrum
and the reference spectrum. I then copy this row of pixels into the appropriate
location in the output array.
pro spectralDivergence, RASTER=oRaster, SPECTRUM=spectrumIn,
RESULT=result
compile_opt idl2
refSpectrum = spectrumIn / Sqrt(Total(spectrumIn*spectrumIn))
result = DblArr(oRaster.nColumns, oRaster.nRows)
ones = Make_Array(oRaster.nBands, /DOUBLE, Value=1)
iter = oRaster.CreateTileIterator(MODE='spectral')
foreach tile, iter do begin
tileSq = Double(tile) * tile
norms = Sqrt(Total(tileSq, 2))
normTile = tile / (norms # ones)
tileDot = normTile # refSpectrum
result[*, iter.Current_Subrect[1]] = tileDot
endforeach
end
I took qb_boulder_msi and converted it to BIL and BIP
interleaves using the “Convert Interleave” toolbox tool, and then ran this code
against all 3 files. The timing results using tic/toc are in this table:
Interleave
|
Time (s) local
|
Time (s) network
|
BIP
|
0.750
|
1.417
|
BIL
|
0.706
|
0.730
|
BSQ
|
0.738
|
1.671
|
I ran the timing tests using the files on my local hard
drive, as well as on a network share, which could more accurately represent the
situation with large repositories of data. The results don’t look that different
for the local file, on the order of 5%, but for the network file it’s a factor
of 2 or more.
Let’s take a closer look at how the data is stored on disk
and how it is accessed to learn why this is. In the BIP interleave, the data
is stored by band first, then by column across each row, then by row. So when
we request a single scanline of pixels for the tile, we can read one chunk of
data from the disk. But this chunk of data has dimensions of nBands * nColumns,
when we want nColumns * nBands, so the data needs to be transposed. In the BSQ
interleave, which I detailed above, the data is stored by column first, then by
row, then by band. So when we request a single scanline of pixels for the
tile, we have to read a scanline from each band, then seek to the next band and
read it, so we end up with nBands reads and seeks, as opposed to one of each.
This is why we see such a dramatic performance decrease for the networked
files, all those separate seeks and reads add up and can’t be as effectively
cached like it is for the local file.
So the question is how to adjust your code to take the pixel
interleave into account and reduce all the time consuming parts of the data
access.
For BIP interleave, the changes aren’t that significant, we
just can’t use the ENVIRasterIterator since it will force BIL requests. So we
manually keep track of the SUB_RECT for the current row of pixels and pass that
into ENVIRaster::GetData(). This will return a tile that is transposed from
the original version, so we swap the # matrix operator with the ## operator to
perform the calculations on the other dimensions. Since we’re still producing
a 1-D array of output for each scanline of pixels, we never have to transpose
any of the variables.
For BSQ interleave, we need to do a bit more work. Since we
are trying to avoid the nBands seeks and reads per scanline, we will read a
number of scanlines at once into a 3-D array of data, then process one scanline
at a time from the cube until it is time to read the next cube. The big
question then is how many scanelines to read at a time, which becomes a
tradeoff of time and space – how much memory can you afford to use for this
cube to avoid the overhead of all those extra seeks and reads? For my example,
I chose a number I knew was small enough to force multiple cubes to be read. I
picked a number for the maximum number of pixels to read (half a million), then
divided that by the number of columns and bands to get a row count (128 for
qb_boulder_msi). I start by reading one cube of data, then iterate over all the
scanlines. At the beginning of each iteration, I check if my current row is
past the last SUB_RECT, and if it is I translate the SUB_RECT down, clamp it
against the total number of rows, and read the next cube of data. I then copy
out one scanline slice of data, which needs to be reformed by a [1024, 1, 4]
array down to a [1024, 4] 2-D array, and then proceed with the calculations
like the BIL case.
Here is the updated version of the code:
pro spectralDivergence2, RASTER=oRaster, SPECTRUM=spectrumIn,
RESULT=result
compile_opt idl2
refSpectrum = spectrumIn / Sqrt(Total(spectrumIn*spectrumIn))
result = DblArr(oRaster.nColumns, oRaster.nRows)
ones = Make_Array(oRaster.nBands, /DOUBLE, Value=1)
switch oRaster.Interleave of
'bip' : begin
for row = 0, oRaster.nRows-1 do begin
subRect = [0, row, oRaster.nColumns-1, row]
tile = oRaster.GetData(SUB_RECT=subRect)
tileSq = Double(tile) * tile
norms = Sqrt(Total(tileSq, 1))
normTile = tile / (norms ## ones)
tileDot = normTile ## refSpectrum
result[*, row] = Reform(tileDot)
endfor
break
end
'bil' : begin
iter = oRaster.CreateTileIterator(MODE='spectral')
foreach tile, iter do begin
tileSq = Double(tile) * tile
norms = Sqrt(Total(tileSq, 2))
normTile = tile / (norms # ones)
tileDot = normTile # refSpectrum
result[*, iter.Current_Subrect[1]] = tileDot
endforeach
break
end
'bsq' : begin
MaxTileSize = 512 * 1024L ; half a
million pixels
numRows = Floor(MaxTileSize / Double(oRaster.nColumns
* oRaster.nBands))
subRect = [0, 0, oRaster.nColumns-1, numRows-1]
bsqCube = oRaster.GetData(SUB_RECT=subRect)
for row = 0, oRaster.nRows-1 do begin
if (row gt subRect[-1]) then begin
subRect += [0, numRows, 0, numRows]
subRect[3] <= oRaster.nRows-1
bsqCube = oRaster.GetData(SUB_RECT=subRect)
endif
tile = Reform(bsqCube[*, row - subRect[1], *])
tileSq = Double(tile) * tile
norms = Sqrt(Total(tileSq, 2))
normTile = tile / (norms # ones)
tileDot = normTile # refSpectrum
result[*, row] = Reform(tileDot)
endfor
break
end
endswitch
end
And more importantly, the new performance:
Interleave
|
Time (s) local
|
Time (s) network
|
BIP
|
0.648
|
1.277
|
BIL
|
0.720
|
0.740
|
BSQ
|
0.146
|
1.008
|
These numbers may not make much sense at first glance, but
there are a number of things to take into account. The BIL numbers are not too
different, within the standard deviation of my measurements, since I ran the
code multiple times for each data point. The BIP numbers show a little
improvement, which is most likely due to the simpler logic of manually updating
the SUB_RECT instead of going through the convenience of the ENVIRasterIterator
foreach logic. The BSQ numbers are the most intriguing, however. The local
file is way faster, by a factor of 5, but the networked file is only 40%
faster. If we analyze the number of seeks and reads for the entire algorithm,
we see the tile iterator peforming nBands seeks and reads for each scanline, for
a total of 4 * 1024 = 4096 seeks and reads, while the cached cube reading
version performs nBands seeks and reads for each cube, for a total of 4 * 8 =
32 seeks and reads. On the local file, this makes a huge difference due to the
disk caching that can occur, while the network share can’t perform as much
caching due to the file requests it gets from all the other systems on the
LAN.
Your mileage will vary depending on the type of hardware involved, and the load on it, but the
important thing to take away from this is that with some forethought and a
little extra code you can speed up your algorithms a bunch, just because of how
you get the data from the storage medium into your algorithm.