Pixel Segmentation Training Background

Training involves repeatedly exposing label rasters to a model. Over time, the model will learn to translate the spectral and spatial information in the label rasters into a class activation raster that highlights the features it was shown during training. In the first training pass, the model attempts an initial guess and generates a random class activation raster, which is compared to the mask band of the label raster. Through a goodness of fit function, also called the loss function, the model learns where its random guess was wrong. The internal parameters, or weights, of the model are adjusted to make it more correct, and the label rasters are exposed to the model again.

In practice, however, data is not passed through training all at once. Instead, square patches of a given size are extracted from the label rasters and are given to training a few at a time.

The following diagram of the ENVINet5 architecture shows how a model processes a single patch. Click on the thumbnail to see the full image. The ENVINet5 architecture is designed for a single-class training workflow. The architecture has 5 "levels" and 27 convolutional layers. Each level represents a different pixel resolution in the model. This example uses a patch size of 464 x 464 and three bands. One of the outputs is a class activation raster, which is converted into a mask and compared to the mask band of the label raster.

In the ENVI Deep Learning 1.1 release, a variation of this architecture (called ENVINet5Multi) was introduced to handle training of multiple features/classes.

The contextual field of view of an architecture indicates how much of the surrounding area contributes to each pixel during training. For the ENVINet5 architecture, the contextual field of view is 185 x 185 pixels. A patch size larger than the contextual field of view allows more training to occur at once and classifies faster. For the model to learn shapes larger than 185 x 185 pixels, the training rasters must be downsampled.

You can indicate how much training to perform by specifying the number of epochs, the number of patches per epoch, and the number of patches per batch. These are described next.

Data Augmentation

Augmentation is a technique commonly used with deep learning to supplement the original training data. It involves creating modified versions of the training images, usually through a geometric transformation such as scaling, flipping, rotating, and/or translating the data. ENVI allows you to choose to apply scaling and rotation, while it performs translations automatically. By having more information to extract from the training data, the trainer and classifier can more effectively learn what features of interest look like. Augmentation can also improve the ability of the models to generalize what they have learned, to new images. It can also reduce the amount of labeling you have to do, particularly with capturing various rotations and sizes of features.

During each epoch, ENVI creates a new training dataset with a randomly assigned angle per training example, if you choose to apply Augment Rotation during training. Likewise, it creates a new training dataset with a randomly assigned scale factor per training example, if you choose to apply Augment Scale during training.

Epochs and Batches

In traditional deep learning, an epoch refers to passing an entire dataset through training once. In ENVI Deep Learning, however, patches are extracted from label rasters in an intelligent manner so that, at the beginning of training, areas with a high density of feature pixels are seen more often than areas with a low density. At the end of training, all areas are seen more equally. Because of this biased determination of how the patches are extracted, an epoch in ENVI Deep Learning instead refers to how many patches are trained before the bias is adjusted.

Multiple epochs are needed to adequately train a model. The number of epochs and number of patches per epoch depend on the diversity of the set of features being learned; there is no correct number. In general, there should be enough epochs for adjustment of weighting to occur smoothly; suggested values are 10 to 30. Once you specify the number of epochs, the number of patches per epoch determines how much training occurs. This number should be lower for small datasets and higher for larger datasets. Values are typically between 200 and 1000 but could be much higher for large datasets.

The training process does not usually train a single patch at a time. Multiple patches are usually used at the same time in one iteration. A batch refers to the set of training patches used in one iteration of training. Batches are run in an epoch until the number of specified patches per epoch is met or exceeded. Typically, you should specify as many patches as possible as will fit into graphic processing unit (GPU) memory. The following table provides some rough guidelines on choosing this value, depending on the specified Patch Size and the memory size of your graphics card. Training tends to be more stable with at least 4 patches per batch. Values can vary based on how much graphics memory is consumed by other processes and by how many bands are chosen.

Graphics Memory	Patch Size	Patches per Batch
16 GB	704	4
16 GB	464	9
12 GB	608	4
12 GB	464	7
8 GB	672	2
8 GB	464	4
4 GB	464	2
4 GB	320	4
2 GB	464	1
2 GB	224	2

Training Parameters

ENVI uses a proprietary technique for training deep learning models that is based on a biased selection of patches. Normally, training patches are chosen with equal probability when training a TensorFlow model. If the pixels representing the feature of interest are sparse in an image, selecting patches equally throughout the image can cause the model to learn to produce a mask that consists entirely of background pixels.

To avoid this, ENVI introduces a bias so that the model will see patches with a higher density of feature pixels more often. The approach is based on a statistical technique called inverse transform sampling, where the examples shown to the model are in proportion to their contribution to a probability density function. This bias is controlled using the Class Weight parameter. You can set minimum and maximum values for Class Weight. The maximum value is used to bias patch selection when training begins. This value decreases to the minimum value when training ends.

In most cases, the minimum value should be 0 so that the model finishes the last epoch while seeing the actual ratio of feature to background pixels. To help determine a suitable value for the maximum, keep in mind that machine- and deep-learning applications often yield better results when the ratio of positive to negative examples is approximately 1:100.

An additional Loss Weight parameter can be used to bias the loss function to place more emphasis on correctly identifying feature pixels than identifying background pixels. This is useful when features are sparse or if not all of the features are labeled. A value of 0 means the model should treat feature and background pixels equally. Increasing the Loss Weight biases the loss function toward finding feature pixels. The useful range of values is between 0 and 1.0 with a default value of 0.8.

You can also set the Patch Sampling Rate parameter to indicate the density of sampling that should occur. This is the average number of patches that each pixel will belong to in the training and validation rasters. Increasing this value can be helpful when features are sparse, as there is a greater likelihood that enough patches are chosen that include the features. For smaller patch sizes, increasing the sampling value could make the model more general by oversampling feature pixels in slightly different positions. The only reason to decrease the value would be when features are dense and you do not need to cover every pixel multiple times with training patches. This can help to speed up the training time.

In addition to weighting feature and background pixels, the training process must also consider the sizes and edges of features. This is described next.

Solid Distance

When labeling features for training, it can be tedious to draw polygons around features of interest. If you are more concerned with counting features, as opposed to accurately capturing their shapes (or masking them), you can label the features with polylines or points. A Solid Distance parameter is provided to expand the size of linear and point features so that they fully represent their associated real-world objects. Take painted road centerlines as an example. When using the Region of Interest (ROI) Tool to collect samples of road centerlines, you would use polyline ROIs. Polylines have a width of one pixel, but in reality, the associated road centerlines have a finite width (approximately 10") that the TensorFlow model needs to learn.

The Solid Distance value is the number of pixels surrounding the labels, in all directions, that are also part of the target feature. You can use Solid Distance to expand the size of polygon features, but its use is limited. It is most commonly used with point and polyline features. Defining a Solid Distance value tends to work well for linear features with a fairly consistent width (like roads, road centerlines, and shipping containers) or compact features with a fairly consistent size (cars and stop signs). You can use the Mensuration tool in the ENVI toolbar to measure the length from an ROI point or polyline to the edge of the feature. In the Cursor Value dialog that appears, select the Pixels option under the Units drop-down list.

For example, if you add a point label to the center of a car that is about 34 pixels wide, a 17-pixel radius around the label would encompass most of the car. So the Solid Distance value would be 17.

Blur Distance

Deep learning algorithms can have difficulty learning the sharp edges of masks in features such as buildings. Blurring the edges and decreasing the blur during training can help the model gradually focus on the feature. To control this, set the minimum and maximum Blur Distance. At the beginning of training, features are expanded with a decaying gradient from the edge of a feature (including the Solid Distance, if defined) to the maximum Blur Distance. As training progresses, the distance is gradually reduced to the minimum value.

In general, set the maximum Blur Distance so that it blurs adequately within the contextual field of view of the model. So a reasonable maximum blur distance should be a few pixels to no more than 70. Set the minimum Blur Distance anywhere from 0 for well-defined borders to a few pixels when feature boundaries are indistinct.

All of these parameters are used to guide the TensorFlow model to learn the features of interest. Once the TensorFlow model has been trained, it can be used to find the same features in other images.

Module	Deep Learning

Version	3.0