Training jobs are configured by a set of hyperparameters, which all have
default values, but are available for you to adjust to finely control the
model architecture and training process.
It helps to understand the meta-architecture of our models. They are
composed of 2 or 3 parts:
Backbone
contains most of the weights. It generates abstract representations of
the inputs for the head(s).
Neck
optional layer that enhances features across scales.
Head(s)
solves a specific ML task. Some heads attach to a single level, others
to multiple (with shared weights).
Neural networks are not scale-invariant, and they usually manage to
understand large spatial contexts by processing the input in "levels", each
of which downscales feature maps by a factor of 2. If the top level of a
backbone (i.e. its "depth") is i, then the
feature maps are downscaled by
2^i (the "stride") compared to the input image.
Here's an example model with a backbone (levels 1 to 5), a neck (levels 3 to
7), and a shared head (levels 3 to 7):
Multi-step
multiply the learning rate by the given factor at every
milestone
milestones
learning rate factor
One-cycle
Start with a reduced learning rate, increase it to its
reference value during the warm-up period, then decrease it
again for the rest of the schedule.
initial learning rate factor
final learning rate factor
warm-up period
Data
In order to batch samples together, they need to have the same shape.
We automatically scale and pad images to the desired resolution
("letterbox resizing"), thus avoiding any distortion or cropping.
If enabled, arbitrarily sized inputs will be scaled to the
training resolution during inference (which makes latency more
predictable). If disabled, inputs are processed at their actual
resolution.