Getting started

Hyperparameters

Training jobs are configured by a set of hyperparameters, which all have default values, but are available for you to adjust to finely control the model architecture and training process.

It helps to understand the meta-architecture of our models. They are composed of 2 or 3 parts:

  1. Backbone contains most of the weights. It generates abstract representations of the inputs for the head(s).
  2. Neck optional layer that enhances features across scales.
  3. Head(s) solves a specific ML task. Some heads attach to a single level, others to multiple (with shared weights).

Neural networks are not scale-invariant, and they usually manage to understand large spatial contexts by processing the input in "levels", each of which downscales feature maps by a factor of 2. If the top level of a backbone (i.e. its "depth") is i, then the feature maps are downscaled by 2^i (the "stride") compared to the input image.

Here's an example model with a backbone (levels 1 to 5), a neck (levels 3 to 7), and a shared head (levels 3 to 7):

Trainer

  • Initialization

    random, imagenet-pretrained, or weights of any model from any project you're in
  • Maximum training iterations

  • Maximum training hours

  • Number of validations

  • Batch size

  • Gradient clipping

    limit gradient norms to stabilize training (optional)
  • Optimizer

    (choose one)
    • SGD
      • learning rate
      • weight decay(optional)
      • momentum(optional)
    • AdamW
      • learning rate
      • weight decay(optional)
  • Scheduler

    (optional, choose one)
    • Multi-step multiply the learning rate by the given factor at every milestone
      • milestones
      • learning rate factor
    • One-cycle Start with a reduced learning rate, increase it to its reference value during the warm-up period, then decrease it again for the rest of the schedule.
      • initial learning rate factor
      • final learning rate factor
      • warm-up period

Data

In order to batch samples together, they need to have the same shape. We automatically scale and pad images to the desired resolution ("letterbox resizing"), thus avoiding any distortion or cropping.
  • Image height

    in pixels, between 32 and 2048
  • Image width

    in pixels, between 32 and 2048
  • Image channels

    RGB = 3, grayscale = 1, multispectral = many
  • Data augmentations

    (optional)
    A single augmentation is randomly chosen and applied per training sample.
    "no-op" is always added to the set of "augmentations" to choose from
    • Horizontal flip
    • Vertical flip
    • Rotate (0 to 360°)
    • Rotate (90°, 180°, 270°)
    • Zoom in
    • Zoom out
    • JPEG compression
    • Gaussian noise
    • Low resolution
    • Gaussian blur
    • Affine
    • Perspective
    • Elastic (local)
    • Elastic (global)
    • Hue
    • Saturation
    • Brightness
    • Contrast

Backbone

  • Architecture

    (choose one)
    • Resnet 18, 34, 50, 101, 152
    • Efficientnet B0, B1, B2, B3, B4, B5, B6, B7
    • Efficientnet V2 small, medium, large
    • Mobilenet V2, V3 small, V3 large
    • [more coming soon...]
  • Resize input?

    If enabled, arbitrarily sized inputs will be scaled to the training resolution during inference (which makes latency more predictable). If disabled, inputs are processed at their actual resolution.
  • Top level ("depth")

    Level i has stride 2^i

Neck

  • Architecture

    (optional, choose one)
    • FPN
      • Levels to fuse
      • Output channels Output feature maps from levels fused by the neck will all have the same number of channels
    • BiFPN
      • Levels to fuse
      • Output channels Output feature maps from levels fused by the neck will all have the same number of channels
      • Number of layers How many repetitions of this bi-directional neck

Head

(see each task's docs)