Getting started

Hyperparameters

Training jobs are configured by a set of hyperparameters, which all have default values, but are available for you to adjust to finely control the model architecture and training process.

It helps to understand the meta-architecture of our models. They are composed of 2 or 3 parts:

  1. Backbone contains most of the weights. It generates abstract representations of the inputs for the head(s).
  2. Neck optional layer that enhances features across scales.
  3. Head(s) solves a specific ML task. Some heads attach to a single level, others to multiple (with shared weights).

Neural networks are not scale-invariant, and they usually manage to understand large spatial contexts by processing the input in "levels", each of which downscales feature maps by a factor of 2. If the top level of a backbone (i.e. its "depth") is i, then the feature maps are downscaled by 2^i (the "stride") compared to the input image.

Here's an example model with a backbone (levels 1 to 5), a neck (levels 3 to 7), and a shared head (levels 3 to 7):

    block-beta
    columns 7
  
    classDef title fill:#0000,stroke:#0000;
  
    block:level_group
      columns 1
      Level
      l7["7"] space l6["6"] space l5["5"] space l4["4"] space l3["3"] space l2["2"] space l1["1"] space l0["0"]
  
      class Level,l7,l6,l5,l4,l3,l2,l1,l0 title
    end
    class level_group title
  
    block:stride_group
      columns 1
      Stride
      s7["128"] space s6["64"] space s5["32"] space s4["16"] space s3["8"] space s2["4"] space s1["2"] space s0["1"]
  
      class Stride,s7,s6,s5,s4,s3,s2,s1,s0 title
    end
    class stride_group title
  
  
    block:input_group
      columns 1
      space:15
      image
    end
    class input_group title

    block:backbone_group:1
      columns 1
      Backbone
      space:4 B5 space B4 space B3 space B2 space B1 space:2
      class Backbone title
    end

    image --> B1
    B1 --> B2
    B2 --> B3
    B3 --> B4
    B4 --> B5

    block:neck_group:1
      columns 1
      Neck
      N7 space N6 space N5 space N4 space N3 space:6
      class Neck title
    end
    
    B5 --> N5
    B4 --> N4
    B3 --> N3

    N4 --> N3
    N5 --> N4
    N5 --> N6
    N6 --> N7

    block:head_group:1
      columns 1
      Head
      h7["H"] space h6["H"] space h5["H"] space h4["H"] space h3["H"] space:6
      class Head title
    end
    N7 --> h7
    N6 --> h6
    N5 --> h5
    N4 --> h4
    N3 --> h3
  

Trainer

  • Initialization

    random, imagenet-pretrained, or weights of any model from any project you're in
  • Maximum training iterations

  • Maximum training hours

  • Number of validations

  • Batch size

  • Gradient clipping

    limit gradient norms to stabilize training (optional)
  • Optimizer

    (choose one)
    • SGD
      • learning rate
      • weight decay(optional)
      • momentum(optional)
    • AdamW
      • learning rate
      • weight decay(optional)
  • Scheduler

    (optional, choose one)
    • Multi-step multiply the learning rate by the given factor at every milestone
      • milestones
      • learning rate factor
    • One-cycle Start with a reduced learning rate, increase it to its reference value during the warm-up period, then decrease it again for the rest of the schedule.
      • initial learning rate factor
      • final learning rate factor
      • warm-up period

Data

In order to batch samples together, they need to have the same shape. We automatically scale and pad images to the desired resolution ("letterbox resizing"), thus avoiding any distortion or cropping.
  • Image height

    in pixels, between 32 and 2048
  • Image width

    in pixels, between 32 and 2048
  • Image channels

    RGB = 3, grayscale = 1, multispectral = many
  • Data augmentations

    (optional)
    A single augmentation is randomly chosen and applied per training sample.
    "no-op" is always added to the set of "augmentations" to choose from
    • Horizontal flip
    • Vertical flip
    • Rotate (0 to 360°)
    • Rotate (90°, 180°, 270°)
    • Zoom in
    • Zoom out
    • JPEG compression
    • Gaussian noise
    • Low resolution
    • Gaussian blur
    • Affine
    • Perspective
    • Elastic (local)
    • Elastic (global)
    • Hue
    • Saturation
    • Brightness
    • Contrast

Backbone

  • Architecture

    (choose one)
    • Resnet 18, 34, 50, 101, 152
    • Efficientnet B0, B1, B2, B3, B4, B5, B6, B7
    • Efficientnet V2 small, medium, large
    • Mobilenet V2, V3 small, V3 large
    • [more coming soon...]
  • Resize input?

    If enabled, arbitrarily sized inputs will be scaled to the training resolution during inference (which makes latency more predictable). If disabled, inputs are processed at their actual resolution.
  • Top level ("depth")

    Level i has stride 2^i

Neck

  • Architecture

    (optional, choose one)
    • FPN
      • Levels to fuse
      • Output channels Output feature maps from levels fused by the neck will all have the same number of channels
    • BiFPN
      • Levels to fuse
      • Output channels Output feature maps from levels fused by the neck will all have the same number of channels
      • Number of layers How many repetitions of this bi-directional neck

Head

(see each task's docs)