Getting started

Datasets

Datasets are the most important part of a deep learning project. A model can only be as good as the data it trains on. In other words, garbage in, garbage out.
This motivates the need to carefully curate datasets, and to iterate on them often to improve their quality, size, and diversity.

Our dataset formats aim to be easily understood by humans; it therefore puts more emphasis on clarity than it does on efficiency compared to most other formats used in research.
If you have very large datasets, for which efficiency is necessary, please contact us at support@sihl.ai.

Dataset formats usually follow a structure similar to this:

.
├── images/
├── metadata.yaml
└── annotations.yaml

Sihl AI expects datasets to be stored in an S3-compatible object storage. Such storage units need these pieces of information to access:

To interact with such object storage services, we recommend using graphical apps like cyberduck or command-line tools like s5cmd.

The images/ folder is always required. Supported formats are JPEG (".jpeg" or ".jpg"), PNG (".png") and TIFF (".tiff" or ".tif").

For self-supervised tasks (view-invariance learning, autoencoding and anomaly detection), only the images/ folder is needed.

The metadata.yaml file is required for supervised tasks, and provides information about the dataset as a whole. It must be named "metadata", be can be YAML (".yaml" or ".yml") or JSON (".json") formatted.

The metadata file might specify some annotation file(s) or folder(s) depending on the task(s). For example, a scene text recognition dataset might have a annotations.yaml file containing image name to text string mappings, while a panoptic segmentation dataset might have a annotations/ folder containing PNG segmentation maps.
When annotations specify pixel coordinates, we always consider the origin at the top-left of the image (x increasing to the right and y increasing to the bottom).

x y

Refer to each task's docs to get more details about the corresponding annotation format.

Some datasets have multiple types of annotations per image. A very popular example is the COCO dataset, which contains object detection, keypoint detection, instance segmentation (and more) annotations per image!

To indicate that a dataset is multitask annotated, simply provide a tasks array in the metadata file, like so:

metadata.yaml
tasks:
  - task: object detection
    annotations: object_detection_annotations.yaml
    categories: [cat1, cat2, cat3]
  - task: keypoint detection
    annotations: keypoint_detection_annotations.yaml
    keypoints: [kpt1, kpt2, kpt3]
  - task: instance segmentation
    annotations: instance_segmentation_annotations/
    categories: [cat1, cat2, cat3]
    colors:
      cat1: [255, 0, 0]  # red
      cat2: [0, 255, 0]  # green
      cat3: [0, 0, 255]  # blue

Notice that the tasks are ordered. This is important, as training a multitask model on this dataset will produce a multi-headed model, where the index of each head corresponds to the index of corresponding task specified in the metadata file.
When performing multitask training, the losses of each tasks are simply summed up and minimized together.

The annotation files/folders of each task must be independant from the others, even though they refer to the same images/. This keeps tasks loosely coupled, and therefore easily added, removed, or modified.
Images do not need have an annotation for every task - you can simply skip certain annotations (or set them to null) to mean it is unannotated for that task. During training, the corresponding head will ignore the loss contribution from these samples.