Datasets are the most important part of a deep learning project. A model can
only be as good as the data it trains on. In other words, garbage in, garbage out.
This motivates the need to carefully curate datasets, and to iterate on them
often to improve their quality, size, and diversity.
Our dataset formats aim to be easily understood by humans; it therefore puts
more emphasis on clarity than it does on efficiency compared to most other
formats used in research.
If you have very large datasets, for which efficiency is necessary, please contact
us at support@sihl.ai.
Dataset formats usually follow a structure similar to this:
.
├── images/
├── metadata.yaml
└── annotations.yaml
Sihl AI expects datasets to be stored in an S3-compatible object storage. Such storage units need these pieces of information to access:
The images/ folder is always required. Supported formats are JPEG (".jpeg" or ".jpg"), PNG (".png") and TIFF (".tiff" or ".tif").
For self-supervised tasks (view-invariance learning, autoencoding and anomaly detection), only the images/ folder is needed.
The metadata.yaml file is required for supervised tasks, and provides information about the dataset as a whole. It must be named "metadata", be can be YAML (".yaml" or ".yml") or JSON (".json") formatted.
The metadata file might specify some annotation file(s) or folder(s)
depending on the task(s). For example, a scene text recognition dataset might have a annotations.yaml file
containing image name to text string mappings, while a panoptic segmentation dataset might have a annotations/ folder containing
PNG segmentation maps.
When annotations specify pixel coordinates, we always consider the origin at
the top-left of the image (x increasing to the right and y increasing to the
bottom).
Refer to each task's docs to get more details about the corresponding annotation format.
Some datasets have multiple types of annotations per image. A very popular example is the COCO dataset, which contains object detection, keypoint detection, instance segmentation (and more) annotations per image!
To indicate that a dataset is multitask annotated, simply provide a tasks array in the metadata file, like so:
tasks:
- task: object detection
annotations: object_detection_annotations.yaml
categories: [cat1, cat2, cat3]
- task: keypoint detection
annotations: keypoint_detection_annotations.yaml
keypoints: [kpt1, kpt2, kpt3]
- task: instance segmentation
annotations: instance_segmentation_annotations/
categories: [cat1, cat2, cat3]
colors:
cat1: [255, 0, 0] # red
cat2: [0, 255, 0] # green
cat3: [0, 0, 255] # blue
Notice that the tasks are ordered. This is important, as training a
multitask model on this dataset will produce a multi-headed model, where the
index of each head corresponds to the index of corresponding task specified
in the metadata file.
When performing multitask training, the losses of each tasks are simply summed
up and minimized together.
The annotation files/folders of each task must be independant from the others,
even though they refer to the same images/.
This keeps tasks loosely coupled, and therefore easily added, removed, or
modified.
Images do not need have an annotation for every task - you can simply skip certain
annotations (or set them to null) to mean
it is unannotated for that task. During training, the corresponding head
will ignore the loss contribution from these samples.