Getting started

Datasets

Each project has access to a single dataset (or a single pair of training and validation datasets).

Project admins are able to change the dataset credentials whenever they need.
Of course, the dataset itself can change too. In fact, we encourage you to regularly improve the quality, quantity, and diversity of your data. To free up resources to do that is one of the main points of automated ML training after all.

Dataset format

Our dataset formats aim to be easily understood by humans; it therefore puts more emphasis on clarity than it does on efficiency.
If you have very large datasets, for which efficiency is necessary, please contact us at support@sihl.ai.

Dataset formats usually follow a structure similar to this:

endpoint_url/bucket
├── prefix/images/
├── prefix/metadata.yaml
└── prefix/annotations.yaml

And are accessed with an access key and a secret key. This pair of keys must have read permissions to the bucket's objects (and optionally write permissions too).

S3-compatible object storage providers should always provide their endpoint URL (looking like "https://[...].com").
The prefix/ is optional. It is useful if you want to store multiple datasets in the same bucket.

The images/ folder is always required. Supported formats are JPEG (".jpeg" or ".jpg"), PNG (".png") and TIFF (".tiff" or ".tif").

For self-supervised tasks (view-invariance learning, autoencoding and anomaly detection), only the images/ folder is needed.

The metadata.yaml file is required for supervised tasks, and provides information about the dataset as a whole. It must be named "metadata", be can be YAML (".yaml" or ".yml") or JSON (".json").

The metadata file might specify some annotation file(s) or folder(s) depending on the task(s). For example, a scene text recognition dataset might have a annotations.yaml file containing image name to text string mappings, while a panoptic segmentation dataset might have a annotations/ folder containing PNG segmentation maps.
When annotations specify pixel coordinates, we always consider the origin at the top-left of the image (x increasing to the right and y increasing to the bottom).

Interacting with your dataset

This is applies to dealing with (S3-compatible) object storage datasets in general.
There are 3 main approaches:

  • Mounting the bucket with products like cloudmounter or mountain duck and just using your operating system's file system.
  • Command-line interface tools like aws-cli, rclone, or s5cmd.
  • Writing your own scripts using the aws sdk in your preferred language (e.g. Python).

As for annotating, there are many possible open-source tools for that (CVAT or labelme), although none of them currently export in our data format (yet). This means that you (or your annotators) might need to write scripts to convert between data formats.