Each project has access to a single dataset (or a single pair of training and validation datasets).
Project admins are able to change the dataset credentials whenever they
need.
Of course, the dataset itself can change too. In fact, we encourage you to regularly
improve the quality, quantity, and diversity of your data. To free up resources
to do that is one of the main points of automated ML training after all.
Our dataset formats aim to be easily understood by humans; it therefore puts
more emphasis on clarity than it does on efficiency.
If you have very large datasets, for which efficiency is necessary, please contact
us at support@sihl.ai.
Dataset formats usually follow a structure similar to this:
endpoint_url/bucket
├── prefix/images/
├── prefix/metadata.yaml
└── prefix/annotations.yaml
And are accessed with an access key and a secret key. This pair of keys must have read permissions to the bucket's objects (and optionally write permissions too).
S3-compatible object storage providers should always provide their endpoint
URL (looking like "https://[...].com").
The prefix/ is optional. It is useful if you
want to store multiple datasets in the same bucket.
The images/ folder is always required. Supported formats are JPEG (".jpeg" or ".jpg"), PNG (".png") and TIFF (".tiff" or ".tif").
For self-supervised tasks (view-invariance learning, autoencoding and anomaly detection), only the images/ folder is needed.
The metadata.yaml file is required for supervised tasks, and provides information about the dataset as a whole. It must be named "metadata", be can be YAML (".yaml" or ".yml") or JSON (".json").
The metadata file might specify some annotation file(s) or folder(s)
depending on the task(s). For example, a scene text recognition dataset might have a annotations.yaml file
containing image name to text string mappings, while a panoptic segmentation dataset might have a annotations/ folder containing
PNG segmentation maps.
When annotations specify pixel coordinates, we always consider the origin at
the top-left of the image (x increasing to the right and y increasing to the
bottom).
This is applies to dealing with (S3-compatible) object storage datasets in
general.
There are 3 main approaches:
As for annotating, there are many possible open-source tools for that (CVAT or labelme), although none of them currently export in our data format (yet). This means that you (or your annotators) might need to write scripts to convert between data formats.