Tasks

Scene text recognition

Scene text recognition models predicts a variable-length sequence of tokens (the "text") associated with the input image.
Some call this task OCR (optical character recognition), but it is important to highlight that a token isn't necessarily a single character, and might not even represent printable characters; it is simply an atomic sequence element.
Scene text recognition is different from document text recognition in that it is not well suited for long sequences of text (e.g. paragraphs). Scene text recognition models are however faster to run, and are sufficiently accurate for short sequences embedded in natural scenes (e.g. license plate numbers).

Dataset format

Datasets follow this structure:

endpoint_url/bucket
├── prefix/images/
├── prefix/annotations.yaml
└── prefix/metadata.yaml

Dataset images are placed directly inside images/ (subdirectories are ignored).
The metadata file looks something like this:

metadata.yaml
task: scene text recognition
annotations: annotations.yaml
tokens: [A, B, C, "1", "2", "3", "<SPACE>"]
Caution: tokens cannot contain whitespace characters (like the space character ""). We recommend replacing them with special tokens, like (unicode U+23B5). In this example, we're using <SPACE> instead.
Our system automatically casts all tokens to strings, but it is nevertheless good practice to explicitly quote expressions that naive YAML parsers could interpret as other types (e.g. numbers and booleans).

Note: If the "tokens" field is a string rather than an array of strings, it will be interpreted as an array of single-character tokens. So tokens: abcdef is considered equivalent to tokens: [a, b, c, d, e, f].

The annotations field specifies the name of the file containing the ground truth annotations.
Here's an example of annotations file:

annotations.yaml
000.jpg: [A, A, "<SPACE>", "2"]
001.jpg: AB32  # interpreted as [A, B, "3", "2"]
002.jpg: []  # no text (could also be "")
# ...
The ground-truth text annotations should only use tokens specified in the metadata.
If an image has no text in it, it has to be explicitly assigned [] or "".
Images assigned to null are ignored!