Tasks

Autoregressive text recognition

Autoregressive text recognition models predicts a variable-length sequence of tokens (the "text") associated with the input image.
Some call this task OCR (optical character recognition), but it is important to highlight that a token isn't necessarily a single character, and might not even represent printable characters; it is simply an atomic sequence element.
Autoregressive text recognition is different from the "normal" scene text recognition in that it is inherently sequential - it predicts texts token by token, in a loop, which is slower, but often more accurate than the parallel, convolutional approach.

Dataset format

Datasets follow this structure:

endpoint_url/bucket
├── prefix/images/
├── prefix/annotations.yaml
└── prefix/metadata.yaml

Dataset images are placed directly inside images/ (subdirectories are ignored).
The metadata file looks something like this:

metadata.yaml

task: scene text recognition
annotations: annotations.yaml
tokens: [A, B, C, "1", "2", "3", "<SPACE>"]

Caution: tokens cannot contain whitespace characters (like the space character ""). We recommend replacing them with special tokens, like ⎵ (unicode U+23B5). In this example, we're using <SPACE> instead.
Our system automatically casts all tokens to strings, but it is nevertheless good practice to explicitly quote expressions that naive YAML parsers could interpret as other types (e.g. numbers and booleans).

Note: If the "tokens" field is a string rather than an array of strings, it will be interpreted as an array of single-character tokens. So tokens: abcdef is considered equivalent to tokens: [a, b, c, d, e, f].

The annotations field specifies the name of the file containing the ground truth annotations.
Here's an example of annotations file:

annotations.yaml

000.jpg: [A, A, "<SPACE>", "2"]
001.jpg: AB32  # interpreted as [A, B, "3", "2"]
002.jpg: []  # no text (could also be "")
# ...

The ground-truth text annotations should only use tokens specified in the metadata.
If an image has no text in it, it has to be explicitly assigned [] or "".
Images assigned to null are ignored!