Autoregressive text recognition models predicts a variable-length sequence
of tokens (the "text") associated with the input image.
Some call this task OCR (optical character recognition), but it is important
to highlight that a token isn't necessarily a single character, and might not
even represent printable characters; it is simply an atomic sequence element.
Autoregressive text recognition is different from the "normal" scene text recognition in that it is inherently sequential - it predicts texts token by token, in
a loop, which is slower, but often more accurate than the parallel, convolutional
Datasets follow this structure:
├── prefix/images/
├── prefix/annotations.yaml
└── prefix/metadata.yaml
Dataset images are placed directly inside images/ (subdirectories are ignored).
The metadata file looks something like
task: scene text recognition
annotations: annotations.yaml
tokens: [A, B, C, "1", "2", "3", "<SPACE>"]
Note: If the "tokens" field is a string rather than an array of strings, it will be interpreted as an array of single-character tokens. So tokens: abcdef is considered equivalent to tokens: [a, b, c, d, e, f].
The annotations field specifies the name of
the file containing the ground truth annotations.
Here's an example of
annotations file:
000.jpg: [A, A, "<SPACE>", "2"]
001.jpg: AB32 # interpreted as [A, B, "3", "2"]
002.jpg: [] # no text (could also be "")
# ...