Scene text recognition models predicts a variable-length sequence of tokens
(the "text") associated with the input image.
Some call this task OCR (optical character recognition), but it is important
to highlight that a token isn't necessarily a single character, and might not
even represent printable characters; it is simply an atomic sequence element.
Scene text recognition is different from document text recognition in that it
is not well suited for long sequences of text (e.g. paragraphs). Scene text recognition
models are however faster to run, and are sufficiently accurate for short sequences
embedded in natural scenes (e.g. license plate numbers).
Datasets follow this structure:
endpoint_url/bucket
├── prefix/images/
├── prefix/annotations.yaml
└── prefix/metadata.yaml
Dataset images are placed directly inside images/ (subdirectories are ignored).
The metadata file looks something like
this:
task: scene text recognition
annotations: annotations.yaml
tokens: [A, B, C, "1", "2", "3", "<SPACE>"]
Note: If the "tokens" field is a string rather than an array of strings, it will be interpreted as an array of single-character tokens. So tokens: abcdef is considered equivalent to tokens: [a, b, c, d, e, f].
The annotations field specifies the name of
the file containing the ground truth annotations.
Here's an example of
annotations file:
000.jpg: [A, A, "<SPACE>", "2"]
001.jpg: AB32 # interpreted as [A, B, "3", "2"]
002.jpg: [] # no text (could also be "")
# ...