Text Recognition


The structure of the text recognition dataset directory is organized as follows.

├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Syn90k
│   │   ├── shuffle_labels.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── mnt
│   ├── SynthText
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── SynthText_Add
│   ├── TextOCR
│   │   ├── image
│   │   ├── train_label.txt
│   │   ├── val_label.txt
│   ├── Totaltext
│   │   ├── imgs
│   │   ├── annotations
│   │   ├── train_label.txt
│   │   ├── test_label.txt
Dataset images annotation file annotation file
training test
coco_text homepage train_label.txt -
icdar_2011 homepage train_label.txt -
icdar_2013 homepage train_label.txt test_label_1015.txt
icdar_2015 homepage train_label.txt test_label.txt
IIIT5K homepage train_label.txt test_label.txt
ct80 homepage - test_label.txt
svt homepage - test_label.txt
svtp unofficial homepage[1] - test_label.txt
MJSynth (Syn90k) homepage shuffle_labels.txt | label.txt -
SynthText (Synth800k) homepage shuffle_labels.txt | instances_train.txt | label.txt -
SynthAdd SynthText_Add.zip (code:627x) label.txt -
TextOCR homepage - -
Totaltext homepage - -

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Preparation Steps

ICDAR 2013



python tools/data/textrecog/svt_converter.py <download_svt_dir_path>




MJSynth (Syn90k)

  • Step1: Download mjsynth.tar.gz from homepage

  • Step2: Download label.txt (8,919,273 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations). Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.

  • Step3:

mkdir Syn90k && cd Syn90k

mv /path/to/mjsynth.tar.gz .

tar -xzf mjsynth.tar.gz

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/Syn90k Syn90k

SynthText (Synth800k)

  • Step1: Download SynthText.zip from homepage

  • Step2: Download label.txt (7,266,686 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations). Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.

  • Step3:

mkdir SynthText && cd SynthText
mv /path/to/SynthText.zip .
unzip SynthText.zip
mv SynthText synthtext

mv /path/to/shuffle_labels.txt .
mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture
ln -s /path/to/SynthText SynthText
  • Step4: Generate cropped images and labels:

cd /path/to/mmocr

python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8


  • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))

  • Step2: Download label.txt

  • Step3:

mkdir SynthAdd && cd SynthAdd

mv /path/to/SynthText_Add.zip .

unzip SynthText_Add.zip

mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/SynthAdd SynthAdd

Note: To convert label file with txt format to lmdb format,

python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>

For example,

python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb


mkdir textocr && cd textocr

# Download TextOCR dataset
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json

# For images
unzip -q train_val_images.zip
mv train_images train
  • Step2: Generate train_label.txt, val_label.txt and crop images using 4 processes with the following command:

python tools/data/textrecog/textocr_converter.py /path/to/textocr 4


  • Step1: Download totaltext.zip from github dataset and groundtruth_text.zip from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

mkdir totaltext && cd totaltext
mkdir imgs && mkdir annotations

# For images
# in ./totaltext
unzip totaltext.zip
mv Images/Train imgs/training
mv Images/Test imgs/test

# For annotations
unzip groundtruth_text.zip
cd Groundtruth
mv Polygon/Train ../annotations/training
mv Polygon/Test ../annotations/test
  • Step2: Generate cropped images, train_label.txt and test_label.txt with the following command (the cropped images will be saved to data/totaltext/dst_imgs/):

python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test