Datasets Preparation

This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.

Text Detection

The structure of the text detection dataset directory is organized as follows.

├── ctw1500
│   ├── annotations
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2015
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2017
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json
├── synthtext
│   ├── imgs
│   └── instances_training.lmdb
├── textocr
│   ├── train
│   ├── instances_training.json
│   └── instances_val.json
├── totaltext
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
Dataset Images Annotation Files
training validation testing
CTW1500 homepage - - -
ICDAR2015 homepage instances_training.json - instances_test.json
ICDAR2017 homepage instances_training.json instances_val.json -
Synthtext homepage instances_training.lmdb - -
TextOCR homepage - - -
Totaltext homepage - - -

Note: For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset, there might be some images containing orientation info in EXIF data. The default OpenCV backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such inconsistency results in false examples in the training set. Therefore, users should use dict(type='LoadImageFromFile', color_type='color_ignore_orientation') in pipelines to change MMCV’s default loading behaviour. (see DBNet’s config for example)

  • For icdar2015:

    • Step1: Download ch4_training_images.zip, ch4_test_images.zip, ch4_training_localization_transcription_gt.zip, Challenge4_Test_Task1_GT.zip from homepage

    • Step2:

    mkdir icdar2015 && cd icdar2015
    mkdir imgs && mkdir annotations
    # For images,
    mv ch4_training_images imgs/training
    mv ch4_test_images imgs/test
    # For annotations,
    mv ch4_training_localization_transcription_gt annotations/training
    mv Challenge4_Test_Task1_GT annotations/test
    
    python tools/data/textdet/icdar_converter.py /path/to/icdar2015 -o /path/to/icdar2015 -d icdar2015 --split-list training test
    
  • For icdar2017:

    • Follow similar steps as above.

  • For ctw1500:

    • Step1: Download train_images.zip, test_images.zip, train_labels.zip, test_labels.zip from github

    mkdir ctw1500 && cd ctw1500
    mkdir imgs && mkdir annotations
    
    # For annotations
    cd annotations
    wget -O train_labels.zip https://universityofadelaide.box.com/shared/static/jikuazluzyj4lq6umzei7m2ppmt3afyw.zip
    wget -O test_labels.zip https://cloudstor.aarnet.edu.au/plus/s/uoeFl0pCN9BOCN5/download
    unzip train_labels.zip && mv ctw1500_train_labels training
    unzip test_labels.zip -d test
    cd ..
    # For images
    cd imgs
    wget -O train_images.zip https://universityofadelaide.box.com/shared/static/py5uwlfyyytbb2pxzq9czvu6fuqbjdh8.zip
    wget -O test_images.zip https://universityofadelaide.box.com/shared/static/t4w48ofnqkdw7jyc4t11nsukoeqk9c3d.zip
    unzip train_images.zip && mv train_images training
    unzip test_images.zip && mv test_images test
    
    • Step2: Generate instances_training.json and instances_test.json with following command:

    python tools/data/textdet/ctw1500_converter.py /path/to/ctw1500 -o /path/to/ctw1500 --split-list training test
    
  • For TextOCR:

    mkdir textocr && cd textocr
    
    # Download TextOCR dataset
    wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
    
    # For images
    unzip -q train_val_images.zip
    mv train_images train
    
    • Step2: Generate instances_training.json and instances_val.json with the following command:

    python tools/data/textdet/textocr_converter.py /path/to/textocr
    
  • For Totaltext:

    • Step1: Download totaltext.zip from github dataset and groundtruth_text.zip from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

    mkdir totaltext && cd totaltext
    mkdir imgs && mkdir annotations
    
    # For images
    # in ./totaltext
    unzip totaltext.zip
    mv Images/Train imgs/training
    mv Images/Test imgs/test
    
    # For annotations
    unzip groundtruth_text.zip
    cd Groundtruth
    mv Polygon/Train ../annotations/training
    mv Polygon/Test ../annotations/test
    
    • Step2: Generate instances_training.json and instances_test.json with the following command:

    python tools/data/textdet/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
    

Text Recognition

The structure of the text recognition dataset directory is organized as follows.

├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Syn90k
│   │   ├── shuffle_labels.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── mnt
│   ├── SynthText
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── SynthText_Add
│   ├── TextOCR
│   │   ├── image
│   │   ├── train_label.txt
│   │   ├── val_label.txt
│   ├── Totaltext
│   │   ├── imgs
│   │   ├── annotations
│   │   ├── train_label.txt
│   │   ├── test_label.txt
Dataset images annotation file annotation file
training test
coco_text homepage train_label.txt -
icdar_2011 homepage train_label.txt -
icdar_2013 homepage train_label.txt test_label_1015.txt
icdar_2015 homepage train_label.txt test_label.txt
IIIT5K homepage train_label.txt test_label.txt
ct80 - - test_label.txt
svt homepage - test_label.txt
svtp - - test_label.txt
Syn90k homepage shuffle_labels.txt | label.txt -
SynthText homepage shuffle_labels.txt | instances_train.txt | label.txt -
SynthAdd SynthText_Add.zip (code:627x) label.txt -
TextOCR homepage - -
Totaltext homepage - -
  • For icdar_2013:

  • For icdar_2015:

  • For IIIT5K:

  • For svt:

    python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
    
  • For ct80:

  • For svtp:

  • For coco_text:

  • For Syn90k:

    mkdir Syn90k && cd Syn90k
    
    mv /path/to/mjsynth.tar.gz .
    
    tar -xzf mjsynth.tar.gz
    
    mv /path/to/shuffle_labels.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/Syn90k Syn90k
    
  • For SynthText:

    • Step1: Download SynthText.zip from homepage

    • Step2:

    mkdir SynthText && cd SynthText
    mv /path/to/SynthText.zip .
    unzip SynthText.zip
    mv SynthText synthtext
    
    mv /path/to/shuffle_labels.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    ln -s /path/to/SynthText SynthText
    
    • Step3: Generate cropped images and labels:

    cd /path/to/mmocr
    
    python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
    
  • For SynthAdd:

    • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))

    • Step2: Download label.txt

    • Step3:

    mkdir SynthAdd && cd SynthAdd
    
    mv /path/to/SynthText_Add.zip .
    
    unzip SynthText_Add.zip
    
    mv /path/to/label.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/SynthAdd SynthAdd
    

    Note: To convert label file with txt format to lmdb format,

python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>

For example,

python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb
  • For TextOCR:

    mkdir textocr && cd textocr
    
    # Download TextOCR dataset
    wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
    
    # For images
    unzip -q train_val_images.zip
    mv train_images train
    
    • Step2: Generate train_label.txt, val_label.txt and crop images using 4 processes with the following command:

    python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
    
  • For Totaltext:

    • Step1: Download totaltext.zip from github dataset and groundtruth_text.zip from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

    mkdir totaltext && cd totaltext
    mkdir imgs && mkdir annotations
    
    # For images
    # in ./totaltext
    unzip totaltext.zip
    mv Images/Train imgs/training
    mv Images/Test imgs/test
    
    # For annotations
    unzip groundtruth_text.zip
    cd Groundtruth
    mv Polygon/Train ../annotations/training
    mv Polygon/Test ../annotations/test
    
    • Step2: Generate cropped images, train_label.txt and test_label.txt with the following command (the cropped images will be saved to data/totaltext/dst_imgs/.):

    python tools/data/textrecog/totaltext_converter.py /path/to/totaltext -o /path/to/totaltext --split-list training test
    

Key Information Extraction

The structure of the key information extraction dataset directory is organized as follows.

└── wildreceipt
  ├── class_list.txt
  ├── dict.txt
  ├── image_files
  ├── test.txt
  └── train.txt

Named Entity Recognition

The structure of the named entity recognition dataset directory is organized as follows.

└── cluener2020
  ├── cluener_predict.json
  ├── dev.json
  ├── README.md
  ├── test.json
  ├── train.json
  └── vocab.txt