Datasets Preparation

This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.

Text Detection

The structure of the text detection dataset directory is organized as follows.

├── ctw1500
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2015
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2017
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json
├── synthtext
│   ├── imgs
│   └── instances_training.lmdb
Dataset Images Annotation Files
training validation testing
CTW1500 homepage instances_training.json - instances_test.json
ICDAR2015 homepage instances_training.json - instances_test.json
ICDAR2017 homepage renamed_imgs instances_training.json instances_val.json -
Synthtext homepage instances_training.lmdb -
  • For icdar2015:

    mkdir icdar2015 && cd icdar2015
    mv /path/to/instances_training.json .
    mv /path/to/instances_test.json .
    
    mkdir imgs && cd imgs
    ln -s /path/to/ch4_training_images training
    ln -s /path/to/ch4_test_images test
    
  • For icdar2017:

    • To avoid the effect of rotation when load jpg with opencv, We provide re-saved png format image in renamed_images. You can copy these images to imgs.

Text Recognition

The structure of the text recognition dataset directory is organized as follows.

├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Syn90k
│   │   ├── shuffle_labels.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── mnt
│   ├── SynthText
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── SynthText_Add
Dataset images annotation file annotation file
training test
coco_text homepage train_label.txt -
icdar_2011 homepage train_label.txt -
icdar_2013 homepage train_label.txt test_label_1015.txt
icdar_2015 homepage train_label.txt test_label.txt
IIIT5K homepage train_label.txt test_label.txt
ct80 - - test_label.txt
svt homepage - test_label.txt
svtp - - test_label.txt
Syn90k homepage shuffle_labels.txt | label.txt -
SynthText homepage shuffle_labels.txt | instances_train.txt | label.txt -
SynthAdd SynthText_Add.zip (code:627x) label.txt -
  • For icdar_2013:

  • For icdar_2015:

  • For IIIT5K:

  • For svt:

    python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
    
  • For ct80:

  • For svtp:

  • For coco_text:

  • For Syn90k:

    mkdir Syn90k && cd Syn90k
    
    mv /path/to/mjsynth.tar.gz .
    
    tar -xzf mjsynth.tar.gz
    
    mv /path/to/shuffle_labels.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/Syn90k Syn90k
    
  • For SynthText:

    unzip SynthText.zip
    
    cd SynthText
    
    mv /path/to/shuffle_labels.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/SynthText SynthText
    
  • For SynthAdd:

    • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))

    • Step2: Download label.txt

    • Step3:

    mkdir SynthAdd && cd SynthAdd
    
    mv /path/to/SynthText_Add.zip .
    
    unzip SynthText_Add.zip
    
    mv /path/to/label.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/SynthAdd SynthAdd
    

Note: To convert label file with txt format to lmdb format,

python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>

For example,

python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb

Key Information Extraction

The structure of the key information extraction dataset directory is organized as follows.

└── wildreceipt
  ├── class_list.txt
  ├── dict.txt
  ├── image_files
  ├── test.txt
  └── train.txt

Named Entity Recognition

CLUENER2020

The structure of the named entity recognition dataset directory is organized as follows.

└── cluener2020
  ├── cluener_predict.json
  ├── dev.json
  ├── README.md
  ├── test.json
  ├── train.json
  └── vocab.txt