Shortcuts

Note

You are reading the documentation for MMOCR 0.x, which will soon be deprecated by the end of 2022. We recommend you upgrade to MMOCR 1.0 to enjoy fruitful new features and better performance brought by OpenMMLab 2.0. Check out the maintenance plan, changelog, code and documentation of MMOCR 1.0 for more details.

Text Recognition

Overview

Dataset images annotation file annotation file
training test
coco_text homepage train_label.txt -
ICDAR2011 homepage - -
ICDAR2013 homepage - -
icdar_2015 homepage train_label.txt test_label.txt
IIIT5K homepage train_label.txt test_label.txt
ct80 homepage - test_label.txt
svt homepage - test_label.txt
svtp unofficial homepage[1] - test_label.txt
MJSynth (Syn90k) homepage shuffle_labels.txt | label.txt -
SynthText (Synth800k) homepage alphanumeric_labels.txt |shuffle_labels.txt | instances_train.txt | label.txt -
SynthAdd SynthText_Add.zip (code:627x) label.txt -
TextOCR homepage - -
Totaltext homepage - -
OpenVINO Open Images annotations annotations
FUNSD homepage - -
DeText homepage - -
NAF homepage - -
SROIE homepage - -
Lecture Video DB homepage - -
LSVT homepage - -
IMGUR homepage - -
KAIST homepage - -
MTWI homepage - -
COCO Text v2 homepage - -
ReCTS homepage - -
IIIT-ILST homepage - -
VinText homepage - -
BID homepage - -
RCTW homepage - -
HierText homepage - -
ArT homepage - -

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Install AWS CLI (optional)

  • Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:

      curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
      unzip awscliv2.zip
      sudo ./aws/install
      ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
      !aws configure
      # this command will require you to input keys, you can skip them except
      # for the Default region name
      # AWS Access Key ID [None]:
      # AWS Secret Access Key [None]:
      # Default region name [None]: us-east-1
      # Default output format [None]
    

ICDAR 2011 (Born-Digital Images)

  • Step1: Download Challenge1_Training_Task3_Images_GT.zip, Challenge1_Test_Task3_Images.zip, and Challenge1_Test_Task3_GT.txt from homepage Task 1.3: Word Recognition (2013 edition).

    mkdir icdar2011 && cd icdar2011
    mkdir annotations
    
    # Download ICDAR 2011
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate
    
    # For images
    mkdir crops
    unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train
    unzip -q Challenge1_Test_Task3_Images.zip -d crops/test
    
    # For annotations
    mv Challenge1_Test_Task3_GT.txt annotations && mv train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
    
  • Step2: Convert original annotations to Train_label.jsonl and Test_label.jsonl with the following command:

    python tools/data/textrecog/ic11_converter.py PATH/TO/icdar2011
    
  • After running the above codes, the directory structure should be as follows:

    ├── icdar2011
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

ICDAR 2013 (Focused Scene Text)

  • Step1: Download Challenge2_Training_Task3_Images_GT.zip, Challenge2_Test_Task3_Images.zip, and Challenge2_Test_Task3_GT.txt from homepage Task 2.3: Word Recognition (2013 edition).

    mkdir icdar2013 && cd icdar2013
    mkdir annotations
    
    # Download ICDAR 2013
    wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate
    
    # For images
    mkdir crops
    unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train
    unzip -q Challenge2_Test_Task3_Images.zip -d crops/test
    # For annotations
    mv Challenge2_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt
    
    rm Challenge2_Training_Task3_Images_GT.zip && rm Challenge2_Test_Task3_Images.zip
    
  • Step 2: Generate Train_label.jsonl and Test_label.jsonl with the following command:

    python tools/data/textrecog/ic13_converter.py PATH/TO/icdar2013
    
  • After running the above codes, the directory structure should be as follows:

    ├── icdar2013
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

ICDAR 2013 [Deprecated]

  • Step1: Download Challenge2_Test_Task3_Images.zip and Challenge2_Training_Task3_Images_GT.zip from homepage

  • Step2: Download test_label_1015.txt and train_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── icdar_2013
    │   ├── train_label.txt
    │   ├── test_label_1015.txt
    │   ├── test_label_1095.txt
    │   ├── Challenge2_Training_Task3_Images_GT
    │   └──  Challenge2_Test_Task3_Images
    

ICDAR 2015

  • Step1: Download ch4_training_word_images_gt.zip and ch4_test_word_images_gt.zip from homepage

  • Step2: Download train_label.txt and test_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── icdar_2015
    │   ├── train_label.txt
    │   ├── test_label.txt
    │   ├── ch4_training_word_images_gt
    │   └── ch4_test_word_images_gt
    

IIIT5K

  • Step1: Download IIIT5K-Word_V3.0.tar.gz from homepage

  • Step2: Download train_label.txt and test_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── III5K
    │   ├── train_label.txt
    │   ├── test_label.txt
    │   ├── train
    │   └── test
    

svt

  • Step1: Download svt.zip form homepage

  • Step2: Download test_label.txt

  • Step3:

    python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
    
  • After running the above codes, the directory structure should be as follows:

    ├── svt
    │   ├── test_label.txt
    │   └── image
    

ct80

  • Step1: Download test_label.txt

  • Step2: Download timage.tar.gz

  • Step3:

    mkdir ct80 && cd ct80
    mv /path/to/test_label.txt .
    mv /path/to/timage.tar.gz .
    tar -xvf timage.tar.gz
    # create soft link
    cd /path/to/mmocr/data/mixture
    ln -s /path/to/ct80 ct80
    
  • After running the above codes, the directory structure should be as follows:

    ├── ct80
    │   ├── test_label.txt
    │   └── timage
    

svtp

  • Step1: Download test_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── svtp
    │   ├── test_label.txt
    │   └── image
    

coco_text

  • Step1: Download from homepage

  • Step2: Download train_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── coco_text
    │   ├── train_label.txt
    │   └── train_words
    

MJSynth (Syn90k)

Note

Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.

  • Step3:

    mkdir Syn90k && cd Syn90k
    
    mv /path/to/mjsynth.tar.gz .
    
    tar -xzf mjsynth.tar.gz
    
    mv /path/to/shuffle_labels.txt .
    mv /path/to/label.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/Syn90k Syn90k
    
    # Convert 'txt' format annos to 'lmdb' (optional)
    cd /path/to/mmocr
    python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only
    
  • After running the above codes, the directory structure should be as follows:

    ├── Syn90k
    │   ├── shuffle_labels.txt
    │   ├── label.txt
    │   ├── label.lmdb (optional)
    │   └── mnt
    

SynthText (Synth800k)

  • Step1: Download SynthText.zip from homepage

  • Step2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).

Warning

Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.

  • Step3:

    mkdir SynthText && cd SynthText
    mv /path/to/SynthText.zip .
    unzip SynthText.zip
    mv SynthText synthtext
    
    mv /path/to/shuffle_labels.txt .
    mv /path/to/label.txt .
    mv /path/to/alphanumeric_labels.txt .
    mv /path/to/instances_train.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    ln -s /path/to/SynthText SynthText
    
  • Step4: Generate cropped images and labels:

    cd /path/to/mmocr
    
    python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
    
    # Convert 'txt' format annos to 'lmdb' (optional)
    cd /path/to/mmocr
    python tools/data/utils/lmdb_converter.py data/mixture/SynthText/label.txt data/mixture/SynthText/label.lmdb --label-only
    
  • After running the above codes, the directory structure should be as follows:

    ├── SynthText
    │   ├── alphanumeric_labels.txt
    │   ├── shuffle_labels.txt
    │   ├── instances_train.txt
    │   ├── label.txt
    │   ├── label.lmdb (optional)
    │   └── synthtext
    

SynthAdd

  • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))

  • Step2: Download label.txt

  • Step3:

    mkdir SynthAdd && cd SynthAdd
    
    mv /path/to/SynthText_Add.zip .
    
    unzip SynthText_Add.zip
    
    mv /path/to/label.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/SynthAdd SynthAdd
    
    # Convert 'txt' format annos to 'lmdb' (optional)
    cd /path/to/mmocr
    python tools/data/utils/lmdb_converter.py data/mixture/SynthAdd/label.txt data/mixture/SynthAdd/label.lmdb --label-only
    
  • After running the above codes, the directory structure should be as follows:

    ├── SynthAdd
    │   ├── label.txt
    │   ├── label.lmdb (optional)
    │   └── SynthText_Add
    

Tip

To convert label file from txt format to lmdb format,

python tools/data/utils/lmdb_converter.py <txt_label_path> <lmdb_label_path> --label-only

For example,

python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only

TextOCR

  • Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to textocr/.

    mkdir textocr && cd textocr
    
    # Download TextOCR dataset
    wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
    
    # For images
    unzip -q train_val_images.zip
    mv train_images train
    
  • Step2: Generate train_label.txt, val_label.txt and crop images using 4 processes with the following command:

    python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── TextOCR
    │   ├── image
    │   ├── train_label.txt
    │   └── val_label.txt
    

Totaltext

  • Step1: Download totaltext.zip from github dataset and groundtruth_text.zip or TT_new_train_GT.zip (if you prefer to use the latest version of training annotations) from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

    mkdir totaltext && cd totaltext
    mkdir imgs && mkdir annotations
    
    # For images
    # in ./totaltext
    unzip totaltext.zip
    mv Images/Train imgs/training
    mv Images/Test imgs/test
    
    # For legacy training and test annotations
    unzip groundtruth_text.zip
    mv Groundtruth/Polygon/Train annotations/training
    mv Groundtruth/Polygon/Test annotations/test
    
    # Using the latest training annotations
    # WARNING: Delete legacy train annotations before running the following command.
    unzip TT_new_train_GT.zip
    mv Train annotations/training
    
  • Step2: Generate cropped images, train_label.txt and test_label.txt with the following command (the cropped images will be saved to data/totaltext/dst_imgs/):

    python tools/data/textrecog/totaltext_converter.py /path/to/totaltext
    
  • After running the above codes, the directory structure should be as follows:

    ├── totaltext
    │   ├── dst_imgs
    │   ├── train_label.txt
    │   └── test_label.txt
    

OpenVINO

  • Step1 (optional): Install AWS CLI.

  • Step2: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.

    mkdir openvino && cd openvino
    
    # Download Open Images subsets
    for s in 1 2 5 f; do
      aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
    done
    aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .
    
    # Download annotations
    for s in 1 2 5 f; do
      wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
    done
    wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json
    
    # Extract images
    mkdir -p openimages_v5/val
    for s in 1 2 5 f; do
      tar zxf train_${s}.tar.gz -C openimages_v5
    done
    tar zxf validation.tar.gz -C openimages_v5/val
    
  • Step3: Generate train_{1,2,5,f}_label.txt, val_label.txt and crop images using 4 processes with the following command:

    python tools/data/textrecog/openvino_converter.py /path/to/openvino 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── OpenVINO
    │   ├── image_1
    │   ├── image_2
    │   ├── image_5
    │   ├── image_f
    │   ├── image_val
    │   ├── train_1_label.txt
    │   ├── train_2_label.txt
    │   ├── train_5_label.txt
    │   ├── train_f_label.txt
    │   └── val_label.txt
    

DeText

  • Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

    mkdir detext && cd detext
    mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
    
    # Download DeText
    wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
    
    # Extract images and annotations
    unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
    
    # Remove zips
    rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
    
  • Step2: Generate instances_training.json and instances_val.json with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/detext/ignores
    python tools/data/textrecog/detext_converter.py PATH/TO/detext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── detext
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

NAF

  • Step1: Download labeled_images.tar.gz to naf/.

    mkdir naf && cd naf
    
    # Download NAF dataset
    wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
    tar -zxf labeled_images.tar.gz
    
    # For images
    mkdir annotations && mv labeled_images imgs
    
    # For annotations
    git clone https://github.com/herobd/NAF_dataset.git
    mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/
    
    rm -rf NAF_dataset && rm labeled_images.tar.gz
    
  • Step2: Generate train_label.txt, val_label.txt, and test_label.txt with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/naf/ignores
    python tools/data/textrecog/naf_converter.py PATH/TO/naf --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── naf
    │   ├── crops
    │   ├── train_label.txt
    │   ├── val_label.txt
    │   └── test_label.txt
    

SROIE

  • Step1: Step1: Download 0325updated.task1train(626p).zip, task1&2_test(361p).zip, and text.task1&2-test(361p).zip from homepage to sroie/

  • Step2:

    mkdir sroie && cd sroie
    mkdir imgs && mkdir annotations && mkdir imgs/training
    
    # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
    # be different, the user should revise the following commands to the correct
    # file name if encounter with errors while extracting and move the files.
    unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip
    
    # For images
    mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test
    
    # For annotations
    mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test
    
    rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip
    
  • Step3: Generate train_label.jsonl and test_label.jsonl and crop images using 4 processes with the following command:

    python tools/data/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── sroie
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

Lecture Video DB

Note

The LV dataset has already provided cropped images and the corresponding annotations

  • Step1: Download IIIT-CVid.zip to lv/.

    mkdir lv && cd lv
    
    # Download LV dataset
    wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
    unzip -q IIIT-CVid.zip
    
    # For image
    mv IIIT-CVid/Crops ./
    
    # For annotation
    mv IIIT-CVid/train.txt train_label.txt && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_label.txt
    
    rm IIIT-CVid.zip
    
  • Step2: Generate train_label.jsonl, val.jsonl, and test.jsonl with following command:

    python tools/data/textdreog/lv_converter.py PATH/TO/lv
    
  • After running the above codes, the directory structure should be as follows:

    ├── lv
    │   ├── Crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

LSVT

  • Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to lsvt/.

    mkdir lsvt && cd lsvt
    
    # Download LSVT dataset
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
    
    mkdir annotations
    tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
    mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
    mv train_full_images_0 imgs
    
    rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional) with the following command:

    # Annotations of LSVT test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/lsvt/ignores
    python tools/data/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── lsvt
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

FUNSD

  • Step1: Download dataset.zip to funsd/.

    mkdir funsd && cd funsd
    
    # Download FUNSD dataset
    wget https://guillaumejaume.github.io/FUNSD/dataset.zip
    unzip -q dataset.zip
    
    # For images
    mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/
    
    # For annotations
    mkdir annotations
    mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test
    
    rm dataset.zip && rm -rf dataset
    
  • Step2: Generate train_label.txt and test_label.txt and crop images using 4 processes with following command (add --preserve-vertical if you wish to preserve the images containing vertical texts):

    python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── funsd
    │   ├── imgs
    │   ├── dst_imgs
    │   ├── annotations
    │   ├── train_label.txt
    │   └── test_label.txt
    

IMGUR

  • Step1: Run download_imgur5k.py to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.

    mkdir imgur && cd imgur
    
    git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
    
    # Download images from imgur.com. This may take SEVERAL HOURS!
    python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
    
    # For annotations
    mkdir annotations
    mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
    
    rm -rf IMGUR5K-Handwriting-Dataset
    
  • Step2: Generate train_label.txt, val_label.txt and test_label.txt and crop images with the following command:

    python tools/data/textrecog/imgur_converter.py PATH/TO/imgur
    
  • After running the above codes, the directory structure should be as follows:

    ├── imgur
    │   ├── crops
    │   ├── train_label.jsonl
    │   ├── test_label.jsonl
    │   └── val_label.jsonl
    

KAIST

  • Step1: Complete download KAIST_all.zip to kaist/.

    mkdir kaist && cd kaist
    mkdir imgs && mkdir annotations
    
    # Download KAIST dataset
    wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
    unzip -q KAIST_all.zip
    
    rm KAIST_all.zip
    
  • Step2: Extract zips:

    python tools/data/common/extract_kaist.py PATH/TO/kaist
    
  • Step3: Generate train_label.jsonl and val_label.jsonl (optional) with following command:

    # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/kaist/ignores
    python tools/data/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── kaist
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

MTWI

  • Step1: Download mtwi_2018_train.zip from homepage.

    mkdir mtwi && cd mtwi
    
    unzip -q mtwi_2018_train.zip
    mv image_train imgs && mv txt_train annotations
    
    rm mtwi_2018_train.zip
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional) with the following command:

    # Annotations of MTWI test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/mtwi/ignores
    python tools/data/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── mtwi
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

COCO Text v2

  • Step1: Download image train2014.zip and annotation cocotext.v2.zip to coco_textv2/.

    mkdir coco_textv2 && cd coco_textv2
    mkdir annotations
    
    # Download COCO Text v2 dataset
    wget http://images.cocodataset.org/zips/train2014.zip
    wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
    unzip -q train2014.zip && unzip -q cocotext.v2.zip
    
    mv train2014 imgs && mv cocotext.v2.json annotations/
    
    rm train2014.zip && rm -rf cocotext.v2.zip
    
  • Step2: Generate train_label.jsonl and val_label.jsonl with the following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/mtwi/ignores
    python tools/data/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── coco_textv2
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl
    

ReCTS

  • Step1: Download ReCTS.zip to rects/ from the homepage.

    mkdir rects && cd rects
    
    # Download ReCTS dataset
    # You can also find Google Drive link on the dataset homepage
    wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
    unzip -q ReCTS.zip
    
    mv img imgs && mv gt_unicode annotations
    
    rm ReCTS.zip -f && rm -rf gt
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional) with the following command:

    # Annotations of ReCTS test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/rects/ignores
    python tools/data/textrecog/rects_converter.py PATH/TO/rects --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── rects
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

ILST

  • Step1: Download IIIT-ILST.zip from onedrive link

  • Step2: Run the following commands

    unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
    cd IIIT-ILST
    
    # rename files
    cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
    cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
    cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
    
    # transfer image path
    mkdir imgs && mkdir annotations
    mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
    mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
    mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
    
    # remove unnecessary files
    rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
    
  • Step3: Generate train_label.jsonl and val_label.jsonl (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/data/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── IIIT-ILST
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

VinText

  • Step1: Download vintext.zip to vintext

    mkdir vintext && cd vintext
    
    # Download dataset from google drive
    wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
    
    # Extract images and annotations
    unzip -q vintext.zip && rm vintext.zip
    mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
    rm -rf vietnamese
    
    # Rename files
    mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
    mkdir imgs
    mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
    
  • Step2: Generate train_label.jsonl, test_label.jsonl, unseen_test_label.jsonl, and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts).

    python tools/data/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── vintext
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   ├── test_label.jsonl
    │   └── unseen_test_label.jsonl
    

BID

  • Step1: Download BID Dataset.zip

  • Step2: Run the following commands to preprocess the dataset

    # Rename
    mv BID\ Dataset.zip BID_Dataset.zip
    
    # Unzip and Rename
    unzip -q BID_Dataset.zip && rm BID_Dataset.zip
    mv BID\ Dataset BID
    
    # The BID dataset has a problem of permission, and you may
    # add permission for this file
    chmod -R 777 BID
    cd BID
    mkdir imgs && mkdir annotations
    
    # For images and annotations
    mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
    mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
    mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
    mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
    mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
    mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
    mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
    mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
    
    # Remove unnecessary files
    rm -rf desktop.ini
    
  • Step3: Generate train_label.jsonl and val_label.jsonl (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/data/textrecog/bid_converter.py dPATH/TO/BID --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── BID
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

RCTW

  • Step1: Download train_images.zip.001, train_images.zip.002, and train_gts.zip from the homepage, extract the zips to rctw/imgs and rctw/annotations, respectively.

  • Step2: Generate train_label.jsonl and val_label.jsonl (optional). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores
    python tools/data/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── rctw
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

HierText

  • Step1 (optional): Install AWS CLI.

  • Step2: Clone HierText repo to get annotations

    mkdir HierText
    git clone https://github.com/google-research-datasets/hiertext.git
    
  • Step3: Download train.tgz, validation.tgz from aws

    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
    
  • Step4: Process raw data

    # process annotations
    mv hiertext/gt ./
    rm -rf hiertext
    mv gt annotations
    gzip -d annotations/train.jsonl.gz
    gzip -d annotations/validation.jsonl.gz
    # process images
    mkdir imgs
    mv train.tgz imgs/
    mv validation.tgz imgs/
    tar -xzvf imgs/train.tgz
    tar -xzvf imgs/validation.tgz
    
  • Step5: Generate train_label.jsonl and val_label.jsonl. HierText includes different levels of annotation, including paragraph, line, and word. Check the original paper for details. E.g. set --level paragraph to get paragraph-level annotation. Set --level line to get line-level annotation. set --level word to get word-level annotation.

    # Collect word annotation from HierText  --level word
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores
    python tools/data/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── HierText
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl
    

ArT

  • Step1: Download train_images.tar.gz, and train_labels.json from the homepage to art/

    mkdir art && cd art
    mkdir annotations
    
    # Download ArT dataset
    wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json
    
    # Extract
    tar -xf train_task2_images.tar.gz
    mv train_task2_images crops
    mv train_task2_labels.json annotations/
    
    # Remove unnecessary files
    rm train_images.tar.gz
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/data/textrecog/art_converter.py PATH/TO/art
    
  • After running the above codes, the directory structure should be as follows:

    │── art
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    
Read the Docs v: v0.6.3
Versions
latest
stable
v0.6.3
v0.6.2
v0.6.1
v0.6.0
v0.5.0
v0.4.1
v0.4.0
v0.3.0
v0.2.1
v0.2.0
v0.1.0
dev-1.x
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.