Shortcuts

Text Recognition

Overview

Dataset images annotation file annotation file
training test
coco_text homepage train_label.txt -
ICDAR2011 homepage - -
ICDAR2013 homepage - -
icdar_2015 homepage train_label.txt test_label.txt
IIIT5K homepage train_label.txt test_label.txt
ct80 homepage - test_label.txt
svt homepage - test_label.txt
svtp unofficial homepage[1] - test_label.txt
MJSynth (Syn90k) homepage shuffle_labels.txt | label.txt -
SynthText (Synth800k) homepage alphanumeric_labels.txt |shuffle_labels.txt | instances_train.txt | label.txt -
SynthAdd SynthText_Add.zip (code:627x) label.txt -
TextOCR homepage - -
Totaltext homepage - -
OpenVINO Open Images annotations annotations
FUNSD homepage - -
DeText homepage - -
NAF homepage - -
SROIE homepage - -
Lecture Video DB homepage - -
LSVT homepage - -
IMGUR homepage - -
KAIST homepage - -
MTWI homepage - -
COCO Text v2 homepage - -
ReCTS homepage - -
IIIT-ILST homepage - -
VinText homepage - -
BID homepage - -
RCTW homepage - -
HierText homepage - -
ArT homepage - -

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Install AWS CLI (optional)

  • Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:

      curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
      unzip awscliv2.zip
      sudo ./aws/install
      ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
      !aws configure
      # this command will require you to input keys, you can skip them except
      # for the Default region name
      # AWS Access Key ID [None]:
      # AWS Secret Access Key [None]:
      # Default region name [None]: us-east-1
      # Default output format [None]
    

ICDAR 2011 (Born-Digital Images)

  • Step1: Download Challenge1_Training_Task3_Images_GT.zip, Challenge1_Test_Task3_Images.zip, and Challenge1_Test_Task3_GT.txt from homepage Task 1.3: Word Recognition (2013 edition).

    mkdir icdar2011 && cd icdar2011
    mkdir annotations
    
    # Download ICDAR 2011
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate
    
    # For images
    mkdir crops
    unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train
    unzip -q Challenge1_Test_Task3_Images.zip -d crops/test
    
    # For annotations
    mv Challenge1_Test_Task3_GT.txt annotations && mv train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
    
  • Step2: Convert original annotations to Train_label.jsonl and Test_label.jsonl with the following command:

    python tools/data/textrecog/ic11_converter.py PATH/TO/icdar2011
    
  • After running the above codes, the directory structure should be as follows:

    ├── icdar2011
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

ICDAR 2013 (Focused Scene Text)

  • Step1: Download Challenge2_Training_Task3_Images_GT.zip, Challenge2_Test_Task3_Images.zip, and Challenge2_Test_Task3_GT.txt from homepage Task 2.3: Word Recognition (2013 edition).

    mkdir icdar2013 && cd icdar2013
    mkdir annotations
    
    # Download ICDAR 2013
    wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate
    
    # For images
    mkdir crops
    unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train
    unzip -q Challenge2_Test_Task3_Images.zip -d crops/test
    # For annotations
    mv Challenge2_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt
    
    rm Challenge2_Training_Task3_Images_GT.zip && rm Challenge2_Test_Task3_Images.zip
    
  • Step 2: Generate Train_label.jsonl and Test_label.jsonl with the following command:

    python tools/data/textrecog/ic13_converter.py PATH/TO/icdar2013
    
  • After running the above codes, the directory structure should be as follows:

    ├── icdar2013
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

ICDAR 2013 [Deprecated]

  • Step1: Download Challenge2_Test_Task3_Images.zip and Challenge2_Training_Task3_Images_GT.zip from homepage

  • Step2: Download test_label_1015.txt and train_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── icdar_2013
    │   ├── train_label.txt
    │   ├── test_label_1015.txt
    │   ├── test_label_1095.txt
    │   ├── Challenge2_Training_Task3_Images_GT
    │   └──  Challenge2_Test_Task3_Images
    

ICDAR 2015

  • Step1: Download ch4_training_word_images_gt.zip and ch4_test_word_images_gt.zip from homepage

  • Step2: Download train_label.txt and test_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── icdar_2015
    │   ├── train_label.txt
    │   ├── test_label.txt
    │   ├── ch4_training_word_images_gt
    │   └── ch4_test_word_images_gt
    

IIIT5K

  • Step1: Download IIIT5K-Word_V3.0.tar.gz from homepage

  • Step2: Download train_label.txt and test_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── III5K
    │   ├── train_label.txt
    │   ├── test_label.txt
    │   ├── train
    │   └── test
    

svt

  • Step1: Download svt.zip form homepage

  • Step2: Download test_label.txt

  • Step3:

    python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
    
  • After running the above codes, the directory structure should be as follows:

    ├── svt
    │   ├── test_label.txt
    │   └── image
    

ct80

  • Step1: Download test_label.txt

  • Step2: Download timage.tar.gz

  • Step3:

    mkdir ct80 && cd ct80
    mv /path/to/test_label.txt .
    mv /path/to/timage.tar.gz .
    tar -xvf timage.tar.gz
    # create soft link
    cd /path/to/mmocr/data/mixture
    ln -s /path/to/ct80 ct80
    
  • After running the above codes, the directory structure should be as follows:

    ├── ct80
    │   ├── test_label.txt
    │   └── timage
    

svtp

  • Step1: Download test_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── svtp
    │   ├── test_label.txt
    │   └── image
    

coco_text

  • Step1: Download from homepage

  • Step2: Download train_label.txt

  • After running the above codes, the directory structure should be as follows:

    ├── coco_text
    │   ├── train_label.txt
    │   └── train_words
    

MJSynth (Syn90k)

Note

Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.

  • Step3:

    mkdir Syn90k && cd Syn90k
    
    mv /path/to/mjsynth.tar.gz .
    
    tar -xzf mjsynth.tar.gz
    
    mv /path/to/shuffle_labels.txt .
    mv /path/to/label.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/Syn90k Syn90k
    
    # Convert 'txt' format annos to 'lmdb' (optional)
    cd /path/to/mmocr
    python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only
    
  • After running the above codes, the directory structure should be as follows:

    ├── Syn90k
    │   ├── shuffle_labels.txt
    │   ├── label.txt
    │   ├── label.lmdb (optional)
    │   └── mnt
    

SynthText (Synth800k)

  • Step1: Download SynthText.zip from homepage

  • Step2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).

Warning

Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.

  • Step3:

    mkdir SynthText && cd SynthText
    mv /path/to/SynthText.zip .
    unzip SynthText.zip
    mv SynthText synthtext
    
    mv /path/to/shuffle_labels.txt .
    mv /path/to/label.txt .
    mv /path/to/alphanumeric_labels.txt .
    mv /path/to/instances_train.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    ln -s /path/to/SynthText SynthText
    
  • Step4: Generate cropped images and labels:

    cd /path/to/mmocr
    
    python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8
    
    # Convert 'txt' format annos to 'lmdb' (optional)
    cd /path/to/mmocr
    python tools/data/utils/lmdb_converter.py data/mixture/SynthText/label.txt data/mixture/SynthText/label.lmdb --label-only
    
  • After running the above codes, the directory structure should be as follows:

    ├── SynthText
    │   ├── alphanumeric_labels.txt
    │   ├── shuffle_labels.txt
    │   ├── instances_train.txt
    │   ├── label.txt
    │   ├── label.lmdb (optional)
    │   └── synthtext
    

SynthAdd

  • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))

  • Step2: Download label.txt

  • Step3:

    mkdir SynthAdd && cd SynthAdd
    
    mv /path/to/SynthText_Add.zip .
    
    unzip SynthText_Add.zip
    
    mv /path/to/label.txt .
    
    # create soft link
    cd /path/to/mmocr/data/mixture
    
    ln -s /path/to/SynthAdd SynthAdd
    
    # Convert 'txt' format annos to 'lmdb' (optional)
    cd /path/to/mmocr
    python tools/data/utils/lmdb_converter.py data/mixture/SynthAdd/label.txt data/mixture/SynthAdd/label.lmdb --label-only
    
  • After running the above codes, the directory structure should be as follows:

    ├── SynthAdd
    │   ├── label.txt
    │   ├── label.lmdb (optional)
    │   └── SynthText_Add
    

Tip

To convert label file from txt format to lmdb format,

python tools/data/utils/lmdb_converter.py <txt_label_path> <lmdb_label_path> --label-only

For example,

python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only

TextOCR

  • Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to textocr/.

    mkdir textocr && cd textocr
    
    # Download TextOCR dataset
    wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json
    wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json
    
    # For images
    unzip -q train_val_images.zip
    mv train_images train
    
  • Step2: Generate train_label.txt, val_label.txt and crop images using 4 processes with the following command:

    python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── TextOCR
    │   ├── image
    │   ├── train_label.txt
    │   └── val_label.txt
    

Totaltext

  • Step1: Download totaltext.zip from github dataset and groundtruth_text.zip or TT_new_train_GT.zip (if you prefer to use the latest version of training annotations) from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).

    mkdir totaltext && cd totaltext
    mkdir imgs && mkdir annotations
    
    # For images
    # in ./totaltext
    unzip totaltext.zip
    mv Images/Train imgs/training
    mv Images/Test imgs/test
    
    # For legacy training and test annotations
    unzip groundtruth_text.zip
    mv Groundtruth/Polygon/Train annotations/training
    mv Groundtruth/Polygon/Test annotations/test
    
    # Using the latest training annotations
    # WARNING: Delete legacy train annotations before running the following command.
    unzip TT_new_train_GT.zip
    mv Train annotations/training
    
  • Step2: Generate cropped images, train_label.txt and test_label.txt with the following command (the cropped images will be saved to data/totaltext/dst_imgs/):

    python tools/data/textrecog/totaltext_converter.py /path/to/totaltext
    
  • After running the above codes, the directory structure should be as follows:

    ├── totaltext
    │   ├── dst_imgs
    │   ├── train_label.txt
    │   └── test_label.txt
    

OpenVINO

  • Step1 (optional): Install AWS CLI.

  • Step2: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.

    mkdir openvino && cd openvino
    
    # Download Open Images subsets
    for s in 1 2 5 f; do
      aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
    done
    aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .
    
    # Download annotations
    for s in 1 2 5 f; do
      wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
    done
    wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json
    
    # Extract images
    mkdir -p openimages_v5/val
    for s in 1 2 5 f; do
      tar zxf train_${s}.tar.gz -C openimages_v5
    done
    tar zxf validation.tar.gz -C openimages_v5/val
    
  • Step3: Generate train_{1,2,5,f}_label.txt, val_label.txt and crop images using 4 processes with the following command:

    python tools/data/textrecog/openvino_converter.py /path/to/openvino 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── OpenVINO
    │   ├── image_1
    │   ├── image_2
    │   ├── image_5
    │   ├── image_f
    │   ├── image_val
    │   ├── train_1_label.txt
    │   ├── train_2_label.txt
    │   ├── train_5_label.txt
    │   ├── train_f_label.txt
    │   └── val_label.txt
    

DeText

  • Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

    mkdir detext && cd detext
    mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
    
    # Download DeText
    wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
    
    # Extract images and annotations
    unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
    
    # Remove zips
    rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
    
  • Step2: Generate instances_training.json and instances_val.json with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/detext/ignores
    python tools/data/textrecog/detext_converter.py PATH/TO/detext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── detext
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

NAF

  • Step1: Download labeled_images.tar.gz to naf/.

    mkdir naf && cd naf
    
    # Download NAF dataset
    wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
    tar -zxf labeled_images.tar.gz
    
    # For images
    mkdir annotations && mv labeled_images imgs
    
    # For annotations
    git clone https://github.com/herobd/NAF_dataset.git
    mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/
    
    rm -rf NAF_dataset && rm labeled_images.tar.gz
    
  • Step2: Generate train_label.txt, val_label.txt, and test_label.txt with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/naf/ignores
    python tools/data/textrecog/naf_converter.py PATH/TO/naf --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── naf
    │   ├── crops
    │   ├── train_label.txt
    │   ├── val_label.txt
    │   └── test_label.txt
    

SROIE

  • Step1: Step1: Download 0325updated.task1train(626p).zip, task1&2_test(361p).zip, and text.task1&2-test(361p).zip from homepage to sroie/

  • Step2:

    mkdir sroie && cd sroie
    mkdir imgs && mkdir annotations && mkdir imgs/training
    
    # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may
    # be different, the user should revise the following commands to the correct
    # file name if encounter with errors while extracting and move the files.
    unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip
    
    # For images
    mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test
    
    # For annotations
    mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test
    
    rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip
    
  • Step3: Generate train_label.jsonl and test_label.jsonl and crop images using 4 processes with the following command:

    python tools/data/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── sroie
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

Lecture Video DB

Note

The LV dataset has already provided cropped images and the corresponding annotations

  • Step1: Download IIIT-CVid.zip to lv/.

    mkdir lv && cd lv
    
    # Download LV dataset
    wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
    unzip -q IIIT-CVid.zip
    
    # For image
    mv IIIT-CVid/Crops ./
    
    # For annotation
    mv IIIT-CVid/train.txt train_label.txt && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_label.txt
    
    rm IIIT-CVid.zip
    
  • Step2: Generate train_label.jsonl, val.jsonl, and test.jsonl with following command:

    python tools/data/textdreog/lv_converter.py PATH/TO/lv
    
  • After running the above codes, the directory structure should be as follows:

    ├── lv
    │   ├── Crops
    │   ├── train_label.jsonl
    │   └── test_label.jsonl
    

LSVT

  • Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to lsvt/.

    mkdir lsvt && cd lsvt
    
    # Download LSVT dataset
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
    
    mkdir annotations
    tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
    mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
    mv train_full_images_0 imgs
    
    rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional) with the following command:

    # Annotations of LSVT test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/lsvt/ignores
    python tools/data/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── lsvt
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

FUNSD

  • Step1: Download dataset.zip to funsd/.

    mkdir funsd && cd funsd
    
    # Download FUNSD dataset
    wget https://guillaumejaume.github.io/FUNSD/dataset.zip
    unzip -q dataset.zip
    
    # For images
    mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/
    
    # For annotations
    mkdir annotations
    mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test
    
    rm dataset.zip && rm -rf dataset
    
  • Step2: Generate train_label.txt and test_label.txt and crop images using 4 processes with following command (add --preserve-vertical if you wish to preserve the images containing vertical texts):

    python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── funsd
    │   ├── imgs
    │   ├── dst_imgs
    │   ├── annotations
    │   ├── train_label.txt
    │   └── test_label.txt
    

IMGUR

  • Step1: Run download_imgur5k.py to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.

    mkdir imgur && cd imgur
    
    git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
    
    # Download images from imgur.com. This may take SEVERAL HOURS!
    python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
    
    # For annotations
    mkdir annotations
    mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
    
    rm -rf IMGUR5K-Handwriting-Dataset
    
  • Step2: Generate train_label.txt, val_label.txt and test_label.txt and crop images with the following command:

    python tools/data/textrecog/imgur_converter.py PATH/TO/imgur
    
  • After running the above codes, the directory structure should be as follows:

    ├── imgur
    │   ├── crops
    │   ├── train_label.jsonl
    │   ├── test_label.jsonl
    │   └── val_label.jsonl
    

KAIST

  • Step1: Complete download KAIST_all.zip to kaist/.

    mkdir kaist && cd kaist
    mkdir imgs && mkdir annotations
    
    # Download KAIST dataset
    wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
    unzip -q KAIST_all.zip
    
    rm KAIST_all.zip
    
  • Step2: Extract zips:

    python tools/data/common/extract_kaist.py PATH/TO/kaist
    
  • Step3: Generate train_label.jsonl and val_label.jsonl (optional) with following command:

    # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/kaist/ignores
    python tools/data/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── kaist
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

MTWI

  • Step1: Download mtwi_2018_train.zip from homepage.

    mkdir mtwi && cd mtwi
    
    unzip -q mtwi_2018_train.zip
    mv image_train imgs && mv txt_train annotations
    
    rm mtwi_2018_train.zip
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional) with the following command:

    # Annotations of MTWI test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/mtwi/ignores
    python tools/data/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── mtwi
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

COCO Text v2

  • Step1: Download image train2014.zip and annotation cocotext.v2.zip to coco_textv2/.

    mkdir coco_textv2 && cd coco_textv2
    mkdir annotations
    
    # Download COCO Text v2 dataset
    wget http://images.cocodataset.org/zips/train2014.zip
    wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip
    unzip -q train2014.zip && unzip -q cocotext.v2.zip
    
    mv train2014 imgs && mv cocotext.v2.json annotations/
    
    rm train2014.zip && rm -rf cocotext.v2.zip
    
  • Step2: Generate train_label.jsonl and val_label.jsonl with the following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/mtwi/ignores
    python tools/data/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── coco_textv2
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl
    

ReCTS

  • Step1: Download ReCTS.zip to rects/ from the homepage.

    mkdir rects && cd rects
    
    # Download ReCTS dataset
    # You can also find Google Drive link on the dataset homepage
    wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
    unzip -q ReCTS.zip
    
    mv img imgs && mv gt_unicode annotations
    
    rm ReCTS.zip -f && rm -rf gt
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional) with the following command:

    # Annotations of ReCTS test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/rects/ignores
    python tools/data/textrecog/rects_converter.py PATH/TO/rects --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── rects
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

ILST

  • Step1: Download IIIT-ILST.zip from onedrive link

  • Step2: Run the following commands

    unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
    cd IIIT-ILST
    
    # rename files
    cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
    cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
    cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
    
    # transfer image path
    mkdir imgs && mkdir annotations
    mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
    mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
    mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
    
    # remove unnecessary files
    rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
    
  • Step3: Generate train_label.jsonl and val_label.jsonl (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/data/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── IIIT-ILST
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

VinText

  • Step1: Download vintext.zip to vintext

    mkdir vintext && cd vintext
    
    # Download dataset from google drive
    wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
    
    # Extract images and annotations
    unzip -q vintext.zip && rm vintext.zip
    mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
    rm -rf vietnamese
    
    # Rename files
    mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
    mkdir imgs
    mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
    
  • Step2: Generate train_label.jsonl, test_label.jsonl, unseen_test_label.jsonl, and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts).

    python tools/data/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── vintext
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   ├── test_label.jsonl
    │   └── unseen_test_label.jsonl
    

BID

  • Step1: Download BID Dataset.zip

  • Step2: Run the following commands to preprocess the dataset

    # Rename
    mv BID\ Dataset.zip BID_Dataset.zip
    
    # Unzip and Rename
    unzip -q BID_Dataset.zip && rm BID_Dataset.zip
    mv BID\ Dataset BID
    
    # The BID dataset has a problem of permission, and you may
    # add permission for this file
    chmod -R 777 BID
    cd BID
    mkdir imgs && mkdir annotations
    
    # For images and annotations
    mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
    mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
    mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
    mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
    mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
    mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
    mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
    mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
    
    # Remove unnecessary files
    rm -rf desktop.ini
    
  • Step3: Generate train_label.jsonl and val_label.jsonl (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/data/textrecog/bid_converter.py dPATH/TO/BID --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── BID
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

RCTW

  • Step1: Download train_images.zip.001, train_images.zip.002, and train_gts.zip from the homepage, extract the zips to rctw/imgs and rctw/annotations, respectively.

  • Step2: Generate train_label.jsonl and val_label.jsonl (optional). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores
    python tools/data/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── rctw
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    

HierText

  • Step1 (optional): Install AWS CLI.

  • Step2: Clone HierText repo to get annotations

    mkdir HierText
    git clone https://github.com/google-research-datasets/hiertext.git
    
  • Step3: Download train.tgz, validation.tgz from aws

    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
    
  • Step4: Process raw data

    # process annotations
    mv hiertext/gt ./
    rm -rf hiertext
    mv gt annotations
    gzip -d annotations/train.jsonl.gz
    gzip -d annotations/validation.jsonl.gz
    # process images
    mkdir imgs
    mv train.tgz imgs/
    mv validation.tgz imgs/
    tar -xzvf imgs/train.tgz
    tar -xzvf imgs/validation.tgz
    
  • Step5: Generate train_label.jsonl and val_label.jsonl. HierText includes different levels of annotation, including paragraph, line, and word. Check the original paper for details. E.g. set --level paragraph to get paragraph-level annotation. Set --level line to get line-level annotation. set --level word to get word-level annotation.

    # Collect word annotation from HierText  --level word
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores
    python tools/data/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── HierText
    │   ├── crops
    │   ├── ignores
    │   ├── train_label.jsonl
    │   └── val_label.jsonl
    

ArT

  • Step1: Download train_images.tar.gz, and train_labels.json from the homepage to art/

    mkdir art && cd art
    mkdir annotations
    
    # Download ArT dataset
    wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json
    
    # Extract
    tar -xf train_task2_images.tar.gz
    mv train_task2_images crops
    mv train_task2_labels.json annotations/
    
    # Remove unnecessary files
    rm train_images.tar.gz
    
  • Step2: Generate train_label.jsonl and val_label.jsonl (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/data/textrecog/art_converter.py PATH/TO/art
    
  • After running the above codes, the directory structure should be as follows:

    │── art
    │   ├── crops
    │   ├── train_label.jsonl
    │   └── val_label.jsonl (optional)
    
Read the Docs v: v0.6.1
Versions
latest
stable
v0.6.1
v0.6.0
v0.5.0
v0.4.1
v0.4.0
v0.3.0
v0.2.1
v0.2.0
v0.1.0
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.