Text Recognition¶
Overview¶
Dataset | images | annotation file | annotation file |
---|---|---|---|
training | test | ||
coco_text | homepage | train_label.txt | - |
ICDAR2011 | homepage | - | - |
ICDAR2013 | homepage | - | - |
icdar_2015 | homepage | train_label.txt | test_label.txt |
IIIT5K | homepage | train_label.txt | test_label.txt |
ct80 | homepage | - | test_label.txt |
svt | homepage | - | test_label.txt |
svtp | unofficial homepage[1] | - | test_label.txt |
MJSynth (Syn90k) | homepage | shuffle_labels.txt | label.txt | - |
SynthText (Synth800k) | homepage | alphanumeric_labels.txt |shuffle_labels.txt | instances_train.txt | label.txt | - |
SynthAdd | SynthText_Add.zip (code:627x) | label.txt | - |
TextOCR | homepage | - | - |
Totaltext | homepage | - | - |
OpenVINO | Open Images | annotations | annotations |
FUNSD | homepage | - | - |
DeText | homepage | - | - |
NAF | homepage | - | - |
SROIE | homepage | - | - |
Lecture Video DB | homepage | - | - |
LSVT | homepage | - | - |
IMGUR | homepage | - | - |
KAIST | homepage | - | - |
MTWI | homepage | - | - |
COCO Text v2 | homepage | - | - |
ReCTS | homepage | - | - |
IIIT-ILST | homepage | - | - |
VinText | homepage | - | - |
BID | homepage | - | - |
RCTW | homepage | - | - |
HierText | homepage | - | - |
ArT | homepage | - | - |
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
Install AWS CLI (optional)¶
Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ./aws/install -i /usr/local/aws-cli -b /usr/local/bin !aws configure # this command will require you to input keys, you can skip them except # for the Default region name # AWS Access Key ID [None]: # AWS Secret Access Key [None]: # Default region name [None]: us-east-1 # Default output format [None]
ICDAR 2011 (Born-Digital Images)¶
Step1: Download
Challenge1_Training_Task3_Images_GT.zip
,Challenge1_Test_Task3_Images.zip
, andChallenge1_Test_Task3_GT.txt
from homepageTask 1.3: Word Recognition (2013 edition)
.mkdir icdar2011 && cd icdar2011 mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate # For images mkdir crops unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train unzip -q Challenge1_Test_Task3_Images.zip -d crops/test # For annotations mv Challenge1_Test_Task3_GT.txt annotations && mv train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
Step2: Convert original annotations to
Train_label.jsonl
andTest_label.jsonl
with the following command:python tools/data/textrecog/ic11_converter.py PATH/TO/icdar2011
After running the above codes, the directory structure should be as follows:
├── icdar2011 │ ├── crops │ ├── train_label.jsonl │ └── test_label.jsonl
ICDAR 2013 (Focused Scene Text)¶
Step1: Download
Challenge2_Training_Task3_Images_GT.zip
,Challenge2_Test_Task3_Images.zip
, andChallenge2_Test_Task3_GT.txt
from homepageTask 2.3: Word Recognition (2013 edition)
.mkdir icdar2013 && cd icdar2013 mkdir annotations # Download ICDAR 2013 wget https://rrc.cvc.uab.es/downloads/Challenge2_Training_Task3_Images_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge2_Test_Task3_GT.txt --no-check-certificate # For images mkdir crops unzip -q Challenge2_Training_Task3_Images_GT.zip -d crops/train unzip -q Challenge2_Test_Task3_Images.zip -d crops/test # For annotations mv Challenge2_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge2_Train_Task3_GT.txt rm Challenge2_Training_Task3_Images_GT.zip && rm Challenge2_Test_Task3_Images.zip
Step 2: Generate
Train_label.jsonl
andTest_label.jsonl
with the following command:python tools/data/textrecog/ic13_converter.py PATH/TO/icdar2013
After running the above codes, the directory structure should be as follows:
├── icdar2013 │ ├── crops │ ├── train_label.jsonl │ └── test_label.jsonl
ICDAR 2013 [Deprecated]¶
Step1: Download
Challenge2_Test_Task3_Images.zip
andChallenge2_Training_Task3_Images_GT.zip
from homepageStep2: Download test_label_1015.txt and train_label.txt
After running the above codes, the directory structure should be as follows:
├── icdar_2013 │ ├── train_label.txt │ ├── test_label_1015.txt │ ├── test_label_1095.txt │ ├── Challenge2_Training_Task3_Images_GT │ └── Challenge2_Test_Task3_Images
ICDAR 2015¶
Step1: Download
ch4_training_word_images_gt.zip
andch4_test_word_images_gt.zip
from homepageStep2: Download train_label.txt and test_label.txt
After running the above codes, the directory structure should be as follows:
├── icdar_2015 │ ├── train_label.txt │ ├── test_label.txt │ ├── ch4_training_word_images_gt │ └── ch4_test_word_images_gt
IIIT5K¶
Step1: Download
IIIT5K-Word_V3.0.tar.gz
from homepageStep2: Download train_label.txt and test_label.txt
After running the above codes, the directory structure should be as follows:
├── III5K │ ├── train_label.txt │ ├── test_label.txt │ ├── train │ └── test
svt¶
Step1: Download
svt.zip
form homepageStep2: Download test_label.txt
Step3:
python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
After running the above codes, the directory structure should be as follows:
├── svt │ ├── test_label.txt │ └── image
ct80¶
Step1: Download test_label.txt
Step2: Download timage.tar.gz
Step3:
mkdir ct80 && cd ct80 mv /path/to/test_label.txt . mv /path/to/timage.tar.gz . tar -xvf timage.tar.gz # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/ct80 ct80
After running the above codes, the directory structure should be as follows:
├── ct80 │ ├── test_label.txt │ └── timage
svtp¶
Step1: Download test_label.txt
After running the above codes, the directory structure should be as follows:
├── svtp │ ├── test_label.txt │ └── image
coco_text¶
Step1: Download from homepage
Step2: Download train_label.txt
After running the above codes, the directory structure should be as follows:
├── coco_text │ ├── train_label.txt │ └── train_words
MJSynth (Syn90k)¶
Step1: Download
mjsynth.tar.gz
from homepageStep2: Download label.txt (8,919,273 annotations) and shuffle_labels.txt (2,400,000 randomly sampled annotations).
Note
Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.
Step3:
mkdir Syn90k && cd Syn90k mv /path/to/mjsynth.tar.gz . tar -xzf mjsynth.tar.gz mv /path/to/shuffle_labels.txt . mv /path/to/label.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/Syn90k Syn90k # Convert 'txt' format annos to 'lmdb' (optional) cd /path/to/mmocr python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only
After running the above codes, the directory structure should be as follows:
├── Syn90k │ ├── shuffle_labels.txt │ ├── label.txt │ ├── label.lmdb (optional) │ └── mnt
SynthText (Synth800k)¶
Step1: Download
SynthText.zip
from homepageStep2: According to your actual needs, download the most appropriate one from the following options: label.txt (7,266,686 annotations), shuffle_labels.txt (2,400,000 randomly sampled annotations), alphanumeric_labels.txt (7,239,272 annotations with alphanumeric characters only) and instances_train.txt (7,266,686 character-level annotations).
Warning
Please make sure you’re using the right annotation to train the model by checking its dataset specs in Model Zoo.
Step3:
mkdir SynthText && cd SynthText mv /path/to/SynthText.zip . unzip SynthText.zip mv SynthText synthtext mv /path/to/shuffle_labels.txt . mv /path/to/label.txt . mv /path/to/alphanumeric_labels.txt . mv /path/to/instances_train.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthText SynthText
Step4: Generate cropped images and labels:
cd /path/to/mmocr python tools/data/textrecog/synthtext_converter.py data/mixture/SynthText/gt.mat data/mixture/SynthText/ data/mixture/SynthText/synthtext/SynthText_patch_horizontal --n_proc 8 # Convert 'txt' format annos to 'lmdb' (optional) cd /path/to/mmocr python tools/data/utils/lmdb_converter.py data/mixture/SynthText/label.txt data/mixture/SynthText/label.lmdb --label-only
After running the above codes, the directory structure should be as follows:
├── SynthText │ ├── alphanumeric_labels.txt │ ├── shuffle_labels.txt │ ├── instances_train.txt │ ├── label.txt │ ├── label.lmdb (optional) │ └── synthtext
SynthAdd¶
Step1: Download
SynthText_Add.zip
from SynthAdd (code:627x))Step2: Download label.txt
Step3:
mkdir SynthAdd && cd SynthAdd mv /path/to/SynthText_Add.zip . unzip SynthText_Add.zip mv /path/to/label.txt . # create soft link cd /path/to/mmocr/data/mixture ln -s /path/to/SynthAdd SynthAdd # Convert 'txt' format annos to 'lmdb' (optional) cd /path/to/mmocr python tools/data/utils/lmdb_converter.py data/mixture/SynthAdd/label.txt data/mixture/SynthAdd/label.lmdb --label-only
After running the above codes, the directory structure should be as follows:
├── SynthAdd │ ├── label.txt │ ├── label.lmdb (optional) │ └── SynthText_Add
Tip
To convert label file from txt
format to lmdb
format,
python tools/data/utils/lmdb_converter.py <txt_label_path> <lmdb_label_path> --label-only
For example,
python tools/data/utils/lmdb_converter.py data/mixture/Syn90k/label.txt data/mixture/Syn90k/label.lmdb --label-only
TextOCR¶
Step1: Download train_val_images.zip, TextOCR_0.1_train.json and TextOCR_0.1_val.json to
textocr/
.mkdir textocr && cd textocr # Download TextOCR dataset wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_train.json wget https://dl.fbaipublicfiles.com/textvqa/data/textocr/TextOCR_0.1_val.json # For images unzip -q train_val_images.zip mv train_images train
Step2: Generate
train_label.txt
,val_label.txt
and crop images using 4 processes with the following command:python tools/data/textrecog/textocr_converter.py /path/to/textocr 4
After running the above codes, the directory structure should be as follows:
├── TextOCR │ ├── image │ ├── train_label.txt │ └── val_label.txt
Totaltext¶
Step1: Download
totaltext.zip
from github dataset andgroundtruth_text.zip
orTT_new_train_GT.zip
(if you prefer to use the latest version of training annotations) from github Groundtruth (Our totaltext_converter.py supports groundtruth with both .mat and .txt format).mkdir totaltext && cd totaltext mkdir imgs && mkdir annotations # For images # in ./totaltext unzip totaltext.zip mv Images/Train imgs/training mv Images/Test imgs/test # For legacy training and test annotations unzip groundtruth_text.zip mv Groundtruth/Polygon/Train annotations/training mv Groundtruth/Polygon/Test annotations/test # Using the latest training annotations # WARNING: Delete legacy train annotations before running the following command. unzip TT_new_train_GT.zip mv Train annotations/training
Step2: Generate cropped images,
train_label.txt
andtest_label.txt
with the following command (the cropped images will be saved todata/totaltext/dst_imgs/
):python tools/data/textrecog/totaltext_converter.py /path/to/totaltext
After running the above codes, the directory structure should be as follows:
├── totaltext │ ├── dst_imgs │ ├── train_label.txt │ └── test_label.txt
OpenVINO¶
Step1 (optional): Install AWS CLI.
Step2: Download Open Images subsets
train_1
,train_2
,train_5
,train_f
, andvalidation
toopenvino/
.mkdir openvino && cd openvino # Download Open Images subsets for s in 1 2 5 f; do aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz . done aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz . # Download annotations for s in 1 2 5 f; do wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json done wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json # Extract images mkdir -p openimages_v5/val for s in 1 2 5 f; do tar zxf train_${s}.tar.gz -C openimages_v5 done tar zxf validation.tar.gz -C openimages_v5/val
Step3: Generate
train_{1,2,5,f}_label.txt
,val_label.txt
and crop images using 4 processes with the following command:python tools/data/textrecog/openvino_converter.py /path/to/openvino 4
After running the above codes, the directory structure should be as follows:
├── OpenVINO │ ├── image_1 │ ├── image_2 │ ├── image_5 │ ├── image_f │ ├── image_val │ ├── train_1_label.txt │ ├── train_2_label.txt │ ├── train_5_label.txt │ ├── train_f_label.txt │ └── val_label.txt
DeText¶
Step1: Download
ch9_training_images.zip
,ch9_training_localization_transcription_gt.zip
,ch9_validation_images.zip
, andch9_validation_localization_transcription_gt.zip
from Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
Step2: Generate
instances_training.json
andinstances_val.json
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/detext/ignores python tools/data/textrecog/detext_converter.py PATH/TO/detext --nproc 4
After running the above codes, the directory structure should be as follows:
├── detext │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── test_label.jsonl
NAF¶
Step1: Download labeled_images.tar.gz to
naf/
.mkdir naf && cd naf # Download NAF dataset wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz tar -zxf labeled_images.tar.gz # For images mkdir annotations && mv labeled_images imgs # For annotations git clone https://github.com/herobd/NAF_dataset.git mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ rm -rf NAF_dataset && rm labeled_images.tar.gz
Step2: Generate
train_label.txt
,val_label.txt
, andtest_label.txt
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/naf/ignores python tools/data/textrecog/naf_converter.py PATH/TO/naf --nproc 4
After running the above codes, the directory structure should be as follows:
├── naf │ ├── crops │ ├── train_label.txt │ ├── val_label.txt │ └── test_label.txt
SROIE¶
Step1: Step1: Download
0325updated.task1train(626p).zip
,task1&2_test(361p).zip
, andtext.task1&2-test(361p).zip
from homepage tosroie/
Step2:
mkdir sroie && cd sroie mkdir imgs && mkdir annotations && mkdir imgs/training # Warnninig: The zip files downloaded from Google Drive and BaiduYun Cloud may # be different, the user should revise the following commands to the correct # file name if encounter with errors while extracting and move the files. unzip -q 0325updated.task1train\(626p\).zip && unzip -q task1\&2_test\(361p\).zip && unzip -q text.task1\&2-test(361p\).zip # For images mv 0325updated.task1train\(626p\)/*.jpg imgs/training && mv fulltext_test\(361p\) imgs/test # For annotations mv 0325updated.task1train\(626p\) annotations/training && mv text.task1\&2-testги361p\)/ annotations/test rm 0325updated.task1train\(626p\).zip && rm task1\&2_test\(361p\).zip && rm text.task1\&2-test(361p\).zip
Step3: Generate
train_label.jsonl
andtest_label.jsonl
and crop images using 4 processes with the following command:python tools/data/textrecog/sroie_converter.py PATH/TO/sroie --nproc 4
After running the above codes, the directory structure should be as follows:
├── sroie │ ├── crops │ ├── train_label.jsonl │ └── test_label.jsonl
Lecture Video DB¶
Note
The LV dataset has already provided cropped images and the corresponding annotations
Step1: Download IIIT-CVid.zip to
lv/
.mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip # For image mv IIIT-CVid/Crops ./ # For annotation mv IIIT-CVid/train.txt train_label.txt && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_label.txt rm IIIT-CVid.zip
Step2: Generate
train_label.jsonl
,val.jsonl
, andtest.jsonl
with following command:python tools/data/textdreog/lv_converter.py PATH/TO/lv
After running the above codes, the directory structure should be as follows:
├── lv │ ├── Crops │ ├── train_label.jsonl │ └── test_label.jsonl
LSVT¶
Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to
lsvt/
.mkdir lsvt && cd lsvt # Download LSVT dataset wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json mkdir annotations tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/ mv train_full_images_0 imgs rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional) with the following command:# Annotations of LSVT test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/lsvt/ignores python tools/data/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
After running the above codes, the directory structure should be as follows:
├── lsvt │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
FUNSD¶
Step1: Download dataset.zip to
funsd/
.mkdir funsd && cd funsd # Download FUNSD dataset wget https://guillaumejaume.github.io/FUNSD/dataset.zip unzip -q dataset.zip # For images mv dataset/training_data/images imgs && mv dataset/testing_data/images/* imgs/ # For annotations mkdir annotations mv dataset/training_data/annotations annotations/training && mv dataset/testing_data/annotations annotations/test rm dataset.zip && rm -rf dataset
Step2: Generate
train_label.txt
andtest_label.txt
and crop images using 4 processes with following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts):python tools/data/textrecog/funsd_converter.py PATH/TO/funsd --nproc 4
After running the above codes, the directory structure should be as follows:
├── funsd │ ├── imgs │ ├── dst_imgs │ ├── annotations │ ├── train_label.txt │ └── test_label.txt
IMGUR¶
Step1: Run
download_imgur5k.py
to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-Dataset
Step2: Generate
train_label.txt
,val_label.txt
andtest_label.txt
and crop images with the following command:python tools/data/textrecog/imgur_converter.py PATH/TO/imgur
After running the above codes, the directory structure should be as follows:
├── imgur │ ├── crops │ ├── train_label.jsonl │ ├── test_label.jsonl │ └── val_label.jsonl
KAIST¶
Step1: Complete download KAIST_all.zip to
kaist/
.mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip rm KAIST_all.zip
Step2: Extract zips:
python tools/data/common/extract_kaist.py PATH/TO/kaist
Step3: Generate
train_label.jsonl
andval_label.jsonl
(optional) with following command:# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/kaist/ignores python tools/data/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
After running the above codes, the directory structure should be as follows:
├── kaist │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
MTWI¶
Step1: Download
mtwi_2018_train.zip
from homepage.mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zip
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional) with the following command:# Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/mtwi/ignores python tools/data/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
After running the above codes, the directory structure should be as follows:
├── mtwi │ ├── crops │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
COCO Text v2¶
Step1: Download image train2014.zip and annotation cocotext.v2.zip to
coco_textv2/
.mkdir coco_textv2 && cd coco_textv2 mkdir annotations # Download COCO Text v2 dataset wget http://images.cocodataset.org/zips/train2014.zip wget https://github.com/bgshih/cocotext/releases/download/dl/cocotext.v2.zip unzip -q train2014.zip && unzip -q cocotext.v2.zip mv train2014 imgs && mv cocotext.v2.json annotations/ rm train2014.zip && rm -rf cocotext.v2.zip
Step2: Generate
train_label.jsonl
andval_label.jsonl
with the following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/mtwi/ignores python tools/data/textrecog/cocotext_converter.py PATH/TO/coco_textv2 --nproc 4
After running the above codes, the directory structure should be as follows:
├── coco_textv2 │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl
ReCTS¶
Step1: Download ReCTS.zip to
rects/
from the homepage.mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip -f && rm -rf gt
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional) with the following command:# Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/rects/ignores python tools/data/textrecog/rects_converter.py PATH/TO/rects --nproc 4
After running the above codes, the directory structure should be as follows:
├── rects │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
ILST¶
Step1: Download
IIIT-ILST.zip
from onedrive linkStep2: Run the following commands
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
Step3: Generate
train_label.jsonl
andval_label.jsonl
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/data/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
After running the above codes, the directory structure should be as follows:
├── IIIT-ILST │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
VinText¶
Step1: Download vintext.zip to
vintext
mkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
Step2: Generate
train_label.jsonl
,test_label.jsonl
,unseen_test_label.jsonl
, and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts).python tools/data/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
After running the above codes, the directory structure should be as follows:
├── vintext │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ ├── test_label.jsonl │ └── unseen_test_label.jsonl
BID¶
Step1: Download BID Dataset.zip
Step2: Run the following commands to preprocess the dataset
# Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.ini
Step3: Generate
train_label.jsonl
andval_label.jsonl
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/data/textrecog/bid_converter.py dPATH/TO/BID --nproc 4
After running the above codes, the directory structure should be as follows:
├── BID │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
RCTW¶
Step1: Download
train_images.zip.001
,train_images.zip.002
, andtrain_gts.zip
from the homepage, extract the zips torctw/imgs
andrctw/annotations
, respectively.Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores python tools/data/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
After running the above codes, the directory structure should be as follows:
│── rctw │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl (optional)
HierText¶
Step1 (optional): Install AWS CLI.
Step2: Clone HierText repo to get annotations
mkdir HierText git clone https://github.com/google-research-datasets/hiertext.git
Step3: Download
train.tgz
,validation.tgz
from awsaws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz . aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
Step4: Process raw data
# process annotations mv hiertext/gt ./ rm -rf hiertext mv gt annotations gzip -d annotations/train.jsonl.gz gzip -d annotations/validation.jsonl.gz # process images mkdir imgs mv train.tgz imgs/ mv validation.tgz imgs/ tar -xzvf imgs/train.tgz tar -xzvf imgs/validation.tgz
Step5: Generate
train_label.jsonl
andval_label.jsonl
. HierText includes different levels of annotation, includingparagraph
,line
, andword
. Check the original paper for details. E.g. set--level paragraph
to get paragraph-level annotation. Set--level line
to get line-level annotation. set--level word
to get word-level annotation.# Collect word annotation from HierText --level word # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores python tools/data/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
After running the above codes, the directory structure should be as follows:
│── HierText │ ├── crops │ ├── ignores │ ├── train_label.jsonl │ └── val_label.jsonl
ArT¶
Step1: Download
train_images.tar.gz
, andtrain_labels.json
from the homepage toart/
mkdir art && cd art mkdir annotations # Download ArT dataset wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json # Extract tar -xf train_task2_images.tar.gz mv train_task2_images crops mv train_task2_labels.json annotations/ # Remove unnecessary files rm train_images.tar.gz
Step2: Generate
train_label.jsonl
andval_label.jsonl
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/data/textrecog/art_converter.py PATH/TO/art
After running the above codes, the directory structure should be as follows:
│── art │ ├── crops │ ├── train_label.jsonl │ └── val_label.jsonl (optional)