Datasets Preparation¶

This page lists the datasets which are commonly used in text detection, text recognition and key information extraction, and their download links.

Datasets Preparation

Text Detection¶

The structure of the text detection dataset directory is organized as follows.

├── ctw1500
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2015
│   ├── imgs
│   ├── instances_test.json
│   └── instances_training.json
├── icdar2017
│   ├── imgs
│   ├── instances_training.json
│   └── instances_val.json
├── synthtext
│   ├── imgs
│   └── instances_training.lmdb

Dataset	Images			Annotation Files
			training	validation	testing
CTW1500	homepage		instances_training.json	-	instances_test.json
ICDAR2015	homepage		instances_training.json	-	instances_test.json
ICDAR2017	homepage	renamed_imgs	instances_training.json	instances_val.json	-
Synthtext	homepage		instances_training.lmdb	-

For icdar2015:

Step1: Download ch4_training_images.zip and ch4_test_images.zip from homepage
Step2: Download instances_training.json and instances_test.json
Step3:

mkdir icdar2015 && cd icdar2015
mv /path/to/instances_training.json .
mv /path/to/instances_test.json .

mkdir imgs && cd imgs
ln -s /path/to/ch4_training_images training
ln -s /path/to/ch4_test_images test

For icdar2017:
- To avoid the effect of rotation when load jpg with opencv, We provide re-saved png format image in renamed_images. You can copy these images to imgs.

Text Recognition¶

The structure of the text recognition dataset directory is organized as follows.

├── mixture
│   ├── coco_text
│   │   ├── train_label.txt
│   │   ├── train_words
│   ├── icdar_2011
│   │   ├── training_label.txt
│   │   ├── Challenge1_Training_Task3_Images_GT
│   ├── icdar_2013
│   │   ├── train_label.txt
│   │   ├── test_label_1015.txt
│   │   ├── test_label_1095.txt
│   │   ├── Challenge2_Training_Task3_Images_GT
│   │   ├── Challenge2_Test_Task3_Images
│   ├── icdar_2015
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── ch4_training_word_images_gt
│   │   ├── ch4_test_word_images_gt
│   ├── III5K
│   │   ├── train_label.txt
│   │   ├── test_label.txt
│   │   ├── train
│   │   ├── test
│   ├── ct80
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svt
│   │   ├── test_label.txt
│   │   ├── image
│   ├── svtp
│   │   ├── test_label.txt
│   │   ├── image
│   ├── Syn90k
│   │   ├── shuffle_labels.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── mnt
│   ├── SynthText
│   │   ├── shuffle_labels.txt
│   │   ├── instances_train.txt
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── synthtext
│   ├── SynthAdd
│   │   ├── label.txt
│   │   ├── label.lmdb
│   │   ├── SynthText_Add

Dataset	images	annotation file	annotation file
		training	test
coco_text	homepage	train_label.txt	-
icdar_2011	homepage	train_label.txt	-
icdar_2013	homepage	train_label.txt	test_label_1015.txt
icdar_2015	homepage	train_label.txt	test_label.txt
IIIT5K	homepage	train_label.txt	test_label.txt
ct80	-	-	test_label.txt
svt	homepage	-	test_label.txt
svtp	-	-	test_label.txt
Syn90k	homepage	shuffle_labels.txt \| label.txt	-
SynthText	homepage	shuffle_labels.txt \| instances_train.txt \| label.txt	-
SynthAdd	SynthText_Add.zip (code:627x)	label.txt	-

For icdar_2013:
- Step1: Download Challenge2_Test_Task3_Images.zip and Challenge2_Training_Task3_Images_GT.zip from homepage
- Step2: Download test_label_1015.txt and train_label.txt
For icdar_2015:
- Step1: Download ch4_training_word_images_gt.zip and ch4_test_word_images_gt.zip from homepage
- Step2: Download train_label.txt and test_label.txt
For IIIT5K:
- Step1: Download IIIT5K-Word_V3.0.tar.gz from homepage
- Step2: Download train_label.txt and test_label.txt
For svt:
- Step1: Download svt.zip form homepage
- Step2: Download test_label.txt
- Step3:
```
python tools/data/textrecog/svt_converter.py <download_svt_dir_path>
```
For ct80:
- Step1: Download test_label.txt
For svtp:
- Step1: Download test_label.txt
For coco_text:
- Step1: Download from homepage
- Step2: Download train_label.txt

For Syn90k:

Step1: Download mjsynth.tar.gz from homepage
Step2: Download shuffle_labels.txt
Step3:

mkdir Syn90k && cd Syn90k

mv /path/to/mjsynth.tar.gz .

tar -xzf mjsynth.tar.gz

mv /path/to/shuffle_labels.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/Syn90k Syn90k

For SynthText:

Step1: Download SynthText.zip from homepage
Step2: Download shuffle_labels.txt
Step3: Download instances_train.txt
Step4:

unzip SynthText.zip

cd SynthText

mv /path/to/shuffle_labels.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/SynthText SynthText

For SynthAdd:

Step1: Download SynthText_Add.zip from SynthAdd (code:627x))
Step2: Download label.txt
Step3:

mkdir SynthAdd && cd SynthAdd

mv /path/to/SynthText_Add.zip .

unzip SynthText_Add.zip

mv /path/to/label.txt .

# create soft link
cd /path/to/mmocr/data/mixture

ln -s /path/to/SynthAdd SynthAdd

Note: To convert label file with txt format to lmdb format,

python tools/data/utils/txt2lmdb.py -i <txt_label_path> -o <lmdb_label_path>

For example,

python tools/data/utils/txt2lmdb.py -i data/mixture/Syn90k/label.txt -o data/mixture/Syn90k/label.lmdb

Key Information Extraction¶

The structure of the key information extraction dataset directory is organized as follows.

└── wildreceipt
  ├── class_list.txt
  ├── dict.txt
  ├── image_files
  ├── test.txt
  └── train.txt

Download wildreceipt.tar

Named Entity Recognition¶

CLUENER2020¶

The structure of the named entity recognition dataset directory is organized as follows.

└── cluener2020
  ├── cluener_predict.json
  ├── dev.json
  ├── README.md
  ├── test.json
  ├── train.json
  └── vocab.txt

Download cluener_public.zip
Download vocab.txt and move vocab.txt to cluener2020