Dataset Preparer (Beta)¶

Note

Dataset Preparer is still in beta version and might not be stable enough. You are welcome to try it out and report any issues to us.

One-click data preparation script¶

MMOCR provides a unified one-stop data preparation script prepare_dataset.py.

Only one line of command is needed to complete the data download, decompression, and format conversion.

python tools/dataset_converters/prepare_dataset.py [$DATASET_NAME] --task [$TASK] --nproc [$NPROC]

ARGS	Type	Description
dataset_name	str	(required) dataset name.
--nproc	int	Number of processes to be used. Defaults to 4.
--task	str	Convert the dataset to the format of a specified task supported by MMOCR. options are: 'textdet', 'textrecog', 'textspotting', and 'kie'.
--splits	str	Splits of the dataset to be prepared. Multiple splits can be accepted. Defaults to `train val test`.
--lmdb	str	Store the data in LMDB format. Only valid when the task is `textrecog`.
--overwrite-cfg	str	Whether to overwrite the dataset config file if it already exists in `configs/{task}/_base_/datasets`.
--dataset-zoo-path	str	Path to the dataset config file. If not specified, the default path is `./dataset_zoo`.

For example, the following command shows how to use the script to prepare the ICDAR2015 dataset for text detection task.

python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet

Also, the script supports preparing multiple datasets at the same time. For example, the following command shows how to prepare the ICDAR2015 and TotalText datasets for text recognition task.

python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task textrecog

To check the supported datasets of Dataset Preparer, please refer to Dataset Zoo. Some of other datasets that need to be prepared manually are listed in Text Detection and Text Recognition.

For users in China, more datasets can be downloaded from the opensource dataset platform: OpenDataLab. After downloading the data, you can place the files listed in data_obtainer.save_name in data/cache and rerun the script.

Advanced Usage¶

LMDB Format¶

In text recognition tasks, we usually use LMDB format to store data to speed up data loading. When using the prepare_dataset.py script to prepare data, you can store data to the LMDB format by the --lmdb parameter. For example:

python tools/dataset_converters/prepare_dataset.py icdar2015 --task textrecog --lmdb

As soon as the dataset is prepared, Dataset Preparer will generate icdar2015_lmdb.py in the configs/textrecog/_base_/datasets/ directory. You can inherit this file and point the dataloader to the LMDB dataset. Moreover, the LMDB dataset needs to be loaded by LoadImageFromNDArray, thus you also need to modify pipeline.

For example, if we want to change the training set of configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py to icdar2015 generated before, we need to perform the following modifications:

Modify configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py:

_base_ = [
     '../_base_/datasets/icdar2015_lmdb.py',  # point to icdar2015 lmdb dataset
      ...
 ]

 train_list = [_base_.icdar2015_lmdb_textrecog_train]
 ...

Modify train_pipeline in configs/textrecog/crnn/_base_crnn_mini-vgg.py, change LoadImageFromFile to LoadImageFromNDArray:

train_pipeline = [
 dict(
     type='LoadImageFromNDArray',
     color_type='grayscale',
     file_client_args=file_client_args,
     ignore_empty=True,
     min_size=2),
 ...
]

Configuration of Dataset Preparer¶

Dataset preparer uses a modular design to enhance extensibility, which allows users to extend it to other public or private datasets easily. The configuration files of the dataset preparers are stored in the dataset_zoo/, where all the configs of currently supported datasets can be found here. The directory structure is as follows:

dataset_zoo/
├── icdar2015
│   ├── metafile.yml
│   ├── textdet.py
│   ├── textrecog.py
│   └── textspotting.py
└── wildreceipt
    ├── metafile.yml
    ├── kie.py
    ├── textdet.py
    ├── textrecog.py
    └── textspotting.py

metafile.yml is the metafile of the dataset, which contains the basic information of the dataset, including the year of publication, the author of the paper, and other information such as license. The other files named by the task are the configuration files of the dataset preparer, which are used to configure the download, decompression, format conversion, etc. of the dataset. These configs are in Python format, and their usage is completely consistent with the configuration files in MMOCR repo. See Configuration File Documentation for detailed usage.

Metafile¶

Take the ICDAR2015 dataset as an example, the metafile.yml stores the basic information of the dataset:

Name: 'Incidental Scene Text IC15'
Paper:
  Title: ICDAR 2015 Competition on Robust Reading
  URL: https://rrc.cvc.uab.es/files/short_rrc_2015.pdf
  Venue: ICDAR
  Year: '2015'
  BibTeX: '@inproceedings{karatzas2015icdar,
  title={ICDAR 2015 competition on robust reading},
  author={Karatzas, Dimosthenis and Gomez-Bigorda, Lluis and Nicolaou, Anguelos and Ghosh, Suman and Bagdanov, Andrew and Iwamura, Masakazu and Matas, Jiri and Neumann, Lukas and Chandrasekhar, Vijay Ramaseshan and Lu, Shijian and others},
  booktitle={2015 13th international conference on document analysis and recognition (ICDAR)},
  pages={1156--1160},
  year={2015},
  organization={IEEE}}'
Data:
  Website: https://rrc.cvc.uab.es/?ch=4
  Language:
    - English
  Scene:
    - Natural Scene
  Granularity:
    - Word
  Tasks:
    - textdet
    - textrecog
    - textspotting
  License:
    Type: CC BY 4.0
    Link: https://creativecommons.org/licenses/by/4.0/

It is not mandatory to use the metafile in the dataset preparation process (so users can ignore this file when preparing private datasets), but in order to better understand the information of each public dataset, we recommend that users read the metafile before preparing the dataset, which will help to understand whether the datasets meet their needs.

Warning

The following section is outdated as of MMOCR 1.0.0rc6.

Config of Dataset Preparer¶

Next, we will introduce the conventional fields and usage of the dataset preparer configuration files.

In the configuration files, there are two fields data_root and cache_path, which are used to store the converted dataset and the temporary files such as the archived files downloaded during the data preparation process.

data_root = './data/icdar2015'
cache_path = './data/cache'

Data preparation usually contains two steps: “raw data preparation” and “format conversion and saving”. Therefore, we use the data_obtainer and data_converter to configure the behavior of these two steps. In some cases, users can also ignore data_converter to only download and decompress the raw data, without performing format conversion and saving. Or, for the local stored dataset, use ignore data_obtainer to only perform format conversion and saving.

Take the text detection task of the ICDAR2015 dataset (dataset_zoo/icdar2015/textdet.py) as an example:

data_obtainer = dict(
    type='NaiveDataObtainer',
    cache_path=cache_path,
    data_root=data_root,
    files=[
        dict(
            url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
            save_name='ic15_textdet_train_img.zip',
            md5='c51cbace155dcc4d98c8dd19d378f30d',
            split=['train'],
            content=['image'],
            mapping=[['ic15_textdet_train_img', 'imgs/train']]),
        dict(
            url='https://rrc.cvc.uab.es/downloads/ch4_test_images.zip',
            save_name='ic15_textdet_test_img.zip',
            md5='97e4c1ddcf074ffcc75feff2b63c35dd',
            split=['test'],
            content=['image'],
            mapping=[['ic15_textdet_test_img', 'imgs/test']]),
    ])

The default type of data_obtainer is NaiveDataObtainer, which mainly downloads and decompresses the original files to the specified directory. Here, we configure the URL, save name, MD5 value, etc. of the original dataset files through the files parameter. The mapping parameter is used to specify the path where the data is decompressed or moved. In addition, the two optional parameters split and content respectively indicate the content type stored in the compressed file and the corresponding dataset.

data_converter = dict(
    type='TextDetDataConverter',
    splits=['train', 'test'],
    data_root=data_root,
    gatherer=dict(
        type='pair_gather',
        suffixes=['.jpg', '.JPG'],
        rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
    parser=dict(type='ICDARTxtTextDetAnnParser'),
    dumper=dict(type='JsonDumper'),
    delete=['annotations', 'ic15_textdet_test_img', 'ic15_textdet_train_img'])

Warning

This section is outdated and not yet synchronized with its Chinese version, please switch the language for the latest information.

data_converter is responsible for loading and converting the original to the format supported by MMOCR. We provide a number of built-in data converters for different tasks, such as TextDetDataConverter, TextRecogDataConverter, TextSpottingDataConverter, and WildReceiptConverter (Since we only support WildReceipt dataset for KIE task at present, we only provide this converter for now).

Take the text detection task as an example, TextDetDataConverter mainly completes the following work:

Collect and match the images and original annotation files, such as the image img_1.jpg and the annotation gt_img_1.txt
Load and parse the original annotations to obtain necessary information such as the bounding box and text
Convert the parsed data to the format supported by MMOCR
Dump the converted data to the specified path and format

The above steps can be configured separately through gatherer, parser, dumper.

Specifically, the gatherer is used to collect and match the images and annotations in the original dataset. Typically, there are two relations between images and annotations, one is many-to-many, the other is many-to-one.

many-to-many
├── img_1.jpg
├── gt_img_1.txt
├── img_2.jpg
├── gt_img_2.txt
├── img_3.JPG
├── gt_img_3.txt

one-to-many
├── img_1.jpg
├── img_2.jpg
├── img_3.JPG
├── gt.txt

Therefore, we provide two built-in gatherers, pair_gather and mono_gather, to handle the two cases. pair_gather is used for the case of many-to-many, and mono_gather is used for the case of one-to-many. pair_gather needs to specify the suffixes parameter to indicate the suffix of the image, such as suffixes=[.jpg,.JPG] in the above example. In addition, we need to specify the corresponding relationship between the image and the annotation file through the regular expression, such as rule=[r'img_(\d+)\.([jJ][pP][gG])'，r'gt_img_\1.txt'] in the above example. Where \d+ is used to match the serial number of the image, ([jJ][pP][gG]) is used to match the suffix of the image, and \_1 matches the serial number of the image and the serial number of the annotation file.

When the image and annotation file are matched, the original annotations will be parsed. Since the annotation format is usually varied from dataset to dataset, the parsers are usually dataset related. Then, the parser will pack the required data into the MMOCR format.

Finally, we can specify the dumpers to decide the data format. Currently, we support JsonDumper, WildreceiptOpensetDumper, and TextRecogLMDBDumper. They are used to save the data in the standard MMOCR Json format, Wildreceipt format, and the LMDB format commonly used in academia in the field of text recognition, respectively.

Use DataPreparer to prepare customized dataset¶

[Coming Soon]