Shortcuts

Welcome to MMOCR’s documentation!

You can switch between English and Chinese in the lower-left corner of the layout.

Overview

MMOCR is an open source toolkit based on PyTorch and MMDetection, supporting numerous OCR-related models, including text detection, text recognition, and key information extraction. In addition, it supports widely-used academic datasets and provides many useful tools, assisting users in exploring various aspects of models and datasets and implementing high-quality algorithms. Generally, it has the following features.

  • One-stop, Multi-model: MMOCR supports various OCR-related tasks and implements the latest models for text detection, recognition, and key information extraction.

  • Modular Design: MMOCR’s modular design allows users to define and reuse modules in the model on demand.

  • Various Useful Tools: MMOCR provides a number of analysis tools, including visualizers, validation scripts, evaluators, etc., to help users troubleshoot, finetune or compare models.

  • Powered by OpenMMLab: Like other algorithm libraries in OpenMMLab family, MMOCR follows OpenMMLab’s rigorous development guidelines and interface conventions, significantly reducing the learning cost of users familiar with other projects in OpenMMLab family. In addition, benefiting from the unified interfaces among OpenMMLab, you can easily call the models implemented in other OpenMMLab projects (e.g. MMDetection) in MMOCR, facilitating cross-domain research and real-world applications.

Together with the release of OpenMMLab 2.0, MMOCR now also comes to its 1.0.0 version, which has made significant BC-breaking changes, resulting in less code redundancy, higher code efficiency and an overall more systematic and consistent design.

Considering that there are some backward incompatible changes in this version compared to 0.x, we have prepared a detailed migration guide. It lists all the changes made in the new version and the steps required to migrate. We hope this guide can help users familiar with the old framework to complete the upgrade as quickly as possible. Though this may take some time, we believe that the new features brought by MMOCR and the OpenMMLab ecosystem will make it all worthwhile. 😊

Next, please read the section according to your actual needs.

  • We recommend that beginners go through Quick Run to get familiar with MMOCR and master the usage of MMOCR by reading the examples in User Guides.

  • Intermediate and advanced developers are suggested to learn the background, conventions, and recommended implementations of each component from Basic Concepts.

  • Read our FAQ to find answers to frequently asked questions.

  • If you can’t find the answers you need in the documentation, feel free to raise an issue.

  • Everyone is welcome to be a contributor! Read the contribution guide to learn how to contribute to MMOCR!

Installation

Prerequisites

  • Linux | Windows | macOS

  • Python 3.7

  • PyTorch 1.6 or higher

  • torchvision 0.7.0

  • CUDA 10.1

  • NCCL 2

  • GCC 5.4.0 or higher

Environment Setup

Note

If you are experienced with PyTorch and have already installed it, just skip this part and jump to the next section. Otherwise, you can follow these steps for the preparation.

Step 0. Download and install Miniconda from the official website.

Step 1. Create a conda environment and activate it.

conda create --name openmmlab python=3.8 -y
conda activate openmmlab

Step 2. Install PyTorch following official instructions, e.g.

conda install pytorch torchvision -c pytorch

Installation Steps

We recommend that users follow our best practices to install MMOCR. However, the whole process is highly customizable. See Customize Installation section for more information.

Best Practices

Step 0. Install MMEngine, MMCV and MMDetection using MIM.

pip install -U openmim
mim install mmengine
mim install mmcv
mim install mmdet

Step 1. Install MMOCR.

If you wish to run and develop MMOCR directly, install it from source (recommended).

If you use MMOCR as a dependency or third-party package, install it via MIM.

git clone https://github.com/open-mmlab/mmocr.git
cd mmocr
pip install -v -e .
# "-v" increases pip's verbosity.
# "-e" means installing the project in editable mode,
# That is, any local modifications on the code will take effect immediately.

Step 2. (Optional) If you wish to use any transform involving albumentations (For example, Albu in ABINet’s pipeline), or any dependency for building documentation or running unit tests, please install the dependency using the following command:

# install albu
pip install -r requirements/albu.txt
# install the dependencies for building documentation and running unit tests
pip install -r requirements.txt

Note

We recommend checking the environment after installing albumentations to ensure that opencv-python and opencv-python-headless are not installed together, otherwise it might cause unexpected issues. If that’s unfortunately the case, please uninstall opencv-python-headless to make sure MMOCR’s visualization utilities can work.

Refer to albumentations’s official documentation for more details.

Verify the installation

You may verify the installation via this inference demo.

Run the following code in a Python interpreter:

>>> from mmocr.apis import MMOCRInferencer
>>> ocr = MMOCRInferencer(det='DBNet', rec='CRNN')
>>> ocr('demo/demo_text_ocr.jpg', show=True, print_result=True)

You should be able to see a pop-up image and the inference result printed out in the console upon successful verification.


# Inference result
{'predictions': [{'rec_texts': ['cbanks', 'docecea', 'grouf', 'pwate', 'chobnsonsg', 'soxee', 'oeioh', 'c', 'sones', 'lbrandec', 'sretalg', '11', 'to8', 'round', 'sale', 'year',
'ally', 'sie', 'sall'], 'rec_scores': [...], 'det_polygons': [...], 'det_scores':
[...]}]}

Note

If you are running MMOCR on a server without GUI or via SSH tunnel with X11 forwarding disabled, you may not see the pop-up window.

Customize Installation

CUDA versions

When installing PyTorch, you need to specify the version of CUDA. If you are not clear on which to choose, follow our recommendations:

  • For Ampere-based NVIDIA GPUs, such as GeForce 30 series and NVIDIA A100, CUDA 11 is a must.

  • For older NVIDIA GPUs, CUDA 11 is backward compatible, but CUDA 10.2 offers better compatibility and is more lightweight.

Please make sure the GPU driver satisfies the minimum version requirements. See this table for more information.

Note

Installing CUDA runtime libraries is enough if you follow our best practices, because no CUDA code will be compiled locally. However if you hope to compile MMCV from source or develop other CUDA operators, you need to install the complete CUDA toolkit from NVIDIA’s website, and its version should match the CUDA version of PyTorch. i.e., the specified version of cudatoolkit in conda install command.

Install MMCV without MIM

MMCV contains C++ and CUDA extensions, thus depending on PyTorch in a complex way. MIM solves such dependencies automatically and makes the installation easier. However, it is not a must.

To install MMCV with pip instead of MIM, please follow MMCV installation guides. This requires manually specifying a find-url based on PyTorch version and its CUDA version.

For example, the following command install mmcv-full built for PyTorch 1.10.x and CUDA 11.3.

pip install `mmcv>=2.0.0rc1`  -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html

Install on CPU-only platforms

MMOCR can be built for CPU-only environment. In CPU mode you can train (requires MMCV version >= 1.4.4), test or inference a model.

However, some functionalities are gone in this mode:

  • Deformable Convolution

  • Modulated Deformable Convolution

  • ROI pooling

  • SyncBatchNorm

If you try to train/test/inference a model containing above ops, an error will be raised. The following table lists affected algorithms.

Operator Model
Deformable Convolution/Modulated Deformable Convolution DBNet (r50dcnv2), DBNet++ (r50dcnv2), FCENet (r50dcnv2)
SyncBatchNorm PANet, PSENet

Using MMOCR with Docker

We provide a Dockerfile to build an image.

# build an image with PyTorch 1.6, CUDA 10.1
docker build -t mmocr docker/

Run it with

docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmocr/data mmocr

Dependency on MMEngine, MMCV & MMDetection

MMOCR has different version requirements on MMEngine, MMCV and MMDetection at each release to guarantee the implementation correctness. Please refer to the table below and ensure the package versions fit the requirement.

MMOCR MMEngine MMCV MMDetection
dev-1.x 0.7.1 \<= mmengine \< 1.1.0 2.0.0rc4 \<= mmcv \< 2.1.0 3.0.0rc5 \<= mmdet \< 3.2.0
1.0.1 0.7.1 \<= mmengine \< 1.1.0 2.0.0rc4 \<= mmcv \< 2.1.0 3.0.0rc5 \<= mmdet \< 3.2.0
1.0.0 0.7.1 \<= mmengine \< 1.0.0 2.0.0rc4 \<= mmcv \< 2.1.0 3.0.0rc5 \<= mmdet \< 3.1.0
1.0.0rc6 0.6.0 \<= mmengine \< 1.0.0 2.0.0rc4 \<= mmcv \< 2.1.0 3.0.0rc5 \<= mmdet \< 3.1.0
1.0.0rc[4-5] 0.1.0 \<= mmengine \< 1.0.0 2.0.0rc1 \<= mmcv \< 2.1.0 3.0.0rc0 \<= mmdet \< 3.1.0
1.0.0rc[0-3] 0.0.0 \<= mmengine \< 0.2.0 2.0.0rc1 \<= mmcv \< 2.1.0 3.0.0rc0 \<= mmdet \< 3.1.0

Quick Run

This chapter will take you through the basic functions of MMOCR. And we assume you installed MMOCR from source. You may check out the tutorial notebook for how to perform inference, training and testing interactively.

Inference

Run the following in MMOCR’s root directory:

python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec CRNN --show --print-result

You should be able to see a pop-up image and the inference result printed out in the console.


# Inference result
{'predictions': [{'rec_texts': ['cbanks', 'docecea', 'grouf', 'pwate', 'chobnsonsg', 'soxee', 'oeioh', 'c', 'sones', 'lbrandec', 'sretalg', '11', 'to8', 'round', 'sale', 'year',
'ally', 'sie', 'sall'], 'rec_scores': [...], 'det_polygons': [...], 'det_scores':
[...]}]}

Note

If you are running MMOCR on a server without GUI or via SSH tunnel with X11 forwarding disabled, you may not see the pop-up window.

A detailed description of MMOCR’s inference interface can be found here

In addition to using our well-provided pre-trained models, you can also train models on your own datasets. In the next section, we will take you through the basic functions of MMOCR by training DBNet on the mini ICDAR 2015 dataset as an example.

Prepare a Dataset

Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform data format, and provides dataset preparer for commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.

Note

But here, efficiency means everything.

Here, we have prepared a lite version of ICDAR 2015 dataset for demonstration purposes. Download our pre-prepared zip and extract it to the data/ directory under mmocr to get our prepared image and annotation file.

wget https://download.openmmlab.com/mmocr/data/icdar2015/mini_icdar2015.tar.gz
mkdir -p data/
tar xzvf mini_icdar2015.tar.gz -C data/

Modify the Config

Once the dataset is prepared, we will then specify the location of the training set and the training parameters by modifying the config file.

In this example, we will train a DBNet using resnet18 as its backbone. Since MMOCR already has a config file for the full ICDAR 2015 dataset (configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py), we just need to make some modifications on top of it.

We first need to modify the path to the dataset. In this config, most of the key config files are imported in _base_, such as the database configuration from configs/textdet/_base_/datasets/icdar2015.py. Open that file and replace the path pointed to by icdar2015_textdet_data_root in the first line with:

icdar2015_textdet_data_root = 'data/mini_icdar2015'

Also, because of the reduced dataset size, we have to reduce the number of training epochs to 400 accordingly, shorten the validation interval as well as the weight storage interval to 10 rounds, and drop the learning rate decay strategy. The following lines of configuration can be directly put into configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py to take effect.

# Save checkpoints every 10 epochs, and only keep the latest checkpoint
default_hooks = dict(
    checkpoint=dict(
        type='CheckpointHook',
        interval=10,
        max_keep_ckpts=1,
    ))
# Set the maximum number of epochs to 400, and validate the model every 10 epochs
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400, val_interval=10)
# Fix learning rate as a constant
param_scheduler = [
    dict(type='ConstantLR', factor=1.0),
]

Here, we have rewritten the corresponding parameters in the base configuration directly through the inheritance (MMEngine: Config) mechanism of the config. The original fields are distributed in configs/textdet/_base_/schedules/schedule_sgd_1200e.py and configs/textdet/_base_/default_runtime.py.

Note

For a more detailed description of config, please refer to here.

Browse the Dataset

Before we start the training, we can also visualize the image processed by training-time data transforms. It’s quite simple: pass the config file we need to visualize into the browse_dataset.py script.

python tools/analysis_tools/browse_dataset.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py

The transformed images and annotations will be displayed one by one in a pop-up window.

Note

For details on the parameters and usage of this script, please refer to here.

Tip

In addition to satisfying our curiosity, visualization can also help us check the parts that may affect the model’s performance before training, such as problems in configs, datasets and data transforms.

Training

Start the training by running the following command:

python tools/train.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py

Depending on the system environment, MMOCR will automatically use the best device for training. If a GPU is available, a single GPU training will be started by default. When you start to see the output of the losses, you have successfully started the training.

2022/08/22 18:42:22 - mmengine - INFO - Epoch(train) [1][5/7]  lr: 7.0000e-03  memory: 7730  data_time: 0.4496  loss_prob: 14.6061  loss_thr: 2.2904  loss_db: 0.9879  loss: 17.8843  time: 1.8666
2022/08/22 18:42:24 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
2022/08/22 18:42:28 - mmengine - INFO - Epoch(train) [2][5/7]  lr: 7.0000e-03  memory: 6695  data_time: 0.2052  loss_prob: 6.7840  loss_thr: 1.4114  loss_db: 0.9855  loss: 9.1809  time: 0.7506
2022/08/22 18:42:29 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
2022/08/22 18:42:33 - mmengine - INFO - Epoch(train) [3][5/7]  lr: 7.0000e-03  memory: 6690  data_time: 0.2101  loss_prob: 3.0700  loss_thr: 1.1800  loss_db: 0.9967  loss: 5.2468  time: 0.6244
2022/08/22 18:42:33 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015

Without extra configurations, model weights will be saved to work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/, while the logs will be stored in work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/TIMESTAMP/. Next, we just need to wait with some patience for training to finish.

Note

For advanced usage of training, such as CPU training, multi-GPU training, and cluster training, please refer to Training and Testing.

Testing

After 400 epochs, we observe that DBNet performs best in the last epoch, with hmean reaching 60.86 (You may see a different result):

08/22 19:24:52 - mmengine - INFO - Epoch(val) [400][100/100]  icdar/precision: 0.7285  icdar/recall: 0.5226  icdar/hmean: 0.6086

Note

It may not have been trained to be optimal, but it is sufficient for a demo.

However, this value only reflects the performance of DBNet on the mini ICDAR 2015 dataset. For a comprehensive evaluation, we also need to see how it performs on out-of-distribution datasets. For example, tests/data/det_toy_dataset is a very small real dataset that we can use to verify the actual performance of DBNet.

Before testing, we also need to make some changes to the location of the dataset. Open configs/textdet/_base_/datasets/icdar2015.py and change data_root of icdar2015_textdet_test to tests/data/det_toy_dataset:

# ...
icdar2015_textdet_test = dict(
    type='OCRDataset',
    data_root='tests/data/det_toy_dataset',
    #  ...
    )

Start testing:

python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/epoch_400.pth

And get the outputs like:

08/21 21:45:59 - mmengine - INFO - Epoch(test) [5/10]    memory: 8562
08/21 21:45:59 - mmengine - INFO - Epoch(test) [10/10]    eta: 0:00:00  time: 0.4893  data_time: 0.0191  memory: 283
08/21 21:45:59 - mmengine - INFO - Evaluating hmean-iou...
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.30, recall: 0.6190, precision: 0.4815, hmean: 0.5417
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.40, recall: 0.6190, precision: 0.5909, hmean: 0.6047
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.50, recall: 0.6190, precision: 0.6842, hmean: 0.6500
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.60, recall: 0.6190, precision: 0.7222, hmean: 0.6667
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.70, recall: 0.3810, precision: 0.8889, hmean: 0.5333
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.80, recall: 0.0000, precision: 0.0000, hmean: 0.0000
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.90, recall: 0.0000, precision: 0.0000, hmean: 0.0000
08/21 21:45:59 - mmengine - INFO - Epoch(test) [10/10]  icdar/precision: 0.7222  icdar/recall: 0.6190  icdar/hmean: 0.6667

The model achieves an hmean of 0.6667 on this dataset.

Note

For advanced usage of testing, such as CPU testing, multi-GPU testing, and cluster testing, please refer to Training and Testing.

Visualize the Outputs

We can also visualize its prediction output in test.py. You can open a pop-up visualization window with the show parameter; and can also specify the directory where the prediction result images are exported with the show-dir parameter.

python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/epoch_400.pth --show-dir imgs/

The true labels and predicted values are displayed in a tiled fashion in the visualization results. The green boxes in the left panel indicate the true labels and the red boxes in the right panel indicate the predicted values.


Note

For a description of more visualization features, see here.

FAQ

General

Q1 I’m getting the warning like unexpected key in source state_dict: fc.weight, fc.bias, is there something wrong?

A It’s not an error. It occurs because the backbone network is pretrained on image classification tasks, where the last fc layer is required to generate the classification output. However, the fc layer is no longer needed when the backbone network is used to extract features in downstream tasks, and therefore these weights can be safely skipped when loading the checkpoint.

Q2 MMOCR terminates with an error: shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry. How could I fix it?

A This error occurs because of some invalid polygons (e.g., polygons with self-intersections) existing in the dataset or generated by some non-rigorous data transforms. These polygons can be fixed by adding FixInvalidPolygon transform after the transform likely to introduce invalid polygons. For example, a common practice is to append it after LoadOCRAnnotations in both train and test pipeline. The resulting pipeline should look like:

train_pipeline = [
    ...
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    dict(type='FixInvalidPolygon', min_poly_points=4),
    ...
]

In practice, we find that Totaltext contains some invalid polygons and using FixInvalidPolygon is a must. Here is an example config.

Q3 Getting libpng warning: iCCP: known incorrect sRGB profile when loading images with cv2 backend.

A This is a warning from libpng and it is safe to ignore. It is caused by the icc profile in the image. You can use pillow backend to avoid this warning:

train_pipeline = [
    dict(
        type='LoadImageFromFile',
        imdecode_backend='pillow'),
    ...
]

Text Recognition

Q1 What are the steps to train text recognition models with my own dictionary?

A In MMOCR 1.0, you only need to modify the config and point Dictionary to your custom dict file. For example, if you want to train SAR model (https://github.com/open-mmlab/mmocr/blob/75c06d34bbc01d3d11dfd7afc098b6cdeee82579/configs/textrecog/sar/sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real.py) with your own dictionary placed at /my/dict.txt, you can modify dictionary.dict_file term in base config to:

dictionary = dict(
    type='Dictionary',
    dict_file='/my/dict.txt',
    with_start=True,
    with_end=True,
    same_start_end=True,
    with_padding=True,
    with_unknown=True)

Now you are good to go. You can also find more information in Dictionary API.

Q2 How to properly visualize non-English characters?

A You can customize font_families or font_properties in visualizer. For example, to visualize Korean:

configs/textrecog/_base_/default_runtime.py:

visualizer = dict(
    type='TextRecogLocalVisualizer',
    name='visualizer',
    font_families='NanumGothic', # new feature
    vis_backends=vis_backends)

It’s also fine to pass the font path to visualizer:

visualizer = dict(
    type='TextRecogLocalVisualizer',
    name='visualizer',
    font_properties='path/to/font_file',
    vis_backends=vis_backends)

Inference

In OpenMMLab, all the inference operations are unified into a new interface - Inferencer. Inferencer is designed to expose a neat and simple API to users, and shares very similar interface across different OpenMMLab libraries.

In MMOCR, Inferencers are constructed in different levels of task abstraction.

  • Standard Inferencer: Following OpenMMLab’s convention, each fundamental task in MMOCR has a standard Inferencer, namely TextDetInferencer (text detection), TextRecInferencer (text recognition), TextSpottingInferencer (end-to-end OCR), and KIEInferencer (key information extraction). They are designed to perform inference on a single task, and can be chained together to perform inference on a series of tasks. They also share very similar interface, have standard input/output protocol, and overall follow the OpenMMLab design.

  • MMOCRInferencer: We also provide MMOCRInferencer, a convenient inference interface only designed for MMOCR. It encapsulates and chains all the Inferencers in MMOCR, so users can use this Inferencer to perform a series of tasks on an image and directly get the final result in an end-to-end manner. However, it has a relatively different interface from other standard Inferencers, and some of standard Inferencer functionalities might be sacrificed for the sake of simplicity.

For new users, we recommend using MMOCRInferencer to test out different combinations of models.

If you are a developer and wish to integrate the models into your own project, we recommend using standard Inferencers, as they are more flexible and standardized, equipped with full functionalities.

Basic Usage

As of now, MMOCRInferencer can perform inference on the following tasks:

  • Text detection

  • Text recognition

  • OCR (text detection + text recognition)

  • Key information extraction (text detection + text recognition + key information extraction)

  • OCR (text spotting) (coming soon)

For convenience, MMOCRInferencer provides both Python and command line interfaces. For example, if you want to perform OCR inference on demo/demo_text_ocr.jpg with DBNet as the text detection model and CRNN as the text recognition model, you can simply run the following command:

>>> from mmocr.apis import MMOCRInferencer
>>> # Load models into memory
>>> ocr = MMOCRInferencer(det='DBNet', rec='SAR')
>>> # Perform inference
>>> ocr('demo/demo_text_ocr.jpg', show=True)

The resulting OCR output will be displayed in a new window:

Note

If you are running MMOCR on a server without GUI or via SSH tunnel with X11 forwarding disabled, the show option will not work. However, you can still save visualizations to files by setting out_dir and save_vis=True arguments. Read Dumping Results for details.

Depending on the initialization arguments, MMOCRInferencer can run in different modes. For example, it can run in KIE mode if it is initialized with det, rec and kie specified.

>>> kie = MMOCRInferencer(det='DBNet', rec='SAR', kie='SDMGR')
>>> kie('demo/demo_kie.jpeg', show=True)

The output image should look like this:


You may have found that the Python interface and the command line interface of MMOCRInferencer are very similar. The following sections will use the Python interface as an example to introduce the usage of MMOCRInferencer. For more information about the command line interface, please refer to Command Line Interface.

Initialization

Each Inferencer must be initialized with a model. You can also choose the inference device during initialization.

Model Initialization

For each task, MMOCRInferencer takes two arguments in the form of xxx and xxx_weights (e.g. det and det_weights) for initialization, and there are many ways to initialize a model for inference. We will take det and det_weights as an example to illustrate some typical ways to initialize a model.

  • To infer with MMOCR’s pre-trained model, passing its name to the argument det can work. The weights will be automatically downloaded and loaded from OpenMMLab’s model zoo. Check Weights for available model names.

    >>> MMOCRInferencer(det='DBNet')
    
  • To load custom config and weight, you can pass the path to the config file to det and the path to the weight to det_weights.

    >>> MMOCRInferencer(det='path/to/dbnet_config.py', det_weights='path/to/dbnet.pth')
    

You may click on the “Standard Inferencer” tab to find more initialization methods.

Device

Each Inferencer instance is bound to a device. By default, the best device is automatically decided by MMEngine. You can also alter the device by specifying the device argument. For example, you can use the following code to create an Inferencer on GPU 1.

>>> inferencer = MMOCRInferencer(det='DBNet', device='cuda:1')

To create an Inferencer on CPU:

>>> inferencer = MMOCRInferencer(det='DBNet', device='cpu')

Refer to torch.device for all the supported forms.

Inference

Once the Inferencer is initialized, you can directly pass in the raw data to be inferred and get the inference results from return values.

Input

Input can be either of these types:

  • str: Path/URL to the image.

    >>> inferencer('demo/demo_text_ocr.jpg')
    
  • array: Image in numpy array. It should be in BGR order.

    >>> import mmcv
    >>> array = mmcv.imread('demo/demo_text_ocr.jpg')
    >>> inferencer(array)
    
  • list: A list of basic types above. Each element in the list will be processed separately.

    >>> inferencer(['img_1.jpg', 'img_2.jpg])
    >>> # You can even mix the types
    >>> inferencer(['img_1.jpg', array])
    
  • str: Path to the directory. All images in the directory will be processed.

    >>> inferencer('tests/data/det_toy_dataset/imgs/test/')
    

Output

By default, each Inferencer returns the prediction results in a dictionary format.

  • visualization contains the visualized predictions. But it’s an empty list by default unless return_vis=True.

  • predictions contains the predictions results in a json-serializable format. As presented below, the contents are slightly different depending on the task type.

    {
        'predictions' : [
          # Each instance corresponds to an input image
          {
            'det_polygons': [...],  # 2d list of length (N,), format: [x1, y1, x2, y2, ...]
            'det_scores': [...],  # float list of length (N,)
            'det_bboxes': [...],   # 2d list of shape (N, 4), format: [min_x, min_y, max_x, max_y]
            'rec_texts': [...],  # str list of length (N,)
            'rec_scores': [...],  # float list of length (N,)
            'kie_labels': [...],  # node labels, length (N, )
            'kie_scores': [...],  # node scores, length (N, )
            'kie_edge_scores': [...],  # edge scores, shape (N, N)
            'kie_edge_labels': [...]  # edge labels, shape (N, N)
          },
          ...
        ],
        'visualization' : [
          array(..., dtype=uint8),
        ]
    }
    

If you wish to get the raw outputs from the model, you can set return_datasamples to True to get the original DataSample, which will be stored in predictions.

Dumping Results

Apart from obtaining predictions from the return value, you can also export the predictions/visualizations to files by setting out_dir and save_pred/save_vis arguments.

>>> inferencer('img_1.jpg', out_dir='outputs/', save_pred=True, save_vis=True)

Results in the directory structure like:

outputs
├── preds
│   └── img_1.json
└── vis
    └── img_1.jpg

The filename of each file is the same as the corresponding input image filename. If the input image is an array, the filename will be a number starting from 0.

Batch Inference

You can customize the batch size by setting batch_size. The default batch size is 1.

API

Here are extensive lists of parameters that you can use.

MMOCRInferencer.__init__():

Arguments Type Default Description
det str or Weights, optional None Pretrained text detection algorithm. It's the path to the config file or the model name defined in metafile.
det_weights str, optional None Path to the custom checkpoint file of the selected det model. If it is not specified and "det" is a model name of metafile, the weights will be loaded from metafile.
rec str or Weights, optional None Pretrained text recognition algorithm. It’s the path to the config file or the model name defined in metafile.
rec_weights str, optional None Path to the custom checkpoint file of the selected rec model. If it is not specified and “rec” is a model name of metafile, the weights will be loaded from metafile.
kie [1] str or Weights, optional None Pretrained key information extraction algorithm. It’s the path to the config file or the model name defined in metafile.
kie_weights str, optional None Path to the custom checkpoint file of the selected kie model. If it is not specified and “kie” is a model name of metafile, the weights will be loaded from metafile.
device str, optional None Device used for inference, accepting all allowed strings by torch.device. E.g., 'cuda:0' or 'cpu'. If None, the available device will be automatically used. Defaults to None.

[1]: kie is only effective when both text detection and recognition models are specified.

MMOCRInferencer.__call__()

Arguments Type Default Description
inputs str/list/tuple/np.array required It can be a path to an image/a folder, an np array or a list/tuple (with img paths or np arrays)
return_datasamples bool False Whether to return results as DataSamples. If False, the results will be packed into a dict.
batch_size int 1 Inference batch size.
det_batch_size int, optional None Inference batch size for text detection model. Overwrite batch_size if it is not None.
rec_batch_size int, optional None Inference batch size for text recognition model. Overwrite batch_size if it is not None.
kie_batch_size int, optional None Inference batch size for KIE model. Overwrite batch_size if it is not None.
return_vis bool False Whether to return the visualization result.
print_result bool False Whether to print the inference result to the console.
show bool False Whether to display the visualization results in a popup window.
wait_time float 0 The interval of show(s).
out_dir str results/ Output directory of results.
save_vis bool False Whether to save the visualization results to out_dir.
save_pred bool False Whether to save the inference results to out_dir.

Command Line Interface

Note

This section is only applicable to MMOCRInferencer.

You can use tools/infer.py to perform inference through MMOCRInferencer. Its general usage is as follows:

python tools/infer.py INPUT_PATH [--det DET] [--det-weights ...] ...

where INPUT_PATH is a required field, which should be a path to an image or a folder. Command-line parameters follow the mapping relationship with the Python interface parameters as follows:

  • To convert the Python interface parameters to the command line ones, you need to add two -- in front of the Python interface parameters, and replace the underscore _ with the hyphen -. For example, out_dir becomes --out-dir.

  • For boolean type parameters, putting the parameter in the command is equivalent to specifying it as True. For example, --show will specify the show parameter as True.

In addition, the command line will not display the inference result by default. You can use the --print-result parameter to view the inference result.

Here is an example:

python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec SAR --show --print-result

Running this command will give the following result:

{'predictions': [{'rec_texts': ['CBank', 'Docbcba', 'GROUP', 'MAUN', 'CROBINSONS', 'AOCOC', '916M3', 'BOO9', 'Oven', 'BRANDS', 'ARETAIL', '14', '70<UKN>S', 'ROUND', 'SALE', 'YEAR', 'ALLY', 'SALE', 'SALE'],
'rec_scores': [0.9753464579582214, ...], 'det_polygons': [[551.9930285844646, 411.9138765335083, 553.6153911653112,
383.53195309638977, 620.2410061195247, 387.33785033226013, 618.6186435386782, 415.71977376937866], ...], 'det_scores': [0.8230461478233337, ...]}]}

Config

MMOCR mainly uses Python files as configuration files. The design of its configuration file system integrates the ideas of modularity and inheritance to facilitate various experiments.

Common Usage

Note

This section is recommended to be read together with the primary usage in MMEngine: Config.

There are three most common operations in MMOCR: inheritance of configuration files, reference to _base_ variables, and modification of _base_ variables. Config provides two syntaxes for inheriting and modifying _base_, one for Python, Json, and Yaml, and one for Python configuration files only. In MMOCR, we prefer the Python-only syntax, so this will be the basis for further description.

The configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py is used as an example to illustrate the three common uses.

_base_ = [
    '_base_dbnet_resnet18_fpnc.py',
    '../_base_/datasets/icdar2015.py',
    '../_base_/default_runtime.py',
    '../_base_/schedules/schedule_sgd_1200e.py',
]

# dataset settings
icdar2015_textdet_train = _base_.icdar2015_textdet_train
icdar2015_textdet_train.pipeline = _base_.train_pipeline
icdar2015_textdet_test = _base_.icdar2015_textdet_test
icdar2015_textdet_test.pipeline = _base_.test_pipeline

train_dataloader = dict(
    batch_size=16,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=icdar2015_textdet_train)

val_dataloader = dict(
    batch_size=1,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=icdar2015_textdet_test)

Configuration Inheritance

There is an inheritance mechanism for configuration files, i.e. one configuration file A can use another configuration file B as its base and inherit all the fields directly from it, thus avoiding a lot of copy-pasting.

In dbnet_resnet18_fpnc_1200e_icdar2015.py you can see that

_base_ = [
    '_base_dbnet_resnet18_fpnc.py',
    '../_base_/datasets/icdar2015.py',
    '../_base_/default_runtime.py',
    '../_base_/schedules/schedule_sgd_1200e.py',
]

The above statement reads all the base configuration files in the list, and all the fields in them are loaded into dbnet_resnet18_fpnc_1200e_icdar2015.py. We can see the structure of the configuration file after it has been parsed by running the following statement in a Python interpretation.

from mmengine import Config
db_config = Config.fromfile('configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py')
print(db_config)

It can be found that the parsed configuration contains all the fields and information in the base configuration.

Note

Variables with the same name cannot exist in each _base_ profile.

_base_ Variable References

Sometimes we may need to reference some fields in the _base_ configuration directly in order to avoid duplicate definitions. Suppose we want to get the variable pseudo in the _base_ configuration, we can get the variable in the _base_ configuration directly via _base_.pseudo.

This syntax has been used extensively in the configuration of MMOCR, and the dataset and pipeline configurations for each model in MMOCR are referenced in the base configuration. For example,

icdar2015_textdet_train = _base_.icdar2015_textdet_train
# ...
train_dataloader = dict(
    # ...
    dataset=icdar2015_textdet_train)

_base_ Variable Modification

In MMOCR, different algorithms usually have different pipelines in different datasets, so there are often scenarios to modify the pipeline in the dataset. There are also many scenarios where you need to modify variables in the _base_ configuration, for example, modifying the training strategy of an algorithm, replacing some modules of an algorithm(backbone, etc.). Users can directly modify the referenced _base_ variables using Python syntax. For dict, we also provide a method similar to class attribute modification to modify the contents of the dictionary directly.

  1. Dictionary

    Here is an example of modifying pipeline in a dataset.

    The dictionary can be modified using Python syntax:

    # Get the dataset in _base_
    icdar2015_textdet_train = _base_.icdar2015_textdet_train
    # You can modify the variables directly with Python's update
    icdar2015_textdet_train.update(pipeline=_base_.train_pipeline)
    

    It can also be modified in the same way as changing Python class attributes.

    # Get the dataset in _base_
    icdar2015_textdet_train = _base_.icdar2015_textdet_train
    # The class property method is modified
    icdar2015_textdet_train.pipeline = _base_.train_pipeline
    
  2. List

    Suppose the variable pseudo = [1, 2, 3] in the _base_ configuration needs to be modified to [1, 2, 4]:

    # pseudo.py
    pseudo = [1, 2, 3]
    

    Can be rewritten directly as.

    _base_ = ['pseudo.py']
    pseudo = [1, 2, 4]
    

    Or modify the list using Python syntax:

    _base_ = ['pseudo.py']
    pseudo = _base_.pseudo
    pseudo[2] = 4
    

Command Line Modification

Sometimes we only want to fix part of the configuration and do not want to modify the configuration file itself. For example, if you want to change the learning rate during an experiment but do not want to write a new configuration file, you can pass in parameters on the command line to override the relevant configuration.

We can pass --cfg-options on the command line and modify the corresponding fields directly with the arguments after it. For example, we can run the following command to modify the learning rate temporarily for this training session.

python tools/train.py example.py --cfg-options optim_wrapper.optimizer.lr=1

For more detailed usage, refer to MMEngine: Command Line Modification.

Configuration Content

With config files and Registry, MMOCR can modify the training parameters as well as the model configuration without invading the code. Specifically, users can customize the following modules in the configuration file: environment configuration, hook configuration, log configuration, training strategy configuration, data-related configuration, model-related configuration, evaluation configuration, and visualization configuration.

This document will take the text detection algorithm DBNet and the text recognition algorithm CRNN as examples to introduce the contents of Config in detail.

Environment Configuration

default_scope = 'mmocr'
env_cfg = dict(
    cudnn_benchmark=True,
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
    dist_cfg=dict(backend='nccl'))
randomness = dict(seed=None)

There are three main components:

  • Set the default scope of all registries to mmocr, ensuring that all modules are searched first from the MMOCR codebase. If the module does not exist, the search will continue from the upstream algorithm libraries MMEngine and MMCV, see MMEngine: Registry for more details.

  • env_cfg configures the distributed environment, see MMEngine: Runner for more details.

  • randomness: Some settings to make the experiment as reproducible as possible like seed and deterministic. See MMEngine: Runner for more details.

Hook Configuration

Hooks are divided into two main parts, default hooks, which are required for all tasks to run, and custom hooks, which generally serve specific algorithms or specific tasks (there are no custom hooks in MMOCR so far).

default_hooks = dict(
    timer=dict(type='IterTimerHook'), # Time recording, including data time as well as model inference time
    logger=dict(type='LoggerHook', interval=1), # Collect logs from different components
    param_scheduler=dict(type='ParamSchedulerHook'), # Update some hyper-parameters in optimizer
    checkpoint=dict(type='CheckpointHook', interval=1),# Save checkpoint. `interval` control save interval
    sampler_seed=dict(type='DistSamplerSeedHook'), # Data-loading sampler for distributed training.
    sync_buffer=dict(type='SyncBuffersHook'), # Synchronize buffer in case of distributed training
    visualization=dict( # Visualize the results of val and test
        type='VisualizationHook',
        interval=1,
        enable=False,
        show=False,
        draw_gt=False,
        draw_pred=False))
 custom_hooks = []

Here is a brief description of a few hooks whose parameters may be changed frequently. For a general modification method, refer to Modify configuration.

  • LoggerHook: Used to configure the behavior of the logger. For example, by modifying interval you can control the interval of log printing, so that the log is printed once per interval iteration, for more settings refer to LoggerHook API.

  • CheckpointHook: Used to configure checkpoint-related behavior, such as saving optimal and/or latest weights. You can also modify interval to control the checkpoint saving interval. More settings can be found in CheckpointHook API

  • VisualizationHook: Used to configure visualization-related behavior, such as visualizing predicted results during validation or testing. Default is off. This Hook also depends on Visualization Configuration. You can refer to Visualizer for more details. For more configuration, you can refer to VisualizationHook API.

If you want to learn more about the configuration of the default hooks and their functions, you can refer to MMEngine: Hooks.

Log Configuration

This section is mainly used to configure the log level and the log processor.

log_level = 'INFO' # Logging Level
log_processor = dict(type='LogProcessor',
                        window_size=10,
                        by_epoch=True)
  • The logging severity level is the same as that of Python: logging

  • The log processor is mainly used to control the format of the output, detailed functions can be found in MMEngine: logging.

    • by_epoch=True indicates that the logs are output in accordance to “epoch”, and the log format needs to be consistent with the type='EpochBasedTrainLoop' parameter in train_cfg. For example, if you want to output logs by iteration number, you need to set by_epoch=False in log_processor and type='IterBasedTrainLoop' in train_cfg.

    • window_size indicates the smoothing window of the loss, i.e. the average value of the various losses for the last window_size iterations. the final loss value printed in logger is the average of all the losses.

Training Strategy Configuration

This section mainly contains optimizer settings, learning rate schedules and Loop settings.

Training strategies usually vary for different tasks (text detection, text recognition, key information extraction). Here we explain the example configuration in CRNN, which is a text recognition model.

# optimizer
optim_wrapper = dict(
    type='OptimWrapper', optimizer=dict(type='Adadelta', lr=1.0))
param_scheduler = [dict(type='ConstantLR', factor=1.0)]
train_cfg = dict(type='EpochBasedTrainLoop',
                    max_epochs=5, # train epochs
                    val_interval=1) # val interval
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
  • optim_wrapper : It contains two main parts, OptimWrapper and Optimizer. Detailed usage information can be found in MMEngine: Optimizer Wrapper.

    • The Optimizer wrapper supports different training strategies, including mixed-accuracy training (AMP), gradient accumulation, and gradient truncation.

    • All PyTorch optimizers are supported in the optimizer settings. All supported optimizers are available in PyTorch Optimizer List.

  • param_scheduler : learning rate tuning strategy, supports most of the learning rate schedulers in PyTorch, such as ExponentialLR, LinearLR, StepLR, MultiStepLR, etc., and is used in much the same way, see scheduler interface, and more features can be found in the MMEngine: Optimizer Parameter Tuning Strategy.

  • train/test/val_cfg : the execution flow of the task, MMEngine provides four kinds of flow: EpochBasedTrainLoop, IterBasedTrainLoop, ValLoop, TestLoop More can be found in MMEngine: loop controller.

Dataset Preparation

Introduction

After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides a data preparation script to help users prepare the datasets with only one command.

In this section, we will introduce a typical process of preparing a dataset for MMOCR:

  1. Download datasets and convert its format to the suggested one

  2. Modify the config file

However, the first step is not necessary if you already have a dataset in the format that MMOCR supports. You can read Dataset Classes for more details.

Downloading Datasets and Converting Format

As an example of the data preparation steps, you can use the following command to prepare the ICDAR 2015 dataset for text detection task.

python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet

Then, the dataset has been downloaded and converted to MMOCR format, and the file directory structure is as follows:

data/icdar2015
├── textdet_imgs
│   ├── test
│   └── train
├── textdet_test.json
└── textdet_train.json

Once your dataset has been prepared, you can use the browse_dataset.py to visualize the dataset and check if the annotations are correct.

python tools/analysis_tools/browse_dataset.py configs/textdet/_base_/datasets/icdar2015.py

Dataset Configuration

Single Dataset Training

When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path configs/xxx/_base_/datasets/ is pre-configured with the commonly used datasets in MMOCR (if you use prepare_dataset.py to prepare dataset, this config will be generated automatically), here we take the ICDAR 2015 dataset as an example (see configs/textdet/_base_/datasets/icdar2015.py).

icdar2015_textdet_data_root = 'data/icdar2015' # dataset root path

# Train set config
icdar2015_textdet_train = dict(
    type='OCRDataset',
    data_root=icdar2015_textdet_data_root,               # dataset root path
    ann_file='textdet_train.json',                       # name of annotation
    filter_cfg=dict(filter_empty_gt=True, min_size=32),  # filtering empty images
    pipeline=None)
# Test set config
icdar2015_textdet_test = dict(
    type='OCRDataset',
    data_root=icdar2015_textdet_data_root,
    ann_file='textdet_test.json',
    test_mode=True,
    pipeline=None)

After configuring the dataset, we can import it in the corresponding model configs. For example, to train the “DBNet_R18” model on the ICDAR 2015 dataset.

_base_ = [
    '_base_dbnet_r18_fpnc.py',
    '../_base_/datasets/icdar2015.py',  # import the dataset config
    '../_base_/default_runtime.py',
    '../_base_/schedules/schedule_sgd_1200e.py',
]

icdar2015_textdet_train = _base_.icdar2015_textdet_train            # specify the training set
icdar2015_textdet_train.pipeline = _base_.train_pipeline   # specify the training pipeline
icdar2015_textdet_test = _base_.icdar2015_textdet_test              # specify the testing set
icdar2015_textdet_test.pipeline = _base_.test_pipeline     # specify the testing pipeline

train_dataloader = dict(
    batch_size=16,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=icdar2015_textdet_train)    # specify the dataset in train_dataloader

val_dataloader = dict(
    batch_size=1,
    num_workers=4,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=icdar2015_textdet_test)    # specify the dataset in val_dataloader

test_dataloader = val_dataloader

Multi-dataset Training

In addition, ConcatDataset enables users to train or test the model on a combination of multiple datasets. You just need to set the dataset type in the dataloader to ConcatDataset in the configuration file and specify the corresponding list of datasets.

train_list = [ic11, ic13, ic15]
train_dataloader = dict(
    dataset=dict(
        type='ConcatDataset', datasets=train_list, pipeline=train_pipeline))

For example, the following configuration uses the MJSynth dataset for training and 6 academic datasets (CUTE80, IIIT5K, SVT, SVTP, ICDAR2013, ICDAR2015) for testing.

_base_ = [ # Import all dataset configurations you want to use
    '../_base_/datasets/mjsynth.py',
    '../_base_/datasets/cute80.py',
    '../_base_/datasets/iiit5k.py',
    '../_base_/datasets/svt.py',
    '../_base_/datasets/svtp.py',
    '../_base_/datasets/icdar2013.py',
    '../_base_/datasets/icdar2015.py',
    '../_base_/default_runtime.py',
    '../_base_/schedules/schedule_adadelta_5e.py',
    '_base_crnn_mini-vgg.py',
]

# List of training datasets
train_list = [_base_.mjsynth_textrecog_train]
# List of testing datasets
test_list = [
    _base_.cute80_textrecog_test, _base_.iiit5k_textrecog_test, _base_.svt_textrecog_test,
    _base_.svtp_textrecog_test, _base_.icdar2013_textrecog_test, _base_.icdar2015_textrecog_test
]

# Use ConcatDataset to combine the datasets in the list
train_dataset = dict(
       type='ConcatDataset', datasets=train_list, pipeline=_base_.train_pipeline)
test_dataset = dict(
       type='ConcatDataset', datasets=test_list, pipeline=_base_.test_pipeline)

train_dataloader = dict(
    batch_size=192 * 4,
    num_workers=32,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=train_dataset)

test_dataloader = dict(
    batch_size=1,
    num_workers=4,
    persistent_workers=True,
    drop_last=False,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=test_dataset)

val_dataloader = test_dataloader

Training and Testing

To meet diverse requirements, MMOCR supports training and testing models on various devices, including PCs, work stations, computation clusters, etc.

Single GPU Training and Testing

Training

tools/train.py provides the basic training service. MMOCR recommends using GPUs for model training and testing, but it still enables CPU-Only training and testing. For example, the following commands demonstrate how to train a DBNet model using a single GPU or CPU.

# Train the specified MMOCR model by calling tools/train.py
CUDA_VISIBLE_DEVICES= python tools/train.py ${CONFIG_FILE} [PY_ARGS]

# Training
# Example 1: Training DBNet with CPU
CUDA_VISIBLE_DEVICES=-1 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py

# Example 2: Specify to train DBNet with gpu:0, specify the working directory as dbnet/, and turn on mixed precision (amp) training
CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py --work-dir dbnet/ --amp

Note

If multiple GPUs are available, you can specify a certain GPU, e.g. the third one, by setting CUDA_VISIBLE_DEVICES=3.

The following table lists all the arguments supported by train.py. Args without the -- prefix are mandatory, while others are optional.

ARGS Type Description
config str (required) Path to config.
--work-dir str Specify the working directory for the training logs and models checkpoints.
--resume bool Whether to resume training from the latest checkpoint.
--amp bool Whether to use automatic mixture precision for training.
--auto-scale-lr bool Whether to use automatic learning rate scaling.
--cfg-options str Override some settings in the configs. Example
--launcher str Option for launcher,['none', 'pytorch', 'slurm', 'mpi'].
--local_rank int Rank of local machine,used for distributed training,defaults to 0。
--tta bool Whether to use test time augmentation.

Test

tools/test.py provides the basic testing service, which is used in a similar way to the training script. For example, the following command demonstrates test a DBNet model on a single GPU or CPU.

# Test a pretrained MMOCR model by calling tools/test.py
CUDA_VISIBLE_DEVICES= python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]

# Test
# Example 1: Testing DBNet with CPU
CUDA_VISIBLE_DEVICES=-1 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth

# Example 2: Testing DBNet on gpu:0
CUDA_VISIBLE_DEVICES=0 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth

The following table lists all the arguments supported by test.py. Args without the -- prefix are mandatory, while others are optional.

ARGS Type Description
config str (required) Path to config.
checkpoint str (required) The model to be tested.
--work-dir str Specify the working directory for the logs.
--save-preds bool Whether to save the predictions to a pkl file.
--show bool Whether to visualize the predictions.
--show-dir str Path to save the visualization results.
--wait-time float Interval of visualization (s), defaults to 2.
--cfg-options str Override some settings in the configs. Example
--launcher str Option for launcher,['none', 'pytorch', 'slurm', 'mpi'].
--local_rank int Rank of local machine,used for distributed training,defaults to 0.

Training and Testing with Multiple GPUs

For large models, distributed training or testing significantly improves the efficiency. For this purpose, MMOCR provides distributed scripts tools/dist_train.sh and tools/dist_test.sh implemented based on MMDistributedDataParallel.

# Training
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]

# Testing
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]

The following table lists the arguments supported by dist_*.sh.

ARGS Type Description
NNODES int The number of nodes. Defaults to 1.
NODE_RANK int The rank of current node. Defaults to 0.
PORT int The master port that will be used by rank 0 node, ranging from 0 to 65535. Defaults to 29500.
MASTER_ADDR str The address of rank 0 node. Defaults to "127.0.0.1".
CONFIG_FILE str (required) The path to config.
CHECKPOINT_FILE str (required,only used in dist_test.sh)The path to checkpoint to be tested.
GPU_NUM int (required) The number of GPUs to be used per node.
[PY_ARGS] str Arguments to be parsed by tools/train.py and tools/test.py.

These two scripts enable training and testing on single-machine multi-GPU or multi-machine multi-GPU. See the following example for usage.

Single-machine Multi-GPU

The following commands demonstrate how to train and test with a specified number of GPUs on a single machine with multiple GPUs.

  1. Training

    Training DBNet using 4 GPUs on a single machine.

    tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4
    
  2. Testing

    Testing DBNet using 4 GPUs on a single machine.

    tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4
    

Launching Multiple Tasks on Single Machine

For a workstation equipped with multiple GPUs, the user can launch multiple tasks simultaneously by specifying the GPU IDs. For example, the following command demonstrates how to test DBNet with GPU [0, 1, 2, 3] and train CRNN on GPU [4, 5, 6, 7].

# Specify gpu:0,1,2,3 for testing and assign port number 29500
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4

# Specify gpu:4,5,6,7 for training and assign port number 29501
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh configs/textrecog/crnn/crnn_academic_dataset.py 4

Note

dist_train.sh sets MASTER_PORT to 29500 by default. When other processes already occupy this port, the program will get a runtime error RuntimeError: Address already in use. In this case, you need to set MASTER_PORT to another free port number in the range of (0~65535).

Multi-machine Multi-GPU Training and Testing

You can launch a task on multiple machines connected to the same network. MMOCR relies on torch.distributed package for distributed training. Find more information at PyTorch’s launch utility.

  1. Training

    The following command demonstrates how to train DBNet on two machines with a total of 4 GPUs.

    # Say that you want to launch the training job on two machines
    # On the first machine:
    NNODES=2 NODE_RANK=0 PORT=29500 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2
    # On the second machine:
    NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2
    
  2. Testing

    The following command demonstrates how to test DBNet on two machines with a total of 4 GPUs.

    # Say that you want to launch the testing job on two machines
    # On the first machine:
    NNODES=2 NODE_RANK=0 PORT=29500 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2
    # On the second machine:
    NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2
    

    Note

    The speed of the network could be the bottleneck of training.

Training and Testing with Slurm Cluster

If you run MMOCR on a cluster managed with Slurm, you can use the script tools/slurm_train.sh and tools/slurm_test.sh.

# tools/slurm_train.sh provides scripts for submitting training tasks on clusters managed by the slurm
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]

# tools/slurm_test.sh provides scripts for submitting testing tasks on clusters managed by the slurm
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${WORK_DIR} [PY_ARGS]
ARGS Type Description
GPUS int The number of GPUs to be used by this task. Defaults to 8.
GPUS_PER_NODE int The number of GPUs to be allocated per node. Defaults to 8.
CPUS_PER_TASK int The number of CPUs to be allocated per task. Defaults to 5.
SRUN_ARGS str Arguments to be parsed by srun. Available options can be found here.
PARTITION str (required) Specify the partition on cluster.
JOB_NAME str (required) Name of the submitted job.
WORK_DIR str (required) Specify the working directory for saving the logs and checkpoints.
CHECKPOINT_FILE str (required,only used in slurm_test.sh)Path to the checkpoint to be tested.
PY_ARGS str Arguments to be parsed by tools/train.py and tools/test.py.

These scripts enable training and testing on slurm clusters, see the following examples.

  1. Training

    Here is an example of using 1 GPU to train a DBNet model on the dev partition.

    # Example: Request 1 GPU resource on dev partition for DBNet training task
    GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_train.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py work_dir
    
  2. Testing

    Similarly, the following example requests 1 GPU for testing.

    # Example: Request 1 GPU resource on dev partition for DBNet testing task
    GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_test.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth work_dir
    

Advanced Tips

Resume Training from a Checkpoint

tools/train.py allows users to resume training from a checkpoint by specifying the --resume parameter, where it will automatically resume training from the latest saved checkpoint.

# Example: Resuming training from the latest checkpoint
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --resume

By default, the program will automatically resume training from the last successfully saved checkpoint in the last training session, i.e. latest.pth. However,

# Example: Set the path of the checkpoint you want to load in the configuration file
load_from = 'work_dir/dbnet/models/epoch_10000.pth'

Mixed Precision Training

Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. In MMOCR, the users can enable the automatic mixed precision training by simply add --amp.

# Example: Using automatic mixed precision training
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --amp

The following table shows the support of each algorithm in MMOCR for automatic mixed precision training.

Whether support AMP Description
Text Detection
DBNet Y
DBNetpp Y
DRRG N roi_align_rotated does not support fp16
FCENet N BCELoss does not support fp16
Mask R-CNN Y
PANet Y
PSENet Y
TextSnake N
Text Recognition
ABINet Y
CRNN Y
MASTER Y
NRTR Y
RobustScanner Y
SAR Y
SATRN Y

Automatic Learning Rate Scaling

MMOCR sets default initial learning rates for each model in the configuration file. However, these initial learning rates may not be applicable when the user uses a different batch_size than our preset base_batch_size. Therefore, we provide a tool to automatically scale the learning rate, which can be called by adding the --auto-scale-lr.

# Example: Using automatic learning rate scaling
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --auto-scale-lr

Visualize the Predictions

tools/test.py provides the visualization interface to facilitate the qualitative analysis of the OCR models.

Detection

(Green boxes are GTs, while red boxes are predictions)

Recognition

(Green font is the GT, red font is the prediction)

KIE

(From left to right: original image, text detection and recognition result, text classification result, relationship)

# Example 1: Show the visualization results per 2 seconds
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show --wait-time 2

# Example 2: For systems that do not support graphical interfaces (such as computing clusters, etc.), the visualization results can be dumped in the specified path
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show-dir ./vis_results

The visualization-related parameters in tools/test.py are described as follows.

ARGS Type Description
--show bool Whether to show the visualization results.
--show-dir str Path to save the visualization results.
--wait-time float Interval of visualization (s), defaults to 2.

Test Time Augmentation

Test time augmentation (TTA) is a technique that is used to improve the performance of a model by performing data augmentation on the input image at test time. It is a simple yet effective method to improve the performance of a model. In MMOCR, we support TTA in the following ways:

Note

TTA is only supported for text recognition models.

python tools/test.py configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py checkpoints/crnn_mini-vgg_5e_mj.pth --tta

Visualization

Before reading this tutorial, it is recommended to read MMEngine’s MMEngine: Visualization documentation to get a first glimpse of the Visualizer definition and usage.

In brief, the Visualizer is implemented in MMEngine to meet the daily visualization needs, and contains three main functions:

  • Implement common drawing APIs, such as draw_bboxes which implements bounding box drawing functions, draw_lines implements the line drawing function.

  • Support writing visualization results, learning rate curves, loss function curves, and verification accuracy curves to various backends, including local disks and common deep learning training logging tools such as TensorBoard and Wandb.

  • Support calling anywhere in the code to visualize or record intermediate states of the model during training or testing, such as feature maps and validation results.

Based on MMEngine’s Visualizer, MMOCR comes with a variety of pre-built visualization tools that can be used by the user by simply modifying the following configuration files.

  • The tools/analysis_tools/browse_dataset.py script provides a dataset visualization function that draws images and corresponding annotations after Data Transforms, as described in browse_dataset.py.

  • MMEngine implements LoggerHook, which uses Visualizer to write the learning rate, loss and evaluation results to the backend set by Visualizer. Therefore, by modifying the Visualizer backend in the configuration file, for example to TensorBoardVISBackend or WandbVISBackend, you can implement logging to common training logging tools such as TensorBoard or WandB, thus making it easy for users to use these visualization tools to analyze and monitor the training process.

  • The VisualizerHook is implemented in MMOCR, which uses the Visualizer to visualize or store the prediction results of the validation or prediction phase into the backend set by the Visualizer, so by modifying the Visualizer backend in the configuration file, for example, to TensorBoardVISBackend or WandbVISBackend, you can implement storing the predicted images to TensorBoard or Wandb.

Configuration

Thanks to the use of the registration mechanism, in MMOCR we can set the behavior of the Visualizer by modifying the configuration file. Usually, we define the default configuration for the visualizer in task/_base_/default_runtime.py, see configuration tutorial for details.

vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
    type='TextxxxLocalVisualizer', # use different visualizers for different tasks
    vis_backends=vis_backends,
    name='visualizer')

Based on the above example, we can see that the configuration of Visualizer consists of two main parts, namely, the type of Visualizer and the visualization backend vis_backends it uses.

  • For different OCR tasks, various visualizers are pre-configured in MMOCR, including TextDetLocalVisualizer, TextRecogLocalVisualizer, TextSpottingLocalVisualizer and KIELocalVisualizer. These visualizers extend the basic Visulizer API according to the characteristics of their tasks and implement the corresponding tag information interface add_datasamples. For example, users can directly use TextDetLocalVisualizer to visualize labels or predictions for text detection tasks.

  • MMOCR sets the visualization backend vis_backend to the local visualization backend LocalVisBackend by default, saving all visualization results and other training information in a local folder.

Storage

MMOCR uses the local visualization backend LocalVisBackend by default, and the model loss, learning rate, model evaluation accuracy and visualization The information stored in VisualizerHook and LoggerHook, including loss, learning rate, evaluation accuracy will be saved to the {work_dir}/{config_name}/{time}/{vis_data} folder by default. In addition, MMOCR also supports other common visualization backends, such as TensorboardVisBackend and WandbVisBackend, and you only need to change the vis_backends type in the configuration file to the corresponding visualization backend. For example, you can store data to TensorBoard and Wandb by simply inserting the following code block into the configuration file.

_base_.visualizer.vis_backends = [
    dict(type='LocalVisBackend'),
    dict(type='TensorboardVisBackend'),
    dict(type='WandbVisBackend'),]

Plot

Plot the prediction results

MMOCR mainly uses VisualizationHook to plot the prediction results of validation and test, by default VisualizationHook is off, and the default configuration is as follows.

visualization=dict( # user visualization of validation and test results
    type='VisualizationHook',
    enable=False,
    interval=1,
    show=False,
    draw_gt=False,
    draw_pred=False)

The following table shows the parameters supported by VisualizationHook.

Parameters Description
enable The VisualizationHook is turned on and off by the enable parameter, which is the default state.
interval Controls how much iteration to store or display the results of a val or test if VisualizationHook is enabled.
show Controls whether to visualize the results of val or test.
draw_gt Whether the results of val or test are drawn with or without labeling information
draw_pred whether to draw predictions for val or test results

If you want to enable VisualizationHook related functions and configurations during training or testing, you only need to modify the configuration, take dbnet_resnet18_fpnc_1200e_icdar2015.py as an example, draw annotations and predictions at the same time, and display the images, the configuration can be modified as follows

visualization = _base_.default_hooks.visualization
visualization.update(
    dict(enable=True, show=True, draw_gt=True, draw_pred=True))

If you only want to see the predicted result information you can just let draw_pred=True

visualization = _base_.default_hooks.visualization
visualization.update(
    dict(enable=True, show=True, draw_gt=False, draw_pred=True))

The test.py procedure is further simplified by providing the --show and --show-dir parameters to visualize the annotation and prediction results during the test without modifying the configuration.

# Show test results
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py dbnet_r18_fpnc_1200e_icdar2015/epoch_400.pth --show

# Specify where to store the prediction results
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py dbnet_r18_fpnc_1200e_icdar2015/epoch_400.pth --show-dir imgs/

Useful Tools

Visualization Tools

Dataset Visualization Tool

MMOCR provides a dataset visualization tool tools/visualizations/browse_datasets.py to help users troubleshoot possible dataset-related problems. You just need to specify the path to the training config (usually stored in configs/textdet/dbnet/xxx.py) or the dataset config (usually stored in configs/textdet/_base_/datasets/xxx.py), and the tool will automatically plots the transformed (or original) images and labels.

Usage
python tools/visualizations/browse_dataset.py \
    ${CONFIG_FILE} \
    [-o, --output-dir ${OUTPUT_DIR}] \
    [-p, --phase ${DATASET_PHASE}] \
    [-m, --mode ${DISPLAY_MODE}] \
    [-t, --task ${DATASET_TASK}] \
    [-n, --show-number ${NUMBER_IMAGES_DISPLAY}] \
    [-i, --show-interval ${SHOW_INTERRVAL}] \
    [--cfg-options ${CFG_OPTIONS}]
ARGS Type Description
config str (required) Path to the config.
-o, --output-dir str If GUI is not available, specifying an output path to save the visualization results.
-p, --phase str Phase of dataset to visualize. Use "train", "test" or "val" if you just want to visualize the default split. It's also possible to be a dataset variable name, which might be useful when a dataset split has multiple variants in the config.
-m, --mode original, transformed, pipeline Display mode: display original pictures or transformed pictures or comparison pictures.original only visualizes the original dataset & annotations; transformed shows the resulting images processed through all the transforms; pipeline shows all the intermediate images. Defaults to "transformed".
-t, --task auto, textdet, textrecog Specify the task type of the dataset. If auto, the task type will be inferred from the config. If the script is unable to infer the task type, you need to specify it manually. Defaults to auto.
-n, --show-number int The number of samples to visualized. If not specified, display all images in the dataset.
-i, --show-interval float Interval of visualization (s), defaults to 2.
--cfg-options float Override configs.Example
Examples

The following example demonstrates how to use the tool to visualize the training data used by the “DBNet_R50_icdar2015” model.

# Example: Visualizing the training data used by dbnet_r50dcn_v2_fpnc_1200e_icadr2015 model
python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py

By default, the visualization mode is “transformed”, and you will see the images & annotations being transformed by the pipeline:

If you just want to visualize the original dataset, simply set the mode to “original”:

python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -m original

Or, to visualize the entire pipeline:

python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -m pipeline

In addition, users can also visualize the original images and their corresponding labels of the dataset by specifying the path to the dataset config file, for example:

python tools/visualizations/browse_dataset.py configs/textrecog/_base_/datasets/icdar2015.py

Some datasets might have multiple variants. For example, the test split of icdar2015 textrecog dataset has two variants, which the base dataset config defines as follows:

icdar2015_textrecog_test = dict(
    ann_file='textrecog_test.json',
    # ...
    )

icdar2015_1811_textrecog_test = dict(
    ann_file='textrecog_test_1811.json',
    # ...
)

In this case, you can specify the variant name to visualize the corresponding dataset:

python tools/visualizations/browse_dataset.py configs/textrecog/_base_/datasets/icdar2015.py -p icdar2015_1811_textrecog_test

Based on this tool, users can easily verify if the annotation of a custom dataset is correct.

Hyper-parameter Scheduler Visualization

This tool aims to help the user to check the hyper-parameter scheduler of the optimizer (without training), which support the “learning rate” or “momentum”

Introduce the scheduler visualization tool
python tools/visualizations/vis_scheduler.py \
    ${CONFIG_FILE} \
    [-p, --parameter ${PARAMETER_NAME}] \
    [-d, --dataset-size ${DATASET_SIZE}] \
    [-n, --ngpus ${NUM_GPUs}] \
    [-s, --save-path ${SAVE_PATH}] \
    [--title ${TITLE}] \
    [--style ${STYLE}] \
    [--window-size ${WINDOW_SIZE}] \
    [--cfg-options]

Description of all arguments

  • config: The path of a model config file.

  • -p, --parameter: The param to visualize its change curve, choose from “lr” and “momentum”. Default to use “lr”.

  • -d, --dataset-size: The size of the datasets. If set,build_dataset will be skipped and ${DATASET_SIZE} will be used as the size. Default to use the function build_dataset.

  • -n, --ngpus: The number of GPUs used in training, default to be 1.

  • -s, --save-path: The learning rate curve plot save path, default not to save.

  • --title: Title of figure. If not set, default to be config file name.

  • --style: Style of plt. If not set, default to be whitegrid.

  • --window-size: The shape of the display window. If not specified, it will be set to 12*7. If used, it must be in the format 'W*H'.

  • --cfg-options: Modifications to the configuration file, refer to Learn about Configs.

Note

Loading annotations maybe consume much time, you can directly specify the size of the dataset with -d, dataset-size to save time.

How to plot the learning rate curve without training

You can use the following command to plot the step learning rate schedule used in the config configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py:

python tools/visualizations/vis_scheduler.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -d 100

Analysis Tools

Offline Evaluation Tool

For saved prediction results, we provide an offline evaluation script tools/analysis_tools/offline_eval.py. The following example demonstrates how to use this tool to evaluate the output of the “PSENet” model offline.

# When running the test script for the first time, you can save the output of the model by specifying the --save-preds parameter
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} --save-preds
# Example: Testing on PSENet
python tools/test.py configs/textdet/psenet/psenet_r50_fpnf_600e_icdar2015.py epoch_600.pth --save-preds

# Then, using the saved outputs for offline evaluation
python tools/analysis_tool/offline_eval.py ${CONFIG_FILE} ${PRED_FILE}
# Example: Offline evaluation of saved PSENet results
python tools/analysis_tools/offline_eval.py configs/textdet/psenet/psenet_r50_fpnf_600e_icdar2015.py work_dirs/psenet_r50_fpnf_600e_icdar2015/epoch_600.pth_predictions.pkl

-save-preds saves the output to work_dir/CONFIG_NAME/MODEL_NAME_predictions.pkl by default

In addition, based on this tool, users can also convert predictions obtained from other libraries into MMOCR-supported formats, then use MMOCR’s built-in metrics to evaluate them.

ARGS Type Description
config str (required) Path to the config.
pkl_results str (required) The saved predictions.
--cfg-options float Override configs.Example

Calculate FLOPs and the Number of Parameters

We provide a method to calculate the FLOPs and the number of parameters, first we install the dependencies using the following command.

pip install fvcore

The usage of the script to calculate FLOPs and the number of parameters is as follows.

python tools/analysis_tools/get_flops.py ${config} --shape ${IMAGE_SHAPE}
ARGS Type Description
config str (required) Path to the config.
--shape int Image size to use when calculating FLOPs, such as --shape 320 320. Default is 640 640

For example, you can run the following command to get FLOPs and the number of parameters of dbnet_resnet18_fpnc_100k_synthtext.py:

python tools/analysis_tools/get_flops.py configs/textdet/dbnet/dbnet_resnet18_fpnc_100k_synthtext.py --shape 1024 1024

The output is as follows:

input shape is  (1, 3, 1024, 1024)
| module                    | #parameters or shape | #flops  |
| :------------------------ | :------------------- | :------ |
| model                     | 12.341M              | 63.955G |
| backbone                  | 11.177M              | 38.159G |
| backbone.conv1            | 9.408K               | 2.466G  |
| backbone.conv1.weight     | (64, 3, 7, 7)        |         |
| backbone.bn1              | 0.128K               | 83.886M |
| backbone.bn1.weight       | (64,)                |         |
| backbone.bn1.bias         | (64,)                |         |
| backbone.layer1           | 0.148M               | 9.748G  |
| backbone.layer1.0         | 73.984K              | 4.874G  |
| backbone.layer1.1         | 73.984K              | 4.874G  |
| backbone.layer2           | 0.526M               | 8.642G  |
| backbone.layer2.0         | 0.23M                | 3.79G   |
| backbone.layer2.1         | 0.295M               | 4.853G  |
| backbone.layer3           | 2.1M                 | 8.616G  |
| backbone.layer3.0         | 0.919M               | 3.774G  |
| backbone.layer3.1         | 1.181M               | 4.842G  |
| backbone.layer4           | 8.394M               | 8.603G  |
| backbone.layer4.0         | 3.673M               | 3.766G  |
| backbone.layer4.1         | 4.721M               | 4.837G  |
| neck                      | 0.836M               | 14.887G |
| neck.lateral_convs        | 0.246M               | 2.013G  |
| neck.lateral_convs.0.conv | 16.384K              | 1.074G  |
| neck.lateral_convs.1.conv | 32.768K              | 0.537G  |
| neck.lateral_convs.2.conv | 65.536K              | 0.268G  |
| neck.lateral_convs.3.conv | 0.131M               | 0.134G  |
| neck.smooth_convs         | 0.59M                | 12.835G |
| neck.smooth_convs.0.conv  | 0.147M               | 9.664G  |
| neck.smooth_convs.1.conv  | 0.147M               | 2.416G  |
| neck.smooth_convs.2.conv  | 0.147M               | 0.604G  |
| neck.smooth_convs.3.conv  | 0.147M               | 0.151G  |
| det_head                  | 0.329M               | 10.909G |
| det_head.binarize         | 0.164M               | 10.909G |
| det_head.binarize.0       | 0.147M               | 9.664G  |
| det_head.binarize.1       | 0.128K               | 20.972M |
| det_head.binarize.3       | 16.448K              | 1.074G  |
| det_head.binarize.4       | 0.128K               | 83.886M |
| det_head.binarize.6       | 0.257K               | 67.109M |
| det_head.threshold        | 0.164M               |         |
| det_head.threshold.0      | 0.147M               |         |
| det_head.threshold.1      | 0.128K               |         |
| det_head.threshold.3      | 16.448K              |         |
| det_head.threshold.4      | 0.128K               |         |
| det_head.threshold.6      | 0.257K               |         |
!!!Please be cautious if you use the results in papers. You may need to check if all ops are supported and verify that the flops computation is correct.

Data Structures and Elements

MMOCR uses MMEngine: Abstract Data Element to encapsulate the data required for each task into data_sample. The base class has implemented basic add/delete/update/check functions and supports data migration between different devices, as well as dictionary-like and tensor-like operations, which also allows the interfaces of different algorithms to be unified.

Thanks to the unified data structures, the data flow between each module in the algorithm libraries, such as visualizer, evaluator, dataset, is greatly simplified. In MMOCR, we have the following conventions for different data types.

  • xxxData: Single granularity data annotation or model output. Currently MMEngine has three built-in granularities of data elements, including instance-level data (InstanceData), pixel-level data (PixelData) and image-level label data (LabelData). Among the tasks currently supported by MMOCR, text detection and key information extraction tasks use InstanceData to encapsulate the bounding boxes and the corresponding box label, while the text recognition task uses LabelData to encapsulate the text content.

  • xxxDataSample: inherited from MMEngine: Base Data Element, used to hold all annotation and prediction information that required by a single task. For example, TextDetDataSample for the text detection, TextRecogDataSample for text recognition, and KIEDataSample for the key information extraction task.

In the following, we will introduce the practical application of data elements xxxData and data samples xxxDataSample in MMOCR, respectively.

Data Elements - xxxData

InstanceData and LabelData are the BaseDataElement defined in MMEngine to encapsulate different granularity of annotation data or model output. In MMOCR, we have used InstanceData and LabelData for encapsulating the data types actually used in OCR-related tasks.

InstanceData

In the text detection task, the detector concentrate on instance-level text samples, so we use InstanceData to encapsulate the data needed for this task. Typically, its required training annotation and prediction output contain rectangular or polygonal bounding boxes, as well as bounding box labels. Since the text detection task has only one positive sample class, “text”, in MMOCR we use 0 to number this class by default. The following code example shows how to use the InstanceData to encapsulate the data used in the text detection task.

import torch
from mmengine.structures import InstanceData

# defining gt_instance for encapsulating the ground truth data
gt_instance = InstanceData()
gt_instance.bbox = torch.Tensor([[0, 0, 10, 10], [10, 10, 20, 20]])
gt_instance.polygons = torch.Tensor([[[0, 0], [10, 0], [10, 10], [0, 10]],
                                    [[10, 10], [20, 10], [20, 20], [10, 20]]])
gt_instance.label = torch.Tensor([0, 0])

# defining pred_instance for encapsulating the prediction data
pred_instances = InstanceData()
pred_polygons, scores = model(input)
pred_instances.polygons = pred_polygons
pred_instances.scores = scores

The conventions for the fields in InstanceData in MMOCR are shown in the table below. It is important to note that the length of each field in InstanceData must be equal to the number of instances N in the sample.

Field Type Description
bboxes torch.FloatTensor Bounding boxes [x1, y1, x2, y2] with the shape (N, 4).
labels torch.LongTensor Instance label with the shape (N, ). By default, MMOCR uses 0 to represent the "text" class.
polygons list[np.array(dtype=np.float32)] Polygonal bounding boxes with the shape (N, ).
scores torch.Tensor Confidence scores of the predictions of bounding boxes. (N, ).
ignored torch.BoolTensor Whether to ignore the current sample with the shape (N, ).
texts list[str] The text content of each instance with the shape (N, ),used for e2e text spotting or KIE task.
text_scores torch.FloatTensor Confidence score of the predictions of text contents with the shape (N, ),used for e2e text spotting task.
edge_labels torch.IntTensor The node adjacency matrix with the shape (N, N). In KIE, the optional values for the state between nodes are -1 (ignored, not involved in loss calculation),0 (disconnected) and 1(connected).
edge_scores torch.FloatTensor The prediction confidence of each edge in the KIE task, with the shape (N, N).

LabelData

For text recognition tasks, both labeled content and predicted content are wrapped using LabelData.

import torch
from mmengine.data import LabelData

# defining gt_text for encapsulating the ground truth data
gt_text = LabelData()
gt_text.item = 'MMOCR'

# defining pred_text for encapsulating the prediction data
pred_text = LabelData()
index, score = model(input)
text = dictionary.idx2str(index)
pred_text.score = score
pred_text.item = text

The conventions for the LabelData fields in MMOCR are shown in the following table.

Field Type Description
item str Text content.
score list[float] Confidence socre of the predicted text.
indexes torch.LongTensor A sequence of text characters encoded by dictionary and containing all special characters except <UNK>.
padded_indexes torch.LongTensor If the length of indexes is less than the maximum sequence length and pad_idx exists, this field holds the encoded text sequence padded to the maximum sequence length of max_seq_len.

DataSample xxxDataSample

By defining a uniform data structure, we can easily encapsulate the annotation data and prediction results in a unified way, making data transfer between different modules of the code base easier. In MMOCR, we have designed three data structures based on the data needed in three tasks: TextDetDataSample, TextRecogDataSample, and KIEDataSample. These data structures all inherit from MMEngine: Base Data Element, which is used to hold all annotation and prediction information required by each task.

Text Detection - TextDetDataSample

TextDetDataSample is used to encapsulate the data needed for the text detection task. It contains two main fields gt_instances and pred_instances, which are used to store the annotation information and prediction results respectively.

Field Type Description
gt_instances InstanceData Annotation information.
pred_instances InstanceData Prediction results.

The fields of InstanceData that will be used are:

Field Type Description
bboxes torch.FloatTensor Bounding boxes [x1, y1, x2, y2] with the shape (N, 4).
labels torch.LongTensor Instance label with the shape (N, ). By default, MMOCR uses 0 to represent the "text" class.
polygons list[np.array(dtype=np.float32)] Polygonal bounding boxes with the shape (N, ).
scores torch.Tensor Confidence scores of the predictions of bounding boxes. (N, ).
ignored torch.BoolTensor Boolean flags with the shape (N, ), indicating whether to ignore the current sample.

Since text detection models usually only output one of the bboxes/polygons, we only need to make sure that one of these two is assigned a value.

The following sample code demonstrates the use of TextDetDataSample.

import torch
from mmengine.data import TextDetDataSample

data_sample = TextDetDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.gt_instances = gt_instances

# Define the prediction data
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.pred_instances = pred_instances

Text Recognition - TextRecogDataSample

TextRecogDataSample is used to encapsulate the data for the text recognition task. It has two fields, gt_text and pred_text , which are used to store annotation information and prediction results, respectively.

Field Type Description
gt_text LabelData Label information.
pred_text LabelData Prediction results.

The following sample code demonstrates the use of TextRecogDataSample.

import torch
from mmengine.data import TextRecogDataSample

data_sample = TextRecogDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_text = LabelData(metainfo=img_meta)
gt_text.item = 'mmocr'
data_sample.gt_text = gt_text

# Define the prediction data
pred_text = LabelData(metainfo=img_meta)
pred_text.item = 'mmocr'
data_sample.pred_text = pred_text

The fields of LabelData that will be used are:

Field Type Description
item list[str] The text corresponding to the instance, of length (N, ), for end-to-end OCR tasks and KIE
score torch.FloatTensor Confidence of the text prediction, of length (N, ), for the end-to-end OCR task
indexes torch.LongTensor A sequence of text characters encoded by dictionary and containing all special characters except <UNK>.
padded_indexes torch.LongTensor If the length of indexes is less than the maximum sequence length and pad_idx exists, this field holds the encoded text sequence padded to the maximum sequence length of max_seq_len.

Key Information Extraction - KIEDataSample

KIEDataSample is used to encapsulate the data needed for the KIE task. It also contains two fields, gt_instances and pred_instances, which are used to store annotation information and prediction results respectively.

Field Type Description
gt_instances InstanceData Annotation information.
pred_instances InstanceData Prediction results.

The InstanceData fields that will be used by this task are shown in the following table.

Field Type Description
bboxes torch.FloatTensor Bounding boxes [x1, y1, x2, y2] with the shape (N, 4).
labels torch.LongTensor Instance label with the shape (N, ).
texts list[str] The text content of each instance with the shape (N, ),used for e2e text spotting or KIE task.
edge_labels torch.IntTensor The node adjacency matrix with the shape (N, N). In the KIE task, the optional values for the state between nodes are -1 (ignored, not involved in loss calculation),0 (disconnected) and 1(connected).
edge_scores torch.FloatTensor The prediction confidence of each edge in the KIE task, with the shape (N, N).
scores torch.FloatTensor The confidence scores for node label predictions, with the shape (N,).

Warning

Since there is no unified standard for model implementation of KIE tasks, the design currently considers only SDMGR model usage scenarios. Therefore, the design is subject to change as we support more KIE models.

The following sample code shows the use of KIEDataSample.

import torch
from mmengine.data import KIEDataSample

data_sample = KIEDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3),pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
gt_instances.texts = ['text1', 'text2', 'text3', 'text4', 'text5']
gt_instances.edge_lebels = torch.randint(-1, 2, (5, 5))
data_sample.gt_instances = gt_instances

# Define the prediction data
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.rand((5,))
pred_instances.edge_labels = torch.randint(-1, 2, (10, 10))
pred_instances.edge_scores = torch.rand((10, 10))
data_sample.pred_instances = pred_instances

Data Transforms and Pipeline

In the design of MMOCR, dataset construction and preparation are decoupled. That is, dataset construction classes such as OCRDataset are responsible for loading and parsing annotation files; while data transforms further apply data preprocessing, augmentation, formatting, and other related functions. Currently, there are five types of data transforms implemented in MMOCR, as shown in the following table.

Transforms Type File Description
Data Loading loading.py Implemented the data loading functions.
Data Formatting formatting.py Formatting the data required by different tasks.
Cross Project Data Adapter adapters.py Converting the data format between other OpenMMLab projects and MMOCR.
Data Augmentation Functions ocr_transforms.py
textdet_transforms.py
textrecog_transforms.py
Various built-in data augmentation methods designed for different tasks.
Wrappers of Third Party Packages wrappers.py Wrapping the transforms implemented in popular third party packages such as ImgAug, and adapting them to MMOCR format.

Since each data transform class is independent of each other, we can easily combine any data transforms to build a data pipeline after we have defined the data fields. As shown in the following figure, in MMOCR, a typical training data pipeline consists of three stages: data loading, data augmentation, and data formatting. Users only need to define the data pipeline list in the configuration file and specify the specific data transform class and its parameters:

Flowchart

train_pipeline_r18 = [
    # Loading images
    dict(
        type='LoadImageFromFile',
        color_type='color_ignore_orientation'),
    # Loading annotations
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    # Data augmentation
    dict(
        type='ImgAugWrapper',
        args=[['Fliplr', 0.5],
              dict(cls='Affine', rotate=[-10, 10]), ['Resize', [0.5, 3.0]]]),
    dict(type='RandomCrop', min_side_ratio=0.1),
    dict(type='Resize', scale=(640, 640), keep_ratio=True),
    dict(type='Pad', size=(640, 640)),
    # Data formatting
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]

Tip

More tutorials about data pipeline configuration can be found in the Config Doc. Next, we will briefly introduce the data transforms supported in MMOCR according to their categories.

For each data transform, MMOCR provides a detailed docstring. For example, in the header of each data transform class, we annotate Required Keys, Modified Keys and Added Keys. The Required Keys represent the mandatory fields that should be included in the input required by the data transform, while the Modified Keys and Added Keys indicate that the transform may modify or add the fields into the original data. For example, LoadImageFromFile implements the image loading function, whose Required Keys is the image path img_path, and the Modified Keys includes the loaded image img, the current size of the image img_shape, the original size of the image ori_shape, and other image attributes.

@TRANSFORMS.register_module()
class LoadImageFromFile(MMCV_LoadImageFromFile):
    # We provide detailed docstring for each data transform.
    """Load an image from file.

    Required Keys:

    - img_path

    Modified Keys:

    - img
    - img_shape
    - ori_shape
    """

Note

In the data pipeline of MMOCR, the image and label information are saved in a dictionary. By using the unified fields, the data can be freely transferred between different data transforms. Therefore, it is very important to understand the conventional fields used in MMOCR.

For your convenience, the following table lists the conventional keys used in MMOCR data transforms.

Key Type Description
img np.array(dtype=np.uint8) Image array, shape of (h, w, c).
img_shape tuple(int, int) Current image size (h, w).
ori_shape tuple(int, int) Original image size (h, w).
scale tuple(int, int) Stores the target image size (h, w) specified by the user in the Resize data transform series. Note: This value may not correspond to the actual image size after the transformation.
scale_factor tuple(float, float) Stores the target image scale factor (w_scale, h_scale) specified by the user in the Resize data transform series. Note: This value may not correspond to the actual image size after the transformation.
keep_ratio bool Boolean flag determines whether to keep the aspect ratio while scaling images.
flip bool Boolean flags to indicate whether the image has been flipped.
flip_direction str Flipping direction, options are horizontal, vertical, diagonal.
gt_bboxes np.array(dtype=np.float32) Ground-truth bounding boxes.
gt_polygons list[np.array(dtype=np.float32) Ground-truth polygons.
gt_bboxes_labels np.array(dtype=np.int64) Category label of bounding boxes. By default, MMOCR uses 0 to represent "text" instances.
gt_texts list[str] Ground-truth text content of the instance.
gt_ignored np.array(dtype=np.bool_) Boolean flag indicating whether ignoring the instance (used in text detection).

Data Loading

Data loading transforms mainly implement the functions of loading data from different formats and backends. Currently, the following data loading transforms are implemented in MMOCR:

Transforms Name Required Keys Modified/Added Keys Description
LoadImageFromFile img_path img
img_shape
ori_shape
Load image from the specified path,supporting different file storage backends (e.g. disk, http, petrel) and decoding backends (e.g. cv2, turbojpeg, pillow, tifffile).
LoadOCRAnnotations bbox
bbox_label
polygon
ignore
text
gt_bboxes
gt_bboxes_labels
gt_polygons
gt_ignored
gt_texts
Parse the annotation required by OCR task.
LoadKIEAnnotations bboxes bbox_labels edge_labels
texts
gt_bboxes
gt_bboxes_labels
gt_edge_labels
gt_texts
ori_shape
Parse the annotation required by KIE task.

Data Augmentation

Data augmentation is an indispensable process in text detection and recognition tasks. Currently, MMOCR has implemented dozens of data augmentation modules commonly used in OCR fields, which are classified into ocr_transforms.py, textdet_transforms.py, and textrecog_transforms.py.

Specifically, ocr_transforms.py implements generic OCR data augmentation modules such as RandomCrop and RandomRotate:

Transforms Name Required Keys Modified/Added Keys Description
RandomCrop img
gt_bboxes
gt_bboxes_labels
gt_polygons
gt_ignored
gt_texts (optional)
img
img_shape
gt_bboxes
gt_bboxes_labels
gt_polygons
gt_ignored
gt_texts (optional)
Randomly crop the image and make sure the cropped image contains at least one text instance. The optional parameter is min_side_ratio, which controls the ratio of the short side of the cropped image to the original image, the default value is 0.4.
RandomRotate img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
img
img_shape
gt_bboxes (optional)
gt_polygons (optional)
rotated_angle
Randomly rotate the image and optionally fill the blank areas of the rotated image.

textdet_transforms.py implements text detection related data augmentation modules:

Transforms Name Required Keys Modified/Added Keys Description
RandomFlip img
gt_bboxes
gt_polygons
img
gt_bboxes
gt_polygons
flip
flip_direction
Random flip, support horizontal, vertical and diagonal modes. Defaults to horizontal.
FixInvalidPolygon gt_polygons
gt_ignored
gt_polygons
gt_ignored
Automatically fixing the invalid polygons included in the annotations.

textrecog_transforms.py implements text recognition related data augmentation modules:

Transforms Name Required Keys Modified/Added Keys Description
RescaleToHeight img img
img_shape
scale
scale_factor
keep_ratio
Scales the image to the specified height while keeping the aspect ratio. When min_width and max_width are specified, the aspect ratio may be changed.

Warning

The above table only briefly introduces some selected data augmentation methods, for more information please refer to the API documentation or the code docstrings.

Data Formatting

Data formatting transforms are responsible for packaging images, ground truth labels, and other information into a dictionary. Different tasks usually rely on different formatting transforms. For example:

Transforms Name Required Keys Modified/Added Keys Description
PackTextDetInputs - - Pack the inputs required by text detection.
PackTextRecogInputs - - Pack the inputs required by text recognition.
PackKIEInputs - - Pack the inputs required by KIE.

Cross Project Data Adapters

The cross-project data adapters bridge the data formats between MMOCR and other OpenMMLab libraries such as MMDetection, making it possible to call models implemented in other OpenMMLab projects. Currently, MMOCR has implemented MMDet2MMOCR and MMOCR2MMDet, allowing data to be converted between MMDetection and MMOCR formats; with these adapters, users can easily train any detectors supported by MMDetection in MMOCR. For example, we provide a tutorial to show how to train Mask R-CNN as a text detector in MMOCR.

Transforms Name Required Keys Modified/Added Keys Description
MMDet2MMOCR gt_masks gt_ignore_flags gt_polygons
gt_ignored
Convert the fields used in MMDet to MMOCR.
MMOCR2MMDet img_shape
gt_polygons
gt_ignored
gt_masks gt_ignore_flags Convert the fields used in MMOCR to MMDet.

Wrappers

To facilitate the use of popular third-party CV libraries in MMOCR, we provide wrappers in wrappers.py to unify the data format between MMOCR and other third-party libraries. Users can directly configure the data transforms provided by these libraries in the configuration file of MMOCR. The supported wrappers are as follows:

Transforms Name Required Keys Modified/Added Keys Description
ImgAugWrapper img
gt_polygons (optional for text recognition)
gt_bboxes (optional for text recognition)
gt_bboxes_labels (optional for text recognition)
gt_ignored (optional for text recognition)
gt_texts (optional)
img
gt_polygons (optional for text recognition)
gt_bboxes (optional for text recognition)
gt_bboxes_labels (optional for text recognition)
gt_ignored (optional for text recognition)
img_shape (optional)
gt_texts (optional)
ImgAug wrapper, which bridges the data format and configuration between ImgAug and MMOCR, allowing users to config the data augmentation methods supported by ImgAug in MMOCR.
TorchVisionWrapper img img
img_shape
TorchVision wrapper, which bridges the data format and configuration between TorchVision and MMOCR, allowing users to config the data transforms supported by torchvision.transforms in MMOCR.

ImgAugWrapper Example

For example, in the original ImgAug, we can define a Sequential type data augmentation pipeline as follows to perform random flipping, random rotation and random scaling on the image:

import imgaug.augmenters as iaa

aug = iaa.Sequential(
  iaa.Fliplr(0.5),                # horizontally flip 50% of all images
  iaa.Affine(rotate=(-10, 10)),   # rotate by -10 to +10 degrees
  iaa.Resize((0.5, 3.0))          # scale images to 50-300% of their size
)

In MMOCR, we can directly configure the above data augmentation pipeline in train_pipeline as follows:

dict(
  type='ImgAugWrapper',
  args=[
    ['Fliplr', 0.5],
    dict(cls='Affine', rotate=[-10, 10]),
    ['Resize', [0.5, 3.0]],
  ]
)

Specifically, the args parameter accepts a list, and each element in the list can be a list or a dictionary. If it is a list, the first element of the list is the class name in imgaug.augmenters, and the following elements are the initialization parameters of the class; if it is a dictionary, the cls key corresponds to the class name in imgaug.augmenters, and the other key-value pairs correspond to the initialization parameters of the class.

TorchVisionWrapper Example

For example, in the original TorchVision, we can define a Compose type data transformation pipeline as follows to perform color jittering on the image:

import torchvision.transforms as transforms

aug = transforms.Compose([
  transforms.ColorJitter(
    brightness=32.0 / 255,  # brightness jittering range
    saturation=0.5)         # saturation jittering range
])

In MMOCR, we can directly configure the above data transformation pipeline in train_pipeline as follows:

dict(
  type='TorchVisionWrapper',
  op='ColorJitter',
  brightness=32.0 / 255,
  saturation=0.5
)

Specifically, the op parameter is the class name in torchvision.transforms, and the following parameters correspond to the initialization parameters of the class.

Evaluation

Note

Before reading this document, we recommend that you first read MMEngine: Model Accuracy Evaluation Basics.

Metrics

MMOCR implements widely-used evaluation metrics for text detection, text recognition and key information extraction tasks based on the MMEngine: BaseMetric base class. Users can specify the metric used in the validation and test phases by modifying the val_evaluator and test_evaluator fields in the configuration file. For example, the following config shows how to use HmeanIOUMetric to evaluate the model performance in text detection task.

val_evaluator = dict(type='HmeanIOUMetric')
test_evaluator = val_evaluator

# In addition, MMOCR also supports the combined evaluation of multiple metrics for the same task, such as using WordMetric and CharMetric at the same time
val_evaluator = [
    dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol']),
    dict(type='CharMetric')
]

Tip

More evaluation related configurations can be found in the evaluation configuration tutorial.

As shown in the following table, MMOCR currently supports 5 evaluation metrics for text detection, text recognition, and key information extraction tasks, including HmeanIOUMetric, WordMetric, CharMetric, OneMinusNEDMetric, and F1Metric.

Metric Task Input Field Output Field
HmeanIOUMetric TextDet pred_polygons
pred_scores
gt_polygons
recall
precision
hmean
WordMetric TextRec pred_text
gt_text
word_acc
word_acc_ignore_case
word_acc_ignore_case_symbol
CharMetric TextRec pred_text
gt_text
char_recall
char_precision
OneMinusNEDMetric TextRec pred_text
gt_text
1-N.E.D
F1Metric KIE pred_labels
gt_labels
macro_f1
micro_f1

In general, the evaluation metric used in each task is conventionally determined. Users usually do not need to understand or manually modify the internal implementation of the evaluation metric. However, to facilitate more customized requirements, this document will further introduce the specific implementation details and configurable parameters of the built-in metrics in MMOCR.

HmeanIOUMetric

HmeanIOUMetric is one of the most widely used evaluation metrics in text detection tasks, because it calculates the harmonic mean (H-mean) between the detection precision (P) and recall rate (R). The HmeanIOUMetric can be calculated by the following equation:

\[H = \frac{2}{\frac{1}{P} + \frac{1}{R}} = \frac{2PR}{P+R}\]

In addition, since it is equivalent to the F-score (also known as F-measure or F-metric) when \(\beta = 1\), HmeanIOUMetric is sometimes written as F1Metric or f1-score:

\[F_1=(1+\beta^2)\cdot\frac{PR}{\beta^2\cdot P+R} = \frac{2PR}{P+R}\]

In MMOCR, the calculation of HmeanIOUMetric can be summarized as the following steps:

  1. Filter out invalid predictions

    • Filter out predictions with a score is lower than pred_score_thrs

    • Filter out predictions overlapping with ignored ground truth boxes with an overlap ratio higher than ignore_precision_thr

    It is worth noting that pred_score_thrs will automatically search for the best threshold within a certain range by default, and users can also customize the search range by manually modifying the configuration file:

    # By default, HmeanIOUMetric searches the best threshold within the range [0.3, 0.9] with a step size of 0.1
    val_evaluator = dict(type='HmeanIOUMetric', pred_score_thrs=dict(start=0.3, stop=0.9, step=0.1))
    
  2. Calculate the IoU matrix

    • At the data processing stage, HmeanIOUMetric will calculate and maintain an \(M \times N\) IoU matrix iou_metric for the convenience of the subsequent bounding box pairing step. Here, M and N represent the number of label bounding boxes and filtered prediction bounding boxes, respectively. Therefore, each element of this matrix stores the IoU between the m-th label bounding box and the n-th prediction bounding box.

  3. Compute the number of GT samples that can be accurately matched based on the corresponding pairing strategy

    Although HmeanIOUMetric can be calculated by a fixed formula, there may still be some subtle differences in the specific implementations. These differences mainly reflect the use of different strategies to match gt and predicted bounding boxes, which leads to the difference in final scores. Currently, MMOCR supports two matching strategies, namely vanilla and max_matching, for the HmeanIOUMetric. As shown below, users can specify the matching strategies in the config.

    • vanilla matching strategy

      By default, HmeanIOUMetric adopts the vanilla matching strategy, which is consistent with the hmean-iou implementation in MMOCR 0.x and the official text detection competition evaluation standard of ICDAR series. The matching strategy adopts the first-come-first-served matching method to pair the labels and predictions.

      # By default, HmeanIOUMetric adopts 'vanilla' matching strategy
      val_evaluator = dict(type='HmeanIOUMetric')
      
    • max_matching matching strategy

      To address the shortcomings of the existing matching mechanism, MMOCR has implemented a more efficient matching strategy to maximize the number of matches.

      # Specify to use 'max_matching' matching strategy
      val_evaluator = dict(type='HmeanIOUMetric', strategy='max_matching')
      

    Note

    We recommend that research-oriented developers use the default vanilla matching strategy to ensure consistency with other papers. For industry-oriented developers, you can use the max_matching matching strategy to achieve optimized performance.

  4. Compute the final evaluation score according to the aforementioned matching strategy

WordMetric

WordMetric implements word-level text recognition evaluation metrics and includes three text matching modes, namely exact, ignore_case, and ignore_case_symbol. Users can freely combine the output of one or more text matching modes in the configuration file by modifying the mode field.

# Use WordMetric for text recognition task
val_evaluator = [
    dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol'])
]
  • exact:Full matching mode, i.e., only when the predicted text and the ground truth text are exactly the same, the predicted text is considered to be correct.

  • ignore_case:The mode ignores the case of the predicted text and the ground truth text.

  • ignore_case_symbol:The mode ignores the case and symbols of the predicted text and the ground truth text. This is also the text recognition accuracy reported by most academic papers. The performance reported by MMOCR uses the ignore_case_symbol mode by default.

Assume that the real label is MMOCR! and the model output is mmocr. The WordMetric scores under the three matching modes are: {'exact': 0, 'ignore_case': 0, 'ignore_case_symbol': 1}.

CharMetric

CharMetric implements character-level text recognition evaluation metrics that are case-insensitive.

# Use CharMetric for text recognition task
val_evaluator = [dict(type='CharMetric')]

Specifically, CharMetric will output two evaluation metrics, namely char_precision and char_recall. Let the number of correctly predicted characters (True Positive) be \(\sigma_{tp}\), then the precision P and recall R can be calculated by the following equation:

\[P=\frac{\sigma_{tp}}{\sigma_{pred}}, R = \frac{\sigma_{tp}}{\sigma_{gt}}\]

where \(\sigma_{gt}\) and \(\sigma_{pred}\) represent the total number of characters in the label text and the predicted text, respectively.

For example, assume that the label text is “MMOCR” and the predicted text is “mm0cR1”. The score of the CharMetric is:

\[P=\frac{4}{6}, R=\frac{4}{5}\]

OneMinusNEDMetric

OneMinusNEDMetric(1-N.E.D) is commonly used for text recognition evaluation of Chinese or English text line-level annotations. Unlike the full matching metric that requires the prediction and the gt text to be exactly the same, 1-N.E.D uses the normalized edit distance (also known as Levenshtein Distance) to measure the difference between the predicted and the gt text, so that the performance difference of the model can be better distinguished when evaluating long texts. Assume that the real and predicted texts are \(s_i\) and \(\hat{s_i}\), respectively, and their lengths are \(l_{i}\) and \(\hat{l_i}\), respectively. The OneMinusNEDMetric score can be calculated by the following formula:

\[score = 1 - \frac{1}{N}\sum_{i=1}^{N}\frac{D(s_i, \hat{s_{i}})}{max(l_{i},\hat{l_{i}})}\]

where N is the total number of samples, and \(D(s_1, s_2)\) is the edit distance between two strings.

For example, assume that the real label is “OpenMMLabMMOCR”, the prediction of model A is “0penMMLabMMOCR”, and the prediction of model B is “uvwxyz”. The results of the full matching and OneMinusNEDMetric evaluation metrics are as follows:

Full-match 1 - N.E.D.
Model A 0 0.92857
Model B 0 0

As shown in the table above, although the model A only predicted one letter incorrectly, both models got 0 in when using full-match strategy. However, the OneMinusNEDMetric evaluation metric can better distinguish the performance of the two models on long texts.

F1Metric

F1Metric implements the F1-Metric evaluation metric for KIE tasks and provides two modes, namely micro and macro.

val_evaluator = [
    dict(type='F1Metric', mode=['micro', 'macro'],
]
  • micro mode: Calculate the global F1-Metric score based on the total number of True Positive, False Negative, and False Positive.

  • macro mode:Calculate the F1-Metric score for each class and then take the average.

Customized Metric

MMOCR supports the implementation of customized evaluation metrics for users who pursue higher customization. In general, users only need to create a customized evaluation metric class CustomizedMetric and inherit MMEngine: BaseMetric. Then, the data format processing method process and the metric calculation method compute_metrics need to be overwritten respectively. Finally, add it to the METRICS registry to implement any customized evaluation metric.

from mmengine.evaluator import BaseMetric
from mmocr.registry import METRICS

@METRICS.register_module()
class CustomizedMetric(BaseMetric):

    def process(self, data_batch: Sequence[Dict], predictions: Sequence[Dict]):
    """ process receives two parameters, data_batch stores the gt label information, and predictions stores the predicted results.
    """
        pass

    def compute_metrics(self, results: List):
    """ compute_metric receives the results of the process method as input and returns the evaluation results.
    """
        pass

Note

More details can be found in MMEngine Documentation: BaseMetric.

Dataset

Overview

In MMOCR, all the datasets are processed via different Dataset classes based on mmengine.BaseDataset. Dataset classes are responsible for loading the data and performing initial parsing, then fed to data pipeline for data preprocessing, augmentation, formatting, etc.

Flowchart

In this tutorial, we will introduce some common interfaces of the Dataset class, and the usage of Dataset implementations in MMOCR as well as the annotation types they support.

Tip

Dataset class supports some advanced features, such as lazy initialization and data serialization, and takes advantage of various dataset wrappers to perform data concatenation, repeating, and category balancing. These content will not be covered in this tutorial, but you can read MMEngine: BaseDataset for more details.

Common Interfaces

Now, let’s look at a concrete example and learn some typical interfaces of a Dataset class. OCRDataset is a widely used Dataset implementation in MMOCR, and is suggested as a default Dataset type in MMOCR as its associated annotation format is flexible enough to support all the OCR tasks (more info). Now we will instantiate an OCRDataset object wherein the toy dataset in tests/data/det_toy_dataset will be loaded.

from mmocr.datasets import OCRDataset
from mmengine.registry import init_default_scope
init_default_scope('mmocr')

train_pipeline = [
    dict(
        type='LoadImageFromFile'),
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    dict(type='RandomCrop', min_side_ratio=0.1),
    dict(type='Resize', scale=(640, 640), keep_ratio=True),
    dict(type='Pad', size=(640, 640)),
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
dataset = OCRDataset(
    data_root='tests/data/det_toy_dataset',
    ann_file='textdet_test.json',
    test_mode=False,
    pipeline=train_pipeline)

Let’s peek the size of this dataset:

>>> print(len(dataset))

10

Typically, a Dataset class loads and stores two types of information: (1) meta information: Some meta descriptors of the dataset’s property, such as available object categories in this dataset. (2) annotation: The path to images, and their labels. We can access the meta information in dataset.metainfo:

>>> from pprint import pprint
>>> pprint(dataset.metainfo)

{'category': [{'id': 0, 'name': 'text'}],
 'dataset_type': 'TextDetDataset',
 'task_name': 'textdet'}

As for the annotations, we can access them via dataset.get_data_info(idx), which returns a dictionary containing the information of the idx-th sample in the dataset that is initially parsed, but not yet processed by data pipeline.

>>> from pprint import pprint
>>> pprint(dataset.get_data_info(0))

{'height': 720,
 'img_path': 'tests/data/det_toy_dataset/test/img_10.jpg',
 'instances': [{'bbox': [260.0, 138.0, 284.0, 158.0],
                'bbox_label': 0,
                'ignore': True,
                'polygon': [261, 138, 284, 140, 279, 158, 260, 158]},
                ...,
               {'bbox': [1011.0, 157.0, 1079.0, 173.0],
                'bbox_label': 0,
                'ignore': True,
                'polygon': [1011, 157, 1079, 160, 1076, 173, 1011, 170]}],
 'sample_idx': 0,
 'seg_map': 'test/gt_img_10.txt',
 'width': 1280}

On the other hand, we can get the sample fully processed by data pipeline via dataset[idx] or dataset.__getitem__(idx), which is directly feedable to models and perform a full train/test cycle. It has two fields:

  • inputs: The image after data augmentation;

  • data_samples: The DataSample that contains the augmented annotations, and meta information appended by some data transforms to keep track of some key properties of this sample.

>>> pprint(dataset[0])

{'data_samples': <TextDetDataSample(

    META INFORMATION
    ori_shape: (720, 1280)
    img_path: 'tests/data/det_toy_dataset/imgs/test/img_10.jpg'
    img_shape: (640, 640)

    DATA FIELDS
    gt_instances: <InstanceData(

            META INFORMATION

            DATA FIELDS
            labels: tensor([0, 0, 0])
            polygons: [array([207.33984 , 104.65409 , 208.34634 ,  84.528305, 231.49594 ,
                        86.54088 , 226.46341 , 104.65409 , 207.33984 , 104.65409 ],
                      dtype=float32), array([237.53496 , 103.6478  , 235.52196 ,  84.528305, 365.36096 ,
                        86.54088 , 364.35446 , 107.67296 , 237.53496 , 103.6478  ],
                      dtype=float32), array([105.68293, 166.03773, 105.68293, 151.94969, 177.14471, 150.94339,
                       178.15121, 165.03145, 105.68293, 166.03773], dtype=float32)]
            ignored: tensor([ True, False,  True])
            bboxes: tensor([[207.3398,  84.5283, 231.4959, 104.6541],
                        [235.5220,  84.5283, 365.3610, 107.6730],
                        [105.6829, 150.9434, 178.1512, 166.0377]])
        ) at 0x7f7359f04fa0>
) at 0x7f735a0508e0>,
 'inputs': tensor([[[129, 111, 131,  ...,   0,   0,   0], ...
                  [ 19,  18,  15,  ...,   0,   0,   0]]], dtype=torch.uint8)}

Dataset Classes and Annotation Formats

Each Dataset implementation can only load datasets in a specific annotation format. Here lists all supported Dataset classes and their compatible annotation formats, as well as an example config that showcases how to use them in practice.

Note

If you are not familiar with the config system, you may find Dataset Configuration helpful.

OCRDataset

Usually, there are many different types of annotations in OCR datasets, and the formats often vary between different subtasks, such as text detection and text recognition. These differences can result in the need for different data loading code when using different datasets, increasing the learning and maintenance costs for users.

In MMOCR, we propose a unified dataset format that can adapt to all three subtasks of OCR: text detection, text recognition, and text spotting. This design maximizes the uniformity of the dataset, allows for the reuse of data annotations across different tasks, and makes dataset management more convenient. Considering that popular dataset formats are still inconsistent, MMOCR provides Dataset Preparer to help users convert their datasets to MMOCR format. We also strongly encourage researchers to develop their own datasets based on this data format.

Annotation Format

This annotation file is a .json file that stores a dict, containing both metainfo and data_list, where the former includes basic information about the dataset and the latter consists of the label item of each target instance. Here presents an extensive list of all the fields in the annotation file, but some fields are used in a subset of tasks and can be ignored in other tasks.

{
    "metainfo":
    {
      "dataset_type": "TextDetDataset",  # Options: TextDetDataset/TextRecogDataset/TextSpotterDataset
      "task_name": "textdet",  #  Options: textdet/textspotter/textrecog
      "category": [{"id": 0, "name": "text"}]  # Used in textdet/textspotter
    },
    "data_list":
    [
      {
        "img_path": "test_img.jpg",
        "height": 604,
        "width": 640,
        "instances":  # multiple instances in one image
        [
          {
            "bbox": [0, 0, 10, 20],  # in textdet/textspotter, [x1, y1, x2, y2].
            "bbox_label": 0,  # The object category, always 0 (text) in MMOCR
            "polygon": [0, 0, 0, 10, 10, 20, 20, 0], # in textdet/textspotter. [x1, y1, x2, y2, ....]
            "text": "mmocr",  # in textspotter/textrecog
            "ignore": False # in textspotter/textdet. Whether to ignore this sample during training
          },
          #...
        ],
      }
      #... multiple images
    ]
}
Example Config

Here is a part of config example where we make train_dataloader use OCRDataset to load the ICDAR2015 dataset for a text detection model. Keep in mind that OCRDataset can load any OCR datasets prepared by Dataset Preparer regardless of its task. That is, you can use it for text recognition and text spotting, but you still have to modify the transform types in pipeline according to the needs of different tasks.

pipeline = [
    dict(
        type='LoadImageFromFile'),
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]

icdar2015_textdet_train = dict(
    type='OCRDataset',
    data_root='data/icdar2015',
    ann_file='textdet_train.json',
    filter_cfg=dict(filter_empty_gt=True, min_size=32),
    pipeline=pipeline)

train_dataloader = dict(
    batch_size=16,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=icdar2015_textdet_train)

RecogLMDBDataset

Reading images or labels from files can be slow when data are excessive, e.g. on a scale of millions. Besides, in academia, most of the scene text recognition datasets are stored in lmdb format, including images and labels. (Example)

To get closer to the mainstream practice and enhance the data storage efficiency, MMOCR supports loading images and labels from lmdb datasets via RecogLMDBDataset.

Annotation Format

MMOCR requires the following keys for LMDB datasets:

  • num_samples: The parameter describing the data volume of the dataset.

  • The keys of images and labels are in the format of image-000000001 and label-000000001, respectively. The index starts from 1.

MMOCR has a toy LMDB dataset in tests/data/rec_toy_dataset/imgs.lmdb. You can get a sense of the format with the following code snippet.

>>> import lmdb
>>>
>>> env = lmdb.open('tests/data/rec_toy_dataset/imgs.lmdb')
>>> txn = env.begin()
>>> for k, v in txn.cursor():
>>>     print(k, v)

b'image-000000001' b'\xff...'
b'image-000000002' b'\xff...'
b'image-000000003' b'\xff...'
b'image-000000004' b'\xff...'
b'image-000000005' b'\xff...'
b'image-000000006' b'\xff...'
b'image-000000007' b'\xff...'
b'image-000000008' b'\xff...'
b'image-000000009' b'\xff...'
b'image-000000010' b'\xff...'
b'label-000000001' b'GRAND'
b'label-000000002' b'HOTEL'
b'label-000000003' b'HOTEL'
b'label-000000004' b'PACIFIC'
b'label-000000005' b'03/09/2009'
b'label-000000006' b'ANING'
b'label-000000007' b'Virgin'
b'label-000000008' b'america'
b'label-000000009' b'ATTACK'
b'label-000000010' b'DAVIDSON'
b'num-samples' b'10'
Example Config

Here is a part of config example where we make train_dataloader use RecogLMDBDataset to load the toy dataset. Since RecogLMDBDataset loads images as numpy arrays, don’t forget to use LoadImageFromNDArray instead of LoadImageFromFile in the pipeline for successful loading.

pipeline = [
    dict(
        type='LoadImageFromNDArray'),
    dict(
        type='LoadOCRAnnotations',
        with_text=True,
    ),
    dict(
        type='PackTextRecogInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]

toy_textrecog_train = dict(
    type='RecogLMDBDataset',
    data_root='tests/data/rec_toy_dataset/',
    ann_file='imgs.lmdb',
    pipeline=pipeline)

train_dataloader = dict(
    batch_size=16,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=toy_textrecog_train)

RecogTextDataset

Prior to MMOCR 1.0, MMOCR 0.x takes text files as input for text recognition. These formats has been deprecated in MMOCR 1.0, and this class could be removed anytime in the future. More info

Annotation Format

Text files can either be in txt format or jsonl format. The simple .txt annotations separate image name and word annotation by a blank space, which cannot handle the case when spaces are included in a text instance.

img1.jpg OpenMMLab
img2.jpg MMOCR

The JSON Line format uses a dictionary-like structure to represent the annotations, where the keys filename and text store the image name and word label, respectively.

{"filename": "img1.jpg", "text": "OpenMMLab"}
{"filename": "img2.jpg", "text": "MMOCR"}
Example Config

Here is a part of config example where we use RecogTextDataset to load the old txt labels in training, and the old jsonl labels in testing.

pipeline = [
    dict(
        type='LoadImageFromFile'),
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]

 # loading 0.x txt format annos
 txt_dataset = dict(
     type='RecogTextDataset',
     data_root=data_root,
     ann_file='old_label.txt',
     data_prefix=dict(img_path='imgs'),
     parser_cfg=dict(
         type='LineStrParser',
         keys=['filename', 'text'],
         keys_idx=[0, 1]),
     pipeline=pipeline)


train_dataloader = dict(
    batch_size=16,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=txt_dataset)

 # loading 0.x json line format annos
 jsonl_dataset = dict(
     type='RecogTextDataset',
     data_root=data_root,
     ann_file='old_label.jsonl',
     data_prefix=dict(img_path='imgs'),
     parser_cfg=dict(
         type='LineJsonParser',
         keys=['filename', 'text'],
     pipeline=pipeline))

test_dataloader = dict(
    batch_size=16,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=False),
    dataset=jsonl_dataset)

IcdarDataset

Prior to MMOCR 1.0, MMOCR 0.x takes COCO-like format annotations as input for text detection. These formats has been deprecated in MMOCR 1.0, and this class could be removed anytime in the future. More info

Annotation Format
{
  "images": [
    {
      "id": 1,
      "width": 800,
      "height": 600,
      "file_name": "test.jpg"
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [0,0,10,10],
      "segmentation": [
          [0,0,10,0,10,10,0,10]
      ],
      "area": 100,
      "iscrowd": 0
    }
  ]
}
Example Config

Here is a part of config example where we make train_dataloader use IcdarDataset to load the old labels.

pipeline = [
    dict(
        type='LoadImageFromFile'),
    dict(
        type='LoadOCRAnnotations',
        with_polygon=True,
        with_bbox=True,
        with_label=True,
    ),
    dict(
        type='PackTextDetInputs',
        meta_keys=('img_path', 'ori_shape', 'img_shape'))
]

icdar2015_textdet_train = dict(
    type='IcdarDatasetDataset',
    data_root='data/det/icdar2015',
    ann_file='instances_training.json',
    filter_cfg=dict(filter_empty_gt=True, min_size=32),
    pipeline=pipeline)

train_dataloader = dict(
    batch_size=16,
    num_workers=8,
    persistent_workers=True,
    sampler=dict(type='DefaultSampler', shuffle=True),
    dataset=icdar2015_textdet_train)

WildReceiptDataset

It’s customized for WildReceipt dataset only.

Annotation Format
// Close Set
{
  "file_name": "image_files/Image_16/11/d5de7f2a20751e50b84c747c17a24cd98bed3554.jpeg",
  "height": 1200,
  "width": 1600,
  "annotations":
    [
      {
        "box": [550.0, 190.0, 937.0, 190.0, 937.0, 104.0, 550.0, 104.0],
        "text": "SAFEWAY",
        "label": 1
      },
      {
        "box": [1048.0, 211.0, 1074.0, 211.0, 1074.0, 196.0, 1048.0, 196.0],
        "text": "TM",
        "label": 25
      }
    ], //...
}

// Open Set
{
  "file_name": "image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg",
  "height": 348,
  "width": 348,
  "annotations":
    [
      {
        "box": [114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0],
        "text": "CHOEUN",
        "label": 2,
        "edge": 1
      },
      {
        "box": [97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0],
        "text": "KOREANRESTAURANT",
        "label": 2,
        "edge": 1
      }
    ]
}
Example Config

Please refer to SDMGR’s config for more details.

Overview & Features[coming soon]

Coming Soon!

Data Flow[coming soon]

Coming Soon!

Models[coming soon]

Coming Soon!

Visualizers[coming soon]

Coming Soon!

Convention[coming soon]

Coming Soon!

Engine[coming soon]

Coming Soon!

Overview

Supported Datasets

Dataset Name

Text Detection

Text Recognition

Text Spotting

KIE

cocotextv2

ctw1500

cute80

funsd

icdar2013

icdar2015

iiit5k

mjsynth

naf

sroie

svt

svtp

synthtext

textocr

totaltext

wildreceipt

Dataset Details

COCO Text v2

“COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images”, arXiv, 2016. PDF

A. Basic Info

  • Official Website: cocotextv2

  • Year: 2016

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: CC BY 4.0

B. Annotation Format


Text Detection/Spotting

{
  "cats": {},
  "anns": {
      "45346": {
          "mask":[468.9,286.7,468.9,295.2,493.0,295.8,493.0,287.2],
          "class":"machine printed",
          "bbox":[468.9,286.7,24.1,9.1],
          "image_id":522579,
          "id":167312,
          "language":"english",
          "area":55.5,
          "utf8_string":"the",
          "legibility":"legible"
      },
      // ...
  },
  "imgs": {
      "522579": {
          "file_name":"COCO_train2014_000000522579.jpg",
          "height":476,
          "width":640,
          "id":522579,
          "set":"train",
      },
      // ...
  },
  "imgToAnns": {
      "522579": [167294, 167295, 167296, 167297, 167298, 167299, 167300, 167301, 167302, 167303, 167304, 167305, 167306, 167307, 167308, 167309, 167310, 167311, 167312, 167313, 167314, 167315, 167316, 167317],
      // ...
  },
  "info": {}
}


C. Reference

@article{veit2016coco, title={Coco-text: Dataset and benchmark for text detection and recognition in natural images}, author={Veit, Andreas and Matera, Tomas and Neumann, Lukas and Matas, Jiri and Belongie, Serge}, journal={arXiv preprint arXiv:1601.07140}, year={2016}}

CTW1500

“Curved scene text detection via transverse and longitudinal sequence connection”, PR, 2019. PDF

A. Basic Info

  • Official Website: ctw1500

  • Year: 2019

  • Language: [‘English’]

  • Scene: [‘Scene’]

  • Annotation Granularity: [‘Word’, ‘Line’]

  • Supported Tasks: [‘textrecog’, ‘textdet’, ‘textspotting’]

  • License: N/A

B. Annotation Format



C. Reference

@article{liu2019curved, title={Curved scene text detection via transverse and longitudinal sequence connection}, author={Liu, Yuliang and Jin, Lianwen and Zhang, Shuaitao and Luo, Canjie and Zhang, Sheng}, journal={Pattern Recognition}, volume={90}, pages={337--345}, year={2019}, publisher={Elsevier} }

CUTE80

“A Robust Arbitrary Text Detection System for Natural Scene Images”, ESWA, 2014. PDF

A. Basic Info

  • Official Website: cute80

  • Year: 2014

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textrecog’]

  • License: N/A

B. Annotation Format


Text Recognition

# timage/img_name text 1 text

timage/001.jpg RONALDO 1 RONALDO
timage/002.jpg 7 1 7
timage/003.jpg SEACREST 1 SEACREST
timage/004.jpg BEACH 1 BEACH


C. Reference

@article{risnumawan2014robust, title={A robust arbitrary text detection system for natural scene images}, author={Risnumawan, Anhar and Shivakumara, Palaiahankote and Chan, Chee Seng and Tan, Chew Lim}, journal={Expert Systems with Applications}, volume={41}, number={18}, pages={8027--8048}, year={2014}, publisher={Elsevier}}

FUNSD

“FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”, ICDAR, 2019. PDF

A. Basic Info

  • Official Website: funsd

  • Year: 2019

  • Language: [‘English’]

  • Scene: [‘Document’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: FUNSD License

B. Annotation Format


Text Detection/Recognition/Spotting

{
  "form": [
    {
      "id": 0,
      "text": "Registration No.",
      "box": [
          94,
          169,
          191,
          186
      ],
      "linking": [
          [
              0,
              1
          ]
      ],
      "label": "question",
      "words": [
          {
              "text": "Registration",
              "box": [
                  94,
                  169,
                  168,
                  186
              ]
          },
          {
              "text": "No.",
              "box": [
                  170,
                  169,
                  191,
                  183
              ]
          }
      ]
    },
    {
      "id": 1,
      "text": "533",
      "box": [
          209,
          169,
          236,
          182
      ],
      "label": "answer",
      "words": [
          {
              "box": [
                  209,
                  169,
                  236,
                  182
              ],
              "text": "533"
          }
      ],
      "linking": [
          [
              0,
              1
          ]
      ]
    }
  ]
}


C. Reference

@inproceedings{jaume2019, title = {FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents}, author = {Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran}, booktitle = {Accepted to ICDAR-OST}, year = {2019}}

Incidental Scene Text IC13

“ICDAR 2013 Robust Reading Competition”, ICDAR, 2013. PDF

A. Basic Info

  • Official Website: icdar2013

  • Year: 2013

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: N/A

B. Annotation Format


Text Detection

# train split
# x1 y1 x2 y2 "transcript"

158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"

# test split
# x1, y1, x2, y2, "transcript"

38, 43, 920, 215, "Tiredness"
275, 264, 665, 450, "kills"
0, 699, 77, 830, "A"

Text Recognition

# img_name, "text"

word_1.png, "PROPER"
word_2.png, "FOOD"
word_3.png, "PRONTO"


C. Reference

@inproceedings{karatzas2013icdar, title={ICDAR 2013 robust reading competition}, author={Karatzas, Dimosthenis and Shafait, Faisal and Uchida, Seiichi and Iwamura, Masakazu and i Bigorda, Lluis Gomez and Mestre, Sergi Robles and Mas, Joan and Mota, David Fernandez and Almazan, Jon Almazan and De Las Heras, Lluis Pere}, booktitle={2013 12th international conference on document analysis and recognition}, pages={1484--1493}, year={2013}, organization={IEEE}}

Incidental Scene Text IC15

“ICDAR 2015 Competition on Robust Reading”, ICDAR, 2015. PDF

A. Basic Info

  • Official Website: icdar2015

  • Year: 2015

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: CC BY 4.0

B. Annotation Format


Text Detection

# x1,y1,x2,y2,x3,y3,x4,y4,trans

377,117,463,117,465,130,378,130,Genaxis Theatre
493,115,519,115,519,131,493,131,[06]
374,155,409,155,409,170,374,170,###

Text Recognition

# img_name, "text"

word_1.png, "Genaxis Theatre"
word_2.png, "[06]"
word_3.png, "62-03"


C. Reference

@inproceedings{karatzas2015icdar, title={ICDAR 2015 competition on robust reading}, author={Karatzas, Dimosthenis and Gomez-Bigorda, Lluis and Nicolaou, Anguelos and Ghosh, Suman and Bagdanov, Andrew and Iwamura, Masakazu and Matas, Jiri and Neumann, Lukas and Chandrasekhar, Vijay Ramaseshan and Lu, Shijian and others}, booktitle={2015 13th international conference on document analysis and recognition (ICDAR)}, pages={1156--1160}, year={2015}, organization={IEEE}}

IIIT5K

“Scene Text Recognition using Higher Order Language Priors”, BMVC, 2012. PDF

A. Basic Info

  • Official Website: iiit5k

  • Year: 2012

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textrecog’]

  • License: N/A

B. Annotation Format


Text Recognition

# img_name, "text"

train/1009_2.png You
train/1017_1.png Rescue
train/1017_2.png mission


C. Reference

@InProceedings{MishraBMVC12, author    = "Mishra, A. and Alahari, K. and Jawahar, C.~V.", title     = "Scene Text Recognition using Higher Order Language Priors", booktitle = "BMVC", year      = "2012"}

Synthetic Word Dataset (MJSynth/Syn90k)

“Reading Text in the Wild with Convolutional Neural Networks”, International Journal of Computer Vision, 2016. PDF

A. Basic Info

  • Official Website: mjsynth

  • Year: 2016

  • Language: [‘English’]

  • Scene: [‘Synthesis’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textrecog’]

  • License: N/A

B. Annotation Format


Text Recognition

./3000/7/182_slinking_71711.jpg 71711
./3000/7/182_REMODELERS_64541.jpg 64541


C. Reference

@InProceedings{Jaderberg14c, author       = "Max Jaderberg and Karen Simonyan and Andrea Vedaldi and Andrew Zisserman", title        = "Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition", booktitle    = "Workshop on Deep Learning, NIPS", year         = "2014", }
@Article{Jaderberg16, author       = "Max Jaderberg and Karen Simonyan and Andrea Vedaldi and Andrew Zisserman", title        = "Reading Text in the Wild with Convolutional Neural Networks", journal      = "International Journal of Computer Vision", number       = "1", volume       = "116", pages        = "1--20", month        = "jan", year         = "2016", }

NAF

“Deep Visual Template-Free Form Parsing”, ICDAR, 2019. PDF

A. Basic Info

  • Official Website: naf

  • Year: 2019

  • Language: [‘English’]

  • Scene: [‘Document’, ‘Handwritten’]

  • Annotation Granularity: [‘Word’, ‘Line’]

  • Supported Tasks: [‘textrecog’, ‘textdet’, ‘textspotting’]

  • License: CDLA

B. Annotation Format


Text Detection/Recognition/Spotting

{"fieldBBs": [{"poly_points": [[435, 1406], [466, 1406], [466, 1439], [435, 1439]], "type": "fieldCheckBox", "id": "f0", "isBlank": 1}, {"poly_points": [[435, 1444], [469, 1444], [469, 1478], [435, 1478]], "type": "fieldCheckBox", "id": "f1", "isBlank": 1}],
 "textBBs": [{"poly_points": [[1183, 1337], [2028, 1345], [2032, 1395], [1186, 1398]], "type": "text", "id": "t0"}, {"poly_points": [[492, 1336], [809, 1338], [809, 1379], [492, 1378]], "type": "text", "id": "t1"}, {"poly_points": [[512, 1375], [798, 1376], [798, 1405], [512, 1404]], "type": "textInst", "id": "t2"}], "imageFilename": "007182398_00026.jpg", "transcriptions": {"f0": "\u00bf\u00bf\u00bf \u00bf\u00bf\u00bf 18/1/49 \u00bf\u00bf\u00bf\u00bf\u00bf", "f1": "U.S. Navy 53rd. Naval Const. Batt.", "t0": "APPLICATION FOR HEADSTONE OR MARKER", "t1": "ORIGINAL"}}


C. Reference

@inproceedings{davis2019deep, title={Deep visual template-free form parsing}, author={Davis, Brian and Morse, Bryan and Cohen, Scott and Price, Brian and Tensmeyer, Chris}, booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, pages={134--141}, year={2019}, organization={IEEE}}

Scanned Receipts OCR and Information Extraction

“ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”, ICDAR, 2019. PDF

A. Basic Info

  • Official Website: sroie

  • Year: 2019

  • Language: [‘English’]

  • Scene: [‘Document’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: CC BY 4.0

B. Annotation Format


Text Detection, Text Recognition and Text Spotting

# x1,y1,x2,y2,x3,y3,x4,y4,trans

72,25,326,25,326,64,72,64,TAN WOON YANN
50,82,440,82,440,121,50,121,BOOK TA .K(TAMAN DAYA) SDN BND
205,121,285,121,285,139,205,139,789417-W


C. Reference

@INPROCEEDINGS{8977955, author={Huang, Zheng and Chen, Kai and He, Jianhua and Bai, Xiang and Karatzas, Dimosthenis and Lu, Shijian and Jawahar, C. V.}, booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, title={ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction}, year={2019}, volume={}, number={}, pages={1516-1520}, doi={10.1109/ICDAR.2019.00244}}

Street View Text Dataset (SVT)

“Word Spotting in the Wild”, ECCV, 2010. PDF

A. Basic Info

  • Official Website: svt

  • Year: 2010

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: N/A

B. Annotation Format


Text Detection/Recognition/Spotting

<image>
  <imageName>img/14_03.jpg</imageName>
  <address>341 Southwest 10th Avenue Portland OR</address>
  <lex>
  LIVING,ROOM,THEATERS,KENNY,ZUKE,DELICATESSEN,CLYDE,COMMON,ACE,HOTEL,PORTLAND,ROSE,CITY,BOOKS,STUMPTOWN,COFFEE,ROASTERS,RED,CAP,GARAGE,FISH,GROTTO,SEAFOOD,RESTAURANT,AURA,RESTAURANT,LOUNGE,ROCCO,PIZZA,PASTA,BUFFALO,EXCHANGE,MARK,SPENCER,LIGHT,FEZ,BALLROOM,READING,FRENZY,ROXY,SCANDALS,MARTINOTTI,CAFE,DELI,CROWSENBERG,HALF
  </lex>
  <Resolution x="1280" y="880"/>
  <taggedRectangles>
    <taggedRectangle height="75" width="236" x="375" y="253">
      <tag>LIVING</tag>
    </taggedRectangle>
    <taggedRectangle height="76" width="175" x="639" y="272">
      <tag>ROOM</tag>
    </taggedRectangle>
    <taggedRectangle height="87" width="281" x="839" y="283">
      <tag>THEATERS</tag>
    </taggedRectangle>
  </taggedRectangles>
</image>


C. Reference

@inproceedings{wang2010word, title={Word spotting in the wild}, author={Wang, Kai and Belongie, Serge}, booktitle={European conference on computer vision}, pages={591--604}, year={2010}, organization={Springer}}

Street View Text Perspective (SVT-P)

“Recognizing Text with Perspective Distortion in Natural Scenes”, ICCV, 2013. PDF

A. Basic Info

  • Official Website: svtp

  • Year: 2013

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textrecog’]

  • License: N/A

B. Annotation Format


Text Recognition

13_15_0_par.jpg WYNDHAM
13_15_1_par.jpg HOTEL
12_16_0_par.jpg UNITED


C. Reference

@inproceedings{phan2013recognizing, title={Recognizing text with perspective distortion in natural scenes}, author={Phan, Trung Quy and Shivakumara, Palaiahnakote and Tian, Shangxuan and Tan, Chew Lim}, booktitle={Proceedings of the IEEE International Conference on Computer Vision}, pages={569--576}, year={2013}}

SynthText in the Wild Dataset

“Synthetic Data for Text Localisation in Natural Images”, CVPR, 2016. PDF

A. Basic Info

  • Official Website: synthtext

  • Year: 2016

  • Language: [‘English’]

  • Scene: [‘Synthesis’]

  • Annotation Granularity: [‘Word’, ‘Character’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: Synthext Custom

B. Annotation Format


Text Detection/Recognition/Spotting

{
    "imnames": [['8/ballet_106_0.jpg', ...]],
    "wordBB": [[[420.58957   418.85016   448.08478   410.3094    117.745026
                322.30963   322.6857    159.09138   154.27284   260.14597
                431.9315    427.52274   296.86508    99.56819   108.96211  ]
               [512.3321    431.88342   519.4515    499.81183   179.0544
                377.97382   376.4993    203.64464   193.77492   313.61514
                487.58023   484.64633   365.83176   142.49403   144.90457  ]
               [511.92203   428.7077    518.7375    499.0373    172.1684
                378.35858   377.2078    203.3191    193.0739    319.69186
                485.6758    482.571     365.76303   142.31898   144.43858  ]
               [420.1795    415.67444   447.3708    409.53485   110.859024
                322.6944    323.3942    158.76585   153.57182   266.2227
                430.02707   425.44742   296.79636    99.39314   108.49613  ]]

              [[ 21.06382    46.19922    47.570374   73.95366   197.17792
                  9.993624   48.437763    9.064571   49.659035  208.57095
                118.41646   162.82489    29.548729    5.800581   28.812992 ]
               [ 23.069519   48.254295   50.130234   77.18146   208.71487
                  8.999153   46.69632     9.698633   50.869553  203.25742
                122.64043   168.38647    29.660484    6.2558594  29.602367 ]
               [ 41.827087   68.39458    70.03627    98.65903   245.30832
                 30.534437   68.589294   32.57161    73.74529   264.40634
                147.7303    189.70224    72.08       22.759935   50.81941  ]
               [ 39.82139    66.3395     67.47641    95.43123   233.77136
                 31.528908   70.33074    31.937548   72.534775  269.71988
                143.50633   184.14066    71.96825    22.304657   50.030033 ]], ...],
    "charBB": [[[423.16126397 439.60847343 450.66887979 466.31976402 479.76190495
                504.59927448 418.80489444 450.13965942 464.16775197 480.46891089
                502.46437709 413.02373632 433.01396211 446.7222192  470.28467827
                482.51674486 116.52285438 139.51408587 150.7448586  162.03366629
                322.84717946 333.54881536 343.28386485 363.07416389 323.48968759
                337.98503283 356.66355903 160.48517048 174.1707753  189.64454066
                155.7637383  167.45490471 179.63644201 262.2183876  271.75848874
                284.05396524 298.26103738 432.8464733  449.15387392 468.07231897
                428.11482147 445.61538159 469.24565878 296.86441324 323.6603118
                344.09880401 101.14677814 110.45423597 120.54555495 131.18342618
                132.20545124 110.01673682 120.83144568 131.35885673]
               [438.2997574  452.61288403 466.31976402 482.22585715 498.3934528
                512.20555863 431.88338084 466.11639619 481.73414937 499.62012025
                519.36789779 432.51717267 449.23571387 465.73425964 484.45139112
                499.59056304 140.27413679 149.59811175 160.13352083 169.59504507
                333.55849014 344.33923741 361.08275796 378.09844418 339.92898685
                355.57692063 376.51230484 174.1707753  189.07871028 203.64462646
                165.22739457 181.27572412 193.60260894 270.99557614 283.13281739
                298.75499435 313.61511672 447.1421735  470.27065563 487.02126631
                446.97485257 468.98979567 484.64633864 317.88691577 341.16094163
                365.8300006  111.15280603 120.54555495 130.72086821 135.27663717
                142.4726875  120.1331955  133.07976304 144.75919258]
               [435.54895424 449.95797159 464.5848793  480.68235876 497.04793842
                511.1101386  428.95660757 463.61882066 480.14247127 498.2535215
                518.03243928 429.36600266 447.19056345 463.89483785 482.21016814
                498.18529977 142.63162835 152.55587851 162.80539142 172.21885945
                333.35620309 344.09880401 360.86201193 377.82379299 339.7646859
                355.37508239 376.1110999  172.46032372 187.37816388 201.39094518
                163.04321987 178.99078221 191.89681939 275.3073355  286.08373072
                301.85539131 318.57227103 444.54207279 467.53925436 485.27070558
                444.57367155 466.90671029 482.56302723 317.62908407 340.9131681
                365.44465854 109.40501176 119.4999228  129.67892444 134.35253232
                140.97421069 118.61779828 131.34019115 143.25688164]
               [420.17946701 436.74150236 448.74896556 464.5848793  478.18853922
                503.4152019  415.67442461 447.3707845  462.35927516 478.8614766
                500.86810735 409.54560397 430.77026495 444.64606264 467.79077782
                480.89051912 119.14629674 142.63162835 153.56593297 164.78799774
                322.69436747 333.35620309 343.11884239 362.84714115 323.37931952
                337.83763574 356.35573621 158.76583616 172.46032372 187.37816388
                153.57183805 165.15781218 177.92125239 266.22269514 274.45156305
                286.82608962 302.69695881 430.02705241 446.01814255 466.05208347
                425.44741792 443.19481667 466.90671029 296.79634428 323.49707084
                343.82488703  99.39315359 109.40501176 119.4999228  130.25798537
                130.70149005 108.49612777 119.08444238 129.84935461]]

              [[ 22.26958901  21.60559248  27.0241972   27.25747678  27.45783459
                 28.73896576  47.91255579  47.80732383  53.77711568  54.24219042
                 52.00169325  74.79043429  80.45929285  81.04748707  76.11658669
                 82.58335942 203.67278213 201.2743445  205.59358622 205.51198143
                 10.06536976  10.82312635  16.77203865  16.31842372  54.80444433
                 54.66492     47.33822371  15.08534083  15.18716407   9.62607092
                 51.06813224  50.18928243  56.16019366 220.78902143 236.08062638
                231.69267533 209.73652786 124.25352842 119.99631725 128.73732717
                165.78411123 167.31764153 167.05531699  29.97351822  31.5116502
                 31.14650552   5.88513488  12.51324147  12.57920537   8.21515307
                  8.21998849  35.66412031  29.17945741  36.00660903]
               [ 22.46075572  21.76391911  27.25747678  27.49456029  27.73554156
                 28.85582217  48.25428361  48.21714995  54.27828788  54.78857757
                 52.4595556   75.57743634  81.15533616  81.86325615  76.681392
                 83.31596322 210.04771309 203.83983042 208.00417391 207.41791524
                  9.79265706  10.55231862  16.36406888  15.97405105  54.64620856
                 54.49559004  47.09756263  15.18716407  15.29808166   9.69862498
                 51.27597632  50.48652154  56.49239954 216.92183074 232.02141018
                226.44624213 203.25738931 125.19349641 121.32658508 130.00428964
                167.43676857 169.36588297 168.38645076  29.58279603  31.19899202
                 30.75826599   5.92344996  12.57920537  12.64571832   8.23451892
                  8.26856497  35.82646468  29.342662    36.22165159]
               [ 40.15739982  40.47241401  40.79219178  41.14411963  41.50190876
                 41.80934074  66.81590976  68.05921213  68.6519006   69.30152766
                 70.01097963  96.14641662  96.04484417  96.89110144  97.81897661
                 98.62829468 237.26055111 240.35280825 243.54641271 245.04022528
                 31.33842788  31.14650552  30.84702178  30.54399042  69.80098672
                 68.7212013   68.62479627  32.13243303  32.34474067  32.54416771
                 72.82501686  73.31372392  73.70922459 267.74318222 265.39839711
                259.52741156 253.14023308 144.60810334 145.23371653 147.69958337
                186.00278322 188.17713786 189.70144388  71.89351759  53.62266986
                 54.40060855  22.41084398  22.51791234  22.62587258  17.11356079
                 22.74567232  50.25232032  46.05692507  50.79345235]
               [ 39.82138755  40.18347166  40.44598236  40.79219178  41.08959901
                 41.64111176  66.33948982  67.47640971  68.01403337  68.60595247
                 69.3953105   95.13188979  95.21297344  95.91593691  97.08847413
                 97.75212171 229.94285119 237.26055111 240.66752705 242.74145162
                 31.52890731  31.33842788  31.16401306  30.81155638  69.87135926
                 68.80273568  68.71664209  31.93753588  32.13243303  32.34474067
                 72.53476992  72.88981775  73.28094858 269.71986636 267.92938572
                262.93698624 256.88902439 143.50635029 143.61251781 146.24080653
                184.14064261 185.86853729 188.17713786  71.96823746  53.79651809
                 54.60870874  22.30465649  22.41084398  22.51791234  17.07939535
                 22.63671808  50.03002471  45.81009198  50.49899163]], ...],
    "txt": [['Lines:\nI lost\nKevin ' 'will                ' 'line\nand            '
              'and\nthe             ' '(and                ' 'the\nout             '
              'you                 ' "don't\n pkg          "], ...]
}


C. Reference

@InProceedings{Gupta16, author       = "Ankush Gupta and Andrea Vedaldi and Andrew Zisserman", title        = "Synthetic Data for Text Localisation in Natural Images", booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition", year         = "2016", }

Text OCR

“TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text”, CVPR, 2021. PDF

A. Basic Info

  • Official Website: textocr

  • Year: 2021

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: CC BY 4.0

B. Annotation Format


Text Detection/Recognition/Spotting

{
  "imgs": {
    "OpenImages_ImageID_1": {
      "id": "OpenImages_ImageID_1",
      "width": "INT, Width of the image",
      "height": "INT, Height of the image",
      "set": "Split train|val|test",
      "filename": "train|test/OpenImages_ImageID_1.jpg"
    },
    "OpenImages_ImageID_2": {
      "...": "..."
    }
  },
  "anns": {
    "OpenImages_ImageID_1_1": {
      "id": "STR, OpenImages_ImageID_1_1, Specifies the nth annotation for an image",
      "image_id": "OpenImages_ImageID_1",
      "bbox": [
        "FLOAT x1",
        "FLOAT y1",
        "FLOAT x2",
        "FLOAT y2"
      ],
      "points": [
        "FLOAT x1",
        "FLOAT y1",
        "FLOAT x2",
        "FLOAT y2",
        "...",
        "FLOAT xN",
        "FLOAT yN"
      ],
      "utf8_string": "text for this annotation",
      "area": "FLOAT, area of this box"
    },
    "OpenImages_ImageID_1_2": {
      "...": "..."
    },
    "OpenImages_ImageID_2_1": {
      "...": "..."
    }
  },
  "img2Anns": {
    "OpenImages_ImageID_1": [
      "OpenImages_ImageID_1_1",
      "OpenImages_ImageID_1_2",
      "OpenImages_ImageID_1_2"
    ],
    "OpenImages_ImageID_N": [
      "..."
    ]
  }
}


C. Reference

@inproceedings{singh2021textocr, title={{TextOCR}: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text}, author={Singh, Amanpreet and Pang, Guan and Toh, Mandy and Huang, Jing and Galuba, Wojciech and Hassner, Tal}, journal={The Conference on Computer Vision and Pattern Recognition}, year={2021}}

Total Text

“Total-Text: Towards Orientation Robustness in Scene Text Detection”, IJDAR, 2020. PDF

A. Basic Info

  • Official Website: totaltext

  • Year: 2020

  • Language: [‘English’]

  • Scene: [‘Natural Scene’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: BSD-3

B. Annotation Format


Text Detection/Spotting

x: [[259 313 389 427 354 302]], y: [[542 462 417 459 507 582]], ornt: [u'c'], transcriptions: [u'PAUL']
x: [[400 478 494 436]], y: [[398 380 448 465]], ornt: [u'#'], transcriptions: [u'#']


C. Reference

@article{CK2019, author = {Chee Kheng Chng and Chee Seng Chan and Chenglin Liu}, title = {Total-Text: Towards Orientation Robustness in Scene Text Detection}, journal = {International Journal on Document Analysis and Recognition (IJDAR)}, volume = {23}, pages = {31-52}, year = {2020}, doi = {10.1007/s10032-019-00334-z}}

WildReceipt

“Spatial Dual-Modality Graph Reasoning for Key Information Extraction”, arXiv, 2021. PDF

A. Basic Info

  • Official Website: wildreceipt

  • Year: 2021

  • Language: [‘English’]

  • Scene: [‘Receipt’]

  • Annotation Granularity: [‘Word’]

  • Supported Tasks: [‘kie’, ‘textdet’, ‘textrecog’, ‘textspotting’]

  • License: N/A

B. Annotation Format


KIE

// Close Set
{
  "file_name": "image_files/Image_16/11/d5de7f2a20751e50b84c747c17a24cd98bed3554.jpeg",
  "height": 1200,
  "width": 1600,
  "annotations":
    [
      {
        "box": [550.0, 190.0, 937.0, 190.0, 937.0, 104.0, 550.0, 104.0],
        "text": "SAFEWAY",
        "label": 1
      },
      {
        "box": [1048.0, 211.0, 1074.0, 211.0, 1074.0, 196.0, 1048.0, 196.0],
        "text": "TM",
        "label": 25
      }
    ], //...
}

// Open Set
{
  "file_name": "image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg",
  "height": 348,
  "width": 348,
  "annotations":
    [
      {
        "box": [114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0],
        "text": "CHOEUN",
        "label": 2,
        "edge": 1
      },
      {
        "box": [97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0],
        "text": "KOREANRESTAURANT",
        "label": 2,
        "edge": 1
      }
    ]
}


C. Reference

@article{sun2021spatial, title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, author={Sun, Hongbin and Kuang, Zhanghui and Yue, Xiaoyu and Lin, Chenhao and Zhang, Wayne}, journal={arXiv preprint arXiv:2103.14470}, year={2021} } 

Dataset Preparer (Beta)

Note

Dataset Preparer is still in beta version and might not be stable enough. You are welcome to try it out and report any issues to us.

One-click data preparation script

MMOCR provides a unified one-stop data preparation script prepare_dataset.py.

Only one line of command is needed to complete the data download, decompression, format conversion, and basic configure generation.

python tools/dataset_converters/prepare_dataset.py [-h] [--nproc NPROC] [--task {textdet,textrecog,textspotting,kie}] [--splits SPLITS [SPLITS ...]] [--lmdb] [--overwrite-cfg] [--dataset-zoo-path DATASET_ZOO_PATH] datasets [datasets ...]
ARGS Type Description
dataset_name str (required) dataset name.
--nproc int Number of processes to be used. Defaults to 4.
--task str Convert the dataset to the format of a specified task supported by MMOCR. options are: 'textdet', 'textrecog', 'textspotting', and 'kie'.
--splits str Splits of the dataset to be prepared. Multiple splits can be accepted. Defaults to train val test.
--lmdb str Store the data in LMDB format. Only valid when the task is textrecog.
--overwrite-cfg str Whether to overwrite the dataset config file if it already exists in configs/{task}/_base_/datasets.
--dataset-zoo-path str Path to the dataset config file. If not specified, the default path is ./dataset_zoo.

For example, the following command shows how to use the script to prepare the ICDAR2015 dataset for text detection task.

python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet --overwrite-cfg

Also, the script supports preparing multiple datasets at the same time. For example, the following command shows how to prepare the ICDAR2015 and TotalText datasets for text recognition task.

python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task textrecog --overwrite-cfg

To check the supported datasets of Dataset Preparer, please refer to Dataset Zoo. Some of other datasets that need to be prepared manually are listed in Text Detection and Text Recognition.

For users in China, more datasets can be downloaded from the opensource dataset platform: OpenDataLab. After downloading the data, you can place the files listed in data_obtainer.save_name in data/cache and rerun the script.

Advanced Usage

LMDB Format

In text recognition tasks, we usually use LMDB format to store data to speed up data loading. When using the prepare_dataset.py script to prepare data, you can store data to the LMDB format by the --lmdb parameter. For example:

python tools/dataset_converters/prepare_dataset.py icdar2015 --task textrecog --lmdb

As soon as the dataset is prepared, Dataset Preparer will generate icdar2015_lmdb.py in the configs/textrecog/_base_/datasets/ directory. You can inherit this file and point the dataloader to the LMDB dataset. Moreover, the LMDB dataset needs to be loaded by LoadImageFromNDArray, thus you also need to modify pipeline.

For example, if we want to change the training set of configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py to icdar2015 generated before, we need to perform the following modifications:

  1. Modify configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py:

    _base_ = [
         '../_base_/datasets/icdar2015_lmdb.py',  # point to icdar2015 lmdb dataset
          ...
     ]
    
     train_list = [_base_.icdar2015_lmdb_textrecog_train]
     ...
    
  2. Modify train_pipeline in configs/textrecog/crnn/_base_crnn_mini-vgg.py, change LoadImageFromFile to LoadImageFromNDArray:

    train_pipeline = [
     dict(
         type='LoadImageFromNDArray',
         color_type='grayscale',
         file_client_args=file_client_args,
         ignore_empty=True,
         min_size=2),
     ...
    ]
    

Design

There are many OCR datasets with different languages, annotation formats, and scenarios. There are generally two ways to use these datasets: to quickly understand the relevant information about the dataset, or to use it to train models. To meet these two usage scenarios, MMOCR provides dataset automatic preparation scripts. The dataset automatic preparation script uses modular design, which greatly enhances scalability, and allows users to easily configure other public or private datasets. The configuration files for the dataset automatic preparation script are uniformly stored in the dataset_zoo/ directory. Users can find all the configuration files for the dataset preparation scripts officially supported by MMOCR in this directory. The directory structure of this folder is as follows:

dataset_zoo/
├── icdar2015
│   ├── metafile.yml
│   ├── sample_anno.md
│   ├── textdet.py
│   ├── textrecog.py
│   └── textspotting.py
└── wildreceipt
    ├── metafile.yml
    ├── sample_anno.md
    ├── kie.py
    ├── textdet.py
    ├── textrecog.py
    └── textspotting.py

Dataset Usage

After decades of development, the OCR field has seen a series of related datasets emerge, often providing text annotation files in various styles, making it necessary for users to perform format conversion when using these datasets. Therefore, to facilitate dataset preparation for users, we have designed the Dataset Preparer to help users quickly prepare datasets in the format supported by MMOCR. For details, please refer to the Dataset Format document. The following figure shows a typical workflow for running the Dataset Preparer.

workflow

The figure shows that when running the Dataset Preparer, the following operations will be performed in sequence:

  1. For the training set, validation set, and test set, the preparers will perform:

    1. Dataset download, extraction, and movement (Obtainer)

    2. Matching annotations with images (Gatherer)

    3. Parsing original annotations (Parser)

    4. Packing annotations into a unified format (Packer)

    5. Saving annotations (Dumper)

  2. Delete files (Delete)

  3. Generate the configuration file for the data set (Config Generator).

To handle various types of datasets, MMOCR has designed each component as a plug-and-play module, and allows users to configure the dataset preparation process through configuration files located in dataset_zoo/. These configuration files are in Python format and can be used in the same way as other configuration files in MMOCR, as described in the Configuration File documentation.

In dataset_zoo/, each dataset has its own folder, and the configuration files are named after the task to distinguish different configurations under different tasks. Taking the text detection part of ICDAR2015 as an example, the sample configuration file dataset_zoo/icdar2015/textdet.py is shown below:

data_root = 'data/icdar2015'
cache_path = 'data/cache'
train_preparer = dict(
    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
                save_name='ic15_textdet_train_img.zip',
                md5='c51cbace155dcc4d98c8dd19d378f30d',
                content=['image'],
                mapping=[['ic15_textdet_train_img', 'textdet_imgs/train']]),
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'ch4_training_localization_transcription_gt.zip',
                save_name='ic15_textdet_train_gt.zip',
                md5='3bfaf1988960909014f7987d2343060b',
                content=['annotation'],
                mapping=[['ic15_textdet_train_gt', 'annotations/train']]),
        ]),
    gatherer=dict(
        type='PairGatherer',
        img_suffixes=['.jpg', '.JPG'],
        rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
    parser=dict(type='ICDARTxtTextDetAnnParser', encoding='utf-8-sig'),
    packer=dict(type='TextDetPacker'),
    dumper=dict(type='JsonDumper'),
)

test_preparer = dict(
    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/ch4_test_images.zip',
                save_name='ic15_textdet_test_img.zip',
                md5='97e4c1ddcf074ffcc75feff2b63c35dd',
                content=['image'],
                mapping=[['ic15_textdet_test_img', 'textdet_imgs/test']]),
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge4_Test_Task4_GT.zip',
                save_name='ic15_textdet_test_gt.zip',
                md5='8bce173b06d164b98c357b0eb96ef430',
                content=['annotation'],
                mapping=[['ic15_textdet_test_gt', 'annotations/test']]),
        ]),
    gatherer=dict(
        type='PairGatherer',
        img_suffixes=['.jpg', '.JPG'],
        rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
    parser=dict(type='ICDARTxtTextDetAnnParser', encoding='utf-8-sig'),
    packer=dict(type='TextDetPacker'),
    dumper=dict(type='JsonDumper'),
)

delete = ['annotations', 'ic15_textdet_test_img', 'ic15_textdet_train_img']
config_generator = dict(type='TextDetConfigGenerator')
Dataset download extraction and movement (Obtainer)

The obtainer module in Dataset Preparer is responsible for downloading, extracting, and moving the dataset. Currently, MMOCR only provides the NaiveDataObtainer. Generally speaking, the built-in NaiveDataObtainer is sufficient for downloading most datasets that can be accessed through direct links, and supports operations such as extraction, moving files, and renaming. However, MMOCR currently does not support automatically downloading datasets stored in resources that require login, such as Baidu or Google Drive. Here is a brief introduction to the NaiveDataObtainer.

Field Name Meaning
cache_path Dataset cache path, used to store the compressed files downloaded during dataset preparation
data_root Root directory where the dataset is stored
files Dataset file list, used to describe the download information of the dataset

The files field is a list, and each element in the list is a dictionary used to describe the download information of a dataset file. The table below shows the meaning of each field:

Field Name Meaning
url Download link for the dataset file
save_name Name used to save the dataset file
md5 (optional) MD5 hash of the dataset file, used to check if the downloaded file is complete
split (optional) Dataset split the file belongs to, such as train, test, etc., this field can be omitted
content (optional) Content of the dataset file, such as image, annotation, etc., this field can be omitted
mapping (optional) Decompression mapping of the dataset file, used to specify the storage location of the file after decompression, this field can be omitted

The Dataset Preparer follows the following conventions:

  • Images of different types of datasets are moved to the corresponding category {taskname}_imgs/{split}/ folder, such as textdet_imgs/train/.

  • For a annotation file containing annotation information for all images, the annotations are moved to annotations/{split}.* file, such as annotations/train.json.

  • For a annotation file containing annotation information for one image, all annotation files are moved to annotations/{split}/ folder, such as annotations/train/.

  • For some other special cases, such as all training, testing, and validation images are in one folder, the images can be moved to a self-set folder, such as {taskname}_imgs/imgs/, and the image storage location should be specified in the subsequent gatherer module.

An example configuration is as follows:

    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
                save_name='ic15_textdet_train_img.zip',
                md5='c51cbace155dcc4d98c8dd19d378f30d',
                content=['image'],
                mapping=[['ic15_textdet_train_img', 'textdet_imgs/train']]),
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'ch4_training_localization_transcription_gt.zip',
                save_name='ic15_textdet_train_gt.zip',
                md5='3bfaf1988960909014f7987d2343060b',
                content=['annotation'],
                mapping=[['ic15_textdet_train_gt', 'annotations/train']]),
        ]),
Dataset collection (Gatherer)

The gatherer module traverses the files in the dataset directory, matches image files with their corresponding annotation files, and organizes a file list for the parser module to read. Therefore, it is necessary to know the matching rules between image files and annotation files in the current dataset. There are two commonly used annotation storage formats for OCR datasets: one is multiple annotation files corresponding to multiple images, and the other is a single annotation file corresponding to multiple images, for example:

Many-to-Many
├── {taskname}_imgs/{split}/img_img_1.jpg
├── annotations/{split}/gt_img_1.txt
├── {taskname}_imgs/{split}/img_2.jpg
├── annotations/{split}/gt_img_2.txt
├── {taskname}_imgs/{split}/img_3.JPG
├── annotations/{split}/gt_img_3.txt

One-to-Many
├── {taskname}/{split}/img_1.jpg
├── {taskname}/{split}/img_2.jpg
├── {taskname}/{split}/img_3.JPG
├── annotations/gt.txt

Specific design is as follows:

Gatherer

MMOCR has built-in PairGatherer and MonoGatherer to handle the two common cases mentioned above. PairGatherer is used for many-to-many situations, while MonoGatherer is used for one-to-many situations.

Note

To simplify processing, the gatherer assumes that the dataset’s images and annotations are stored separately in {taskname}_imgs/{split}/ and annotations/, respectively. In particular, for many-to-many situations, the annotation file needs to be placed in annotations/{split}.

  • In the many-to-many case, PairGatherer needs to find the image files and corresponding annotation files according to a certain naming convention. First, the suffix of the image needs to be specified by the img_suffixes parameter, as in the example above img_suffixes=[.jpg,.JPG]. In addition, a pair of regular expressions rule is used to specify the correspondence between the image and annotation files. For example, rule=[r'img_(\d+)\.([jJ][pP][gG])',r'gt_img_\1.txt']. The first regular expression is used to match the image file name, \d+ is used to match the image sequence number, and ([jJ][pP][gG]) is used to match the image suffix. The second regular expression is used to match the annotation file name, where \1 associates the matched image sequence number with the annotation file sequence number. An example configuration is:

    gatherer=dict(
        type='PairGatherer',
        img_suffixes=['.jpg', '.JPG'],
        rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),

For the case of one-to-many, it is usually simple, and the user only needs to specify the annotation file name. For example, for the training set configuration:

    gatherer=dict(type='MonoGatherer', ann_name='train.txt'),

MMOCR has also made conventions on the return value of Gatherer. Gatherer returns a tuple with two elements. The first element is a list of image paths (including all image paths) or the folder containing all images. The second element is a list of annotation file paths (including all annotation file paths) or the path of the annotation file (the annotation file contains all image annotation information). Specifically, the return value of PairGatherer is (list of image paths, list of annotation file paths), as shown below:

    (['{taskname}_imgs/{split}/img_1.jpg', '{taskname}_imgs/{split}/img_2.jpg', '{taskname}_imgs/{split}/img_3.JPG'],
    ['annotations/{split}/gt_img_1.txt', 'annotations/{split}/gt_img_2.txt', 'annotations/{split}/gt_img_3.txt'])

MonoGatherer returns a tuple containing the path to the image directory and the path to the annotation file, as follows:

    ('{taskname}/{split}', 'annotations/gt.txt')
Dataset parsing (Parser)

Parser is mainly used to parse the original annotation files. Since the original annotation formats vary greatly, MMOCR provides BaseParser as a base class, which users can inherit to implement their own Parser. In BaseParser, MMOCR has designed two interfaces: parse_files and parse_file, where the annotation parsing is conventionally carried out. For the two different input situations of Gatherer (many-to-many, one-to-many), the implementations of these two interfaces should be different.

  • BaseParser by default handles the many-to-many situation. Among them, parse_files distributes the data in parallel to multiple parse_file processes, and each parse_file parses the annotation of a single image separately.

  • For the one-to-many situation, the user needs to override parse_files to implement loading the annotation and returning standardized results.

The interface of BaseParser is defined as follows:

class BaseParser:
    def __call__(self, img_paths, ann_paths):
    	return self.parse_files(img_paths, ann_paths)

    def parse_files(self, img_paths: Union[List[str], str],
                    ann_paths: Union[List[str], str]) -> List[Tuple]:
        samples = track_parallel_progress_multi_args(
            self.parse_file, (img_paths, ann_paths), nproc=self.nproc)
        return samples

    @abstractmethod
    def parse_file(self, img_path: str, ann_path: str) -> Tuple:

        raise NotImplementedError

In order to ensure the uniformity of subsequent modules, MMOCR has made conventions for the return values of parse_files and parse_file. The return value of parse_file is a tuple, the first element of which is the image path, and the second element is the annotation information. The annotation information is a list, each element of which is a dictionary with the fields poly, text, and ignore, as shown below:

# An example of returned values:
(
    'imgs/train/xxx.jpg',
    [
        dict(
            poly=[0, 1, 1, 1, 1, 0, 0, 0],
            text='hello',
            ignore=False),
        ...
    ]
)

The output of parse_files is a list, and each element in the list is the return value of parse_file. An example is:

[
    (
        'imgs/train/xxx.jpg',
        [
            dict(
                poly=[0, 1, 1, 1, 1, 0, 0, 0],
                text='hello',
                ignore=False),
            ...
        ]
    ),
    ...
]
Dataset Conversion (Packer)

Packer is mainly used to convert data into a unified annotation format, because the input data is the output of parsers and the format has been fixed. Therefore, the packer only needs to convert the input format into a unified annotation format for each task. Currently, MMOCR supports tasks such as text detection, text recognition, end-to-end OCR, and key information extraction, and MMOCR has a corresponding packer for each task, as shown below:

Packer

For text detection, end-to-end OCR, and key information extraction, MMOCR has a unique corresponding Packer. However, for text recognition, MMOCR provides two Packer options: TextRecogPacker and TextRecogCropPacker, due to the existence of two types of datasets:

  • Each image is a recognition sample, and the annotation information returned by the parser is only a dict(text='xxx'). In this case, TextRecogPacker can be used.

  • The dataset does not crop text from the image, and it essentially contains end-to-end OCR annotations that include the position information of the text and the corresponding text information. TextRecogCropPacker will crop the text from the image and then convert it into the unified format for text recognition.

Annotation Saving (Dumper)

The dumper module is used to determine what format the data should be saved in. Currently, MMOCR supports JsonDumper, WildreceiptOpensetDumper, and TextRecogLMDBDumper. They are used to save data in the standard MMOCR JSON format, the Wildreceipt format, and the LMDB format commonly used in the academic community for text recognition, respectively.

Delete files (Delete)

When processing a dataset, temporary files that are not needed may be generated. Here, a list of such files or folders can be passed in, which will be deleted when the conversion is finished.

Generate the configuration file for the dataset (ConfigGenerator)

In order to automatically generate basic configuration files after preparing the dataset, MMOCR has implemented TextDetConfigGenerator, TextRecogConfigGenerator, and TextSpottingConfigGenerator for each task. The main parameters supported by these generators are as follows:

Field Name Meaning
data_root Root directory where the dataset is stored.
train_anns Path to the training set annotations in the configuration file. If not specified, it defaults to [dict(ann_file='{taskname}_train.json', dataset_postfix=''].
val_anns Path to the validation set annotations in the configuration file. If not specified, it defaults to an empty string.
test_anns Path to the test set annotations in the configuration file. If not specified, it defaults to [dict(ann_file='{taskname}_test.json', dataset_postfix=''].
config_path Path to the directory where the configuration files for the algorithm are stored. The configuration generator will write the default configuration to {config_path}/{taskname}/_base_/datasets/{dataset_name}.py. If not specified, it defaults to configs/.

After preparing all the files for the dataset, the configuration generator will automatically generate the basic configuration files required to call the dataset. Below is a minimal example of a TextDetConfigGenerator configuration:

config_generator = dict(type='TextDetConfigGenerator')

The generated file will be placed by default under configs/{task}/_base_/datasets/. In this example, the basic configuration file for the ICDAR 2015 dataset will be generated at configs/textdet/_base_/datasets/icdar2015.py.

icdar2015_textdet_data_root = 'data/icdar2015'

icdar2015_textdet_train = dict(
    type='OCRDataset',
    data_root=icdar2015_textdet_data_root,
    ann_file='textdet_train.json',
    filter_cfg=dict(filter_empty_gt=True, min_size=32),
    pipeline=None)

icdar2015_textdet_test = dict(
    type='OCRDataset',
    data_root=icdar2015_textdet_data_root,
    ann_file='textdet_test.json',
    test_mode=True,
    pipeline=None)

If the dataset is special and there are several variants of the annotations, the configuration generator also supports generating variables pointing to each variant in the base configuration. However, this requires users to differentiate them by using different dataset_postfix when setting up. For example, the ICDAR 2015 text recognition dataset has two annotation versions for the test set, the original version and the 1811 version, which can be specified in test_anns as follows:

config_generator = dict(
    type='TextRecogConfigGenerator',
    test_anns=[
        dict(ann_file='textrecog_test.json'),
        dict(dataset_postfix='857', ann_file='textrecog_test_857.json')
    ])

The configuration generator will generate the following configurations:

icdar2015_textrecog_data_root = 'data/icdar2015'

icdar2015_textrecog_train = dict(
    type='OCRDataset',
    data_root=icdar2015_textrecog_data_root,
    ann_file='textrecog_train.json',
    pipeline=None)

icdar2015_textrecog_test = dict(
    type='OCRDataset',
    data_root=icdar2015_textrecog_data_root,
    ann_file='textrecog_test.json',
    test_mode=True,
    pipeline=None)

icdar2015_1811_textrecog_test = dict(
    type='OCRDataset',
    data_root=icdar2015_textrecog_data_root,
    ann_file='textrecog_test_1811.json',
    test_mode=True,
    pipeline=None)

With this file, MMOCR can directly import this dataset into the dataloader from the model configuration file (the following sample is excerpted from configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py):

_base_ = [
    '../_base_/datasets/icdar2015.py',
    # ...
]

# dataset settings
icdar2015_textdet_train = _base_.icdar2015_textdet_train
icdar2015_textdet_test = _base_.icdar2015_textdet_test
# ...

train_dataloader = dict(
    dataset=icdar2015_textdet_train)

val_dataloader = dict(
    dataset=icdar2015_textdet_test)

test_dataloader = val_dataloader

Note

By default, the configuration generator does not overwrite existing base configuration files unless the user manually specifies overwrite-cfg when running the script.

Adding a new dataset to Dataset Preparer

Adding Public Datasets

MMOCR has already supported many commonly used public datasets. If the dataset you want to use has not been supported yet and you are willing to contribute to the MMOCR open-source community, you can follow the steps below to add a new dataset.

In the following example, we will show you how to add the ICDAR2013 dataset step by step.

Adding metafile.yml

First, make sure that the dataset you want to add does not already exist in dataset_zoo/. Then, create a new folder named after the dataset you want to add, such as icdar2013/ (usually, use lowercase alphanumeric characters without symbols to name the dataset). In the icdar2013/ folder, create a metafile.yml file and fill in the basic information of the dataset according to the following template:

Name: 'Incidental Scene Text IC13'
Paper:
  Title: ICDAR 2013 Robust Reading Competition
  URL: https://www.imlab.jp/publication_data/1352/icdar_competition_report.pdf
  Venue: ICDAR
  Year: '2013'
  BibTeX: '@inproceedings{karatzas2013icdar,
  title={ICDAR 2013 robust reading competition},
  author={Karatzas, Dimosthenis and Shafait, Faisal and Uchida, Seiichi and Iwamura, Masakazu and i Bigorda, Lluis Gomez and Mestre, Sergi Robles and Mas, Joan and Mota, David Fernandez and Almazan, Jon Almazan and De Las Heras, Lluis Pere},
  booktitle={2013 12th international conference on document analysis and recognition},
  pages={1484--1493},
  year={2013},
  organization={IEEE}}'
Data:
  Website: https://rrc.cvc.uab.es/?ch=2
  Language:
    - English
  Scene:
    - Natural Scene
  Granularity:
    - Word
  Tasks:
    - textdet
    - textrecog
    - textspotting
  License:
    Type: N/A
    Link: N/A
  Format: .txt
  Keywords:
    - Horizontal
Add Annotation Examples

Finally, you can add an annotation example file sample_anno.md under the dataset_zoo/icdar2013/ directory to help the documentation script add annotation examples when generating documentation. The annotation example file is a Markdown file that typically contains the raw data format of a single sample. For example, the following code block shows a sample data file for the ICDAR2013 dataset:

  **Text Detection**

  ```text
  # train split
  # x1 y1 x2 y2 "transcript"

  158 128 411 181 "Footpath"
  443 128 501 169 "To"
  64 200 363 243 "Colchester"

  # test split
  # x1, y1, x2, y2, "transcript"

  38, 43, 920, 215, "Tiredness"
  275, 264, 665, 450, "kills"
  0, 699, 77, 830, "A"
Add configuration files for corresponding tasks

In the dataset_zoo/icdar2013 directory, add a .py configuration file named after the task. For example, textdet.py, textrecog.py, textspotting.py, kie.py, etc. The configuration template is shown below:

data_root = ''
data_cache = 'data/cache'
train_prepare = dict(
    obtainer=dict(
        type='NaiveObtainer',
        data_cache=data_cache,
        files=[
            dict(
                url='xx',
                md5='',
                save_name='xxx',
                mapping=list())
              ]),
    gatherer=dict(type='xxxGatherer', **kwargs),
    parser=dict(type='xxxParser', **kwargs),
    packer=dict(type='TextxxxPacker'), # Packer for the task
    dumper=dict(type='JsonDumper'),
)
test_prepare = dict(
    obtainer=dict(
        type='NaiveObtainer',
        data_cache=data_cache,
        files=[
            dict(
                url='xx',
                md5='',
                save_name='xxx',
                mapping=list())
              ]),
    gatherer=dict(type='xxxGatherer', **kwargs),
    parser=dict(type='xxxParser', **kwargs),
    packer=dict(type='TextxxxPacker'), # Packer for the task
    dumper=dict(type='JsonDumper'),
)

Taking the file detection task as an example, let’s introduce the specific content of the configuration file. In general, users do not need to implement new obtainer, gatherer, packer, or dumper, but usually need to implement a new parser according to the annotation format of the dataset.

Regarding the configuration of obtainer, we will not go into detail here, and you can refer to Data set download, extraction, and movement (Obtainer).

For the gatherer, by observing the obtained ICDAR2013 dataset files, we found that each image has a corresponding .txt format annotation file:

data_root
├── textdet_imgs/train/
│   ├── img_1.jpg
│   ├── img_2.jpg
│   └── ...
├── annotations/train/
│   ├── gt_img_1.txt
│   ├── gt_img_2.txt
│   └── ...

Moreover, the name of each annotation file corresponds to the image: gt_img_1.txt corresponds to img_1.jpg, and so on. Therefore, PairGatherer can be used to match them.

gatherer=dict(
      type='PairGatherer',
      img_suffixes=['.jpg'],
      rule=[r'(\w+)\.jpg', r'gt_\1.txt'])

The first regular expression in the rule is used to match the image file name, and the second regular expression is used to match the annotation file name. Here, (\w+) is used to match the image file name, and gt_\1.txt is used to match the annotation file name, where \1 represents the content matched by the first regular expression. That is, it replaces img_xx.jpg with gt_img_xx.txt.

Next, you need to implement a parser to parse the original annotation files into a standard format. Usually, before adding a new dataset, users can browse the details page of the supported datasets and check if there is a dataset with the same format. If there is, you can use the parser of that dataset directly. Otherwise, you need to implement a new format parser.

Data format parsers are stored in the mmocr/datasets/preparers/parsers directory. All parsers need to inherit from BaseParser and implement the parse_file or parse_files method. For more information, please refer to Parsing original annotations (Parser).

By observing the annotation files of the ICDAR2013 dataset:

158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"
542, 710, 938, 841, "break"
87, 884, 457, 1021, "could"
517, 919, 831, 1024, "save"

We found that the built-in ICDARTxtTextDetAnnParser already meets the requirements, so we can directly use this parser and configure it in the preparer.

parser=dict(
     type='ICDARTxtTextDetAnnParser',
     remove_strs=[',', '"'],
     encoding='utf-8',
     format='x1 y1 x2 y2 trans',
     separator=' ',
     mode='xyxy')

In the configuration for the ICDARTxtTextDetAnnParser, remove_strs=[',', '"'] is specified to remove extra quotes and commas in the annotation files. In the format section, x1 y1 x2 y2 trans indicates that each line in the annotation file contains four coordinates and a text content separated by spaces (separator=’ ‘). Also, mode is set to xyxy, which means that the coordinates in the annotation file are the coordinates of the top-left and bottom-right corners, so that ICDARTxtTextDetAnnParser can parse the annotations into a unified format.

For the packer, taking the file detection task as an example, its packer is TextDetPacker, and its configuration is as follows:

packer=dict(type='TextDetPacker')

Finally, specify the dumper, which is generally saved in json format. Its configuration is as follows:

dumper=dict(type='JsonDumper')

After the above configuration, the configuration file for the ICDAR2013 training set is as follows:

train_preparer = dict(
    obtainer=dict(
        type='NaiveDataObtainer',
        cache_path=cache_path,
        files=[
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Training_Task12_Images.zip',
                save_name='ic13_textdet_train_img.zip',
                md5='a443b9649fda4229c9bc52751bad08fb',
                content=['image'],
                mapping=[['ic13_textdet_train_img', 'textdet_imgs/train']]),
            dict(
                url='https://rrc.cvc.uab.es/downloads/'
                'Challenge2_Training_Task1_GT.zip',
                save_name='ic13_textdet_train_gt.zip',
                md5='f3a425284a66cd67f455d389c972cce4',
                content=['annotation'],
                mapping=[['ic13_textdet_train_gt', 'annotations/train']]),
        ]),
    gatherer=dict(
        type='PairGatherer',
        img_suffixes=['.jpg'],
        rule=[r'(\w+)\.jpg', r'gt_\1.txt']),
    parser=dict(
        type='ICDARTxtTextDetAnnParser',
        remove_strs=[',', '"'],
        format='x1 y1 x2 y2 trans',
        separator=' ',
        mode='xyxy'),
    packer=dict(type='TextDetPacker'),
    dumper=dict(type='JsonDumper'),
)

To automatically generate the basic configuration after the dataset is prepared, you also need to configure the corresponding task’s config_generator.

In this example, since it is a text detection task, you only need to set the generator to TextDetConfigGenerator.

config_generator = dict(type='TextDetConfigGenerator')

Use DataPreparer to prepare customized dataset

[Coming Soon]

Text Detection

Note

This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.

Overview

Dataset Images Annotation Files
training validation testing
ICDAR2011 homepage - -
ICDAR2017 homepage instances_training.json instances_val.json -
CurvedSynText150k homepage | Part1 | Part2 instances_training.json - -
DeText homepage - - -
Lecture Video DB homepage - - -
LSVT homepage - - -
IMGUR homepage - - -
KAIST homepage - - -
MTWI homepage - - -
ReCTS homepage - - -
IIIT-ILST homepage - - -
VinText homepage - - -
BID homepage - - -
RCTW homepage - - -
HierText homepage - - -
ArT homepage - - -

Install AWS CLI (optional)

  • Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:

      curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
      unzip awscliv2.zip
      sudo ./aws/install
      ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
      !aws configure
      # this command will require you to input keys, you can skip them except
      # for the Default region name
      # AWS Access Key ID [None]:
      # AWS Secret Access Key [None]:
      # Default region name [None]: us-east-1
      # Default output format [None]
    

For users in China, these datasets can also be downloaded from OpenDataLab with high speed:

Important Note

Note

For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset, there might be some images containing orientation info in EXIF data. The default OpenCV backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such inconsistency results in false examples in the training set. Therefore, users should use dict(type='LoadImageFromFile', color_type='color_ignore_orientation') in pipelines to change MMCV’s default loading behaviour. (see DBNet’s pipeline config for example)

ICDAR 2011 (Born-Digital Images)

  • Step1: Download Challenge1_Training_Task12_Images.zip, Challenge1_Training_Task1_GT.zip, Challenge1_Test_Task12_Images.zip, and Challenge1_Test_Task1_GT.zip from homepage Task 1.1: Text Localization (2013 edition).

    mkdir icdar2011 && cd icdar2011
    mkdir imgs && mkdir annotations
    
    # Download ICDAR 2011
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate
    
    # For images
    unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training
    unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test
    # For annotations
    unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training
    unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test
    
    rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
    
  • Step 2: Generate instances_training.json and instances_test.json with the following command:

    python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── icdar2011
    │   ├── imgs
    │   ├── instances_test.json
    │   └── instances_training.json
    

ICDAR 2017

  • Follow similar steps as ICDAR 2015.

  • The resulting directory structure looks like the following:

    ├── icdar2017
    │   ├── imgs
    │   ├── annotations
    │   ├── instances_training.json
    │   └── instances_val.json
    

CurvedSynText150k

  • Step1: Download syntext1.zip and syntext2.zip to CurvedSynText150k/.

  • Step2:

    unzip -q syntext1.zip
    mv train.json train1.json
    unzip images.zip
    rm images.zip
    
    unzip -q syntext2.zip
    mv train.json train2.json
    unzip images.zip
    rm images.zip
    
  • Step3: Download instances_training.json to CurvedSynText150k/

  • Or, generate instances_training.json with following command:

    python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
    
  • The resulting directory structure looks like the following:

    ├── CurvedSynText150k
    │   ├── syntext_word_eng
    │   ├── emcs_imgs
    │   └── instances_training.json
    

DeText

  • Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

    mkdir detext && cd detext
    mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
    
    # Download DeText
    wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
    
    # Extract images and annotations
    unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
    
    # Remove zips
    rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
    
  • Step2: Generate instances_training.json and instances_val.json with following command:

    python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── detext
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_test.json
    │   └── instances_training.json
    

Lecture Video DB

  • Step1: Download IIIT-CVid.zip to lv/.

    mkdir lv && cd lv
    
    # Download LV dataset
    wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
    unzip -q IIIT-CVid.zip
    
    mv IIIT-CVid/Frames imgs
    
    rm IIIT-CVid.zip
    
  • Step2: Generate instances_training.json, instances_val.json, and instances_test.json with following command:

    python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
    
  • The resulting directory structure looks like the following:

    │── lv
    │   ├── imgs
    │   ├── instances_test.json
    │   ├── instances_training.json
    │   └── instances_val.json
    

LSVT

  • Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to lsvt/.

    mkdir lsvt && cd lsvt
    
    # Download LSVT dataset
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
    
    mkdir annotations
    tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
    mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
    mv train_full_images_0 imgs
    
    rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
    
  • Step2: Generate instances_training.json and instances_val.json (optional) with the following command:

    # Annotations of LSVT test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
    
  • After running the above codes, the directory structure should be as follows:

    |── lsvt
    │   ├── imgs
    │   ├── instances_training.json
    │   └── instances_val.json (optional)
    

IMGUR

  • Step1: Run download_imgur5k.py to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.

    mkdir imgur && cd imgur
    
    git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
    
    # Download images from imgur.com. This may take SEVERAL HOURS!
    python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
    
    # For annotations
    mkdir annotations
    mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
    
    rm -rf IMGUR5K-Handwriting-Dataset
    
  • Step2: Generate instances_train.json, instance_val.json and instances_test.json with the following command:

    python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
    
  • After running the above codes, the directory structure should be as follows:

    │── imgur
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_test.json
    │   ├── instances_training.json
    │   └── instances_val.json
    

KAIST

  • Step1: Complete download KAIST_all.zip to kaist/.

    mkdir kaist && cd kaist
    mkdir imgs && mkdir annotations
    
    # Download KAIST dataset
    wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
    unzip -q KAIST_all.zip
    
    rm KAIST_all.zip
    
  • Step2: Extract zips:

    python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
    
  • Step3: Generate instances_training.json and instances_val.json (optional) with following command:

    # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── kaist
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_training.json
    │   └── instances_val.json (optional)
    

MTWI

  • Step1: Download mtwi_2018_train.zip from homepage.

    mkdir mtwi && cd mtwi
    
    unzip -q mtwi_2018_train.zip
    mv image_train imgs && mv txt_train annotations
    
    rm mtwi_2018_train.zip
    
  • Step2: Generate instances_training.json and instance_val.json (optional) with the following command:

    # Annotations of MTWI test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── mtwi
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_training.json
    │   └── instances_val.json (optional)
    

ReCTS

  • Step1: Download ReCTS.zip to rects/ from the homepage.

    mkdir rects && cd rects
    
    # Download ReCTS dataset
    # You can also find Google Drive link on the dataset homepage
    wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
    unzip -q ReCTS.zip
    
    mv img imgs && mv gt_unicode annotations
    
    rm ReCTS.zip && rm -rf gt
    
  • Step2: Generate instances_training.json and instances_val.json (optional) with following command:

    # Annotations of ReCTS test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
    
  • After running the above codes, the directory structure should be as follows:

    │── rects
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_val.json (optional)
    │   └── instances_training.json
    

ILST

  • Step1: Download IIIT-ILST from onedrive

  • Step2: Run the following commands

    unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
    cd IIIT-ILST
    
    # rename files
    cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
    cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
    cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
    
    # transfer image path
    mkdir imgs && mkdir annotations
    mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
    mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
    mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
    
    # remove unnecessary files
    rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
    
  • Step3: Generate instances_training.json and instances_val.json (optional). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textdet/ilst_converter.py    PATH/TO/IIIT-ILST --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── IIIT-ILST
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_val.json (optional)
    │   └── instances_training.json
    

VinText

  • Step1: Download vintext.zip to vintext

    mkdir vintext && cd vintext
    
    # Download dataset from google drive
    wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O-  sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
    
    # Extract images and annotations
    unzip -q vintext.zip && rm vintext.zip
    mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
    rm -rf vietnamese
    
    # Rename files
    mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
    mkdir imgs
    mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
    
  • Step2: Generate instances_training.json, instances_test.json and instances_unseen_test.json

    python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── vintext
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_test.json
    │   ├── instances_unseen_test.json
    │   └── instances_training.json
    

BID

  • Step1: Download BID Dataset.zip

  • Step2: Run the following commands to preprocess the dataset

    # Rename
    mv BID\ Dataset.zip BID_Dataset.zip
    
    # Unzip and Rename
    unzip -q BID_Dataset.zip && rm BID_Dataset.zip
    mv BID\ Dataset BID
    
    # The BID dataset has a problem of permission, and you may
    # add permission for this file
    chmod -R 777 BID
    cd BID
    mkdir imgs && mkdir annotations
    
    # For images and annotations
    mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
    mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
    mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
    mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
    mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
    mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
    mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
    mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
    
    # Remove unnecessary files
    rm -rf desktop.ini
    
  • Step3: - Step3: Generate instances_training.json and instances_val.json (optional). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── BID
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_training.json
    │   └── instances_val.json (optional)
    

RCTW

  • Step1: Download train_images.zip.001, train_images.zip.002, and train_gts.zip from the homepage, extract the zips to rctw/imgs and rctw/annotations, respectively.

  • Step2: Generate instances_training.json and instances_val.json (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── rctw
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_training.json
    │   └── instances_val.json (optional)
    

HierText

  • Step1 (optional): Install AWS CLI.

  • Step2: Clone HierText repo to get annotations

    mkdir HierText
    git clone https://github.com/google-research-datasets/hiertext.git
    
  • Step3: Download train.tgz, validation.tgz from aws

    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
    
  • Step4: Process raw data

    # process annotations
    mv hiertext/gt ./
    rm -rf hiertext
    mv gt annotations
    gzip -d annotations/train.jsonl.gz
    gzip -d annotations/validation.jsonl.gz
    # process images
    mkdir imgs
    mv train.tgz imgs/
    mv validation.tgz imgs/
    tar -xzvf imgs/train.tgz
    tar -xzvf imgs/validation.tgz
    
  • Step5: Generate instances_training.json and instance_val.json. HierText includes different levels of annotation, from paragraph, line, to word. Check the original paper for details. E.g. set --level paragraph to get paragraph-level annotation. Set --level line to get line-level annotation. set --level word to get word-level annotation.

    # Collect word annotation from HierText  --level word
    python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── HierText
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_training.json
    │   └── instances_val.json
    

ArT

  • Step1: Download train_images.tar.gz, and train_labels.json from the homepage to art/

    mkdir art && cd art
    mkdir annotations
    
    # Download ArT dataset
    wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate
    wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate
    
    # Extract
    tar -xf train_images.tar.gz
    mv train_images imgs
    mv train_labels.json annotations/
    
    # Remove unnecessary files
    rm train_images.tar.gz
    
  • Step2: Generate instances_training.json and instances_val.json (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── art
    │   ├── annotations
    │   ├── imgs
    │   ├── instances_training.json
    │   └── instances_val.json (optional)
    

Text Recognition

Note

This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.

Overview

Dataset images annotation file annotation file
training test
coco_text homepage train_labels.json -
ICDAR2011 homepage - -
SynthAdd SynthText_Add.zip (code:627x) train_labels.json -
OpenVINO Open Images annotations annotations
DeText homepage - -
Lecture Video DB homepage - -
LSVT homepage - -
IMGUR homepage - -
KAIST homepage - -
MTWI homepage - -
ReCTS homepage - -
IIIT-ILST homepage - -
VinText homepage - -
BID homepage - -
RCTW homepage - -
HierText homepage - -
ArT homepage - -

(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.

Install AWS CLI (optional)

  • Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:

      curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
      unzip awscliv2.zip
      sudo ./aws/install
      ./aws/install -i /usr/local/aws-cli -b /usr/local/bin
      !aws configure
      # this command will require you to input keys, you can skip them except
      # for the Default region name
      # AWS Access Key ID [None]:
      # AWS Secret Access Key [None]:
      # Default region name [None]: us-east-1
      # Default output format [None]
    

For users in China, these datasets can also be downloaded from OpenDataLab with high speed:

ICDAR 2011 (Born-Digital Images)

  • Step1: Download Challenge1_Training_Task3_Images_GT.zip, Challenge1_Test_Task3_Images.zip, and Challenge1_Test_Task3_GT.txt from homepage Task 1.3: Word Recognition (2013 edition).

    mkdir icdar2011 && cd icdar2011
    mkdir annotations
    
    # Download ICDAR 2011
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate
    
    # For images
    mkdir crops
    unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train
    unzip -q Challenge1_Test_Task3_Images.zip -d crops/test
    
    # For annotations
    mv Challenge1_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
    
  • Step2: Convert original annotations to train_labels.json and test_labels.json with the following command:

    python tools/dataset_converters/textrecog/ic11_converter.py PATH/TO/icdar2011
    
  • After running the above codes, the directory structure should be as follows:

    ├── icdar2011
    │   ├── crops
    │   ├── train_labels.json
    │   └── test_labels.json
    

coco_text

  • Step1: Download from homepage

  • Step2: Download train_labels.json

  • After running the above codes, the directory structure should be as follows:

    ├── coco_text
    │   ├── train_labels.json
    │   └── train_words
    

SynthAdd

  • Step1: Download SynthText_Add.zip from SynthAdd (code:627x))

  • Step2: Download train_labels.json

  • Step3:

    mkdir SynthAdd && cd SynthAdd
    
    mv /path/to/SynthText_Add.zip .
    
    unzip SynthText_Add.zip
    
    mv /path/to/train_labels.json .
    
    # create soft link
    cd /path/to/mmocr/data/recog
    
    ln -s /path/to/SynthAdd SynthAdd
    
  • After running the above codes, the directory structure should be as follows:

    ├── SynthAdd
    │   ├── train_labels.json
    │   └── SynthText_Add
    

OpenVINO

  • Step1 (optional): Install AWS CLI.

  • Step2: Download Open Images subsets train_1, train_2, train_5, train_f, and validation to openvino/.

    mkdir openvino && cd openvino
    
    # Download Open Images subsets
    for s in 1 2 5 f; do
      aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz .
    done
    aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz .
    
    # Download annotations
    for s in 1 2 5 f; do
      wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json
    done
    wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json
    
    # Extract images
    mkdir -p openimages_v5/val
    for s in 1 2 5 f; do
      tar zxf train_${s}.tar.gz -C openimages_v5
    done
    tar zxf validation.tar.gz -C openimages_v5/val
    
  • Step3: Generate train_{1,2,5,f}_labels.json, val_labels.json and crop images using 4 processes with the following command:

    python tools/dataset_converters/textrecog/openvino_converter.py /path/to/openvino 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── OpenVINO
    │   ├── image_1
    │   ├── image_2
    │   ├── image_5
    │   ├── image_f
    │   ├── image_val
    │   ├── train_1_labels.json
    │   ├── train_2_labels.json
    │   ├── train_5_labels.json
    │   ├── train_f_labels.json
    │   └── val_labels.json
    

DeText

  • Step1: Download ch9_training_images.zip, ch9_training_localization_transcription_gt.zip, ch9_validation_images.zip, and ch9_validation_localization_transcription_gt.zip from Task 3: End to End on the homepage.

    mkdir detext && cd detext
    mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val
    
    # Download DeText
    wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate
    wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate
    
    # Extract images and annotations
    unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val
    
    # Remove zips
    rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
    
  • Step2: Generate train_labels.json and test_labels.json with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/detext/ignores
    python tools/dataset_converters/textrecog/detext_converter.py PATH/TO/detext --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── detext
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── test_labels.json
    

NAF

  • Step1: Download labeled_images.tar.gz to naf/.

    mkdir naf && cd naf
    
    # Download NAF dataset
    wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz
    tar -zxf labeled_images.tar.gz
    
    # For images
    mkdir annotations && mv labeled_images imgs
    
    # For annotations
    git clone https://github.com/herobd/NAF_dataset.git
    mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/
    
    rm -rf NAF_dataset && rm labeled_images.tar.gz
    
  • Step2: Generate train_labels.json, val_labels.json, and test_labels.json with following command:

    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/naf/ignores
    python tools/dataset_converters/textrecog/naf_converter.py PATH/TO/naf --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── naf
    │   ├── crops
    │   ├── train_labels.json
    │   ├── val_labels.json
    │   └── test_labels.json
    

Lecture Video DB

Warning

This section is not fully tested yet.

Note

The LV dataset has already provided cropped images and the corresponding annotations

  • Step1: Download IIIT-CVid.zip to lv/.

    mkdir lv && cd lv
    
    # Download LV dataset
    wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip
    unzip -q IIIT-CVid.zip
    
    # For image
    mv IIIT-CVid/Crops ./
    
    # For annotation
    mv IIIT-CVid/train.txt train_labels.json && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_labels.json
    
    rm IIIT-CVid.zip
    
  • Step2: Generate train_labels.json, val.json, and test.json with following command:

    python tools/dataset_converters/textdreog/lv_converter.py PATH/TO/lv
    
  • After running the above codes, the directory structure should be as follows:

    ├── lv
    │   ├── Crops
    │   ├── train_labels.json
    │   └── test_labels.json
    

LSVT

Warning

This section is not fully tested yet.

  • Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to lsvt/.

    mkdir lsvt && cd lsvt
    
    # Download LSVT dataset
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json
    
    mkdir annotations
    tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz
    mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/
    mv train_full_images_0 imgs
    
    rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
    
  • Step2: Generate train_labels.json and val_label.json (optional) with the following command:

    # Annotations of LSVT test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/lsvt/ignores
    python tools/dataset_converters/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── lsvt
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

IMGUR

Warning

This section is not fully tested yet.

  • Step1: Run download_imgur5k.py to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.

    mkdir imgur && cd imgur
    
    git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git
    
    # Download images from imgur.com. This may take SEVERAL HOURS!
    python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs
    
    # For annotations
    mkdir annotations
    mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations
    
    rm -rf IMGUR5K-Handwriting-Dataset
    
  • Step2: Generate train_labels.json, val_label.txt and test_labels.json and crop images with the following command:

    python tools/dataset_converters/textrecog/imgur_converter.py PATH/TO/imgur
    
  • After running the above codes, the directory structure should be as follows:

    ├── imgur
    │   ├── crops
    │   ├── train_labels.json
    │   ├── test_labels.json
    │   └── val_label.json
    

KAIST

Warning

This section is not fully tested yet.

  • Step1: Download KAIST_all.zip to kaist/.

    mkdir kaist && cd kaist
    mkdir imgs && mkdir annotations
    
    # Download KAIST dataset
    wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip
    unzip -q KAIST_all.zip && rm KAIST_all.zip
    
  • Step2: Extract zips:

    python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
    
  • Step3: Generate train_labels.json and val_label.json (optional) with following command:

    # Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/kaist/ignores
    python tools/dataset_converters/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── kaist
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

MTWI

Warning

This section is not fully tested yet.

  • Step1: Download mtwi_2018_train.zip from homepage.

    mkdir mtwi && cd mtwi
    
    unzip -q mtwi_2018_train.zip
    mv image_train imgs && mv txt_train annotations
    
    rm mtwi_2018_train.zip
    
  • Step2: Generate train_labels.json and val_label.json (optional) with the following command:

    # Annotations of MTWI test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/mtwi/ignores
    python tools/dataset_converters/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── mtwi
    │   ├── crops
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

ReCTS

Warning

This section is not fully tested yet.

  • Step1: Download ReCTS.zip to rects/ from the homepage.

    mkdir rects && cd rects
    
    # Download ReCTS dataset
    # You can also find Google Drive link on the dataset homepage
    wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate
    unzip -q ReCTS.zip
    
    mv img imgs && mv gt_unicode annotations
    
    rm ReCTS.zip -f && rm -rf gt
    
  • Step2: Generate train_labels.json and val_label.json (optional) with the following command:

    # Annotations of ReCTS test split is not publicly available, split a validation
    # set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise
    # vertical images will be filtered and stored in PATH/TO/rects/ignores
    python tools/dataset_converters/textrecog/rects_converter.py PATH/TO/rects --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── rects
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

ILST

Warning

This section is not fully tested yet.

  • Step1: Download IIIT-ILST.zip from onedrive link

  • Step2: Run the following commands

    unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip
    cd IIIT-ILST
    
    # rename files
    cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd ..
    cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd ..
    cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd ..
    
    # transfer image path
    mkdir imgs && mkdir annotations
    mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/
    mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/
    mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/
    
    # remove unnecessary files
    rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
    
  • Step3: Generate train_labels.json and val_label.json (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── IIIT-ILST
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

VinText

Warning

This section is not fully tested yet.

  • Step1: Download vintext.zip to vintext

    mkdir vintext && cd vintext
    
    # Download dataset from google drive
    wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt
    
    # Extract images and annotations
    unzip -q vintext.zip && rm vintext.zip
    mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./
    rm -rf vietnamese
    
    # Rename files
    mv labels annotations && mv test_image test && mv train_images  training && mv unseen_test_images  unseen_test
    mkdir imgs
    mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
    
  • Step2: Generate train_labels.json, test_labels.json, unseen_test_labels.json, and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts).

    python tools/dataset_converters/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── vintext
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   ├── test_labels.json
    │   └── unseen_test_labels.json
    

BID

Warning

This section is not fully tested yet.

  • Step1: Download BID Dataset.zip

  • Step2: Run the following commands to preprocess the dataset

    # Rename
    mv BID\ Dataset.zip BID_Dataset.zip
    
    # Unzip and Rename
    unzip -q BID_Dataset.zip && rm BID_Dataset.zip
    mv BID\ Dataset BID
    
    # The BID dataset has a problem of permission, and you may
    # add permission for this file
    chmod -R 777 BID
    cd BID
    mkdir imgs && mkdir annotations
    
    # For images and annotations
    mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta
    mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente
    mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso
    mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente
    mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso
    mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto
    mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente
    mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso
    
    # Remove unnecessary files
    rm -rf desktop.ini
    
  • Step3: Generate train_labels.json and val_label.json (optional) and crop images using 4 processes with the following command (add --preserve-vertical if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    python tools/dataset_converters/textrecog/bid_converter.py PATH/TO/BID --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    ├── BID
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

RCTW

Warning

This section is not fully tested yet.

  • Step1: Download train_images.zip.001, train_images.zip.002, and train_gts.zip from the homepage, extract the zips to rctw/imgs and rctw/annotations, respectively.

  • Step2: Generate train_labels.json and val_label.json (optional). Since the original dataset doesn’t have a validation set, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores
    python tools/dataset_converters/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── rctw
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

HierText

Warning

This section is not fully tested yet.

  • Step1 (optional): Install AWS CLI.

  • Step2: Clone HierText repo to get annotations

    mkdir HierText
    git clone https://github.com/google-research-datasets/hiertext.git
    
  • Step3: Download train.tgz, validation.tgz from aws

    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz .
    aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
    
  • Step4: Process raw data

    # process annotations
    mv hiertext/gt ./
    rm -rf hiertext
    mv gt annotations
    gzip -d annotations/train.json.gz
    gzip -d annotations/validation.json.gz
    # process images
    mkdir imgs
    mv train.tgz imgs/
    mv validation.tgz imgs/
    tar -xzvf imgs/train.tgz
    tar -xzvf imgs/validation.tgz
    
  • Step5: Generate train_labels.json and val_label.json. HierText includes different levels of annotation, including paragraph, line, and word. Check the original paper for details. E.g. set --level paragraph to get paragraph-level annotation. Set --level line to get line-level annotation. set --level word to get word-level annotation.

    # Collect word annotation from HierText  --level word
    # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores
    python tools/dataset_converters/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
    
  • After running the above codes, the directory structure should be as follows:

    │── HierText
    │   ├── crops
    │   ├── ignores
    │   ├── train_labels.json
    │   └── val_label.json
    

ArT

Warning

This section is not fully tested yet.

  • Step1: Download train_images.tar.gz, and train_labels.json from the homepage to art/

    mkdir art && cd art
    mkdir annotations
    
    # Download ArT dataset
    wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz
    wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json
    
    # Extract
    tar -xf train_task2_images.tar.gz
    mv train_task2_images crops
    mv train_task2_labels.json annotations/
    
    # Remove unnecessary files
    rm train_images.tar.gz
    
  • Step2: Generate train_labels.json and val_label.json (optional). Since the test annotations are not publicly available, you may specify --val-ratio to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.

    # Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2
    python tools/dataset_converters/textrecog/art_converter.py PATH/TO/art
    
  • After running the above codes, the directory structure should be as follows:

    │── art
    │   ├── crops
    │   ├── train_labels.json
    │   └── val_label.json (optional)
    

Key Information Extraction

Note

This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.

Overview

The structure of the key information extraction dataset directory is organized as follows.

└── wildreceipt
  ├── class_list.txt
  ├── dict.txt
  ├── image_files
  ├── openset_train.txt
  ├── openset_test.txt
  ├── test.txt
  └── train.txt

Preparation Steps

WildReceipt

WildReceiptOpenset

  • Step0: have WildReceipt prepared.

  • Step1: Convert annotation files to OpenSet format:

# You may find more available arguments by running
# python tools/data/kie/closeset_to_openset.py -h
python tools/data/kie/closeset_to_openset.py data/wildreceipt/train.txt data/wildreceipt/openset_train.txt
python tools/data/kie/closeset_to_openset.py data/wildreceipt/test.txt data/wildreceipt/openset_test.txt

Note

You can learn more about the key differences between CloseSet and OpenSet annotations in our tutorial.

Overview

Weights

Here are the list of weights available for Inference.

For the ease of reference, some weights may have shorter aliases, which will be separated by / in the table. For example, “DB_r18 / dbnet_resnet18_fpnc_1200e_icdar2015” means that you can use either DB_r18 or dbnet_resnet18_fpnc_1200e_icdar2015 to initialize the Inferencer:

>>> from mmocr.apis import TextDetInferencer
>>> inferencer = TextDetInferencer(model='DB_r18')
>>> # equivalent to
>>> inferencer = TextDetInferencer(model='dbnet_resnet18_fpnc_1200e_icdar2015')

Text Detection

Model

README

ICDAR2015 (hmean-iou)

CTW1500 (hmean-iou)

Totaltext (hmean-iou)

DB_r18 / dbnet_resnet18_fpnc_1200e_icdar2015

link

0.8169

-

-

dbnet_resnet50_fpnc_1200e_icdar2015

link

0.8504

-

-

dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015

link

0.8543

-

-

DB_r50 / DBNet / dbnet_resnet50-oclip_fpnc_1200e_icdar2015

link

0.8644

-

-

dbnet_resnet18_fpnc_1200e_totaltext

link

-

-

0.8182

DBPP_r50 / dbnetpp_resnet50_fpnc_1200e_icdar2015

link

0.8622

-

-

dbnetpp_resnet50-dcnv2_fpnc_1200e_icdar2015

link

0.8684

-

-

DBNetpp / dbnetpp_resnet50-oclip_fpnc_1200e_icdar2015

link

0.8882

-

-

MaskRCNN_CTW / mask-rcnn_resnet50_fpn_160e_ctw1500

link

-

0.7458

-

mask-rcnn_resnet50-oclip_fpn_160e_ctw1500

link

-

0.7562

-

MaskRCNN_IC15 / mask-rcnn_resnet50_fpn_160e_icdar2015

link

0.8182

-

-

MaskRCNN / mask-rcnn_resnet50-oclip_fpn_160e_icdar2015

link

0.8513

-

-

DRRG / drrg_resnet50_fpn-unet_1200e_ctw1500

link

-

0.8467

-

FCE_CTW_DCNv2 / fcenet_resnet50-dcnv2_fpn_1500e_ctw1500

link

-

0.8488

-

fcenet_resnet50-oclip_fpn_1500e_ctw1500

link

-

0.8192

-

FCE_IC15 / fcenet_resnet50_fpn_1500e_icdar2015

link

0.8528

-

-

FCENet / fcenet_resnet50-oclip_fpn_1500e_icdar2015

link

0.8604

-

-

fcenet_resnet50_fpn_1500e_totaltext

link

-

-

0.8134

PANet_CTW / panet_resnet18_fpem-ffm_600e_ctw1500

link

-

0.777

-

PANet_IC15 / panet_resnet18_fpem-ffm_600e_icdar2015

link

0.7848

-

-

PS_CTW / psenet_resnet50_fpnf_600e_ctw1500

link

-

0.7793

-

psenet_resnet50-oclip_fpnf_600e_ctw1500

link

-

0.8037

-

PS_IC15 / psenet_resnet50_fpnf_600e_icdar2015

link

0.7998

-

-

PSENet / psenet_resnet50-oclip_fpnf_600e_icdar2015

link

0.8478

-

-

textsnake_resnet50_fpn-unet_1200e_ctw1500

link

-

0.8286

-

TextSnake / textsnake_resnet50-oclip_fpn-unet_1200e_ctw1500

link

-

0.8529

-

Text Recognition

Note

Avg is the average on IIIT5K, SVT, ICDAR2013, ICDAR2015, SVTP, CT80.

Model

README

Avg (word_acc)

IIIT5K (word_acc)

SVT (word_acc)

ICDAR2013 (word_acc)

ICDAR2015 (word_acc)

SVTP (word_acc)

CT80 (word_acc)

ABINet_Vision / abinet-vision_20e_st-an_mj

link

0.88

0.95

0.91

0.94

0.79

0.84

0.84

ABINet / abinet_20e_st-an_mj

link

0.91

0.96

0.94

0.95

0.81

0.89

0.88

ASTER / aster_resnet45_6e_st_mj

link

0.86

0.94

0.89

0.93

0.77

0.81

0.85

CRNN / crnn_mini-vgg_5e_mj

link

0.70

0.81

0.81

0.87

0.56

0.61

0.57

MASTER / master_resnet31_12e_st_mj_sa

link

0.88

0.95

0.90

0.95

0.76

0.85

0.89

nrtr_modality-transform_6e_st_mj

link

0.83

0.92

0.88

0.94

0.72

0.78

0.75

NRTR / NRTR_1/8-1/4 / nrtr_resnet31-1by8-1by4_6e_st_mj

link

0.87

0.95

0.88

0.95

0.76

0.80

0.89

NRTR_1/16-1/8 / nrtr_resnet31-1by16-1by8_6e_st_mj

link

0.87

0.95

0.90

0.94

0.74

0.80

0.89

svtr-small / svtr-small_20e_st_mj

link

0.86

0.86

0.90

0.94

0.75

0.85

0.89

svtr-base / svtr-base_20e_st_mj

link

0.87

0.86

0.92

0.94

0.74

0.84

0.90

RobustScanner / robustscanner_resnet31_5e_st-sub_mj-sub_sa_real

link

0.87

0.95

0.89

0.93

0.76

0.81

0.87

SAR / sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real

link

0.88

0.95

0.88

0.94

0.76

0.83

0.90

sar_resnet31_sequential-decoder_5e_st-sub_mj-sub_sa_real

link

0.87

0.96

0.87

0.94

0.77

0.81

0.89

SATRN / satrn_shallow_5e_st_mj

link

0.90

0.96

0.92

0.96

0.80

0.88

0.90

SATRN_sm / satrn_shallow-small_5e_st_mj

link

0.88

0.94

0.90

0.96

0.79

0.86

0.85

Key Information Extraction

Model

README

wildreceipt (macro_f1)

SDMGR / sdmgr_unet16_60e_wildreceipt

link

0.89

sdmgr_novisual_60e_wildreceipt

link

0.87

sdmgr_novisual_60e_wildreceipt_openset

link

0.93

Statistics

  • Number of checkpoints: 48

  • Number of configs: 49

  • Number of papers: 19

    • ALGORITHM: 19

Key Information Extraction Models

SOTA Models

Here are some selected project implementations that are not yet included in MMOCR package, but are ready to use.

ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network

This is an implementation of ABCNet based on MMOCR, MMCV, and MMEngine.

ABCNet is a conceptually novel, efficient, and fully convolutional framework for text spotting, which address the problem by proposing the Adaptive Bezier-Curve Network (ABCNet). Our contributions are three-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance with arbitrary shapes, significantly improving the precision compared with previous methods. 3) Compared with standard bounding box detection, our Bezier curve detection introduces negligible computation overhead, resulting in superiority of our method in both efficiency and accuracy. Experiments on arbitrarily-shaped benchmark datasets, namely Total-Text and CTW1500, demonstrate that ABCNet achieves state-of-the-art accuracy, meanwhile significantly improving the speed. In particular, on Total-Text, our realtime version is over 10 times faster than recent state-of-the-art methods with a competitive recognition accuracy.

Status

Inference

Train

README

️✔

link

ABCNet v2: Adaptive Bezier-Curve Network for Real-time End-to-end Text Spotting

This is an implementation of ABCNetV2 based on MMOCR, MMCV, and MMEngine.

ABCNetV2 contributions are four-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve, which, compared with segmentation-based methods, can not only provide structured output but also controllable representation. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance of arbitrary shapes, significantly improving the precision of recognition over previous methods. 3) Different from previous methods, which often suffer from complex post-processing and sensitive hyper-parameters, our ABCNet v2 maintains a simple pipeline with the only post-processing non-maximum suppression (NMS). 4) As the performance of text recognition closely depends on feature alignment, ABCNet v2 further adopts a simple yet effective coordinate convolution to encode the position of the convolutional filters, which leads to a considerable improvement with negligible computation overhead. Comprehensive experiments conducted on various bilingual (English and Chinese) benchmark datasets demonstrate that ABCNet v2 can achieve state-of-the-art performance while maintaining very high efficiency.

Status

Inference

Train

README

️✔

link

SPTS: Single-Point Text Spotting

This is an implementation of SPTS based on MMOCR, MMCV, and MMEngine.

Existing scene text spotting (i.e., end-to-end text detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level bounding boxes). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive Transformer to predict the sequence. The proposed method is simple yet effective, which can achieve state-of-the-art results on widely used benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated or even be automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible.

Status

Inference

Train

README

️✔

link

BackBones

oCLIP

Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting

Abstract

Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).

Models

Backbone Pre-train Data Model
ResNet-50 SynthText Link

Note

The model is converted from the official oCLIP.

Supported Text Detection Models

DBNet DBNet++ FCENet TextSnake PSENet DRRG Mask R-CNN
ICDAR2015
CTW1500

Citation

@article{xue2022language,
  title={Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting},
  author={Xue, Chuhui and Zhang, Wenqing and Hao, Yu and Lu, Shijian and Torr, Philip and Bai, Song},
  journal={Proceedings of the European Conference on Computer Vision (ECCV)},
  year={2022}
}

Text Detection Models

DBNet

Real-time Scene Text Detection with Differentiable Binarization

Abstract

Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset.

Results and models

SynthText
Method Backbone Training set ##iters Download
DBNet_r18 ResNet18 SynthText 100,000 model | log
ICDAR2015
Method Backbone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
DBNet_r18 ResNet18 - ICDAR2015 Train ICDAR2015 Test 1200 736 0.8853 0.7583 0.8169 model | log
DBNet_r50 ResNet50 - ICDAR2015 Train ICDAR2015 Test 1200 1024 0.8744 0.8276 0.8504 model | log
DBNet_r50dcn ResNet50-DCN Synthtext ICDAR2015 Train ICDAR2015 Test 1200 1024 0.8784 0.8315 0.8543 model | log
DBNet_r50-oclip ResNet50-oCLIP - ICDAR2015 Train ICDAR2015 Test 1200 1024 0.9052 0.8272 0.8644 model | log
Total Text
Method Backbone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
DBNet_r18 ResNet18 - Totaltext Train Totaltext Test 1200 736 0.8640 0.7770 0.8182 model | log

Citation

@article{Liao_Wan_Yao_Chen_Bai_2020,
    title={Real-Time Scene Text Detection with Differentiable Binarization},
    journal={Proceedings of the AAAI Conference on Artificial Intelligence},
    author={Liao, Minghui and Wan, Zhaoyi and Yao, Cong and Chen, Kai and Bai, Xiang},
    year={2020},
    pages={11474-11481}}

DBNetpp

Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion

Abstract

Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks.

Results and models

SynthText
Method BackBone Training set ##iters Download
DBNetpp_r50dcn ResNet50-dcnv2 SynthText 100,000 model | log
ICDAR2015
Method BackBone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
DBNetpp_r50 ResNet50 - ICDAR2015 Train ICDAR2015 Test 1200 1024 0.9079 0.8209 0.8622 model | log
DBNetpp_r50dcn ResNet50-dcnv2 Synthtext (model) ICDAR2015 Train ICDAR2015 Test 1200 1024 0.9116 0.8291 0.8684 model | log
DBNetpp_r50-oclip ResNet50-oCLIP - ICDAR2015 Train ICDAR2015 Test 1200 1024 0.9174 0.8609 0.8882 model | log

Citation

@article{liao2022real,
    title={Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion},
    author={Liao, Minghui and Zou, Zhisheng and Wan, Zhaoyi and Yao, Cong and Bai, Xiang},
    journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
    year={2022},
    publisher={IEEE}
}

DRRG

Deep relational reasoning graph network for arbitrary shape text detection

Abstract

Arbitrary shape text detection is a challenging task due to the high variety and complexity of scenes texts. In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection. In our method, an innovative local graph bridges a text proposal model via Convolutional Neural Network (CNN) and a deep relational reasoning network via Graph Convolutional Network (GCN), making our network end-to-end trainable. To be concrete, every text instance will be divided into a series of small rectangular components, and the geometry attributes (e.g., height, width, and orientation) of the small components will be estimated by our text proposal model. Given the geometry attributes, the local graph construction model can roughly establish linkages between different text components. For further reasoning and deducing the likelihood of linkages between the component and its neighbors, we adopt a graph-based network to perform deep relational reasoning on local graphs. Experiments on public available datasets demonstrate the state-of-the-art performance of our method.

Results and models

CTW1500
Method BackBone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
DRRG ResNet50 - CTW1500 Train CTW1500 Test 1200 640 0.8775 0.8179 0.8467 model \ log
DRRG_r50-oclip ResNet50-oCLIP - CTW1500 Train CTW1500 Test 1200 model \ log

Citation

@article{zhang2020drrg,
  title={Deep relational reasoning graph network for arbitrary shape text detection},
  author={Zhang, Shi-Xue and Zhu, Xiaobin and Hou, Jie-Bo and Liu, Chang and Yang, Chun and Wang, Hongfa and Yin, Xu-Cheng},
  booktitle={CVPR},
  pages={9699-9708},
  year={2020}
}

FCENet

Fourier Contour Embedding for Arbitrary-Shaped Text Detection

Abstract

One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. We further construct FCENet with a backbone, feature pyramid networks (FPN) and a simple post-processing with the Inverse Fourier Transformation (IFT) and Non-Maximum Suppression (NMS). Different from previous methods, FCENet first predicts compact Fourier signatures of text instances, and then reconstructs text contours via IFT and NMS during test. Extensive experiments demonstrate that FCE is accurate and robust to fit contours of scene texts even with highly-curved shapes, and also validate the effectiveness and the good generalization of FCENet for arbitrary-shaped text detection. Furthermore, experimental results show that our FCENet is superior to the state-of-the-art (SOTA) methods on CTW1500 and Total-Text, especially on challenging highly-curved text subset.

Results and models

CTW1500
Method Backbone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
FCENet_r50dcn ResNet50 + DCNv2 - CTW1500 Train CTW1500 Test 1500 (736, 1080) 0.8689 0.8296 0.8488 model | log
FCENet_r50-oclip ResNet50-oCLIP - CTW1500 Train CTW1500 Test 1500 (736, 1080) 0.8383 0.801 0.8192 model | log
ICDAR2015
Method Backbone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
FCENet_r50 ResNet50 - IC15 Train IC15 Test 1500 (2260, 2260) 0.8243 0.8834 0.8528 model | log
FCENet_r50-oclip ResNet50-oCLIP - IC15 Train IC15 Test 1500 (2260, 2260) 0.9176 0.8098 0.8604 model | log
Total Text
Method Backbone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
FCENet_r50 ResNet50 - Totaltext Train Totaltext Test 1500 (1280, 960) 0.8485 0.7810 0.8134 model | log

Citation

@InProceedings{zhu2021fourier,
      title={Fourier Contour Embedding for Arbitrary-Shaped Text Detection},
      author={Yiqin Zhu and Jianyong Chen and Lingyu Liang and Zhanghui Kuang and Lianwen Jin and Wayne Zhang},
      year={2021},
      booktitle = {CVPR}
      }

Mask R-CNN

Mask R-CNN

Abstract

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

Results and models

CTW1500
Method BackBone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
MaskRCNN - - CTW1500 Train CTW1500 Test 160 1600 0.7165 0.7776 0.7458 model | log
MaskRCNN_r50-oclip ResNet50-oCLIP - CTW1500 Train CTW1500 Test 160 1600 0.753 0.7593 0.7562 model | log
ICDAR2015
Method BackBone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
MaskRCNN ResNet50 - ICDAR2015 Train ICDAR2015 Test 160 1920 0.8644 0.7766 0.8182 model | log
MaskRCNN_r50-oclip ResNet50-oCLIP - ICDAR2015 Train ICDAR2015 Test 160 1920 0.8695 0.8339 0.8513 model | log

Citation

@INPROCEEDINGS{8237584,
  author={K. {He} and G. {Gkioxari} and P. {Dollár} and R. {Girshick}},
  booktitle={2017 IEEE International Conference on Computer Vision (ICCV)},
  title={Mask R-CNN},
  year={2017},
  pages={2980-2988},
  doi={10.1109/ICCV.2017.322}}

PANet

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

Abstract

Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical this http URL this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

Results and models

CTW1500
Method Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
PANet ImageNet CTW1500 Train CTW1500 Test 600 640 0.8208 0.7376 0.7770 model | log
ICDAR2015
Method Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
PANet ImageNet ICDAR2015 Train ICDAR2015 Test 600 736 0.8455 0.7323 0.7848 model | log

Citation

@inproceedings{WangXSZWLYS19,
  author={Wenhai Wang and Enze Xie and Xiaoge Song and Yuhang Zang and Wenjia Wang and Tong Lu and Gang Yu and Chunhua Shen},
  title={Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network},
  booktitle={ICCV},
  pages={8439--8448},
  year={2019}
  }

PSENet

Shape robust text detection with progressive scale expansion network

Abstract

Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.

Results and models

CTW1500
Method Backbone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
PSENet ResNet50 - CTW1500 Train CTW1500 Test 600 1280 0.7705 0.7883 0.7793 model | log
PSENet_r50-oclip ResNet50-oCLIP - CTW1500 Train CTW1500 Test 600 1280 0.8483 0.7636 0.8037 model | log
ICDAR2015
Method Backbone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
PSENet ResNet50 - IC15 Train IC15 Test 600 2240 0.8396 0.7636 0.7998 model | log
PSENet_r50-oclip ResNet50-oCLIP - IC15 Train IC15 Test 600 2240 0.8895 0.8098 0.8478 model | log

Citation

@inproceedings{wang2019shape,
  title={Shape robust text detection with progressive scale expansion network},
  author={Wang, Wenhai and Xie, Enze and Li, Xiang and Hou, Wenbo and Lu, Tong and Yu, Gang and Shao, Shuai},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={9336--9345},
  year={2019}
}

Textsnake

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Abstract

Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

Results and models

CTW1500
Method BackBone Pretrained Model Training set Test set ##epochs Test size Precision Recall Hmean Download
TextSnake ResNet50 - CTW1500 Train CTW1500 Test 1200 736 0.8535 0.8052 0.8286 model | log
TextSnake_r50-oclip ResNet50-oCLIP - CTW1500 Train CTW1500 Test 1200 736 0.8869 0.8215 0.8529 model | log

Citation

@article{long2018textsnake,
  title={TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes},
  author={Long, Shangbang and Ruan, Jiaqiang and Zhang, Wenjie and He, Xin and Wu, Wenhao and Yao, Cong},
  booktitle={ECCV},
  pages={20-36},
  year={2018}
}

Text Recognition Models

ABINet

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Abstract

Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition.

Dataset

Train Dataset
trainset instance_num repeat_num note
Syn90k 8919273 1 synth
SynthText 7239272 1 alphanumeric
Test Dataset
testset instance_num note
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular
CT80 288 irregular

Results and models

methods pretrained Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
ABINet-Vision - 0.9523 0.9196 0.9369 0.7896 0.8403 0.8437 model | log
ABINet-Vision-TTA - 0.9523 0.9196 0.9360 0.8175 0.8450 0.8542
ABINet Pretrained 0.9603 0.9397 0.9557 0.8146 0.8868 0.8785 model | log
ABINet-TTA Pretrained 0.9597 0.9397 0.9527 0.8426 0.8930 0.8854

Note

  1. ABINet allows its encoder to run and be trained without decoder and fuser. Its encoder is designed to recognize texts as a stand-alone model and therefore can work as an independent text recognizer. We release it as ABINet-Vision.

  2. Facts about the pretrained model: MMOCR does not have a systematic pipeline to pretrain the language model (LM) yet, thus the weights of LM are converted from the official pretrained model. The weights of ABINet-Vision are directly used as the vision model of ABINet.

Citation

@article{fang2021read,
  title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
  author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2021}
}

ASTER

ASTER: An Attentional Scene Text Recognizer with Flexible Rectification

Abstract

A challenging aspect of scene text recognition is to handle text with distortions or irregular layout. In particular, perspective text and curved text are common in natural scenes and are difficult to recognize. In this work, we introduce ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network. The rectification network adaptively transforms an input image into a new one, rectifying the text in it. It is powered by a flexible Thin-Plate Spline transformation which handles a variety of text irregularities and is trained without human annotations. The recognition network is an attentional sequence-to-sequence model that predicts a character sequence directly from the rectified image. The whole model is trained end to end, requiring only images and their groundtruth text. Through extensive experiments, we verify the effectiveness of the rectification and demonstrate the state-of-the-art recognition performance of ASTER. Furthermore, we demonstrate that ASTER is a powerful component in end-to-end recognition systems, for its ability to enhance the detector.

Dataset

Train Dataset
trainset instance_num repeat_num note
Syn90k 8919273 1 synth
SynthText 7239272 1 alphanumeric
Test Dataset
testset instance_num note
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular
CT80 288 irregular

Results and models

Methods Backbone Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
ASTER ResNet45 0.9357 0.8949 0.9281 0.7665 0.8062 0.8507 model | log
ASTER-TTA ResNet45 0.9337 0.8949 0.9251 0.7925 0.8109 0.8507

Citation

@article{shi2018aster,
  title={Aster: An attentional scene text recognizer with flexible rectification},
  author={Shi, Baoguang and Yang, Mingkun and Wang, Xinggang and Lyu, Pengyuan and Yao, Cong and Bai, Xiang},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  volume={41},
  number={9},
  pages={2035--2048},
  year={2018},
  publisher={IEEE}
}

CRNN

An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition

Abstract

Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

Dataset

Train Dataset
trainset instance_num repeat_num note
Syn90k 8919273 1 synth
Test Dataset
testset instance_num note
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular
CT80 288 irregular

Results and models

methods Regular Text Irregular Text download
methods IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
CRNN 0.8053 0.7991 0.8739 0.5571 0.6093 0.5694 model | log
CRNN-TTA 0.8013 0.7975 0.8631 0.5763 0.6093 0.5764 model | log

Citation

@article{shi2016end,
  title={An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition},
  author={Shi, Baoguang and Bai, Xiang and Yao, Cong},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  year={2016}
}

MASTER

MASTER: Multi-aspect non-local network for scene text recognition

Abstract

Attention-based scene text recognizers have gained huge success, which leverages a more compact intermediate representation to learn 1d- or 2d- attention by a RNN-based encoder-decoder architecture. However, such methods suffer from attention-drift problem because high similarity among encoded features leads to attention confusion under the RNN-based local attention mechanism. Moreover, RNN-based methods have low efficiency due to poor parallelization. To overcome these problems, we propose the MASTER, a self-attention based scene text recognizer that (1) not only encodes the input-output attention but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder and (2) learns a more powerful and robust intermediate representation to spatial distortion, and (3) owns a great training efficiency because of high training parallelization and a high-speed inference because of an efficient memory-cache mechanism. Extensive experiments on various benchmarks demonstrate the superior performance of our MASTER on both regular and irregular scene text.

Dataset

Train Dataset
trainset instance_num repeat_num source
SynthText 7266686 1 synth
SynthAdd 1216889 1 synth
Syn90k 8919273 1 synth
Test Dataset
testset instance_num type
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular
CT80 288 irregular

Results and Models

Methods Backbone Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
MASTER R31-GCAModule 0.9490 0.8887 0.9517 0.7650 0.8465 0.8889 model | log
MASTER-TTA R31-GCAModule 0.9450 0.8887 0.9478 0.7906 0.8481 0.8958

Citation

@article{Lu2021MASTER,
  title={MASTER: Multi-Aspect Non-local Network for Scene Text Recognition},
  author={Ning Lu and Wenwen Yu and Xianbiao Qi and Yihao Chen and Ping Gong and Rong Xiao and Xiang Bai},
  journal={Pattern Recognition},
  year={2021}
}

NRTR

NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition

Abstract

Scene text recognition has attracted a great many researches due to its importance to various applications. Existing methods mainly adopt recurrence or convolution based networks. Though have obtained good performance, these methods still suffer from two limitations: slow training speed due to the internal recurrence of RNNs, and high complexity due to stacked convolutional layers for long-term feature extraction. This paper, for the first time, proposes a no-recurrence sequence-to-sequence text recognizer, named NRTR, that dispenses with recurrences and convolutions entirely. NRTR follows the encoder-decoder paradigm, where the encoder uses stacked self-attention to extract image features, and the decoder applies stacked self-attention to recognize texts based on encoder output. NRTR relies solely on self-attention mechanism thus could be trained with more parallelization and less complexity. Considering scene image has large variation in text and background, we further design a modality-transform block to effectively transform 2D input images to 1D sequences, combined with the encoder to extract more discriminative features. NRTR achieves state-of-the-art or highly competitive performance on both regular and irregular benchmarks, while requires only a small fraction of training time compared to the best model from the literature (at least 8 times faster).

Dataset

Train Dataset
trainset instance_num repeat_num source
SynthText 7266686 1 synth
Syn90k 8919273 1 synth
Test Dataset
testset instance_num type
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular
CT80 288 irregular

Results and Models

Methods Backbone Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
NRTR NRTRModalityTransform 0.9147 0.8841 0.9369 0.7246 0.7783 0.7500 model | log
NRTR-TTA NRTRModalityTransform 0.9123 0.8825 0.9310 0.7492 0.7798 0.7535
NRTR R31-1/8-1/4 0.9483 0.8918 0.9507 0.7578 0.8016 0.8889 model | log
NRTR-TTA R31-1/8-1/4 0.9443 0.8903 0.9478 0.7790 0.8078 0.8854
NRTR R31-1/16-1/8 0.9470 0.8918 0.9399 0.7376 0.7969 0.8854 model | log
NRTR-TTA R31-1/16-1/8 0.9423 0.8903 0.9360 0.7641 0.8016 0.8854

Citation

@inproceedings{sheng2019nrtr,
  title={NRTR: A no-recurrence sequence-to-sequence model for scene text recognition},
  author={Sheng, Fenfen and Chen, Zhineng and Xu, Bo},
  booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
  pages={781--786},
  year={2019},
  organization={IEEE}
}

RobustScanner

RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

Abstract

The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.

Dataset

Train Dataset
trainset instance_num repeat_num source
icdar_2011 3567 20 real
icdar_2013 848 20 real
icdar2015 4468 20 real
coco_text 42142 20 real
IIIT5K 2000 20 real
SynthText 2400000 1 synth
SynthAdd 1216889 1 synth, 1.6m in [1]
Syn90k 2400000 1 synth
Test Dataset
testset instance_num type
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular, 639 in [1]
CT80 288 irregular

Results and Models

Methods GPUs Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
RobustScanner 4 0.9510 0.9011 0.9320 0.7578 0.8078 0.8750 model | log
RobustScanner-TTA 4 0.9487 0.9011 0.9261 0.7805 0.8124 0.8819

References

[1] Li, Hui and Wang, Peng and Shen, Chunhua and Zhang, Guyu. Show, attend and read: A simple and strong baseline for irregular text recognition. In AAAI 2019.

Citation

@inproceedings{yue2020robustscanner,
  title={RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition},
  author={Yue, Xiaoyu and Kuang, Zhanghui and Lin, Chenhao and Sun, Hongbin and Zhang, Wayne},
  booktitle={European Conference on Computer Vision},
  year={2020}
}

SAR

Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition

Abstract

Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using off-the-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an LSTM-based encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust and achieves state-of-the-art performance on both regular and irregular scene text recognition benchmarks.

Dataset

Train Dataset
trainset instance_num repeat_num source
icdar_2011 3567 20 real
icdar_2013 848 20 real
icdar2015 4468 20 real
coco_text 42142 20 real
IIIT5K 2000 20 real
SynthText 2400000 1 synth
SynthAdd 1216889 1 synth, 1.6m in [1]
Syn90k 2400000 1 synth
Test Dataset
testset instance_num type
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular, 639 in [1]
CT80 288 irregular

Results and Models

Methods Backbone Decoder Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
SAR R31-1/8-1/4 ParallelSARDecoder 0.9533 0.8964 0.9369 0.7602 0.8326 0.9062 model | log
SAR-TTA R31-1/8-1/4 ParallelSARDecoder 0.9510 0.8964 0.9340 0.7862 0.8372 0.9132
SAR R31-1/8-1/4 SequentialSARDecoder 0.9553 0.9073 0.9409 0.7761 0.8093 0.8958 model | log
SAR-TTA R31-1/8-1/4 SequentialSARDecoder 0.9530 0.9073 0.9389 0.8002 0.8124 0.9028

Citation

@inproceedings{li2019show,
  title={Show, attend and read: A simple and strong baseline for irregular text recognition},
  author={Li, Hui and Wang, Peng and Shen, Chunhua and Zhang, Guyu},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  volume={33},
  number={01},
  pages={8610--8617},
  year={2019}
}

SATRN

On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention

Abstract

Scene text recognition (STR) is the task of recognizing character sequences in natural scenes. While there have been great advances in STR methods, current methods still fail to recognize texts in arbitrary shapes, such as heavily curved or rotated texts, which are abundant in daily life (e.g. restaurant signs, product labels, company logos, etc). This paper introduces a novel architecture to recognizing texts of arbitrary shapes, named Self-Attention Text Recognition Network (SATRN), which is inspired by the Transformer. SATRN utilizes the self-attention mechanism to describe two-dimensional (2D) spatial dependencies of characters in a scene text image. Exploiting the full-graph propagation of self-attention, SATRN can recognize texts with arbitrary arrangements and large inter-character spacing. As a result, SATRN outperforms existing STR models by a large margin of 5.7 pp on average in “irregular text” benchmarks. We provide empirical analyses that illustrate the inner mechanisms and the extent to which the model is applicable (e.g. rotated and multi-line text). We will open-source the code.

Dataset

Train Dataset
trainset instance_num repeat_num source
SynthText 7266686 1 synth
Syn90k 8919273 1 synth
Test Dataset
testset instance_num type
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular
CT80 288 irregular

Results and Models

Methods Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
Satrn 0.9600 0.9181 0.9606 0.8045 0.8837 0.8993 model | log
Satrn-TTA 0.9530 0.9181 0.9527 0.8276 0.8884 0.9028
Satrn_small 0.9423 0.9011 0.9567 0.7886 0.8574 0.8472 model | log
Satrn_small-TTA 0.9380 0.8995 0.9488 0.8122 0.8620 0.8507

Citation

@article{junyeop2019recognizing,
  title={On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention},
  author={Junyeop Lee, Sungrae Park, Jeonghun Baek, Seong Joon Oh, Seonghyeon Kim, Hwalsuk Lee},
  year={2019}
}

SVTR

SVTR: Scene Text Recognition with a Single Visual Model

Abstract

Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference.

Dataset

Train Dataset
trainset instance_num repeat_num source
SynthText 7266686 1 synth
Syn90k 8919273 1 synth
Test Dataset
testset instance_num type
IIIT5K 3000 regular
SVT 647 regular
IC13 1015 regular
IC15 2077 irregular
SVTP 645 irregular
CT80 288 irregular

Results and Models

Methods Regular Text Irregular Text download
IIIT5K SVT IC13-1015 IC15-2077 SVTP CT80
SVTR-tiny - - - - - - -
SVTR-small 0.8553 0.9026 0.9448 0.7496 0.8496 0.8854 model | log
SVTR-small-TTA 0.8397 0.8964 0.9241 0.7597 0.8124 0.8646
SVTR-base 0.8570 0.9181 0.9438 0.7448 0.8388 0.9028 model | log
SVTR-base-TTA 0.8517 0.9011 0.9379 0.7569 0.8279 0.8819
SVTR-large - - - - - - -

Note

The implementation and configuration follow the original code and paper, but there is still a gap between the reproduced results and the official ones. We appreciate any suggestions to improve its performance.

Citation

@inproceedings{ijcai2022p124,
  title     = {SVTR: Scene Text Recognition with a Single Visual Model},
  author    = {Du, Yongkun and Chen, Zhineng and Jia, Caiyan and Yin, Xiaoting and Zheng, Tianlun and Li, Chenxia and Du, Yuning and Jiang, Yu-Gang},
  booktitle = {Proceedings of the Thirty-First International Joint Conference on
               Artificial Intelligence, {IJCAI-22}},
  publisher = {International Joint Conferences on Artificial Intelligence Organization},
  editor    = {Lud De Raedt},
  pages     = {884--890},
  year      = {2022},
  month     = {7},
  note      = {Main Track},
  doi       = {10.24963/ijcai.2022/124},
  url       = {https://doi.org/10.24963/ijcai.2022/124},
}

Key Information Extraction Models

SDMGR

Spatial Dual-Modality Graph Reasoning for Key Information Extraction

Abstract

Key information extraction from document images is of paramount importance in office automation. Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors. In this paper, we propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. It contains 25 key information categories, a total of about 69000 text boxes, and is about 2 times larger than the existing public datasets. Extensive experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction. It has been shown that SDMG-R can effectively extract key information from document images of unseen templates, and obtain new state-of-the-art results on the recent popular benchmark SROIE and our WildReceipt. Our code and dataset will be publicly released.

Results and models

WildReceipt
Method Modality Macro F1-Score Download
sdmgr_unet16 Visual + Textual 0.890 model | log
sdmgr_novisual Textual 0.873 model | log
WildReceiptOpenset
Method Modality Edge F1-Score Node Macro F1-Score Node Micro F1-Score Download
sdmgr_novisual_openset Textual 0.792 0.931 0.940 model | log

Citation

@misc{sun2021spatial,
      title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction},
      author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang},
      year={2021},
      eprint={2103.14470},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Branches

This documentation aims to provide a comprehensive understanding of the purpose and features of each branch in MMOCR.

Branch Overview

1. main

The main branch serves as the default branch for the MMOCR project. It contains the latest stable version of MMOCR, currently housing the code for MMOCR 1.x (e.g. v1.0.0). The main branch ensures users have access to the most recent and reliable version of the software.

2. dev-1.x

The dev-1.x branch is dedicated to the development of the next major version of MMOCR. This branch will routinely undergo reliance tests, and the passing commits will be squashed in a release and published to the main branch. By having a separate development branch, the project can continue to evolve without impacting the stability of the main branch. All the PRs should be merged into the dev-1.x branch.

3. 0.x

The 0.x branch serves as an archive for MMOCR 0.x (e.g. v0.6.3). This branch will no longer actively receive updates or improvements, but it remains accessible for historical reference or for users who have not yet upgraded to MMOCR 1.x.

3. 1.x

It’s an alias of main branch, which is intended for a smooth transition from the compatibility period. It will be removed in mid 2023.

Note

The branches mapping has been changed in 2023.04.06. For the legacy branches mapping and the guide for migration, please refer to the branch migration guide.

Contribution Guide

OpenMMLab welcomes everyone who is interested in contributing to our projects and accepts contribution in the form of PR.

What is PR

PR is the abbreviation of Pull Request. Here’s the definition of PR in the official document of Github.

Pull requests let you tell others about changes you have pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.

Basic Workflow

  1. Get the most recent codebase

  2. Checkout a new branch from dev-1.x branch, depending on the version of the codebase you want to contribute to.

  3. Commit your changes (Don’t forget to use pre-commit hooks!)

  4. Push your changes and create a PR

  5. Discuss and review your code

  6. Merge your branch to dev-1.x branch

Procedures in detail

1. Get the most recent codebase

  • When you work on your first PR

    Fork the OpenMMLab repository: click the fork button at the top right corner of Github page avatar

    Clone forked repository to local

    git clone git@github.com:XXX/mmocr.git
    

    Add source repository to upstream

    git remote add upstream git@github.com:open-mmlab/mmocr
    
  • After your first PR

    Checkout the latest branch of the local repository and pull the latest branch of the source repository. Here we assume that you are working on the dev-1.x branch.

    git checkout dev-1.x
    git pull upstream dev-1.x
    

2. Checkout a new branch from dev-1.x branch

git checkout -b branchname

Tip

To make commit history clear, we strongly recommend you checkout the dev-1.x branch before creating a new branch.

3. Commit your changes

  • If you are a first-time contributor, please install and initialize pre-commit hooks from the repository root directory first.

    pip install -U pre-commit
    pre-commit install
    
  • Commit your changes as usual. Pre-commit hooks will be triggered to stylize your code before each commit.

    # coding
    git add [files]
    git commit -m 'messages'
    

    Note

    Sometimes your code may be changed by pre-commit hooks. In this case, please remember to re-stage the modified files and commit again.

4. Push your changes to the forked repository and create a PR

  • Push the branch to your forked remote repository

    git push origin branchname
    
  • Create a PR avatar

  • Revise PR message template to describe your motivation and modifications made in this PR. You can also link the related issue to the PR manually in the PR message (For more information, checkout the official guidance).

  • Specifically, if you are contributing to dev-1.x, you will have to change the base branch of the PR to dev-1.x in the PR page, since the default base branch is main.

    avatar

  • You can also ask a specific person to review the changes you’ve proposed.

5. Discuss and review your code

  • Modify your codes according to reviewers’ suggestions and then push your changes.

6. Merge your branch to dev-1.x branch and delete the branch

  • After the PR is merged by the maintainer, you can delete the branch you created in your forked repository.

    git branch -d branchname # delete local branch
    git push origin --delete branchname # delete remote branch
    

PR Specs

  1. Use pre-commit hook to avoid issues of code style

  2. One short-time branch should be matched with only one PR

  3. Accomplish a detailed change in one PR. Avoid large PR

    • Bad: Support Faster R-CNN

    • Acceptable: Add a box head to Faster R-CNN

    • Good: Add a parameter to box head to support custom conv-layer number

  4. Provide clear and significant commit message

  5. Provide clear and meaningful PR description

    • Task name should be clarified in title. The general format is: [Prefix] Short description of the PR (Suffix)

    • Prefix: add new feature [Feature], fix bug [Fix], related to documents [Docs], in developing [WIP] (which will not be reviewed temporarily)

    • Introduce main changes, results and influences on other modules in short description

    • Associate related issues and pull requests with a milestone

Changelog of v1.x

v1.0.0 (04/06/2023)

We are excited to announce the first official release of MMOCR 1.0, with numerous enhancements, bug fixes, and the introduction of new dataset support!

🌟 Highlights

  • Support for SCUT-CTW1500, SynthText, and MJSynth datasets

  • Updated FAQ and documentation

  • Deprecation of file_client_args in favor of backend_args

  • Added a new MMOCR tutorial notebook

🆕 New Features & Enhancement

  • Add SCUT-CTW1500 by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1677

  • Cherry Pick #1205 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1774

  • Make lanms-neo optional by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1772

  • SynthText by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1779

  • Deprecate file_client_args and use backend_args instead by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1765

  • MJSynth by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1791

  • Add MMOCR tutorial notebook by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1771

  • decouple batch_size to det_batch_size, rec_batch_size and kie_batch_size in MMOCRInferencer by @hugotong6425 in https://github.com/open-mmlab/mmocr/pull/1801

  • Accepts local-rank in train.py and test.py by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1806

  • update stitch_boxes_into_lines by @cherryjm in https://github.com/open-mmlab/mmocr/pull/1824

  • Add tests for pytorch 2.0 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1836

📝 Docs

  • FAQ by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1773

  • Remove LoadImageFromLMDB from docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1767

  • Mark projects in docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1766

  • add opendatalab download link by @jorie-peng in https://github.com/open-mmlab/mmocr/pull/1753

  • Fix some deadlinks in the docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1469

  • Fix quick run by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1775

  • Dataset by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1782

  • Update faq by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1817

  • more social network links by @fengshiwest in https://github.com/open-mmlab/mmocr/pull/1818

  • Update docs after branch switching by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1834

🛠️ Bug Fixes:

  • Place dicts to .mim by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1781

  • Test svtr_small instead of svtr_tiny by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1786

  • Add pse weight to metafile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1787

  • Synthtext metafile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1788

  • Clear up some unused scripts by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1798

  • if dst not exists, when move a single file may raise a file not exists error. by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1803

  • CTW1500 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1814

  • MJSynth & SynthText Dataset Preparer config by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1805

  • Use poly_intersection instead of poly.intersection to avoid sup… by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1811

  • Abinet: fix ValueError: Blur limit must be odd when centered=True. Got: (3, 6) by @hugotong6425 in https://github.com/open-mmlab/mmocr/pull/1821

  • Bug generated during kie inference visualization by @Yangget in https://github.com/open-mmlab/mmocr/pull/1830

  • Revert sync bn in inferencer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1832

  • Fix mmdet digit version by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1840

🎉 New Contributors

  • @jorie-peng made their first contribution in https://github.com/open-mmlab/mmocr/pull/1753

  • @hugotong6425 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1801

  • @fengshiwest made their first contribution in https://github.com/open-mmlab/mmocr/pull/1818

  • @cherryjm made their first contribution in https://github.com/open-mmlab/mmocr/pull/1824

  • @Yangget made their first contribution in https://github.com/open-mmlab/mmocr/pull/1830

Thank you to all the contributors for making this release possible! We’re excited about the new features and enhancements in this version, and we’re looking forward to your feedback and continued support. Happy coding! 🚀

Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc6…v1.0.0

Highlights

v1.0.0rc6 (03/07/2023)

Highlights

  1. Two new models, ABCNet v2 (inference only) and SPTS are added to projects/ folder.

  2. Announcing Inferencer, a unified inference interface in OpenMMLab for everyone’s easy access and quick inference with all the pre-trained weights. Docs

  3. Users can use test-time augmentation for text recognition tasks. Docs

  4. Support batch augmentation through BatchAugSampler, which is a technique used in SPTS.

  5. Dataset Preparer has been refactored to allow more flexible configurations. Besides, users are now able to prepare text recognition datasets in LMDB formats. Docs

  6. Some textspotting datasets have been revised to enhance the correctness and consistency with the common practice.

  7. Potential spurious warnings from shapely have been eliminated.

Dependency

This version requires MMEngine >= 0.6.0, MMCV >= 2.0.0rc4 and MMDet >= 3.0.0rc5.

New Features & Enhancements

  • Discard deprecated lmdb dataset format and only support img+label now by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1681

  • abcnetv2 inference by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1657

  • Add RepeatAugSampler by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1678

  • SPTS by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1696

  • Refactor Inferencers by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1608

  • Dynamic return type for rescale_polygons by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1702

  • Revise upstream version limit by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1703

  • TextRecogCropConverter add crop with opencv warpPersepective function by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1667

  • change cudnn benchmark to false by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1705

  • Add ST-pretrained DB-series models and logs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1635

  • Only keep meta and state_dict when publish model by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1729

  • Rec TTA by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1401

  • Speedup formatting by replacing np.transpose with torch… by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1719

  • Support auto import modules from registry. by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1731

  • Support batch visualization & dumping in Inferencer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1722

  • add a new argument font_properties to set a specific font file in order to draw Chinese characters properly by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1709

  • Refactor data converter and gather by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1707

  • Support batch augmentation through BatchAugSampler by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1757

  • Put all registry into registry.py by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1760

  • train by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1756

  • configs for regression benchmark by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1755

  • Support lmdb format in Dataset Preparer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1762

Docs

  • update the link of DBNet by @AllentDan in https://github.com/open-mmlab/mmocr/pull/1672

  • Add notice for default branch switching by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1693

  • docs: Add twitter discord medium youtube link by @vansin in https://github.com/open-mmlab/mmocr/pull/1724

  • Remove unsupported datasets in docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1670

Bug Fixes

  • Update dockerfile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1671

  • Explicitly create np object array for compatibility by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1691

  • Fix a minor error in docstring by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1685

  • Fix lint by @triple-Mu in https://github.com/open-mmlab/mmocr/pull/1694

  • Fix LoadOCRAnnotation ut by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1695

  • Fix isort pre-commit error by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1697

  • Update owners by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1699

  • Detect intersection before using shapley.intersection to eliminate spurious warnings by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1710

  • Fix some inferencer bugs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1706

  • Fix textocr ignore flag by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1712

  • Add missing softmax in ASTER forward_test by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1718

  • Fix head in readme by @vansin in https://github.com/open-mmlab/mmocr/pull/1727

  • Fix some browse dataset script bugs and draw textdet gt instance with ignore flags by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1701

  • icdar textrecog ann parser skip data with ignore flag by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1708

  • bezier_to_polygon -> bezier2polygon by @double22a in https://github.com/open-mmlab/mmocr/pull/1739

  • Fix docs recog CharMetric P/R error definition by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1740

  • Remove outdated resources in demo/ by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1747

  • Fix wrong ic13 textspotting split data; add lexicons to ic13, ic15 and totaltext by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1758

  • SPTS readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1761

New Contributors

  • @triple-Mu made their first contribution in https://github.com/open-mmlab/mmocr/pull/1694

  • @double22a made their first contribution in https://github.com/open-mmlab/mmocr/pull/1739

Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc5…v1.0.0rc6

v1.0.0rc5 (01/06/2023)

Highlights

  1. Two models, Aster and SVTR, are added to our model zoo. The full implementation of ABCNet is also available now.

  2. Dataset Preparer supports 5 more datasets: CocoTextV2, FUNSD, TextOCR, NAF, SROIE.

  3. We have 4 more text recognition transforms, and two helper transforms. See https://github.com/open-mmlab/mmocr/pull/1646 https://github.com/open-mmlab/mmocr/pull/1632 https://github.com/open-mmlab/mmocr/pull/1645 for details.

  4. The transform, FixInvalidPolygon, is getting smarter at dealing with invalid polygons, and now capable of handling more weird annotations. As a result, a complete training cycle on TotalText dataset can be performed bug-free. The weights of DBNet and FCENet pretrained on TotalText are also released.

New Features & Enhancements

  • Update ic15 det config according to DataPrepare by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1617

  • Refactor icdardataset metainfo to lowercase. by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1620

  • Add ASTER Encoder by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1239

  • Add ASTER decoder by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1625

  • Add ASTER config by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1238

  • Update ASTER config by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1629

  • Support browse_dataset.py to visualize original dataset by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1503

  • Add CocoTextv2 to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1514

  • Add Funsd to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1550

  • Add TextOCR to Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1543

  • Refine example projects and readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1628

  • Enhance FixInvalidPolygon, add RemoveIgnored transform by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1632

  • ConditionApply by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1646

  • Add NAF to dataset preparer by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1609

  • Add SROIE to dataset preparer by @FerryHuang in https://github.com/open-mmlab/mmocr/pull/1639

  • Add svtr decoder by @willpat1213 in https://github.com/open-mmlab/mmocr/pull/1448

  • Add missing unit tests by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1651

  • Add svtr encoder by @willpat1213 in https://github.com/open-mmlab/mmocr/pull/1483

  • ABCNet train by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1610

  • Totaltext cfgs for DB and FCE by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1633

  • Add Aliases to models by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1611

  • SVTR transforms by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1645

  • Add SVTR framework and configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1621

  • Issue Template by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1663

Docs

  • Add Chinese translation for browse_dataset.py by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1647

  • updata abcnet doc by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1658

  • update the dbnetpp`s readme file by @zhuyue66 in https://github.com/open-mmlab/mmocr/pull/1626

  • Inferencer docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1744

Bug Fixes

  • nn.SmoothL1Loss beta can not be zero in PyTorch 1.13 version by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1616

  • ctc loss bug if target is empty by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1618

  • Add torch 1.13 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1619

  • Remove outdated tutorial link by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1627

  • Dev 1.x some doc mistakes by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1630

  • Support custom font to visualize some languages (e.g. Korean) by @ProtossDragoon in https://github.com/open-mmlab/mmocr/pull/1567

  • db_module_loss,negative number encountered in sqrt by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1640

  • Use int instead of np.int by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1636

  • Remove support for py3.6 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1660

New Contributors

  • @zhuyue66 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1626

  • @KevinNuNu made their first contribution in https://github.com/open-mmlab/mmocr/pull/1630

  • @FerryHuang made their first contribution in https://github.com/open-mmlab/mmocr/pull/1639

  • @willpat1213 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1448

Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc4…v1.0.0rc5

v1.0.0rc4 (12/06/2022)

Highlights

  1. Dataset Preparer can automatically generate base dataset configs at the end of the preparation process, and supports 6 more datasets: IIIT5k, CUTE80, ICDAR2013, ICDAR2015, SVT, SVTP.

  2. Introducing our projects/ folder - implementing new models and features into OpenMMLab’s algorithm libraries has long been complained to be troublesome due to the rigorous requirements on code quality, which could hinder the fast iteration of SOTA models and might discourage community members from sharing their latest outcome here. We now introduce projects/ folder, where some experimental features, frameworks and models can be placed, only needed to satisfy the minimum requirement on the code quality. Everyone is welcome to post their implementation of any great ideas in this folder! We also add the first example project to illustrate what we expect a good project to have (check out the raw content of README.md for more info!).

  3. Inside the projects/ folder, we are releasing the preview version of ABCNet, which is the first implementation of text spotting models in MMOCR. It’s inference-only now, but the full implementation will be available very soon.

New Features & Enhancements

  • Add SVT to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1521

  • Polish bbox2poly by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1532

  • Add SVTP to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1523

  • Iiit5k converter by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1530

  • Add cute80 to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1522

  • Add IC13 preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1531

  • Add ‘Projects/’ folder, and the first example project by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1524

  • Rename to {dataset-name}_task_train/test by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1541

  • Add print_config.py to the tools by @IncludeMathH in https://github.com/open-mmlab/mmocr/pull/1547

  • Add get_md5 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1553

  • Add config generator by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1552

  • Support IC15_1811 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1556

  • Update CT80 config by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1555

  • Add config generators to all textdet and textrecog configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1560

  • Refactor TPS by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1240

  • Add TextSpottingConfigGenerator by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1561

  • Add common typing by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1596

  • Update textrecog config and readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1597

  • Support head loss or postprocessor is None for only infer by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1594

  • Textspotting datasample by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1593

  • Simplify mono_gather by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1588

  • ABCNet v1 infer by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1598

Docs

  • Add Chinese Guidance on How to Add New Datasets to Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1506

  • Update the qq group link by @vansin in https://github.com/open-mmlab/mmocr/pull/1569

  • Collapse some sections; update logo url by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1571

  • Update dataset preparer (CN) by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1591

Bug Fixes

  • Fix two bugs in dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1513

  • Register bug of CLIPResNet by @jyshee in https://github.com/open-mmlab/mmocr/pull/1517

  • Being more conservative on Dataset Preparer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1520

  • python -m pip upgrade in windows by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1525

  • Fix wildreceipt metafile by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1528

  • Fix Dataset Preparer Extract by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1527

  • Fix ICDARTxtParser by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1529

  • Fix Dataset Zoo Script by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1533

  • Fix crop without padding and recog metainfo delete unuse info by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1526

  • Automatically create nonexistent directory for base configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1535

  • Change mmcv.dump to mmengine.dump by @ProtossDragoon in https://github.com/open-mmlab/mmocr/pull/1540

  • mmocr.utils.typing -> mmocr.utils.typing_utils by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1538

  • Wildreceipt tests by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1546

  • Fix judge exist dir by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1542

  • Fix IC13 textdet config by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1563

  • Fix IC13 textrecog annotations by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1568

  • Auto scale lr by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1584

  • Fix icdar data parse for text containing separator by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1587

  • Fix textspotting ut by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1599

  • Fix TextSpottingConfigGenerator and TextSpottingDataConverter by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1604

  • Keep E2E Inferencer output simple by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1559

New Contributors

  • @jyshee made their first contribution in https://github.com/open-mmlab/mmocr/pull/1517

  • @ProtossDragoon made their first contribution in https://github.com/open-mmlab/mmocr/pull/1540

  • @IncludeMathH made their first contribution in https://github.com/open-mmlab/mmocr/pull/1547

Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc3…v1.0.0rc4

v1.0.0rc3 (11/03/2022)

Highlights

  1. We release several pretrained models using oCLIP-ResNet as the backbone, which is a ResNet variant trained with oCLIP and can significantly boost the performance of text detection models.

  2. Preparing datasets is troublesome and tedious, especially in OCR domain where multiple datasets are usually required. In order to free our users from laborious work, we designed a Dataset Preparer to help you get a bunch of datasets ready for use, with only one line of command! Dataset Preparer is also crafted to consist of a series of reusable modules, each responsible for handling one of the standardized phases throughout the preparation process, shortening the development cycle on supporting new datasets.

New Features & Enhancements

  • Add Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1484

  • support modified resnet structure used in oCLIP by @HannibalAPE in https://github.com/open-mmlab/mmocr/pull/1458

  • Add oCLIP configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1509

Docs

  • Update install.md by @rogachevai in https://github.com/open-mmlab/mmocr/pull/1494

  • Refine some docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1455

  • Update some dataset preparer related docs by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1502

  • oclip readme by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1505

Bug Fixes

  • Fix offline_eval error caused by new data flow by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1500

New Contributors

  • @rogachevai made their first contribution in https://github.com/open-mmlab/mmocr/pull/1494

  • @HannibalAPE made their first contribution in https://github.com/open-mmlab/mmocr/pull/1458

Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc2…v1.0.0rc3

v1.0.0rc2 (10/14/2022)

This release relaxes the version requirement of MMEngine to >=0.1.0, < 1.0.0.

v1.0.0rc1 (10/09/2022)

Highlights

This release fixes a severe bug leading to inaccurate metric report in multi-GPU training. We release the weights for all the text recognition models in MMOCR 1.0 architecture. The inference shorthand for them are also added back to ocr.py. Besides, more documentation chapters are available now.

New Features & Enhancements

  • Simplify the Mask R-CNN config by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1391

  • auto scale lr by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1326

  • Update paths to pretrain weights by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1416

  • Streamline duplicated split_result in pan_postprocessor by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1418

  • Update model links in ocr.py and inference.md by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1431

  • Update rec configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1417

  • Visualizer refine by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1411

  • Support get flops and parameters in dev-1.x by @vansin in https://github.com/open-mmlab/mmocr/pull/1414

Docs

  • intersphinx and api by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1367

  • Fix quickrun by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1374

  • Fix some docs issues by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1385

  • Add Documents for DataElements by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1381

  • config english by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1372

  • Metrics by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1399

  • Add version switcher to menu by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1407

  • Data Transforms by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1392

  • Fix inference docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1415

  • Fix some docs by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1410

  • Add maintenance plan to migration guide by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1413

  • Update Recog Models by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1402

Bug Fixes

  • clear metric.results only done in main process by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1379

  • Fix a bug in MMDetWrapper by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1393

  • Fix browse_dataset.py by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1398

  • ImgAugWrapper: Do not cilp polygons if not applicable by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1231

  • Fix CI by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1365

  • Fix merge stage test by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1370

  • Del CI support for torch 1.5.1 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1371

  • Test windows cu111 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1373

  • Fix windows CI by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1387

  • Upgrade pre commit hooks by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1429

  • Skip invalid augmented polygons in ImgAugWrapper by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1434

New Contributors

  • @vansin made their first contribution in https://github.com/open-mmlab/mmocr/pull/1414

Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc0…v1.0.0rc1

v1.0.0rc0 (09/01/2022)

We are excited to announce the release of MMOCR 1.0.0rc0. MMOCR 1.0.0rc0 is the first version of MMOCR 1.x, a part of the OpenMMLab 2.0 projects. Built upon the new training engine, MMOCR 1.x unifies the interfaces of dataset, models, evaluation, and visualization with faster training and testing speed.

Highlights

  1. New engines. MMOCR 1.x is based on MMEngine, which provides a general and powerful runner that allows more flexible customizations and significantly simplifies the entrypoints of high-level interfaces.

  2. Unified interfaces. As a part of the OpenMMLab 2.0 projects, MMOCR 1.x unifies and refactors the interfaces and internal logics of train, testing, datasets, models, evaluation, and visualization. All the OpenMMLab 2.0 projects share the same design in those interfaces and logics to allow the emergence of multi-task/modality algorithms.

  3. Cross project calling. Benefiting from the unified design, you can use the models implemented in other OpenMMLab projects, such as MMDet. We provide an example of how to use MMDetection’s Mask R-CNN through MMDetWrapper. Check our documents for more details. More wrappers will be released in the future.

  4. Stronger visualization. We provide a series of useful tools which are mostly based on brand-new visualizers. As a result, it is more convenient for the users to explore the models and datasets now.

  5. More documentation and tutorials. We add a bunch of documentation and tutorials to help users get started more smoothly. Read it here.

Breaking Changes

We briefly list the major breaking changes here. We will update the migration guide to provide complete details and migration instructions.

Dependencies
  • MMOCR 1.x relies on MMEngine to run. MMEngine is a new foundational library for training deep learning models in OpenMMLab 2.0 models. The dependencies of file IO and training are migrated from MMCV 1.x to MMEngine.

  • MMOCR 1.x relies on MMCV>=2.0.0rc0. Although MMCV no longer maintains the training functionalities since 2.0.0rc0, MMOCR 1.x relies on the data transforms, CUDA operators, and image processing interfaces in MMCV. Note that the package mmcv is the version that provide pre-built CUDA operators and mmcv-lite does not since MMCV 2.0.0rc0, while mmcv-full has been deprecated.

Training and testing
  • MMOCR 1.x uses Runner in MMEngine rather than that in MMCV. The new Runner implements and unifies the building logic of dataset, model, evaluation, and visualizer. Therefore, MMOCR 1.x no longer maintains the building logics of those modules in mmocr.train.apis and tools/train.py. Those code have been migrated into MMEngine. Please refer to the migration guide of Runner in MMEngine for more details.

  • The Runner in MMEngine also supports testing and validation. The testing scripts are also simplified, which has similar logic as that in training scripts to build the runner.

  • The execution points of hooks in the new Runner have been enriched to allow more flexible customization. Please refer to the migration guide of Hook in MMEngine for more details.

  • Learning rate and momentum scheduling has been migrated from Hook to Parameter Scheduler in MMEngine. Please refer to the migration guide of Parameter Scheduler in MMEngine for more details.

Configs
Dataset

The Dataset classes implemented in MMOCR 1.x all inherits from the BaseDetDataset, which inherits from the BaseDataset in MMEngine. There are several changes of Dataset in MMOCR 1.x.

  • All the datasets support to serialize the data list to reduce the memory when multiple workers are built to accelerate data loading.

  • The interfaces are changed accordingly.

Data Transforms

The data transforms in MMOCR 1.x all inherits from those in MMCV>=2.0.0rc0, which follows a new convention in OpenMMLab 2.0 projects. The changes are listed as below:

  • The interfaces are also changed. Please refer to the API Reference

  • The functionality of some data transforms (e.g., Resize) are decomposed into several transforms.

  • The same data transforms in different OpenMMLab 2.0 libraries have the same augmentation implementation and the logic of the same arguments, i.e., Resize in MMDet 3.x and MMOCR 1.x will resize the image in the exact same manner given the same arguments.

Model

The models in MMOCR 1.x all inherits from BaseModel in MMEngine, which defines a new convention of models in OpenMMLab 2.0 projects. Users can refer to the tutorial of model in MMengine for more details. Accordingly, there are several changes as the following:

  • The model interfaces, including the input and output formats, are significantly simplified and unified following the new convention in MMOCR 1.x. Specifically, all the input data in training and testing are packed into inputs and data_samples, where inputs contains model inputs like a list of image tensors, and data_samples contains other information of the current data sample such as ground truths and model predictions. In this way, different tasks in MMOCR 1.x can share the same input arguments, which makes the models more general and suitable for multi-task learning.

  • The model has a data preprocessor module, which is used to pre-process the input data of model. In MMOCR 1.x, the data preprocessor usually does necessary steps to form the input images into a batch, such as padding. It can also serve as a place for some special data augmentations or more efficient data transformations like normalization.

  • The internal logic of model have been changed. In MMOCR 0.x, model used forward_train and simple_test to deal with different model forward logics. In MMOCR 1.x and OpenMMLab 2.0, the forward function has three modes: loss, predict, and tensor for training, inference, and tracing or other purposes, respectively. The forward function calls self.loss(), self.predict(), and self._forward() given the modes loss, predict, and tensor, respectively.

Evaluation

MMOCR 1.x mainly implements corresponding metrics for each task, which are manipulated by Evaluator to complete the evaluation. In addition, users can build evaluator in MMOCR 1.x to conduct offline evaluation, i.e., evaluate predictions that may not produced by MMOCR, prediction follows our dataset conventions. More details can be find in the Evaluation Tutorial in MMEngine.

Visualization

The functions of visualization in MMOCR 1.x are removed. Instead, in OpenMMLab 2.0 projects, we use Visualizer to visualize data. MMOCR 1.x implements TextDetLocalVisualizer, TextRecogLocalVisualizer, and KIELocalVisualizer to allow visualization of ground truths, model predictions, and feature maps, etc., at any place, for the three tasks supported in MMOCR. It also supports to dump the visualization data to any external visualization backends such as Tensorboard and Wandb. Check our Visualization Document for more details.

Improvements

  • Most models enjoy a performance improvement from the new framework and refactor of data transforms. For example, in MMOCR 1.x, DBNet-R50 achieves 0.854 hmean score on ICDAR 2015, while the counterpart can only get 0.840 hmean score in MMOCR 0.x.

  • Support mixed precision training of most of the models. However, the rest models are not supported yet because the operators they used might not be representable in fp16. We will update the documentation and list the results of mixed precision training.

Ongoing changes

  1. Test-time augmentation: which was supported in MMOCR 0.x, is not implemented yet in this version due to limited time slot. We will support it in the following releases with a new and simplified design.

  2. Inference interfaces: a unified inference interfaces will be supported in the future to ease the use of released models.

  3. Interfaces of useful tools that can be used in notebook: more useful tools that implemented in the tools/ directory will have their python interfaces so that they can be used through notebook and in downstream libraries.

  4. Documentation: we will add more design docs, tutorials, and migration guidance so that the community can deep dive into our new design, participate the future development, and smoothly migrate downstream libraries to MMOCR 1.x.

Overview

Along with the release of OpenMMLab 2.0, MMOCR 1.0 made many significant changes, resulting in less redundant, more efficient code and a more consistent overall design. However, these changes break backward compatibility. We understand that with such huge changes, it is not easy for users familiar with the old version to adapt to the new version. Therefore, we prepared a detailed migration guide to make the transition as smooth as possible so that all users can enjoy the productivity benefits of the new MMOCR and the entire OpenMMLab 2.0 ecosystem.

Warning

MMOCR 1.0 depends on the new foundational library for training deep learning models MMEngine, and therefore has an entirely different dependency chain compared with MMOCR 0.x. Even if you have a well-rounded MMOCR 0.x environment before, you still need to create a new python environment for MMOCR 1.0. We provide a detailed installation guide for reference.

Next, please read the sections according to your requirements.

As shown in the following figure, the maintenance plan of MMOCR 1.x version is mainly divided into three stages, namely “RC Period”, “Compatibility Period” and “Maintenance Period”. For old versions, we will no longer add major new features. Therefore, we strongly recommend users to migrate to MMOCR 1.x version as soon as possible.

plan

What’s New in MMOCR 1.x

Here are some highlights of MMOCR 1.x compared to 0.x.

  1. New engines. MMOCR 1.x is based on MMEngine, which provides a general and powerful runner that allows more flexible customizations and significantly simplifies the entrypoints of high-level interfaces.

  2. Unified interfaces. As a part of the OpenMMLab 2.0 projects, MMOCR 1.x unifies and refactors the interfaces and internal logics of train, testing, datasets, models, evaluation, and visualization. All the OpenMMLab 2.0 projects share the same design in those interfaces and logics to allow the emergence of multi-task/modality algorithms.

  3. Cross project calling. Benefiting from the unified design, you can use the models implemented in other OpenMMLab projects, such as MMDet. We provide an example of how to use MMDetection’s Mask R-CNN through MMDetWrapper. Check our documents for more details. More wrappers will be released in the future.

  4. Stronger visualization. We provide a series of useful tools which are mostly based on brand-new visualizers. As a result, it is more convenient for the users to explore the models and datasets now.

  5. More documentation and tutorials. We add a bunch of documentation and tutorials to help users get started more smoothly.

  6. One-stop Dataset Preparaion. Multiple datasets are instantly ready with only one line of command, via our Dataset Preparer.

  7. Embracing more projects/: We now introduce projects/ folder, where some experimental features, frameworks and models can be placed, only needed to satisfy the minimum requirement on the code quality. Everyone is welcome to post their implementation of any great ideas in this folder! Learn more from our example project.

  8. More models. MMOCR 1.0 supports more tasks and more state-of-the-art models!

Branch Migration

At an earlier stage, MMOCR had three branches: main, 1.x, and dev-1.x. Some of these branches have been renamed together with the official MMOCR 1.0.0 release, and here is the changelog.

  • main branch housed the code for MMOCR 0.x (e.g., v0.6.3). Now it has been renamed to 0.x.

  • 1.x contained the code for MMOCR 1.x (e.g., 1.0.0rc6). Now it is an alias of main, and will be removed in mid 2023.

  • dev-1.x was the development branch for MMOCR 1.x. Now it remains unchanged.

For more information about the branches, check out branches.

Resolving Conflicts When Upgrading the main branch

For users who wish to upgrade from the old main branch that has the code for MMOCR 0.x, the non-fast-forwarded-able nature of the upgrade may cause conflicts. To resolve these conflicts, follow the steps below:

  1. Commit all the changes you have on main if you have any. Backup your current main branch by creating a copy.

    git checkout main
    git add --all
    git commit -m 'backup'
    git checkout -b main_backup
    
  2. Fetch the latest changes from the remote repository.

    git remote add openmmlab git@github.com:open-mmlab/mmocr.git
    git fetch openmmlab
    
  3. Reset the main branch to the latest main branch on the remote repository by running git reset --hard openmmlab/main.

    git checkout main
    git reset --hard openmmlab/main
    

By following these steps, you can successfully upgrade your main branch.

Code Migration

MMOCR has been designed in a way that there are a lot of shortcomings in the initial version in order to balance the tasks of text detection, recognition and key information extraction. In this 1.0 release, MMOCR synchronizes its new model architecture to align as much as possible with the overall OpenMMLab design and to achieve structural uniformity within the algorithm library. Although this upgrade is not fully backward compatible, we summarize the changes that may be of interest to developers for those who need them.

Fundamental Changes

Functional boundaries of modules has not been clearly defined in MMOCR 0.x. In MMOCR 1.0, we address this issue by refactoring the design of model modules. Here are some major changes in 1.0:

  • MMOCR 1.0 no longer supports named entity recognition tasks since it’s not in the scope of OCR.

  • The module that computes the loss in a model is named as Module Loss, which is also responsible for the conversion of gold annotations into loss targets. Another module, Postprocessor, is responsible for decoding the model raw output into DataSample for the corresponding task at prediction time.

  • The inputs of all models are now organized as a dictionary that consists of two keys: inputs, containing the original features of the images, and List[DataSample], containing the meta-information of the images. At training time, the output format of a model is standardized to a dictionary containing the loss tensors. Similarly, a model generates a sequence of DataSamples containing the prediction outputs in testing.

  • In MMOCR 0.x, the majority of classes named XXLoss have the implementations closely bound to the corresponding model, while their names made users hard to tell them apart from other generic losses like DiceLoss. In 1.0, they are renamed to the form XXModuleLoss. (e.g. DBLoss was renamed to DBModuleLoss). The key to their configurations in config files is also changed from loss to module_loss.

  • The names of generic loss classes that are not related to the model implementation are kept as XXLoss. (e.g. MaskedBCELoss) They are all placed under mmocr/models/common/losses.

  • Changes under mmocr/models/common/losses: DiceLoss is renamed to MaskedDiceLoss. FocalLoss has been removed.

  • MMOCR 1.0 adds a Dictionary module which originates from label converter. It is used in text recognition and key information extraction tasks.

Text Detection Models

Key Changes (TL;DR)

  • The model weights from MMOCR 0.x still works in the 1.0, but the fields starting with bbox_head in the state dict state_dict need to be renamed to det_head.

  • XXTargets transforms, which were responsible for genearting detection targets, have been merged into XXModuleLoss.

SingleStageTextDetector

  • The original inheritance chain was mmdet.BaseDetector->SingleStageDetector->SingleStageTextDetector. Now SingleStageTextDetector is directly inherited from BaseDetector without extra dependency on MMDetection, and SingleStageDetector is deleted.

  • bbox_head is renamed to det_head.

  • train_cfg, test_cfg and pretrained fields are removed.

  • forward_train() and simple_test() are refactored to loss() and predict(). The part of simple_test() that was responsible for splitting the raw output of the model and feeding it into head.get_bounary() is integrated into BaseTextDetPostProcessor.

  • TextDetectorMixin has been removed since its implementation overlaps with TextDetLocalVisualizer.

ModuleLoss

  • Data transforms XXXTargets in text detection tasks are all moved to XXXModuleLoss._get_target_single(). Target-related configurations are no longer specified in the data pipeline but in XXXLoss instead.

Postprocessor

  • The logic in the original XXXPostprocessor.__call__() are transferred to the refactored XXXPostprocessor.get_text_instances().

  • BasePostprocessor is refactored to BaseTextDetPostProcessor. This base class splits and processes the model output predictions one by one and supports automatic scaling of the output polygon or bounding box based on scale_factor.

Text Recognition

Key Changes (TL;DR)

  • Due to the change of the character order and some bugs in the model architecture being fixed, the recognition model weights in 0.x can no longer be directly used in 1.0. We will provide a migration script and tutorial for those who need it.

  • The support of SegOCR has been removed. TPS-CRNN will still be supported in a later version.

  • Test time augmentation will be supported in the upcoming release.

  • Label converter module has been removed and its functions have been split into Dictionary, ModuleLoss and Postprocessor.

  • The definition of max_seq_len has been unified and now it represents the original output length of the model.

Label Converter

  • The original label converters had spelling errors (written as label convertors). We fixed them by removing label converters from this project.

  • The part responsible for converting characters/strings to and from numeric indexes was extracted to Dictionary.

  • In older versions, different label converters would have different special character sets and character order. In version 0.x, the character order was as follows.

Converter Character order
AttnConvertor, ABIConvertor <UKN>, <BOS/EOS>, <PAD>, characters
CTCConvertor <BLK>, <UKN>, characters

In 1.0, instead of designing different dictionaries and character orders for different tasks, we have a unified Dictionary implementation with the character order always as characters, <BOS/EOS>, <PAD>, <UKN>. <BLK> in CTCConvertor has been equivalently replaced by <PAD>.

  • Label convertor originally supported three ways to initialize dictionaries: dict_type, dict_file and dict_list, which are now reduced to dict_file only in Dictionary. Also, we have put those pre-defined character sets originally supported in dict_type into dicts/ directory now. The corresponding mapping is as follows:

MMOCR 0.x: dict_type MMOCR 1.0: Dict path
DICT90 dicts/english_digits_symbols.txt
DICT91 dicts/english_digits_symbols_space.txt
DICT36 dicts/lower_english_digits.txt
DICT37 dicts/lower_english_digits_space.txt
  • The implementation of str2tensor() in label converter has been moved to ModuleLoss.get_targets(). The following table shows the correspondence between the old and new method implementations. Note that the old and new implementations are not identical.

MMOCR 0.x MMOCR 1.0 Note
ABIConvertor.str2tensor(), AttnConvertor.str2tensor() BaseTextRecogModuleLoss.get_targets() The different implementations between ABIConvertor.str2tensor() and AttnConvertor.str2tensor() have been unified in the new version.
CTCConvertor.str2tensor() CTCModuleLoss.get_targets()
  • The implementation of tensor2idx() in label converter has been moved to Postprocessor.get_single_prediction(). The following table shows the correspondence between the old and new method implementations. Note that the old and new implementations are not identical.

MMOCR 0.x MMOCR 1.0
ABIConvertor.tensor2idx(), AttnConvertor.tensor2idx() AttentionPostprocessor.get_single_prediction()
CTCConvertor.tensor2idx() CTCPostProcessor.get_single_prediction()

Key Information Extraction

Key Changes (TL;DR)

  • Due to changes in the inputs to the model, the model weights obtained in 0.x can no longer be directly used in 1.0.

KIEDataset & OpensetKIEDataset

  • The part that reads data is kept in WildReceiptDataset.

  • The part that additionally processes the nodes and edges is moved to LoadKIEAnnotation.

  • The part that uses dictionaries to transform text is moved to SDMGRHead.convert_text(), with the help of Dictionary.

  • The part of compute_relation() that computes the relationships between text boxes is moved to SDMGRHead.compute_relations(). It’s now done inside the model.

  • The part that evaluates the model performance is done in F1Metric.

  • The part of OpensetKIEDataset that processes model’s edge outputs is moved to SDMGRPostProcessor.

SDMGR

  • show_result() is integrated into KIEVisualizer.

  • The part of forward_test() that post-processes the output is organized in SDMGRPostProcessor.

Utils Migration

Utility functions are now grouped together under mmocr/utils/. Here are the scopes of the files in this directory:

  • bbox_utils.py: bounding box related functions.

  • check_argument.py: used to check argument type.

  • collect_env.py: used to collect running environment.

  • data_converter_utils.py: used for data format conversion.

  • fileio.py: file input and output related functions.

  • img_utils.py: image processing related functions.

  • mask_utils.py: mask related functions.

  • ocr.py: used for MMOCR inference.

  • parsers.py: used for parsing datasets.

  • polygon_utils.py: polygon related functions.

  • setup_env.py: used for initialize MMOCR.

  • string_utils.py: string related functions.

  • typing.py: defines the abbreviation of types used in MMOCR.

Dataset Migration

Based on the new design of BaseDataset in MMEngine, we have refactored the base OCR dataset class OCRDataset in MMOCR 1.0. The following document describes the differences between the old and new dataset formats in MMOCR, and how to migrate from the deprecated version to the latest. For users who do not want to migrate datasets at this time, we also provide a temporary solution in Section Compatibility.

Note

The Key Information Extraction task still uses the original WildReceipt dataset annotation format.

Review of Old Dataset Formats

MMOCR version 0.x implements a number of dataset classes, such as IcdarDataset, TextDetDataset for text detection tasks, and OCRDataset, OCRSegDataset for text recognition tasks. At the same time, the annotations may vary in different formats, such as .txt, .json, .jsonl. Users have to manually configure the Loader and the Parser while customizing the datasets.

Text Detection

For the text detection task, IcdarDataset uses a COCO-like annotation format.

{
  "images": [
    {
      "id": 1,
      "width": 800,
      "height": 600,
      "file_name": "test.jpg"
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [0,0,10,10],
      "segmentation": [
          [0,0,10,0,10,10,0,10]
      ],
      "area": 100,
      "iscrowd": 0
    }
  ]
}

The TextDetDataset uses the JSON Line storage format, converting COCO-like labels to strings and saves them in .txt or .jsonl format files.

{"file_name": "test/img_2.jpg", "height": 720, "width": 1280,  "annotations": [{"iscrowd": 0, "category_id": 1, "bbox": [602.0, 173.0,  33.0, 24.0], "segmentation": [[602, 173, 635, 175, 634, 197, 602,  196]]}, {"iscrowd": 0, "category_id": 1, "bbox": [734.0, 310.0, 58.0,  54.0], "segmentation": [[734, 310, 792, 320, 792, 364, 738, 361]]}]}
{"file_name": "test/img_5.jpg", "height": 720, "width": 1280,  "annotations": [{"iscrowd": 1, "category_id": 1, "bbox": [405.0, 409.0,  32.0, 52.0], "segmentation": [[408, 409, 437, 436, 434, 461, 405,  433]]}, {"iscrowd": 1, "category_id": 1, "bbox": [435.0, 434.0, 8.0,  33.0], "segmentation": [[437, 434, 443, 440, 441, 467, 435, 462]]}]}

Text Recognition

For text recognition tasks, there are two annotation formats in MMOCR version 0.x. The simple .txt annotations separate image name and word annotation by a blank space, which cannot handle the case when spaces are included in a text instance.

img1.jpg OpenMMLab
img2.jpg MMOCR

The JSON Line format uses a dictionary-like structure to represent the annotations, where the keys filename and text store the image name and word label, respectively.

{"filename": "img1.jpg", "text": "OpenMMLab"}
{"filename": "img2.jpg", "text": "MMOCR"}

New Dataset Format

To solve the dataset issues, MMOCR 1.x adopts a unified dataset design introduced in MMEngine. Each annotation file is a .json file that stores a dict, containing both metainfo and data_list, where the former includes basic information about the dataset and the latter consists of the label item of each target instance.

{
  "metainfo":
    {
      "classes": ("cat", "dog"),
      // ...
    },
  "data_list":
    [
      {
        "img_path": "xxx/xxx_0.jpg",
        "img_label": 0,
        // ...
      },
      // ...
    ]
}

Based on the above structure, we introduced TextDetDataset, TextRecogDataset for MMOCR-specific tasks.

Text Detection

Introduction of the New Format

The TextDetDataset holds the information required by the text detection task, such as bounding boxes and labels. We refer users to tests/data/det_toy_dataset/instances_test.json which is an example annotation for TextDetDataset.

{
  "metainfo":
    {
      "dataset_type": "TextDetDataset",
      "task_name": "textdet",
      "category": [{"id": 0, "name": "text"}]
    },
  "data_list":
    [
      {
        "img_path": "test_img.jpg",
        "height": 640,
        "width": 640,
        "instances":
          [
            {
              "polygon": [0, 0, 0, 10, 10, 20, 20, 0],
              "bbox": [0, 0, 10, 20],
              "bbox_label": 0,
              "ignore": False
            },
            // ...
          ]
      }
    ]
}

The bounding box format is as follows: [min_x, min_y, max_x, max_y]

Migration Script

We provide a migration script to help users migrate old annotation files to the new format.

python tools/dataset_converters/textdet/data_migrator.py ${IN_PATH} ${OUT_PATH}
ARGS Type Description
in_path str (Required)Path to the old annotation file.
out_path str (Required)Path to the new annotation file.
--task 'auto', 'textdet', 'textspotter' Specifies the compatible task for the output dataset annotation. If 'textdet' is specified, the text field in coco format will not be dumped. The default is 'auto', which automatically determines the output format based on the the old annotation files.

Text Recognition

Introduction of the New Format

The TextRecogDataset holds the information required by the text detection task, such as text and image path. We refer users to tests/data/rec_toy_dataset/labels.json which is an example annotation for TextRecogDataset.

{
  "metainfo":
    {
      "dataset_type": "TextRecogDataset",
      "task_name": "textrecog",
    },
    "data_list":
    [
      {
        "img_path": "test_img.jpg",
        "instances":
            [
              {
                "text": "GRAND"
              }
            ]
      }
    ]
}
Migration Script

We provide a migration script to help users migrate old annotation files to the new format.

python tools/dataset_converters/textrecog/data_migrator.py ${IN_PATH} ${OUT_PATH} --format ${txt, jsonl, lmdb}
ARGS Type Description
in_path str (Required)Path to the old annotation file.
out_path str (Required)Path to the new annotation file.
--format 'txt', 'jsonl', 'lmdb' Specify the format of the old dataset annotation.

Compatibility

In consideration of the cost to users for data migration, we have temporarily made MMOCR version 1.x compatible with the old MMOCR 0.x format.

Note

The code and components used for compatibility with the old data format may be completely removed in a future release. Therefore, we strongly recommend that users migrate their datasets to the new data format.

Specifically, we provide three dataset classes IcdarDataset, RecogTextDataset, RecogLMDBDataset to support the old formats.

  1. IcdarDataset supports COCO-like format annotations for text detection. You just need to add a new dataset config to configs/textdet/_base_/datasets and specify its dataset type as IcdarDataset.

    data_root = 'data/det/icdar2015'
    train_anno_path = 'instances_training.json'
    
    train_dataset = dict(
        type='IcdarDataset',
        data_root=data_root,
        ann_file=train_anno_path,
        data_prefix=dict(img_path='imgs/'),
        filter_cfg=dict(filter_empty_gt=True, min_size=32),
        pipeline=None)
    
  2. RecogTextDataset supports .txt and .jsonl format annotations for text recognition. You just need to add a new dataset config to configs/textrecog/_base_/datasets and specify its dataset type as RecogTextDataset. For example, the following example shows how to configure and load the 0.x format labels old_label.txt and old_label.jsonl from the toy dataset.

     data_root = 'tests/data/rec_toy_dataset/'
    
     # loading 0.x txt format annos
     txt_dataset = dict(
         type='RecogTextDataset',
         data_root=data_root,
         ann_file='old_label.txt',
         data_prefix=dict(img_path='imgs'),
         parser_cfg=dict(
             type='LineStrParser',
             keys=['filename', 'text'],
             keys_idx=[0, 1]),
         pipeline=[])
    
     # loading 0.x json line format annos
     jsonl_dataset = dict(
         type='RecogTextDataset',
         data_root=data_root,
         ann_file='old_label.jsonl',
         data_prefix=dict(img_path='imgs'),
         parser_cfg=dict(
             type='LineJsonParser',
             keys=['filename', 'text'],
         pipeline=[]))
    
  3. RecogLMDBDataset supports LMDB format dataset (img+labels) for text recognition. You just need to add a new dataset config to configs/textrecog/_base_/datasets and specify its dataset type as RecogLMDBDataset. For example, the following example shows how to configure and load the both labels and images imgs.lmdb from the toy dataset.

  • set the dataset type to RecogLMDBDataset

# Specify the dataset type as RecogLMDBDataset
 data_root = 'tests/data/rec_toy_dataset/'

 lmdb_dataset = dict(
     type='RecogLMDBDataset',
     data_root=data_root,
     ann_file='imgs.lmdb',
     pipeline=None)
  • replace the LoadImageFromFile with LoadImageFromNDArray in the data pipelines in train_pipeline and test_pipeline., for example:

 train_pipeline = [dict(type='LoadImageFromNDArray')]

Pretrained Model Migration

Due to the extensive refactoring and fixing of the model structure in the new version, MMOCR 1.x does not support load weights trained by the old version. We have updated the pre-training weights and logs of all models on our website.

In addition, we are working on the development of a weight migration tool for text detection tasks and plan to release it in the near future. Since the text recognition and key information extraction models are too much modified and the migration is lossy, we do not plan to support them accordingly for the time being. If you have specific requirements, please feel free to raise an Issue.

Data Transform Migration

Introduction

In MMOCR version 0.x, we implemented a series of Data Transform methods in mmocr/datasets/pipelines/xxx_transforms.py. However, these modules are scattered all over the place and lack a standardized design. Therefore, we refactored all the data transform modules in MMOCR version 1.x. According to the task type, they are now defined in ocr_transforms.py, textdet_transforms.py, and textrecog_transforms.py, respectively, under mmocr/datasets/transforms. Specifically, ocr_transforms.py implements the data augmentation methods for OCR-related tasks in general, while textdet_transforms.py and textrecog_transforms.py implement data augmentation transforms related to text detection and text recognition tasks, respectively.

Since some of the modules were renamed, merged or separated during the refactoring process, the new interface and default parameters may be inconsistent with the old version. Therefore, this migration guide will introduce how to configure the new data transforms to achieve the identical behavior as the old version.

Configuration Migration Guide

mmocr.apis

mmocr.apis

Inferencers

MMOCRInferencer

MMOCR Inferencer.

TextDetInferencer

Text Detection inferencer.

TextRecInferencer

Text Recognition inferencer.

TextSpotInferencer

Text Spotting inferencer.

KIEInferencer

Key Information Extraction Inferencer.

mmocr.structures

TextDetDataSample

A data structure interface of MMOCR.

TextRecogDataSample

A data structure interface of MMOCR for text recognition.

KIEDataSample

A data structure interface of MMOCR.

mmocr.datasets

Samplers

BatchAugSampler

Sampler that repeats the same data elements for num_repeats times.

Datasets

OCRDataset

OCRDataset for text detection and text recognition.

WildReceiptDataset

WildReceipt Dataset for key information extraction.

Compatible Datasets

IcdarDataset

Dataset for text detection while ann_file in coco format.

RecogLMDBDataset

RecogLMDBDataset for text recognition.

RecogTextDataset

RecogTextDataset for text recognition.

Dataset Wrapper

ConcatDataset

A wrapper of concatenated dataset.

mmocr.datasets

Loading

LoadImageFromFile

Load an image from file.

LoadOCRAnnotations

Load and process the instances annotation provided by dataset.

LoadKIEAnnotations

Load and process the instances annotation provided by dataset.

InferencerLoader

Load the image in Inferencer’s pipeline.

TextDet Transforms

BoundedScaleAspectJitter

First randomly rescale the image so that the longside and shortside of the image are around the bound; then jitter its aspect ratio.

RandomFlip

Flip the image & bbox polygon.

SourceImagePad

Pad Image to target size.

ShortScaleAspectJitter

First rescale the image for its shorter side to reach the short_size and then jitter its aspect ratio, final rescale the shape guaranteed to be divided by scale_divisor.

TextDetRandomCrop

Randomly select a region and crop images to a target size and make sure to contain text region.

TextDetRandomCropFlip

Random crop and flip a patch in the image.

TextRecog Transforms

TextRecogGeneralAug

A general geometric augmentation tool for text images in the CVPR 2020 paper “Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition”.

CropHeight

Randomly crop the image’s height, either from top or bottom.

ImageContentJitter

Jitter the image contents.

ReversePixels

Reverse image pixels.

PyramidRescale

Resize the image to the base shape, downsample it with gaussian pyramid, and rescale it back to original size.

PadToWidth

Only pad the image’s width.

RescaleToHeight

Rescale the image to the height according to setting and keep the aspect ratio unchanged if possible.

OCR Transforms

RandomCrop

Randomly crop images and make sure to contain at least one intact instance.

RandomRotate

Randomly rotate the image, boxes, and polygons.

Resize

Resize image & bboxes & polygons.

FixInvalidPolygon

Fix invalid polygons in the dataset.

RemoveIgnored

Removed ignored elements from the pipeline.

Formatting

PackTextDetInputs

Pack the inputs data for text detection.

PackTextRecogInputs

Pack the inputs data for text recognition.

PackKIEInputs

Pack the inputs data for key information extraction.

Transform Wrapper

ImgAugWrapper

A wrapper around imgaug https://github.com/aleju/imgaug.

TorchVisionWrapper

A wrapper around torchvision transforms.

Adapter

MMDet2MMOCR

Convert transforms’s data format from MMDet to MMOCR.

MMOCR2MMDet

Convert transforms’s data format from MMOCR to MMDet.

mmocr.models

models.common

BackBones

UNet

UNet backbone.

Dictionary

Dictionary

The class generates a dictionary for recognition.

Losses

MaskedBalancedBCEWithLogitsLoss

This loss combines a Sigmoid layers and a masked balanced BCE loss in one single class.

MaskedDiceLoss

Masked dice loss.

MaskedSmoothL1Loss

Masked Smooth L1 loss.

MaskedSquareDiceLoss

Masked square dice loss.

MaskedBCEWithLogitsLoss

This loss combines a Sigmoid layers and a masked BCE loss in one single class.

SmoothL1Loss

Smooth L1 loss.

CrossEntropyLoss

Cross entropy loss.

MaskedBalancedBCELoss

Masked Balanced BCE loss.

MaskedBCELoss

Masked BCE loss.

Layers

TFEncoderLayer

Transformer Encoder Layer.

TFDecoderLayer

Transformer Decoder Layer.

Modules

ScaledDotProductAttention

Scaled Dot-Product Attention Module.

MultiHeadAttention

Multi-Head Attention module.

PositionwiseFeedForward

Two-layer feed-forward module.

PositionalEncoding

Fixed positional encoding with sine and cosine functions.

models.textdet

Detectors

SingleStageTextDetector

The class for implementing single stage text detector.

DBNet

The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization.

PANet

The class for implementing PANet text detector:

PSENet

The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network.

TextSnake

The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

FCENet

The class for implementing FCENet text detector FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection

DRRG

The class for implementing DRRG text detector.

MMDetWrapper

A wrapper of MMDet’s model.

Data Preprocessors

TextDetDataPreprocessor

Image pre-processor for detection tasks.

Necks

FPEM_FFM

This code is from https://github.com/WenmuZhou/PAN.pytorch.

FPNF

FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network.

FPNC

FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization.

FPN_UNet

The class for implementing DRRG and TextSnake U-Net-like FPN.

Heads

BaseTextDetHead

Base head for text detection, build the loss and postprocessor.

PSEHead

The class for PSENet head.

PANHead

The class for PANet head.

DBHead

The class for DBNet head.

FCEHead

The class for implementing FCENet head.

TextSnakeHead

The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

DRRGHead

The class for DRRG head: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

Module Losses

SegBasedModuleLoss

Base class for the module loss of segmentation-based text detection algorithms with some handy utilities.

PANModuleLoss

The class for implementing PANet loss.

PSEModuleLoss

The class for implementing PSENet loss.

DBModuleLoss

The class for implementing DBNet loss.

TextSnakeModuleLoss

The class for implementing TextSnake loss.

FCEModuleLoss

The class for implementing FCENet loss.

DRRGModuleLoss

The class for implementing DRRG loss.

Postprocessors

BaseTextDetPostProcessor

Base postprocessor for text detection models.

PSEPostprocessor

Decoding predictions of PSENet to instances.

PANPostprocessor

Convert scores to quadrangles via post processing in PANet.

DBPostprocessor

Decoding predictions of DbNet to instances.

DRRGPostprocessor

Merge text components and construct boundaries of text instances.

FCEPostprocessor

Decoding predictions of FCENet to instances.

TextSnakePostprocessor

Decoding predictions of TextSnake to instances.

models.textrecog

Recognizers

BaseRecognizer

Base class for recognizer.

EncoderDecoderRecognizer

Base class for encode-decode recognizer.

CRNN

CTC-loss based recognizer.

SARNet

Implementation of SAR

NRTR

Implementation of NRTR

RobustScanner

Implementation of `RobustScanner.

SATRN

Implementation of SATRN

ABINet

Implementation of `Read Like Humans: Autonomous, Bidirectional and Iterative LanguageModeling for Scene Text Recognition.

MASTER

Implementation of MASTER

ASTER

Implement `ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.

Data Preprocessors

TextRecogDataPreprocessor

Image pre-processor for recognition tasks.

Preprocessors

STN

Implement STN module in ASTER: An Attentional Scene Text Recognizer with Flexible Rectification (https://ieeexplore.ieee.org/abstract/document/8395027/)

BackBones

ResNet31OCR

Implement ResNet backbone for text recognition, modified from

MiniVGG

A mini VGG backbone for text recognition, modified from `VGG-VeryDeep.

NRTRModalityTransform

Modality transform in NRTR.

ShallowCNN

Implement Shallow CNN block for SATRN.

ResNetABI

Implement ResNet backbone for text recognition, modified from `ResNet.

ResNet

param in_channels

Number of channels of input image tensor.

MobileNetV2

See mmdet.models.backbones.MobileNetV2 for details.

Encoders

SAREncoder

Implementation of encoder module in `SAR.

NRTREncoder

Transformer Encoder block with self attention mechanism.

BaseEncoder

Base Encoder class for text recognition.

ChannelReductionEncoder

Change the channel number with a one by one convoluational layer.

SATRNEncoder

Implement encoder for SATRN, see `SATRN.

ABIEncoder

Implement transformer encoder for text recognition, modified from <https://github.com/FangShancheng/ABINet>.

ASTEREncoder

Implement BiLSTM encoder module in `ASTER: An Attentional Scene Text Recognizer with Flexible Rectification.

Decoders

BaseDecoder

Base decoder for text recognition, build the loss and postprocessor.

ABILanguageDecoder

Transformer-based language model responsible for spell correction. Implementation of language model of ABINet.

ABIVisionDecoder

Converts visual features into text characters.

ABIFuser

A special decoder responsible for mixing and aligning visual feature and linguistic feature.

CRNNDecoder

Decoder for CRNN.

ParallelSARDecoder

Implementation Parallel Decoder module in `SAR.

SequentialSARDecoder

Implementation Sequential Decoder module in `SAR.

ParallelSARDecoderWithBS

Parallel Decoder module with beam-search in SAR.

NRTRDecoder

Transformer Decoder block with self attention mechanism.

SequenceAttentionDecoder

Sequence attention decoder for RobustScanner.

PositionAttentionDecoder

Position attention decoder for RobustScanner.

RobustScannerFuser

Decoder for RobustScanner.

MasterDecoder

Decoder module in MASTER.

ASTERDecoder

Implement attention decoder.

Module Losses

BaseTextRecogModuleLoss

Base recognition loss.

CEModuleLoss

Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss.

CTCModuleLoss

Implementation of loss module for CTC-loss based text recognition.

ABIModuleLoss

Implementation of ABINet multiloss that allows mixing different types of losses with weights.

Postprocessors

BaseTextRecogPostprocessor

Base text recognition postprocessor.

AttentionPostprocessor

PostProcessor for seq2seq.

CTCPostProcessor

PostProcessor for CTC.

Layers

BidirectionalLSTM

Adaptive2DPositionalEncoding

Implement Adaptive 2D positional encoder for SATRN, see `SATRN.

BasicBlock

Bottleneck

RobustScannerFusionLayer

DotProductAttentionLayer

PositionAwareLayer

SATRNEncoderLayer

Implement encoder layer for SATRN, see `SATRN.

models.kie

Extractors

SDMGR

The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

Heads

SDMGRHead

SDMGR Head.

Module Losses

SDMGRModuleLoss

The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

Postprocessors

SDMGRPostProcessor

Postprocessor for SDMGR.

mmocr.evaluation

Evaluator

MultiDatasetsEvaluator

Wrapper class to compose class: ConcatDataset and multiple BaseMetric instances.

TextDet Metric

HmeanIOUMetric

HmeanIOU metric.

TextRecog Metric

WordMetric

Word metrics for text recognition task.

CharMetric

Character metrics for text recognition task.

OneMinusNEDMetric

One minus NED metric for text recognition task.

KIE Metric

F1Metric

Compute F1 scores.

mmocr.visualization

BaseLocalVisualizer

The MMOCR Text Detection Local Visualizer.

TextDetLocalVisualizer

The MMOCR Text Detection Local Visualizer.

TextRecogLocalVisualizer

MMOCR Text Detection Local Visualizer.

TextSpottingLocalVisualizer

KIELocalVisualizer

The MMOCR Text Detection Local Visualizer.

mmocr.engine

mmocr.engine

Hooks

VisualizationHook

Detection Visualization Hook.

mmocr.utils

Box Utils

bbox2poly

Converting a bounding box to a polygon.

bbox_center_distance

Calculate the distance between the center points of two bounding boxes.

bbox_diag_distance

Calculate the diagonal length of a bounding box (distance between the top-left and bottom-right).

bezier2polygon

Sample points from the boundary of a polygon enclosed by two Bezier curves, which are controlled by bezier_points.

is_on_same_line

Check if two boxes are on the same line by their y-axis coordinates.

rescale_bboxes

Rescale bboxes according to scale_factor.

stitch_boxes_into_lines

Stitch fragmented boxes of words into lines.

Point Utils

point_distance

Calculate the distance between two points.

points_center

Calculate the center of a set of points.

Polygon Utils

boundary_iou

Calculate the IOU between two boundaries.

crop_polygon

Crop polygon to be within a box region.

is_poly_inside_rect

Check if the polygon is inside the target region.

offset_polygon

Offset (expand/shrink) the polygon by the target distance.

poly2bbox

Converting a polygon to a bounding box.

poly2shapely

Convert a polygon to shapely.geometry.Polygon.

poly_intersection

Calculate the intersection area between two polygons.

poly_iou

Calculate the IOU between two polygons.

poly_make_valid

Convert a potentially invalid polygon to a valid one by eliminating self-crossing or self-touching parts.

poly_union

Calculate the union area between two polygons.

polys2shapely

Convert a nested list of boundaries to a list of Polygons.

rescale_polygon

Rescale a polygon according to scale_factor.

rescale_polygons

Rescale polygons according to scale_factor.

shapely2poly

Convert a nested list of boundaries to a list of Polygons.

sort_points

Sort arbitrary points in clockwise order in Cartesian coordinate, you may need to reverse the output sequence if you are using OpenCV’s image coordinate.

sort_vertex

Sort box vertices in clockwise order from left-top first.

sort_vertex8

Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4]

Mask Utils

fill_hole

Fill holes in matrix.

Misc Utils

equal_len

is_2dlist

check x is 2d-list([[1], []]) or 1d empty list([]).

is_3dlist

check x is 3d-list([[[1], []]]) or 2d empty list([[], []]) or 1d empty list([]).

is_none_or_type

is_type_list

Setup Env

register_all_modules

Register all modules in mmocr into the registries.

Welcome to the OpenMMLab community

Scan the QR code below to follow the OpenMMLab team’s Zhihu Official Account and join the OpenMMLab team’s QQ Group, or join the official communication WeChat group by adding the WeChat, or join our Slack

We will provide you with the OpenMMLab community

  • 📢 share the latest core technologies of AI frameworks

  • 💻 Explaining PyTorch common module source Code

  • 📰 News related to the release of OpenMMLab

  • 🚀 Introduction of cutting-edge algorithms developed by OpenMMLab 🏃 Get the more efficient answer and feedback

  • 🔥 Provide a platform for communication with developers from all walks of life

The OpenMMLab community looks forward to your participation! 👬

Indices and tables

Get Started

User Guides

Basic Concepts

Dataset Zoo

Model Zoo

Notes

Migrating from MMOCR 0.x

API Reference

Contact US

Switch Language

Read the Docs v: stable
Versions
latest
stable
v1.0.1
v1.0.0
0.x
v0.6.3
v0.6.2
v0.6.1
v0.6.0
v0.5.0
v0.4.1
v0.4.0
v0.3.0
v0.2.1
v0.2.0
v0.1.0
dev-1.x
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.