Welcome to MMOCR’s documentation!¶
You can switch between English and Chinese in the lower-left corner of the layout.
Overview¶
MMOCR is an open source toolkit based on PyTorch and MMDetection, supporting numerous OCR-related models, including text detection, text recognition, and key information extraction. In addition, it supports widely-used academic datasets and provides many useful tools, assisting users in exploring various aspects of models and datasets and implementing high-quality algorithms. Generally, it has the following features.
One-stop, Multi-model: MMOCR supports various OCR-related tasks and implements the latest models for text detection, recognition, and key information extraction.
Modular Design: MMOCR’s modular design allows users to define and reuse modules in the model on demand.
Various Useful Tools: MMOCR provides a number of analysis tools, including visualizers, validation scripts, evaluators, etc., to help users troubleshoot, finetune or compare models.
Powered by OpenMMLab: Like other algorithm libraries in OpenMMLab family, MMOCR follows OpenMMLab’s rigorous development guidelines and interface conventions, significantly reducing the learning cost of users familiar with other projects in OpenMMLab family. In addition, benefiting from the unified interfaces among OpenMMLab, you can easily call the models implemented in other OpenMMLab projects (e.g. MMDetection) in MMOCR, facilitating cross-domain research and real-world applications.
Together with the release of OpenMMLab 2.0, MMOCR now also comes to its 1.0.0 version, which has made significant BC-breaking changes, resulting in less code redundancy, higher code efficiency and an overall more systematic and consistent design.
Considering that there are some backward incompatible changes in this version compared to 0.x, we have prepared a detailed migration guide. It lists all the changes made in the new version and the steps required to migrate. We hope this guide can help users familiar with the old framework to complete the upgrade as quickly as possible. Though this may take some time, we believe that the new features brought by MMOCR and the OpenMMLab ecosystem will make it all worthwhile. 😊
Next, please read the section according to your actual needs.
We recommend that beginners go through Quick Run to get familiar with MMOCR and master the usage of MMOCR by reading the examples in User Guides.
Intermediate and advanced developers are suggested to learn the background, conventions, and recommended implementations of each component from Basic Concepts.
Read our FAQ to find answers to frequently asked questions.
If you can’t find the answers you need in the documentation, feel free to raise an issue.
Everyone is welcome to be a contributor! Read the contribution guide to learn how to contribute to MMOCR!
Installation¶
Prerequisites¶
Linux | Windows | macOS
Python 3.7
PyTorch 1.6 or higher
torchvision 0.7.0
CUDA 10.1
NCCL 2
GCC 5.4.0 or higher
Environment Setup¶
Note
If you are experienced with PyTorch and have already installed it, just skip this part and jump to the next section. Otherwise, you can follow these steps for the preparation.
Step 0. Download and install Miniconda from the official website.
Step 1. Create a conda environment and activate it.
conda create --name openmmlab python=3.8 -y
conda activate openmmlab
Step 2. Install PyTorch following official instructions, e.g.
conda install pytorch torchvision -c pytorch
conda install pytorch torchvision cpuonly -c pytorch
Installation Steps¶
We recommend that users follow our best practices to install MMOCR. However, the whole process is highly customizable. See Customize Installation section for more information.
Best Practices¶
Step 0. Install MMEngine, MMCV and MMDetection using MIM.
pip install -U openmim
mim install mmengine
mim install mmcv
mim install mmdet
Step 1. Install MMOCR.
If you wish to run and develop MMOCR directly, install it from source (recommended).
If you use MMOCR as a dependency or third-party package, install it via MIM.
git clone https://github.com/open-mmlab/mmocr.git
cd mmocr
pip install -v -e .
# "-v" increases pip's verbosity.
# "-e" means installing the project in editable mode,
# That is, any local modifications on the code will take effect immediately.
mim install mmocr
Step 2. (Optional) If you wish to use any transform involving albumentations
(For example, Albu
in ABINet’s pipeline), or any dependency for building documentation or running unit tests, please install the dependency using the following command:
# install albu
pip install -r requirements/albu.txt
# install the dependencies for building documentation and running unit tests
pip install -r requirements.txt
pip install albumentations>=1.1.0 --no-binary qudida,albumentations
Note
We recommend checking the environment after installing albumentations
to
ensure that opencv-python
and opencv-python-headless
are not installed together, otherwise it might cause unexpected issues. If that’s unfortunately the case, please uninstall opencv-python-headless
to make sure MMOCR’s visualization utilities can work.
Refer to albumentations’s official documentation for more details.
Verify the installation¶
You may verify the installation via this inference demo.
Run the following code in a Python interpreter:
>>> from mmocr.apis import MMOCRInferencer
>>> ocr = MMOCRInferencer(det='DBNet', rec='CRNN')
>>> ocr('demo/demo_text_ocr.jpg', show=True, print_result=True)
If you installed MMOCR from source, you can run the following in MMOCR’s root directory:
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec CRNN --show --print-result
You should be able to see a pop-up image and the inference result printed out in the console upon successful verification.

# Inference result
{'predictions': [{'rec_texts': ['cbanks', 'docecea', 'grouf', 'pwate', 'chobnsonsg', 'soxee', 'oeioh', 'c', 'sones', 'lbrandec', 'sretalg', '11', 'to8', 'round', 'sale', 'year',
'ally', 'sie', 'sall'], 'rec_scores': [...], 'det_polygons': [...], 'det_scores':
[...]}]}
Note
If you are running MMOCR on a server without GUI or via SSH tunnel with X11 forwarding disabled, you may not see the pop-up window.
Customize Installation¶
CUDA versions¶
When installing PyTorch, you need to specify the version of CUDA. If you are not clear on which to choose, follow our recommendations:
For Ampere-based NVIDIA GPUs, such as GeForce 30 series and NVIDIA A100, CUDA 11 is a must.
For older NVIDIA GPUs, CUDA 11 is backward compatible, but CUDA 10.2 offers better compatibility and is more lightweight.
Please make sure the GPU driver satisfies the minimum version requirements. See this table for more information.
Note
Installing CUDA runtime libraries is enough if you follow our best practices, because no CUDA code will be compiled locally. However if you hope to compile MMCV from source or develop other CUDA operators, you need to install the complete CUDA toolkit from NVIDIA’s website, and its version should match the CUDA version of PyTorch. i.e., the specified version of cudatoolkit in conda install
command.
Install MMCV without MIM¶
MMCV contains C++ and CUDA extensions, thus depending on PyTorch in a complex way. MIM solves such dependencies automatically and makes the installation easier. However, it is not a must.
To install MMCV with pip instead of MIM, please follow MMCV installation guides. This requires manually specifying a find-url based on PyTorch version and its CUDA version.
For example, the following command install mmcv-full built for PyTorch 1.10.x and CUDA 11.3.
pip install `mmcv>=2.0.0rc1` -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html
Install on CPU-only platforms¶
MMOCR can be built for CPU-only environment. In CPU mode you can train (requires MMCV version >= 1.4.4), test or inference a model.
However, some functionalities are gone in this mode:
Deformable Convolution
Modulated Deformable Convolution
ROI pooling
SyncBatchNorm
If you try to train/test/inference a model containing above ops, an error will be raised. The following table lists affected algorithms.
Operator | Model |
---|---|
Deformable Convolution/Modulated Deformable Convolution | DBNet (r50dcnv2), DBNet++ (r50dcnv2), FCENet (r50dcnv2) |
SyncBatchNorm | PANet, PSENet |
Using MMOCR with Docker¶
We provide a Dockerfile to build an image.
# build an image with PyTorch 1.6, CUDA 10.1
docker build -t mmocr docker/
Run it with
docker run --gpus all --shm-size=8g -it -v {DATA_DIR}:/mmocr/data mmocr
Dependency on MMEngine, MMCV & MMDetection¶
MMOCR has different version requirements on MMEngine, MMCV and MMDetection at each release to guarantee the implementation correctness. Please refer to the table below and ensure the package versions fit the requirement.
MMOCR | MMEngine | MMCV | MMDetection |
---|---|---|---|
dev-1.x | 0.7.1 \<= mmengine \< 1.1.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.2.0 |
1.0.1 | 0.7.1 \<= mmengine \< 1.1.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.2.0 |
1.0.0 | 0.7.1 \<= mmengine \< 1.0.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.1.0 |
1.0.0rc6 | 0.6.0 \<= mmengine \< 1.0.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.1.0 |
1.0.0rc[4-5] | 0.1.0 \<= mmengine \< 1.0.0 | 2.0.0rc1 \<= mmcv \< 2.1.0 | 3.0.0rc0 \<= mmdet \< 3.1.0 |
1.0.0rc[0-3] | 0.0.0 \<= mmengine \< 0.2.0 | 2.0.0rc1 \<= mmcv \< 2.1.0 | 3.0.0rc0 \<= mmdet \< 3.1.0 |
Quick Run¶
This chapter will take you through the basic functions of MMOCR. And we assume you installed MMOCR from source. You may check out the tutorial notebook for how to perform inference, training and testing interactively.
Inference¶
Run the following in MMOCR’s root directory:
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec CRNN --show --print-result
You should be able to see a pop-up image and the inference result printed out in the console.

# Inference result
{'predictions': [{'rec_texts': ['cbanks', 'docecea', 'grouf', 'pwate', 'chobnsonsg', 'soxee', 'oeioh', 'c', 'sones', 'lbrandec', 'sretalg', '11', 'to8', 'round', 'sale', 'year',
'ally', 'sie', 'sall'], 'rec_scores': [...], 'det_polygons': [...], 'det_scores':
[...]}]}
Note
If you are running MMOCR on a server without GUI or via SSH tunnel with X11 forwarding disabled, you may not see the pop-up window.
A detailed description of MMOCR’s inference interface can be found here
In addition to using our well-provided pre-trained models, you can also train models on your own datasets. In the next section, we will take you through the basic functions of MMOCR by training DBNet on the mini ICDAR 2015 dataset as an example.
Prepare a Dataset¶
Since the variety of OCR dataset formats are not conducive to either switching or joint training of multiple datasets, MMOCR proposes a uniform data format, and provides dataset preparer for commonly used OCR datasets. Usually, to use those datasets in MMOCR, you just need to follow the steps to get them ready for use.
Note
But here, efficiency means everything.
Here, we have prepared a lite version of ICDAR 2015 dataset for demonstration purposes. Download our pre-prepared zip and extract it to the data/
directory under mmocr to get our prepared image and annotation file.
wget https://download.openmmlab.com/mmocr/data/icdar2015/mini_icdar2015.tar.gz
mkdir -p data/
tar xzvf mini_icdar2015.tar.gz -C data/
Modify the Config¶
Once the dataset is prepared, we will then specify the location of the training set and the training parameters by modifying the config file.
In this example, we will train a DBNet using resnet18 as its backbone. Since MMOCR already has a config file for the full ICDAR 2015 dataset (configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
), we just need to make some modifications on top of it.
We first need to modify the path to the dataset. In this config, most of the key config files are imported in _base_
, such as the database configuration from configs/textdet/_base_/datasets/icdar2015.py
. Open that file and replace the path pointed to by icdar2015_textdet_data_root
in the first line with:
icdar2015_textdet_data_root = 'data/mini_icdar2015'
Also, because of the reduced dataset size, we have to reduce the number of training epochs to 400 accordingly, shorten the validation interval as well as the weight storage interval to 10 rounds, and drop the learning rate decay strategy. The following lines of configuration can be directly put into configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
to take effect.
# Save checkpoints every 10 epochs, and only keep the latest checkpoint
default_hooks = dict(
checkpoint=dict(
type='CheckpointHook',
interval=10,
max_keep_ckpts=1,
))
# Set the maximum number of epochs to 400, and validate the model every 10 epochs
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400, val_interval=10)
# Fix learning rate as a constant
param_scheduler = [
dict(type='ConstantLR', factor=1.0),
]
Here, we have rewritten the corresponding parameters in the base configuration directly through the inheritance (MMEngine: Config) mechanism of the config. The original fields are distributed in configs/textdet/_base_/schedules/schedule_sgd_1200e.py
and configs/textdet/_base_/default_runtime.py
.
Note
For a more detailed description of config, please refer to here.
Browse the Dataset¶
Before we start the training, we can also visualize the image processed by training-time data transforms. It’s quite simple: pass the config file we need to visualize into the browse_dataset.py script.
python tools/analysis_tools/browse_dataset.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
The transformed images and annotations will be displayed one by one in a pop-up window.



Note
For details on the parameters and usage of this script, please refer to here.
Tip
In addition to satisfying our curiosity, visualization can also help us check the parts that may affect the model’s performance before training, such as problems in configs, datasets and data transforms.
Training¶
Start the training by running the following command:
python tools/train.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
Depending on the system environment, MMOCR will automatically use the best device for training. If a GPU is available, a single GPU training will be started by default. When you start to see the output of the losses, you have successfully started the training.
2022/08/22 18:42:22 - mmengine - INFO - Epoch(train) [1][5/7] lr: 7.0000e-03 memory: 7730 data_time: 0.4496 loss_prob: 14.6061 loss_thr: 2.2904 loss_db: 0.9879 loss: 17.8843 time: 1.8666
2022/08/22 18:42:24 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
2022/08/22 18:42:28 - mmengine - INFO - Epoch(train) [2][5/7] lr: 7.0000e-03 memory: 6695 data_time: 0.2052 loss_prob: 6.7840 loss_thr: 1.4114 loss_db: 0.9855 loss: 9.1809 time: 0.7506
2022/08/22 18:42:29 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
2022/08/22 18:42:33 - mmengine - INFO - Epoch(train) [3][5/7] lr: 7.0000e-03 memory: 6690 data_time: 0.2101 loss_prob: 3.0700 loss_thr: 1.1800 loss_db: 0.9967 loss: 5.2468 time: 0.6244
2022/08/22 18:42:33 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
Without extra configurations, model weights will be saved to work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/
, while the logs will be stored in work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/TIMESTAMP/
. Next, we just need to wait with some patience for training to finish.
Note
For advanced usage of training, such as CPU training, multi-GPU training, and cluster training, please refer to Training and Testing.
Testing¶
After 400 epochs, we observe that DBNet performs best in the last epoch, with hmean
reaching 60.86 (You may see a different result):
08/22 19:24:52 - mmengine - INFO - Epoch(val) [400][100/100] icdar/precision: 0.7285 icdar/recall: 0.5226 icdar/hmean: 0.6086
Note
It may not have been trained to be optimal, but it is sufficient for a demo.
However, this value only reflects the performance of DBNet on the mini ICDAR 2015 dataset. For a comprehensive evaluation, we also need to see how it performs on out-of-distribution datasets. For example, tests/data/det_toy_dataset
is a very small real dataset that we can use to verify the actual performance of DBNet.
Before testing, we also need to make some changes to the location of the dataset. Open configs/textdet/_base_/datasets/icdar2015.py
and change data_root
of icdar2015_textdet_test
to tests/data/det_toy_dataset
:
# ...
icdar2015_textdet_test = dict(
type='OCRDataset',
data_root='tests/data/det_toy_dataset',
# ...
)
Start testing:
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/epoch_400.pth
And get the outputs like:
08/21 21:45:59 - mmengine - INFO - Epoch(test) [5/10] memory: 8562
08/21 21:45:59 - mmengine - INFO - Epoch(test) [10/10] eta: 0:00:00 time: 0.4893 data_time: 0.0191 memory: 283
08/21 21:45:59 - mmengine - INFO - Evaluating hmean-iou...
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.30, recall: 0.6190, precision: 0.4815, hmean: 0.5417
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.40, recall: 0.6190, precision: 0.5909, hmean: 0.6047
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.50, recall: 0.6190, precision: 0.6842, hmean: 0.6500
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.60, recall: 0.6190, precision: 0.7222, hmean: 0.6667
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.70, recall: 0.3810, precision: 0.8889, hmean: 0.5333
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.80, recall: 0.0000, precision: 0.0000, hmean: 0.0000
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.90, recall: 0.0000, precision: 0.0000, hmean: 0.0000
08/21 21:45:59 - mmengine - INFO - Epoch(test) [10/10] icdar/precision: 0.7222 icdar/recall: 0.6190 icdar/hmean: 0.6667
The model achieves an hmean of 0.6667 on this dataset.
Note
For advanced usage of testing, such as CPU testing, multi-GPU testing, and cluster testing, please refer to Training and Testing.
Visualize the Outputs¶
We can also visualize its prediction output in test.py
. You can open a pop-up visualization window with the show
parameter; and can also specify the directory where the prediction result images are exported with the show-dir
parameter.
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/epoch_400.pth --show-dir imgs/
The true labels and predicted values are displayed in a tiled fashion in the visualization results. The green boxes in the left panel indicate the true labels and the red boxes in the right panel indicate the predicted values.

Note
For a description of more visualization features, see here.
FAQ¶
General¶
Q1 I’m getting the warning like unexpected key in source state_dict: fc.weight, fc.bias
, is there something wrong?
A It’s not an error. It occurs because the backbone network is pretrained on image classification tasks, where the last fc layer is required to generate the classification output. However, the fc layer is no longer needed when the backbone network is used to extract features in downstream tasks, and therefore these weights can be safely skipped when loading the checkpoint.
Q2 MMOCR terminates with an error: shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry
. How could I fix it?
A This error occurs because of some invalid polygons (e.g., polygons with self-intersections) existing in the dataset or generated by some non-rigorous data transforms. These polygons can be fixed by adding FixInvalidPolygon
transform after the transform likely to introduce invalid polygons. For example, a common practice is to append it after LoadOCRAnnotations
in both train and test pipeline. The resulting pipeline should look like:
train_pipeline = [
...
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(type='FixInvalidPolygon', min_poly_points=4),
...
]
In practice, we find that Totaltext contains some invalid polygons and using FixInvalidPolygon
is a must. Here is an example config.
Q3 Getting libpng warning: iCCP: known incorrect sRGB profile
when loading images with cv2
backend.
A This is a warning from libpng
and it is safe to ignore. It is caused by the icc
profile in the image. You can use pillow
backend to avoid this warning:
train_pipeline = [
dict(
type='LoadImageFromFile',
imdecode_backend='pillow'),
...
]
Text Recognition¶
Q1 What are the steps to train text recognition models with my own dictionary?
A In MMOCR 1.0, you only need to modify the config and point Dictionary
to your custom dict file. For example, if you want to train SAR model (https://github.com/open-mmlab/mmocr/blob/75c06d34bbc01d3d11dfd7afc098b6cdeee82579/configs/textrecog/sar/sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real.py) with your own dictionary placed at /my/dict.txt
, you can modify dictionary.dict_file
term in base config to:
dictionary = dict(
type='Dictionary',
dict_file='/my/dict.txt',
with_start=True,
with_end=True,
same_start_end=True,
with_padding=True,
with_unknown=True)
Now you are good to go. You can also find more information in Dictionary API.
Q2 How to properly visualize non-English characters?
A You can customize font_families
or font_properties
in visualizer. For example, to visualize Korean:
configs/textrecog/_base_/default_runtime.py
:
visualizer = dict(
type='TextRecogLocalVisualizer',
name='visualizer',
font_families='NanumGothic', # new feature
vis_backends=vis_backends)
It’s also fine to pass the font path to visualizer:
visualizer = dict(
type='TextRecogLocalVisualizer',
name='visualizer',
font_properties='path/to/font_file',
vis_backends=vis_backends)
Inference¶
In OpenMMLab, all the inference operations are unified into a new interface - Inferencer
. Inferencer
is designed to expose a neat and simple API to users, and shares very similar interface across different OpenMMLab libraries.
In MMOCR, Inferencers are constructed in different levels of task abstraction.
Standard Inferencer: Following OpenMMLab’s convention, each fundamental task in MMOCR has a standard Inferencer, namely
TextDetInferencer
(text detection),TextRecInferencer
(text recognition),TextSpottingInferencer
(end-to-end OCR), andKIEInferencer
(key information extraction). They are designed to perform inference on a single task, and can be chained together to perform inference on a series of tasks. They also share very similar interface, have standard input/output protocol, and overall follow the OpenMMLab design.MMOCRInferencer: We also provide
MMOCRInferencer
, a convenient inference interface only designed for MMOCR. It encapsulates and chains all the Inferencers in MMOCR, so users can use this Inferencer to perform a series of tasks on an image and directly get the final result in an end-to-end manner. However, it has a relatively different interface from other standard Inferencers, and some of standard Inferencer functionalities might be sacrificed for the sake of simplicity.
For new users, we recommend using MMOCRInferencer to test out different combinations of models.
If you are a developer and wish to integrate the models into your own project, we recommend using standard Inferencers, as they are more flexible and standardized, equipped with full functionalities.
Basic Usage¶
As of now, MMOCRInferencer
can perform inference on the following tasks:
Text detection
Text recognition
OCR (text detection + text recognition)
Key information extraction (text detection + text recognition + key information extraction)
OCR (text spotting) (coming soon)
For convenience, MMOCRInferencer
provides both Python and command line interfaces. For example, if you want to perform OCR inference on demo/demo_text_ocr.jpg
with DBNet
as the text detection model and CRNN
as the text recognition model, you can simply run the following command:
>>> from mmocr.apis import MMOCRInferencer
>>> # Load models into memory
>>> ocr = MMOCRInferencer(det='DBNet', rec='SAR')
>>> # Perform inference
>>> ocr('demo/demo_text_ocr.jpg', show=True)
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec SAR --show
The resulting OCR output will be displayed in a new window:

Note
If you are running MMOCR on a server without GUI or via SSH tunnel with X11 forwarding disabled, the show
option will not work. However, you can still save visualizations to files by setting out_dir
and save_vis=True
arguments. Read Dumping Results for details.
Depending on the initialization arguments, MMOCRInferencer
can run in different modes. For example, it can run in KIE mode if it is initialized with det
, rec
and kie
specified.
>>> kie = MMOCRInferencer(det='DBNet', rec='SAR', kie='SDMGR')
>>> kie('demo/demo_kie.jpeg', show=True)
python tools/infer.py demo/demo_kie.jpeg --det DBNet --rec SAR --kie SDMGR --show
The output image should look like this:

You may have found that the Python interface and the command line interface of MMOCRInferencer
are very similar. The following sections will use the Python interface as an example to introduce the usage of MMOCRInferencer
. For more information about the command line interface, please refer to Command Line Interface.
In general, all the standard Inferencers across OpenMMLab share a very similar interface. The following example shows how to use TextDetInferencer
to perform inference on a single image.
>>> from mmocr.apis import TextDetInferencer
>>> # Load models into memory
>>> inferencer = TextDetInferencer(model='DBNet')
>>> # Inference
>>> inferencer('demo/demo_text_ocr.jpg', show=True)
The visualization result should look like:

Initialization¶
Each Inferencer must be initialized with a model. You can also choose the inference device during initialization.
Model Initialization¶
For each task, MMOCRInferencer
takes two arguments in the form of xxx
and xxx_weights
(e.g. det
and det_weights
) for initialization, and there are many ways to initialize a model for inference. We will take det
and det_weights
as an example to illustrate some typical ways to initialize a model.
To infer with MMOCR’s pre-trained model, passing its name to the argument
det
can work. The weights will be automatically downloaded and loaded from OpenMMLab’s model zoo. Check Weights for available model names.>>> MMOCRInferencer(det='DBNet')
To load custom config and weight, you can pass the path to the config file to
det
and the path to the weight todet_weights
.>>> MMOCRInferencer(det='path/to/dbnet_config.py', det_weights='path/to/dbnet.pth')
You may click on the “Standard Inferencer” tab to find more initialization methods.
Every standard Inferencer
accepts two parameters, model
and weights
. (In MMOCRInferencer
, they are referred to as xxx
and xxx_weights
)
model
takes either the name of a model, or the path to a config file as input. The name of a model is obtained from the model’s metafile (Example) indexed from model-index.yml. You can find the list of available weights here.weights
accepts the path to a weight file.
There are various ways to initialize a model.
To infer with MMOCR’s pre-trained model, you can pass its name to
model
. The weights will be automatically downloaded and loaded from OpenMMLab’s model zoo.>>> from mmocr.apis import TextDetInferencer >>> inferencer = TextDetInferencer(model='DBNet')
Note
The model type must match the Inferencer type.
You can load another weight by passing its path/url to
weights
.>>> inferencer = TextDetInferencer(model='DBNet', weights='path/to/dbnet.pth')
To load custom config and weight, you can pass the path to the config file to
model
and the path to the weight toweights
.>>> inferencer = TextDetInferencer(model='path/to/dbnet_config.py', weights='path/to/dbnet.pth')
By default, MMEngine dumps config to the weight. If you have a weight trained on MMEngine, you can also pass the path to the weight file to
weights
without specifyingmodel
:>>> # It will raise an error if the config file cannot be found in the weight >>> inferencer = TextDetInferencer(weights='path/to/dbnet.pth')
Passing config file to
model
without specifyingweight
will result in a randomly initialized model.
Device¶
Each Inferencer instance is bound to a device.
By default, the best device is automatically decided by MMEngine. You can also alter the device by specifying the device
argument. For example, you can use the following code to create an Inferencer on GPU 1.
>>> inferencer = MMOCRInferencer(det='DBNet', device='cuda:1')
>>> inferencer = TextDetInferencer(model='DBNet', device='cuda:1')
To create an Inferencer on CPU:
>>> inferencer = MMOCRInferencer(det='DBNet', device='cpu')
>>> inferencer = TextDetInferencer(model='DBNet', device='cpu')
Refer to torch.device for all the supported forms.
Inference¶
Once the Inferencer is initialized, you can directly pass in the raw data to be inferred and get the inference results from return values.
Input¶
Input can be either of these types:
str: Path/URL to the image.
>>> inferencer('demo/demo_text_ocr.jpg')
array: Image in numpy array. It should be in BGR order.
>>> import mmcv >>> array = mmcv.imread('demo/demo_text_ocr.jpg') >>> inferencer(array)
list: A list of basic types above. Each element in the list will be processed separately.
>>> inferencer(['img_1.jpg', 'img_2.jpg]) >>> # You can even mix the types >>> inferencer(['img_1.jpg', array])
str: Path to the directory. All images in the directory will be processed.
>>> inferencer('tests/data/det_toy_dataset/imgs/test/')
Input can be a dict or list[dict], where each dictionary contains following keys:
img
(str or ndarray): Path to the image or the image itself. If KIE Inferencer is used in no-visual mode, this key is not required. If it’s an numpy array, it should be in BGR order.img_shape
(tuple(int, int)): Image shape in (H, W). Only required when KIE Inferencer is used in no-visual mode and noimg
is provided.instances
(list[dict]): A list of instances.
Each instance
looks like the following:
{
# A nested list of 4 numbers representing the bounding box of
# the instance, in (x1, y1, x2, y2) order.
"bbox": np.array([[x1, y1, x2, y2], [x1, y1, x2, y2], ...],
dtype=np.int32),
# List of texts.
"texts": ['text1', 'text2', ...],
}
Output¶
By default, each Inferencer
returns the prediction results in a dictionary format.
visualization
contains the visualized predictions. But it’s an empty list by default unlessreturn_vis=True
.predictions
contains the predictions results in a json-serializable format. As presented below, the contents are slightly different depending on the task type.{ 'predictions' : [ # Each instance corresponds to an input image { 'det_polygons': [...], # 2d list of length (N,), format: [x1, y1, x2, y2, ...] 'det_scores': [...], # float list of length (N,) 'det_bboxes': [...], # 2d list of shape (N, 4), format: [min_x, min_y, max_x, max_y] 'rec_texts': [...], # str list of length (N,) 'rec_scores': [...], # float list of length (N,) 'kie_labels': [...], # node labels, length (N, ) 'kie_scores': [...], # node scores, length (N, ) 'kie_edge_scores': [...], # edge scores, shape (N, N) 'kie_edge_labels': [...] # edge labels, shape (N, N) }, ... ], 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # Each instance corresponds to an input image { 'polygons': [...], # 2d list of len (N,) in the format of [x1, y1, x2, y2, ...] 'bboxes': [...], # 2d list of shape (N, 4), in the format of [min_x, min_y, max_x, max_y] 'scores': [...] # list of float, len (N, ) }, ] 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # Each instance corresponds to an input image { 'text': '...', # a string 'scores': 0.1, # a float }, ... ] 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # Each instance corresponds to an input image { 'polygons': [...], # 2d list of len (N,) in the format of [x1, y1, x2, y2, ...] 'bboxes': [...], # 2d list of shape (N, 4), in the format of [min_x, min_y, max_x, max_y] 'scores': [...] # list of float, len (N, ) 'texts': ['...',] # list of texts, len (N, ) }, ] 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # Each instance corresponds to an input image { 'labels': [...], # node label, len (N,) 'scores': [...], # node scores, len (N, ) 'edge_scores': [...], # edge scores, shape (N, N) 'edge_labels': [...], # edge labels, shape (N, N) }, ] 'visualization' : [ array(..., dtype=uint8), ] }
If you wish to get the raw outputs from the model, you can set return_datasamples
to True
to get the original DataSample, which will be stored in predictions
.
Dumping Results¶
Apart from obtaining predictions from the return value, you can also export the predictions/visualizations to files by setting out_dir
and save_pred
/save_vis
arguments.
>>> inferencer('img_1.jpg', out_dir='outputs/', save_pred=True, save_vis=True)
Results in the directory structure like:
outputs
├── preds
│ └── img_1.json
└── vis
└── img_1.jpg
The filename of each file is the same as the corresponding input image filename. If the input image is an array, the filename will be a number starting from 0.
Batch Inference¶
You can customize the batch size by setting batch_size
. The default batch size is 1.
API¶
Here are extensive lists of parameters that you can use.
MMOCRInferencer.__init__():
Arguments | Type | Default | Description |
---|---|---|---|
det |
str or Weights, optional | None | Pretrained text detection algorithm. It's the path to the config file or the model name defined in metafile. |
det_weights |
str, optional | None | Path to the custom checkpoint file of the selected det model. If it is not specified and "det" is a model name of metafile, the weights will be loaded from metafile. |
rec |
str or Weights, optional | None | Pretrained text recognition algorithm. It’s the path to the config file or the model name defined in metafile. |
rec_weights |
str, optional | None | Path to the custom checkpoint file of the selected rec model. If it is not specified and “rec” is a model name of metafile, the weights will be loaded from metafile. |
kie [1] |
str or Weights, optional | None | Pretrained key information extraction algorithm. It’s the path to the config file or the model name defined in metafile. |
kie_weights |
str, optional | None | Path to the custom checkpoint file of the selected kie model. If it is not specified and “kie” is a model name of metafile, the weights will be loaded from metafile. |
device |
str, optional | None | Device used for inference, accepting all allowed strings by torch.device . E.g., 'cuda:0' or 'cpu'. If None, the available device will be automatically used. Defaults to None. |
[1]: kie
is only effective when both text detection and recognition models are specified.
MMOCRInferencer.__call__()
Arguments | Type | Default | Description |
---|---|---|---|
inputs |
str/list/tuple/np.array | required | It can be a path to an image/a folder, an np array or a list/tuple (with img paths or np arrays) |
return_datasamples |
bool | False | Whether to return results as DataSamples. If False, the results will be packed into a dict. |
batch_size |
int | 1 | Inference batch size. |
det_batch_size |
int, optional | None | Inference batch size for text detection model. Overwrite batch_size if it is not None. |
rec_batch_size |
int, optional | None | Inference batch size for text recognition model. Overwrite batch_size if it is not None. |
kie_batch_size |
int, optional | None | Inference batch size for KIE model. Overwrite batch_size if it is not None. |
return_vis |
bool | False | Whether to return the visualization result. |
print_result |
bool | False | Whether to print the inference result to the console. |
show |
bool | False | Whether to display the visualization results in a popup window. |
wait_time |
float | 0 | The interval of show(s). |
out_dir |
str | results/ |
Output directory of results. |
save_vis |
bool | False | Whether to save the visualization results to out_dir . |
save_pred |
bool | False | Whether to save the inference results to out_dir . |
Inferencer.__init__():
Arguments | Type | Default | Description |
---|---|---|---|
model |
str or Weights, optional | None | Path to the config file or the model name defined in metafile. |
weights |
str, optional | None | Path to the custom checkpoint file of the selected det model. If it is not specified and "det" is a model name of metafile, the weights will be loaded from metafile. |
device |
str, optional | None | Device used for inference, accepting all allowed strings by torch.device . E.g., 'cuda:0' or 'cpu'. If None, the available device will be automatically used. Defaults to None. |
Inferencer.__call__()
Arguments | Type | Default | Description |
---|---|---|---|
inputs |
str/list/tuple/np.array | required | It can be a path to an image/a folder, an np array or a list/tuple (with img paths or np arrays) |
return_datasamples |
bool | False | Whether to return results as DataSamples. If False, the results will be packed into a dict. |
batch_size |
int | 1 | Inference batch size. |
progress_bar |
bool | True | Whether to show a progress bar. |
return_vis |
bool | False | Whether to return the visualization result. |
print_result |
bool | False | Whether to print the inference result to the console. |
show |
bool | False | Whether to display the visualization results in a popup window. |
wait_time |
float | 0 | The interval of show(s). |
draw_pred |
bool | True | Whether to draw predicted bounding boxes. Only applicable on TextDetInferencer and TextSpottingInferencer . |
out_dir |
str | results/ |
Output directory of results. |
save_vis |
bool | False | Whether to save the visualization results to out_dir . |
save_pred |
bool | False | Whether to save the inference results to out_dir . |
Command Line Interface¶
Note
This section is only applicable to MMOCRInferencer
.
You can use tools/infer.py
to perform inference through MMOCRInferencer
.
Its general usage is as follows:
python tools/infer.py INPUT_PATH [--det DET] [--det-weights ...] ...
where INPUT_PATH
is a required field, which should be a path to an image or a folder. Command-line parameters follow the mapping relationship with the Python interface parameters as follows:
To convert the Python interface parameters to the command line ones, you need to add two
--
in front of the Python interface parameters, and replace the underscore_
with the hyphen-
. For example,out_dir
becomes--out-dir
.For boolean type parameters, putting the parameter in the command is equivalent to specifying it as True. For example,
--show
will specify theshow
parameter as True.
In addition, the command line will not display the inference result by default. You can use the --print-result
parameter to view the inference result.
Here is an example:
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec SAR --show --print-result
Running this command will give the following result:
{'predictions': [{'rec_texts': ['CBank', 'Docbcba', 'GROUP', 'MAUN', 'CROBINSONS', 'AOCOC', '916M3', 'BOO9', 'Oven', 'BRANDS', 'ARETAIL', '14', '70<UKN>S', 'ROUND', 'SALE', 'YEAR', 'ALLY', 'SALE', 'SALE'],
'rec_scores': [0.9753464579582214, ...], 'det_polygons': [[551.9930285844646, 411.9138765335083, 553.6153911653112,
383.53195309638977, 620.2410061195247, 387.33785033226013, 618.6186435386782, 415.71977376937866], ...], 'det_scores': [0.8230461478233337, ...]}]}
Config¶
MMOCR mainly uses Python files as configuration files. The design of its configuration file system integrates the ideas of modularity and inheritance to facilitate various experiments.
Common Usage¶
Note
This section is recommended to be read together with the primary usage in MMEngine: Config.
There are three most common operations in MMOCR: inheritance of configuration files, reference to _base_
variables, and modification of _base_
variables. Config provides two syntaxes for inheriting and modifying _base_
, one for Python, Json, and Yaml, and one for Python configuration files only. In MMOCR, we prefer the Python-only syntax, so this will be the basis for further description.
The configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
is used as an example to illustrate the three common uses.
_base_ = [
'_base_dbnet_resnet18_fpnc.py',
'../_base_/datasets/icdar2015.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_sgd_1200e.py',
]
# dataset settings
icdar2015_textdet_train = _base_.icdar2015_textdet_train
icdar2015_textdet_train.pipeline = _base_.train_pipeline
icdar2015_textdet_test = _base_.icdar2015_textdet_test
icdar2015_textdet_test.pipeline = _base_.test_pipeline
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train)
val_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=icdar2015_textdet_test)
Configuration Inheritance¶
There is an inheritance mechanism for configuration files, i.e. one configuration file A can use another configuration file B as its base and inherit all the fields directly from it, thus avoiding a lot of copy-pasting.
In dbnet_resnet18_fpnc_1200e_icdar2015.py
you can see that
_base_ = [
'_base_dbnet_resnet18_fpnc.py',
'../_base_/datasets/icdar2015.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_sgd_1200e.py',
]
The above statement reads all the base configuration files in the list, and all the fields in them are loaded into dbnet_resnet18_fpnc_1200e_icdar2015.py
. We can see the structure of the configuration file after it has been parsed by running the following statement in a Python interpretation.
from mmengine import Config
db_config = Config.fromfile('configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py')
print(db_config)
It can be found that the parsed configuration contains all the fields and information in the base configuration.
Note
Variables with the same name cannot exist in each _base_
profile.
_base_
Variable References¶
Sometimes we may need to reference some fields in the _base_
configuration directly in order to avoid duplicate definitions. Suppose we want to get the variable pseudo
in the _base_
configuration, we can get the variable in the _base_
configuration directly via _base_.pseudo
.
This syntax has been used extensively in the configuration of MMOCR, and the dataset and pipeline configurations for each model in MMOCR are referenced in the base configuration. For example,
icdar2015_textdet_train = _base_.icdar2015_textdet_train
# ...
train_dataloader = dict(
# ...
dataset=icdar2015_textdet_train)
_base_
Variable Modification¶
In MMOCR, different algorithms usually have different pipelines in different datasets, so there are often scenarios to modify the pipeline
in the dataset. There are also many scenarios where you need to modify variables in the _base_
configuration, for example, modifying the training strategy of an algorithm, replacing some modules of an algorithm(backbone, etc.). Users can directly modify the referenced _base_
variables using Python syntax. For dict, we also provide a method similar to class attribute modification to modify the contents of the dictionary directly.
Dictionary
Here is an example of modifying
pipeline
in a dataset.The dictionary can be modified using Python syntax:
# Get the dataset in _base_ icdar2015_textdet_train = _base_.icdar2015_textdet_train # You can modify the variables directly with Python's update icdar2015_textdet_train.update(pipeline=_base_.train_pipeline)
It can also be modified in the same way as changing Python class attributes.
# Get the dataset in _base_ icdar2015_textdet_train = _base_.icdar2015_textdet_train # The class property method is modified icdar2015_textdet_train.pipeline = _base_.train_pipeline
List
Suppose the variable
pseudo = [1, 2, 3]
in the_base_
configuration needs to be modified to[1, 2, 4]
:# pseudo.py pseudo = [1, 2, 3]
Can be rewritten directly as.
_base_ = ['pseudo.py'] pseudo = [1, 2, 4]
Or modify the list using Python syntax:
_base_ = ['pseudo.py'] pseudo = _base_.pseudo pseudo[2] = 4
Command Line Modification¶
Sometimes we only want to fix part of the configuration and do not want to modify the configuration file itself. For example, if you want to change the learning rate during an experiment but do not want to write a new configuration file, you can pass in parameters on the command line to override the relevant configuration.
We can pass --cfg-options
on the command line and modify the corresponding fields directly with the arguments after it. For example, we can run the following command to modify the learning rate temporarily for this training session.
python tools/train.py example.py --cfg-options optim_wrapper.optimizer.lr=1
For more detailed usage, refer to MMEngine: Command Line Modification.
Configuration Content¶
With config files and Registry, MMOCR can modify the training parameters as well as the model configuration without invading the code. Specifically, users can customize the following modules in the configuration file: environment configuration, hook configuration, log configuration, training strategy configuration, data-related configuration, model-related configuration, evaluation configuration, and visualization configuration.
This document will take the text detection algorithm DBNet
and the text recognition algorithm CRNN
as examples to introduce the contents of Config in detail.
Environment Configuration¶
default_scope = 'mmocr'
env_cfg = dict(
cudnn_benchmark=True,
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
dist_cfg=dict(backend='nccl'))
randomness = dict(seed=None)
There are three main components:
Set the default
scope
of all registries tommocr
, ensuring that all modules are searched first from theMMOCR
codebase. If the module does not exist, the search will continue from the upstream algorithm librariesMMEngine
andMMCV
, see MMEngine: Registry for more details.env_cfg
configures the distributed environment, see MMEngine: Runner for more details.randomness
: Some settings to make the experiment as reproducible as possible like seed and deterministic. See MMEngine: Runner for more details.
Hook Configuration¶
Hooks are divided into two main parts, default hooks, which are required for all tasks to run, and custom hooks, which generally serve specific algorithms or specific tasks (there are no custom hooks in MMOCR so far).
default_hooks = dict(
timer=dict(type='IterTimerHook'), # Time recording, including data time as well as model inference time
logger=dict(type='LoggerHook', interval=1), # Collect logs from different components
param_scheduler=dict(type='ParamSchedulerHook'), # Update some hyper-parameters in optimizer
checkpoint=dict(type='CheckpointHook', interval=1),# Save checkpoint. `interval` control save interval
sampler_seed=dict(type='DistSamplerSeedHook'), # Data-loading sampler for distributed training.
sync_buffer=dict(type='SyncBuffersHook'), # Synchronize buffer in case of distributed training
visualization=dict( # Visualize the results of val and test
type='VisualizationHook',
interval=1,
enable=False,
show=False,
draw_gt=False,
draw_pred=False))
custom_hooks = []
Here is a brief description of a few hooks whose parameters may be changed frequently. For a general modification method, refer to Modify configuration.
LoggerHook
: Used to configure the behavior of the logger. For example, by modifyinginterval
you can control the interval of log printing, so that the log is printed once perinterval
iteration, for more settings refer to LoggerHook API.CheckpointHook
: Used to configure checkpoint-related behavior, such as saving optimal and/or latest weights. You can also modifyinterval
to control the checkpoint saving interval. More settings can be found in CheckpointHook APIVisualizationHook
: Used to configure visualization-related behavior, such as visualizing predicted results during validation or testing. Default is off. This Hook also depends on Visualization Configuration. You can refer to Visualizer for more details. For more configuration, you can refer to VisualizationHook API.
If you want to learn more about the configuration of the default hooks and their functions, you can refer to MMEngine: Hooks.
Log Configuration¶
This section is mainly used to configure the log level and the log processor.
log_level = 'INFO' # Logging Level
log_processor = dict(type='LogProcessor',
window_size=10,
by_epoch=True)
The logging severity level is the same as that of Python: logging
The log processor is mainly used to control the format of the output, detailed functions can be found in MMEngine: logging.
by_epoch=True
indicates that the logs are output in accordance to “epoch”, and the log format needs to be consistent with thetype='EpochBasedTrainLoop'
parameter intrain_cfg
. For example, if you want to output logs by iteration number, you need to setby_epoch=False
inlog_processor
andtype='IterBasedTrainLoop'
intrain_cfg
.window_size
indicates the smoothing window of the loss, i.e. the average value of the various losses for the lastwindow_size
iterations. the final loss value printed in logger is the average of all the losses.
Training Strategy Configuration¶
This section mainly contains optimizer settings, learning rate schedules and Loop
settings.
Training strategies usually vary for different tasks (text detection, text recognition, key information extraction). Here we explain the example configuration in CRNN
, which is a text recognition model.
# optimizer
optim_wrapper = dict(
type='OptimWrapper', optimizer=dict(type='Adadelta', lr=1.0))
param_scheduler = [dict(type='ConstantLR', factor=1.0)]
train_cfg = dict(type='EpochBasedTrainLoop',
max_epochs=5, # train epochs
val_interval=1) # val interval
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
optim_wrapper
: It contains two main parts, OptimWrapper and Optimizer. Detailed usage information can be found in MMEngine: Optimizer Wrapper.The Optimizer wrapper supports different training strategies, including mixed-accuracy training (AMP), gradient accumulation, and gradient truncation.
All PyTorch optimizers are supported in the optimizer settings. All supported optimizers are available in PyTorch Optimizer List.
param_scheduler
: learning rate tuning strategy, supports most of the learning rate schedulers in PyTorch, such asExponentialLR
,LinearLR
,StepLR
,MultiStepLR
, etc., and is used in much the same way, see scheduler interface, and more features can be found in the MMEngine: Optimizer Parameter Tuning Strategy.train/test/val_cfg
: the execution flow of the task, MMEngine provides four kinds of flow:EpochBasedTrainLoop
,IterBasedTrainLoop
,ValLoop
,TestLoop
More can be found in MMEngine: loop controller.
Dataset Preparation¶
Introduction¶
After decades of development, the OCR community has produced a series of related datasets that often provide annotations of text in a variety of styles, making it necessary for users to convert these datasets to the required format when using them. MMOCR supports dozens of commonly used text-related datasets and provides a data preparation script to help users prepare the datasets with only one command.
In this section, we will introduce a typical process of preparing a dataset for MMOCR:
However, the first step is not necessary if you already have a dataset in the format that MMOCR supports. You can read Dataset Classes for more details.
Downloading Datasets and Converting Format¶
As an example of the data preparation steps, you can use the following command to prepare the ICDAR 2015 dataset for text detection task.
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
Then, the dataset has been downloaded and converted to MMOCR format, and the file directory structure is as follows:
data/icdar2015
├── textdet_imgs
│ ├── test
│ └── train
├── textdet_test.json
└── textdet_train.json
Once your dataset has been prepared, you can use the browse_dataset.py to visualize the dataset and check if the annotations are correct.
python tools/analysis_tools/browse_dataset.py configs/textdet/_base_/datasets/icdar2015.py
Dataset Configuration¶
Single Dataset Training¶
When training or evaluating a model on new datasets, we need to write the dataset config where the image path, annotation path, and image prefix are set. The path configs/xxx/_base_/datasets/
is pre-configured with the commonly used datasets in MMOCR (if you use prepare_dataset.py
to prepare dataset, this config will be generated automatically), here we take the ICDAR 2015 dataset as an example (see configs/textdet/_base_/datasets/icdar2015.py
).
icdar2015_textdet_data_root = 'data/icdar2015' # dataset root path
# Train set config
icdar2015_textdet_train = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root, # dataset root path
ann_file='textdet_train.json', # name of annotation
filter_cfg=dict(filter_empty_gt=True, min_size=32), # filtering empty images
pipeline=None)
# Test set config
icdar2015_textdet_test = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root,
ann_file='textdet_test.json',
test_mode=True,
pipeline=None)
After configuring the dataset, we can import it in the corresponding model configs. For example, to train the “DBNet_R18” model on the ICDAR 2015 dataset.
_base_ = [
'_base_dbnet_r18_fpnc.py',
'../_base_/datasets/icdar2015.py', # import the dataset config
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_sgd_1200e.py',
]
icdar2015_textdet_train = _base_.icdar2015_textdet_train # specify the training set
icdar2015_textdet_train.pipeline = _base_.train_pipeline # specify the training pipeline
icdar2015_textdet_test = _base_.icdar2015_textdet_test # specify the testing set
icdar2015_textdet_test.pipeline = _base_.test_pipeline # specify the testing pipeline
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train) # specify the dataset in train_dataloader
val_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=icdar2015_textdet_test) # specify the dataset in val_dataloader
test_dataloader = val_dataloader
Multi-dataset Training¶
In addition, ConcatDataset
enables users to train or test the model on a combination of multiple datasets. You just need to set the dataset type in the dataloader to ConcatDataset
in the configuration file and specify the corresponding list of datasets.
train_list = [ic11, ic13, ic15]
train_dataloader = dict(
dataset=dict(
type='ConcatDataset', datasets=train_list, pipeline=train_pipeline))
For example, the following configuration uses the MJSynth dataset for training and 6 academic datasets (CUTE80, IIIT5K, SVT, SVTP, ICDAR2013, ICDAR2015) for testing.
_base_ = [ # Import all dataset configurations you want to use
'../_base_/datasets/mjsynth.py',
'../_base_/datasets/cute80.py',
'../_base_/datasets/iiit5k.py',
'../_base_/datasets/svt.py',
'../_base_/datasets/svtp.py',
'../_base_/datasets/icdar2013.py',
'../_base_/datasets/icdar2015.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_adadelta_5e.py',
'_base_crnn_mini-vgg.py',
]
# List of training datasets
train_list = [_base_.mjsynth_textrecog_train]
# List of testing datasets
test_list = [
_base_.cute80_textrecog_test, _base_.iiit5k_textrecog_test, _base_.svt_textrecog_test,
_base_.svtp_textrecog_test, _base_.icdar2013_textrecog_test, _base_.icdar2015_textrecog_test
]
# Use ConcatDataset to combine the datasets in the list
train_dataset = dict(
type='ConcatDataset', datasets=train_list, pipeline=_base_.train_pipeline)
test_dataset = dict(
type='ConcatDataset', datasets=test_list, pipeline=_base_.test_pipeline)
train_dataloader = dict(
batch_size=192 * 4,
num_workers=32,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=train_dataset)
test_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=test_dataset)
val_dataloader = test_dataloader
Training and Testing¶
To meet diverse requirements, MMOCR supports training and testing models on various devices, including PCs, work stations, computation clusters, etc.
Single GPU Training and Testing¶
Training¶
tools/train.py
provides the basic training service. MMOCR recommends using GPUs for model training and testing, but it still enables CPU-Only training and testing. For example, the following commands demonstrate how to train a DBNet model using a single GPU or CPU.
# Train the specified MMOCR model by calling tools/train.py
CUDA_VISIBLE_DEVICES= python tools/train.py ${CONFIG_FILE} [PY_ARGS]
# Training
# Example 1: Training DBNet with CPU
CUDA_VISIBLE_DEVICES=-1 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py
# Example 2: Specify to train DBNet with gpu:0, specify the working directory as dbnet/, and turn on mixed precision (amp) training
CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py --work-dir dbnet/ --amp
Note
If multiple GPUs are available, you can specify a certain GPU, e.g. the third one, by setting CUDA_VISIBLE_DEVICES=3.
The following table lists all the arguments supported by train.py
. Args without the --
prefix are mandatory, while others are optional.
ARGS | Type | Description |
---|---|---|
config | str | (required) Path to config. |
--work-dir | str | Specify the working directory for the training logs and models checkpoints. |
--resume | bool | Whether to resume training from the latest checkpoint. |
--amp | bool | Whether to use automatic mixture precision for training. |
--auto-scale-lr | bool | Whether to use automatic learning rate scaling. |
--cfg-options | str | Override some settings in the configs. Example |
--launcher | str | Option for launcher,['none', 'pytorch', 'slurm', 'mpi']. |
--local_rank | int | Rank of local machine,used for distributed training,defaults to 0。 |
--tta | bool | Whether to use test time augmentation. |
Test¶
tools/test.py
provides the basic testing service, which is used in a similar way to the training script. For example, the following command demonstrates test a DBNet model on a single GPU or CPU.
# Test a pretrained MMOCR model by calling tools/test.py
CUDA_VISIBLE_DEVICES= python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]
# Test
# Example 1: Testing DBNet with CPU
CUDA_VISIBLE_DEVICES=-1 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth
# Example 2: Testing DBNet on gpu:0
CUDA_VISIBLE_DEVICES=0 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth
The following table lists all the arguments supported by test.py
. Args without the --
prefix are mandatory, while others are optional.
ARGS | Type | Description |
---|---|---|
config | str | (required) Path to config. |
checkpoint | str | (required) The model to be tested. |
--work-dir | str | Specify the working directory for the logs. |
--save-preds | bool | Whether to save the predictions to a pkl file. |
--show | bool | Whether to visualize the predictions. |
--show-dir | str | Path to save the visualization results. |
--wait-time | float | Interval of visualization (s), defaults to 2. |
--cfg-options | str | Override some settings in the configs. Example |
--launcher | str | Option for launcher,['none', 'pytorch', 'slurm', 'mpi']. |
--local_rank | int | Rank of local machine,used for distributed training,defaults to 0. |
Training and Testing with Multiple GPUs¶
For large models, distributed training or testing significantly improves the efficiency. For this purpose, MMOCR provides distributed scripts tools/dist_train.sh
and tools/dist_test.sh
implemented based on MMDistributedDataParallel.
# Training
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
# Testing
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
The following table lists the arguments supported by dist_*.sh
.
ARGS | Type | Description |
---|---|---|
NNODES | int | The number of nodes. Defaults to 1. |
NODE_RANK | int | The rank of current node. Defaults to 0. |
PORT | int | The master port that will be used by rank 0 node, ranging from 0 to 65535. Defaults to 29500. |
MASTER_ADDR | str | The address of rank 0 node. Defaults to "127.0.0.1". |
CONFIG_FILE | str | (required) The path to config. |
CHECKPOINT_FILE | str | (required,only used in dist_test.sh)The path to checkpoint to be tested. |
GPU_NUM | int | (required) The number of GPUs to be used per node. |
[PY_ARGS] | str | Arguments to be parsed by tools/train.py and tools/test.py. |
These two scripts enable training and testing on single-machine multi-GPU or multi-machine multi-GPU. See the following example for usage.
Single-machine Multi-GPU¶
The following commands demonstrate how to train and test with a specified number of GPUs on a single machine with multiple GPUs.
Training
Training DBNet using 4 GPUs on a single machine.
tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4
Testing
Testing DBNet using 4 GPUs on a single machine.
tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4
Launching Multiple Tasks on Single Machine¶
For a workstation equipped with multiple GPUs, the user can launch multiple tasks simultaneously by specifying the GPU IDs. For example, the following command demonstrates how to test DBNet with GPU [0, 1, 2, 3]
and train CRNN on GPU [4, 5, 6, 7]
.
# Specify gpu:0,1,2,3 for testing and assign port number 29500
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4
# Specify gpu:4,5,6,7 for training and assign port number 29501
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh configs/textrecog/crnn/crnn_academic_dataset.py 4
Note
dist_train.sh
sets MASTER_PORT
to 29500
by default. When other processes already occupy this port, the program will get a runtime error RuntimeError: Address already in use
. In this case, you need to set MASTER_PORT
to another free port number in the range of (0~65535)
.
Multi-machine Multi-GPU Training and Testing¶
You can launch a task on multiple machines connected to the same network. MMOCR relies on torch.distributed
package for distributed training. Find more information at PyTorch’s launch utility.
Training
The following command demonstrates how to train DBNet on two machines with a total of 4 GPUs.
# Say that you want to launch the training job on two machines # On the first machine: NNODES=2 NODE_RANK=0 PORT=29500 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2 # On the second machine: NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2
Testing
The following command demonstrates how to test DBNet on two machines with a total of 4 GPUs.
# Say that you want to launch the testing job on two machines # On the first machine: NNODES=2 NODE_RANK=0 PORT=29500 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2 # On the second machine: NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2
Note
The speed of the network could be the bottleneck of training.
Training and Testing with Slurm Cluster¶
If you run MMOCR on a cluster managed with Slurm, you can use the script tools/slurm_train.sh
and tools/slurm_test.sh
.
# tools/slurm_train.sh provides scripts for submitting training tasks on clusters managed by the slurm
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]
# tools/slurm_test.sh provides scripts for submitting testing tasks on clusters managed by the slurm
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${WORK_DIR} [PY_ARGS]
ARGS | Type | Description |
---|---|---|
GPUS | int | The number of GPUs to be used by this task. Defaults to 8. |
GPUS_PER_NODE | int | The number of GPUs to be allocated per node. Defaults to 8. |
CPUS_PER_TASK | int | The number of CPUs to be allocated per task. Defaults to 5. |
SRUN_ARGS | str | Arguments to be parsed by srun. Available options can be found here. |
PARTITION | str | (required) Specify the partition on cluster. |
JOB_NAME | str | (required) Name of the submitted job. |
WORK_DIR | str | (required) Specify the working directory for saving the logs and checkpoints. |
CHECKPOINT_FILE | str | (required,only used in slurm_test.sh)Path to the checkpoint to be tested. |
PY_ARGS | str | Arguments to be parsed by tools/train.py and tools/test.py . |
These scripts enable training and testing on slurm clusters, see the following examples.
Training
Here is an example of using 1 GPU to train a DBNet model on the
dev
partition.# Example: Request 1 GPU resource on dev partition for DBNet training task GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_train.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py work_dir
Testing
Similarly, the following example requests 1 GPU for testing.
# Example: Request 1 GPU resource on dev partition for DBNet testing task GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_test.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth work_dir
Advanced Tips¶
Resume Training from a Checkpoint¶
tools/train.py
allows users to resume training from a checkpoint by specifying the --resume
parameter, where it will automatically resume training from the latest saved checkpoint.
# Example: Resuming training from the latest checkpoint
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --resume
By default, the program will automatically resume training from the last successfully saved checkpoint in the last training session, i.e. latest.pth
. However,
# Example: Set the path of the checkpoint you want to load in the configuration file
load_from = 'work_dir/dbnet/models/epoch_10000.pth'
Mixed Precision Training¶
Mixed precision training offers significant computational speedup by performing operations in half-precision format, while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. In MMOCR, the users can enable the automatic mixed precision training by simply add --amp
.
# Example: Using automatic mixed precision training
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --amp
The following table shows the support of each algorithm in MMOCR for automatic mixed precision training.
Whether support AMP | Description | |
---|---|---|
Text Detection | ||
DBNet | Y | |
DBNetpp | Y | |
DRRG | N | roi_align_rotated does not support fp16 |
FCENet | N | BCELoss does not support fp16 |
Mask R-CNN | Y | |
PANet | Y | |
PSENet | Y | |
TextSnake | N | |
Text Recognition | ||
ABINet | Y | |
CRNN | Y | |
MASTER | Y | |
NRTR | Y | |
RobustScanner | Y | |
SAR | Y | |
SATRN | Y |
Automatic Learning Rate Scaling¶
MMOCR sets default initial learning rates for each model in the configuration file. However, these initial learning rates may not be applicable when the user uses a different batch_size
than our preset base_batch_size
. Therefore, we provide a tool to automatically scale the learning rate, which can be called by adding the --auto-scale-lr
.
# Example: Using automatic learning rate scaling
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --auto-scale-lr
Visualize the Predictions¶
tools/test.py
provides the visualization interface to facilitate the qualitative analysis of the OCR models.
(Green boxes are GTs, while red boxes are predictions)
(Green font is the GT, red font is the prediction)
(From left to right: original image, text detection and recognition result, text classification result, relationship)
# Example 1: Show the visualization results per 2 seconds
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show --wait-time 2
# Example 2: For systems that do not support graphical interfaces (such as computing clusters, etc.), the visualization results can be dumped in the specified path
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show-dir ./vis_results
The visualization-related parameters in tools/test.py
are described as follows.
ARGS | Type | Description |
---|---|---|
--show | bool | Whether to show the visualization results. |
--show-dir | str | Path to save the visualization results. |
--wait-time | float | Interval of visualization (s), defaults to 2. |
Test Time Augmentation¶
Test time augmentation (TTA) is a technique that is used to improve the performance of a model by performing data augmentation on the input image at test time. It is a simple yet effective method to improve the performance of a model. In MMOCR, we support TTA in the following ways:
Note
TTA is only supported for text recognition models.
python tools/test.py configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py checkpoints/crnn_mini-vgg_5e_mj.pth --tta
Visualization¶
Before reading this tutorial, it is recommended to read MMEngine’s MMEngine: Visualization documentation to get a first glimpse of the Visualizer
definition and usage.
In brief, the Visualizer
is implemented in MMEngine to meet the daily visualization needs, and contains three main functions:
Implement common drawing APIs, such as
draw_bboxes
which implements bounding box drawing functions,draw_lines
implements the line drawing function.Support writing visualization results, learning rate curves, loss function curves, and verification accuracy curves to various backends, including local disks and common deep learning training logging tools such as TensorBoard and Wandb.
Support calling anywhere in the code to visualize or record intermediate states of the model during training or testing, such as feature maps and validation results.
Based on MMEngine’s Visualizer, MMOCR comes with a variety of pre-built visualization tools that can be used by the user by simply modifying the following configuration files.
The
tools/analysis_tools/browse_dataset.py
script provides a dataset visualization function that draws images and corresponding annotations after Data Transforms, as described inbrowse_dataset.py
.MMEngine implements
LoggerHook
, which usesVisualizer
to write the learning rate, loss and evaluation results to the backend set byVisualizer
. Therefore, by modifying theVisualizer
backend in the configuration file, for example toTensorBoardVISBackend
orWandbVISBackend
, you can implement logging to common training logging tools such asTensorBoard
orWandB
, thus making it easy for users to use these visualization tools to analyze and monitor the training process.The
VisualizerHook
is implemented in MMOCR, which uses theVisualizer
to visualize or store the prediction results of the validation or prediction phase into the backend set by theVisualizer
, so by modifying theVisualizer
backend in the configuration file, for example, toTensorBoardVISBackend
orWandbVISBackend
, you can implement storing the predicted images toTensorBoard
orWandb
.
Configuration¶
Thanks to the use of the registration mechanism, in MMOCR we can set the behavior of the Visualizer
by modifying the configuration file. Usually, we define the default configuration for the visualizer in task/_base_/default_runtime.py
, see configuration tutorial for details.
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
type='TextxxxLocalVisualizer', # use different visualizers for different tasks
vis_backends=vis_backends,
name='visualizer')
Based on the above example, we can see that the configuration of Visualizer
consists of two main parts, namely, the type of Visualizer
and the visualization backend vis_backends
it uses.
For different OCR tasks, various visualizers are pre-configured in MMOCR, including
TextDetLocalVisualizer
,TextRecogLocalVisualizer
,TextSpottingLocalVisualizer
andKIELocalVisualizer
. These visualizers extend the basic Visulizer API according to the characteristics of their tasks and implement the corresponding tag information interfaceadd_datasamples
. For example, users can directly useTextDetLocalVisualizer
to visualize labels or predictions for text detection tasks.MMOCR sets the visualization backend
vis_backend
to the local visualization backendLocalVisBackend
by default, saving all visualization results and other training information in a local folder.
Storage¶
MMOCR uses the local visualization backend LocalVisBackend
by default, and the model loss, learning rate, model evaluation accuracy and visualization The information stored in VisualizerHook
and LoggerHook
, including loss, learning rate, evaluation accuracy will be saved to the {work_dir}/{config_name}/{time}/{vis_data}
folder by default. In addition, MMOCR also supports other common visualization backends, such as TensorboardVisBackend
and WandbVisBackend
, and you only need to change the vis_backends
type in the configuration file to the corresponding visualization backend. For example, you can store data to TensorBoard
and Wandb
by simply inserting the following code block into the configuration file.
_base_.visualizer.vis_backends = [
dict(type='LocalVisBackend'),
dict(type='TensorboardVisBackend'),
dict(type='WandbVisBackend'),]
Plot¶
Plot the prediction results¶
MMOCR mainly uses VisualizationHook
to plot the prediction results of validation and test, by default VisualizationHook
is off, and the default configuration is as follows.
visualization=dict( # user visualization of validation and test results
type='VisualizationHook',
enable=False,
interval=1,
show=False,
draw_gt=False,
draw_pred=False)
The following table shows the parameters supported by VisualizationHook
.
Parameters | Description |
---|---|
enable | The VisualizationHook is turned on and off by the enable parameter, which is the default state. |
interval | Controls how much iteration to store or display the results of a val or test if VisualizationHook is enabled. |
show | Controls whether to visualize the results of val or test. |
draw_gt | Whether the results of val or test are drawn with or without labeling information |
draw_pred | whether to draw predictions for val or test results |
If you want to enable VisualizationHook
related functions and configurations during training or testing, you only need to modify the configuration, take dbnet_resnet18_fpnc_1200e_icdar2015.py
as an example, draw annotations and predictions at the same time, and display the images, the configuration can be modified as follows
visualization = _base_.default_hooks.visualization
visualization.update(
dict(enable=True, show=True, draw_gt=True, draw_pred=True))

If you only want to see the predicted result information you can just let draw_pred=True
visualization = _base_.default_hooks.visualization
visualization.update(
dict(enable=True, show=True, draw_gt=False, draw_pred=True))

The test.py
procedure is further simplified by providing the --show
and --show-dir
parameters to visualize the annotation and prediction results during the test without modifying the configuration.
# Show test results
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py dbnet_r18_fpnc_1200e_icdar2015/epoch_400.pth --show
# Specify where to store the prediction results
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py dbnet_r18_fpnc_1200e_icdar2015/epoch_400.pth --show-dir imgs/

Useful Tools¶
Visualization Tools¶
Dataset Visualization Tool¶
MMOCR provides a dataset visualization tool tools/visualizations/browse_datasets.py
to help users troubleshoot possible dataset-related problems. You just need to specify the path to the training config (usually stored in configs/textdet/dbnet/xxx.py
) or the dataset config (usually stored in configs/textdet/_base_/datasets/xxx.py
), and the tool will automatically plots the transformed (or original) images and labels.
Usage¶
python tools/visualizations/browse_dataset.py \
${CONFIG_FILE} \
[-o, --output-dir ${OUTPUT_DIR}] \
[-p, --phase ${DATASET_PHASE}] \
[-m, --mode ${DISPLAY_MODE}] \
[-t, --task ${DATASET_TASK}] \
[-n, --show-number ${NUMBER_IMAGES_DISPLAY}] \
[-i, --show-interval ${SHOW_INTERRVAL}] \
[--cfg-options ${CFG_OPTIONS}]
ARGS | Type | Description |
---|---|---|
config | str | (required) Path to the config. |
-o, --output-dir | str | If GUI is not available, specifying an output path to save the visualization results. |
-p, --phase | str | Phase of dataset to visualize. Use "train", "test" or "val" if you just want to visualize the default split. It's also possible to be a dataset variable name, which might be useful when a dataset split has multiple variants in the config. |
-m, --mode | original , transformed , pipeline |
Display mode: display original pictures or transformed pictures or comparison pictures.original only visualizes the original dataset & annotations; transformed shows the resulting images processed through all the transforms; pipeline shows all the intermediate images. Defaults to "transformed". |
-t, --task | auto , textdet , textrecog |
Specify the task type of the dataset. If auto , the task type will be inferred from the config. If the script is unable to infer the task type, you need to specify it manually. Defaults to auto . |
-n, --show-number | int | The number of samples to visualized. If not specified, display all images in the dataset. |
-i, --show-interval | float | Interval of visualization (s), defaults to 2. |
--cfg-options | float | Override configs.Example |
Examples¶
The following example demonstrates how to use the tool to visualize the training data used by the “DBNet_R50_icdar2015” model.
# Example: Visualizing the training data used by dbnet_r50dcn_v2_fpnc_1200e_icadr2015 model
python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py
By default, the visualization mode is “transformed”, and you will see the images & annotations being transformed by the pipeline:



If you just want to visualize the original dataset, simply set the mode to “original”:
python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -m original

Or, to visualize the entire pipeline:
python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -m pipeline

In addition, users can also visualize the original images and their corresponding labels of the dataset by specifying the path to the dataset config file, for example:
python tools/visualizations/browse_dataset.py configs/textrecog/_base_/datasets/icdar2015.py
Some datasets might have multiple variants. For example, the test split of icdar2015
textrecog dataset has two variants, which the base dataset config defines as follows:
icdar2015_textrecog_test = dict(
ann_file='textrecog_test.json',
# ...
)
icdar2015_1811_textrecog_test = dict(
ann_file='textrecog_test_1811.json',
# ...
)
In this case, you can specify the variant name to visualize the corresponding dataset:
python tools/visualizations/browse_dataset.py configs/textrecog/_base_/datasets/icdar2015.py -p icdar2015_1811_textrecog_test
Based on this tool, users can easily verify if the annotation of a custom dataset is correct.
Hyper-parameter Scheduler Visualization¶
This tool aims to help the user to check the hyper-parameter scheduler of the optimizer (without training), which support the “learning rate” or “momentum”
Introduce the scheduler visualization tool¶
python tools/visualizations/vis_scheduler.py \
${CONFIG_FILE} \
[-p, --parameter ${PARAMETER_NAME}] \
[-d, --dataset-size ${DATASET_SIZE}] \
[-n, --ngpus ${NUM_GPUs}] \
[-s, --save-path ${SAVE_PATH}] \
[--title ${TITLE}] \
[--style ${STYLE}] \
[--window-size ${WINDOW_SIZE}] \
[--cfg-options]
Description of all arguments:
config
: The path of a model config file.-p, --parameter
: The param to visualize its change curve, choose from “lr” and “momentum”. Default to use “lr”.-d, --dataset-size
: The size of the datasets. If set,build_dataset
will be skipped and${DATASET_SIZE}
will be used as the size. Default to use the functionbuild_dataset
.-n, --ngpus
: The number of GPUs used in training, default to be 1.-s, --save-path
: The learning rate curve plot save path, default not to save.--title
: Title of figure. If not set, default to be config file name.--style
: Style of plt. If not set, default to bewhitegrid
.--window-size
: The shape of the display window. If not specified, it will be set to12*7
. If used, it must be in the format'W*H'
.--cfg-options
: Modifications to the configuration file, refer to Learn about Configs.
Note
Loading annotations maybe consume much time, you can directly specify the size of the dataset with -d, dataset-size
to save time.
How to plot the learning rate curve without training¶
You can use the following command to plot the step learning rate schedule used in the config configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py
:
python tools/visualizations/vis_scheduler.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -d 100

Analysis Tools¶
Offline Evaluation Tool¶
For saved prediction results, we provide an offline evaluation script tools/analysis_tools/offline_eval.py
. The following example demonstrates how to use this tool to evaluate the output of the “PSENet” model offline.
# When running the test script for the first time, you can save the output of the model by specifying the --save-preds parameter
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} --save-preds
# Example: Testing on PSENet
python tools/test.py configs/textdet/psenet/psenet_r50_fpnf_600e_icdar2015.py epoch_600.pth --save-preds
# Then, using the saved outputs for offline evaluation
python tools/analysis_tool/offline_eval.py ${CONFIG_FILE} ${PRED_FILE}
# Example: Offline evaluation of saved PSENet results
python tools/analysis_tools/offline_eval.py configs/textdet/psenet/psenet_r50_fpnf_600e_icdar2015.py work_dirs/psenet_r50_fpnf_600e_icdar2015/epoch_600.pth_predictions.pkl
-save-preds
saves the output to work_dir/CONFIG_NAME/MODEL_NAME_predictions.pkl
by default
In addition, based on this tool, users can also convert predictions obtained from other libraries into MMOCR-supported formats, then use MMOCR’s built-in metrics to evaluate them.
ARGS | Type | Description |
---|---|---|
config | str | (required) Path to the config. |
pkl_results | str | (required) The saved predictions. |
--cfg-options | float | Override configs.Example |
Calculate FLOPs and the Number of Parameters¶
We provide a method to calculate the FLOPs and the number of parameters, first we install the dependencies using the following command.
pip install fvcore
The usage of the script to calculate FLOPs and the number of parameters is as follows.
python tools/analysis_tools/get_flops.py ${config} --shape ${IMAGE_SHAPE}
ARGS | Type | Description |
---|---|---|
config | str | (required) Path to the config. |
--shape | int | Image size to use when calculating FLOPs, such as --shape 320 320 . Default is 640 640 |
For example, you can run the following command to get FLOPs and the number of parameters of dbnet_resnet18_fpnc_100k_synthtext.py
:
python tools/analysis_tools/get_flops.py configs/textdet/dbnet/dbnet_resnet18_fpnc_100k_synthtext.py --shape 1024 1024
The output is as follows:
input shape is (1, 3, 1024, 1024)
| module | #parameters or shape | #flops |
| :------------------------ | :------------------- | :------ |
| model | 12.341M | 63.955G |
| backbone | 11.177M | 38.159G |
| backbone.conv1 | 9.408K | 2.466G |
| backbone.conv1.weight | (64, 3, 7, 7) | |
| backbone.bn1 | 0.128K | 83.886M |
| backbone.bn1.weight | (64,) | |
| backbone.bn1.bias | (64,) | |
| backbone.layer1 | 0.148M | 9.748G |
| backbone.layer1.0 | 73.984K | 4.874G |
| backbone.layer1.1 | 73.984K | 4.874G |
| backbone.layer2 | 0.526M | 8.642G |
| backbone.layer2.0 | 0.23M | 3.79G |
| backbone.layer2.1 | 0.295M | 4.853G |
| backbone.layer3 | 2.1M | 8.616G |
| backbone.layer3.0 | 0.919M | 3.774G |
| backbone.layer3.1 | 1.181M | 4.842G |
| backbone.layer4 | 8.394M | 8.603G |
| backbone.layer4.0 | 3.673M | 3.766G |
| backbone.layer4.1 | 4.721M | 4.837G |
| neck | 0.836M | 14.887G |
| neck.lateral_convs | 0.246M | 2.013G |
| neck.lateral_convs.0.conv | 16.384K | 1.074G |
| neck.lateral_convs.1.conv | 32.768K | 0.537G |
| neck.lateral_convs.2.conv | 65.536K | 0.268G |
| neck.lateral_convs.3.conv | 0.131M | 0.134G |
| neck.smooth_convs | 0.59M | 12.835G |
| neck.smooth_convs.0.conv | 0.147M | 9.664G |
| neck.smooth_convs.1.conv | 0.147M | 2.416G |
| neck.smooth_convs.2.conv | 0.147M | 0.604G |
| neck.smooth_convs.3.conv | 0.147M | 0.151G |
| det_head | 0.329M | 10.909G |
| det_head.binarize | 0.164M | 10.909G |
| det_head.binarize.0 | 0.147M | 9.664G |
| det_head.binarize.1 | 0.128K | 20.972M |
| det_head.binarize.3 | 16.448K | 1.074G |
| det_head.binarize.4 | 0.128K | 83.886M |
| det_head.binarize.6 | 0.257K | 67.109M |
| det_head.threshold | 0.164M | |
| det_head.threshold.0 | 0.147M | |
| det_head.threshold.1 | 0.128K | |
| det_head.threshold.3 | 16.448K | |
| det_head.threshold.4 | 0.128K | |
| det_head.threshold.6 | 0.257K | |
!!!Please be cautious if you use the results in papers. You may need to check if all ops are supported and verify that the flops computation is correct.
Data Structures and Elements¶
MMOCR uses MMEngine: Abstract Data Element to encapsulate the data required for each task into data_sample
. The base class has implemented basic add/delete/update/check functions and supports data migration between different devices, as well as dictionary-like and tensor-like operations, which also allows the interfaces of different algorithms to be unified.
Thanks to the unified data structures, the data flow between each module in the algorithm libraries, such as visualizer
, evaluator
, dataset
, is greatly simplified. In MMOCR, we have the following conventions for different data types.
xxxData: Single granularity data annotation or model output. Currently MMEngine has three built-in granularities of data elements, including instance-level data (
InstanceData
), pixel-level data (PixelData
) and image-level label data (LabelData
). Among the tasks currently supported by MMOCR, text detection and key information extraction tasks useInstanceData
to encapsulate the bounding boxes and the corresponding box label, while the text recognition task usesLabelData
to encapsulate the text content.xxxDataSample: inherited from MMEngine: Base Data Element, used to hold all annotation and prediction information that required by a single task. For example,
TextDetDataSample
for the text detection,TextRecogDataSample
for text recognition, andKIEDataSample
for the key information extraction task.
In the following, we will introduce the practical application of data elements xxxData and data samples xxxDataSample in MMOCR, respectively.
Data Elements - xxxData¶
InstanceData
and LabelData
are the BaseDataElement
defined in MMEngine
to encapsulate different granularity of annotation data or model output. In MMOCR, we have used InstanceData
and LabelData
for encapsulating the data types actually used in OCR-related tasks.
InstanceData¶
In the text detection task, the detector concentrate on instance-level text samples, so we use InstanceData
to encapsulate the data needed for this task. Typically, its required training annotation and prediction output contain rectangular or polygonal bounding boxes, as well as bounding box labels. Since the text detection task has only one positive sample class, “text”, in MMOCR we use 0
to number this class by default. The following code example shows how to use the InstanceData
to encapsulate the data used in the text detection task.
import torch
from mmengine.structures import InstanceData
# defining gt_instance for encapsulating the ground truth data
gt_instance = InstanceData()
gt_instance.bbox = torch.Tensor([[0, 0, 10, 10], [10, 10, 20, 20]])
gt_instance.polygons = torch.Tensor([[[0, 0], [10, 0], [10, 10], [0, 10]],
[[10, 10], [20, 10], [20, 20], [10, 20]]])
gt_instance.label = torch.Tensor([0, 0])
# defining pred_instance for encapsulating the prediction data
pred_instances = InstanceData()
pred_polygons, scores = model(input)
pred_instances.polygons = pred_polygons
pred_instances.scores = scores
The conventions for the fields in InstanceData
in MMOCR are shown in the table below. It is important to note that the length of each field in InstanceData
must be equal to the number of instances N
in the sample.
Field | Type | Description |
bboxes | torch.FloatTensor |
Bounding boxes [x1, y1, x2, y2] with the shape (N, 4) . |
labels | torch.LongTensor |
Instance label with the shape (N, ) . By default, MMOCR uses 0 to represent the "text" class. |
polygons | list[np.array(dtype=np.float32)] |
Polygonal bounding boxes with the shape (N, ) . |
scores | torch.Tensor |
Confidence scores of the predictions of bounding boxes. (N, ) . |
ignored | torch.BoolTensor |
Whether to ignore the current sample with the shape (N, ) . |
texts | list[str] |
The text content of each instance with the shape (N, ) ,used for e2e text spotting or KIE task. |
text_scores | torch.FloatTensor |
Confidence score of the predictions of text contents with the shape (N, ) ,used for e2e text spotting task. |
edge_labels | torch.IntTensor |
The node adjacency matrix with the shape (N, N) . In KIE, the optional values for the state between nodes are -1 (ignored, not involved in loss calculation),0 (disconnected) and 1 (connected). |
edge_scores | torch.FloatTensor |
The prediction confidence of each edge in the KIE task, with the shape (N, N) . |
LabelData¶
For text recognition tasks, both labeled content and predicted content are wrapped using LabelData
.
import torch
from mmengine.data import LabelData
# defining gt_text for encapsulating the ground truth data
gt_text = LabelData()
gt_text.item = 'MMOCR'
# defining pred_text for encapsulating the prediction data
pred_text = LabelData()
index, score = model(input)
text = dictionary.idx2str(index)
pred_text.score = score
pred_text.item = text
The conventions for the LabelData
fields in MMOCR are shown in the following table.
Field | Type | Description |
item | str |
Text content. |
score | list[float] |
Confidence socre of the predicted text. |
indexes | torch.LongTensor |
A sequence of text characters encoded by dictionary and containing all special characters except <UNK> . |
padded_indexes | torch.LongTensor |
If the length of indexes is less than the maximum sequence length and pad_idx exists, this field holds the encoded text sequence padded to the maximum sequence length of max_seq_len . |
DataSample xxxDataSample¶
By defining a uniform data structure, we can easily encapsulate the annotation data and prediction results in a unified way, making data transfer between different modules of the code base easier. In MMOCR, we have designed three data structures based on the data needed in three tasks: TextDetDataSample
, TextRecogDataSample
, and KIEDataSample
. These data structures all inherit from MMEngine: Base Data Element, which is used to hold all annotation and prediction information required by each task.
Text Detection - TextDetDataSample¶
TextDetDataSample is used to encapsulate the data needed for the text detection task. It contains two main fields gt_instances
and pred_instances
, which are used to store the annotation information and prediction results respectively.
Field | Type | Description |
gt_instances | InstanceData |
Annotation information. |
pred_instances | InstanceData |
Prediction results. |
The fields of InstanceData
that will be used are:
Field | Type | Description |
bboxes | torch.FloatTensor |
Bounding boxes [x1, y1, x2, y2] with the shape (N, 4) . |
labels | torch.LongTensor |
Instance label with the shape (N, ) . By default, MMOCR uses 0 to represent the "text" class. |
polygons | list[np.array(dtype=np.float32)] |
Polygonal bounding boxes with the shape (N, ) . |
scores | torch.Tensor |
Confidence scores of the predictions of bounding boxes. (N, ) . |
ignored | torch.BoolTensor |
Boolean flags with the shape (N, ) , indicating whether to ignore the current sample. |
Since text detection models usually only output one of the bboxes/polygons, we only need to make sure that one of these two is assigned a value.
The following sample code demonstrates the use of TextDetDataSample
.
import torch
from mmengine.data import TextDetDataSample
data_sample = TextDetDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.gt_instances = gt_instances
# Define the prediction data
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.pred_instances = pred_instances
Text Recognition - TextRecogDataSample¶
TextRecogDataSample
is used to encapsulate the data for the text recognition task. It has two fields, gt_text
and pred_text
, which are used to store annotation information and prediction results, respectively.
Field | Type | Description |
gt_text | LabelData |
Label information. |
pred_text | LabelData |
Prediction results. |
The following sample code demonstrates the use of TextRecogDataSample
.
import torch
from mmengine.data import TextRecogDataSample
data_sample = TextRecogDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_text = LabelData(metainfo=img_meta)
gt_text.item = 'mmocr'
data_sample.gt_text = gt_text
# Define the prediction data
pred_text = LabelData(metainfo=img_meta)
pred_text.item = 'mmocr'
data_sample.pred_text = pred_text
The fields of LabelData
that will be used are:
Field | Type | Description |
item | list[str] |
The text corresponding to the instance, of length (N, ), for end-to-end OCR tasks and KIE |
score | torch.FloatTensor |
Confidence of the text prediction, of length (N, ), for the end-to-end OCR task |
indexes | torch.LongTensor |
A sequence of text characters encoded by dictionary and containing all special characters except <UNK> . |
padded_indexes | torch.LongTensor |
If the length of indexes is less than the maximum sequence length and pad_idx exists, this field holds the encoded text sequence padded to the maximum sequence length of max_seq_len . |
Key Information Extraction - KIEDataSample¶
KIEDataSample
is used to encapsulate the data needed for the KIE task. It also contains two fields, gt_instances
and pred_instances
, which are used to store annotation information and prediction results respectively.
Field | Type | Description |
gt_instances | InstanceData |
Annotation information. |
pred_instances | InstanceData |
Prediction results. |
The InstanceData
fields that will be used by this task are shown in the following table.
Field | Type | Description |
bboxes | torch.FloatTensor |
Bounding boxes [x1, y1, x2, y2] with the shape (N, 4) . |
labels | torch.LongTensor |
Instance label with the shape (N, ) . |
texts | list[str] |
The text content of each instance with the shape (N, ) ,used for e2e text spotting or KIE task. |
edge_labels | torch.IntTensor |
The node adjacency matrix with the shape (N, N) . In the KIE task, the optional values for the state between nodes are -1 (ignored, not involved in loss calculation),0 (disconnected) and 1 (connected). |
edge_scores | torch.FloatTensor |
The prediction confidence of each edge in the KIE task, with the shape (N, N) . |
scores | torch.FloatTensor |
The confidence scores for node label predictions, with the shape (N,) . |
Warning
Since there is no unified standard for model implementation of KIE tasks, the design currently considers only SDMGR model usage scenarios. Therefore, the design is subject to change as we support more KIE models.
The following sample code shows the use of KIEDataSample
.
import torch
from mmengine.data import KIEDataSample
data_sample = KIEDataSample()
# Define the ground truth data
img_meta = dict(img_shape=(800, 1196, 3),pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
gt_instances.texts = ['text1', 'text2', 'text3', 'text4', 'text5']
gt_instances.edge_lebels = torch.randint(-1, 2, (5, 5))
data_sample.gt_instances = gt_instances
# Define the prediction data
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.rand((5,))
pred_instances.edge_labels = torch.randint(-1, 2, (10, 10))
pred_instances.edge_scores = torch.rand((10, 10))
data_sample.pred_instances = pred_instances
Data Transforms and Pipeline¶
In the design of MMOCR, dataset construction and preparation are decoupled. That is, dataset construction classes such as OCRDataset
are responsible for loading and parsing annotation files; while data transforms further apply data preprocessing, augmentation, formatting, and other related functions. Currently, there are five types of data transforms implemented in MMOCR, as shown in the following table.
Transforms Type | File | Description |
Data Loading | loading.py | Implemented the data loading functions. |
Data Formatting | formatting.py | Formatting the data required by different tasks. |
Cross Project Data Adapter | adapters.py | Converting the data format between other OpenMMLab projects and MMOCR. |
Data Augmentation Functions | ocr_transforms.py textdet_transforms.py textrecog_transforms.py |
Various built-in data augmentation methods designed for different tasks. |
Wrappers of Third Party Packages | wrappers.py | Wrapping the transforms implemented in popular third party packages such as ImgAug, and adapting them to MMOCR format. |
Since each data transform class is independent of each other, we can easily combine any data transforms to build a data pipeline after we have defined the data fields. As shown in the following figure, in MMOCR, a typical training data pipeline consists of three stages: data loading, data augmentation, and data formatting. Users only need to define the data pipeline list in the configuration file and specify the specific data transform class and its parameters:
train_pipeline_r18 = [
# Loading images
dict(
type='LoadImageFromFile',
color_type='color_ignore_orientation'),
# Loading annotations
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
# Data augmentation
dict(
type='ImgAugWrapper',
args=[['Fliplr', 0.5],
dict(cls='Affine', rotate=[-10, 10]), ['Resize', [0.5, 3.0]]]),
dict(type='RandomCrop', min_side_ratio=0.1),
dict(type='Resize', scale=(640, 640), keep_ratio=True),
dict(type='Pad', size=(640, 640)),
# Data formatting
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
Tip
More tutorials about data pipeline configuration can be found in the Config Doc. Next, we will briefly introduce the data transforms supported in MMOCR according to their categories.
For each data transform, MMOCR provides a detailed docstring. For example, in the header of each data transform class, we annotate Required Keys
, Modified Keys
and Added Keys
. The Required Keys
represent the mandatory fields that should be included in the input required by the data transform, while the Modified Keys
and Added Keys
indicate that the transform may modify or add the fields into the original data. For example, LoadImageFromFile
implements the image loading function, whose Required Keys
is the image path img_path
, and the Modified Keys
includes the loaded image img
, the current size of the image img_shape
, the original size of the image ori_shape
, and other image attributes.
@TRANSFORMS.register_module()
class LoadImageFromFile(MMCV_LoadImageFromFile):
# We provide detailed docstring for each data transform.
"""Load an image from file.
Required Keys:
- img_path
Modified Keys:
- img
- img_shape
- ori_shape
"""
Note
In the data pipeline of MMOCR, the image and label information are saved in a dictionary. By using the unified fields, the data can be freely transferred between different data transforms. Therefore, it is very important to understand the conventional fields used in MMOCR.
For your convenience, the following table lists the conventional keys used in MMOCR data transforms.
Key | Type | Description |
img | np.array(dtype=np.uint8) |
Image array, shape of (h, w, c) . |
img_shape | tuple(int, int) |
Current image size (h, w) . |
ori_shape | tuple(int, int) |
Original image size (h, w) . |
scale | tuple(int, int) |
Stores the target image size (h, w) specified by the user in the Resize data transform series. Note: This value may not correspond to the actual image size after the transformation. |
scale_factor | tuple(float, float) |
Stores the target image scale factor (w_scale, h_scale) specified by the user in the Resize data transform series. Note: This value may not correspond to the actual image size after the transformation. |
keep_ratio | bool |
Boolean flag determines whether to keep the aspect ratio while scaling images. |
flip | bool |
Boolean flags to indicate whether the image has been flipped. |
flip_direction | str |
Flipping direction, options are horizontal , vertical , diagonal . |
gt_bboxes | np.array(dtype=np.float32) |
Ground-truth bounding boxes. |
gt_polygons | list[np.array(dtype=np.float32) |
Ground-truth polygons. |
gt_bboxes_labels | np.array(dtype=np.int64) |
Category label of bounding boxes. By default, MMOCR uses 0 to represent "text" instances. |
gt_texts | list[str] |
Ground-truth text content of the instance. |
gt_ignored | np.array(dtype=np.bool_) |
Boolean flag indicating whether ignoring the instance (used in text detection). |
Data Loading¶
Data loading transforms mainly implement the functions of loading data from different formats and backends. Currently, the following data loading transforms are implemented in MMOCR:
Transforms Name | Required Keys | Modified/Added Keys | Description |
LoadImageFromFile | img_path |
img img_shape ori_shape |
Load image from the specified path,supporting different file storage backends (e.g. disk , http , petrel ) and decoding backends (e.g. cv2 , turbojpeg , pillow , tifffile ). |
LoadOCRAnnotations | bbox bbox_label polygon ignore text |
gt_bboxes gt_bboxes_labels gt_polygons gt_ignored gt_texts |
Parse the annotation required by OCR task. |
LoadKIEAnnotations | bboxes bbox_labels edge_labels texts |
gt_bboxes gt_bboxes_labels gt_edge_labels gt_texts ori_shape |
Parse the annotation required by KIE task. |
Data Augmentation¶
Data augmentation is an indispensable process in text detection and recognition tasks. Currently, MMOCR has implemented dozens of data augmentation modules commonly used in OCR fields, which are classified into ocr_transforms.py, textdet_transforms.py, and textrecog_transforms.py.
Specifically, ocr_transforms.py
implements generic OCR data augmentation modules such as RandomCrop
and RandomRotate
:
Transforms Name | Required Keys | Modified/Added Keys | Description |
RandomCrop | img gt_bboxes gt_bboxes_labels gt_polygons gt_ignored gt_texts (optional) |
img img_shape gt_bboxes gt_bboxes_labels gt_polygons gt_ignored gt_texts (optional) |
Randomly crop the image and make sure the cropped image contains at least one text instance. The optional parameter is min_side_ratio , which controls the ratio of the short side of the cropped image to the original image, the default value is 0.4 . |
RandomRotate | img img_shape gt_bboxes (optional)gt_polygons (optional) |
img img_shape gt_bboxes (optional)gt_polygons (optional)rotated_angle |
Randomly rotate the image and optionally fill the blank areas of the rotated image. |
textdet_transforms.py
implements text detection related data augmentation modules:
Transforms Name | Required Keys | Modified/Added Keys | Description |
RandomFlip | img gt_bboxes gt_polygons |
img gt_bboxes gt_polygons flip flip_direction |
Random flip, support horizontal , vertical and diagonal modes. Defaults to horizontal . |
FixInvalidPolygon | gt_polygons gt_ignored |
gt_polygons gt_ignored |
Automatically fixing the invalid polygons included in the annotations. |
textrecog_transforms.py
implements text recognition related data augmentation modules:
Transforms Name | Required Keys | Modified/Added Keys | Description |
RescaleToHeight | img |
img img_shape scale scale_factor keep_ratio |
Scales the image to the specified height while keeping the aspect ratio. When min_width and max_width are specified, the aspect ratio may be changed. |
Warning
The above table only briefly introduces some selected data augmentation methods, for more information please refer to the API documentation or the code docstrings.
Data Formatting¶
Data formatting transforms are responsible for packaging images, ground truth labels, and other information into a dictionary. Different tasks usually rely on different formatting transforms. For example:
Transforms Name | Required Keys | Modified/Added Keys | Description |
PackTextDetInputs | - | - | Pack the inputs required by text detection. |
PackTextRecogInputs | - | - | Pack the inputs required by text recognition. |
PackKIEInputs | - | - | Pack the inputs required by KIE. |
Cross Project Data Adapters¶
The cross-project data adapters bridge the data formats between MMOCR and other OpenMMLab libraries such as MMDetection, making it possible to call models implemented in other OpenMMLab projects. Currently, MMOCR has implemented MMDet2MMOCR
and MMOCR2MMDet
, allowing data to be converted between MMDetection and MMOCR formats; with these adapters, users can easily train any detectors supported by MMDetection in MMOCR. For example, we provide a tutorial to show how to train Mask R-CNN as a text detector in MMOCR.
Transforms Name | Required Keys | Modified/Added Keys | Description |
MMDet2MMOCR | gt_masks gt_ignore_flags |
gt_polygons gt_ignored |
Convert the fields used in MMDet to MMOCR. |
MMOCR2MMDet | img_shape gt_polygons gt_ignored |
gt_masks gt_ignore_flags |
Convert the fields used in MMOCR to MMDet. |
Wrappers¶
To facilitate the use of popular third-party CV libraries in MMOCR, we provide wrappers in wrappers.py
to unify the data format between MMOCR and other third-party libraries. Users can directly configure the data transforms provided by these libraries in the configuration file of MMOCR. The supported wrappers are as follows:
Transforms Name | Required Keys | Modified/Added Keys | Description |
ImgAugWrapper | img gt_polygons (optional for text recognition)gt_bboxes (optional for text recognition)gt_bboxes_labels (optional for text recognition)gt_ignored (optional for text recognition)gt_texts (optional) |
img gt_polygons (optional for text recognition)gt_bboxes (optional for text recognition)gt_bboxes_labels (optional for text recognition)gt_ignored (optional for text recognition)img_shape (optional)gt_texts (optional) |
ImgAug wrapper, which bridges the data format and configuration between ImgAug and MMOCR, allowing users to config the data augmentation methods supported by ImgAug in MMOCR. |
TorchVisionWrapper | img |
img img_shape |
TorchVision wrapper, which bridges the data format and configuration between TorchVision and MMOCR, allowing users to config the data transforms supported by torchvision.transforms in MMOCR. |
ImgAugWrapper
Example¶
For example, in the original ImgAug, we can define a Sequential
type data augmentation pipeline as follows to perform random flipping, random rotation and random scaling on the image:
import imgaug.augmenters as iaa
aug = iaa.Sequential(
iaa.Fliplr(0.5), # horizontally flip 50% of all images
iaa.Affine(rotate=(-10, 10)), # rotate by -10 to +10 degrees
iaa.Resize((0.5, 3.0)) # scale images to 50-300% of their size
)
In MMOCR, we can directly configure the above data augmentation pipeline in train_pipeline
as follows:
dict(
type='ImgAugWrapper',
args=[
['Fliplr', 0.5],
dict(cls='Affine', rotate=[-10, 10]),
['Resize', [0.5, 3.0]],
]
)
Specifically, the args
parameter accepts a list, and each element in the list can be a list or a dictionary. If it is a list, the first element of the list is the class name in imgaug.augmenters
, and the following elements are the initialization parameters of the class; if it is a dictionary, the cls
key corresponds to the class name in imgaug.augmenters
, and the other key-value pairs correspond to the initialization parameters of the class.
TorchVisionWrapper
Example¶
For example, in the original TorchVision, we can define a Compose
type data transformation pipeline as follows to perform color jittering on the image:
import torchvision.transforms as transforms
aug = transforms.Compose([
transforms.ColorJitter(
brightness=32.0 / 255, # brightness jittering range
saturation=0.5) # saturation jittering range
])
In MMOCR, we can directly configure the above data transformation pipeline in train_pipeline
as follows:
dict(
type='TorchVisionWrapper',
op='ColorJitter',
brightness=32.0 / 255,
saturation=0.5
)
Specifically, the op
parameter is the class name in torchvision.transforms
, and the following parameters correspond to the initialization parameters of the class.
Evaluation¶
Note
Before reading this document, we recommend that you first read MMEngine: Model Accuracy Evaluation Basics.
Metrics¶
MMOCR implements widely-used evaluation metrics for text detection, text recognition and key information extraction tasks based on the MMEngine: BaseMetric base class. Users can specify the metric used in the validation and test phases by modifying the val_evaluator
and test_evaluator
fields in the configuration file. For example, the following config shows how to use HmeanIOUMetric
to evaluate the model performance in text detection task.
val_evaluator = dict(type='HmeanIOUMetric')
test_evaluator = val_evaluator
# In addition, MMOCR also supports the combined evaluation of multiple metrics for the same task, such as using WordMetric and CharMetric at the same time
val_evaluator = [
dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol']),
dict(type='CharMetric')
]
Tip
More evaluation related configurations can be found in the evaluation configuration tutorial.
As shown in the following table, MMOCR currently supports 5 evaluation metrics for text detection, text recognition, and key information extraction tasks, including HmeanIOUMetric
, WordMetric
, CharMetric
, OneMinusNEDMetric
, and F1Metric
.
Metric | Task | Input Field | Output Field |
HmeanIOUMetric | TextDet | pred_polygons pred_scores gt_polygons |
recall precision hmean |
WordMetric | TextRec | pred_text gt_text |
word_acc word_acc_ignore_case word_acc_ignore_case_symbol |
CharMetric | TextRec | pred_text gt_text |
char_recall char_precision |
OneMinusNEDMetric | TextRec | pred_text gt_text |
1-N.E.D |
F1Metric | KIE | pred_labels gt_labels |
macro_f1 micro_f1 |
In general, the evaluation metric used in each task is conventionally determined. Users usually do not need to understand or manually modify the internal implementation of the evaluation metric. However, to facilitate more customized requirements, this document will further introduce the specific implementation details and configurable parameters of the built-in metrics in MMOCR.
HmeanIOUMetric¶
HmeanIOUMetric is one of the most widely used evaluation metrics in text detection tasks, because it calculates the harmonic mean (H-mean) between the detection precision (P) and recall rate (R). The HmeanIOUMetric
can be calculated by the following equation:
In addition, since it is equivalent to the F-score (also known as F-measure or F-metric) when \(\beta = 1\), HmeanIOUMetric
is sometimes written as F1Metric
or f1-score
:
In MMOCR, the calculation of HmeanIOUMetric
can be summarized as the following steps:
Filter out invalid predictions
Filter out predictions with a score is lower than
pred_score_thrs
Filter out predictions overlapping with
ignored
ground truth boxes with an overlap ratio higher thanignore_precision_thr
It is worth noting that
pred_score_thrs
will automatically search for the best threshold within a certain range by default, and users can also customize the search range by manually modifying the configuration file:# By default, HmeanIOUMetric searches the best threshold within the range [0.3, 0.9] with a step size of 0.1 val_evaluator = dict(type='HmeanIOUMetric', pred_score_thrs=dict(start=0.3, stop=0.9, step=0.1))
Calculate the IoU matrix
At the data processing stage,
HmeanIOUMetric
will calculate and maintain an \(M \times N\) IoU matrixiou_metric
for the convenience of the subsequent bounding box pairing step. Here, M and N represent the number of label bounding boxes and filtered prediction bounding boxes, respectively. Therefore, each element of this matrix stores the IoU between the m-th label bounding box and the n-th prediction bounding box.
Compute the number of GT samples that can be accurately matched based on the corresponding pairing strategy
Although
HmeanIOUMetric
can be calculated by a fixed formula, there may still be some subtle differences in the specific implementations. These differences mainly reflect the use of different strategies to match gt and predicted bounding boxes, which leads to the difference in final scores. Currently, MMOCR supports two matching strategies, namelyvanilla
andmax_matching
, for theHmeanIOUMetric
. As shown below, users can specify the matching strategies in the config.vanilla
matching strategyBy default,
HmeanIOUMetric
adopts thevanilla
matching strategy, which is consistent with thehmean-iou
implementation in MMOCR 0.x and the official text detection competition evaluation standard of ICDAR series. The matching strategy adopts the first-come-first-served matching method to pair the labels and predictions.# By default, HmeanIOUMetric adopts 'vanilla' matching strategy val_evaluator = dict(type='HmeanIOUMetric')
max_matching
matching strategyTo address the shortcomings of the existing matching mechanism, MMOCR has implemented a more efficient matching strategy to maximize the number of matches.
# Specify to use 'max_matching' matching strategy val_evaluator = dict(type='HmeanIOUMetric', strategy='max_matching')
Note
We recommend that research-oriented developers use the default
vanilla
matching strategy to ensure consistency with other papers. For industry-oriented developers, you can use themax_matching
matching strategy to achieve optimized performance.Compute the final evaluation score according to the aforementioned matching strategy
WordMetric¶
WordMetric implements word-level text recognition evaluation metrics and includes three text matching modes, namely exact
, ignore_case
, and ignore_case_symbol
. Users can freely combine the output of one or more text matching modes in the configuration file by modifying the mode
field.
# Use WordMetric for text recognition task
val_evaluator = [
dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol'])
]
exact
:Full matching mode, i.e., only when the predicted text and the ground truth text are exactly the same, the predicted text is considered to be correct.ignore_case
:The mode ignores the case of the predicted text and the ground truth text.ignore_case_symbol
:The mode ignores the case and symbols of the predicted text and the ground truth text. This is also the text recognition accuracy reported by most academic papers. The performance reported by MMOCR uses theignore_case_symbol
mode by default.
Assume that the real label is MMOCR!
and the model output is mmocr
. The WordMetric
scores under the three matching modes are: {'exact': 0, 'ignore_case': 0, 'ignore_case_symbol': 1}
.
CharMetric¶
CharMetric implements character-level text recognition evaluation metrics that are case-insensitive.
# Use CharMetric for text recognition task
val_evaluator = [dict(type='CharMetric')]
Specifically, CharMetric
will output two evaluation metrics, namely char_precision
and char_recall
. Let the number of correctly predicted characters (True Positive) be \(\sigma_{tp}\), then the precision P and recall R can be calculated by the following equation:
where \(\sigma_{gt}\) and \(\sigma_{pred}\) represent the total number of characters in the label text and the predicted text, respectively.
For example, assume that the label text is “MMOCR” and the predicted text is “mm0cR1”. The score of the CharMetric
is:
OneMinusNEDMetric¶
OneMinusNEDMetric(1-N.E.D) is commonly used for text recognition evaluation of Chinese or English text line-level annotations. Unlike the full matching metric that requires the prediction and the gt text to be exactly the same, 1-N.E.D
uses the normalized edit distance (also known as Levenshtein Distance) to measure the difference between the predicted and the gt text, so that the performance difference of the model can be better distinguished when evaluating long texts. Assume that the real and predicted texts are \(s_i\) and \(\hat{s_i}\), respectively, and their lengths are \(l_{i}\) and \(\hat{l_i}\), respectively. The OneMinusNEDMetric
score can be calculated by the following formula:
where N is the total number of samples, and \(D(s_1, s_2)\) is the edit distance between two strings.
For example, assume that the real label is “OpenMMLabMMOCR”, the prediction of model A is “0penMMLabMMOCR”, and the prediction of model B is “uvwxyz”. The results of the full matching and OneMinusNEDMetric
evaluation metrics are as follows:
Full-match | 1 - N.E.D. | |
Model A | 0 | 0.92857 |
Model B | 0 | 0 |
As shown in the table above, although the model A only predicted one letter incorrectly, both models got 0 in when using full-match strategy. However, the OneMinusNEDMetric
evaluation metric can better distinguish the performance of the two models on long texts.
F1Metric¶
F1Metric implements the F1-Metric evaluation metric for KIE tasks and provides two modes, namely micro
and macro
.
val_evaluator = [
dict(type='F1Metric', mode=['micro', 'macro'],
]
micro
mode: Calculate the global F1-Metric score based on the total number of True Positive, False Negative, and False Positive.macro
mode:Calculate the F1-Metric score for each class and then take the average.
Customized Metric¶
MMOCR supports the implementation of customized evaluation metrics for users who pursue higher customization. In general, users only need to create a customized evaluation metric class CustomizedMetric
and inherit MMEngine: BaseMetric. Then, the data format processing method process
and the metric calculation method compute_metrics
need to be overwritten respectively. Finally, add it to the METRICS
registry to implement any customized evaluation metric.
from mmengine.evaluator import BaseMetric
from mmocr.registry import METRICS
@METRICS.register_module()
class CustomizedMetric(BaseMetric):
def process(self, data_batch: Sequence[Dict], predictions: Sequence[Dict]):
""" process receives two parameters, data_batch stores the gt label information, and predictions stores the predicted results.
"""
pass
def compute_metrics(self, results: List):
""" compute_metric receives the results of the process method as input and returns the evaluation results.
"""
pass
Note
More details can be found in MMEngine Documentation: BaseMetric.
Dataset¶
Overview¶
In MMOCR, all the datasets are processed via different Dataset classes based on mmengine.BaseDataset. Dataset classes are responsible for loading the data and performing initial parsing, then fed to data pipeline for data preprocessing, augmentation, formatting, etc.
In this tutorial, we will introduce some common interfaces of the Dataset class, and the usage of Dataset implementations in MMOCR as well as the annotation types they support.
Tip
Dataset class supports some advanced features, such as lazy initialization and data serialization, and takes advantage of various dataset wrappers to perform data concatenation, repeating, and category balancing. These content will not be covered in this tutorial, but you can read MMEngine: BaseDataset for more details.
Common Interfaces¶
Now, let’s look at a concrete example and learn some typical interfaces of a Dataset class.
OCRDataset
is a widely used Dataset implementation in MMOCR, and is suggested as a default Dataset type in MMOCR as its associated annotation format is flexible enough to support all the OCR tasks (more info). Now we will instantiate an OCRDataset
object wherein the toy dataset in tests/data/det_toy_dataset
will be loaded.
from mmocr.datasets import OCRDataset
from mmengine.registry import init_default_scope
init_default_scope('mmocr')
train_pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(type='RandomCrop', min_side_ratio=0.1),
dict(type='Resize', scale=(640, 640), keep_ratio=True),
dict(type='Pad', size=(640, 640)),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
dataset = OCRDataset(
data_root='tests/data/det_toy_dataset',
ann_file='textdet_test.json',
test_mode=False,
pipeline=train_pipeline)
Let’s peek the size of this dataset:
>>> print(len(dataset))
10
Typically, a Dataset class loads and stores two types of information: (1) meta information: Some meta descriptors of the dataset’s property, such as available object categories in this dataset. (2) annotation: The path to images, and their labels. We can access the meta information in dataset.metainfo
:
>>> from pprint import pprint
>>> pprint(dataset.metainfo)
{'category': [{'id': 0, 'name': 'text'}],
'dataset_type': 'TextDetDataset',
'task_name': 'textdet'}
As for the annotations, we can access them via dataset.get_data_info(idx)
, which returns a dictionary containing the information of the idx
-th sample in the dataset that is initially parsed, but not yet processed by data pipeline.
>>> from pprint import pprint
>>> pprint(dataset.get_data_info(0))
{'height': 720,
'img_path': 'tests/data/det_toy_dataset/test/img_10.jpg',
'instances': [{'bbox': [260.0, 138.0, 284.0, 158.0],
'bbox_label': 0,
'ignore': True,
'polygon': [261, 138, 284, 140, 279, 158, 260, 158]},
...,
{'bbox': [1011.0, 157.0, 1079.0, 173.0],
'bbox_label': 0,
'ignore': True,
'polygon': [1011, 157, 1079, 160, 1076, 173, 1011, 170]}],
'sample_idx': 0,
'seg_map': 'test/gt_img_10.txt',
'width': 1280}
On the other hand, we can get the sample fully processed by data pipeline via dataset[idx]
or dataset.__getitem__(idx)
, which is directly feedable to models and perform a full train/test cycle. It has two fields:
inputs
: The image after data augmentation;data_samples
: The DataSample that contains the augmented annotations, and meta information appended by some data transforms to keep track of some key properties of this sample.
>>> pprint(dataset[0])
{'data_samples': <TextDetDataSample(
META INFORMATION
ori_shape: (720, 1280)
img_path: 'tests/data/det_toy_dataset/imgs/test/img_10.jpg'
img_shape: (640, 640)
DATA FIELDS
gt_instances: <InstanceData(
META INFORMATION
DATA FIELDS
labels: tensor([0, 0, 0])
polygons: [array([207.33984 , 104.65409 , 208.34634 , 84.528305, 231.49594 ,
86.54088 , 226.46341 , 104.65409 , 207.33984 , 104.65409 ],
dtype=float32), array([237.53496 , 103.6478 , 235.52196 , 84.528305, 365.36096 ,
86.54088 , 364.35446 , 107.67296 , 237.53496 , 103.6478 ],
dtype=float32), array([105.68293, 166.03773, 105.68293, 151.94969, 177.14471, 150.94339,
178.15121, 165.03145, 105.68293, 166.03773], dtype=float32)]
ignored: tensor([ True, False, True])
bboxes: tensor([[207.3398, 84.5283, 231.4959, 104.6541],
[235.5220, 84.5283, 365.3610, 107.6730],
[105.6829, 150.9434, 178.1512, 166.0377]])
) at 0x7f7359f04fa0>
) at 0x7f735a0508e0>,
'inputs': tensor([[[129, 111, 131, ..., 0, 0, 0], ...
[ 19, 18, 15, ..., 0, 0, 0]]], dtype=torch.uint8)}
Dataset Classes and Annotation Formats¶
Each Dataset implementation can only load datasets in a specific annotation format. Here lists all supported Dataset classes and their compatible annotation formats, as well as an example config that showcases how to use them in practice.
Note
If you are not familiar with the config system, you may find Dataset Configuration helpful.
OCRDataset¶
Usually, there are many different types of annotations in OCR datasets, and the formats often vary between different subtasks, such as text detection and text recognition. These differences can result in the need for different data loading code when using different datasets, increasing the learning and maintenance costs for users.
In MMOCR, we propose a unified dataset format that can adapt to all three subtasks of OCR: text detection, text recognition, and text spotting. This design maximizes the uniformity of the dataset, allows for the reuse of data annotations across different tasks, and makes dataset management more convenient. Considering that popular dataset formats are still inconsistent, MMOCR provides Dataset Preparer to help users convert their datasets to MMOCR format. We also strongly encourage researchers to develop their own datasets based on this data format.
Annotation Format¶
This annotation file is a .json
file that stores a dict
, containing both metainfo
and data_list
, where the former includes basic information about the dataset and the latter consists of the label item of each target instance. Here presents an extensive list of all the fields in the annotation file, but some fields are used in a subset of tasks and can be ignored in other tasks.
{
"metainfo":
{
"dataset_type": "TextDetDataset", # Options: TextDetDataset/TextRecogDataset/TextSpotterDataset
"task_name": "textdet", # Options: textdet/textspotter/textrecog
"category": [{"id": 0, "name": "text"}] # Used in textdet/textspotter
},
"data_list":
[
{
"img_path": "test_img.jpg",
"height": 604,
"width": 640,
"instances": # multiple instances in one image
[
{
"bbox": [0, 0, 10, 20], # in textdet/textspotter, [x1, y1, x2, y2].
"bbox_label": 0, # The object category, always 0 (text) in MMOCR
"polygon": [0, 0, 0, 10, 10, 20, 20, 0], # in textdet/textspotter. [x1, y1, x2, y2, ....]
"text": "mmocr", # in textspotter/textrecog
"ignore": False # in textspotter/textdet. Whether to ignore this sample during training
},
#...
],
}
#... multiple images
]
}
Example Config¶
Here is a part of config example where we make train_dataloader
use OCRDataset
to load the ICDAR2015 dataset for a text detection model. Keep in mind that OCRDataset
can load any OCR datasets prepared by Dataset Preparer regardless of its task. That is, you can use it for text recognition and text spotting, but you still have to modify the transform types in pipeline
according to the needs of different tasks.
pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
icdar2015_textdet_train = dict(
type='OCRDataset',
data_root='data/icdar2015',
ann_file='textdet_train.json',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train)
RecogLMDBDataset¶
Reading images or labels from files can be slow when data are excessive, e.g. on a scale of millions. Besides, in academia, most of the scene text recognition datasets are stored in lmdb format, including images and labels. (Example)
To get closer to the mainstream practice and enhance the data storage efficiency, MMOCR supports loading images and labels from lmdb datasets via RecogLMDBDataset
.
Annotation Format¶
MMOCR requires the following keys for LMDB datasets:
num_samples
: The parameter describing the data volume of the dataset.The keys of images and labels are in the format of
image-000000001
andlabel-000000001
, respectively. The index starts from 1.
MMOCR has a toy LMDB dataset in tests/data/rec_toy_dataset/imgs.lmdb
.
You can get a sense of the format with the following code snippet.
>>> import lmdb
>>>
>>> env = lmdb.open('tests/data/rec_toy_dataset/imgs.lmdb')
>>> txn = env.begin()
>>> for k, v in txn.cursor():
>>> print(k, v)
b'image-000000001' b'\xff...'
b'image-000000002' b'\xff...'
b'image-000000003' b'\xff...'
b'image-000000004' b'\xff...'
b'image-000000005' b'\xff...'
b'image-000000006' b'\xff...'
b'image-000000007' b'\xff...'
b'image-000000008' b'\xff...'
b'image-000000009' b'\xff...'
b'image-000000010' b'\xff...'
b'label-000000001' b'GRAND'
b'label-000000002' b'HOTEL'
b'label-000000003' b'HOTEL'
b'label-000000004' b'PACIFIC'
b'label-000000005' b'03/09/2009'
b'label-000000006' b'ANING'
b'label-000000007' b'Virgin'
b'label-000000008' b'america'
b'label-000000009' b'ATTACK'
b'label-000000010' b'DAVIDSON'
b'num-samples' b'10'
Example Config¶
Here is a part of config example where we make train_dataloader
use RecogLMDBDataset
to load the toy dataset. Since RecogLMDBDataset
loads images as numpy arrays, don’t forget to use LoadImageFromNDArray
instead of LoadImageFromFile
in the pipeline for successful loading.
pipeline = [
dict(
type='LoadImageFromNDArray'),
dict(
type='LoadOCRAnnotations',
with_text=True,
),
dict(
type='PackTextRecogInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
toy_textrecog_train = dict(
type='RecogLMDBDataset',
data_root='tests/data/rec_toy_dataset/',
ann_file='imgs.lmdb',
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=toy_textrecog_train)
RecogTextDataset¶
Prior to MMOCR 1.0, MMOCR 0.x takes text files as input for text recognition. These formats has been deprecated in MMOCR 1.0, and this class could be removed anytime in the future. More info
Annotation Format¶
Text files can either be in txt
format or jsonl
format. The simple .txt
annotations separate image name and word annotation by a blank space, which cannot handle the case when spaces are included in a text instance.
img1.jpg OpenMMLab
img2.jpg MMOCR
The JSON Line format uses a dictionary-like structure to represent the annotations, where the keys filename
and text
store the image name and word label, respectively.
{"filename": "img1.jpg", "text": "OpenMMLab"}
{"filename": "img2.jpg", "text": "MMOCR"}
Example Config¶
Here is a part of config example where we use RecogTextDataset
to load the old txt labels in training, and the old jsonl labels in testing.
pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
# loading 0.x txt format annos
txt_dataset = dict(
type='RecogTextDataset',
data_root=data_root,
ann_file='old_label.txt',
data_prefix=dict(img_path='imgs'),
parser_cfg=dict(
type='LineStrParser',
keys=['filename', 'text'],
keys_idx=[0, 1]),
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=txt_dataset)
# loading 0.x json line format annos
jsonl_dataset = dict(
type='RecogTextDataset',
data_root=data_root,
ann_file='old_label.jsonl',
data_prefix=dict(img_path='imgs'),
parser_cfg=dict(
type='LineJsonParser',
keys=['filename', 'text'],
pipeline=pipeline))
test_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=jsonl_dataset)
IcdarDataset¶
Prior to MMOCR 1.0, MMOCR 0.x takes COCO-like format annotations as input for text detection. These formats has been deprecated in MMOCR 1.0, and this class could be removed anytime in the future. More info
Annotation Format¶
{
"images": [
{
"id": 1,
"width": 800,
"height": 600,
"file_name": "test.jpg"
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [0,0,10,10],
"segmentation": [
[0,0,10,0,10,10,0,10]
],
"area": 100,
"iscrowd": 0
}
]
}
Example Config¶
Here is a part of config example where we make train_dataloader
use IcdarDataset
to load the old labels.
pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
icdar2015_textdet_train = dict(
type='IcdarDatasetDataset',
data_root='data/det/icdar2015',
ann_file='instances_training.json',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train)
WildReceiptDataset¶
It’s customized for WildReceipt dataset only.
Annotation Format¶
// Close Set
{
"file_name": "image_files/Image_16/11/d5de7f2a20751e50b84c747c17a24cd98bed3554.jpeg",
"height": 1200,
"width": 1600,
"annotations":
[
{
"box": [550.0, 190.0, 937.0, 190.0, 937.0, 104.0, 550.0, 104.0],
"text": "SAFEWAY",
"label": 1
},
{
"box": [1048.0, 211.0, 1074.0, 211.0, 1074.0, 196.0, 1048.0, 196.0],
"text": "TM",
"label": 25
}
], //...
}
// Open Set
{
"file_name": "image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg",
"height": 348,
"width": 348,
"annotations":
[
{
"box": [114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0],
"text": "CHOEUN",
"label": 2,
"edge": 1
},
{
"box": [97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0],
"text": "KOREANRESTAURANT",
"label": 2,
"edge": 1
}
]
}
Example Config¶
Please refer to SDMGR’s config for more details.
Overview & Features[coming soon]¶
Coming Soon!
Data Flow[coming soon]¶
Coming Soon!
Models[coming soon]¶
Coming Soon!
Visualizers[coming soon]¶
Coming Soon!
Convention[coming soon]¶
Coming Soon!
Engine[coming soon]¶
Coming Soon!
Overview¶
Supported Datasets¶
Dataset Name |
Text Detection |
Text Recognition |
Text Spotting |
KIE |
---|---|---|---|---|
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
||||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
||||
✓ |
||||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
||||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
✓ |
Dataset Details¶
COCO Text v2¶
“COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images”, arXiv, 2016. PDF
A. Basic Info
Official Website: cocotextv2
Year: 2016
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: CC BY 4.0
B. Annotation Format
Text Detection/Spotting
{
"cats": {},
"anns": {
"45346": {
"mask":[468.9,286.7,468.9,295.2,493.0,295.8,493.0,287.2],
"class":"machine printed",
"bbox":[468.9,286.7,24.1,9.1],
"image_id":522579,
"id":167312,
"language":"english",
"area":55.5,
"utf8_string":"the",
"legibility":"legible"
},
// ...
},
"imgs": {
"522579": {
"file_name":"COCO_train2014_000000522579.jpg",
"height":476,
"width":640,
"id":522579,
"set":"train",
},
// ...
},
"imgToAnns": {
"522579": [167294, 167295, 167296, 167297, 167298, 167299, 167300, 167301, 167302, 167303, 167304, 167305, 167306, 167307, 167308, 167309, 167310, 167311, 167312, 167313, 167314, 167315, 167316, 167317],
// ...
},
"info": {}
}
C. Reference
@article{veit2016coco, title={Coco-text: Dataset and benchmark for text detection and recognition in natural images}, author={Veit, Andreas and Matera, Tomas and Neumann, Lukas and Matas, Jiri and Belongie, Serge}, journal={arXiv preprint arXiv:1601.07140}, year={2016}}
CTW1500¶
“Curved scene text detection via transverse and longitudinal sequence connection”, PR, 2019. PDF
A. Basic Info
Official Website: ctw1500
Year: 2019
Language: [‘English’]
Scene: [‘Scene’]
Annotation Granularity: [‘Word’, ‘Line’]
Supported Tasks: [‘textrecog’, ‘textdet’, ‘textspotting’]
License: N/A
B. Annotation Format
C. Reference
@article{liu2019curved, title={Curved scene text detection via transverse and longitudinal sequence connection}, author={Liu, Yuliang and Jin, Lianwen and Zhang, Shuaitao and Luo, Canjie and Zhang, Sheng}, journal={Pattern Recognition}, volume={90}, pages={337--345}, year={2019}, publisher={Elsevier} }
CUTE80¶
“A Robust Arbitrary Text Detection System for Natural Scene Images”, ESWA, 2014. PDF
A. Basic Info
Official Website: cute80
Year: 2014
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textrecog’]
License: N/A
B. Annotation Format
Text Recognition
# timage/img_name text 1 text
timage/001.jpg RONALDO 1 RONALDO
timage/002.jpg 7 1 7
timage/003.jpg SEACREST 1 SEACREST
timage/004.jpg BEACH 1 BEACH
C. Reference
@article{risnumawan2014robust, title={A robust arbitrary text detection system for natural scene images}, author={Risnumawan, Anhar and Shivakumara, Palaiahankote and Chan, Chee Seng and Tan, Chew Lim}, journal={Expert Systems with Applications}, volume={41}, number={18}, pages={8027--8048}, year={2014}, publisher={Elsevier}}
FUNSD¶
“FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”, ICDAR, 2019. PDF
A. Basic Info
Official Website: funsd
Year: 2019
Language: [‘English’]
Scene: [‘Document’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: FUNSD License
B. Annotation Format
Text Detection/Recognition/Spotting
{
"form": [
{
"id": 0,
"text": "Registration No.",
"box": [
94,
169,
191,
186
],
"linking": [
[
0,
1
]
],
"label": "question",
"words": [
{
"text": "Registration",
"box": [
94,
169,
168,
186
]
},
{
"text": "No.",
"box": [
170,
169,
191,
183
]
}
]
},
{
"id": 1,
"text": "533",
"box": [
209,
169,
236,
182
],
"label": "answer",
"words": [
{
"box": [
209,
169,
236,
182
],
"text": "533"
}
],
"linking": [
[
0,
1
]
]
}
]
}
C. Reference
@inproceedings{jaume2019, title = {FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents}, author = {Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran}, booktitle = {Accepted to ICDAR-OST}, year = {2019}}
Incidental Scene Text IC13¶
“ICDAR 2013 Robust Reading Competition”, ICDAR, 2013. PDF
A. Basic Info
Official Website: icdar2013
Year: 2013
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: N/A
B. Annotation Format
Text Detection
# train split
# x1 y1 x2 y2 "transcript"
158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"
# test split
# x1, y1, x2, y2, "transcript"
38, 43, 920, 215, "Tiredness"
275, 264, 665, 450, "kills"
0, 699, 77, 830, "A"
Text Recognition
# img_name, "text"
word_1.png, "PROPER"
word_2.png, "FOOD"
word_3.png, "PRONTO"
C. Reference
@inproceedings{karatzas2013icdar, title={ICDAR 2013 robust reading competition}, author={Karatzas, Dimosthenis and Shafait, Faisal and Uchida, Seiichi and Iwamura, Masakazu and i Bigorda, Lluis Gomez and Mestre, Sergi Robles and Mas, Joan and Mota, David Fernandez and Almazan, Jon Almazan and De Las Heras, Lluis Pere}, booktitle={2013 12th international conference on document analysis and recognition}, pages={1484--1493}, year={2013}, organization={IEEE}}
Incidental Scene Text IC15¶
“ICDAR 2015 Competition on Robust Reading”, ICDAR, 2015. PDF
A. Basic Info
Official Website: icdar2015
Year: 2015
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: CC BY 4.0
B. Annotation Format
Text Detection
# x1,y1,x2,y2,x3,y3,x4,y4,trans
377,117,463,117,465,130,378,130,Genaxis Theatre
493,115,519,115,519,131,493,131,[06]
374,155,409,155,409,170,374,170,###
Text Recognition
# img_name, "text"
word_1.png, "Genaxis Theatre"
word_2.png, "[06]"
word_3.png, "62-03"
C. Reference
@inproceedings{karatzas2015icdar, title={ICDAR 2015 competition on robust reading}, author={Karatzas, Dimosthenis and Gomez-Bigorda, Lluis and Nicolaou, Anguelos and Ghosh, Suman and Bagdanov, Andrew and Iwamura, Masakazu and Matas, Jiri and Neumann, Lukas and Chandrasekhar, Vijay Ramaseshan and Lu, Shijian and others}, booktitle={2015 13th international conference on document analysis and recognition (ICDAR)}, pages={1156--1160}, year={2015}, organization={IEEE}}
IIIT5K¶
“Scene Text Recognition using Higher Order Language Priors”, BMVC, 2012. PDF
A. Basic Info
Official Website: iiit5k
Year: 2012
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textrecog’]
License: N/A
B. Annotation Format
Text Recognition
# img_name, "text"
train/1009_2.png You
train/1017_1.png Rescue
train/1017_2.png mission
C. Reference
@InProceedings{MishraBMVC12, author = "Mishra, A. and Alahari, K. and Jawahar, C.~V.", title = "Scene Text Recognition using Higher Order Language Priors", booktitle = "BMVC", year = "2012"}
Synthetic Word Dataset (MJSynth/Syn90k)¶
“Reading Text in the Wild with Convolutional Neural Networks”, International Journal of Computer Vision, 2016. PDF
A. Basic Info
Official Website: mjsynth
Year: 2016
Language: [‘English’]
Scene: [‘Synthesis’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textrecog’]
License: N/A
B. Annotation Format
Text Recognition
./3000/7/182_slinking_71711.jpg 71711
./3000/7/182_REMODELERS_64541.jpg 64541
C. Reference
@InProceedings{Jaderberg14c, author = "Max Jaderberg and Karen Simonyan and Andrea Vedaldi and Andrew Zisserman", title = "Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition", booktitle = "Workshop on Deep Learning, NIPS", year = "2014", }
@Article{Jaderberg16, author = "Max Jaderberg and Karen Simonyan and Andrea Vedaldi and Andrew Zisserman", title = "Reading Text in the Wild with Convolutional Neural Networks", journal = "International Journal of Computer Vision", number = "1", volume = "116", pages = "1--20", month = "jan", year = "2016", }
NAF¶
“Deep Visual Template-Free Form Parsing”, ICDAR, 2019. PDF
A. Basic Info
Official Website: naf
Year: 2019
Language: [‘English’]
Scene: [‘Document’, ‘Handwritten’]
Annotation Granularity: [‘Word’, ‘Line’]
Supported Tasks: [‘textrecog’, ‘textdet’, ‘textspotting’]
License: CDLA
B. Annotation Format
Text Detection/Recognition/Spotting
{"fieldBBs": [{"poly_points": [[435, 1406], [466, 1406], [466, 1439], [435, 1439]], "type": "fieldCheckBox", "id": "f0", "isBlank": 1}, {"poly_points": [[435, 1444], [469, 1444], [469, 1478], [435, 1478]], "type": "fieldCheckBox", "id": "f1", "isBlank": 1}],
"textBBs": [{"poly_points": [[1183, 1337], [2028, 1345], [2032, 1395], [1186, 1398]], "type": "text", "id": "t0"}, {"poly_points": [[492, 1336], [809, 1338], [809, 1379], [492, 1378]], "type": "text", "id": "t1"}, {"poly_points": [[512, 1375], [798, 1376], [798, 1405], [512, 1404]], "type": "textInst", "id": "t2"}], "imageFilename": "007182398_00026.jpg", "transcriptions": {"f0": "\u00bf\u00bf\u00bf \u00bf\u00bf\u00bf 18/1/49 \u00bf\u00bf\u00bf\u00bf\u00bf", "f1": "U.S. Navy 53rd. Naval Const. Batt.", "t0": "APPLICATION FOR HEADSTONE OR MARKER", "t1": "ORIGINAL"}}
C. Reference
@inproceedings{davis2019deep, title={Deep visual template-free form parsing}, author={Davis, Brian and Morse, Bryan and Cohen, Scott and Price, Brian and Tensmeyer, Chris}, booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, pages={134--141}, year={2019}, organization={IEEE}}
Scanned Receipts OCR and Information Extraction¶
“ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”, ICDAR, 2019. PDF
A. Basic Info
Official Website: sroie
Year: 2019
Language: [‘English’]
Scene: [‘Document’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: CC BY 4.0
B. Annotation Format
Text Detection, Text Recognition and Text Spotting
# x1,y1,x2,y2,x3,y3,x4,y4,trans
72,25,326,25,326,64,72,64,TAN WOON YANN
50,82,440,82,440,121,50,121,BOOK TA .K(TAMAN DAYA) SDN BND
205,121,285,121,285,139,205,139,789417-W
C. Reference
@INPROCEEDINGS{8977955, author={Huang, Zheng and Chen, Kai and He, Jianhua and Bai, Xiang and Karatzas, Dimosthenis and Lu, Shijian and Jawahar, C. V.}, booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, title={ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction}, year={2019}, volume={}, number={}, pages={1516-1520}, doi={10.1109/ICDAR.2019.00244}}
Street View Text Dataset (SVT)¶
“Word Spotting in the Wild”, ECCV, 2010. PDF
A. Basic Info
Official Website: svt
Year: 2010
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: N/A
B. Annotation Format
Text Detection/Recognition/Spotting
<image>
<imageName>img/14_03.jpg</imageName>
<address>341 Southwest 10th Avenue Portland OR</address>
<lex>
LIVING,ROOM,THEATERS,KENNY,ZUKE,DELICATESSEN,CLYDE,COMMON,ACE,HOTEL,PORTLAND,ROSE,CITY,BOOKS,STUMPTOWN,COFFEE,ROASTERS,RED,CAP,GARAGE,FISH,GROTTO,SEAFOOD,RESTAURANT,AURA,RESTAURANT,LOUNGE,ROCCO,PIZZA,PASTA,BUFFALO,EXCHANGE,MARK,SPENCER,LIGHT,FEZ,BALLROOM,READING,FRENZY,ROXY,SCANDALS,MARTINOTTI,CAFE,DELI,CROWSENBERG,HALF
</lex>
<Resolution x="1280" y="880"/>
<taggedRectangles>
<taggedRectangle height="75" width="236" x="375" y="253">
<tag>LIVING</tag>
</taggedRectangle>
<taggedRectangle height="76" width="175" x="639" y="272">
<tag>ROOM</tag>
</taggedRectangle>
<taggedRectangle height="87" width="281" x="839" y="283">
<tag>THEATERS</tag>
</taggedRectangle>
</taggedRectangles>
</image>
C. Reference
@inproceedings{wang2010word, title={Word spotting in the wild}, author={Wang, Kai and Belongie, Serge}, booktitle={European conference on computer vision}, pages={591--604}, year={2010}, organization={Springer}}
Street View Text Perspective (SVT-P)¶
“Recognizing Text with Perspective Distortion in Natural Scenes”, ICCV, 2013. PDF
A. Basic Info
Official Website: svtp
Year: 2013
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textrecog’]
License: N/A
B. Annotation Format
Text Recognition
13_15_0_par.jpg WYNDHAM
13_15_1_par.jpg HOTEL
12_16_0_par.jpg UNITED
C. Reference
@inproceedings{phan2013recognizing, title={Recognizing text with perspective distortion in natural scenes}, author={Phan, Trung Quy and Shivakumara, Palaiahnakote and Tian, Shangxuan and Tan, Chew Lim}, booktitle={Proceedings of the IEEE International Conference on Computer Vision}, pages={569--576}, year={2013}}
SynthText in the Wild Dataset¶
“Synthetic Data for Text Localisation in Natural Images”, CVPR, 2016. PDF
A. Basic Info
Official Website: synthtext
Year: 2016
Language: [‘English’]
Scene: [‘Synthesis’]
Annotation Granularity: [‘Word’, ‘Character’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: Synthext Custom
B. Annotation Format
Text Detection/Recognition/Spotting
{
"imnames": [['8/ballet_106_0.jpg', ...]],
"wordBB": [[[420.58957 418.85016 448.08478 410.3094 117.745026
322.30963 322.6857 159.09138 154.27284 260.14597
431.9315 427.52274 296.86508 99.56819 108.96211 ]
[512.3321 431.88342 519.4515 499.81183 179.0544
377.97382 376.4993 203.64464 193.77492 313.61514
487.58023 484.64633 365.83176 142.49403 144.90457 ]
[511.92203 428.7077 518.7375 499.0373 172.1684
378.35858 377.2078 203.3191 193.0739 319.69186
485.6758 482.571 365.76303 142.31898 144.43858 ]
[420.1795 415.67444 447.3708 409.53485 110.859024
322.6944 323.3942 158.76585 153.57182 266.2227
430.02707 425.44742 296.79636 99.39314 108.49613 ]]
[[ 21.06382 46.19922 47.570374 73.95366 197.17792
9.993624 48.437763 9.064571 49.659035 208.57095
118.41646 162.82489 29.548729 5.800581 28.812992 ]
[ 23.069519 48.254295 50.130234 77.18146 208.71487
8.999153 46.69632 9.698633 50.869553 203.25742
122.64043 168.38647 29.660484 6.2558594 29.602367 ]
[ 41.827087 68.39458 70.03627 98.65903 245.30832
30.534437 68.589294 32.57161 73.74529 264.40634
147.7303 189.70224 72.08 22.759935 50.81941 ]
[ 39.82139 66.3395 67.47641 95.43123 233.77136
31.528908 70.33074 31.937548 72.534775 269.71988
143.50633 184.14066 71.96825 22.304657 50.030033 ]], ...],
"charBB": [[[423.16126397 439.60847343 450.66887979 466.31976402 479.76190495
504.59927448 418.80489444 450.13965942 464.16775197 480.46891089
502.46437709 413.02373632 433.01396211 446.7222192 470.28467827
482.51674486 116.52285438 139.51408587 150.7448586 162.03366629
322.84717946 333.54881536 343.28386485 363.07416389 323.48968759
337.98503283 356.66355903 160.48517048 174.1707753 189.64454066
155.7637383 167.45490471 179.63644201 262.2183876 271.75848874
284.05396524 298.26103738 432.8464733 449.15387392 468.07231897
428.11482147 445.61538159 469.24565878 296.86441324 323.6603118
344.09880401 101.14677814 110.45423597 120.54555495 131.18342618
132.20545124 110.01673682 120.83144568 131.35885673]
[438.2997574 452.61288403 466.31976402 482.22585715 498.3934528
512.20555863 431.88338084 466.11639619 481.73414937 499.62012025
519.36789779 432.51717267 449.23571387 465.73425964 484.45139112
499.59056304 140.27413679 149.59811175 160.13352083 169.59504507
333.55849014 344.33923741 361.08275796 378.09844418 339.92898685
355.57692063 376.51230484 174.1707753 189.07871028 203.64462646
165.22739457 181.27572412 193.60260894 270.99557614 283.13281739
298.75499435 313.61511672 447.1421735 470.27065563 487.02126631
446.97485257 468.98979567 484.64633864 317.88691577 341.16094163
365.8300006 111.15280603 120.54555495 130.72086821 135.27663717
142.4726875 120.1331955 133.07976304 144.75919258]
[435.54895424 449.95797159 464.5848793 480.68235876 497.04793842
511.1101386 428.95660757 463.61882066 480.14247127 498.2535215
518.03243928 429.36600266 447.19056345 463.89483785 482.21016814
498.18529977 142.63162835 152.55587851 162.80539142 172.21885945
333.35620309 344.09880401 360.86201193 377.82379299 339.7646859
355.37508239 376.1110999 172.46032372 187.37816388 201.39094518
163.04321987 178.99078221 191.89681939 275.3073355 286.08373072
301.85539131 318.57227103 444.54207279 467.53925436 485.27070558
444.57367155 466.90671029 482.56302723 317.62908407 340.9131681
365.44465854 109.40501176 119.4999228 129.67892444 134.35253232
140.97421069 118.61779828 131.34019115 143.25688164]
[420.17946701 436.74150236 448.74896556 464.5848793 478.18853922
503.4152019 415.67442461 447.3707845 462.35927516 478.8614766
500.86810735 409.54560397 430.77026495 444.64606264 467.79077782
480.89051912 119.14629674 142.63162835 153.56593297 164.78799774
322.69436747 333.35620309 343.11884239 362.84714115 323.37931952
337.83763574 356.35573621 158.76583616 172.46032372 187.37816388
153.57183805 165.15781218 177.92125239 266.22269514 274.45156305
286.82608962 302.69695881 430.02705241 446.01814255 466.05208347
425.44741792 443.19481667 466.90671029 296.79634428 323.49707084
343.82488703 99.39315359 109.40501176 119.4999228 130.25798537
130.70149005 108.49612777 119.08444238 129.84935461]]
[[ 22.26958901 21.60559248 27.0241972 27.25747678 27.45783459
28.73896576 47.91255579 47.80732383 53.77711568 54.24219042
52.00169325 74.79043429 80.45929285 81.04748707 76.11658669
82.58335942 203.67278213 201.2743445 205.59358622 205.51198143
10.06536976 10.82312635 16.77203865 16.31842372 54.80444433
54.66492 47.33822371 15.08534083 15.18716407 9.62607092
51.06813224 50.18928243 56.16019366 220.78902143 236.08062638
231.69267533 209.73652786 124.25352842 119.99631725 128.73732717
165.78411123 167.31764153 167.05531699 29.97351822 31.5116502
31.14650552 5.88513488 12.51324147 12.57920537 8.21515307
8.21998849 35.66412031 29.17945741 36.00660903]
[ 22.46075572 21.76391911 27.25747678 27.49456029 27.73554156
28.85582217 48.25428361 48.21714995 54.27828788 54.78857757
52.4595556 75.57743634 81.15533616 81.86325615 76.681392
83.31596322 210.04771309 203.83983042 208.00417391 207.41791524
9.79265706 10.55231862 16.36406888 15.97405105 54.64620856
54.49559004 47.09756263 15.18716407 15.29808166 9.69862498
51.27597632 50.48652154 56.49239954 216.92183074 232.02141018
226.44624213 203.25738931 125.19349641 121.32658508 130.00428964
167.43676857 169.36588297 168.38645076 29.58279603 31.19899202
30.75826599 5.92344996 12.57920537 12.64571832 8.23451892
8.26856497 35.82646468 29.342662 36.22165159]
[ 40.15739982 40.47241401 40.79219178 41.14411963 41.50190876
41.80934074 66.81590976 68.05921213 68.6519006 69.30152766
70.01097963 96.14641662 96.04484417 96.89110144 97.81897661
98.62829468 237.26055111 240.35280825 243.54641271 245.04022528
31.33842788 31.14650552 30.84702178 30.54399042 69.80098672
68.7212013 68.62479627 32.13243303 32.34474067 32.54416771
72.82501686 73.31372392 73.70922459 267.74318222 265.39839711
259.52741156 253.14023308 144.60810334 145.23371653 147.69958337
186.00278322 188.17713786 189.70144388 71.89351759 53.62266986
54.40060855 22.41084398 22.51791234 22.62587258 17.11356079
22.74567232 50.25232032 46.05692507 50.79345235]
[ 39.82138755 40.18347166 40.44598236 40.79219178 41.08959901
41.64111176 66.33948982 67.47640971 68.01403337 68.60595247
69.3953105 95.13188979 95.21297344 95.91593691 97.08847413
97.75212171 229.94285119 237.26055111 240.66752705 242.74145162
31.52890731 31.33842788 31.16401306 30.81155638 69.87135926
68.80273568 68.71664209 31.93753588 32.13243303 32.34474067
72.53476992 72.88981775 73.28094858 269.71986636 267.92938572
262.93698624 256.88902439 143.50635029 143.61251781 146.24080653
184.14064261 185.86853729 188.17713786 71.96823746 53.79651809
54.60870874 22.30465649 22.41084398 22.51791234 17.07939535
22.63671808 50.03002471 45.81009198 50.49899163]], ...],
"txt": [['Lines:\nI lost\nKevin ' 'will ' 'line\nand '
'and\nthe ' '(and ' 'the\nout '
'you ' "don't\n pkg "], ...]
}
C. Reference
@InProceedings{Gupta16, author = "Ankush Gupta and Andrea Vedaldi and Andrew Zisserman", title = "Synthetic Data for Text Localisation in Natural Images", booktitle = "IEEE Conference on Computer Vision and Pattern Recognition", year = "2016", }
Text OCR¶
“TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text”, CVPR, 2021. PDF
A. Basic Info
Official Website: textocr
Year: 2021
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: CC BY 4.0
B. Annotation Format
Text Detection/Recognition/Spotting
{
"imgs": {
"OpenImages_ImageID_1": {
"id": "OpenImages_ImageID_1",
"width": "INT, Width of the image",
"height": "INT, Height of the image",
"set": "Split train|val|test",
"filename": "train|test/OpenImages_ImageID_1.jpg"
},
"OpenImages_ImageID_2": {
"...": "..."
}
},
"anns": {
"OpenImages_ImageID_1_1": {
"id": "STR, OpenImages_ImageID_1_1, Specifies the nth annotation for an image",
"image_id": "OpenImages_ImageID_1",
"bbox": [
"FLOAT x1",
"FLOAT y1",
"FLOAT x2",
"FLOAT y2"
],
"points": [
"FLOAT x1",
"FLOAT y1",
"FLOAT x2",
"FLOAT y2",
"...",
"FLOAT xN",
"FLOAT yN"
],
"utf8_string": "text for this annotation",
"area": "FLOAT, area of this box"
},
"OpenImages_ImageID_1_2": {
"...": "..."
},
"OpenImages_ImageID_2_1": {
"...": "..."
}
},
"img2Anns": {
"OpenImages_ImageID_1": [
"OpenImages_ImageID_1_1",
"OpenImages_ImageID_1_2",
"OpenImages_ImageID_1_2"
],
"OpenImages_ImageID_N": [
"..."
]
}
}
C. Reference
@inproceedings{singh2021textocr, title={{TextOCR}: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text}, author={Singh, Amanpreet and Pang, Guan and Toh, Mandy and Huang, Jing and Galuba, Wojciech and Hassner, Tal}, journal={The Conference on Computer Vision and Pattern Recognition}, year={2021}}
Total Text¶
“Total-Text: Towards Orientation Robustness in Scene Text Detection”, IJDAR, 2020. PDF
A. Basic Info
Official Website: totaltext
Year: 2020
Language: [‘English’]
Scene: [‘Natural Scene’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘textdet’, ‘textrecog’, ‘textspotting’]
License: BSD-3
B. Annotation Format
Text Detection/Spotting
x: [[259 313 389 427 354 302]], y: [[542 462 417 459 507 582]], ornt: [u'c'], transcriptions: [u'PAUL']
x: [[400 478 494 436]], y: [[398 380 448 465]], ornt: [u'#'], transcriptions: [u'#']
C. Reference
@article{CK2019, author = {Chee Kheng Chng and Chee Seng Chan and Chenglin Liu}, title = {Total-Text: Towards Orientation Robustness in Scene Text Detection}, journal = {International Journal on Document Analysis and Recognition (IJDAR)}, volume = {23}, pages = {31-52}, year = {2020}, doi = {10.1007/s10032-019-00334-z}}
WildReceipt¶
“Spatial Dual-Modality Graph Reasoning for Key Information Extraction”, arXiv, 2021. PDF
A. Basic Info
Official Website: wildreceipt
Year: 2021
Language: [‘English’]
Scene: [‘Receipt’]
Annotation Granularity: [‘Word’]
Supported Tasks: [‘kie’, ‘textdet’, ‘textrecog’, ‘textspotting’]
License: N/A
B. Annotation Format
KIE
// Close Set
{
"file_name": "image_files/Image_16/11/d5de7f2a20751e50b84c747c17a24cd98bed3554.jpeg",
"height": 1200,
"width": 1600,
"annotations":
[
{
"box": [550.0, 190.0, 937.0, 190.0, 937.0, 104.0, 550.0, 104.0],
"text": "SAFEWAY",
"label": 1
},
{
"box": [1048.0, 211.0, 1074.0, 211.0, 1074.0, 196.0, 1048.0, 196.0],
"text": "TM",
"label": 25
}
], //...
}
// Open Set
{
"file_name": "image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg",
"height": 348,
"width": 348,
"annotations":
[
{
"box": [114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0],
"text": "CHOEUN",
"label": 2,
"edge": 1
},
{
"box": [97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0],
"text": "KOREANRESTAURANT",
"label": 2,
"edge": 1
}
]
}
C. Reference
@article{sun2021spatial, title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, author={Sun, Hongbin and Kuang, Zhanghui and Yue, Xiaoyu and Lin, Chenhao and Zhang, Wayne}, journal={arXiv preprint arXiv:2103.14470}, year={2021} }
Dataset Preparer (Beta)¶
Note
Dataset Preparer is still in beta version and might not be stable enough. You are welcome to try it out and report any issues to us.
One-click data preparation script¶
MMOCR provides a unified one-stop data preparation script prepare_dataset.py
.
Only one line of command is needed to complete the data download, decompression, format conversion, and basic configure generation.
python tools/dataset_converters/prepare_dataset.py [-h] [--nproc NPROC] [--task {textdet,textrecog,textspotting,kie}] [--splits SPLITS [SPLITS ...]] [--lmdb] [--overwrite-cfg] [--dataset-zoo-path DATASET_ZOO_PATH] datasets [datasets ...]
ARGS | Type | Description |
---|---|---|
dataset_name | str | (required) dataset name. |
--nproc | int | Number of processes to be used. Defaults to 4. |
--task | str | Convert the dataset to the format of a specified task supported by MMOCR. options are: 'textdet', 'textrecog', 'textspotting', and 'kie'. |
--splits | str | Splits of the dataset to be prepared. Multiple splits can be accepted. Defaults to train val test . |
--lmdb | str | Store the data in LMDB format. Only valid when the task is textrecog . |
--overwrite-cfg | str | Whether to overwrite the dataset config file if it already exists in configs/{task}/_base_/datasets . |
--dataset-zoo-path | str | Path to the dataset config file. If not specified, the default path is ./dataset_zoo . |
For example, the following command shows how to use the script to prepare the ICDAR2015 dataset for text detection task.
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet --overwrite-cfg
Also, the script supports preparing multiple datasets at the same time. For example, the following command shows how to prepare the ICDAR2015 and TotalText datasets for text recognition task.
python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task textrecog --overwrite-cfg
To check the supported datasets of Dataset Preparer, please refer to Dataset Zoo. Some of other datasets that need to be prepared manually are listed in Text Detection and Text Recognition.
For users in China, more datasets can be downloaded from the opensource dataset platform: OpenDataLab. After downloading the data, you can place the files listed in data_obtainer.save_name
in data/cache
and rerun the script.
Advanced Usage¶
LMDB Format¶
In text recognition tasks, we usually use LMDB format to store data to speed up data loading. When using the prepare_dataset.py
script to prepare data, you can store data to the LMDB format by the --lmdb
parameter. For example:
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textrecog --lmdb
As soon as the dataset is prepared, Dataset Preparer will generate icdar2015_lmdb.py
in the configs/textrecog/_base_/datasets/
directory. You can inherit this file and point the dataloader
to the LMDB dataset. Moreover, the LMDB dataset needs to be loaded by LoadImageFromNDArray
, thus you also need to modify pipeline
.
For example, if we want to change the training set of configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py
to icdar2015 generated before, we need to perform the following modifications:
Modify
configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py
:_base_ = [ '../_base_/datasets/icdar2015_lmdb.py', # point to icdar2015 lmdb dataset ... ] train_list = [_base_.icdar2015_lmdb_textrecog_train] ...
Modify
train_pipeline
inconfigs/textrecog/crnn/_base_crnn_mini-vgg.py
, changeLoadImageFromFile
toLoadImageFromNDArray
:train_pipeline = [ dict( type='LoadImageFromNDArray', color_type='grayscale', file_client_args=file_client_args, ignore_empty=True, min_size=2), ... ]
Design¶
There are many OCR datasets with different languages, annotation formats, and scenarios. There are generally two ways to use these datasets: to quickly understand the relevant information about the dataset, or to use it to train models. To meet these two usage scenarios, MMOCR provides dataset automatic preparation scripts. The dataset automatic preparation script uses modular design, which greatly enhances scalability, and allows users to easily configure other public or private datasets. The configuration files for the dataset automatic preparation script are uniformly stored in the dataset_zoo/
directory. Users can find all the configuration files for the dataset preparation scripts officially supported by MMOCR in this directory. The directory structure of this folder is as follows:
dataset_zoo/
├── icdar2015
│ ├── metafile.yml
│ ├── sample_anno.md
│ ├── textdet.py
│ ├── textrecog.py
│ └── textspotting.py
└── wildreceipt
├── metafile.yml
├── sample_anno.md
├── kie.py
├── textdet.py
├── textrecog.py
└── textspotting.py
Dataset Usage¶
After decades of development, the OCR field has seen a series of related datasets emerge, often providing text annotation files in various styles, making it necessary for users to perform format conversion when using these datasets. Therefore, to facilitate dataset preparation for users, we have designed the Dataset Preparer to help users quickly prepare datasets in the format supported by MMOCR. For details, please refer to the Dataset Format document. The following figure shows a typical workflow for running the Dataset Preparer.
The figure shows that when running the Dataset Preparer, the following operations will be performed in sequence:
For the training set, validation set, and test set, the preparers will perform:
Dataset download, extraction, and movement (Obtainer)
Delete files (Delete)
Generate the configuration file for the data set (Config Generator).
To handle various types of datasets, MMOCR has designed each component as a plug-and-play module, and allows users to configure the dataset preparation process through configuration files located in dataset_zoo/
. These configuration files are in Python format and can be used in the same way as other configuration files in MMOCR, as described in the Configuration File documentation.
In dataset_zoo/
, each dataset has its own folder, and the configuration files are named after the task to distinguish different configurations under different tasks. Taking the text detection part of ICDAR2015 as an example, the sample configuration file dataset_zoo/icdar2015/textdet.py
is shown below:
data_root = 'data/icdar2015'
cache_path = 'data/cache'
train_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
save_name='ic15_textdet_train_img.zip',
md5='c51cbace155dcc4d98c8dd19d378f30d',
content=['image'],
mapping=[['ic15_textdet_train_img', 'textdet_imgs/train']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'ch4_training_localization_transcription_gt.zip',
save_name='ic15_textdet_train_gt.zip',
md5='3bfaf1988960909014f7987d2343060b',
content=['annotation'],
mapping=[['ic15_textdet_train_gt', 'annotations/train']]),
]),
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg', '.JPG'],
rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
parser=dict(type='ICDARTxtTextDetAnnParser', encoding='utf-8-sig'),
packer=dict(type='TextDetPacker'),
dumper=dict(type='JsonDumper'),
)
test_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_test_images.zip',
save_name='ic15_textdet_test_img.zip',
md5='97e4c1ddcf074ffcc75feff2b63c35dd',
content=['image'],
mapping=[['ic15_textdet_test_img', 'textdet_imgs/test']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'Challenge4_Test_Task4_GT.zip',
save_name='ic15_textdet_test_gt.zip',
md5='8bce173b06d164b98c357b0eb96ef430',
content=['annotation'],
mapping=[['ic15_textdet_test_gt', 'annotations/test']]),
]),
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg', '.JPG'],
rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
parser=dict(type='ICDARTxtTextDetAnnParser', encoding='utf-8-sig'),
packer=dict(type='TextDetPacker'),
dumper=dict(type='JsonDumper'),
)
delete = ['annotations', 'ic15_textdet_test_img', 'ic15_textdet_train_img']
config_generator = dict(type='TextDetConfigGenerator')
Dataset download extraction and movement (Obtainer)¶
The obtainer
module in Dataset Preparer is responsible for downloading, extracting, and moving the dataset. Currently, MMOCR only provides the NaiveDataObtainer
. Generally speaking, the built-in NaiveDataObtainer
is sufficient for downloading most datasets that can be accessed through direct links, and supports operations such as extraction, moving files, and renaming. However, MMOCR currently does not support automatically downloading datasets stored in resources that require login, such as Baidu or Google Drive. Here is a brief introduction to the NaiveDataObtainer
.
Field Name | Meaning |
---|---|
cache_path | Dataset cache path, used to store the compressed files downloaded during dataset preparation |
data_root | Root directory where the dataset is stored |
files | Dataset file list, used to describe the download information of the dataset |
The files
field is a list, and each element in the list is a dictionary used to describe the download information of a dataset file. The table below shows the meaning of each field:
Field Name | Meaning |
---|---|
url | Download link for the dataset file |
save_name | Name used to save the dataset file |
md5 (optional) | MD5 hash of the dataset file, used to check if the downloaded file is complete |
split (optional) | Dataset split the file belongs to, such as train , test , etc., this field can be omitted |
content (optional) | Content of the dataset file, such as image , annotation , etc., this field can be omitted |
mapping (optional) | Decompression mapping of the dataset file, used to specify the storage location of the file after decompression, this field can be omitted |
The Dataset Preparer follows the following conventions:
Images of different types of datasets are moved to the corresponding category
{taskname}_imgs/{split}/
folder, such astextdet_imgs/train/
.For a annotation file containing annotation information for all images, the annotations are moved to
annotations/{split}.*
file, such asannotations/train.json
.For a annotation file containing annotation information for one image, all annotation files are moved to
annotations/{split}/
folder, such asannotations/train/
.For some other special cases, such as all training, testing, and validation images are in one folder, the images can be moved to a self-set folder, such as
{taskname}_imgs/imgs/
, and the image storage location should be specified in the subsequentgatherer
module.
An example configuration is as follows:
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
save_name='ic15_textdet_train_img.zip',
md5='c51cbace155dcc4d98c8dd19d378f30d',
content=['image'],
mapping=[['ic15_textdet_train_img', 'textdet_imgs/train']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'ch4_training_localization_transcription_gt.zip',
save_name='ic15_textdet_train_gt.zip',
md5='3bfaf1988960909014f7987d2343060b',
content=['annotation'],
mapping=[['ic15_textdet_train_gt', 'annotations/train']]),
]),
Dataset collection (Gatherer)¶
The gatherer
module traverses the files in the dataset directory, matches image files with their corresponding annotation files, and organizes a file list for the parser
module to read. Therefore, it is necessary to know the matching rules between image files and annotation files in the current dataset. There are two commonly used annotation storage formats for OCR datasets: one is multiple annotation files corresponding to multiple images, and the other is a single annotation file corresponding to multiple images, for example:
Many-to-Many
├── {taskname}_imgs/{split}/img_img_1.jpg
├── annotations/{split}/gt_img_1.txt
├── {taskname}_imgs/{split}/img_2.jpg
├── annotations/{split}/gt_img_2.txt
├── {taskname}_imgs/{split}/img_3.JPG
├── annotations/{split}/gt_img_3.txt
One-to-Many
├── {taskname}/{split}/img_1.jpg
├── {taskname}/{split}/img_2.jpg
├── {taskname}/{split}/img_3.JPG
├── annotations/gt.txt
Specific design is as follows:
MMOCR has built-in PairGatherer
and MonoGatherer
to handle the two common cases mentioned above. PairGatherer
is used for many-to-many situations, while MonoGatherer
is used for one-to-many situations.
Note
To simplify processing, the gatherer assumes that the dataset’s images and annotations are stored separately in {taskname}_imgs/{split}/
and annotations/
, respectively. In particular, for many-to-many situations, the annotation file needs to be placed in annotations/{split}
.
In the many-to-many case,
PairGatherer
needs to find the image files and corresponding annotation files according to a certain naming convention. First, the suffix of the image needs to be specified by theimg_suffixes
parameter, as in the example aboveimg_suffixes=[.jpg,.JPG]
. In addition, a pair of regular expressionsrule
is used to specify the correspondence between the image and annotation files. For example,rule=[r'img_(\d+)\.([jJ][pP][gG])',r'gt_img_\1.txt']
. The first regular expression is used to match the image file name,\d+
is used to match the image sequence number, and([jJ][pP][gG])
is used to match the image suffix. The second regular expression is used to match the annotation file name, where\1
associates the matched image sequence number with the annotation file sequence number. An example configuration is:
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg', '.JPG'],
rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
For the case of one-to-many, it is usually simple, and the user only needs to specify the annotation file name. For example, for the training set configuration:
gatherer=dict(type='MonoGatherer', ann_name='train.txt'),
MMOCR has also made conventions on the return value of Gatherer
. Gatherer
returns a tuple with two elements. The first element is a list of image paths (including all image paths) or the folder containing all images. The second element is a list of annotation file paths (including all annotation file paths) or the path of the annotation file (the annotation file contains all image annotation information). Specifically, the return value of PairGatherer
is (list of image paths, list of annotation file paths), as shown below:
(['{taskname}_imgs/{split}/img_1.jpg', '{taskname}_imgs/{split}/img_2.jpg', '{taskname}_imgs/{split}/img_3.JPG'],
['annotations/{split}/gt_img_1.txt', 'annotations/{split}/gt_img_2.txt', 'annotations/{split}/gt_img_3.txt'])
MonoGatherer
returns a tuple containing the path to the image directory and the path to the annotation file, as follows:
('{taskname}/{split}', 'annotations/gt.txt')
Dataset parsing (Parser)¶
Parser
is mainly used to parse the original annotation files. Since the original annotation formats vary greatly, MMOCR provides BaseParser
as a base class, which users can inherit to implement their own Parser
. In BaseParser
, MMOCR has designed two interfaces: parse_files
and parse_file
, where the annotation parsing is conventionally carried out. For the two different input situations of Gatherer
(many-to-many, one-to-many), the implementations of these two interfaces should be different.
BaseParser
by default handles the many-to-many situation. Among them,parse_files
distributes the data in parallel to multipleparse_file
processes, and eachparse_file
parses the annotation of a single image separately.For the one-to-many situation, the user needs to override
parse_files
to implement loading the annotation and returning standardized results.
The interface of BaseParser
is defined as follows:
class BaseParser:
def __call__(self, img_paths, ann_paths):
return self.parse_files(img_paths, ann_paths)
def parse_files(self, img_paths: Union[List[str], str],
ann_paths: Union[List[str], str]) -> List[Tuple]:
samples = track_parallel_progress_multi_args(
self.parse_file, (img_paths, ann_paths), nproc=self.nproc)
return samples
@abstractmethod
def parse_file(self, img_path: str, ann_path: str) -> Tuple:
raise NotImplementedError
In order to ensure the uniformity of subsequent modules, MMOCR has made conventions for the return values of parse_files
and parse_file
. The return value of parse_file
is a tuple, the first element of which is the image path, and the second element is the annotation information. The annotation information is a list, each element of which is a dictionary with the fields poly
, text
, and ignore
, as shown below:
# An example of returned values:
(
'imgs/train/xxx.jpg',
[
dict(
poly=[0, 1, 1, 1, 1, 0, 0, 0],
text='hello',
ignore=False),
...
]
)
The output of parse_files
is a list, and each element in the list is the return value of parse_file
. An example is:
[
(
'imgs/train/xxx.jpg',
[
dict(
poly=[0, 1, 1, 1, 1, 0, 0, 0],
text='hello',
ignore=False),
...
]
),
...
]
Dataset Conversion (Packer)¶
Packer
is mainly used to convert data into a unified annotation format, because the input data is the output of parsers and the format has been fixed. Therefore, the packer only needs to convert the input format into a unified annotation format for each task. Currently, MMOCR supports tasks such as text detection, text recognition, end-to-end OCR, and key information extraction, and MMOCR has a corresponding packer for each task, as shown below:
For text detection, end-to-end OCR, and key information extraction, MMOCR has a unique corresponding Packer
. However, for text recognition, MMOCR provides two Packer
options: TextRecogPacker
and TextRecogCropPacker
, due to the existence of two types of datasets:
Each image is a recognition sample, and the annotation information returned by the
parser
is only adict(text='xxx')
. In this case,TextRecogPacker
can be used.The dataset does not crop text from the image, and it essentially contains end-to-end OCR annotations that include the position information of the text and the corresponding text information.
TextRecogCropPacker
will crop the text from the image and then convert it into the unified format for text recognition.
Annotation Saving (Dumper)¶
The dumper
module is used to determine what format the data should be saved in. Currently, MMOCR supports JsonDumper
, WildreceiptOpensetDumper
, and TextRecogLMDBDumper
. They are used to save data in the standard MMOCR JSON format, the Wildreceipt format, and the LMDB format commonly used in the academic community for text recognition, respectively.
Delete files (Delete)¶
When processing a dataset, temporary files that are not needed may be generated. Here, a list of such files or folders can be passed in, which will be deleted when the conversion is finished.
Generate the configuration file for the dataset (ConfigGenerator)¶
In order to automatically generate basic configuration files after preparing the dataset, MMOCR has implemented TextDetConfigGenerator
, TextRecogConfigGenerator
, and TextSpottingConfigGenerator
for each task. The main parameters supported by these generators are as follows:
Field Name | Meaning |
---|---|
data_root | Root directory where the dataset is stored. |
train_anns | Path to the training set annotations in the configuration file. If not specified, it defaults to [dict(ann_file='{taskname}_train.json', dataset_postfix=''] . |
val_anns | Path to the validation set annotations in the configuration file. If not specified, it defaults to an empty string. |
test_anns | Path to the test set annotations in the configuration file. If not specified, it defaults to [dict(ann_file='{taskname}_test.json', dataset_postfix=''] . |
config_path | Path to the directory where the configuration files for the algorithm are stored. The configuration generator will write the default configuration to {config_path}/{taskname}/_base_/datasets/{dataset_name}.py . If not specified, it defaults to configs/ . |
After preparing all the files for the dataset, the configuration generator will automatically generate the basic configuration files required to call the dataset. Below is a minimal example of a TextDetConfigGenerator
configuration:
config_generator = dict(type='TextDetConfigGenerator')
The generated file will be placed by default under configs/{task}/_base_/datasets/
. In this example, the basic configuration file for the ICDAR 2015 dataset will be generated at configs/textdet/_base_/datasets/icdar2015.py
.
icdar2015_textdet_data_root = 'data/icdar2015'
icdar2015_textdet_train = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root,
ann_file='textdet_train.json',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=None)
icdar2015_textdet_test = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root,
ann_file='textdet_test.json',
test_mode=True,
pipeline=None)
If the dataset is special and there are several variants of the annotations, the configuration generator also supports generating variables pointing to each variant in the base configuration. However, this requires users to differentiate them by using different dataset_postfix
when setting up. For example, the ICDAR 2015 text recognition dataset has two annotation versions for the test set, the original version and the 1811 version, which can be specified in test_anns
as follows:
config_generator = dict(
type='TextRecogConfigGenerator',
test_anns=[
dict(ann_file='textrecog_test.json'),
dict(dataset_postfix='857', ann_file='textrecog_test_857.json')
])
The configuration generator will generate the following configurations:
icdar2015_textrecog_data_root = 'data/icdar2015'
icdar2015_textrecog_train = dict(
type='OCRDataset',
data_root=icdar2015_textrecog_data_root,
ann_file='textrecog_train.json',
pipeline=None)
icdar2015_textrecog_test = dict(
type='OCRDataset',
data_root=icdar2015_textrecog_data_root,
ann_file='textrecog_test.json',
test_mode=True,
pipeline=None)
icdar2015_1811_textrecog_test = dict(
type='OCRDataset',
data_root=icdar2015_textrecog_data_root,
ann_file='textrecog_test_1811.json',
test_mode=True,
pipeline=None)
With this file, MMOCR can directly import this dataset into the dataloader
from the model configuration file (the following sample is excerpted from configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
):
_base_ = [
'../_base_/datasets/icdar2015.py',
# ...
]
# dataset settings
icdar2015_textdet_train = _base_.icdar2015_textdet_train
icdar2015_textdet_test = _base_.icdar2015_textdet_test
# ...
train_dataloader = dict(
dataset=icdar2015_textdet_train)
val_dataloader = dict(
dataset=icdar2015_textdet_test)
test_dataloader = val_dataloader
Note
By default, the configuration generator does not overwrite existing base configuration files unless the user manually specifies overwrite-cfg
when running the script.
Adding a new dataset to Dataset Preparer¶
Adding Public Datasets¶
MMOCR has already supported many commonly used public datasets. If the dataset you want to use has not been supported yet and you are willing to contribute to the MMOCR open-source community, you can follow the steps below to add a new dataset.
In the following example, we will show you how to add the ICDAR2013 dataset step by step.
Adding metafile.yml
¶
First, make sure that the dataset you want to add does not already exist in dataset_zoo/
. Then, create a new folder named after the dataset you want to add, such as icdar2013/
(usually, use lowercase alphanumeric characters without symbols to name the dataset). In the icdar2013/
folder, create a metafile.yml
file and fill in the basic information of the dataset according to the following template:
Name: 'Incidental Scene Text IC13'
Paper:
Title: ICDAR 2013 Robust Reading Competition
URL: https://www.imlab.jp/publication_data/1352/icdar_competition_report.pdf
Venue: ICDAR
Year: '2013'
BibTeX: '@inproceedings{karatzas2013icdar,
title={ICDAR 2013 robust reading competition},
author={Karatzas, Dimosthenis and Shafait, Faisal and Uchida, Seiichi and Iwamura, Masakazu and i Bigorda, Lluis Gomez and Mestre, Sergi Robles and Mas, Joan and Mota, David Fernandez and Almazan, Jon Almazan and De Las Heras, Lluis Pere},
booktitle={2013 12th international conference on document analysis and recognition},
pages={1484--1493},
year={2013},
organization={IEEE}}'
Data:
Website: https://rrc.cvc.uab.es/?ch=2
Language:
- English
Scene:
- Natural Scene
Granularity:
- Word
Tasks:
- textdet
- textrecog
- textspotting
License:
Type: N/A
Link: N/A
Format: .txt
Keywords:
- Horizontal
Add Annotation Examples¶
Finally, you can add an annotation example file sample_anno.md
under the dataset_zoo/icdar2013/
directory to help the documentation script add annotation examples when generating documentation. The annotation example file is a Markdown file that typically contains the raw data format of a single sample. For example, the following code block shows a sample data file for the ICDAR2013 dataset:
**Text Detection**
```text
# train split
# x1 y1 x2 y2 "transcript"
158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"
# test split
# x1, y1, x2, y2, "transcript"
38, 43, 920, 215, "Tiredness"
275, 264, 665, 450, "kills"
0, 699, 77, 830, "A"
Add configuration files for corresponding tasks¶
In the dataset_zoo/icdar2013
directory, add a .py
configuration file named after the task. For example, textdet.py
, textrecog.py
, textspotting.py
, kie.py
, etc. The configuration template is shown below:
data_root = ''
data_cache = 'data/cache'
train_prepare = dict(
obtainer=dict(
type='NaiveObtainer',
data_cache=data_cache,
files=[
dict(
url='xx',
md5='',
save_name='xxx',
mapping=list())
]),
gatherer=dict(type='xxxGatherer', **kwargs),
parser=dict(type='xxxParser', **kwargs),
packer=dict(type='TextxxxPacker'), # Packer for the task
dumper=dict(type='JsonDumper'),
)
test_prepare = dict(
obtainer=dict(
type='NaiveObtainer',
data_cache=data_cache,
files=[
dict(
url='xx',
md5='',
save_name='xxx',
mapping=list())
]),
gatherer=dict(type='xxxGatherer', **kwargs),
parser=dict(type='xxxParser', **kwargs),
packer=dict(type='TextxxxPacker'), # Packer for the task
dumper=dict(type='JsonDumper'),
)
Taking the file detection task as an example, let’s introduce the specific content of the configuration file. In general, users do not need to implement new obtainer
, gatherer
, packer
, or dumper
, but usually need to implement a new parser
according to the annotation format of the dataset.
Regarding the configuration of obtainer
, we will not go into detail here, and you can refer to Data set download, extraction, and movement (Obtainer).
For the gatherer
, by observing the obtained ICDAR2013 dataset files, we found that each image has a corresponding .txt
format annotation file:
data_root
├── textdet_imgs/train/
│ ├── img_1.jpg
│ ├── img_2.jpg
│ └── ...
├── annotations/train/
│ ├── gt_img_1.txt
│ ├── gt_img_2.txt
│ └── ...
Moreover, the name of each annotation file corresponds to the image: gt_img_1.txt
corresponds to img_1.jpg
, and so on. Therefore, PairGatherer
can be used to match them.
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg'],
rule=[r'(\w+)\.jpg', r'gt_\1.txt'])
The first regular expression in the rule is used to match the image file name, and the second regular expression is used to match the annotation file name. Here, (\w+)
is used to match the image file name, and gt_\1.txt
is used to match the annotation file name, where \1
represents the content matched by the first regular expression. That is, it replaces img_xx.jpg
with gt_img_xx.txt
.
Next, you need to implement a parser
to parse the original annotation files into a standard format. Usually, before adding a new dataset, users can browse the details page of the supported datasets and check if there is a dataset with the same format. If there is, you can use the parser of that dataset directly. Otherwise, you need to implement a new format parser.
Data format parsers are stored in the mmocr/datasets/preparers/parsers
directory. All parsers need to inherit from BaseParser
and implement the parse_file
or parse_files
method. For more information, please refer to Parsing original annotations (Parser).
By observing the annotation files of the ICDAR2013 dataset:
158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"
542, 710, 938, 841, "break"
87, 884, 457, 1021, "could"
517, 919, 831, 1024, "save"
We found that the built-in ICDARTxtTextDetAnnParser
already meets the requirements, so we can directly use this parser and configure it in the preparer
.
parser=dict(
type='ICDARTxtTextDetAnnParser',
remove_strs=[',', '"'],
encoding='utf-8',
format='x1 y1 x2 y2 trans',
separator=' ',
mode='xyxy')
In the configuration for the ICDARTxtTextDetAnnParser
, remove_strs=[',', '"']
is specified to remove extra quotes and commas in the annotation files. In the format
section, x1 y1 x2 y2 trans
indicates that each line in the annotation file contains four coordinates and a text content separated by spaces (separator
=’ ‘). Also, mode
is set to xyxy
, which means that the coordinates in the annotation file are the coordinates of the top-left and bottom-right corners, so that ICDARTxtTextDetAnnParser
can parse the annotations into a unified format.
For the packer
, taking the file detection task as an example, its packer
is TextDetPacker
, and its configuration is as follows:
packer=dict(type='TextDetPacker')
Finally, specify the dumper
, which is generally saved in json format. Its configuration is as follows:
dumper=dict(type='JsonDumper')
After the above configuration, the configuration file for the ICDAR2013 training set is as follows:
train_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/'
'Challenge2_Training_Task12_Images.zip',
save_name='ic13_textdet_train_img.zip',
md5='a443b9649fda4229c9bc52751bad08fb',
content=['image'],
mapping=[['ic13_textdet_train_img', 'textdet_imgs/train']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'Challenge2_Training_Task1_GT.zip',
save_name='ic13_textdet_train_gt.zip',
md5='f3a425284a66cd67f455d389c972cce4',
content=['annotation'],
mapping=[['ic13_textdet_train_gt', 'annotations/train']]),
]),
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg'],
rule=[r'(\w+)\.jpg', r'gt_\1.txt']),
parser=dict(
type='ICDARTxtTextDetAnnParser',
remove_strs=[',', '"'],
format='x1 y1 x2 y2 trans',
separator=' ',
mode='xyxy'),
packer=dict(type='TextDetPacker'),
dumper=dict(type='JsonDumper'),
)
To automatically generate the basic configuration after the dataset is prepared, you also need to configure the corresponding task’s config_generator
.
In this example, since it is a text detection task, you only need to set the generator to TextDetConfigGenerator
.
config_generator = dict(type='TextDetConfigGenerator')
Use DataPreparer to prepare customized dataset¶
[Coming Soon]
Text Detection¶
Note
This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.
Overview¶
Dataset | Images | Annotation Files | |||
---|---|---|---|---|---|
training | validation | testing | |||
ICDAR2011 | homepage | - | - | ||
ICDAR2017 | homepage | instances_training.json | instances_val.json | - | |
CurvedSynText150k | homepage | Part1 | Part2 | instances_training.json | - | - | |
DeText | homepage | - | - | - | |
Lecture Video DB | homepage | - | - | - | |
LSVT | homepage | - | - | - | |
IMGUR | homepage | - | - | - | |
KAIST | homepage | - | - | - | |
MTWI | homepage | - | - | - | |
ReCTS | homepage | - | - | - | |
IIIT-ILST | homepage | - | - | - | |
VinText | homepage | - | - | - | |
BID | homepage | - | - | - | |
RCTW | homepage | - | - | - | |
HierText | homepage | - | - | - | |
ArT | homepage | - | - | - |
Install AWS CLI (optional)¶
Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ./aws/install -i /usr/local/aws-cli -b /usr/local/bin !aws configure # this command will require you to input keys, you can skip them except # for the Default region name # AWS Access Key ID [None]: # AWS Secret Access Key [None]: # Default region name [None]: us-east-1 # Default output format [None]
For users in China, these datasets can also be downloaded from OpenDataLab with high speed:
Important Note¶
Note
For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset, there might be some images containing orientation info in EXIF data. The default OpenCV
backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such
inconsistency results in false examples in the training set. Therefore, users should use dict(type='LoadImageFromFile', color_type='color_ignore_orientation')
in pipelines to change MMCV’s default loading behaviour. (see DBNet’s pipeline config for example)
ICDAR 2011 (Born-Digital Images)¶
Step1: Download
Challenge1_Training_Task12_Images.zip
,Challenge1_Training_Task1_GT.zip
,Challenge1_Test_Task12_Images.zip
, andChallenge1_Test_Task1_GT.zip
from homepageTask 1.1: Text Localization (2013 edition)
.mkdir icdar2011 && cd icdar2011 mkdir imgs && mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate # For images unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test # For annotations unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
Step 2: Generate
instances_training.json
andinstances_test.json
with the following command:python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
After running the above codes, the directory structure should be as follows:
│── icdar2011 │ ├── imgs │ ├── instances_test.json │ └── instances_training.json
ICDAR 2017¶
Follow similar steps as ICDAR 2015.
The resulting directory structure looks like the following:
├── icdar2017 │ ├── imgs │ ├── annotations │ ├── instances_training.json │ └── instances_val.json
CurvedSynText150k¶
Step1: Download syntext1.zip and syntext2.zip to
CurvedSynText150k/
.Step2:
unzip -q syntext1.zip mv train.json train1.json unzip images.zip rm images.zip unzip -q syntext2.zip mv train.json train2.json unzip images.zip rm images.zip
Step3: Download instances_training.json to
CurvedSynText150k/
Or, generate
instances_training.json
with following command:python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
The resulting directory structure looks like the following:
├── CurvedSynText150k │ ├── syntext_word_eng │ ├── emcs_imgs │ └── instances_training.json
DeText¶
Step1: Download
ch9_training_images.zip
,ch9_training_localization_transcription_gt.zip
,ch9_validation_images.zip
, andch9_validation_localization_transcription_gt.zip
from Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
Step2: Generate
instances_training.json
andinstances_val.json
with following command:python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
After running the above codes, the directory structure should be as follows:
│── detext │ ├── annotations │ ├── imgs │ ├── instances_test.json │ └── instances_training.json
Lecture Video DB¶
Step1: Download IIIT-CVid.zip to
lv/
.mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip mv IIIT-CVid/Frames imgs rm IIIT-CVid.zip
Step2: Generate
instances_training.json
,instances_val.json
, andinstances_test.json
with following command:python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
The resulting directory structure looks like the following:
│── lv │ ├── imgs │ ├── instances_test.json │ ├── instances_training.json │ └── instances_val.json
LSVT¶
Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to
lsvt/
.mkdir lsvt && cd lsvt # Download LSVT dataset wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json mkdir annotations tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/ mv train_full_images_0 imgs rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
Step2: Generate
instances_training.json
andinstances_val.json
(optional) with the following command:# Annotations of LSVT test split is not publicly available, split a validation # set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
After running the above codes, the directory structure should be as follows:
|── lsvt │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
IMGUR¶
Step1: Run
download_imgur5k.py
to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-Dataset
Step2: Generate
instances_train.json
,instance_val.json
andinstances_test.json
with the following command:python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
After running the above codes, the directory structure should be as follows:
│── imgur │ ├── annotations │ ├── imgs │ ├── instances_test.json │ ├── instances_training.json │ └── instances_val.json
KAIST¶
Step1: Complete download KAIST_all.zip to
kaist/
.mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip rm KAIST_all.zip
Step2: Extract zips:
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
Step3: Generate
instances_training.json
andinstances_val.json
(optional) with following command:# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
After running the above codes, the directory structure should be as follows:
│── kaist │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
MTWI¶
Step1: Download
mtwi_2018_train.zip
from homepage.mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zip
Step2: Generate
instances_training.json
andinstance_val.json
(optional) with the following command:# Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
After running the above codes, the directory structure should be as follows:
│── mtwi │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
ReCTS¶
Step1: Download ReCTS.zip to
rects/
from the homepage.mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip && rm -rf gt
Step2: Generate
instances_training.json
andinstances_val.json
(optional) with following command:# Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
After running the above codes, the directory structure should be as follows:
│── rects │ ├── annotations │ ├── imgs │ ├── instances_val.json (optional) │ └── instances_training.json
ILST¶
Step1: Download
IIIT-ILST
from onedriveStep2: Run the following commands
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
Step3: Generate
instances_training.json
andinstances_val.json
(optional). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textdet/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
After running the above codes, the directory structure should be as follows:
│── IIIT-ILST │ ├── annotations │ ├── imgs │ ├── instances_val.json (optional) │ └── instances_training.json
VinText¶
Step1: Download vintext.zip to
vintext
mkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
Step2: Generate
instances_training.json
,instances_test.json
andinstances_unseen_test.json
python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
After running the above codes, the directory structure should be as follows:
│── vintext │ ├── annotations │ ├── imgs │ ├── instances_test.json │ ├── instances_unseen_test.json │ └── instances_training.json
BID¶
Step1: Download BID Dataset.zip
Step2: Run the following commands to preprocess the dataset
# Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.ini
Step3: - Step3: Generate
instances_training.json
andinstances_val.json
(optional). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
After running the above codes, the directory structure should be as follows:
│── BID │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
RCTW¶
Step1: Download
train_images.zip.001
,train_images.zip.002
, andtrain_gts.zip
from the homepage, extract the zips torctw/imgs
andrctw/annotations
, respectively.Step2: Generate
instances_training.json
andinstances_val.json
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
After running the above codes, the directory structure should be as follows:
│── rctw │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
HierText¶
Step1 (optional): Install AWS CLI.
Step2: Clone HierText repo to get annotations
mkdir HierText git clone https://github.com/google-research-datasets/hiertext.git
Step3: Download
train.tgz
,validation.tgz
from awsaws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz . aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
Step4: Process raw data
# process annotations mv hiertext/gt ./ rm -rf hiertext mv gt annotations gzip -d annotations/train.jsonl.gz gzip -d annotations/validation.jsonl.gz # process images mkdir imgs mv train.tgz imgs/ mv validation.tgz imgs/ tar -xzvf imgs/train.tgz tar -xzvf imgs/validation.tgz
Step5: Generate
instances_training.json
andinstance_val.json
. HierText includes different levels of annotation, from paragraph, line, to word. Check the original paper for details. E.g. set--level paragraph
to get paragraph-level annotation. Set--level line
to get line-level annotation. set--level word
to get word-level annotation.# Collect word annotation from HierText --level word python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
After running the above codes, the directory structure should be as follows:
│── HierText │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json
ArT¶
Step1: Download
train_images.tar.gz
, andtrain_labels.json
from the homepage toart/
mkdir art && cd art mkdir annotations # Download ArT dataset wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate # Extract tar -xf train_images.tar.gz mv train_images imgs mv train_labels.json annotations/ # Remove unnecessary files rm train_images.tar.gz
Step2: Generate
instances_training.json
andinstances_val.json
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
After running the above codes, the directory structure should be as follows:
│── art │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
Text Recognition¶
Note
This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.
Overview¶
Dataset | images | annotation file | annotation file |
---|---|---|---|
training | test | ||
coco_text | homepage | train_labels.json | - |
ICDAR2011 | homepage | - | - |
SynthAdd | SynthText_Add.zip (code:627x) | train_labels.json | - |
OpenVINO | Open Images | annotations | annotations |
DeText | homepage | - | - |
Lecture Video DB | homepage | - | - |
LSVT | homepage | - | - |
IMGUR | homepage | - | - |
KAIST | homepage | - | - |
MTWI | homepage | - | - |
ReCTS | homepage | - | - |
IIIT-ILST | homepage | - | - |
VinText | homepage | - | - |
BID | homepage | - | - |
RCTW | homepage | - | - |
HierText | homepage | - | - |
ArT | homepage | - | - |
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
Install AWS CLI (optional)¶
Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ./aws/install -i /usr/local/aws-cli -b /usr/local/bin !aws configure # this command will require you to input keys, you can skip them except # for the Default region name # AWS Access Key ID [None]: # AWS Secret Access Key [None]: # Default region name [None]: us-east-1 # Default output format [None]
For users in China, these datasets can also be downloaded from OpenDataLab with high speed:
ICDAR 2011 (Born-Digital Images)¶
Step1: Download
Challenge1_Training_Task3_Images_GT.zip
,Challenge1_Test_Task3_Images.zip
, andChallenge1_Test_Task3_GT.txt
from homepageTask 1.3: Word Recognition (2013 edition)
.mkdir icdar2011 && cd icdar2011 mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate # For images mkdir crops unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train unzip -q Challenge1_Test_Task3_Images.zip -d crops/test # For annotations mv Challenge1_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
Step2: Convert original annotations to
train_labels.json
andtest_labels.json
with the following command:python tools/dataset_converters/textrecog/ic11_converter.py PATH/TO/icdar2011
After running the above codes, the directory structure should be as follows:
├── icdar2011 │ ├── crops │ ├── train_labels.json │ └── test_labels.json
coco_text¶
Step1: Download from homepage
Step2: Download train_labels.json
After running the above codes, the directory structure should be as follows:
├── coco_text │ ├── train_labels.json │ └── train_words
SynthAdd¶
Step1: Download
SynthText_Add.zip
from SynthAdd (code:627x))Step2: Download train_labels.json
Step3:
mkdir SynthAdd && cd SynthAdd mv /path/to/SynthText_Add.zip . unzip SynthText_Add.zip mv /path/to/train_labels.json . # create soft link cd /path/to/mmocr/data/recog ln -s /path/to/SynthAdd SynthAdd
After running the above codes, the directory structure should be as follows:
├── SynthAdd │ ├── train_labels.json │ └── SynthText_Add
OpenVINO¶
Step1 (optional): Install AWS CLI.
Step2: Download Open Images subsets
train_1
,train_2
,train_5
,train_f
, andvalidation
toopenvino/
.mkdir openvino && cd openvino # Download Open Images subsets for s in 1 2 5 f; do aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz . done aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz . # Download annotations for s in 1 2 5 f; do wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json done wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json # Extract images mkdir -p openimages_v5/val for s in 1 2 5 f; do tar zxf train_${s}.tar.gz -C openimages_v5 done tar zxf validation.tar.gz -C openimages_v5/val
Step3: Generate
train_{1,2,5,f}_labels.json
,val_labels.json
and crop images using 4 processes with the following command:python tools/dataset_converters/textrecog/openvino_converter.py /path/to/openvino 4
After running the above codes, the directory structure should be as follows:
├── OpenVINO │ ├── image_1 │ ├── image_2 │ ├── image_5 │ ├── image_f │ ├── image_val │ ├── train_1_labels.json │ ├── train_2_labels.json │ ├── train_5_labels.json │ ├── train_f_labels.json │ └── val_labels.json
DeText¶
Step1: Download
ch9_training_images.zip
,ch9_training_localization_transcription_gt.zip
,ch9_validation_images.zip
, andch9_validation_localization_transcription_gt.zip
from Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
Step2: Generate
train_labels.json
andtest_labels.json
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/detext/ignores python tools/dataset_converters/textrecog/detext_converter.py PATH/TO/detext --nproc 4
After running the above codes, the directory structure should be as follows:
├── detext │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── test_labels.json
NAF¶
Step1: Download labeled_images.tar.gz to
naf/
.mkdir naf && cd naf # Download NAF dataset wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz tar -zxf labeled_images.tar.gz # For images mkdir annotations && mv labeled_images imgs # For annotations git clone https://github.com/herobd/NAF_dataset.git mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ rm -rf NAF_dataset && rm labeled_images.tar.gz
Step2: Generate
train_labels.json
,val_labels.json
, andtest_labels.json
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/naf/ignores python tools/dataset_converters/textrecog/naf_converter.py PATH/TO/naf --nproc 4
After running the above codes, the directory structure should be as follows:
├── naf │ ├── crops │ ├── train_labels.json │ ├── val_labels.json │ └── test_labels.json
Lecture Video DB¶
Warning
This section is not fully tested yet.
Note
The LV dataset has already provided cropped images and the corresponding annotations
Step1: Download IIIT-CVid.zip to
lv/
.mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip # For image mv IIIT-CVid/Crops ./ # For annotation mv IIIT-CVid/train.txt train_labels.json && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_labels.json rm IIIT-CVid.zip
Step2: Generate
train_labels.json
,val.json
, andtest.json
with following command:python tools/dataset_converters/textdreog/lv_converter.py PATH/TO/lv
After running the above codes, the directory structure should be as follows:
├── lv │ ├── Crops │ ├── train_labels.json │ └── test_labels.json
LSVT¶
Warning
This section is not fully tested yet.
Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to
lsvt/
.mkdir lsvt && cd lsvt # Download LSVT dataset wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json mkdir annotations tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/ mv train_full_images_0 imgs rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
Step2: Generate
train_labels.json
andval_label.json
(optional) with the following command:# Annotations of LSVT test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/lsvt/ignores python tools/dataset_converters/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
After running the above codes, the directory structure should be as follows:
├── lsvt │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
IMGUR¶
Warning
This section is not fully tested yet.
Step1: Run
download_imgur5k.py
to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-Dataset
Step2: Generate
train_labels.json
,val_label.txt
andtest_labels.json
and crop images with the following command:python tools/dataset_converters/textrecog/imgur_converter.py PATH/TO/imgur
After running the above codes, the directory structure should be as follows:
├── imgur │ ├── crops │ ├── train_labels.json │ ├── test_labels.json │ └── val_label.json
KAIST¶
Warning
This section is not fully tested yet.
Step1: Download KAIST_all.zip to
kaist/
.mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip && rm KAIST_all.zip
Step2: Extract zips:
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
Step3: Generate
train_labels.json
andval_label.json
(optional) with following command:# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/kaist/ignores python tools/dataset_converters/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
After running the above codes, the directory structure should be as follows:
├── kaist │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
MTWI¶
Warning
This section is not fully tested yet.
Step1: Download
mtwi_2018_train.zip
from homepage.mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zip
Step2: Generate
train_labels.json
andval_label.json
(optional) with the following command:# Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/mtwi/ignores python tools/dataset_converters/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
After running the above codes, the directory structure should be as follows:
├── mtwi │ ├── crops │ ├── train_labels.json │ └── val_label.json (optional)
ReCTS¶
Warning
This section is not fully tested yet.
Step1: Download ReCTS.zip to
rects/
from the homepage.mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip -f && rm -rf gt
Step2: Generate
train_labels.json
andval_label.json
(optional) with the following command:# Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/rects/ignores python tools/dataset_converters/textrecog/rects_converter.py PATH/TO/rects --nproc 4
After running the above codes, the directory structure should be as follows:
├── rects │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
ILST¶
Warning
This section is not fully tested yet.
Step1: Download
IIIT-ILST.zip
from onedrive linkStep2: Run the following commands
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
Step3: Generate
train_labels.json
andval_label.json
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
After running the above codes, the directory structure should be as follows:
├── IIIT-ILST │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
VinText¶
Warning
This section is not fully tested yet.
Step1: Download vintext.zip to
vintext
mkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
Step2: Generate
train_labels.json
,test_labels.json
,unseen_test_labels.json
, and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts).python tools/dataset_converters/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
After running the above codes, the directory structure should be as follows:
├── vintext │ ├── crops │ ├── ignores │ ├── train_labels.json │ ├── test_labels.json │ └── unseen_test_labels.json
BID¶
Warning
This section is not fully tested yet.
Step1: Download BID Dataset.zip
Step2: Run the following commands to preprocess the dataset
# Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.ini
Step3: Generate
train_labels.json
andval_label.json
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textrecog/bid_converter.py PATH/TO/BID --nproc 4
After running the above codes, the directory structure should be as follows:
├── BID │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
RCTW¶
Warning
This section is not fully tested yet.
Step1: Download
train_images.zip.001
,train_images.zip.002
, andtrain_gts.zip
from the homepage, extract the zips torctw/imgs
andrctw/annotations
, respectively.Step2: Generate
train_labels.json
andval_label.json
(optional). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores python tools/dataset_converters/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
After running the above codes, the directory structure should be as follows:
│── rctw │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
HierText¶
Warning
This section is not fully tested yet.
Step1 (optional): Install AWS CLI.
Step2: Clone HierText repo to get annotations
mkdir HierText git clone https://github.com/google-research-datasets/hiertext.git
Step3: Download
train.tgz
,validation.tgz
from awsaws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz . aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
Step4: Process raw data
# process annotations mv hiertext/gt ./ rm -rf hiertext mv gt annotations gzip -d annotations/train.json.gz gzip -d annotations/validation.json.gz # process images mkdir imgs mv train.tgz imgs/ mv validation.tgz imgs/ tar -xzvf imgs/train.tgz tar -xzvf imgs/validation.tgz
Step5: Generate
train_labels.json
andval_label.json
. HierText includes different levels of annotation, includingparagraph
,line
, andword
. Check the original paper for details. E.g. set--level paragraph
to get paragraph-level annotation. Set--level line
to get line-level annotation. set--level word
to get word-level annotation.# Collect word annotation from HierText --level word # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores python tools/dataset_converters/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
After running the above codes, the directory structure should be as follows:
│── HierText │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json
ArT¶
Warning
This section is not fully tested yet.
Step1: Download
train_images.tar.gz
, andtrain_labels.json
from the homepage toart/
mkdir art && cd art mkdir annotations # Download ArT dataset wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json # Extract tar -xf train_task2_images.tar.gz mv train_task2_images crops mv train_task2_labels.json annotations/ # Remove unnecessary files rm train_images.tar.gz
Step2: Generate
train_labels.json
andval_label.json
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/dataset_converters/textrecog/art_converter.py PATH/TO/art
After running the above codes, the directory structure should be as follows:
│── art │ ├── crops │ ├── train_labels.json │ └── val_label.json (optional)
Key Information Extraction¶
Note
This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.
Overview¶
The structure of the key information extraction dataset directory is organized as follows.
└── wildreceipt
├── class_list.txt
├── dict.txt
├── image_files
├── openset_train.txt
├── openset_test.txt
├── test.txt
└── train.txt
Preparation Steps¶
WildReceipt¶
Just download and extract wildreceipt.tar.
WildReceiptOpenset¶
Step0: have WildReceipt prepared.
Step1: Convert annotation files to OpenSet format:
# You may find more available arguments by running
# python tools/data/kie/closeset_to_openset.py -h
python tools/data/kie/closeset_to_openset.py data/wildreceipt/train.txt data/wildreceipt/openset_train.txt
python tools/data/kie/closeset_to_openset.py data/wildreceipt/test.txt data/wildreceipt/openset_test.txt
Note
You can learn more about the key differences between CloseSet and OpenSet annotations in our tutorial.
Overview¶
Weights¶
Here are the list of weights available for Inference.
For the ease of reference, some weights may have shorter aliases, which will be
separated by /
in the table.
For example, “DB_r18 / dbnet_resnet18_fpnc_1200e_icdar2015
” means that you can
use either DB_r18
or dbnet_resnet18_fpnc_1200e_icdar2015
to initialize the Inferencer:
>>> from mmocr.apis import TextDetInferencer
>>> inferencer = TextDetInferencer(model='DB_r18')
>>> # equivalent to
>>> inferencer = TextDetInferencer(model='dbnet_resnet18_fpnc_1200e_icdar2015')
Text Detection¶
Model |
README |
ICDAR2015 (hmean-iou) |
CTW1500 (hmean-iou) |
Totaltext (hmean-iou) |
---|---|---|---|---|
|
0.8169 |
- |
- |
|
|
0.8504 |
- |
- |
|
|
0.8543 |
- |
- |
|
|
0.8644 |
- |
- |
|
|
- |
- |
0.8182 |
|
|
0.8622 |
- |
- |
|
|
0.8684 |
- |
- |
|
|
0.8882 |
- |
- |
|
|
- |
0.7458 |
- |
|
|
- |
0.7562 |
- |
|
|
0.8182 |
- |
- |
|
|
0.8513 |
- |
- |
|
|
- |
0.8467 |
- |
|
|
- |
0.8488 |
- |
|
|
- |
0.8192 |
- |
|
|
0.8528 |
- |
- |
|
|
0.8604 |
- |
- |
|
|
- |
- |
0.8134 |
|
|
- |
0.777 |
- |
|
|
0.7848 |
- |
- |
|
|
- |
0.7793 |
- |
|
|
- |
0.8037 |
- |
|
|
0.7998 |
- |
- |
|
|
0.8478 |
- |
- |
|
|
- |
0.8286 |
- |
|
|
- |
0.8529 |
- |
Text Recognition¶
Note
Avg is the average on IIIT5K, SVT, ICDAR2013, ICDAR2015, SVTP, CT80.
Model |
README |
Avg (word_acc) |
IIIT5K (word_acc) |
SVT (word_acc) |
ICDAR2013 (word_acc) |
ICDAR2015 (word_acc) |
SVTP (word_acc) |
CT80 (word_acc) |
---|---|---|---|---|---|---|---|---|
|
0.88 |
0.95 |
0.91 |
0.94 |
0.79 |
0.84 |
0.84 |
|
|
0.91 |
0.96 |
0.94 |
0.95 |
0.81 |
0.89 |
0.88 |
|
|
0.86 |
0.94 |
0.89 |
0.93 |
0.77 |
0.81 |
0.85 |
|
|
0.70 |
0.81 |
0.81 |
0.87 |
0.56 |
0.61 |
0.57 |
|
|
0.88 |
0.95 |
0.90 |
0.95 |
0.76 |
0.85 |
0.89 |
|
|
0.83 |
0.92 |
0.88 |
0.94 |
0.72 |
0.78 |
0.75 |
|
|
0.87 |
0.95 |
0.88 |
0.95 |
0.76 |
0.80 |
0.89 |
|
|
0.87 |
0.95 |
0.90 |
0.94 |
0.74 |
0.80 |
0.89 |
|
|
0.86 |
0.86 |
0.90 |
0.94 |
0.75 |
0.85 |
0.89 |
|
|
0.87 |
0.86 |
0.92 |
0.94 |
0.74 |
0.84 |
0.90 |
|
|
0.87 |
0.95 |
0.89 |
0.93 |
0.76 |
0.81 |
0.87 |
|
|
0.88 |
0.95 |
0.88 |
0.94 |
0.76 |
0.83 |
0.90 |
|
|
0.87 |
0.96 |
0.87 |
0.94 |
0.77 |
0.81 |
0.89 |
|
|
0.90 |
0.96 |
0.92 |
0.96 |
0.80 |
0.88 |
0.90 |
|
|
0.88 |
0.94 |
0.90 |
0.96 |
0.79 |
0.86 |
0.85 |
Statistics¶
Number of checkpoints: 48
Number of configs: 49
Number of papers: 19
ALGORITHM: 19
Text Detection Models¶
Number of checkpoints: 29
Number of configs: 29
Number of papers: 8
[ALGORITHM] Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection
[ALGORITHM] Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network
[ALGORITHM] Fourier Contour Embedding for Arbitrary-Shaped Text Detection
[ALGORITHM] Mask R-CNN
[ALGORITHM] Real-Time Scene Text Detection With Differentiable Binarization and Adaptive Scale Fusion
[ALGORITHM] Real-Time Scene Text Detection With Differentiable Binarization
[ALGORITHM] Shape Robust Text Detection With Progressive Scale Expansion Network
[ALGORITHM] Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
Text Recognition Models¶
Number of checkpoints: 16
Number of configs: 17
Number of papers: 9
[ALGORITHM] Aster: An Attentional Scene Text Recognizer With Flexible Rectification
[ALGORITHM] Master: Multi-Aspect Non-Local Network for Scene Text Recognition
[ALGORITHM] Nrtr: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition
[ALGORITHM] On Recognizing Texts of Arbitrary Shapes With 2d Self-Attention
[ALGORITHM] Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
[ALGORITHM] Robustscanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
[ALGORITHM] Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition
[ALGORITHM] Svtr: Scene Text Recognition With a Single Visual Model
Key Information Extraction Models¶
Number of checkpoints: 3
Number of configs: 3
Number of papers: 1
SOTA Models¶
Here are some selected project implementations that are not yet included in MMOCR package, but are ready to use.
ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network¶
This is an implementation of ABCNet based on MMOCR, MMCV, and MMEngine.
ABCNet is a conceptually novel, efficient, and fully convolutional framework for text spotting, which address the problem by proposing the Adaptive Bezier-Curve Network (ABCNet). Our contributions are three-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance with arbitrary shapes, significantly improving the precision compared with previous methods. 3) Compared with standard bounding box detection, our Bezier curve detection introduces negligible computation overhead, resulting in superiority of our method in both efficiency and accuracy. Experiments on arbitrarily-shaped benchmark datasets, namely Total-Text and CTW1500, demonstrate that ABCNet achieves state-of-the-art accuracy, meanwhile significantly improving the speed. In particular, on Total-Text, our realtime version is over 10 times faster than recent state-of-the-art methods with a competitive recognition accuracy.

ABCNet v2: Adaptive Bezier-Curve Network for Real-time End-to-end Text Spotting¶
This is an implementation of ABCNetV2 based on MMOCR, MMCV, and MMEngine.
ABCNetV2 contributions are four-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve, which, compared with segmentation-based methods, can not only provide structured output but also controllable representation. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance of arbitrary shapes, significantly improving the precision of recognition over previous methods. 3) Different from previous methods, which often suffer from complex post-processing and sensitive hyper-parameters, our ABCNet v2 maintains a simple pipeline with the only post-processing non-maximum suppression (NMS). 4) As the performance of text recognition closely depends on feature alignment, ABCNet v2 further adopts a simple yet effective coordinate convolution to encode the position of the convolutional filters, which leads to a considerable improvement with negligible computation overhead. Comprehensive experiments conducted on various bilingual (English and Chinese) benchmark datasets demonstrate that ABCNet v2 can achieve state-of-the-art performance while maintaining very high efficiency.

SPTS: Single-Point Text Spotting¶
This is an implementation of SPTS based on MMOCR, MMCV, and MMEngine.
Existing scene text spotting (i.e., end-to-end text detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level bounding boxes). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive Transformer to predict the sequence. The proposed method is simple yet effective, which can achieve state-of-the-art results on widely used benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated or even be automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible.

BackBones¶
oCLIP¶
Abstract¶
Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).

Models¶
Backbone | Pre-train Data | Model |
---|---|---|
ResNet-50 | SynthText | Link |
Note
The model is converted from the official oCLIP.
Supported Text Detection Models¶
DBNet | DBNet++ | FCENet | TextSnake | PSENet | DRRG | Mask R-CNN | |
---|---|---|---|---|---|---|---|
ICDAR2015 | ✓ | ✓ | ✓ | ✓ | ✓ | ||
CTW1500 | ✓ | ✓ | ✓ | ✓ | ✓ |
Citation¶
@article{xue2022language,
title={Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting},
author={Xue, Chuhui and Zhang, Wenqing and Hao, Yu and Lu, Shijian and Torr, Philip and Bai, Song},
journal={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2022}
}
Text Detection Models¶
DBNet¶
Real-time Scene Text Detection with Differentiable Binarization
Abstract¶
Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset.

Results and models¶
SynthText¶
Method | Backbone | Training set | ##iters | Download |
---|---|---|---|---|
DBNet_r18 | ResNet18 | SynthText | 100,000 | model | log |
ICDAR2015¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
DBNet_r18 | ResNet18 | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 736 | 0.8853 | 0.7583 | 0.8169 | model | log |
DBNet_r50 | ResNet50 | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.8744 | 0.8276 | 0.8504 | model | log |
DBNet_r50dcn | ResNet50-DCN | Synthtext | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.8784 | 0.8315 | 0.8543 | model | log |
DBNet_r50-oclip | ResNet50-oCLIP | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9052 | 0.8272 | 0.8644 | model | log |
Citation¶
@article{Liao_Wan_Yao_Chen_Bai_2020,
title={Real-Time Scene Text Detection with Differentiable Binarization},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Liao, Minghui and Wan, Zhaoyi and Yao, Cong and Chen, Kai and Bai, Xiang},
year={2020},
pages={11474-11481}}
DBNetpp¶
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion
Abstract¶
Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks.

Results and models¶
SynthText¶
Method | BackBone | Training set | ##iters | Download |
---|---|---|---|---|
DBNetpp_r50dcn | ResNet50-dcnv2 | SynthText | 100,000 | model | log |
ICDAR2015¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
DBNetpp_r50 | ResNet50 | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9079 | 0.8209 | 0.8622 | model | log |
DBNetpp_r50dcn | ResNet50-dcnv2 | Synthtext (model) | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9116 | 0.8291 | 0.8684 | model | log |
DBNetpp_r50-oclip | ResNet50-oCLIP | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9174 | 0.8609 | 0.8882 | model | log |
Citation¶
@article{liao2022real,
title={Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion},
author={Liao, Minghui and Zou, Zhisheng and Wan, Zhaoyi and Yao, Cong and Bai, Xiang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2022},
publisher={IEEE}
}
DRRG¶
Deep relational reasoning graph network for arbitrary shape text detection
Abstract¶
Arbitrary shape text detection is a challenging task due to the high variety and complexity of scenes texts. In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection. In our method, an innovative local graph bridges a text proposal model via Convolutional Neural Network (CNN) and a deep relational reasoning network via Graph Convolutional Network (GCN), making our network end-to-end trainable. To be concrete, every text instance will be divided into a series of small rectangular components, and the geometry attributes (e.g., height, width, and orientation) of the small components will be estimated by our text proposal model. Given the geometry attributes, the local graph construction model can roughly establish linkages between different text components. For further reasoning and deducing the likelihood of linkages between the component and its neighbors, we adopt a graph-based network to perform deep relational reasoning on local graphs. Experiments on public available datasets demonstrate the state-of-the-art performance of our method.

Results and models¶
CTW1500¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
DRRG | ResNet50 | - | CTW1500 Train | CTW1500 Test | 1200 | 640 | 0.8775 | 0.8179 | 0.8467 | model \ log |
DRRG_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 1200 | model \ log |
Citation¶
@article{zhang2020drrg,
title={Deep relational reasoning graph network for arbitrary shape text detection},
author={Zhang, Shi-Xue and Zhu, Xiaobin and Hou, Jie-Bo and Liu, Chang and Yang, Chun and Wang, Hongfa and Yin, Xu-Cheng},
booktitle={CVPR},
pages={9699-9708},
year={2020}
}
FCENet¶
Fourier Contour Embedding for Arbitrary-Shaped Text Detection
Abstract¶
One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. We further construct FCENet with a backbone, feature pyramid networks (FPN) and a simple post-processing with the Inverse Fourier Transformation (IFT) and Non-Maximum Suppression (NMS). Different from previous methods, FCENet first predicts compact Fourier signatures of text instances, and then reconstructs text contours via IFT and NMS during test. Extensive experiments demonstrate that FCE is accurate and robust to fit contours of scene texts even with highly-curved shapes, and also validate the effectiveness and the good generalization of FCENet for arbitrary-shaped text detection. Furthermore, experimental results show that our FCENet is superior to the state-of-the-art (SOTA) methods on CTW1500 and Total-Text, especially on challenging highly-curved text subset.

Results and models¶
CTW1500¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
FCENet_r50dcn | ResNet50 + DCNv2 | - | CTW1500 Train | CTW1500 Test | 1500 | (736, 1080) | 0.8689 | 0.8296 | 0.8488 | model | log |
FCENet_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 1500 | (736, 1080) | 0.8383 | 0.801 | 0.8192 | model | log |
ICDAR2015¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
FCENet_r50 | ResNet50 | - | IC15 Train | IC15 Test | 1500 | (2260, 2260) | 0.8243 | 0.8834 | 0.8528 | model | log |
FCENet_r50-oclip | ResNet50-oCLIP | - | IC15 Train | IC15 Test | 1500 | (2260, 2260) | 0.9176 | 0.8098 | 0.8604 | model | log |
Total Text¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
FCENet_r50 | ResNet50 | - | Totaltext Train | Totaltext Test | 1500 | (1280, 960) | 0.8485 | 0.7810 | 0.8134 | model | log |
Citation¶
@InProceedings{zhu2021fourier,
title={Fourier Contour Embedding for Arbitrary-Shaped Text Detection},
author={Yiqin Zhu and Jianyong Chen and Lingyu Liang and Zhanghui Kuang and Lianwen Jin and Wayne Zhang},
year={2021},
booktitle = {CVPR}
}
Mask R-CNN¶
Abstract¶
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

Results and models¶
CTW1500¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
MaskRCNN | - | - | CTW1500 Train | CTW1500 Test | 160 | 1600 | 0.7165 | 0.7776 | 0.7458 | model | log |
MaskRCNN_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 160 | 1600 | 0.753 | 0.7593 | 0.7562 | model | log |
ICDAR2015¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
MaskRCNN | ResNet50 | - | ICDAR2015 Train | ICDAR2015 Test | 160 | 1920 | 0.8644 | 0.7766 | 0.8182 | model | log |
MaskRCNN_r50-oclip | ResNet50-oCLIP | - | ICDAR2015 Train | ICDAR2015 Test | 160 | 1920 | 0.8695 | 0.8339 | 0.8513 | model | log |
Citation¶
@INPROCEEDINGS{8237584,
author={K. {He} and G. {Gkioxari} and P. {Dollár} and R. {Girshick}},
booktitle={2017 IEEE International Conference on Computer Vision (ICCV)},
title={Mask R-CNN},
year={2017},
pages={2980-2988},
doi={10.1109/ICCV.2017.322}}
PANet¶
Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network
Abstract¶
Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical this http URL this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

Results and models¶
Citation¶
@inproceedings{WangXSZWLYS19,
author={Wenhai Wang and Enze Xie and Xiaoge Song and Yuhang Zang and Wenjia Wang and Tong Lu and Gang Yu and Chunhua Shen},
title={Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network},
booktitle={ICCV},
pages={8439--8448},
year={2019}
}
PSENet¶
Shape robust text detection with progressive scale expansion network
Abstract¶
Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.

Results and models¶
CTW1500¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
PSENet | ResNet50 | - | CTW1500 Train | CTW1500 Test | 600 | 1280 | 0.7705 | 0.7883 | 0.7793 | model | log |
PSENet_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 600 | 1280 | 0.8483 | 0.7636 | 0.8037 | model | log |
ICDAR2015¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
PSENet | ResNet50 | - | IC15 Train | IC15 Test | 600 | 2240 | 0.8396 | 0.7636 | 0.7998 | model | log |
PSENet_r50-oclip | ResNet50-oCLIP | - | IC15 Train | IC15 Test | 600 | 2240 | 0.8895 | 0.8098 | 0.8478 | model | log |
Citation¶
@inproceedings{wang2019shape,
title={Shape robust text detection with progressive scale expansion network},
author={Wang, Wenhai and Xie, Enze and Li, Xiang and Hou, Wenbo and Lu, Tong and Yu, Gang and Shao, Shuai},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={9336--9345},
year={2019}
}
Textsnake¶
TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
Abstract¶
Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

Results and models¶
CTW1500¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
TextSnake | ResNet50 | - | CTW1500 Train | CTW1500 Test | 1200 | 736 | 0.8535 | 0.8052 | 0.8286 | model | log |
TextSnake_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 1200 | 736 | 0.8869 | 0.8215 | 0.8529 | model | log |
Citation¶
@article{long2018textsnake,
title={TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes},
author={Long, Shangbang and Ruan, Jiaqiang and Zhang, Wenjie and He, Xin and Wu, Wenhao and Yao, Cong},
booktitle={ECCV},
pages={20-36},
year={2018}
}
Text Recognition Models¶
ABINet¶
Abstract¶
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | note |
---|---|---|---|
Syn90k | 8919273 | 1 | synth |
SynthText | 7239272 | 1 | alphanumeric |
Test Dataset¶
testset | instance_num | note |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and models¶
methods | pretrained | Regular Text | Irregular Text | download | ||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
ABINet-Vision | - | 0.9523 | 0.9196 | 0.9369 | 0.7896 | 0.8403 | 0.8437 | model | log |
ABINet-Vision-TTA | - | 0.9523 | 0.9196 | 0.9360 | 0.8175 | 0.8450 | 0.8542 | |
ABINet | Pretrained | 0.9603 | 0.9397 | 0.9557 | 0.8146 | 0.8868 | 0.8785 | model | log |
ABINet-TTA | Pretrained | 0.9597 | 0.9397 | 0.9527 | 0.8426 | 0.8930 | 0.8854 |
Note
ABINet allows its encoder to run and be trained without decoder and fuser. Its encoder is designed to recognize texts as a stand-alone model and therefore can work as an independent text recognizer. We release it as ABINet-Vision.
Facts about the pretrained model: MMOCR does not have a systematic pipeline to pretrain the language model (LM) yet, thus the weights of LM are converted from the official pretrained model. The weights of ABINet-Vision are directly used as the vision model of ABINet.
Citation¶
@article{fang2021read,
title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}
ASTER¶
ASTER: An Attentional Scene Text Recognizer with Flexible Rectification
Abstract¶
A challenging aspect of scene text recognition is to handle text with distortions or irregular layout. In particular, perspective text and curved text are common in natural scenes and are difficult to recognize. In this work, we introduce ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network. The rectification network adaptively transforms an input image into a new one, rectifying the text in it. It is powered by a flexible Thin-Plate Spline transformation which handles a variety of text irregularities and is trained without human annotations. The recognition network is an attentional sequence-to-sequence model that predicts a character sequence directly from the rectified image. The whole model is trained end to end, requiring only images and their groundtruth text. Through extensive experiments, we verify the effectiveness of the rectification and demonstrate the state-of-the-art recognition performance of ASTER. Furthermore, we demonstrate that ASTER is a powerful component in end-to-end recognition systems, for its ability to enhance the detector.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | note |
---|---|---|---|
Syn90k | 8919273 | 1 | synth |
SynthText | 7239272 | 1 | alphanumeric |
Test Dataset¶
testset | instance_num | note |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and models¶
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
ASTER | ResNet45 | 0.9357 | 0.8949 | 0.9281 | 0.7665 | 0.8062 | 0.8507 | model | log | |
ASTER-TTA | ResNet45 | 0.9337 | 0.8949 | 0.9251 | 0.7925 | 0.8109 | 0.8507 |
Citation¶
@article{shi2018aster,
title={Aster: An attentional scene text recognizer with flexible rectification},
author={Shi, Baoguang and Yang, Mingkun and Wang, Xinggang and Lyu, Pengyuan and Yao, Cong and Bai, Xiang},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={41},
number={9},
pages={2035--2048},
year={2018},
publisher={IEEE}
}
CRNN¶
Abstract¶
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | note |
---|---|---|---|
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | note |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and models¶
methods | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|
methods | IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||
CRNN | 0.8053 | 0.7991 | 0.8739 | 0.5571 | 0.6093 | 0.5694 | model | log | |
CRNN-TTA | 0.8013 | 0.7975 | 0.8631 | 0.5763 | 0.6093 | 0.5764 | model | log |
Citation¶
@article{shi2016end,
title={An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition},
author={Shi, Baoguang and Bai, Xiang and Yao, Cong},
journal={IEEE transactions on pattern analysis and machine intelligence},
year={2016}
}
MASTER¶
MASTER: Multi-aspect non-local network for scene text recognition
Abstract¶
Attention-based scene text recognizers have gained huge success, which leverages a more compact intermediate representation to learn 1d- or 2d- attention by a RNN-based encoder-decoder architecture. However, such methods suffer from attention-drift problem because high similarity among encoded features leads to attention confusion under the RNN-based local attention mechanism. Moreover, RNN-based methods have low efficiency due to poor parallelization. To overcome these problems, we propose the MASTER, a self-attention based scene text recognizer that (1) not only encodes the input-output attention but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder and (2) learns a more powerful and robust intermediate representation to spatial distortion, and (3) owns a great training efficiency because of high training parallelization and a high-speed inference because of an efficient memory-cache mechanism. Extensive experiments on various benchmarks demonstrate the superior performance of our MASTER on both regular and irregular scene text.
Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
SynthAdd | 1216889 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
MASTER | R31-GCAModule | 0.9490 | 0.8887 | 0.9517 | 0.7650 | 0.8465 | 0.8889 | model | log | |
MASTER-TTA | R31-GCAModule | 0.9450 | 0.8887 | 0.9478 | 0.7906 | 0.8481 | 0.8958 |
Citation¶
@article{Lu2021MASTER,
title={MASTER: Multi-Aspect Non-local Network for Scene Text Recognition},
author={Ning Lu and Wenwen Yu and Xianbiao Qi and Yihao Chen and Ping Gong and Rong Xiao and Xiang Bai},
journal={Pattern Recognition},
year={2021}
}
NRTR¶
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
Abstract¶
Scene text recognition has attracted a great many researches due to its importance to various applications. Existing methods mainly adopt recurrence or convolution based networks. Though have obtained good performance, these methods still suffer from two limitations: slow training speed due to the internal recurrence of RNNs, and high complexity due to stacked convolutional layers for long-term feature extraction. This paper, for the first time, proposes a no-recurrence sequence-to-sequence text recognizer, named NRTR, that dispenses with recurrences and convolutions entirely. NRTR follows the encoder-decoder paradigm, where the encoder uses stacked self-attention to extract image features, and the decoder applies stacked self-attention to recognize texts based on encoder output. NRTR relies solely on self-attention mechanism thus could be trained with more parallelization and less complexity. Considering scene image has large variation in text and background, we further design a modality-transform block to effectively transform 2D input images to 1D sequences, combined with the encoder to extract more discriminative features. NRTR achieves state-of-the-art or highly competitive performance on both regular and irregular benchmarks, while requires only a small fraction of training time compared to the best model from the literature (at least 8 times faster).

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
NRTR | NRTRModalityTransform | 0.9147 | 0.8841 | 0.9369 | 0.7246 | 0.7783 | 0.7500 | model | log | |
NRTR-TTA | NRTRModalityTransform | 0.9123 | 0.8825 | 0.9310 | 0.7492 | 0.7798 | 0.7535 | ||
NRTR | R31-1/8-1/4 | 0.9483 | 0.8918 | 0.9507 | 0.7578 | 0.8016 | 0.8889 | model | log | |
NRTR-TTA | R31-1/8-1/4 | 0.9443 | 0.8903 | 0.9478 | 0.7790 | 0.8078 | 0.8854 | ||
NRTR | R31-1/16-1/8 | 0.9470 | 0.8918 | 0.9399 | 0.7376 | 0.7969 | 0.8854 | model | log | |
NRTR-TTA | R31-1/16-1/8 | 0.9423 | 0.8903 | 0.9360 | 0.7641 | 0.8016 | 0.8854 |
Citation¶
@inproceedings{sheng2019nrtr,
title={NRTR: A no-recurrence sequence-to-sequence model for scene text recognition},
author={Sheng, Fenfen and Chen, Zhineng and Xu, Bo},
booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
pages={781--786},
year={2019},
organization={IEEE}
}
RobustScanner¶
RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
Abstract¶
The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.

Dataset¶
Results and Models¶
Methods | GPUs | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
RobustScanner | 4 | 0.9510 | 0.9011 | 0.9320 | 0.7578 | 0.8078 | 0.8750 | model | log | |
RobustScanner-TTA | 4 | 0.9487 | 0.9011 | 0.9261 | 0.7805 | 0.8124 | 0.8819 |
References¶
[1] Li, Hui and Wang, Peng and Shen, Chunhua and Zhang, Guyu. Show, attend and read: A simple and strong baseline for irregular text recognition. In AAAI 2019.
Citation¶
@inproceedings{yue2020robustscanner,
title={RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition},
author={Yue, Xiaoyu and Kuang, Zhanghui and Lin, Chenhao and Sun, Hongbin and Zhang, Wayne},
booktitle={European Conference on Computer Vision},
year={2020}
}
SAR¶
Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition
Abstract¶
Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using off-the-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an LSTM-based encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust and achieves state-of-the-art performance on both regular and irregular scene text recognition benchmarks.

Dataset¶
Results and Models¶
Methods | Backbone | Decoder | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||||
SAR | R31-1/8-1/4 | ParallelSARDecoder | 0.9533 | 0.8964 | 0.9369 | 0.7602 | 0.8326 | 0.9062 | model | log | |
SAR-TTA | R31-1/8-1/4 | ParallelSARDecoder | 0.9510 | 0.8964 | 0.9340 | 0.7862 | 0.8372 | 0.9132 | ||
SAR | R31-1/8-1/4 | SequentialSARDecoder | 0.9553 | 0.9073 | 0.9409 | 0.7761 | 0.8093 | 0.8958 | model | log | |
SAR-TTA | R31-1/8-1/4 | SequentialSARDecoder | 0.9530 | 0.9073 | 0.9389 | 0.8002 | 0.8124 | 0.9028 |
Citation¶
@inproceedings{li2019show,
title={Show, attend and read: A simple and strong baseline for irregular text recognition},
author={Li, Hui and Wang, Peng and Shen, Chunhua and Zhang, Guyu},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={33},
number={01},
pages={8610--8617},
year={2019}
}
SATRN¶
On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention
Abstract¶
Scene text recognition (STR) is the task of recognizing character sequences in natural scenes. While there have been great advances in STR methods, current methods still fail to recognize texts in arbitrary shapes, such as heavily curved or rotated texts, which are abundant in daily life (e.g. restaurant signs, product labels, company logos, etc). This paper introduces a novel architecture to recognizing texts of arbitrary shapes, named Self-Attention Text Recognition Network (SATRN), which is inspired by the Transformer. SATRN utilizes the self-attention mechanism to describe two-dimensional (2D) spatial dependencies of characters in a scene text image. Exploiting the full-graph propagation of self-attention, SATRN can recognize texts with arbitrary arrangements and large inter-character spacing. As a result, SATRN outperforms existing STR models by a large margin of 5.7 pp on average in “irregular text” benchmarks. We provide empirical analyses that illustrate the inner mechanisms and the extent to which the model is applicable (e.g. rotated and multi-line text). We will open-source the code.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
Satrn | 0.9600 | 0.9181 | 0.9606 | 0.8045 | 0.8837 | 0.8993 | model | log | |
Satrn-TTA | 0.9530 | 0.9181 | 0.9527 | 0.8276 | 0.8884 | 0.9028 | ||
Satrn_small | 0.9423 | 0.9011 | 0.9567 | 0.7886 | 0.8574 | 0.8472 | model | log | |
Satrn_small-TTA | 0.9380 | 0.8995 | 0.9488 | 0.8122 | 0.8620 | 0.8507 |
Citation¶
@article{junyeop2019recognizing,
title={On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention},
author={Junyeop Lee, Sungrae Park, Jeonghun Baek, Seong Joon Oh, Seonghyeon Kim, Hwalsuk Lee},
year={2019}
}
SVTR¶
SVTR: Scene Text Recognition with a Single Visual Model
Abstract¶
Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
SVTR-tiny | - | - | - | - | - | - | - | |
SVTR-small | 0.8553 | 0.9026 | 0.9448 | 0.7496 | 0.8496 | 0.8854 | model | log | |
SVTR-small-TTA | 0.8397 | 0.8964 | 0.9241 | 0.7597 | 0.8124 | 0.8646 | ||
SVTR-base | 0.8570 | 0.9181 | 0.9438 | 0.7448 | 0.8388 | 0.9028 | model | log | |
SVTR-base-TTA | 0.8517 | 0.9011 | 0.9379 | 0.7569 | 0.8279 | 0.8819 | ||
SVTR-large | - | - | - | - | - | - | - |
Note
The implementation and configuration follow the original code and paper, but there is still a gap between the reproduced results and the official ones. We appreciate any suggestions to improve its performance.
Citation¶
@inproceedings{ijcai2022p124,
title = {SVTR: Scene Text Recognition with a Single Visual Model},
author = {Du, Yongkun and Chen, Zhineng and Jia, Caiyan and Yin, Xiaoting and Zheng, Tianlun and Li, Chenxia and Du, Yuning and Jiang, Yu-Gang},
booktitle = {Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, {IJCAI-22}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
editor = {Lud De Raedt},
pages = {884--890},
year = {2022},
month = {7},
note = {Main Track},
doi = {10.24963/ijcai.2022/124},
url = {https://doi.org/10.24963/ijcai.2022/124},
}
Key Information Extraction Models¶
SDMGR¶
Spatial Dual-Modality Graph Reasoning for Key Information Extraction
Abstract¶
Key information extraction from document images is of paramount importance in office automation. Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors. In this paper, we propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. It contains 25 key information categories, a total of about 69000 text boxes, and is about 2 times larger than the existing public datasets. Extensive experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction. It has been shown that SDMG-R can effectively extract key information from document images of unseen templates, and obtain new state-of-the-art results on the recent popular benchmark SROIE and our WildReceipt. Our code and dataset will be publicly released.

Results and models¶
WildReceipt¶
Method | Modality | Macro F1-Score | Download |
---|---|---|---|
sdmgr_unet16 | Visual + Textual | 0.890 | model | log |
sdmgr_novisual | Textual | 0.873 | model | log |
WildReceiptOpenset¶
Method | Modality | Edge F1-Score | Node Macro F1-Score | Node Micro F1-Score | Download |
---|---|---|---|---|---|
sdmgr_novisual_openset | Textual | 0.792 | 0.931 | 0.940 | model | log |
Citation¶
@misc{sun2021spatial,
title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction},
author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang},
year={2021},
eprint={2103.14470},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
Branches¶
This documentation aims to provide a comprehensive understanding of the purpose and features of each branch in MMOCR.
Branch Overview¶
1. main
¶
The main
branch serves as the default branch for the MMOCR project. It contains the latest stable version of MMOCR, currently housing the code for MMOCR 1.x (e.g. v1.0.0). The main
branch ensures users have access to the most recent and reliable version of the software.
2. dev-1.x
¶
The dev-1.x
branch is dedicated to the development of the next major version of MMOCR. This branch will routinely undergo reliance tests, and the passing commits will be squashed in a release and published to the main
branch. By having a separate development branch, the project can continue to evolve without impacting the stability of the main
branch. All the PRs should be merged into the dev-1.x
branch.
3. 0.x
¶
The 0.x
branch serves as an archive for MMOCR 0.x (e.g. v0.6.3). This branch will no longer actively receive updates or improvements, but it remains accessible for historical reference or for users who have not yet upgraded to MMOCR 1.x.
3. 1.x
¶
It’s an alias of main
branch, which is intended for a smooth transition from the compatibility period. It will be removed in mid 2023.
Note
The branches mapping has been changed in 2023.04.06. For the legacy branches mapping and the guide for migration, please refer to the branch migration guide.
Contribution Guide¶
OpenMMLab welcomes everyone who is interested in contributing to our projects and accepts contribution in the form of PR.
What is PR¶
PR
is the abbreviation of Pull Request
. Here’s the definition of PR
in the official document of Github.
Pull requests let you tell others about changes you have pushed to a branch in a repository on GitHub. Once a pull request is opened, you can discuss and review the potential changes with collaborators and add follow-up commits before your changes are merged into the base branch.
Basic Workflow¶
Get the most recent codebase
Checkout a new branch from
dev-1.x
branch, depending on the version of the codebase you want to contribute to.Commit your changes (Don’t forget to use pre-commit hooks!)
Push your changes and create a PR
Discuss and review your code
Merge your branch to
dev-1.x
branch
Procedures in detail¶
1. Get the most recent codebase¶
When you work on your first PR
Fork the OpenMMLab repository: click the fork button at the top right corner of Github page
Clone forked repository to local
git clone git@github.com:XXX/mmocr.git
Add source repository to upstream
git remote add upstream git@github.com:open-mmlab/mmocr
After your first PR
Checkout the latest branch of the local repository and pull the latest branch of the source repository. Here we assume that you are working on the
dev-1.x
branch.git checkout dev-1.x git pull upstream dev-1.x
2. Checkout a new branch from dev-1.x
branch¶
git checkout -b branchname
Tip
To make commit history clear, we strongly recommend you checkout the dev-1.x
branch before creating a new branch.
3. Commit your changes¶
If you are a first-time contributor, please install and initialize pre-commit hooks from the repository root directory first.
pip install -U pre-commit pre-commit install
Commit your changes as usual. Pre-commit hooks will be triggered to stylize your code before each commit.
# coding git add [files] git commit -m 'messages'
Note
Sometimes your code may be changed by pre-commit hooks. In this case, please remember to re-stage the modified files and commit again.
4. Push your changes to the forked repository and create a PR¶
Push the branch to your forked remote repository
git push origin branchname
Create a PR
Revise PR message template to describe your motivation and modifications made in this PR. You can also link the related issue to the PR manually in the PR message (For more information, checkout the official guidance).
Specifically, if you are contributing to
dev-1.x
, you will have to change the base branch of the PR todev-1.x
in the PR page, since the default base branch ismain
.You can also ask a specific person to review the changes you’ve proposed.
5. Discuss and review your code¶
Modify your codes according to reviewers’ suggestions and then push your changes.
6. Merge your branch to dev-1.x
branch and delete the branch¶
After the PR is merged by the maintainer, you can delete the branch you created in your forked repository.
git branch -d branchname # delete local branch git push origin --delete branchname # delete remote branch
PR Specs¶
Use pre-commit hook to avoid issues of code style
One short-time branch should be matched with only one PR
Accomplish a detailed change in one PR. Avoid large PR
Bad: Support Faster R-CNN
Acceptable: Add a box head to Faster R-CNN
Good: Add a parameter to box head to support custom conv-layer number
Provide clear and significant commit message
Provide clear and meaningful PR description
Task name should be clarified in title. The general format is: [Prefix] Short description of the PR (Suffix)
Prefix: add new feature [Feature], fix bug [Fix], related to documents [Docs], in developing [WIP] (which will not be reviewed temporarily)
Introduce main changes, results and influences on other modules in short description
Associate related issues and pull requests with a milestone
Changelog of v1.x¶
v1.0.0 (04/06/2023)¶
We are excited to announce the first official release of MMOCR 1.0, with numerous enhancements, bug fixes, and the introduction of new dataset support!
🌟 Highlights¶
Support for SCUT-CTW1500, SynthText, and MJSynth datasets
Updated FAQ and documentation
Deprecation of file_client_args in favor of backend_args
Added a new MMOCR tutorial notebook
🆕 New Features & Enhancement¶
Add SCUT-CTW1500 by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1677
Cherry Pick #1205 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1774
Make lanms-neo optional by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1772
SynthText by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1779
Deprecate file_client_args and use backend_args instead by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1765
MJSynth by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1791
Add MMOCR tutorial notebook by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1771
decouple batch_size to det_batch_size, rec_batch_size and kie_batch_size in MMOCRInferencer by @hugotong6425 in https://github.com/open-mmlab/mmocr/pull/1801
Accepts local-rank in train.py and test.py by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1806
update stitch_boxes_into_lines by @cherryjm in https://github.com/open-mmlab/mmocr/pull/1824
Add tests for pytorch 2.0 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1836
📝 Docs¶
FAQ by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1773
Remove LoadImageFromLMDB from docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1767
Mark projects in docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1766
add opendatalab download link by @jorie-peng in https://github.com/open-mmlab/mmocr/pull/1753
Fix some deadlinks in the docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1469
Fix quick run by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1775
Dataset by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1782
Update faq by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1817
more social network links by @fengshiwest in https://github.com/open-mmlab/mmocr/pull/1818
Update docs after branch switching by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1834
🛠️ Bug Fixes:¶
Place dicts to .mim by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1781
Test svtr_small instead of svtr_tiny by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1786
Add pse weight to metafile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1787
Synthtext metafile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1788
Clear up some unused scripts by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1798
if dst not exists, when move a single file may raise a file not exists error. by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1803
CTW1500 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1814
MJSynth & SynthText Dataset Preparer config by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1805
Use poly_intersection instead of poly.intersection to avoid sup… by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1811
Abinet: fix ValueError: Blur limit must be odd when centered=True. Got: (3, 6) by @hugotong6425 in https://github.com/open-mmlab/mmocr/pull/1821
Bug generated during kie inference visualization by @Yangget in https://github.com/open-mmlab/mmocr/pull/1830
Revert sync bn in inferencer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1832
Fix mmdet digit version by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1840
🎉 New Contributors¶
@jorie-peng made their first contribution in https://github.com/open-mmlab/mmocr/pull/1753
@hugotong6425 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1801
@fengshiwest made their first contribution in https://github.com/open-mmlab/mmocr/pull/1818
@cherryjm made their first contribution in https://github.com/open-mmlab/mmocr/pull/1824
@Yangget made their first contribution in https://github.com/open-mmlab/mmocr/pull/1830
Thank you to all the contributors for making this release possible! We’re excited about the new features and enhancements in this version, and we’re looking forward to your feedback and continued support. Happy coding! 🚀
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc6…v1.0.0
Highlights¶
v1.0.0rc6 (03/07/2023)¶
Highlights¶
Two new models, ABCNet v2 (inference only) and SPTS are added to
projects/
folder.Announcing
Inferencer
, a unified inference interface in OpenMMLab for everyone’s easy access and quick inference with all the pre-trained weights. DocsUsers can use test-time augmentation for text recognition tasks. Docs
Support batch augmentation through
BatchAugSampler
, which is a technique used in SPTS.Dataset Preparer has been refactored to allow more flexible configurations. Besides, users are now able to prepare text recognition datasets in LMDB formats. Docs
Some textspotting datasets have been revised to enhance the correctness and consistency with the common practice.
Potential spurious warnings from
shapely
have been eliminated.
Dependency¶
This version requires MMEngine >= 0.6.0, MMCV >= 2.0.0rc4 and MMDet >= 3.0.0rc5.
New Features & Enhancements¶
Discard deprecated lmdb dataset format and only support img+label now by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1681
abcnetv2 inference by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1657
Add RepeatAugSampler by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1678
SPTS by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1696
Refactor Inferencers by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1608
Dynamic return type for rescale_polygons by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1702
Revise upstream version limit by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1703
TextRecogCropConverter add crop with opencv warpPersepective function by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1667
change cudnn benchmark to false by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1705
Add ST-pretrained DB-series models and logs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1635
Only keep meta and state_dict when publish model by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1729
Rec TTA by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1401
Speedup formatting by replacing np.transpose with torch… by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1719
Support auto import modules from registry. by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1731
Support batch visualization & dumping in Inferencer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1722
add a new argument font_properties to set a specific font file in order to draw Chinese characters properly by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1709
Refactor data converter and gather by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1707
Support batch augmentation through BatchAugSampler by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1757
Put all registry into registry.py by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1760
train by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1756
configs for regression benchmark by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1755
Support lmdb format in Dataset Preparer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1762
Docs¶
update the link of DBNet by @AllentDan in https://github.com/open-mmlab/mmocr/pull/1672
Add notice for default branch switching by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1693
docs: Add twitter discord medium youtube link by @vansin in https://github.com/open-mmlab/mmocr/pull/1724
Remove unsupported datasets in docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1670
Bug Fixes¶
Update dockerfile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1671
Explicitly create np object array for compatibility by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1691
Fix a minor error in docstring by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1685
Fix lint by @triple-Mu in https://github.com/open-mmlab/mmocr/pull/1694
Fix LoadOCRAnnotation ut by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1695
Fix isort pre-commit error by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1697
Update owners by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1699
Detect intersection before using shapley.intersection to eliminate spurious warnings by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1710
Fix some inferencer bugs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1706
Fix textocr ignore flag by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1712
Add missing softmax in ASTER forward_test by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1718
Fix head in readme by @vansin in https://github.com/open-mmlab/mmocr/pull/1727
Fix some browse dataset script bugs and draw textdet gt instance with ignore flags by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1701
icdar textrecog ann parser skip data with ignore flag by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1708
bezier_to_polygon -> bezier2polygon by @double22a in https://github.com/open-mmlab/mmocr/pull/1739
Fix docs recog CharMetric P/R error definition by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1740
Remove outdated resources in demo/ by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1747
Fix wrong ic13 textspotting split data; add lexicons to ic13, ic15 and totaltext by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1758
SPTS readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1761
New Contributors¶
@triple-Mu made their first contribution in https://github.com/open-mmlab/mmocr/pull/1694
@double22a made their first contribution in https://github.com/open-mmlab/mmocr/pull/1739
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc5…v1.0.0rc6
v1.0.0rc5 (01/06/2023)¶
Highlights¶
Two models, Aster and SVTR, are added to our model zoo. The full implementation of ABCNet is also available now.
Dataset Preparer supports 5 more datasets: CocoTextV2, FUNSD, TextOCR, NAF, SROIE.
We have 4 more text recognition transforms, and two helper transforms. See https://github.com/open-mmlab/mmocr/pull/1646 https://github.com/open-mmlab/mmocr/pull/1632 https://github.com/open-mmlab/mmocr/pull/1645 for details.
The transform,
FixInvalidPolygon
, is getting smarter at dealing with invalid polygons, and now capable of handling more weird annotations. As a result, a complete training cycle on TotalText dataset can be performed bug-free. The weights of DBNet and FCENet pretrained on TotalText are also released.
New Features & Enhancements¶
Update ic15 det config according to DataPrepare by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1617
Refactor icdardataset metainfo to lowercase. by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1620
Add ASTER Encoder by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1239
Add ASTER decoder by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1625
Add ASTER config by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1238
Update ASTER config by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1629
Support browse_dataset.py to visualize original dataset by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1503
Add CocoTextv2 to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1514
Add Funsd to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1550
Add TextOCR to Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1543
Refine example projects and readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1628
Enhance FixInvalidPolygon, add RemoveIgnored transform by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1632
ConditionApply by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1646
Add NAF to dataset preparer by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1609
Add SROIE to dataset preparer by @FerryHuang in https://github.com/open-mmlab/mmocr/pull/1639
Add svtr decoder by @willpat1213 in https://github.com/open-mmlab/mmocr/pull/1448
Add missing unit tests by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1651
Add svtr encoder by @willpat1213 in https://github.com/open-mmlab/mmocr/pull/1483
ABCNet train by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1610
Totaltext cfgs for DB and FCE by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1633
Add Aliases to models by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1611
SVTR transforms by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1645
Add SVTR framework and configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1621
Issue Template by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1663
Docs¶
Add Chinese translation for browse_dataset.py by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1647
updata abcnet doc by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1658
update the dbnetpp`s readme file by @zhuyue66 in https://github.com/open-mmlab/mmocr/pull/1626
Inferencer docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1744
Bug Fixes¶
nn.SmoothL1Loss beta can not be zero in PyTorch 1.13 version by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1616
ctc loss bug if target is empty by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1618
Add torch 1.13 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1619
Remove outdated tutorial link by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1627
Dev 1.x some doc mistakes by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1630
Support custom font to visualize some languages (e.g. Korean) by @ProtossDragoon in https://github.com/open-mmlab/mmocr/pull/1567
db_module_loss,negative number encountered in sqrt by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1640
Use int instead of np.int by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1636
Remove support for py3.6 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1660
New Contributors¶
@zhuyue66 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1626
@KevinNuNu made their first contribution in https://github.com/open-mmlab/mmocr/pull/1630
@FerryHuang made their first contribution in https://github.com/open-mmlab/mmocr/pull/1639
@willpat1213 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1448
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc4…v1.0.0rc5
v1.0.0rc4 (12/06/2022)¶
Highlights¶
Dataset Preparer can automatically generate base dataset configs at the end of the preparation process, and supports 6 more datasets: IIIT5k, CUTE80, ICDAR2013, ICDAR2015, SVT, SVTP.
Introducing our
projects/
folder - implementing new models and features into OpenMMLab’s algorithm libraries has long been complained to be troublesome due to the rigorous requirements on code quality, which could hinder the fast iteration of SOTA models and might discourage community members from sharing their latest outcome here. We now introduceprojects/
folder, where some experimental features, frameworks and models can be placed, only needed to satisfy the minimum requirement on the code quality. Everyone is welcome to post their implementation of any great ideas in this folder! We also add the first example project to illustrate what we expect a good project to have (check out the raw content of README.md for more info!).Inside the
projects/
folder, we are releasing the preview version of ABCNet, which is the first implementation of text spotting models in MMOCR. It’s inference-only now, but the full implementation will be available very soon.
New Features & Enhancements¶
Add SVT to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1521
Polish bbox2poly by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1532
Add SVTP to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1523
Iiit5k converter by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1530
Add cute80 to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1522
Add IC13 preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1531
Add ‘Projects/’ folder, and the first example project by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1524
Rename to {dataset-name}_task_train/test by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1541
Add print_config.py to the tools by @IncludeMathH in https://github.com/open-mmlab/mmocr/pull/1547
Add get_md5 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1553
Add config generator by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1552
Support IC15_1811 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1556
Update CT80 config by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1555
Add config generators to all textdet and textrecog configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1560
Refactor TPS by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1240
Add TextSpottingConfigGenerator by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1561
Add common typing by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1596
Update textrecog config and readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1597
Support head loss or postprocessor is None for only infer by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1594
Textspotting datasample by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1593
Simplify mono_gather by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1588
ABCNet v1 infer by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1598
Docs¶
Add Chinese Guidance on How to Add New Datasets to Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1506
Update the qq group link by @vansin in https://github.com/open-mmlab/mmocr/pull/1569
Collapse some sections; update logo url by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1571
Update dataset preparer (CN) by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1591
Bug Fixes¶
Fix two bugs in dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1513
Register bug of CLIPResNet by @jyshee in https://github.com/open-mmlab/mmocr/pull/1517
Being more conservative on Dataset Preparer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1520
python -m pip upgrade in windows by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1525
Fix wildreceipt metafile by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1528
Fix Dataset Preparer Extract by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1527
Fix ICDARTxtParser by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1529
Fix Dataset Zoo Script by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1533
Fix crop without padding and recog metainfo delete unuse info by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1526
Automatically create nonexistent directory for base configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1535
Change mmcv.dump to mmengine.dump by @ProtossDragoon in https://github.com/open-mmlab/mmocr/pull/1540
mmocr.utils.typing -> mmocr.utils.typing_utils by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1538
Wildreceipt tests by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1546
Fix judge exist dir by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1542
Fix IC13 textdet config by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1563
Fix IC13 textrecog annotations by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1568
Auto scale lr by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1584
Fix icdar data parse for text containing separator by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1587
Fix textspotting ut by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1599
Fix TextSpottingConfigGenerator and TextSpottingDataConverter by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1604
Keep E2E Inferencer output simple by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1559
New Contributors¶
@jyshee made their first contribution in https://github.com/open-mmlab/mmocr/pull/1517
@ProtossDragoon made their first contribution in https://github.com/open-mmlab/mmocr/pull/1540
@IncludeMathH made their first contribution in https://github.com/open-mmlab/mmocr/pull/1547
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc3…v1.0.0rc4
v1.0.0rc3 (11/03/2022)¶
Highlights¶
We release several pretrained models using oCLIP-ResNet as the backbone, which is a ResNet variant trained with oCLIP and can significantly boost the performance of text detection models.
Preparing datasets is troublesome and tedious, especially in OCR domain where multiple datasets are usually required. In order to free our users from laborious work, we designed a Dataset Preparer to help you get a bunch of datasets ready for use, with only one line of command! Dataset Preparer is also crafted to consist of a series of reusable modules, each responsible for handling one of the standardized phases throughout the preparation process, shortening the development cycle on supporting new datasets.
New Features & Enhancements¶
Add Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1484
support modified resnet structure used in oCLIP by @HannibalAPE in https://github.com/open-mmlab/mmocr/pull/1458
Add oCLIP configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1509
Docs¶
Update install.md by @rogachevai in https://github.com/open-mmlab/mmocr/pull/1494
Refine some docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1455
Update some dataset preparer related docs by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1502
oclip readme by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1505
Bug Fixes¶
Fix offline_eval error caused by new data flow by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1500
New Contributors¶
@rogachevai made their first contribution in https://github.com/open-mmlab/mmocr/pull/1494
@HannibalAPE made their first contribution in https://github.com/open-mmlab/mmocr/pull/1458
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc2…v1.0.0rc3
v1.0.0rc2 (10/14/2022)¶
This release relaxes the version requirement of MMEngine
to >=0.1.0, < 1.0.0
.
v1.0.0rc1 (10/09/2022)¶
Highlights¶
This release fixes a severe bug leading to inaccurate metric report in multi-GPU training.
We release the weights for all the text recognition models in MMOCR 1.0 architecture. The inference shorthand for them are also added back to ocr.py
. Besides, more documentation chapters are available now.
New Features & Enhancements¶
Simplify the Mask R-CNN config by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1391
auto scale lr by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1326
Update paths to pretrain weights by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1416
Streamline duplicated split_result in pan_postprocessor by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1418
Update model links in ocr.py and inference.md by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1431
Update rec configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1417
Visualizer refine by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1411
Support get flops and parameters in dev-1.x by @vansin in https://github.com/open-mmlab/mmocr/pull/1414
Docs¶
intersphinx and api by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1367
Fix quickrun by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1374
Fix some docs issues by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1385
Add Documents for DataElements by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1381
config english by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1372
Metrics by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1399
Add version switcher to menu by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1407
Data Transforms by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1392
Fix inference docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1415
Fix some docs by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1410
Add maintenance plan to migration guide by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1413
Update Recog Models by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1402
Bug Fixes¶
clear metric.results only done in main process by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1379
Fix a bug in MMDetWrapper by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1393
Fix browse_dataset.py by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1398
ImgAugWrapper: Do not cilp polygons if not applicable by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1231
Fix CI by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1365
Fix merge stage test by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1370
Del CI support for torch 1.5.1 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1371
Test windows cu111 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1373
Fix windows CI by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1387
Upgrade pre commit hooks by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1429
Skip invalid augmented polygons in ImgAugWrapper by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1434
New Contributors¶
@vansin made their first contribution in https://github.com/open-mmlab/mmocr/pull/1414
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc0…v1.0.0rc1
v1.0.0rc0 (09/01/2022)¶
We are excited to announce the release of MMOCR 1.0.0rc0. MMOCR 1.0.0rc0 is the first version of MMOCR 1.x, a part of the OpenMMLab 2.0 projects. Built upon the new training engine, MMOCR 1.x unifies the interfaces of dataset, models, evaluation, and visualization with faster training and testing speed.
Highlights¶
New engines. MMOCR 1.x is based on MMEngine, which provides a general and powerful runner that allows more flexible customizations and significantly simplifies the entrypoints of high-level interfaces.
Unified interfaces. As a part of the OpenMMLab 2.0 projects, MMOCR 1.x unifies and refactors the interfaces and internal logics of train, testing, datasets, models, evaluation, and visualization. All the OpenMMLab 2.0 projects share the same design in those interfaces and logics to allow the emergence of multi-task/modality algorithms.
Cross project calling. Benefiting from the unified design, you can use the models implemented in other OpenMMLab projects, such as MMDet. We provide an example of how to use MMDetection’s Mask R-CNN through
MMDetWrapper
. Check our documents for more details. More wrappers will be released in the future.Stronger visualization. We provide a series of useful tools which are mostly based on brand-new visualizers. As a result, it is more convenient for the users to explore the models and datasets now.
More documentation and tutorials. We add a bunch of documentation and tutorials to help users get started more smoothly. Read it here.
Breaking Changes¶
We briefly list the major breaking changes here. We will update the migration guide to provide complete details and migration instructions.
Dependencies¶
MMOCR 1.x relies on MMEngine to run. MMEngine is a new foundational library for training deep learning models in OpenMMLab 2.0 models. The dependencies of file IO and training are migrated from MMCV 1.x to MMEngine.
MMOCR 1.x relies on MMCV>=2.0.0rc0. Although MMCV no longer maintains the training functionalities since 2.0.0rc0, MMOCR 1.x relies on the data transforms, CUDA operators, and image processing interfaces in MMCV. Note that the package
mmcv
is the version that provide pre-built CUDA operators andmmcv-lite
does not since MMCV 2.0.0rc0, whilemmcv-full
has been deprecated.
Training and testing¶
MMOCR 1.x uses Runner in MMEngine rather than that in MMCV. The new Runner implements and unifies the building logic of dataset, model, evaluation, and visualizer. Therefore, MMOCR 1.x no longer maintains the building logics of those modules in
mmocr.train.apis
andtools/train.py
. Those code have been migrated into MMEngine. Please refer to the migration guide of Runner in MMEngine for more details.The Runner in MMEngine also supports testing and validation. The testing scripts are also simplified, which has similar logic as that in training scripts to build the runner.
The execution points of hooks in the new Runner have been enriched to allow more flexible customization. Please refer to the migration guide of Hook in MMEngine for more details.
Learning rate and momentum scheduling has been migrated from
Hook
toParameter Scheduler
in MMEngine. Please refer to the migration guide of Parameter Scheduler in MMEngine for more details.
Configs¶
The Runner in MMEngine uses a different config structures to ease the understanding of the components in runner. Users can read the config example of MMOCR or refer to the migration guide in MMEngine for migration details.
The file names of configs and models are also refactored to follow the new rules unified across OpenMMLab 2.0 projects. Please refer to the user guides of config for more details.
Dataset¶
The Dataset classes implemented in MMOCR 1.x all inherits from the BaseDetDataset
, which inherits from the BaseDataset in MMEngine. There are several changes of Dataset in MMOCR 1.x.
All the datasets support to serialize the data list to reduce the memory when multiple workers are built to accelerate data loading.
The interfaces are changed accordingly.
Data Transforms¶
The data transforms in MMOCR 1.x all inherits from those in MMCV>=2.0.0rc0, which follows a new convention in OpenMMLab 2.0 projects. The changes are listed as below:
The interfaces are also changed. Please refer to the API Reference
The functionality of some data transforms (e.g.,
Resize
) are decomposed into several transforms.The same data transforms in different OpenMMLab 2.0 libraries have the same augmentation implementation and the logic of the same arguments, i.e.,
Resize
in MMDet 3.x and MMOCR 1.x will resize the image in the exact same manner given the same arguments.
Model¶
The models in MMOCR 1.x all inherits from BaseModel
in MMEngine, which defines a new convention of models in OpenMMLab 2.0 projects. Users can refer to the tutorial of model in MMengine for more details. Accordingly, there are several changes as the following:
The model interfaces, including the input and output formats, are significantly simplified and unified following the new convention in MMOCR 1.x. Specifically, all the input data in training and testing are packed into
inputs
anddata_samples
, whereinputs
contains model inputs like a list of image tensors, anddata_samples
contains other information of the current data sample such as ground truths and model predictions. In this way, different tasks in MMOCR 1.x can share the same input arguments, which makes the models more general and suitable for multi-task learning.The model has a data preprocessor module, which is used to pre-process the input data of model. In MMOCR 1.x, the data preprocessor usually does necessary steps to form the input images into a batch, such as padding. It can also serve as a place for some special data augmentations or more efficient data transformations like normalization.
The internal logic of model have been changed. In MMOCR 0.x, model used
forward_train
andsimple_test
to deal with different model forward logics. In MMOCR 1.x and OpenMMLab 2.0, the forward function has three modes:loss
,predict
, andtensor
for training, inference, and tracing or other purposes, respectively. The forward function callsself.loss()
,self.predict()
, andself._forward()
given the modesloss
,predict
, andtensor
, respectively.
Evaluation¶
MMOCR 1.x mainly implements corresponding metrics for each task, which are manipulated by Evaluator to complete the evaluation. In addition, users can build evaluator in MMOCR 1.x to conduct offline evaluation, i.e., evaluate predictions that may not produced by MMOCR, prediction follows our dataset conventions. More details can be find in the Evaluation Tutorial in MMEngine.
Visualization¶
The functions of visualization in MMOCR 1.x are removed. Instead, in OpenMMLab 2.0 projects, we use Visualizer to visualize data. MMOCR 1.x implements TextDetLocalVisualizer
, TextRecogLocalVisualizer
, and KIELocalVisualizer
to allow visualization of ground truths, model predictions, and feature maps, etc., at any place, for the three tasks supported in MMOCR. It also supports to dump the visualization data to any external visualization backends such as Tensorboard and Wandb. Check our Visualization Document for more details.
Improvements¶
Most models enjoy a performance improvement from the new framework and refactor of data transforms. For example, in MMOCR 1.x, DBNet-R50 achieves 0.854 hmean score on ICDAR 2015, while the counterpart can only get 0.840 hmean score in MMOCR 0.x.
Support mixed precision training of most of the models. However, the rest models are not supported yet because the operators they used might not be representable in fp16. We will update the documentation and list the results of mixed precision training.
Ongoing changes¶
Test-time augmentation: which was supported in MMOCR 0.x, is not implemented yet in this version due to limited time slot. We will support it in the following releases with a new and simplified design.
Inference interfaces: a unified inference interfaces will be supported in the future to ease the use of released models.
Interfaces of useful tools that can be used in notebook: more useful tools that implemented in the
tools/
directory will have their python interfaces so that they can be used through notebook and in downstream libraries.Documentation: we will add more design docs, tutorials, and migration guidance so that the community can deep dive into our new design, participate the future development, and smoothly migrate downstream libraries to MMOCR 1.x.
Overview¶
Along with the release of OpenMMLab 2.0, MMOCR 1.0 made many significant changes, resulting in less redundant, more efficient code and a more consistent overall design. However, these changes break backward compatibility. We understand that with such huge changes, it is not easy for users familiar with the old version to adapt to the new version. Therefore, we prepared a detailed migration guide to make the transition as smooth as possible so that all users can enjoy the productivity benefits of the new MMOCR and the entire OpenMMLab 2.0 ecosystem.
Warning
MMOCR 1.0 depends on the new foundational library for training deep learning models MMEngine, and therefore has an entirely different dependency chain compared with MMOCR 0.x. Even if you have a well-rounded MMOCR 0.x environment before, you still need to create a new python environment for MMOCR 1.0. We provide a detailed installation guide for reference.
Next, please read the sections according to your requirements.
Read What’s new in MMOCR 1.x to learn about the new features and changes in MMOCR 1.x.
If you want to migrate a model trained in version 0.x to use it directly in version 1.0, please read Pretrained Model Migration.
If you want to train the model, please read Dataset Migration and Data Transform Migration.
If you want to develop on MMOCR, please read Code Migration, Branch Migration and Upstream Library Changes.
As shown in the following figure, the maintenance plan of MMOCR 1.x version is mainly divided into three stages, namely “RC Period”, “Compatibility Period” and “Maintenance Period”. For old versions, we will no longer add major new features. Therefore, we strongly recommend users to migrate to MMOCR 1.x version as soon as possible.
What’s New in MMOCR 1.x¶
Here are some highlights of MMOCR 1.x compared to 0.x.
New engines. MMOCR 1.x is based on MMEngine, which provides a general and powerful runner that allows more flexible customizations and significantly simplifies the entrypoints of high-level interfaces.
Unified interfaces. As a part of the OpenMMLab 2.0 projects, MMOCR 1.x unifies and refactors the interfaces and internal logics of train, testing, datasets, models, evaluation, and visualization. All the OpenMMLab 2.0 projects share the same design in those interfaces and logics to allow the emergence of multi-task/modality algorithms.
Cross project calling. Benefiting from the unified design, you can use the models implemented in other OpenMMLab projects, such as MMDet. We provide an example of how to use MMDetection’s Mask R-CNN through
MMDetWrapper
. Check our documents for more details. More wrappers will be released in the future.Stronger visualization. We provide a series of useful tools which are mostly based on brand-new visualizers. As a result, it is more convenient for the users to explore the models and datasets now.
More documentation and tutorials. We add a bunch of documentation and tutorials to help users get started more smoothly.
One-stop Dataset Preparaion. Multiple datasets are instantly ready with only one line of command, via our Dataset Preparer.
Embracing more
projects/
: We now introduceprojects/
folder, where some experimental features, frameworks and models can be placed, only needed to satisfy the minimum requirement on the code quality. Everyone is welcome to post their implementation of any great ideas in this folder! Learn more from our example project.More models. MMOCR 1.0 supports more tasks and more state-of-the-art models!
Branch Migration¶
At an earlier stage, MMOCR had three branches: main
, 1.x
, and dev-1.x
. Some of these branches have been renamed together with the official MMOCR 1.0.0 release, and here is the changelog.
main
branch housed the code for MMOCR 0.x (e.g., v0.6.3). Now it has been renamed to0.x
.1.x
contained the code for MMOCR 1.x (e.g., 1.0.0rc6). Now it is an alias ofmain
, and will be removed in mid 2023.dev-1.x
was the development branch for MMOCR 1.x. Now it remains unchanged.
For more information about the branches, check out branches.
Resolving Conflicts When Upgrading the main
branch¶
For users who wish to upgrade from the old main
branch that has the code for MMOCR 0.x, the non-fast-forwarded-able nature of the upgrade may cause conflicts. To resolve these conflicts, follow the steps below:
Commit all the changes you have on
main
if you have any. Backup your currentmain
branch by creating a copy.git checkout main git add --all git commit -m 'backup' git checkout -b main_backup
Fetch the latest changes from the remote repository.
git remote add openmmlab git@github.com:open-mmlab/mmocr.git git fetch openmmlab
Reset the
main
branch to the latestmain
branch on the remote repository by runninggit reset --hard openmmlab/main
.git checkout main git reset --hard openmmlab/main
By following these steps, you can successfully upgrade your main
branch.
Code Migration¶
MMOCR has been designed in a way that there are a lot of shortcomings in the initial version in order to balance the tasks of text detection, recognition and key information extraction. In this 1.0 release, MMOCR synchronizes its new model architecture to align as much as possible with the overall OpenMMLab design and to achieve structural uniformity within the algorithm library. Although this upgrade is not fully backward compatible, we summarize the changes that may be of interest to developers for those who need them.
Fundamental Changes¶
Functional boundaries of modules has not been clearly defined in MMOCR 0.x. In MMOCR 1.0, we address this issue by refactoring the design of model modules. Here are some major changes in 1.0:
MMOCR 1.0 no longer supports named entity recognition tasks since it’s not in the scope of OCR.
The module that computes the loss in a model is named as Module Loss, which is also responsible for the conversion of gold annotations into loss targets. Another module, Postprocessor, is responsible for decoding the model raw output into
DataSample
for the corresponding task at prediction time.The inputs of all models are now organized as a dictionary that consists of two keys:
inputs
, containing the original features of the images, andList[DataSample]
, containing the meta-information of the images. At training time, the output format of a model is standardized to a dictionary containing the loss tensors. Similarly, a model generates a sequence ofDataSample
s containing the prediction outputs in testing.In MMOCR 0.x, the majority of classes named
XXLoss
have the implementations closely bound to the corresponding model, while their names made users hard to tell them apart from other generic losses likeDiceLoss
. In 1.0, they are renamed to the formXXModuleLoss
. (e.g.DBLoss
was renamed toDBModuleLoss
). The key to their configurations in config files is also changed fromloss
tomodule_loss
.The names of generic loss classes that are not related to the model implementation are kept as
XXLoss
. (e.g.MaskedBCELoss
) They are all placed undermmocr/models/common/losses
.Changes under
mmocr/models/common/losses
:DiceLoss
is renamed toMaskedDiceLoss
.FocalLoss
has been removed.MMOCR 1.0 adds a Dictionary module which originates from label converter. It is used in text recognition and key information extraction tasks.
Text Detection Models¶
Key Changes (TL;DR)¶
The model weights from MMOCR 0.x still works in the 1.0, but the fields starting with
bbox_head
in the state dictstate_dict
need to be renamed todet_head
.XXTargets
transforms, which were responsible for genearting detection targets, have been merged intoXXModuleLoss
.
SingleStageTextDetector¶
The original inheritance chain was
mmdet.BaseDetector->SingleStageDetector->SingleStageTextDetector
. NowSingleStageTextDetector
is directly inherited fromBaseDetector
without extra dependency on MMDetection, andSingleStageDetector
is deleted.bbox_head
is renamed todet_head
.train_cfg
,test_cfg
andpretrained
fields are removed.forward_train()
andsimple_test()
are refactored toloss()
andpredict()
. The part ofsimple_test()
that was responsible for splitting the raw output of the model and feeding it intohead.get_bounary()
is integrated intoBaseTextDetPostProcessor
.TextDetectorMixin
has been removed since its implementation overlaps withTextDetLocalVisualizer
.
Head¶
HeadMixin
, the base class thatXXXHead
had to inherit from in version 0.x, has been replaced byBaseTextDetHead
.get_boundary()
andresize_boundary()
are now rewritten as__call__()
andrescale()
inBaseTextDetPostProcessor
.
ModuleLoss¶
Data transforms
XXXTargets
in text detection tasks are all moved toXXXModuleLoss._get_target_single()
. Target-related configurations are no longer specified in the data pipeline but inXXXLoss
instead.
Postprocessor¶
The logic in the original
XXXPostprocessor.__call__()
are transferred to the refactoredXXXPostprocessor.get_text_instances()
.BasePostprocessor
is refactored toBaseTextDetPostProcessor
. This base class splits and processes the model output predictions one by one and supports automatic scaling of the output polygon or bounding box based onscale_factor
.
Text Recognition¶
Key Changes (TL;DR)¶
Due to the change of the character order and some bugs in the model architecture being fixed, the recognition model weights in 0.x can no longer be directly used in 1.0. We will provide a migration script and tutorial for those who need it.
The support of SegOCR has been removed. TPS-CRNN will still be supported in a later version.
Test time augmentation will be supported in the upcoming release.
Label converter module has been removed and its functions have been split into Dictionary, ModuleLoss and Postprocessor.
The definition of
max_seq_len
has been unified and now it represents the original output length of the model.
Label Converter¶
The original label converters had spelling errors (written as label convertors). We fixed them by removing label converters from this project.
The part responsible for converting characters/strings to and from numeric indexes was extracted to Dictionary.
In older versions, different label converters would have different special character sets and character order. In version 0.x, the character order was as follows.
Converter | Character order |
---|---|
AttnConvertor , ABIConvertor |
<UKN> , <BOS/EOS> , <PAD> , characters |
CTCConvertor |
<BLK> , <UKN> , characters |
In 1.0, instead of designing different dictionaries and character orders for different tasks, we have a unified Dictionary implementation with the character order always as characters, <BOS/EOS>, <PAD>, <UKN>. <BLK> in CTCConvertor
has been equivalently replaced by <PAD>.
Label convertor originally supported three ways to initialize dictionaries:
dict_type
,dict_file
anddict_list
, which are now reduced todict_file
only inDictionary
. Also, we have put those pre-defined character sets originally supported indict_type
intodicts/
directory now. The corresponding mapping is as follows:
MMOCR 0.x: dict_type |
MMOCR 1.0: Dict path |
---|---|
DICT90 | dicts/english_digits_symbols.txt |
DICT91 | dicts/english_digits_symbols_space.txt |
DICT36 | dicts/lower_english_digits.txt |
DICT37 | dicts/lower_english_digits_space.txt |
The implementation of
str2tensor()
in label converter has been moved toModuleLoss.get_targets()
. The following table shows the correspondence between the old and new method implementations. Note that the old and new implementations are not identical.
MMOCR 0.x | MMOCR 1.0 | Note |
---|---|---|
ABIConvertor.str2tensor() , AttnConvertor.str2tensor() |
BaseTextRecogModuleLoss.get_targets() |
The different implementations between ABIConvertor.str2tensor() and AttnConvertor.str2tensor() have been unified in the new version. |
CTCConvertor.str2tensor() |
CTCModuleLoss.get_targets() |
The implementation of
tensor2idx()
in label converter has been moved toPostprocessor.get_single_prediction()
. The following table shows the correspondence between the old and new method implementations. Note that the old and new implementations are not identical.
MMOCR 0.x | MMOCR 1.0 |
---|---|
ABIConvertor.tensor2idx() , AttnConvertor.tensor2idx() |
AttentionPostprocessor.get_single_prediction() |
CTCConvertor.tensor2idx() |
CTCPostProcessor.get_single_prediction() |
Key Information Extraction¶
Key Changes (TL;DR)¶
Due to changes in the inputs to the model, the model weights obtained in 0.x can no longer be directly used in 1.0.
KIEDataset & OpensetKIEDataset¶
The part that reads data is kept in
WildReceiptDataset
.The part that additionally processes the nodes and edges is moved to
LoadKIEAnnotation
.The part that uses dictionaries to transform text is moved to
SDMGRHead.convert_text()
, with the help of Dictionary.The part of
compute_relation()
that computes the relationships between text boxes is moved toSDMGRHead.compute_relations()
. It’s now done inside the model.The part that evaluates the model performance is done in
F1Metric
.The part of
OpensetKIEDataset
that processes model’s edge outputs is moved toSDMGRPostProcessor
.
SDMGR¶
show_result()
is integrated intoKIEVisualizer
.The part of
forward_test()
that post-processes the output is organized inSDMGRPostProcessor
.
Utils Migration¶
Utility functions are now grouped together under mmocr/utils/
. Here are the scopes of the files in this directory:
bbox_utils.py: bounding box related functions.
check_argument.py: used to check argument type.
collect_env.py: used to collect running environment.
data_converter_utils.py: used for data format conversion.
fileio.py: file input and output related functions.
img_utils.py: image processing related functions.
mask_utils.py: mask related functions.
ocr.py: used for MMOCR inference.
parsers.py: used for parsing datasets.
polygon_utils.py: polygon related functions.
setup_env.py: used for initialize MMOCR.
string_utils.py: string related functions.
typing.py: defines the abbreviation of types used in MMOCR.
Dataset Migration¶
Based on the new design of BaseDataset in MMEngine, we have refactored the base OCR dataset class OCRDataset
in MMOCR 1.0. The following document describes the differences between the old and new dataset formats in MMOCR, and how to migrate from the deprecated version to the latest. For users who do not want to migrate datasets at this time, we also provide a temporary solution in Section Compatibility.
Note
The Key Information Extraction task still uses the original WildReceipt dataset annotation format.
Review of Old Dataset Formats¶
MMOCR version 0.x implements a number of dataset classes, such as IcdarDataset
, TextDetDataset
for text detection tasks, and OCRDataset
, OCRSegDataset
for text recognition tasks. At the same time, the annotations may vary in different formats, such as .txt
, .json
, .jsonl
. Users have to manually configure the Loader
and the Parser
while customizing the datasets.
Text Detection¶
For the text detection task, IcdarDataset
uses a COCO-like annotation format.
{
"images": [
{
"id": 1,
"width": 800,
"height": 600,
"file_name": "test.jpg"
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [0,0,10,10],
"segmentation": [
[0,0,10,0,10,10,0,10]
],
"area": 100,
"iscrowd": 0
}
]
}
The TextDetDataset
uses the JSON Line storage format, converting COCO-like labels to strings and saves them in .txt
or .jsonl
format files.
{"file_name": "test/img_2.jpg", "height": 720, "width": 1280, "annotations": [{"iscrowd": 0, "category_id": 1, "bbox": [602.0, 173.0, 33.0, 24.0], "segmentation": [[602, 173, 635, 175, 634, 197, 602, 196]]}, {"iscrowd": 0, "category_id": 1, "bbox": [734.0, 310.0, 58.0, 54.0], "segmentation": [[734, 310, 792, 320, 792, 364, 738, 361]]}]}
{"file_name": "test/img_5.jpg", "height": 720, "width": 1280, "annotations": [{"iscrowd": 1, "category_id": 1, "bbox": [405.0, 409.0, 32.0, 52.0], "segmentation": [[408, 409, 437, 436, 434, 461, 405, 433]]}, {"iscrowd": 1, "category_id": 1, "bbox": [435.0, 434.0, 8.0, 33.0], "segmentation": [[437, 434, 443, 440, 441, 467, 435, 462]]}]}
Text Recognition¶
For text recognition tasks, there are two annotation formats in MMOCR version 0.x. The simple .txt
annotations separate image name and word annotation by a blank space, which cannot handle the case when spaces are included in a text instance.
img1.jpg OpenMMLab
img2.jpg MMOCR
The JSON Line format uses a dictionary-like structure to represent the annotations, where the keys filename
and text
store the image name and word label, respectively.
{"filename": "img1.jpg", "text": "OpenMMLab"}
{"filename": "img2.jpg", "text": "MMOCR"}
New Dataset Format¶
To solve the dataset issues, MMOCR 1.x adopts a unified dataset design introduced in MMEngine. Each annotation file is a .json
file that stores a dict
, containing both metainfo
and data_list
, where the former includes basic information about the dataset and the latter consists of the label item of each target instance.
{
"metainfo":
{
"classes": ("cat", "dog"),
// ...
},
"data_list":
[
{
"img_path": "xxx/xxx_0.jpg",
"img_label": 0,
// ...
},
// ...
]
}
Based on the above structure, we introduced TextDetDataset
, TextRecogDataset
for MMOCR-specific tasks.
Text Detection¶
Introduction of the New Format¶
The TextDetDataset
holds the information required by the text detection task, such as bounding boxes and labels. We refer users to tests/data/det_toy_dataset/instances_test.json
which is an example annotation for TextDetDataset
.
{
"metainfo":
{
"dataset_type": "TextDetDataset",
"task_name": "textdet",
"category": [{"id": 0, "name": "text"}]
},
"data_list":
[
{
"img_path": "test_img.jpg",
"height": 640,
"width": 640,
"instances":
[
{
"polygon": [0, 0, 0, 10, 10, 20, 20, 0],
"bbox": [0, 0, 10, 20],
"bbox_label": 0,
"ignore": False
},
// ...
]
}
]
}
The bounding box format is as follows: [min_x, min_y, max_x, max_y]
Migration Script¶
We provide a migration script to help users migrate old annotation files to the new format.
python tools/dataset_converters/textdet/data_migrator.py ${IN_PATH} ${OUT_PATH}
ARGS | Type | Description |
---|---|---|
in_path | str | (Required)Path to the old annotation file. |
out_path | str | (Required)Path to the new annotation file. |
--task | 'auto', 'textdet', 'textspotter' | Specifies the compatible task for the output dataset annotation. If 'textdet' is specified, the text field in coco format will not be dumped. The default is 'auto', which automatically determines the output format based on the the old annotation files. |
Text Recognition¶
Introduction of the New Format¶
The TextRecogDataset
holds the information required by the text detection task, such as text and image path. We refer users to tests/data/rec_toy_dataset/labels.json
which is an example annotation for TextRecogDataset
.
{
"metainfo":
{
"dataset_type": "TextRecogDataset",
"task_name": "textrecog",
},
"data_list":
[
{
"img_path": "test_img.jpg",
"instances":
[
{
"text": "GRAND"
}
]
}
]
}
Migration Script¶
We provide a migration script to help users migrate old annotation files to the new format.
python tools/dataset_converters/textrecog/data_migrator.py ${IN_PATH} ${OUT_PATH} --format ${txt, jsonl, lmdb}
ARGS | Type | Description |
---|---|---|
in_path | str | (Required)Path to the old annotation file. |
out_path | str | (Required)Path to the new annotation file. |
--format | 'txt', 'jsonl', 'lmdb' | Specify the format of the old dataset annotation. |
Compatibility¶
In consideration of the cost to users for data migration, we have temporarily made MMOCR version 1.x compatible with the old MMOCR 0.x format.
Note
The code and components used for compatibility with the old data format may be completely removed in a future release. Therefore, we strongly recommend that users migrate their datasets to the new data format.
Specifically, we provide three dataset classes IcdarDataset, RecogTextDataset, RecogLMDBDataset to support the old formats.
IcdarDataset supports COCO-like format annotations for text detection. You just need to add a new dataset config to
configs/textdet/_base_/datasets
and specify its dataset type asIcdarDataset
.data_root = 'data/det/icdar2015' train_anno_path = 'instances_training.json' train_dataset = dict( type='IcdarDataset', data_root=data_root, ann_file=train_anno_path, data_prefix=dict(img_path='imgs/'), filter_cfg=dict(filter_empty_gt=True, min_size=32), pipeline=None)
RecogTextDataset supports
.txt
and.jsonl
format annotations for text recognition. You just need to add a new dataset config toconfigs/textrecog/_base_/datasets
and specify its dataset type asRecogTextDataset
. For example, the following example shows how to configure and load the 0.x format labelsold_label.txt
andold_label.jsonl
from the toy dataset.data_root = 'tests/data/rec_toy_dataset/' # loading 0.x txt format annos txt_dataset = dict( type='RecogTextDataset', data_root=data_root, ann_file='old_label.txt', data_prefix=dict(img_path='imgs'), parser_cfg=dict( type='LineStrParser', keys=['filename', 'text'], keys_idx=[0, 1]), pipeline=[]) # loading 0.x json line format annos jsonl_dataset = dict( type='RecogTextDataset', data_root=data_root, ann_file='old_label.jsonl', data_prefix=dict(img_path='imgs'), parser_cfg=dict( type='LineJsonParser', keys=['filename', 'text'], pipeline=[]))
RecogLMDBDataset supports LMDB format dataset (img+labels) for text recognition. You just need to add a new dataset config to
configs/textrecog/_base_/datasets
and specify its dataset type asRecogLMDBDataset
. For example, the following example shows how to configure and load the both labels and imagesimgs.lmdb
from the toy dataset.
set the dataset type to
RecogLMDBDataset
# Specify the dataset type as RecogLMDBDataset
data_root = 'tests/data/rec_toy_dataset/'
lmdb_dataset = dict(
type='RecogLMDBDataset',
data_root=data_root,
ann_file='imgs.lmdb',
pipeline=None)
replace the
LoadImageFromFile
withLoadImageFromNDArray
in the data pipelines intrain_pipeline
andtest_pipeline
., for example:
train_pipeline = [dict(type='LoadImageFromNDArray')]
Pretrained Model Migration¶
Due to the extensive refactoring and fixing of the model structure in the new version, MMOCR 1.x does not support load weights trained by the old version. We have updated the pre-training weights and logs of all models on our website.
In addition, we are working on the development of a weight migration tool for text detection tasks and plan to release it in the near future. Since the text recognition and key information extraction models are too much modified and the migration is lossy, we do not plan to support them accordingly for the time being. If you have specific requirements, please feel free to raise an Issue.
Data Transform Migration¶
Introduction¶
In MMOCR version 0.x, we implemented a series of Data Transform methods in mmocr/datasets/pipelines/xxx_transforms.py
. However, these modules are scattered all over the place and lack a standardized design. Therefore, we refactored all the data transform modules in MMOCR version 1.x. According to the task type, they are now defined in ocr_transforms.py
, textdet_transforms.py
, and textrecog_transforms.py
, respectively, under mmocr/datasets/transforms
. Specifically, ocr_transforms.py
implements the data augmentation methods for OCR-related tasks in general, while textdet_transforms.py
and textrecog_transforms.py
implement data augmentation transforms related to text detection and text recognition tasks, respectively.
Since some of the modules were renamed, merged or separated during the refactoring process, the new interface and default parameters may be inconsistent with the old version. Therefore, this migration guide will introduce how to configure the new data transforms to achieve the identical behavior as the old version.
Configuration Migration Guide¶
mmocr.apis¶
mmocr.apis
Inferencers¶
MMOCR Inferencer. |
|
Text Detection inferencer. |
|
Text Recognition inferencer. |
|
Text Spotting inferencer. |
|
Key Information Extraction Inferencer. |
mmocr.structures¶
A data structure interface of MMOCR. |
|
A data structure interface of MMOCR for text recognition. |
|
A data structure interface of MMOCR. |
mmocr.datasets¶
mmocr.datasets
Datasets¶
OCRDataset for text detection and text recognition. |
|
WildReceipt Dataset for key information extraction. |
Compatible Datasets¶
Dataset for text detection while ann_file in coco format. |
|
RecogLMDBDataset for text recognition. |
|
RecogTextDataset for text recognition. |
Dataset Wrapper¶
A wrapper of concatenated dataset. |
mmocr.datasets¶
mmocr.datasets.transforms
Loading¶
Load an image from file. |
|
Load and process the |
|
Load and process the |
|
Load the image in Inferencer’s pipeline. |
TextDet Transforms¶
First randomly rescale the image so that the longside and shortside of the image are around the bound; then jitter its aspect ratio. |
|
Flip the image & bbox polygon. |
|
Pad Image to target size. |
|
First rescale the image for its shorter side to reach the short_size and then jitter its aspect ratio, final rescale the shape guaranteed to be divided by scale_divisor. |
|
Randomly select a region and crop images to a target size and make sure to contain text region. |
|
Random crop and flip a patch in the image. |
TextRecog Transforms¶
A general geometric augmentation tool for text images in the CVPR 2020 paper “Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition”. |
|
Randomly crop the image’s height, either from top or bottom. |
|
Jitter the image contents. |
|
Reverse image pixels. |
|
Resize the image to the base shape, downsample it with gaussian pyramid, and rescale it back to original size. |
|
Only pad the image’s width. |
|
Rescale the image to the height according to setting and keep the aspect ratio unchanged if possible. |
OCR Transforms¶
Randomly crop images and make sure to contain at least one intact instance. |
|
Randomly rotate the image, boxes, and polygons. |
|
Resize image & bboxes & polygons. |
|
Fix invalid polygons in the dataset. |
|
Removed ignored elements from the pipeline. |
Formatting¶
Pack the inputs data for text detection. |
|
Pack the inputs data for text recognition. |
|
Pack the inputs data for key information extraction. |
Transform Wrapper¶
A wrapper around imgaug https://github.com/aleju/imgaug. |
|
A wrapper around torchvision transforms. |
mmocr.models¶
models.common¶
Dictionary¶
The class generates a dictionary for recognition. |
Losses¶
This loss combines a Sigmoid layers and a masked balanced BCE loss in one single class. |
|
Masked dice loss. |
|
Masked Smooth L1 loss. |
|
Masked square dice loss. |
|
This loss combines a Sigmoid layers and a masked BCE loss in one single class. |
|
Smooth L1 loss. |
|
Cross entropy loss. |
|
Masked Balanced BCE loss. |
|
Masked BCE loss. |
Layers¶
Transformer Encoder Layer. |
|
Transformer Decoder Layer. |
Modules¶
Scaled Dot-Product Attention Module. |
|
Multi-Head Attention module. |
|
Two-layer feed-forward module. |
|
Fixed positional encoding with sine and cosine functions. |
models.textdet¶
Detectors¶
The class for implementing single stage text detector. |
|
The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization. |
|
The class for implementing PANet text detector: |
|
The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network. |
|
The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. |
|
The class for implementing FCENet text detector FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection |
|
The class for implementing DRRG text detector. |
|
A wrapper of MMDet’s model. |
Data Preprocessors¶
Image pre-processor for detection tasks. |
Necks¶
This code is from https://github.com/WenmuZhou/PAN.pytorch. |
|
FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network. |
|
FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization. |
|
The class for implementing DRRG and TextSnake U-Net-like FPN. |
Heads¶
Base head for text detection, build the loss and postprocessor. |
|
The class for PSENet head. |
|
The class for PANet head. |
|
The class for DBNet head. |
|
The class for implementing FCENet head. |
|
The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. |
|
The class for DRRG head: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection. |
Module Losses¶
Base class for the module loss of segmentation-based text detection algorithms with some handy utilities. |
|
The class for implementing PANet loss. |
|
The class for implementing PSENet loss. |
|
The class for implementing DBNet loss. |
|
The class for implementing TextSnake loss. |
|
The class for implementing FCENet loss. |
|
The class for implementing DRRG loss. |
Postprocessors¶
Base postprocessor for text detection models. |
|
Decoding predictions of PSENet to instances. |
|
Convert scores to quadrangles via post processing in PANet. |
|
Decoding predictions of DbNet to instances. |
|
Merge text components and construct boundaries of text instances. |
|
Decoding predictions of FCENet to instances. |
|
Decoding predictions of TextSnake to instances. |
models.textrecog¶
Recognizers¶
Base class for recognizer. |
|
Base class for encode-decode recognizer. |
|
CTC-loss based recognizer. |
|
Implementation of SAR |
|
Implementation of NRTR |
|
Implementation of `RobustScanner. |
|
Implementation of SATRN |
|
Implementation of `Read Like Humans: Autonomous, Bidirectional and Iterative LanguageModeling for Scene Text Recognition. |
|
Implementation of MASTER |
|
Implement `ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. |
Data Preprocessors¶
Image pre-processor for recognition tasks. |
Preprocessors¶
Implement STN module in ASTER: An Attentional Scene Text Recognizer with Flexible Rectification (https://ieeexplore.ieee.org/abstract/document/8395027/) |
BackBones¶
Implement ResNet backbone for text recognition, modified from |
|
A mini VGG backbone for text recognition, modified from `VGG-VeryDeep. |
|
Modality transform in NRTR. |
|
Implement Shallow CNN block for SATRN. |
|
Implement ResNet backbone for text recognition, modified from `ResNet. |
|
|
|
See mmdet.models.backbones.MobileNetV2 for details. |
Encoders¶
Implementation of encoder module in `SAR. |
|
Transformer Encoder block with self attention mechanism. |
|
Base Encoder class for text recognition. |
|
Change the channel number with a one by one convoluational layer. |
|
Implement encoder for SATRN, see `SATRN. |
|
Implement transformer encoder for text recognition, modified from <https://github.com/FangShancheng/ABINet>. |
|
Implement BiLSTM encoder module in `ASTER: An Attentional Scene Text Recognizer with Flexible Rectification. |
Decoders¶
Base decoder for text recognition, build the loss and postprocessor. |
|
Transformer-based language model responsible for spell correction. Implementation of language model of ABINet. |
|
Converts visual features into text characters. |
|
A special decoder responsible for mixing and aligning visual feature and linguistic feature. |
|
Decoder for CRNN. |
|
Implementation Parallel Decoder module in `SAR. |
|
Implementation Sequential Decoder module in `SAR. |
|
Parallel Decoder module with beam-search in SAR. |
|
Transformer Decoder block with self attention mechanism. |
|
Sequence attention decoder for RobustScanner. |
|
Position attention decoder for RobustScanner. |
|
Decoder for RobustScanner. |
|
Decoder module in MASTER. |
|
Implement attention decoder. |
Module Losses¶
Base recognition loss. |
|
Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss. |
|
Implementation of loss module for CTC-loss based text recognition. |
|
Implementation of ABINet multiloss that allows mixing different types of losses with weights. |
Postprocessors¶
Base text recognition postprocessor. |
|
PostProcessor for seq2seq. |
|
PostProcessor for CTC. |
models.kie¶
Extractors¶
The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. |
Module Losses¶
The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. |
Postprocessors¶
Postprocessor for SDMGR. |
mmocr.evaluation¶
mmocr.evaluation
TextDet Metric¶
HmeanIOU metric. |
TextRecog Metric¶
Word metrics for text recognition task. |
|
Character metrics for text recognition task. |
|
One minus NED metric for text recognition task. |
KIE Metric¶
Compute F1 scores. |
mmocr.visualization¶
The MMOCR Text Detection Local Visualizer. |
|
The MMOCR Text Detection Local Visualizer. |
|
MMOCR Text Detection Local Visualizer. |
|
The MMOCR Text Detection Local Visualizer. |
mmocr.utils¶
Box Utils¶
Converting a bounding box to a polygon. |
|
Calculate the distance between the center points of two bounding boxes. |
|
Calculate the diagonal length of a bounding box (distance between the top-left and bottom-right). |
|
Sample points from the boundary of a polygon enclosed by two Bezier curves, which are controlled by |
|
Check if two boxes are on the same line by their y-axis coordinates. |
|
Rescale bboxes according to scale_factor. |
|
Stitch fragmented boxes of words into lines. |
Point Utils¶
Calculate the distance between two points. |
|
Calculate the center of a set of points. |
Polygon Utils¶
Calculate the IOU between two boundaries. |
|
Crop polygon to be within a box region. |
|
Check if the polygon is inside the target region. |
|
Offset (expand/shrink) the polygon by the target distance. |
|
Converting a polygon to a bounding box. |
|
Convert a polygon to shapely.geometry.Polygon. |
|
Calculate the intersection area between two polygons. |
|
Calculate the IOU between two polygons. |
|
Convert a potentially invalid polygon to a valid one by eliminating self-crossing or self-touching parts. |
|
Calculate the union area between two polygons. |
|
Convert a nested list of boundaries to a list of Polygons. |
|
Rescale a polygon according to scale_factor. |
|
Rescale polygons according to scale_factor. |
|
Convert a nested list of boundaries to a list of Polygons. |
|
Sort arbitrary points in clockwise order in Cartesian coordinate, you may need to reverse the output sequence if you are using OpenCV’s image coordinate. |
|
Sort box vertices in clockwise order from left-top first. |
|
Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4] |
Mask Utils¶
Fill holes in matrix. |
Misc Utils¶
check x is 2d-list([[1], []]) or 1d empty list([]). |
|
check x is 3d-list([[[1], []]]) or 2d empty list([[], []]) or 1d empty list([]). |
|
Welcome to the OpenMMLab community¶
Scan the QR code below to follow the OpenMMLab team’s Zhihu Official Account and join the OpenMMLab team’s QQ Group, or join the official communication WeChat group by adding the WeChat, or join our Slack



We will provide you with the OpenMMLab community
📢 share the latest core technologies of AI frameworks
💻 Explaining PyTorch common module source Code
📰 News related to the release of OpenMMLab
🚀 Introduction of cutting-edge algorithms developed by OpenMMLab 🏃 Get the more efficient answer and feedback
🔥 Provide a platform for communication with developers from all walks of life
The OpenMMLab community looks forward to your participation! 👬