欢迎来到 MMOCR 的中文文档!¶
您可以在页面左下角切换中英文文档。
概览¶
MMOCR 是一个基于 PyTorch 和 MMDetection 的开源工具箱,支持众多 OCR 相关的模型,涵盖了文本检测、文本识别以及关键信息提取等多个主要方向。它还支持了大多数流行的学术数据集,并提供了许多实用工具帮助用户对数据集和模型进行多方面的探索和调试,助力优质模型的产出和落地。它具有以下特点:
全流程,多模型:支持了全流程的 OCR 任务,包括文本检测、文本识别及关键信息提取的各种最新模型。
模块化设计:MMOCR 的模块化设计使用户可以按需定义及复用模型中的各个模块。
实用工具众多:MMOCR 提供了全面的可视化工具、验证工具和性能评测工具,帮助用户对模型进行排错、调优或客观比较。
由 OpenMMLab 强力驱动:与家族内的其它算法库一样,MMOCR 遵循着 OpenMMLab 严谨的开发准则和接口约定,极大地降低了用户切换各算法库时的学习成本。同时,MMOCR 也可以非常便捷地与家族内其他算法库跨库联动,从而满足用户跨领域研究和落地的需求。
随着 OpenMMLab 家族架构的整体升级, MMOCR 也相应地进行了大幅度的升级和修改。在这个大版本的更新中,MMOCR 中大量的冗余代码和重复实现被移除,多个关键方法的运行效率得到了提升,且整体框架设计上变得更为统一。考虑到该版本相较于 0.x 存在一些后向不兼容的修改,我们准备了一份详细的迁移指南,并在里面列出了新版本所作出的所有改动和迁移所需的步骤,力求帮助熟悉旧版框架的用户尽快完成升级。尽管这可能需要一定时间,但我们相信由 MMOCR 和 OpenMMLab 生态系统整体带来的新特性会让这一切变得尤为值得。😊
接下来,请根据实际需求选择你需要阅读的章节。
安装¶
环境依赖¶
Linux | Windows | macOS
Python 3.7
PyTorch 1.6 或更高版本
torchvision 0.7.0
CUDA 10.1
NCCL 2
GCC 5.4.0 或更高版本
准备环境¶
注解
如果你已经在本地安装了 PyTorch,请直接跳转到安装步骤。
第一步 下载并安装 Miniconda.
第二步 创建并激活一个 conda 环境:
conda create --name openmmlab python=3.8 -y
conda activate openmmlab
第三步 依照官方指南,安装 PyTorch。
conda install pytorch torchvision -c pytorch
conda install pytorch torchvision cpuonly -c pytorch
安装步骤¶
我们建议大多数用户采用我们的推荐方式安装 MMOCR。倘若你需要更灵活的安装过程,则可以参考自定义安装一节。
推荐步骤¶
第一步 使用 MIM 安装 MMEngine, MMCV 和 MMDetection。
pip install -U openmim
mim install mmengine
mim install mmcv
mim install mmdet
第二步 安装 MMOCR.
若你需要直接运行 MMOCR 或在其基础上进行开发,则通过源码安装(推荐)。
如果你将 MMOCR 作为一个外置依赖库使用,则可以通过 MIM 安装。
git clone https://github.com/open-mmlab/mmocr.git
cd mmocr
pip install -v -e .
# "-v" 会让安装过程产生更详细的输出
# "-e" 会以可编辑的方式安装该代码库,你对该代码库所作的任何更改都会立即生效
mim install mmocr
第三步(可选) 如果你需要使用与 albumentations
有关的变换(如 ABINet 数据流水线中的 Albu
),或需要构建文档、运行单元测试的依赖,请使用以下命令安装依赖:
# 安装 albu
pip install -r requirements/albu.txt
# 安装文档、测试等依赖
pip install -r requirements.txt
pip install albumentations>=1.1.0 --no-binary qudida,albumentations
注解
我们建议在安装 albumentations
之后检查当前环境,确保 opencv-python
和 opencv-python-headless
没有同时被安装,否则有可能会产生一些无法预知的错误。如果它们不巧同时存在于环境当中,请卸载 opencv-python-headless
以确保 MMOCR 的可视化工具可以正常运行。
查看 albumentations
的官方文档以获知详情。
检验¶
你可以通过运行一个简单的推理任务来检验 MMOCR 的安装是否成功。
在 Python 中运行以下代码:
>>> from mmocr.apis import MMOCRInferencer
>>> ocr = MMOCRInferencer(det='DBNet', rec='CRNN')
>>> ocr('demo/demo_text_ocr.jpg', show=True, print_result=True)
如果你是通过源码安装的 MMOCR,你可以在 MMOCR 的根目录下运行以下命令:
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec CRNN --show --print-result
若 MMOCR 的安装无误,你在这一节完成后应当能看到以图片和文字形式表示的识别结果:

# 识别结果
{'predictions': [{'rec_texts': ['cbanks', 'docecea', 'grouf', 'pwate', 'chobnsonsg', 'soxee', 'oeioh', 'c', 'sones', 'lbrandec', 'sretalg', '11', 'to8', 'round', 'sale', 'year',
'ally', 'sie', 'sall'], 'rec_scores': [...], 'det_polygons': [...], 'det_scores':
[...]}]}
注解
如果你在没有 GUI 的服务器上运行 MMOCR,或者通过没有开启 X11 转发的 SSH 隧道运行 MMOCR,你可能无法看到弹出的窗口。
自定义安装¶
CUDA 版本¶
安装 PyTorch 时,需要指定 CUDA 版本。如果您不清楚选择哪个,请遵循我们的建议:
对于 Ampere 架构的 NVIDIA GPU,例如 GeForce 30 series 以及 NVIDIA A100,CUDA 11 是必需的。
对于更早的 NVIDIA GPU,CUDA 11 是向前兼容的,但 CUDA 10.2 能够提供更好的兼容性,也更加轻量。
请确保你的 GPU 驱动版本满足最低的版本需求,参阅这张表。
注解
如果按照我们的最佳实践进行安装,CUDA 运行时库就足够了,因为我们提供相关 CUDA 代码的预编译,你不需要进行本地编译。
但如果你希望从源码进行 MMCV 的编译,或是进行其他 CUDA 算子的开发,那么就必须安装完整的 CUDA 工具链,参见
NVIDIA 官网,另外还需要确保该 CUDA 工具链的版本与 PyTorch 安装时
的配置相匹配(如用 conda install
安装 PyTorch 时指定的 cudatoolkit 版本)。
不使用 MIM 安装 MMCV¶
MMCV 包含 C++ 和 CUDA 扩展,因此其对 PyTorch 的依赖比较复杂。MIM 会自动解析这些 依赖,选择合适的 MMCV 预编译包,使安装更简单,但它并不是必需的。
要使用 pip 而不是 MIM 来安装 MMCV,请遵照 MMCV 安装指南。 它需要你用指定 url 的形式手动指定对应的 PyTorch 和 CUDA 版本。
举个例子,如下命令将会安装基于 PyTorch 1.10.x 和 CUDA 11.3 编译的 mmcv-full。
pip install 'mmcv>=2.0.0rc1' -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html
在 CPU 环境中安装¶
MMOCR 可以仅在 CPU 环境中安装,在 CPU 模式下,你可以完成训练(需要 MMCV 版本 >= 1.4.4)、测试和模型推理等所有操作。
在 CPU 模式下,MMCV 中的以下算子将不可用:
Deformable Convolution
Modulated Deformable Convolution
ROI pooling
SyncBatchNorm
如果你尝试使用用到了以上算子的模型进行训练、测试或推理,程序将会报错。以下为可能受到影响的模型列表:
算子 | 模型 |
---|---|
Deformable Convolution/Modulated Deformable Convolution | DBNet (r50dcnv2), DBNet++ (r50dcnv2), FCENet (r50dcnv2) |
SyncBatchNorm | PANet, PSENet |
通过 Docker 使用 MMOCR¶
我们提供了一个 Dockerfile 文件以建立 docker 镜像 。
# build an image with PyTorch 1.6, CUDA 10.1
docker build -t mmocr docker/
使用以下命令运行。
docker run --gpus all --shm-size=8g -it -v {实际数据目录}:/mmocr/data mmocr
对 MMEngine、MMCV 和 MMDetection 的版本依赖¶
为了确保代码实现的正确性,MMOCR 每个版本都有可能改变对 MMEngine、MMCV 和 MMDetection 版本的依赖。请根据以下表格确保版本之间的相互匹配。
MMOCR | MMEngine | MMCV | MMDetection |
---|---|---|---|
dev-1.x | 0.7.1 \<= mmengine \< 1.1.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.2.0 |
1.0.1 | 0.7.1 \<= mmengine \< 1.1.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.2.0 |
1.0.0 | 0.7.1 \<= mmengine \< 1.0.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.1.0 |
1.0.0rc6 | 0.6.0 \<= mmengine \< 1.0.0 | 2.0.0rc4 \<= mmcv \< 2.1.0 | 3.0.0rc5 \<= mmdet \< 3.1.0 |
1.0.0rc[4-5] | 0.1.0 \<= mmengine \< 1.0.0 | 2.0.0rc1 \<= mmcv \< 2.1.0 | 3.0.0rc0 \<= mmdet \< 3.1.0 |
1.0.0rc[0-3] | 0.0.0 \<= mmengine \< 0.2.0 | 2.0.0rc1 \<= mmcv \< 2.1.0 | 3.0.0rc0 \<= mmdet \< 3.1.0 |
快速运行¶
这个章节会介绍 MMOCR 的一些基本功能。我们假设你已经从源码安装了 MMOCR。此外,你也可以通过教程 Notebook来了解如何在交互式环境下实现推理、训练和测试。
推理¶
在 MMOCR 的根目录下运行以下命令:
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec CRNN --show --print-result
你可以看到弹出的预测结果,以及在控制台中打印出的推理结果。

# 识别结果
{'predictions': [{'rec_texts': ['cbanks', 'docecea', 'grouf', 'pwate', 'chobnsonsg', 'soxee', 'oeioh', 'c', 'sones', 'lbrandec', 'sretalg', '11', 'to8', 'round', 'sale', 'year',
'ally', 'sie', 'sall'], 'rec_scores': [...], 'det_polygons': [...], 'det_scores':
[...]}]}
注解
如果你在没有 GUI 的服务器上运行 MMOCR,或者通过没有开启 X11 转发的 SSH 隧道运行 MMOCR,你可能无法看到弹出的窗口。
对 MMOCR 中推理接口更为详细的说明,可以在这里找到。
除了使用我们提供好的预训练模型,用户也可以在自己的数据集上训练流行模型。接下来我们以在迷你的 ICDAR 2015 数据集上训练 DBNet 为例,带大家熟悉 MMOCR 的基本功能。
准备数据集¶
由于 OCR 任务的数据集种类多样,格式不一,不利于多数据集的切换和联合训练,因此 MMOCR 约定了一种统一的数据格式,并针对常用的 OCR 数据集提供了一键式数据准备脚本。通常,要在 MMOCR 中使用数据集,你只需要按照对应步骤运行指令即可。
注解
但我们亦深知,效率就是生命——尤其对想要快速上手 MMOCR 的你来说。
在这里,我们准备了一个用于演示的精简版 ICDAR 2015 数据集。下载我们预先准备好的压缩包,解压到 mmocr 的 data/
目录下,就能得到我们准备好的图片和标注文件。
wget https://download.openmmlab.com/mmocr/data/icdar2015/mini_icdar2015.tar.gz
mkdir -p data/
tar xzvf mini_icdar2015.tar.gz -C data/
修改配置¶
准备好数据集后,我们接下来就需要通过修改配置的方式指定训练集的位置和训练参数。
在这个例子中,我们将会训练一个以 resnet18 作为骨干网络(backbone)的 DBNet。由于 MMOCR 已经有针对完整 ICDAR 2015 数据集的配置 (configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
),我们只需要在它的基础上作出一点修改。
我们首先需要修改数据集的路径。在这个配置中,大部分关键的配置文件都在 _base_
中被导入,如数据库的配置就来自 configs/textdet/_base_/datasets/icdar2015.py
。打开该文件,把第一行 icdar2015_textdet_data_root
指向的路径替换:
icdar2015_textdet_data_root = 'data/mini_icdar2015'
另外,因为数据集尺寸缩小了,我们也要相应地减少训练的轮次到 400,缩短验证和储存权重的间隔到10轮,并放弃学习率衰减策略。直接把以下几行配置放入 configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
即可生效:
# 每 10 个 epoch 储存一次权重,且只保留最后一个权重
default_hooks = dict(
checkpoint=dict(
type='CheckpointHook',
interval=10,
max_keep_ckpts=1,
))
# 设置最大 epoch 数为 400,每 10 个 epoch 运行一次验证
train_cfg = dict(type='EpochBasedTrainLoop', max_epochs=400, val_interval=10)
# 令学习率为常量,即不进行学习率衰减
param_scheduler = [dict(type='ConstantLR', factor=1.0),]
这里,我们通过配置的继承 (MMEngine: Config) 机制将基础配置中的相应参数直接进行了改写。原本的字段分布在 configs/textdet/_base_/schedules/schedule_sgd_1200e.py
和 configs/textdet/_base_/default_runtime.py
中,感兴趣的读者可以自行查看。
注解
关于配置文件更加详尽的说明,请参考此处。
可视化数据集¶
在正式开始训练前,我们还可以可视化一下经过训练过程中数据变换(transforms)后的图像。方法也很简单,把我们需要可视化的配置传入 browse_dataset.py 脚本即可:
python tools/analysis_tools/browse_dataset.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
数据变换后的图片和标签会在弹窗中逐张被展示出来。



注解
有关该脚本更详细的指南,请参考此处.
小技巧
除了满足好奇心之外,可视化还可以帮助我们在训练前检查可能影响到模型表现的部分,如配置文件、数据集及数据变换中的问题。
训练¶
万事俱备,只欠东风。运行以下命令启动训练:
python tools/train.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
根据系统情况,MMOCR 会自动使用最佳的设备进行训练。如果有 GPU,则会默认在第一张卡启动单卡训练。当开始看到 loss 的输出,就说明你已经成功启动了训练。
2022/08/22 18:42:22 - mmengine - INFO - Epoch(train) [1][5/7] lr: 7.0000e-03 memory: 7730 data_time: 0.4496 loss_prob: 14.6061 loss_thr: 2.2904 loss_db: 0.9879 loss: 17.8843 time: 1.8666
2022/08/22 18:42:24 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
2022/08/22 18:42:28 - mmengine - INFO - Epoch(train) [2][5/7] lr: 7.0000e-03 memory: 6695 data_time: 0.2052 loss_prob: 6.7840 loss_thr: 1.4114 loss_db: 0.9855 loss: 9.1809 time: 0.7506
2022/08/22 18:42:29 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
2022/08/22 18:42:33 - mmengine - INFO - Epoch(train) [3][5/7] lr: 7.0000e-03 memory: 6690 data_time: 0.2101 loss_prob: 3.0700 loss_thr: 1.1800 loss_db: 0.9967 loss: 5.2468 time: 0.6244
2022/08/22 18:42:33 - mmengine - INFO - Exp name: dbnet_resnet18_fpnc_1200e_icdar2015
在不指定额外参数时,训练的权重默认会被保存到 work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/
下面,而日志则会保存在work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/开始训练的时间戳/
里。接下来,我们只需要耐心等待模型训练完成即可。
注解
若需要了解训练的高级用法,如 CPU 训练、多卡训练及集群训练等,请查阅训练与测试。
测试¶
经过数十分钟的等待,模型顺利完成了400 epochs的训练。我们通过控制台的输出,观察到 DBNet 在最后一个 epoch 的表现最好,hmean
达到了 60.86(你可能会得到一个不太一样的结果):
08/22 19:24:52 - mmengine - INFO - Epoch(val) [400][100/100] icdar/precision: 0.7285 icdar/recall: 0.5226 icdar/hmean: 0.6086
注解
它或许还没被训练到最优状态,但对于一个演示而言已经足够了。
然而,这个数值只反映了 DBNet 在迷你 ICDAR 2015 数据集上的性能。要想更加客观地评判它的检测能力,我们还要看看它在分布外数据集上的表现。例如,tests/data/det_toy_dataset
就是一个很小的真实数据集,我们可以用它来验证一下 DBNet 的实际性能。
在测试前,我们同样需要对数据集的位置做一下修改。打开 configs/textdet/_base_/datasets/icdar2015.py
,修改 icdar2015_textdet_test
的 data_root
为 tests/data/det_toy_dataset
:
# ...
icdar2015_textdet_test = dict(
type='OCRDataset',
data_root='tests/data/det_toy_dataset',
# ...
)
修改完毕,运行命令启动测试。
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/epoch_400.pth
得到输出:
08/21 21:45:59 - mmengine - INFO - Epoch(test) [5/10] memory: 8562
08/21 21:45:59 - mmengine - INFO - Epoch(test) [10/10] eta: 0:00:00 time: 0.4893 data_time: 0.0191 memory: 283
08/21 21:45:59 - mmengine - INFO - Evaluating hmean-iou...
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.30, recall: 0.6190, precision: 0.4815, hmean: 0.5417
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.40, recall: 0.6190, precision: 0.5909, hmean: 0.6047
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.50, recall: 0.6190, precision: 0.6842, hmean: 0.6500
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.60, recall: 0.6190, precision: 0.7222, hmean: 0.6667
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.70, recall: 0.3810, precision: 0.8889, hmean: 0.5333
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.80, recall: 0.0000, precision: 0.0000, hmean: 0.0000
08/21 21:45:59 - mmengine - INFO - prediction score threshold: 0.90, recall: 0.0000, precision: 0.0000, hmean: 0.0000
08/21 21:45:59 - mmengine - INFO - Epoch(test) [10/10] icdar/precision: 0.7222 icdar/recall: 0.6190 icdar/hmean: 0.6667
可以发现,模型在这个数据集上能达到的 hmean 为 0.6667,效果还是不错的。
注解
若需要了解测试的高级用法,如 CPU 测试、多卡测试及集群测试等,请查阅训练与测试。
可视化输出¶
为了对模型的输出有一个更直观的感受,我们还可以直接可视化它的预测输出。在 test.py
中,用户可以通过 show
参数打开弹窗可视化;也可以通过 show-dir
参数指定预测结果图导出的目录。
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py work_dirs/dbnet_resnet18_fpnc_1200e_icdar2015/epoch_400.pth --show-dir imgs/
真实标签和预测值会在可视化结果中以平铺的方式展示。左图的绿框表示真实标签,右图的红框表示预测值。

注解
有关更多可视化功能的介绍,请参阅这里。
FAQ¶
General¶
Q1 I’m getting the warning like unexpected key in source state_dict: fc.weight, fc.bias
, is there something wrong?
A It’s not an error. It occurs because the backbone network is pretrained on image classification tasks, where the last fc layer is required to generate the classification output. However, the fc layer is no longer needed when the backbone network is used to extract features in downstream tasks, and therefore these weights can be safely skipped when loading the checkpoint.
Q2 MMOCR terminates with an error: shapely.errors.TopologicalError: The operation 'GEOSIntersection_r' could not be performed. Likely cause is invalidity of the geometry
. How could I fix it?
A This error occurs because of some invalid polygons (e.g., polygons with self-intersections) existing in the dataset or generated by some non-rigorous data transforms. These polygons can be fixed by adding FixInvalidPolygon
transform after the transform likely to introduce invalid polygons. For example, a common practice is to append it after LoadOCRAnnotations
in both train and test pipeline. The resulting pipeline should look like:
train_pipeline = [
...
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(type='FixInvalidPolygon', min_poly_points=4),
...
]
In practice, we find that Totaltext contains some invalid polygons and using FixInvalidPolygon
is a must. Here is an example config.
Q3 Getting libpng warning: iCCP: known incorrect sRGB profile
when loading images with cv2
backend.
A This is a warning from libpng
and it is safe to ignore. It is caused by the icc
profile in the image. You can use pillow
backend to avoid this warning:
train_pipeline = [
dict(
type='LoadImageFromFile',
imdecode_backend='pillow'),
...
]
Text Recognition¶
Q1 What are the steps to train text recognition models with my own dictionary?
A In MMOCR 1.0, you only need to modify the config and point Dictionary
to your custom dict file. For example, if you want to train SAR model (https://github.com/open-mmlab/mmocr/blob/75c06d34bbc01d3d11dfd7afc098b6cdeee82579/configs/textrecog/sar/sar_resnet31_parallel-decoder_5e_st-sub_mj-sub_sa_real.py) with your own dictionary placed at /my/dict.txt
, you can modify dictionary.dict_file
term in base config to:
dictionary = dict(
type='Dictionary',
dict_file='/my/dict.txt',
with_start=True,
with_end=True,
same_start_end=True,
with_padding=True,
with_unknown=True)
Now you are good to go. You can also find more information in Dictionary API.
Q2 How to properly visualize non-English characters?
A You can customize font_families
or font_properties
in visualizer. For example, to visualize Korean:
configs/textrecog/_base_/default_runtime.py
:
visualizer = dict(
type='TextRecogLocalVisualizer',
name='visualizer',
font_families='NanumGothic', # new feature
vis_backends=vis_backends)
It’s also fine to pass the font path to visualizer:
visualizer = dict(
type='TextRecogLocalVisualizer',
name='visualizer',
font_properties='path/to/font_file',
vis_backends=vis_backends)
推理¶
在 OpenMMLab 中,所有的推理操作都被统一到了推理器 Inferencer
中。推理器被设计成为一个简洁易用的 API,它在不同的 OpenMMLab 库中都有着非常相似的接口。
MMOCR 中存在两种不同的推理器:
标准推理器:MMOCR 中的每个基本任务都有一个标准推理器,即
TextDetInferencer
(文本检测),TextRecInferencer
(文本识别),TextSpottingInferencer
(端到端 OCR) 和KIEInferencer
(关键信息提取)。它们具有非常相似的接口,具有标准的输入/输出协议,并且总体遵循 OpenMMLab 的设计。这些推理器也可以被串联在一起,以便对一系列任务进行推理。MMOCRInferencer:我们还提供了
MMOCRInferencer
,一个专门为 MMOCR 设计的便捷推理接口。它封装和链接了 MMOCR 中的所有推理器,因此用户可以使用此推理器对图像执行一系列任务,并直接获得最终结果。但是,它的接口与标准推理器有一些不同,并且为了简单起见,可能会牺牲一些标准的推理器功能。
对于新用户,我们建议使用 MMOCRInferencer 来测试不同模型的组合。
如果你是开发人员并希望将模型集成到自己的项目中,我们建议使用标准推理器,因为它们更灵活且标准化,并具有完整的功能。
基础用法¶
目前,MMOCRInferencer
可以对以下任务进行推理:
文本检测
文本识别
OCR(文本检测 + 文本识别)
关键信息提取(文本检测 + 文本识别 + 关键信息提取)
OCR(text spotting)(即将推出)
为了便于使用,MMOCRInferencer
向用户提供了 Python 接口和命令行接口。例如,如果你想要对 demo/demo_text_ocr.jpg 进行 OCR 推理,使用 DBNet
作为文本检测模型,CRNN
作为文本识别模型,只需执行以下命令:
>>> from mmocr.apis import MMOCRInferencer
>>> # 读取模型
>>> ocr = MMOCRInferencer(det='DBNet', rec='SAR')
>>> # 进行推理并可视化结果
>>> ocr('demo/demo_text_ocr.jpg', show=True)
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec SAR --show
可视化结果将被显示在一个新窗口中:

注解
如果你在没有 GUI 的服务器上运行 MMOCR,或者是通过禁用 X11 转发的 SSH 隧道运行该指令,show
选项将不起作用。然而,你仍然可以通过设置 out_dir
和 save_vis=True
参数将可视化数据保存到文件。阅读 储存结果 了解详情。
根据初始化参数,MMOCRInferencer
可以在不同模式下运行。例如,如果初始化时指定了 det
、rec
和 kie
,它可以在 KIE 模式下运行。
>>> kie = MMOCRInferencer(det='DBNet', rec='SAR', kie='SDMGR')
>>> kie('demo/demo_kie.jpeg', show=True)
python tools/infer.py demo/demo_kie.jpeg --det DBNet --rec SAR --kie SDMGR --show
可视化结果如下:

可以见到,MMOCRInferencer 的 Python 接口与命令行接口的使用方法非常相似。下文将以 Python 接口为例,介绍 MMOCRInferencer 的具体用法。关于命令行接口的更多信息,请参考 命令行接口。
通常,OpenMMLab 中的所有标准推理器都具有非常相似的接口。下面的例子展示了如何使用 TextDetInferencer
对单个图像进行推理。
>>> from mmocr.apis import TextDetInferencer
>>> # 读取模型
>>> inferencer = TextDetInferencer(model='DBNet')
>>> # 推理
>>> inferencer('demo/demo_text_ocr.jpg', show=True)
可视化结果如图:

初始化¶
每个推理器必须使用一个模型进行初始化。初始化时,可以手动选择推理设备。
模型初始化¶
对于每个任务,MMOCRInferencer
需要两个参数 xxx
和 xxx_weights
(例如 det
和 det_weights
)以对模型进行初始化。此处将以det
和det_weights
为例来说明一些典型的初始化模型的方法。
要用 MMOCR 的预训练模型进行推理,只需要把它的名字传给参数
det
,权重将自动从 OpenMMLab 的模型库中下载和加载。此处记录了 MMOCR 中可以通过该方法初始化的所有模型。>>> MMOCRInferencer(det='DBNet')
要加载自定义的配置和权重,你可以把配置文件的路径传给
det
,把权重的路径传给det_weights
。>>> MMOCRInferencer(det='path/to/dbnet_config.py', det_weights='path/to/dbnet.pth')
如果需要查看更多的初始化方法,请点击“标准推理器”选项卡。
每个标准的 Inferencer
都接受两个参数,model
和 weights
。在 MMOCRInferencer 中,这两个参数分别对应 xxx
和 xxx_weights
(例如 det
和 det_weights
)。
model
接受模型的名称或配置文件的路径作为输入。模型的名称从 model-index.yml 中的模型的元文件(示例 )中获取。你可以在此处找到可用权重的列表。weights
接受权重文件的路径。
此处列举了一些常见的初始化模型的方法。
你可以通过传递模型的名称给
model
来推理 MMOCR 的预训练模型。权重将会自动从 OpenMMLab 的模型库中下载并加载。>>> from mmocr.apis import TextDetInferencer >>> inferencer = TextDetInferencer(model='DBNet')
注解
模型与推理器的任务种类必须匹配。
你可以通过将权重的路径或 URL 传递给
weights
来让推理器加载自定义的权重。>>> inferencer = TextDetInferencer(model='DBNet', weights='path/to/dbnet.pth')
如果有自定义的配置和权重,你可以将配置文件的路径传递给
model
,将权重的路径传递给weights
。>>> inferencer = TextDetInferencer(model='path/to/dbnet_config.py', weights='path/to/dbnet.pth')
默认情况下,MMEngine 会在训练模型时自动将配置文件转储到权重文件中。如果你有一个在 MMEngine 上训练的权重,你也可以将权重文件的路径传递给
weights
,而不需要指定model
:>>> # 如果无法在权重中找到配置文件,则会引发错误 >>> inferencer = TextDetInferencer(weights='path/to/dbnet.pth')
传递配置文件到
model
而不指定weight
则会产生一个随机初始化的模型。
推理设备¶
每个推理器实例都会跟一个设备绑定。默认情况下,最佳设备是由 MMEngine 自动决定的。你也可以通过指定 device
参数来改变设备。例如,你可以使用以下代码在 GPU 1上创建一个推理器。
>>> inferencer = MMOCRInferencer(det='DBNet', device='cuda:1')
>>> inferencer = TextDetInferencer(model='DBNet', device='cuda:1')
如要在 CPU 上创建一个推理器:
>>> inferencer = MMOCRInferencer(det='DBNet', device='cpu')
>>> inferencer = TextDetInferencer(model='DBNet', device='cpu')
请参考 torch.device 了解 device
参数支持的所有形式。
推理¶
当推理器初始化后,你可以直接传入要推理的原始数据,从返回值中获取推理结果。
输入¶
输入可以是以下任意一种格式:
str: 图像的路径/URL。
>>> inferencer('demo/demo_text_ocr.jpg')
array: 图像的 numpy 数组。它应该是 BGR 格式。
>>> import mmcv >>> array = mmcv.imread('demo/demo_text_ocr.jpg') >>> inferencer(array)
list: 基本类型的列表。列表中的每个元素都将单独处理。
>>> inferencer(['img_1.jpg', 'img_2.jpg]) >>> # 列表内混合类型也是允许的 >>> inferencer(['img_1.jpg', array])
str: 目录的路径。目录中的所有图像都将被处理。
>>> inferencer('tests/data/det_toy_dataset/imgs/test/')
输入可以是一个字典或者一个字典列表,其中每个字典包含以下键:
img
(str 或者 ndarray): 图像的路径或图像本身。如果 KIE 推理器在无可视模式下使用,则不需要此键。如果它是一个 numpy 数组,则应该是 BGR 顺序编码的图片。img_shape
(tuple(int, int)): 图像的形状 (H, W)。仅在 KIE 推理器在无可视模式下使用且没有提供img
时才需要。instances
(list[dict]): 实例列表。
每个 instance
都应该包含以下键:
{
# 一个嵌套列表,其中包含 4 个数字,表示实例的边界框,顺序为 (x1, y1, x2, y2)
"bbox": np.array([[x1, y1, x2, y2], [x1, y1, x2, y2], ...],
dtype=np.int32),
# 文本列表
"texts": ['text1', 'text2', ...],
}
输出¶
默认情况下,每个推理器都以字典格式返回预测结果。
visualization
包含可视化的预测结果。但默认情况下,它是一个空列表,除非return_vis=True
。predictions
包含以 json-可序列化格式返回的预测结果。如下所示,内容因任务类型而异。{ 'predictions' : [ # 每个实例都对应于一个输入图像 { 'det_polygons': [...], # 2d 列表,长度为 (N,),格式为 [x1, y1, x2, y2, ...] 'det_scores': [...], # 浮点列表,长度为(N, ) 'det_bboxes': [...], # 2d 列表,形状为 (N, 4),格式为 [min_x, min_y, max_x, max_y] 'rec_texts': [...], # 字符串列表,长度为(N, ) 'rec_scores': [...], # 浮点列表,长度为(N, ) 'kie_labels': [...], # 节点标签,长度为 (N, ) 'kie_scores': [...], # 节点置信度,长度为 (N, ) 'kie_edge_scores': [...], # 边预测置信度, 形状为 (N, N) 'kie_edge_labels': [...] # 边标签, 形状为 (N, N) }, ... ], 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # 每个实例都对应于一个输入图像 { 'polygons': [...], # 2d 列表,长度为 (N,),格式为 [x1, y1, x2, y2, ...] 'bboxes': [...], # 2d 列表,形状为 (N, 4),格式为 [min_x, min_y, max_x, max_y] 'scores': [...] # 浮点列表,长度为(N, ) }, ... ] 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # 每个实例都对应于一个输入图像 { 'text': '...', # 字符串 'scores': 0.1, # 浮点 }, ... ] 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # 每个实例都对应于一个输入图像 { 'polygons': [...], # 2d 列表,长度为 (N,),格式为 [x1, y1, x2, y2, ...] 'bboxes': [...], # 2d 列表,形状为 (N, 4),格式为 [min_x, min_y, max_x, max_y] 'scores': [...] # 浮点列表,长度为(N, ) 'texts': ['...',] # 字符串列表,长度为(N, ) }, ] 'visualization' : [ array(..., dtype=uint8), ] }
{ 'predictions' : [ # 每个实例都对应于一个输入图像 { 'labels': [...], # 节点标签,长度为 (N, ) 'scores': [...], # 节点置信度,长度为 (N, ) 'edge_scores': [...], # 边预测置信度, 形状为 (N, N) 'edge_labels': [...], # 边标签, 形状为 (N, N) }, ] 'visualization' : [ array(..., dtype=uint8), ] }
如果你想要从模型中获取原始输出,可以将 return_datasamples
设置为 True
来获取原始的 DataSample,它将存储在 predictions
中。
储存结果¶
除了从返回值中获取预测结果,你还可以通过设置 out_dir
和 save_pred
/save_vis
参数将预测结果和可视化结果导出到文件中。
>>> inferencer('img_1.jpg', out_dir='outputs/', save_pred=True, save_vis=True)
结果目录结构如下:
outputs
├── preds
│ └── img_1.json
└── vis
└── img_1.jpg
文件名与对应的输入图像文件名相同。 如果输入图像是数组,则文件名将是从0开始的数字。
批量推理¶
你可以通过设置 batch_size
来自定义批量推理的批大小。 默认批大小为 1。
API¶
这里列出了推理器详尽的参数列表。
MMOCRInferencer.__init__():
参数 | 类型 | 默认值 | 描述 |
---|---|---|---|
det |
str 或 权重, 可选 | None | 预训练的文本检测算法。它是配置文件的路径或者是 metafile 中定义的模型名称。 |
det_weights |
str, 可选 | None | det 模型的权重文件的路径。 |
rec |
str 或 权重, 可选 | None | 预训练的文本识别算法。它是配置文件的路径或者是 metafile 中定义的模型名称。 |
rec_weights |
str, 可选 | None | rec 模型的权重文件的路径。 |
kie [1] |
str 或 权重, 可选 | None | 预训练的关键信息提取算法。它是配置文件的路径或者是 metafile 中定义的模型名称。 |
kie_weights |
str, 可选 | None | kie 模型的权重文件的路径。 |
device |
str, 可选 | None | 推理使用的设备,接受 torch.device 允许的所有字符串。例如,'cuda:0' 或 'cpu'。如果为 None,将自动使用可用设备。 默认为 None。 |
[1]: 当同时指定了文本检测和识别模型时,kie
才会生效。
MMOCRInferencer.__call__()
参数 | 类型 | 默认值 | 描述 |
---|---|---|---|
inputs |
str/list/tuple/np.array | 必需 | 它可以是一个图片/文件夹的路径,一个 numpy 数组,或者是一个包含图片路径或 numpy 数组的列表/元组 |
return_datasamples |
bool | False | 是否将结果作为 DataSample 返回。如果为 False,结果将被打包成一个字典。 |
batch_size |
int | 1 | 推理的批大小。 |
det_batch_size |
int, 可选 | None | 推理的批大小 (文本检测模型)。如果不为 None,则覆盖 batch_size。 |
rec_batch_size |
int, 可选 | None | 推理的批大小 (文本识别模型)。如果不为 None,则覆盖 batch_size。 |
kie_batch_size |
int, 可选 | None | 推理的批大小 (关键信息提取模型)。如果不为 None,则覆盖 batch_size。 |
return_vis |
bool | False | 是否返回可视化结果。 |
print_result |
bool | False | 是否将推理结果打印到控制台。 |
show |
bool | False | 是否在弹出窗口中显示可视化结果。 |
wait_time |
float | 0 | 弹窗展示可视化结果的时间间隔。 |
out_dir |
str | results/ |
结果的输出目录。 |
save_vis |
bool | False | 是否将可视化结果保存到 out_dir 。 |
save_pred |
bool | False | 是否将推理结果保存到 out_dir 。 |
Inferencer.__init__():
参数 | 类型 | 默认值 | 描述 |
---|---|---|---|
model |
str 或 权重, 可选 | None | 路径到配置文件或者在 metafile 中定义的模型名称。 |
weights |
str, 可选 | None | 权重文件的路径。 |
device |
str, 可选 | None | 推理使用的设备,接受 torch.device 允许的所有字符串。 例如,'cuda:0' 或 'cpu'。 如果为 None,则将自动使用可用设备。 默认为 None。 |
Inferencer.__call__()
参数 | 类型 | 默认值 | 描述 |
---|---|---|---|
inputs |
str/list/tuple/np.array | 必需 | 可以是图像的路径/文件夹,np 数组或列表/元组(带有图像路径或 np 数组) |
return_datasamples |
bool | False | 是否将结果作为 DataSamples 返回。 如果为 False,则结果将被打包到一个 dict 中。 |
batch_size |
int | 1 | 推理批大小。 |
progress_bar |
bool | True | 是否显示进度条。 |
return_vis |
bool | False | 是否返回可视化结果。 |
print_result |
bool | False | 是否将推理结果打印到控制台。 |
show |
bool | False | 是否在弹出窗口中显示可视化结果。 |
wait_time |
float | 0 | 弹窗展示可视化结果的时间间隔。 |
draw_pred |
bool | True | 是否绘制预测的边界框。 仅适用于 TextDetInferencer 和 TextSpottingInferencer 。 |
out_dir |
str | results/ |
结果的输出目录。 |
save_vis |
bool | False | 是否将可视化结果保存到 out_dir 。 |
save_pred |
bool | False | 是否将推理结果保存到 out_dir 。 |
命令行接口¶
注解
该节仅适用于 MMOCRInferencer
.
MMOCRInferencer
的命令行形式可以通过 tools/infer.py
调用,大致形式如下:
python tools/infer.py INPUT_PATH [--det DET] [--det-weights ...] ...
其中,INPUT_PATH
为必须字段,内容应当为指向图片或文件目录的路径。其他参数与 Python 接口遵循的映射关系如下:
在命令行中调用参数时,需要在 Python 接口的参数前面加上两个
-
,然后把下划线_
替换成连字符-
。例如,out_dir
会变成--out-dir
。对于布尔类型的参数,将参数放在命令中就相当于将其指定为 True。例如,
--show
会将show
参数指定为 True。
此外,命令行中默认不会回显推理结果,你可以通过 --print-result
参数来查看推理结果。
下面是一个例子:
python tools/infer.py demo/demo_text_ocr.jpg --det DBNet --rec SAR --show --print-result
运行该命令,可以得到如下结果:
{'predictions': [{'rec_texts': ['CBank', 'Docbcba', 'GROUP', 'MAUN', 'CROBINSONS', 'AOCOC', '916M3', 'BOO9', 'Oven', 'BRANDS', 'ARETAIL', '14', '70<UKN>S', 'ROUND', 'SALE', 'YEAR', 'ALLY', 'SALE', 'SALE'],
'rec_scores': [0.9753464579582214, ...], 'det_polygons': [[551.9930285844646, 411.9138765335083, 553.6153911653112,
383.53195309638977, 620.2410061195247, 387.33785033226013, 618.6186435386782, 415.71977376937866], ...], 'det_scores': [0.8230461478233337, ...]}]}
配置文件¶
MMOCR 主要使用 Python 文件作为配置文件。其配置文件系统的设计整合了模块化与继承的思想,方便用户进行各种实验。
常见用法¶
注解
本小节建议结合 MMEngine: 配置(Config) 中的初级用法共同阅读。
MMOCR 最常用的操作为三种:配置文件的继承,对 _base_
变量的引用以及对 _base_
变量的修改。对于 _base_
的继承与修改, MMEngine.Config 提供了两种语法,一种是针对 Python,Json, Yaml 均可使用的操作;另一种则仅适用于 Python 配置文件。在 MMOCR 中,我们更推荐使用只针对Python的语法,因此下文将以此为基础作进一步介绍。
这里以 configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
为例,说明常用的三种用法。
_base_ = [
'_base_dbnet_resnet18_fpnc.py',
'../_base_/datasets/icdar2015.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_sgd_1200e.py',
]
# dataset settings
icdar2015_textdet_train = _base_.icdar2015_textdet_train
icdar2015_textdet_train.pipeline = _base_.train_pipeline
icdar2015_textdet_test = _base_.icdar2015_textdet_test
icdar2015_textdet_test.pipeline = _base_.test_pipeline
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train)
val_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=icdar2015_textdet_test)
配置文件的继承¶
配置文件存在继承的机制,即一个配置文件 A 可以将另一个配置文件 B 作为自己的基础并直接继承其中的所有字段,从而避免了大量的复制粘贴。
在 dbnet_resnet18_fpnc_1200e_icdar2015.py 中可以看到:
_base_ = [
'_base_dbnet_resnet18_fpnc.py',
'../_base_/datasets/icdar2015.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_sgd_1200e.py',
]
上述语句会读取列表中的所有基础配置文件,它们中的所有字段都会被载入到 dbnet_resnet18_fpnc_1200e_icdar2015.py 中。我们可以通过在 Python 解释中运行以下语句,了解配置文件被解析后的结构:
from mmengine import Config
db_config = Config.fromfile('configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py')
print(db_config)
可以发现,被解析的配置包含了所有base配置中的字段和信息。
注解
请注意:各 base 配置文件中不能存在同名变量。
_base_
变量的引用¶
有时,我们可能需要直接引用 _base_
配置中的某些字段,以避免重复定义。假设我们想要获取 _base_
配置中的变量 pseudo
,就可以直接通过 _base_.pseudo
获得 _base_
配置中的变量。
该语法已广泛用于 MMOCR 的配置中。MMOCR 中各个模型的数据集和管道(pipeline)配置都引用于基本配置。如在
icdar2015_textdet_train = _base_.icdar2015_textdet_train
# ...
train_dataloader = dict(
# ...
dataset=icdar2015_textdet_train)
_base_
变量的修改¶
在 MMOCR 中不同算法在不同数据集通常有不同的数据流水线(pipeline),因此经常会会存在修改数据集中 pipeline
的场景。同时还存在很多场景需要修改 _base_
配置中的变量,例如想修改某个算法的训练策略,某个模型的某些算法模块(更换 backbone 等)。用户可以直接利用 Python 的语法直接修改引用的 _base_
变量。针对 dict,我们也提供了与类属性修改类似的方法,可以直接修改类属性修改字典内的内容。
字典
这里以修改数据集中的
pipeline
为例:可以利用 Python 语法修改字典:
# 获取 _base_ 中的数据集 icdar2015_textdet_train = _base_.icdar2015_textdet_train # 可以直接利用 Python 的 update 修改变量 icdar2015_textdet_train.update(pipeline=_base_.train_pipeline)
也可以使用类属性的方法进行修改:
# 获取 _base_ 中的数据集 icdar2015_textdet_train = _base_.icdar2015_textdet_train # 类属性方法修改 icdar2015_textdet_train.pipeline = _base_.train_pipeline
列表
假设
_base_
配置中的变量pseudo = [1, 2, 3]
, 需要修改为[1, 2, 4]
:# pseudo.py pseudo = [1, 2, 3]
可以直接重写:
_base_ = ['pseudo.py'] pseudo = [1, 2, 4]
或者利用 Python 语法修改列表:
_base_ = ['pseudo.py'] pseudo = _base_.pseudo pseudo[2] = 4
命令行修改配置¶
有时候我们只希望修部分配置,而不想修改配置文件本身。例如实验过程中想更换学习率,但是又不想重新写一个配置文件,可以通过命令行传入参数来覆盖相关配置。
我们可以在命令行里传入 --cfg-options
,并在其之后的参数直接修改对应字段,例如我们想在运行 train 的时候修改学习率,只需要在命令行执行:
python tools/train.py example.py --cfg-options optim_wrapper.optimizer.lr=1
更多详细用法参考 MMEngine: 命令行修改配置.
配置内容¶
通过配置文件与注册器的配合,MMOCR 可以在不侵入代码的前提下修改训练参数以及模型配置。具体而言,用户可以在配置文件中对如下模块进行自定义修改:环境配置、Hook 配置、日志配置、训练策略配置、数据相关配置、模型相关配置、评测配置、可视化配置。
本文档将以文字检测算法 DBNet
和文字识别算法 CRNN
为例来详细介绍 Config 中的内容。
环境配置¶
default_scope = 'mmocr'
env_cfg = dict(
cudnn_benchmark=True,
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
dist_cfg=dict(backend='nccl'))
randomness = dict(seed=None)
主要包含三个部分:
设置所有注册器的默认
scope
为mmocr
, 保证所有的模块首先从MMOCR
代码库中进行搜索。若果该模块不存在,则继续从上游算法库MMEngine
和MMCV
中进行搜索,详见 MMEngine: 注册器。env_cfg
设置分布式环境配置, 更多配置可以详见 MMEngine: Runner。randomness
设置 numpy, torch,cudnn 等随机种子,更多配置详见 MMEngine: Runner。
Hook 配置¶
Hook 主要分为两个部分,默认 hook 以及自定义 hook。默认 hook 为所有任务想要运行所必须的配置,自定义 hook 一般服务于特定的算法或某些特定任务(目前为止 MMOCR 中没有自定义的 Hook)。
default_hooks = dict(
timer=dict(type='IterTimerHook'), # 时间记录,包括数据增强时间以及模型推理时间
logger=dict(type='LoggerHook', interval=1), # 日志打印间隔
param_scheduler=dict(type='ParamSchedulerHook'), # 更新学习率等超参
checkpoint=dict(type='CheckpointHook', interval=1),# 保存 checkpoint, interval控制保存间隔
sampler_seed=dict(type='DistSamplerSeedHook'), # 多机情况下设置种子
sync_buffer=dict(type='SyncBuffersHook'), # 多卡情况下,同步buffer
visualization=dict( # 可视化val 和 test 的结果
type='VisualizationHook',
interval=1,
enable=False,
show=False,
draw_gt=False,
draw_pred=False))
custom_hooks = []
这里简单介绍几个经常可能会变动的 hook,通用的修改方法参考修改配置。
LoggerHook
:用于配置日志记录器的行为。例如,通过修改interval
可以控制日志打印的间隔,每interval
次迭代 (iteration) 打印一次日志,更多设置可参考 LoggerHook API。CheckpointHook
:用于配置模型断点保存相关的行为,如保存最优权重,保存最新权重等。同样可以修改interval
控制保存 checkpoint 的间隔。更多设置可参考 CheckpointHook APIVisualizationHook
:用于配置可视化相关行为,例如在验证或测试时可视化预测结果,默认为关。同时该 Hook 依赖可视化配置。想要了解详细功能可以参考 Visualizer。更多配置可以参考 VisualizationHook API。
如果想进一步了解默认 hook 的配置以及功能,可以参考 MMEngine: 钩子(Hook)。
日志配置¶
此部分主要用来配置日志配置等级以及日志处理器。
log_level = 'INFO' # 日志记录等级
log_processor = dict(type='LogProcessor',
window_size=10,
by_epoch=True)
日志配置等级与 Python: logging 的配置一致,
日志处理器主要用来控制输出的格式,详细功能可参考 MMEngine: 记录日志:
by_epoch=True
表示按照epoch输出日志,日志格式需要和train_cfg
中的type='EpochBasedTrainLoop'
参数保持一致。例如想按迭代次数输出日志,就需要令log_processor
中的by_epoch=False
的同时train_cfg
中的type = 'IterBasedTrainLoop'
。window_size
表示损失的平滑窗口,即最近window_size
次迭代的各种损失的均值。logger 中最终打印的 loss 值为各种损失的平均值。
训练策略配置¶
此部分主要包含优化器设置、学习率策略和 Loop
设置。
对不同算法任务(文字检测,文字识别,关键信息提取),通常有自己任务常用的调参策略。这里列出了文字识别中的 CRNN
所用涉及的相应配置。
# 优化器
optim_wrapper = dict(
type='OptimWrapper', optimizer=dict(type='Adadelta', lr=1.0))
param_scheduler = [dict(type='ConstantLR', factor=1.0)]
train_cfg = dict(type='EpochBasedTrainLoop',
max_epochs=5, # 训练轮数
val_interval=1) # 评测间隔
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
optim_wrapper
: 主要包含两个部分,优化器封装 (OptimWrapper) 以及优化器 (Optimizer)。详情使用信息可见 MMEngine: 优化器封装优化器封装支持不同的训练策略,包括混合精度训练(AMP)、梯度累加和梯度截断。
优化器设置中支持了 PyTorch 所有的优化器,所有支持的优化器见 PyTorch 优化器列表。
param_scheduler
: 学习率调整策略,支持大部分 PyTorch 中的学习率调度器,例如ExponentialLR
,LinearLR
,StepLR
,MultiStepLR
等,使用方式也基本一致,所有支持的调度器见调度器接口文档, 更多功能可以参考 MMEngine: 优化器参数调整策略。train/test/val_cfg
: 任务的执行流程,MMEngine 提供了四种流程:EpochBasedTrainLoop
,IterBasedTrainLoop
,ValLoop
,TestLoop
更多可以参考 MMEngine: 循环控制器。
数据相关配置¶
数据集配置¶
主要用于配置两个方向:
数据集的图像与标注文件的位置。
数据增强相关的配置。在 OCR 领域中,数据增强通常与模型强相关。
更多参数配置可以参考数据基类。
数据集字段的命名规则在 MMOCR 中为:
{数据集名称缩写}_{算法任务}_{训练/测试/验证} = dict(...)
数据集缩写:见 数据集名称对应表
算法任务:文本检测-det,文字识别-rec,关键信息提取-kie
训练/测试/验证:数据集用于训练,测试还是验证
以识别为例,使用 Syn90k 作为训练集,以 icdar2013 和 icdar2015 作为测试集配置如下:
# 识别数据集配置
mjsynth_textrecog_train = dict(
type='OCRDataset',
data_root='data/rec/Syn90k/',
data_prefix=dict(img_path='mnt/ramdisk/max/90kDICT32px'),
ann_file='train_labels.json',
test_mode=False,
pipeline=None)
icdar2013_textrecog_test = dict(
type='OCRDataset',
data_root='data/rec/icdar_2013/',
data_prefix=dict(img_path='Challenge2_Test_Task3_Images/'),
ann_file='test_labels.json',
test_mode=True,
pipeline=None)
icdar2015_textrecog_test = dict(
type='OCRDataset',
data_root='data/rec/icdar_2015/',
data_prefix=dict(img_path='ch4_test_word_images_gt/'),
ann_file='test_labels.json',
test_mode=True,
pipeline=None)
数据流水线配置¶
MMOCR 中,数据集的构建与数据准备是相互解耦的。也就是说,OCRDataset
等数据集构建类负责完成标注文件的读取与解析功能;而数据变换方法(Data Transforms)则进一步实现了数据读取、数据增强、数据格式化等相关功能。
同时一般情况下训练和测试会存在不同的增强策略,因此一般会存在训练流水线(train_pipeline)和测试流水线(test_pipeline)。更多信息可以参考数据流水线
训练流水线的数据增强流程通常为:数据读取(LoadImageFromFile)->标注信息读取(LoadXXXAnntation)->数据增强->数据格式化(PackXXXInputs)。
测试流水线的数据增强流程通常为:数据读取(LoadImageFromFile)->数据增强->标注信息读取(LoadXXXAnntation)->数据格式化(PackXXXInputs)。
由于 OCR 任务的特殊性,一般情况下不同模型有不同数据增强的方式,相同模型在不同数据集一般也会有不同的数据增强方式。以 CRNN 为例:
# 数据增强
train_pipeline = [
dict(
type='LoadImageFromFile',
color_type='grayscale',
ignore_empty=True,
min_size=5),
dict(type='LoadOCRAnnotations', with_text=True),
dict(type='Resize', scale=(100, 32), keep_ratio=False),
dict(
type='PackTextRecogInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape', 'valid_ratio'))
]
test_pipeline = [
dict(
type='LoadImageFromFile',
color_type='grayscale'),
dict(
type='RescaleToHeight',
height=32,
min_width=32,
max_width=None,
width_divisor=16),
dict(type='LoadOCRAnnotations', with_text=True),
dict(
type='PackTextRecogInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape', 'valid_ratio'))
]
Dataloader 配置¶
主要为构造数据集加载器(dataloader)所需的配置信息,更多教程看参考 PyTorch 数据加载器。
# Dataloader 部分
train_dataloader = dict(
batch_size=64,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=dict(
type='ConcatDataset',
datasets=[mjsynth_textrecog_train],
pipeline=train_pipeline))
val_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=dict(
type='ConcatDataset',
datasets=[icdar2013_textrecog_test, icdar2015_textrecog_test],
pipeline=test_pipeline))
test_dataloader = val_dataloader
模型相关配置¶
网络配置¶
用于配置模型的网络结构,不同的算法任务有不同的网络结构。更多信息可以参考网络结构
文本检测¶
文本检测主要包含几个部分:
data_preprocessor
: 数据处理器backbone
: 特征提取网络neck
: 颈网络配置det_head
: 检测头网络配置module_loss
: 模型损失函数配置postprocessor
: 模型预测结果后处理配置
我们以 DBNet 为例,介绍文字检测中模型配置:
model = dict(
type='DBNet',
data_preprocessor=dict(
type='TextDetDataPreprocessor',
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
bgr_to_rgb=True,
pad_size_divisor=32)
backbone=dict(
type='mmdet.ResNet',
depth=18,
num_stages=4,
out_indices=(0, 1, 2, 3),
frozen_stages=-1,
norm_cfg=dict(type='BN', requires_grad=True),
init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet18'),
norm_eval=False,
style='caffe'),
neck=dict(
type='FPNC', in_channels=[64, 128, 256, 512], lateral_channels=256),
det_head=dict(
type='DBHead',
in_channels=256,
module_loss=dict(type='DBModuleLoss'),
postprocessor=dict(type='DBPostprocessor', text_repr_type='quad')))
文本识别¶
文本识别主要包含:
data_processor
: 数据预处理配置preprocessor
: 网络预处理配置,如TPS等backbone
:特征提取配置encoder
: 编码器配置decoder
: 解码器配置module_loss
: 解码器损失postprocessor
: 解码器后处理dictionary
: 字典配置
以 CRNN 为例:
# 模型部分
model = dict(
type='CRNN',
data_preprocessor=dict(
type='TextRecogDataPreprocessor', mean=[127], std=[127])
preprocessor=None,
backbone=dict(type='VeryDeepVgg', leaky_relu=False, input_channels=1),
encoder=None,
decoder=dict(
type='CRNNDecoder',
in_channels=512,
rnn_flag=True,
module_loss=dict(type='CTCModuleLoss', letter_case='lower'),
postprocessor=dict(type='CTCPostProcessor'),
dictionary=dict(
type='Dictionary',
dict_file='dicts/lower_english_digits.txt',
with_padding=True)))
权重加载配置¶
可以通过 load_from
参数加载检查点(checkpoint)文件中的模型权重,只需要将 load_from
参数设置为检查点文件的路径即可。
用户也可通过设置 resume=True
,加载检查点中的训练状态信息来恢复训练。当 load_from
和 resume=True
同时被设置时,执行器将加载 load_from
路径对应的检查点文件中的训练状态。
如果仅设置 resume=True
,执行器将会尝试从 work_dir
文件夹中寻找并读取最新的检查点文件
load_from = None # 加载checkpoint的路径
resume = False # 是否 resume
更多可以参考 MMEngine: 加载权重或恢复训练 与 OCR 进阶技巧-断点恢复训练。
评测配置¶
在模型验证和模型测试中,通常需要对模型精度做定量评测。MMOCR 通过评测指标(Metric)和评测器(Evaluator)来完成这一功能。更多可以参考MMEngine: 评测指标(Metric)和评测器(Evaluator) 和 评测器
评测部分包含两个部分,评测器和评测指标。接下来我们分部分展开讲解。
评测器¶
评测器主要用来管理多个数据集以及多个 Metric
。针对单数据集与多数据集情况,评测器分为了单数据集评测器与多数据集评测器,这两种评测器均可管理多个 Metric
.
单数据集评测器配置如下:
# 单个数据集 单个 Metric 情况
val_evaluator = dict(
type='Evaluator',
metrics=dict())
# 单个数据集 多个 Metric 情况
val_evaluator = dict(
type='Evaluator',
metrics=[...])
在实现中默认为单数据集评测器,因此对单数据集评测情况下,一般情况下只需配置评测器,即为
# 单个数据集 单个 Metric 情况
val_evaluator = dict()
# 单个数据集 多个 Metric 情况
val_evaluator = [...]
多数据集评测与单数据集评测存在两个位置上的不同:评测器类别与前缀。评测器类别必须为MultiDatasetsEvaluator
且不能省略,前缀主要用来区分不同数据集在相同评测指标下的结果,请参考多数据集评测。
假设我们需要在 IC13 和 IC15 情况下测试精度,则配置如下:
# 多个数据集,单个 Metric 情况
val_evaluator = dict(
type='MultiDatasetsEvaluator',
metrics=dict(),
dataset_prefixes=['IC13', 'IC15'])
# 多个数据集,多个 Metric 情况
val_evaluator = dict(
type='MultiDatasetsEvaluator',
metrics=[...],
dataset_prefixes=['IC13', 'IC15'])
评测指标¶
评测指标指不同度量精度的方法,同时可以多个评测指标共同使用,更多评测指标原理参考 MMEngine: 评测指标,在 MMOCR 中不同算法任务有不同的评测指标。 更多 OCR 相关的评测指标可以参考 评测指标。
文字检测: HmeanIOUMetric
文字识别: WordMetric
,CharMetric
, OneMinusNEDMetric
关键信息提取: F1Metric
以文本检测为例说明,在单数据集评测情况下,使用单个 Metric
:
val_evaluator = dict(type='HmeanIOUMetric')
以文本识别为例,对多个数据集(IC13 和 IC15)用多个 Metric
(WordMetric
和 CharMetric
)进行评测:
# 评测部分
val_evaluator = dict(
type='MultiDatasetsEvaluator',
metrics=[
dict(
type='WordMetric',
mode=['exact', 'ignore_case', 'ignore_case_symbol']),
dict(type='CharMetric')
],
dataset_prefixes=['IC13', 'IC15'])
test_evaluator = val_evaluator
可视化配置¶
每个任务配置该任务对应的可视化器。可视化器主要用于用户模型中间结果的可视化或存储,及 val 和 test 预测结果的可视化。同时可视化的结果可以通过可视化后端储存到不同的后端,比如 WandB,TensorBoard 等。常用修改操作可见可视化。
文本检测的可视化默认配置如下:
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
type='TextDetLocalVisualizer', # 不同任务有不同的可视化器
vis_backends=vis_backends,
name='visualizer')
目录结构¶
MMOCR
所有配置文件都放置在 configs
文件夹下。为了避免配置文件过长,同时提高配置文件的可复用性以及清晰性,MMOCR 利用 Config 文件的继承特性,将配置内容的八个部分做了拆分。因为每部分均与算法任务相关,因此 MMOCR 对每个任务在 Config 中提供了一个任务文件夹,即 textdet
(文字检测任务)、textrecog
(文字识别任务)、kie
(关键信息提取)。同时各个任务算法配置文件夹下进一步划分为两个部分:_base_
文件夹与诸多算法文件夹:
_base_
文件夹下主要存放与具体算法无关的一些通用配置文件,各部分依目录分为常用的数据集、常用的训练策略以及通用的运行配置。算法配置文件夹中存放与算法强相关的配置项。算法配置文件夹主要分为两部分:
算法的模型与数据流水线:OCR 领域中一般情况下数据增强策略与算法强相关,因此模型与数据流水线通常置于统一位置。
算法在制定数据集上的特定配置:用于训练和测试的配置,将分散在不同位置的 base 配置汇总。同时可能会修改一些
_base_
中的变量,如batch size, 数据流水线,训练策略等
最后的将配置内容中的各个模块分布在不同配置文件中,最终各配置文件内容如下:
textdet |
_base_ | datasets | icdar_datasets.py ctw1500.py ... |
数据集配置 |
schedules | schedule_adam_600e.py ... |
训练策略配置 | ||
default_runtime.py |
- | 环境配置 默认hook配置 日志配置 权重加载配置 评测配置 可视化配置 |
||
dbnet | _base_dbnet_resnet18_fpnc.py | - | 网络配置 数据流水线 |
|
dbnet_resnet18_fpnc_1200e_icdar2015.py | - | Dataloader 配置 数据流水线(Optional) |
最终目录结构如下:
configs
├── textdet
│ ├── _base_
│ │ ├── datasets
│ │ │ ├── icdar2015.py
│ │ │ ├── icdar2017.py
│ │ │ └── totaltext.py
│ │ ├── schedules
│ │ │ └── schedule_adam_600e.py
│ │ └── default_runtime.py
│ └── dbnet
│ ├── _base_dbnet_resnet18_fpnc.py
│ └── dbnet_resnet18_fpnc_1200e_icdar2015.py
├── textrecog
│ ├── _base_
│ │ ├── datasets
│ │ │ ├── icdar2015.py
│ │ │ ├── icdar2017.py
│ │ │ └── totaltext.py
│ │ ├── schedules
│ │ │ └── schedule_adam_base.py
│ │ └── default_runtime.py
│ └── crnn
│ ├── _base_crnn_mini-vgg.py
│ └── crnn_mini-vgg_5e_mj.py
└── kie
├── _base_
│ ├──datasets
│ └── default_runtime.py
└── sgdmr
└── sdmgr_novisual_60e_wildreceipt_openset.py
配置文件以及权重命名规则¶
MMOCR 按照以下风格进行配置文件命名,代码库的贡献者需要遵循相同的命名规则。文件名总体分为四部分:算法信息,模块信息,训练信息和数据信息。逻辑上属于不同部分的单词之间用下划线 '_'
连接,同一部分有多个单词用短横线 '-'
连接。
{{算法信息}}_{{模块信息}}_{{训练信息}}_{{数据信息}}.py
算法信息(algorithm info):算法名称,如 dbnet, crnn 等
模块信息(module info):按照数据流的顺序列举一些中间的模块,其内容依赖于算法任务,同时为了避免Config过长,会省略一些与模型强相关的模块。下面举例说明:
对于文字检测任务和关键信息提取任务:
{{算法信息}}_{{backbone}}_{{neck}}_{{head}}_{{训练信息}}_{{数据信息}}.py
一般情况下 head 位置一般为算法专有的 head,因此一般省略。
对于文本识别任务:
{{算法信息}}_{{backbone}}_{{encoder}}_{{decoder}}_{{训练信息}}_{{数据信息}}.py
一般情况下 encoder 和 decoder 位置一般为算法专有,因此一般省略。
训练信息(training info):训练策略的一些设置,包括 batch size,schedule 等
数据信息(data info):数据集名称、模态、输入尺寸等,如 icdar2015,synthtext 等
数据集准备¶
前言¶
经过数十年的发展,OCR 领域涌现出了一系列的相关数据集,这些数据集往往采用风格各异的格式来提供文本的标注文件,使得用户在使用这些数据集时不得不进行格式转换。因此,为了方便用户进行数据集准备,我们提供了一键式的数据准备脚本,使得用户仅需使用一行命令即可完成数据集准备的全部步骤。
在这一节,我们将介绍一个典型的数据集准备流程:
然而,如果你已经有了 MMOCR 支持的格式的数据集,那么第一步就不是必须的。你可以阅读数据集类及标注格式来了解更多细节。
数据集下载及格式转换¶
以 ICDAR 2015 数据集的文本检测任务准备步骤为例,你可以执行以下命令来完成数据集准备:
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet
命令执行完成后,数据集将被下载并转换至 MMOCR 格式,文件目录结构如下:
data/icdar2015
├── textdet_imgs
│ ├── test
│ └── train
├── textdet_test.json
└── textdet_train.json
数据准备完毕以后,你也可以通过使用我们提供的数据集浏览工具 browse_dataset.py 来可视化数据集的标签是否被正确生成,例如:
python tools/analysis_tools/browse_dataset.py configs/textdet/_base_/datasets/icdar2015.py
修改配置文件¶
单数据集训练¶
在使用新的数据集时,我们需要对其图像、标注文件的路径等基础信息进行配置。configs/xxx/_base_/datasets/
路径下已预先配置了 MMOCR 中常用的数据集(当你使用 prepare_dataset.py
来准备数据集时,这个配置文件通常会在数据集准备就绪后自动生成),这里我们以 ICDAR 2015 数据集为例(见 configs/textdet/_base_/datasets/icdar2015.py
):
icdar2015_textdet_data_root = 'data/icdar2015' # 数据集根目录
# 训练集配置
icdar2015_textdet_train = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root, # 数据根目录
ann_file='textdet_train.json', # 标注文件名称
filter_cfg=dict(filter_empty_gt=True, min_size=32), # 数据过滤
pipeline=None)
# 测试集配置
icdar2015_textdet_test = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root,
ann_file='textdet_test.json',
test_mode=True,
pipeline=None)
在配置好数据集后,我们还需要在相应的算法模型配置文件中导入想要使用的数据集。例如,在 ICDAR 2015 数据集上训练 “DBNet_R18” 模型:
_base_ = [
'_base_dbnet_r18_fpnc.py',
'../_base_/datasets/icdar2015.py', # 导入数据集配置文件
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_sgd_1200e.py',
]
icdar2015_textdet_train = _base_.icdar2015_textdet_train # 指定训练集
icdar2015_textdet_train.pipeline = _base_.train_pipeline # 指定训练集使用的数据流水线
icdar2015_textdet_test = _base_.icdar2015_textdet_test # 指定测试集
icdar2015_textdet_test.pipeline = _base_.test_pipeline # 指定测试集使用的数据流水线
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train) # 在 train_dataloader 中指定使用的训练数据集
val_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=icdar2015_textdet_test) # 在 val_dataloader 中指定使用的验证数据集
test_dataloader = val_dataloader
多数据集训练¶
此外,基于 ConcatDataset
,用户还可以使用多个数据集组合来训练或测试模型。用户只需在配置文件中将 dataloader 中的 dataset 类型设置为 ConcatDataset
,并指定对应的数据集列表即可。
train_list = [ic11, ic13, ic15]
train_dataloader = dict(
dataset=dict(
type='ConcatDataset', datasets=train_list, pipeline=train_pipeline))
例如,以下配置使用了 MJSynth 数据集进行训练,并使用 6 个学术数据集(CUTE80, IIIT5K, SVT, SVTP, ICDAR2013, ICDAR2015)进行测试。
_base_ = [ # 导入所有需要使用的数据集配置
'../_base_/datasets/mjsynth.py',
'../_base_/datasets/cute80.py',
'../_base_/datasets/iiit5k.py',
'../_base_/datasets/svt.py',
'../_base_/datasets/svtp.py',
'../_base_/datasets/icdar2013.py',
'../_base_/datasets/icdar2015.py',
'../_base_/default_runtime.py',
'../_base_/schedules/schedule_adadelta_5e.py',
'_base_crnn_mini-vgg.py',
]
# 训练集列表
train_list = [_base_.mjsynth_textrecog_train]
# 测试集列表
test_list = [
_base_.cute80_textrecog_test, _base_.iiit5k_textrecog_test, _base_.svt_textrecog_test,
_base_.svtp_textrecog_test, _base_.icdar2013_textrecog_test, _base_.icdar2015_textrecog_test
]
# 使用 ConcatDataset 来级联列表中的多个数据集
train_dataset = dict(
type='ConcatDataset', datasets=train_list, pipeline=_base_.train_pipeline)
test_dataset = dict(
type='ConcatDataset', datasets=test_list, pipeline=_base_.test_pipeline)
train_dataloader = dict(
batch_size=192 * 4,
num_workers=32,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=train_dataset)
test_dataloader = dict(
batch_size=1,
num_workers=4,
persistent_workers=True,
drop_last=False,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=test_dataset)
val_dataloader = test_dataloader
训练与测试¶
为了适配多样化的用户需求,MMOCR 实现了多种不同操作系统及设备上的模型训练及测试。无论是使用本地机器进行单机单卡训练测试,还是在部署了 slurm 系统的大规模集群上进行训练测试,MMOCR 都提供了便捷的解决方案。
单卡机器训练及测试¶
训练¶
tools/train.py
实现了基础的训练服务。MMOCR 推荐用户使用 GPU 进行模型训练和测试,但是,用户也可以通过指定 CUDA_VISIBLE_DEVICES=-1
来使用 CPU 设备进行模型训练及测试。例如,以下命令演示了如何使用 CPU 或单卡 GPU 来训练 DBNet 文本检测器。
# 通过调用 tools/train.py 来训练指定的 MMOCR 模型
CUDA_VISIBLE_DEVICES= python tools/train.py ${CONFIG_FILE} [PY_ARGS]
# 训练
# 示例 1:使用 CPU 训练 DBNet
CUDA_VISIBLE_DEVICES=-1 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py
# 示例 2:指定使用 gpu:0 训练 DBNet,指定工作目录为 dbnet/,并打开混合精度(amp)训练
CUDA_VISIBLE_DEVICES=0 python tools/train.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py --work-dir dbnet/ --amp
注解
此外,如需使用指定编号的 GPU 进行训练或测试,例如使用3号 GPU,则可以通过设定 CUDA_VISIBLE_DEVICES=3 来实现。
下表列出了 train.py
支持的所有参数。其中,不带 --
前缀的参数为必须的位置参数,带 --
前缀的参数为可选参数。
参数 | 类型 | 说明 |
---|---|---|
config | str | (必须)配置文件路径。 |
--work-dir | str | 指定工作目录,用于存放训练日志以及模型 checkpoints。 |
--resume | bool | 是否从断点处恢复训练。 |
--amp | bool | 是否使用混合精度。 |
--auto-scale-lr | bool | 是否使用学习率自动缩放。 |
--cfg-options | str | 用于覆写配置文件中的指定参数。示例 |
--launcher | str | 启动器选项,可选项目为 ['none', 'pytorch', 'slurm', 'mpi']。 |
--local_rank | int | 本地机器编号,用于多机多卡分布式训练,默认为 0。 |
测试¶
tools/test.py
提供了基础的测试服务,其使用原理和训练脚本类似。例如,以下命令演示了 CPU 或 GPU 单卡测试 DBNet 模型。
# 通过调用 tools/test.py 来测试指定的 MMOCR 模型
CUDA_VISIBLE_DEVICES= python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} [PY_ARGS]
# 测试
# 示例 1:使用 CPU 测试 DBNet
CUDA_VISIBLE_DEVICES=-1 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth
# 示例 2:使用 gpu:0 测试 DBNet
CUDA_VISIBLE_DEVICES=0 python tools/test.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth
下表列出了 test.py
支持的所有参数。其中,不带 --
前缀的参数为必须的位置参数,带 --
前缀的参数为可选参数。
参数 | 类型 | 说明 |
---|---|---|
config | str | (必须)配置文件路径。 |
checkpoint | str | (必须)待测试模型路径。 |
--work-dir | str | 工作目录,用于存放训练日志以及模型 checkpoints。 |
--save-preds | bool | 是否将预测结果写入 pkl 文件并保存。 |
--show | bool | 是否可视化预测结果。 |
--show-dir | str | 将可视化的预测结果保存至指定路径。 |
--wait-time | float | 可视化间隔时间(秒),默认为 2 秒。 |
--cfg-options | str | 用于覆写配置文件中的指定参数。示例 |
--launcher | str | 启动器选项,可选项目为 ['none', 'pytorch', 'slurm', 'mpi']。 |
--local_rank | int | 本地机器编号,用于多机多卡分布式训练,默认为 0。 |
--tta | bool | 是否使用测试时数据增强 |
多卡机器训练及测试¶
对于大规模模型,采用多 GPU 训练和测试可以极大地提升操作的效率。为此,MMOCR 提供了基于 MMDistributedDataParallel 实现的分布式脚本 tools/dist_train.sh
和 tools/dist_test.sh
。
# 训练
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${GPU_NUM} [PY_ARGS]
# 测试
NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_test.sh ${CONFIG_FILE} ${CHECKPOINT_FILE} ${GPU_NUM} [PY_ARGS]
下表列出了 dist_*.sh
支持的参数:
参数 | 类型 | 说明 |
---|---|---|
NNODES | int | 总共使用的机器节点个数,默认为 1。 |
NODE_RANK | int | 节点编号,默认为 0。 |
PORT | int | 在 RANK 0 机器上使用的 MASTER_PORT 端口号,取值范围是 0 至 65535,默认值为 29500。 |
MASTER_ADDR | str | RANK 0 机器的 IP 地址,默认值为 127.0.0.1。 |
CONFIG_FILE | str | (必须)指定配置文件的地址。 |
CHECKPOINT_FILE | str | (必须,仅在 dist_test.sh 中适用)指定模型权重的地址。 |
GPU_NUM | int | (必须)指定 GPU 的数量。 |
[PY_ARGS] | str | 该部分一切的参数都会被直接传入 tools/train.py 或 tools/test.py 中。 |
这两个脚本可以实现单机多卡或多机多卡的训练和测试,下面演示了它们在不同场景下的用法。
单机多卡¶
以下命令演示了如何在搭载多块 GPU 的单台机器上使用指定数目的 GPU 进行训练及测试:
训练
使用单台机器上的 4 块 GPU 训练 DBNet。
# 单机 4 卡训练 DBNet tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4
测试
使用单台机器上的 4 块 GPU 测试 DBNet。
# 单机 4 卡测试 DBNet tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4
单机多任务训练及测试¶
对于搭载多块 GPU 的单台服务器而言,用户可以通过指定 GPU 的形式来同时执行不同的训练任务。例如,以下命令演示了如何在一台 8 卡 GPU 服务器上分别使用 [0, 1, 2, 3]
卡测试 DBNet 及 [4, 5, 6, 7]
卡训练 CRNN:
# 指定使用 gpu:0,1,2,3 测试 DBNet,并分配端口号 29500
CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 4
# 指定使用 gpu:4,5,6,7 训练 CRNN,并分配端口号 29501
CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh configs/textrecog/crnn/crnn_academic_dataset.py 4
注解
dist_train.sh
默认将 MASTER_PORT
设置为 29500
,当单台机器上有其它进程已占用该端口时,程序则会出现运行时错误 RuntimeError: Address already in use
。此时,用户需要将 MASTER_PORT
设置为 (0~65535)
范围内的其它空闲端口号。
多机多卡训练及测试¶
MMOCR 基于torch.distributed 提供了相同局域网下的多台机器间的多卡分布式训练。
训练
以下命令演示了如何在两台机器上分别使用 2 张 GPU 合计 4 卡训练 DBNet:
# 示例:在两台机器上分别使用 2 张 GPU 合计 4 卡训练 DBNet # 在 “机器1” 上运行以下命令 NNODES=2 NODE_RANK=0 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2 # 在 “机器2” 上运行以下命令 NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_train.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 2
测试
以下命令演示了如何在两台机器上分别使用 2 张 GPU 合计 4 卡测试:
# 示例:在两台机器上分别使用 2 张 GPU 合计 4 卡测试 # 在 “机器1” 上运行以下命令 NNODES=2 NODE_RANK=0 PORT=29500 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2 # 在 “机器2” 上运行以下命令 NNODES=2 NODE_RANK=1 PORT=29501 MASTER_ADDR=10.140.0.169 tools/dist_test.sh configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth 2
注解
需要注意的是,采用多机多卡训练时,机器间的网络传输速度可能成为训练速度的瓶颈。
集群训练及测试¶
针对 Slurm 调度系统管理的计算集群,MMOCR 提供了对应的训练和测试任务提交脚本 tools/slurm_train.sh
及 tools/slurm_test.sh
。
# tools/slurm_train.sh 提供基于 slurm 调度系统管理的计算集群上提交训练任务的脚本
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]
# tools/slurm_test.sh 提供基于 slurm 调度系统管理的计算集群上提交测试任务的脚本
GPUS=${GPUS} GPUS_PER_NODE=${GPUS_PER_NODE} CPUS_PER_TASK=${CPUS_PER_TASK} SRUN_ARGS=${SRUN_ARGS} ./tools/slurm_test.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${CHECKPOINT_FILE} ${WORK_DIR} [PY_ARGS]
参数 | 类型 | 说明 |
---|---|---|
GPUS | int | 使用的 GPU 数目,默认为8。 |
GPUS_PER_NODE | int | 每台节点机器上搭载的 GPU 数目,默认为8。 |
CPUS_PER_TASK | int | 任务使用的 CPU 个数,默认为5。 |
SRUN_ARGS | str | 其他 srun 支持的参数。详见这里 |
PARTITION | str | (必须)指定使用的集群分区。 |
JOB_NAME | str | (必须)提交任务的名称。 |
WORK_DIR | str | (必须)任务的工作目录,训练日志以及模型的 checkpoints 将被保存至该目录。 |
CHECKPOINT_FILE | str | (必须,仅在 slurm_test.sh 中适用)指向模型权重的地址。 |
[PY_ARGS] | str | tools/train.py 以及 tools/test.py 支持的参数。 |
这两个脚本可以实现 slurm 集群上的训练和测试,下面演示了它们在不同场景下的用法。
训练
以下示例为在 slurm 集群 dev 分区申请 1 块 GPU 进行 DBNet 训练。
# 示例:在 slurm 集群 dev 分区申请 1块 GPU 资源进行 DBNet 训练任务
GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_train.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py work_dir
测试
同理, 则提供了测试任务提交脚本。以下示例为在 slurm 集群 dev 分区申请 1 块 GPU 资源进行 DBNet 测试。
# 示例:在 slurm 集群 dev 分区申请 1块 GPU 资源进行 DBNet 测试任务
GPUS=1 GPUS_PER_NODE=1 CPUS_PER_TASK=5 tools/slurm_test.sh dev db_r50 configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth work_dir
进阶技巧¶
从断点恢复训练¶
tools/train.py
提供了从断点恢复训练的功能,用户仅需在命令中指定 --resume
参数,即可自动从断点恢复训练。
# 示例:从断点恢复训练
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --resume
默认地,程序将自动从上次训练过程中最后成功保存的断点,即 latest.pth
处开始继续训练。如果用户希望指定从特定的断点处开始恢复训练,则可以按如下格式在模型的配置文件中设定该断点的路径。
# 示例:在配置文件中设置想要加载的断点路径
load_from = 'work_dir/dbnet/models/epoch_10000.pth'
混合精度训练¶
混合精度训练可以在缩减内存占用的同时提升训练速度,为此,MMOCR 提供了一键式的混合精度训练方案,仅需在训练时添加 --amp
参数即可。
# 示例:使用自动混合精度训练
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --amp
下表列出了 MMOCR 中各算法对自动混合精度训练的支持情况:
是否支持混合精度训练 | 备注 | |
---|---|---|
文本检测 | ||
DBNet | 是 | |
DBNetpp | 是 | |
DRRG | 否 | roi_align_rotated 不支持 fp16 |
FCENet | 否 | BCELoss 不支持 fp16 |
Mask R-CNN | 是 | |
PANet | 是 | |
PSENet | 是 | |
TextSnake | 否 | |
文本识别 | ||
ABINet | 是 | |
ASTER | 是 | |
CRNN | 是 | |
MASTER | 是 | |
NRTR | 是 | |
RobustScanner | 是 | |
SAR | 是 | |
SATRN | 是 |
自动学习率缩放¶
MMOCR 在配置文件中为每一个模型设置了默认的初始学习率,然而,当用户使用的 batch_size
不同于我们预设的 base_batch_size
时,这些初始学习率可能不再完全适用。因此,我们提供了自动学习率缩放工具。当使用不同于 MMOCR 预设的 base_batch_size
进行训练时,用户仅需添加 --auto-scale-lr
参数即可自动依据新的 batch_size
将学习率缩放至对应尺度。
# 示例:使用自动学习率缩放
python tools/train.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py 4 --auto-scale-lr
可视化模型测试结果¶
tools/test.py
提供了可视化接口,以方便用户对模型进行定性分析。
(绿色框为真实标注,红色框为预测结果)
(绿色字体为真实标注,红色字体为预测结果)
(从左至右分别为:原图,文本检测和识别结果,文本分类结果,关系图)
# 示例1:每间隔 2 秒绘制出
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show --wait-time 2
# 示例2:对于不支持图形化界面的系统(如计算集群等),可以将可视化结果存入指定路径
python tools/test.py configs/textdet/dbnet/dbnet_r50dcnv2_fpnc_1200e_icdar2015.py dbnet_r50.pth --show-dir ./vis_results
tools/test.py
中可视化相关参数说明:
参数 | 类型 | 说明 |
---|---|---|
--show | bool | 是否绘制可视化结果。 |
--show-dir | str | 可视化图片存储路径。 |
--wait-time | float | 可视化间隔时间(秒),默认为 2。 |
测试时数据增强¶
测试时增强,指的是在推理(预测)阶段,将原始图片进行水平翻转、垂直翻转、对角线翻转、旋转角度等数据增强操作,得到多张图,分别进行推理,再对多个结果进行综合分析,得到最终输出结果。
为此,MMOCR 提供了一键式测试时数据增强,仅需在测试时添加 --tta
参数即可。
注解
TTA 仅支持文本识别模型。
python tools/test.py configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py checkpoints/crnn_mini-vgg_5e_mj.pth --tta
可视化¶
阅读本文前建议先阅读 MMEngine: 可视化 以初步了解 Visualizer 的定义及相关用法。
简单来说,MMEngine 中实现了用于满足日常可视化需求的可视化器件 Visualizer
,其主要包含三个功能:
实现了常用的绘图 API,例如
draw_bboxes
实现了边界盒的绘制功能,draw_lines
实现了线条的绘制功能。支持将可视化结果、学习率曲线、损失函数曲线以及验证精度曲线等写入多种后端中,包括本地磁盘以及常用的深度学习训练日志记录工具,如 TensorBoard 和 WandB。
支持在代码中的任意位置进行调用,例如在训练或测试过程中可视化或记录模型的中间状态,如特征图及验证结果等。
基于 MMEngine 的 Visualizer,MMOCR 内预置了多种可视化工具,用户仅需简单修改配置文件即可使用:
tools/analysis_tools/browse_dataset.py
脚本提供了数据集可视化功能,其可以绘制经过数据变换(Data Transforms)之后的图像及对应的标注内容,详见browse_dataset.py
。MMEngine 中实现了
LoggerHook
,该 Hook 利用Visualizer
将学习率、损失以及评估结果等数据写入Visualizer
设置的后端中,因此通过修改配置文件中的Visualizer
后端,比如修改为TensorBoardVISBackend
或WandbVISBackend
,可以实现将日志到TensorBoard
或WandB
等常见的训练日志记录工具中,从而方便用户使用这些可视化工具来分析和监控训练流程。MMOCR 中实现了
VisualizerHook
,该 Hook 利用Visualizer
将验证阶段或预测阶段的预测结果进行可视化或储存至Visualizer
设置的后端中,因此通过修改配置文件中的Visualizer
后端,比如修改为TensorBoardVISBackend
或WandbVISBackend
,可以实现将预测的图像存储到TensorBoard
或Wandb
中。
配置¶
得益于注册机制的使用,在 MMOCR 中,我们可以通过修改配置文件来设置可视化器件 Visualizer
的行为。通常,我们在 task/_base_/default_runtime.py
中定义可视化相关的默认配置, 详见配置教程。
vis_backends = [dict(type='LocalVisBackend')]
visualizer = dict(
type='TextxxxLocalVisualizer', # 不同任务使用不同的可视化器
vis_backends=vis_backends,
name='visualizer')
依据以上示例,我们可以看出 Visualizer
的配置主要由两个部分组成,即,Visualizer
的类型以及其采用的可视化后端 vis_backends
。
针对不同的 OCR 任务,MMOCR 中预置了多种可视化器件,包括
TextDetLocalVisualizer
,TextRecogLocalVisualizer
,TextSpottingLocalVisualizer
以及KIELocalVisualizer
。这些可视化器件依照自身任务的特点对基础的 Visulizer API 进行了拓展,并实现了相应的标签信息接口add_datasamples
。例如,用户可以直接使用TextDetLocalVisualizer
来可视化文本检测任务的标签或预测结果。MMOCR 默认将可视化后端
vis_backend
设置为本地可视化后端LocalVisBackend
,将所有可视化结果及其他训练信息保存在本地文件夹中。
存储¶
MMOCR 默认使用本地可视化后端 LocalVisBackend
,VisualizerHook
和LoggerHook
中存储的模型损失、学习率、模型评估精度以及可视化结果等信息将被默认保存至{work_dir}/{config_name}/{time}/{vis_data}
文件夹。此外,MMOCR 也支持其它常用的可视化后端,如 TensorboardVisBackend
以及 WandbVisBackend
用户只需要将配置文件中的 vis_backends
类型修改为对应的可视化后端即可。例如,用户只需要在配置文件中插入以下代码块,即可将数据存储至 TensorBoard
以及 WandB
中。
_base_.visualizer.vis_backends = [
dict(type='LocalVisBackend'),
dict(type='TensorboardVisBackend'),
dict(type='WandbVisBackend'),]
绘制¶
绘制预测结果信息¶
MMOCR 主要利用 VisualizationHook
validation 和 test 的预测结果, 默认情况下 VisualizationHook
为关闭状态,默认配置如下:
visualization=dict( # 用户可视化 validation 和 test 的结果
type='VisualizationHook',
enable=False,
interval=1,
show=False,
draw_gt=False,
draw_pred=False)
下表为 VisualizationHook
支持的参数:
参数 | 说明 |
---|---|
enable | VisualizationHook 的开启和关闭由参数enable控制默认是关闭的状态, |
interval | 在VisualizationHook开启的情况下,用以控制多少iteration 存储或展示 val 或 test 的结果 |
show | 控制是否可视化 val 或 test 的结果 |
draw_gt | val 或 test 的结果是否绘制标注信息 |
draw_pred | val 或 test 的结果是否绘制预测结果 |
如果在训练或者测试过程中想开启 VisualizationHook
相关功能和配置,仅需修改配置即可,以 dbnet_resnet18_fpnc_1200e_icdar2015.py
为例, 同时绘制标注和预测,并且将图像展示,配置可进行如下修改
visualization = _base_.default_hooks.visualization
visualization.update(
dict(enable=True, show=True, draw_gt=True, draw_pred=True))

如果只想查看预测结果信息可以只让draw_pred=True
visualization = _base_.default_hooks.visualization
visualization.update(
dict(enable=True, show=True, draw_gt=False, draw_pred=True))

在 test.py
过程中进一步简化,提供了 --show
和 --show-dir
两个参数,无需修改配置即可视化测试过程中绘制标注和预测结果。
# 展示test 结果
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py dbnet_r18_fpnc_1200e_icdar2015/epoch_400.pth --show
# 指定预测结果的存储位置
python tools/test.py configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py dbnet_r18_fpnc_1200e_icdar2015/epoch_400.pth --show-dir imgs/

常用工具¶
可视化工具¶
数据集可视化工具¶
MMOCR 提供了数据集可视化工具 tools/visualizations/browse_datasets.py
以辅助用户排查可能遇到的数据集相关的问题。用户只需要指定所使用的训练配置文件(通常存放在如 configs/textdet/dbnet/xxx.py
文件中)或数据集配置(通常存放在 configs/textdet/_base_/datasets/xxx.py
文件中)路径。该工具将依据输入的配置文件类型自动将经过数据流水线(data pipeline)处理过的图像及其对应的标签,或原始图片及其对应的标签绘制出来。
支持参数¶
python tools/visualizations/browse_dataset.py \
${CONFIG_FILE} \
[-o, --output-dir ${OUTPUT_DIR}] \
[-p, --phase ${DATASET_PHASE}] \
[-m, --mode ${DISPLAY_MODE}] \
[-t, --task ${DATASET_TASK}] \
[-n, --show-number ${NUMBER_IMAGES_DISPLAY}] \
[-i, --show-interval ${SHOW_INTERRVAL}] \
[--cfg-options ${CFG_OPTIONS}]
参数名 | 类型 | 描述 |
---|---|---|
config | str | (必须) 配置文件路径。 |
-o, --output-dir | str | 如果图形化界面不可用,请指定一个输出路径来保存可视化结果。 |
-p, --phase | str | 用于指定需要可视化的数据集切片,如 "train", "test", "val"。当数据集存在多个变种时,也可以通过该参数来指定待可视化的切片。 |
-m, --mode | original , transformed , pipeline |
用于指定数据可视化的模式。original :原始模式,仅可视化数据集的原始标注;transformed :变换模式,展示经过所有数据变换步骤的最终图像;pipeline :流水线模式,展示数据变换过程中每一个中间步骤的变换图像。默认使用 transformed 变换模式。 |
-t, --task | auto , textdet , textrecog |
用于指定可视化数据集的任务类型。auto :自动模式,将依据给定的配置文件自动选择合适的任务类型,如果无法自动获取任务类型,则需要用户手动指定为 textdet 文本检测任务 或 textrecog 文本识别任务。默认采用 auto 自动模式。 |
-n, --show-number | int | 指定需要可视化的样本数量。若该参数缺省则默认将可视化全部图片。 |
-i, --show-interval | float | 可视化图像间隔时间,默认为 2 秒。 |
--cfg-options | float | 用于覆盖配置文件中的参数,详见示例。 |
用法示例¶
以下示例演示了如何使用该工具可视化 “DBNet_R50_icdar2015” 模型使用的训练数据。
# 使用默认参数可视化 "dbnet_r50dcn_v2_fpnc_1200e_icadr2015" 模型的训练数据
python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py
默认情况下,可视化模式为 “transformed”,您将看到经由数据流水线变换过后的图像和标注:



如果您只想可视化原始数据集,只需将模式设置为 “original”:
python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -m original

或者,您也可以使用 “pipeline” 模式来可视化整个数据流水线的中间结果:
python tools/visualizations/browse_dataset.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -m pipeline

另外,用户还可以通过指定数据集配置文件的路径来可视化数据集的原始图像及其对应的标注,例如:
python tools/visualizations/browse_dataset.py configs/textrecog/_base_/datasets/icdar2015.py
部分数据集可能有多个变体。例如,icdar2015
文本识别数据集的配置文件中包含两个测试集变体,分别为 icdar2015_textrecog_test
和 icdar2015_1811_textrecog_test
,如下所示:
icdar2015_textrecog_test = dict(
ann_file='textrecog_test.json',
# ...
)
icdar2015_1811_textrecog_test = dict(
ann_file='textrecog_test_1811.json',
# ...
)
在这种情况下,用户可以通过指定 -p
参数来可视化不同的变体,例如,使用以下命令可视化 icdar2015_1811_textrecog_test
变体:
python tools/visualizations/browse_dataset.py configs/textrecog/_base_/datasets/icdar2015.py -p icdar2015_1811_textrecog_test
基于该工具,用户可以轻松地查看数据集的原始图像及其对应的标注,以便于检查数据集的标注是否正确。
优化器参数策略可视化工具¶
MMOCR提供了优化器参数可视化工具 tools/visualizations/vis_scheduler.py
以辅助用户排查优化器的超参数调度器(无需训练),支持学习率(learning rate)和动量(momentum)。
工具简介¶
python tools/visualizations/vis_scheduler.py \
${CONFIG_FILE} \
[-p, --parameter ${PARAMETER_NAME}] \
[-d, --dataset-size ${DATASET_SIZE}] \
[-n, --ngpus ${NUM_GPUs}] \
[-s, --save-path ${SAVE_PATH}] \
[--title ${TITLE}] \
[--style ${STYLE}] \
[--window-size ${WINDOW_SIZE}] \
[--cfg-options]
所有参数的说明:
config
: 模型配置文件的路径。-p, parameter
: 可视化参数名,只能为["lr", "momentum"]
之一, 默认为"lr"
.-d, --dataset-size
: 数据集的大小。如果指定,build_dataset
将被跳过并使用这个大小作为数据集大小,默认使用build_dataset
所得数据集的大小。-n, --ngpus
: 使用 GPU 的数量, 默认为1。-s, --save-path
: 保存的可视化图片的路径,默认不保存。--title
: 可视化图片的标题,默认为配置文件名。--style
: 可视化图片的风格,默认为whitegrid
。--window-size
: 可视化窗口大小,如果没有指定,默认为12*7
。如果需要指定,按照格式 `W*H’。--cfg-options
: 对配置文件的修改,参考学习配置文件。
注解
部分数据集在解析标注阶段比较耗时,可直接将 -d, dataset-size
指定数据集的大小,以节约时间。
如何在开始训练前可视化学习率曲线¶
你可以使用如下命令来绘制配置文件 configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py
将会使用的变化率曲线:
python tools/visualizations/vis_scheduler.py configs/textdet/dbnet/dbnet_resnet50-dcnv2_fpnc_1200e_icdar2015.py -d 100

分析工具¶
离线评测工具¶
对于已保存的预测结果,我们提供了离线评测脚本 tools/analysis_tools/offline_eval.py
。例如,以下代码演示了如何使用该工具对 “PSENet” 模型的输出结果进行离线评估:
# 初次运行测试脚本时,用户可以通过指定 --save-preds 参数来保存模型的输出结果
python tools/test.py ${CONFIG_FILE} ${CHECKPOINT_FILE} --save-preds
# 示例:对 PSENet 进行测试
python tools/test.py configs/textdet/psenet/psenet_r50_fpnf_600e_icdar2015.py epoch_600.pth --save-preds
# 之后即可使用已保存的输出文件进行离线评估
python tools/analysis_tool/offline_eval.py ${CONFIG_FILE} ${PRED_FILE}
# 示例:对已保存的 PSENet 结果进行离线评估
python tools/analysis_tools/offline_eval.py configs/textdet/psenet/psenet_r50_fpnf_600e_icdar2015.py work_dirs/psenet_r50_fpnf_600e_icdar2015/epoch_600.pth_predictions.pkl
--save-preds
默认将输出结果保存至 work_dir/CONFIG_NAME/MODEL_NAME_predictions.pkl
此外,基于此工具,用户也可以将其他算法库获取的预测结果转换成 MMOCR 支持的格式,从而使用 MMOCR 内置的评估指标来对其他算法库的模型进行评测。
参数 | 类型 | 说明 |
---|---|---|
config | str | (必须)配置文件路径。 |
pkl_results | str | (必须)预先保存的预测结果文件。 |
--cfg-options | float | 用于覆写配置文件中的指定参数。示例 |
计算 FLOPs 和参数量¶
我们提供一个计算 FLOPs 和参数量的方法,首先我们使用以下命令安装依赖。
pip install fvcore
计算 FLOPs 和参数量的脚本使用方法如下:
python tools/analysis_tools/get_flops.py ${config} --shape ${IMAGE_SHAPE}
参数 | 类型 | 说明 |
---|---|---|
config | str | (必须) 配置文件路径。 |
--shape | int*2 | 计算 FLOPs 使用的图片尺寸,如 --shape 320 320 。 默认为 640 640 |
获取 dbnet_resnet18_fpnc_100k_synthtext.py
FLOPs 和参数量的示例命令如下。
python tools/analysis_tools/get_flops.py configs/textdet/dbnet/dbnet_resnet18_fpnc_100k_synthtext.py --shape 1024 1024
输出如下:
input shape is (1, 3, 1024, 1024)
| module | #parameters or shape | #flops |
| :------------------------ | :------------------- | :------ |
| model | 12.341M | 63.955G |
| backbone | 11.177M | 38.159G |
| backbone.conv1 | 9.408K | 2.466G |
| backbone.conv1.weight | (64, 3, 7, 7) | |
| backbone.bn1 | 0.128K | 83.886M |
| backbone.bn1.weight | (64,) | |
| backbone.bn1.bias | (64,) | |
| backbone.layer1 | 0.148M | 9.748G |
| backbone.layer1.0 | 73.984K | 4.874G |
| backbone.layer1.1 | 73.984K | 4.874G |
| backbone.layer2 | 0.526M | 8.642G |
| backbone.layer2.0 | 0.23M | 3.79G |
| backbone.layer2.1 | 0.295M | 4.853G |
| backbone.layer3 | 2.1M | 8.616G |
| backbone.layer3.0 | 0.919M | 3.774G |
| backbone.layer3.1 | 1.181M | 4.842G |
| backbone.layer4 | 8.394M | 8.603G |
| backbone.layer4.0 | 3.673M | 3.766G |
| backbone.layer4.1 | 4.721M | 4.837G |
| neck | 0.836M | 14.887G |
| neck.lateral_convs | 0.246M | 2.013G |
| neck.lateral_convs.0.conv | 16.384K | 1.074G |
| neck.lateral_convs.1.conv | 32.768K | 0.537G |
| neck.lateral_convs.2.conv | 65.536K | 0.268G |
| neck.lateral_convs.3.conv | 0.131M | 0.134G |
| neck.smooth_convs | 0.59M | 12.835G |
| neck.smooth_convs.0.conv | 0.147M | 9.664G |
| neck.smooth_convs.1.conv | 0.147M | 2.416G |
| neck.smooth_convs.2.conv | 0.147M | 0.604G |
| neck.smooth_convs.3.conv | 0.147M | 0.151G |
| det_head | 0.329M | 10.909G |
| det_head.binarize | 0.164M | 10.909G |
| det_head.binarize.0 | 0.147M | 9.664G |
| det_head.binarize.1 | 0.128K | 20.972M |
| det_head.binarize.3 | 16.448K | 1.074G |
| det_head.binarize.4 | 0.128K | 83.886M |
| det_head.binarize.6 | 0.257K | 67.109M |
| det_head.threshold | 0.164M | |
| det_head.threshold.0 | 0.147M | |
| det_head.threshold.1 | 0.128K | |
| det_head.threshold.3 | 16.448K | |
| det_head.threshold.4 | 0.128K | |
| det_head.threshold.6 | 0.257K | |
!!!Please be cautious if you use the results in papers. You may need to check if all ops are supported and verify that the flops computation is correct.
数据元素与数据结构¶
MMOCR 基于 MMEngine: 抽象数据接口 将各任务所需的数据统一封装入 data_sample
中。MMEngine 的抽象数据接口实现了基础的增/删/改/查功能,且支持不同设备间的数据迁移,也支持了类字典和张量的操作,充分满足了数据的日常使用需求,这也使得不同算法的数据接口可以得到统一。
得益于统一的数据封装,算法库内的 visualizer
,evaluator
,dataset
等各个模块间的数据流通都得到了极大的简化。在 MMOCR 中,我们对数据接口类型作出以下约定:
xxxData: 单一粒度的数据标注或模型输出。目前 MMEngine 内置了三种粒度的数据元素,包括实例级数据(
InstanceData
),像素级数据(PixelData
)以及图像级的标签数据(LabelData
)。在 MMOCR 目前支持的任务中,文本检测以及关键信息抽取任务使用InstanceData
来封装文本实例的检测框及对应标签,而文本识别任务则使用了LabelData
来封装文本内容。xxxDataSample: 继承自 MMEngine: 数据基类
BaseDataElement
,用于保存单个任务的训练或测试样本的所有标注及预测信息。如文本检测任务的数据样本类TextDetDataSample
,文本识别任务的数据样本类TextRecogDataSample
,以及关键信息抽任务的数据样本类KIEDataSample
。
下面,我们将分别介绍数据元素 xxxData 与数据样本 xxxDataSample 在 MMOCR 中的实际应用。
数据元素 xxxData¶
InstanceData
和 LabelData
是 MMEngine
中定义的基础数据元素,用于封装不同粒度的标注数据或模型输出。在 MMOCR 中,我们针对不同任务中实际使用的数据类型,分别采用了 InstanceData
与 LabelData
进行了封装。
InstanceData¶
在文本检测任务中,检测器关注的是实例级别的文字样本,因此我们使用 InstanceData
来封装该任务所需的数据。其所需的训练标注和预测输出通常包含了矩形或多边形边界盒,以及边界盒标签。由于文本检测任务只有一种正样本类,即 “text”,在 MMOCR 中我们默认使用 0
来编号该类别。以下代码示例展示了如何使用 InstanceData
数据抽象接口来封装文本检测任务中使用的数据类型。
import torch
from mmengine.structures import InstanceData
# 定义 gt_instance 用于封装边界盒的标注信息
gt_instance = InstanceData()
gt_instance.bbox = torch.Tensor([[0, 0, 10, 10], [10, 10, 20, 20]])
gt_instance.polygons = torch.Tensor([[[0, 0], [10, 0], [10, 10], [0, 10]],
[[10, 10], [20, 10], [20, 20], [10, 20]]])
gt_instance.label = torch.Tensor([0, 0])
# 定义 pred_instance 用于封装模型的输出信息
pred_instances = InstanceData()
pred_polygons, scores = model(input)
pred_instances.polygons = pred_polygons
pred_instances.scores = scores
MMOCR 中对 InstanceData
字段的约定如下表所示。值得注意的是,InstanceData
中的各字段的长度必须为与样本中的实例个数 N
相等。
字段 | 类型 | 说明 |
bboxes | torch.FloatTensor |
文本边界框 [x1, y1, x2, y2] ,形状为 (N, 4) 。 |
labels | torch.LongTensor |
实例的类别,长度为 (N, ) 。MMOCR 中默认使用 0 来表示正样本类,即 “text” 类。 |
polygons | list[np.array(dtype=np.float32)] |
表示文本实例的多边形,列表长度为 (N, ) 。 |
scores | torch.Tensor |
文本实例检测框的置信度,长度为 (N, ) 。 |
ignored | torch.BoolTensor |
是否在训练中忽略当前文本实例,长度为 (N, ) 。 |
texts | list[str] |
实例对应的文本,长度为 (N, ) ,用于端到端 OCR 任务和 KIE。 |
text_scores | torch.FloatTensor |
文本预测的置信度,长度为(N, ) ,用于端到端 OCR 任务。 |
edge_labels | torch.IntTensor |
节点的邻接矩阵,形状为 (N, N) 。在 KIE 任务中,节点之间状态的可选值为 -1 (忽略,不参与 loss 计算),0 (断开)和 1 (连接)。 |
edge_scores | torch.FloatTensor |
用于 KIE 任务中每条边的预测置信度,形状为 (N, N) 。 |
LabelData¶
对于文字识别任务,标注内容和预测内容都会使用 LabelData
进行封装。
import torch
from mmengine.data import LabelData
# 定义一个 gt_text 用于封装标签文本内容
gt_text = LabelData()
gt_text.item = 'MMOCR'
# 定义一个 pred_text 对象用于封装预测文本以及置信度
pred_text = LabelData()
index, score = model(input)
text = dictionary.idx2str(index)
pred_text.score = score
pred_text.item = text
MMOCR 中对 LabelData
字段的约定如下表所示:
字段 | 类型 | 说明 |
item | str |
文本内容。 |
score | list[float] |
预测的文本内容的置信度。 |
indexes | torch.LongTensor |
文本字符经过字典编码后的序列,且包含了除 <UNK> 以外的所有特殊字符。 |
padded_indexes | torch.LongTensor |
如果 indexes 的长度小于最大序列长度,且 pad_idx 存在时,该字段保存了填充至最大序列长度 max_seq_len 的编码后的文本序列。 |
数据样本 xxxDataSample¶
通过定义统一的数据结构,我们可以方便地将标注数据和预测结果进行统一封装,使代码库不同模块间的数据传递更加便捷。在 MMOCR 中,我们基于现在支持的三个任务及其所需要的数据分别封装了三种数据抽象,包括文本检测任务数据抽象 TextDetDataSample
,文本识别任务数据抽象 TextRecogDataSample
,以及关键信息抽取任务数据抽象 KIEDataSample
。这些数据抽象均继承自 MMEngine: 数据基类 BaseDataElement
,用于保存单个任务的训练或测试样本的所有标注及预测信息。
文本检测任务数据抽象 TextDetDataSample¶
TextDetDataSample 用于封装文字检测任务所需的数据,其主要包含了两个字段 gt_instances
与 pred_instances
,分别用于存放标注信息与预测结果。
字段 | 类型 | 说明 |
gt_instances | InstanceData |
标注信息。 |
pred_instances | InstanceData |
预测结果。 |
其中会用到的 InstanceData
约定字段有:
字段 | 类型 | 说明 |
bboxes | torch.FloatTensor |
文本边界框 [x1, y1, x2, y2] ,形状为 (N, 4) 。 |
labels | torch.LongTensor |
实例的类别,长度为 (N, ) 。在 MMOCR 中通常使用 0 来表示正样本类,即 “text” 类 |
polygons | list[np.array(dtype=np.float32)] |
表示文本实例的多边形,列表长度为 (N, ) 。 |
scores | torch.Tensor |
文本实例任务预测的检测框的置信度,长度为 (N, ) 。 |
ignored | torch.BoolTensor |
是否在训练中忽略当前文本实例,长度为 (N, ) 。 |
由于文本检测模型通常只会输出 bboxes/polygons 中的一项,因此我们只需确保这两项中的一个被赋值即可。
以下示例代码展示了 TextDetDataSample
的使用方法:
import torch
from mmengine.data import TextDetDataSample
data_sample = TextDetDataSample()
# 指定当前图片的标注信息
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.gt_instances = gt_instances
# 指定当前图片的预测信息
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.zeros((5,), dtype=torch.long)
data_sample.pred_instances = pred_instances
文本识别任务数据抽象 TextRecogDataSample¶
TextRecogDataSample
用于封装文字识别任务的数据。它有两个属性,gt_text
和 pred_text
, 分别用于存放标注信息和预测结果。
字段 | 类型 | 说明 |
gt_text | LabelData |
标注信息。 |
pred_text | LabelData |
预测结果。 |
以下示例代码展示了 TextRecogDataSample
的使用方法:
import torch
from mmengine.data import TextRecogDataSample
data_sample = TextRecogDataSample()
# 指定当前图片的标注信息
img_meta = dict(img_shape=(800, 1196, 3), pad_shape=(800, 1216, 3))
gt_text = LabelData(metainfo=img_meta)
gt_text.item = 'mmocr'
data_sample.gt_text = gt_text
# 指定当前图片的预测结果
pred_text = LabelData(metainfo=img_meta)
pred_text.item = 'mmocr'
data_sample.pred_text = pred_text
其中会用到的 LabelData
字段有:
字段 | 类型 | 说明 |
item | list[str] |
实例对应的文本,长度为 (N, ) ,用于端到端 OCR 任务和 KIE |
score | torch.FloatTensor |
文本预测的置信度,长度为 (N, ),用于端到端 OCR 任务 |
indexes | torch.LongTensor |
文本字符经过字典编码后的序列,且包含了除 <UNK> 以外的所有特殊字符。 |
padded_indexes | torch.LongTensor |
如果 indexes 的长度小于最大序列长度,且 pad_idx 存在时,该字段保存了填充至最大序列长度 max_seq_len 的编码后的文本序列。 |
关键信息抽取任务数据抽象 KIEDataSample¶
KIEDataSample
用于封装 KIE 任务所需的数据,其同样约定了两个属性,即 gt_instances
与 pred_instances
,分别用于存放标注信息与预测结果。
字段 | 类型 | 说明 |
gt_instances | InstanceData |
标注信息。 |
pred_instances | InstanceData |
预测结果。 |
该任务会用到的 InstanceData
字段如下表所示:
字段 | 类型 | 说明 |
bboxes | torch.Tensor |
文本边界框 [x1, y1, x2, y2] ,形状为 (N, 4) 。 |
labels | torch.LongTensor |
实例的类别,长度为 (N, ) 。在 MMOCR 中通常为 0,即 “text” 类。 |
texts | list[str] |
实例对应的文本,长度为 (N, ) ,用于端到端 OCR 任务和 KIE 任务。 |
edge_labels | torch.IntTensor |
节点之间的邻接矩阵,形状为 (N, N) 。在 KIE 任务中,节点之间状态的可选值为 -1 (不关心,且不参与 loss 计算),0 (断开)和 1 (连接)。 |
edge_scores | torch.FloatTensor |
每条边的预测置信度,形状为 (N, N) 。 |
scores | torch.FloatTensor |
节点标签的预测置信度, 形状为 (N,) 。 |
警告
由于 KIE 任务的模型实现尚未有统一标准,该设计目前仅考虑了 SDMGR 模型的使用场景。因此,该设计有可能在我们支持更多 KIE 模型后产生变动。
以下示例代码展示了 KIEDataSample
的使用方法。
import torch
from mmengine.data import KIEDataSample
data_sample = KIEDataSample()
# 指定当前图片的标注信息
img_meta = dict(img_shape=(800, 1196, 3),pad_shape=(800, 1216, 3))
gt_instances = InstanceData(metainfo=img_meta)
gt_instances.bboxes = torch.rand((5, 4))
gt_instances.labels = torch.zeros((5,), dtype=torch.long)
gt_instances.texts = ['text1', 'text2', 'text3', 'text4', 'text5']
gt_instances.edge_lebels = torch.randint(-1, 2, (5, 5))
data_sample.gt_instances = gt_instances
# 指定当前图片的预测信息
pred_instances = InstanceData()
pred_instances.bboxes = torch.rand((5, 4))
pred_instances.labels = torch.rand((5,))
pred_instances.edge_labels = torch.randint(-1, 2, (10, 10))
pred_instances.edge_scores = torch.rand((10, 10))
data_sample.pred_instances = pred_instances
数据变换与流水线¶
在 MMOCR 的设计中,数据集的构建与数据准备是相互解耦的。也就是说,OCRDataset
等数据集构建类负责完成标注文件的读取与解析功能;而数据变换方法(Data Transforms)则进一步实现了数据预处理、数据增强、数据格式化等相关功能。目前,如下表所示,MMOCR 中共实现了 5 类数据变换方法:
数据变换类型 | 对应文件 | 功能说明 |
数据读取 | loading.py | 实现了不同格式数据的读取功能。 |
数据格式化 | formatting.py | 完成不同任务所需数据的格式化功能。 |
跨库数据适配器 | adapters.py | 负责 OpenMMLab 项目内跨库调用的数据格式转换功能。 |
数据增强 | ocr_transforms.py textdet_transforms.py textrecog_transforms.py |
实现了不同任务下的各类数据增强方法。 |
包装类 | wrappers.py | 实现了对 ImgAug 等常用算法库的包装,使其适配 MMOCR 的内部数据格式。 |
由于每一个数据变换类之间都是相互独立的,因此,在约定好固定的数据存储字段后,我们可以便捷地采用任意的数据变换组合来构建数据流水线(Pipeline)。如下图所示,在 MMOCR 中,一个典型的训练数据流水线主要由数据读取、图像增强以及数据格式化三部分构成,用户只需要在配置文件中定义相关的数据流水线列表,并指定具体所需的数据变换类及其参数即可:
train_pipeline_r18 = [
# 数据读取(图像)
dict(
type='LoadImageFromFile',
color_type='color_ignore_orientation'),
# 数据读取(标注)
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
# 使用 ImgAug 作数据增强
dict(
type='ImgAugWrapper',
args=[['Fliplr', 0.5],
dict(cls='Affine', rotate=[-10, 10]), ['Resize', [0.5, 3.0]]]),
# 使用 MMOCR 内置的图像增强
dict(type='RandomCrop', min_side_ratio=0.1),
dict(type='Resize', scale=(640, 640), keep_ratio=True),
dict(type='Pad', size=(640, 640)),
# 数据格式化
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
小技巧
更多有关数据流水线配置的教程可见配置文档。下面,我们将简单介绍 MMOCR 中已支持的数据变换类型。
对于每一个数据变换方法,MMOCR 都严格按照文档字符串(docstring)规范在源码中提供了详细的代码注释。例如,每一个数据转换类的头部我们都注释了 “需求字段”(Required keys
), “修改字段”(Modified Keys
)与 “添加字段”(Added Keys
)。其中,“需求字段”代表该数据转换方法对于输入数据所需包含字段的强制需求,而“修改字段”与“添加字段”则表明该方法可能会在原有数据基础之上修改或添加的字段。例如,LoadImageFromFile
实现了图片的读取功能,其需求字段为图像的存储路径 img_path
,而修改字段则包括了读入的图像信息 img
,以及图片当前尺寸 img_shape
,图片原始尺寸 ori_shape
等图片属性。
@TRANSFORMS.register_module()
class LoadImageFromFile(MMCV_LoadImageFromFile):
# 在每一个数据变换方法的头部,我们都提供了详细的代码注释。
"""Load an image from file.
Required Keys:
- img_path
Modified Keys:
- img
- img_shape
- ori_shape
"""
注解
在 MMOCR 的数据流水线中,图像及标签等信息被统一保存在字典中。通过统一的字段名,我们可以在不同的数据变换方法间灵活地传递数据。因此,了解 MMOCR 中常用的约定字段名是非常重要的。
为方便用户查询,下表列出了 MMOCR 中各数据转换(Data Transform)类常用的字段约定和说明。
字段 | 类型 | 说明 |
img | np.array(dtype=np.uint8) |
图像信息,形状为 (h, w, c) 。 |
img_shape | tuple(int, int) |
当前图像尺寸 (h, w) 。 |
ori_shape | tuple(int, int) |
图像在初始化时的尺寸 (h, w) 。 |
scale | tuple(int, int) |
存放用户在 Resize 系列数据变换(Transform)中指定的目标图像尺寸 (h, w) 。注意:该值未必与变换后的实际图像尺寸相符。 |
scale_factor | tuple(float, float) |
存放用户在 Resize 系列数据变换(Transform)中指定的目标图像缩放因子 (w_scale, h_scale) 。注意:该值未必与变换后的实际图像尺寸相符。 |
keep_ratio | bool |
是否按等比例对图像进行缩放。 |
flip | bool |
图像是否被翻转。 |
flip_direction | str |
翻转方向。可选项为 horizontal , vertical , diagonal 。 |
gt_bboxes | np.array(dtype=np.float32) |
文本实例边界框的真实标签。 |
gt_polygons | list[np.array(dtype=np.float32) |
文本实例边界多边形的真实标签。 |
gt_bboxes_labels | np.array(dtype=np.int64) |
文本实例对应的类别标签。在 MMOCR 中通常为 0,代指 "text" 类别。 |
gt_texts | list[str] |
与文本实例对应的字符串标注。 |
gt_ignored | np.array(dtype=np.bool_) |
是否要在计算目标时忽略该实例(用于检测任务中)。 |
数据读取 - loading.py¶
数据读取类主要实现了不同文件格式、后端读取图片及加载标注信息的功能。目前,MMOCR 内部共实现了以下数据读取类的 Data Transforms:
数据转换类名称 | 需求字段 | 修改/添加字段 | 说明 |
LoadImageFromFile | img_path |
img img_shape ori_shape |
从图片路径读取图片,支持多种文件存储后端(如 disk , http , petrel 等)及图片解码后端(如 cv2 , turbojpeg , pillow , tifffile 等)。 |
LoadOCRAnnotations | bbox bbox_label polygon ignore text |
gt_bboxes gt_bboxes_labels gt_polygons gt_ignored gt_texts |
解析 OCR 任务所需的标注信息。 |
LoadKIEAnnotations | bboxes bbox_labels edge_labels texts |
gt_bboxes gt_bboxes_labels gt_edge_labels gt_texts ori_shape |
解析 KIE 任务所需的标注信息。 |
数据增强 - xxx_transforms.py¶
数据增强是文本检测、识别等任务中必不可少的流程之一。目前,MMOCR 中共实现了数十种文本领域内常用的数据增强模块,依据其任务类型,分别为通用 OCR 数据增强模块 ocr_transforms.py,文本检测数据增强模块 textdet_transforms.py,以及文本识别数据增强模块 textrecog_transforms.py。
具体而言,ocr_transforms.py
中实现了随机剪裁、随机旋转等各任务通用的数据增强模块:
数据转换类名称 | 需求字段 | 修改/添加字段 | 说明 |
RandomCrop | img gt_bboxes gt_bboxes_labels gt_polygons gt_ignored gt_texts (optional) |
img img_shape gt_bboxes gt_bboxes_labels gt_polygons gt_ignored gt_texts (optional) |
随机裁剪,并确保裁剪后的图片至少包含一个文本实例。可选参数为 min_side_ratio ,用以控制裁剪图片的短边占原始图片的比例,默认值为 0.4 。 |
RandomRotate | img img_shape gt_bboxes (optional)gt_polygons (optional) |
img img_shape gt_bboxes (optional)gt_polygons (optional)rotated_angle |
随机旋转,并可选择对旋转后图像的黑边进行填充。 |
textdet_transforms.py
则实现了文本检测任务中常用的数据增强模块:
数据转换类名称 | 需求字段 | 修改/添加字段 | 说明 |
RandomFlip | img gt_bboxes gt_polygons |
img gt_bboxes gt_polygons flip flip_direction |
随机翻转,支持水平、垂直和对角三种方向的图像翻转。默认使用水平翻转。 |
FixInvalidPolygon | gt_polygons gt_ignored |
gt_polygons gt_ignored |
自动修复或忽略非法多边形标注。 |
textrecog_transforms.py
中实现了文本识别任务中常用的数据增强模块:
数据转换类名称 | 需求字段 | 修改/添加字段 | 说明 |
RescaleToHeight | img |
img img_shape scale scale_factor keep_ratio |
缩放图像至指定高度,并尽可能保持长宽比不变。当 min_width 及 max_width 被指定时,长宽比则可能会被改变。 |
警告
以上表格仅选择性地对部分数据增强方法作简要介绍,更多数据增强方法介绍请参考API 文档或阅读代码内的文档注释。
数据格式化 - formatting.py¶
数据格式化负责将图像、真实标签以及其它常用信息等打包成一个字典。不同的任务通常依赖于不同的数据格式化数据变换类。例如:
数据转换类名称 | 需求字段 | 修改/添加字段 | 说明 |
PackTextDetInputs | - | - | 用于打包文本检测任务所需要的输入信息。 |
PackTextRecogInputs | - | - | 用于打包文本识别任务所需要的输入信息。 |
PackKIEInputs | - | - | 用于打包关键信息抽取任务所需要的输入信息。 |
跨库数据适配器 - adapters.py¶
跨库数据适配器打通了 MMOCR 与其他 OpenMMLab 系列算法库如 MMDetection 之间的数据格式,使得跨项目调用其它开源算法库的配置文件及算法成为了可能。目前,MMOCR 实现了 MMDet2MMOCR
以及 MMOCR2MMDet
,使得数据可以在 MMDetection 与 MMOCR 的格式之间自由转换;借助这些适配转换器,用户可以在 MMOCR 算法库内部轻松调用任何 MMDetection 已支持的检测算法,并在 OCR 相关数据集上进行训练。例如,我们以 Mask R-CNN 为例提供了教程,展示了如何在 MMOCR 中使用 MMDetection 的检测算法训练文本检测器。
数据转换类名称 | 需求字段 | 修改/添加字段 | 说明 |
MMDet2MMOCR | gt_masks gt_ignore_flags |
gt_polygons gt_ignored |
将 MMDet 中采用的字段转换为对应的 MMOCR 字段。 |
MMOCR2MMDet | img_shape gt_polygons gt_ignored |
gt_masks gt_ignore_flags |
将 MMOCR 中采用的字段转换为对应的 MMDet 字段。 |
包装类 - wrappers.py¶
为了方便用户在 MMOCR 内部无缝调用常用的 CV 算法库,我们在 wrappers.py 中提供了相应的包装类。其主要打通了 MMOCR 与其它第三方算法库之间的数据格式和转换标准,使得用户可以在 MMOCR 的配置文件内直接配置使用这些第三方库提供的数据变换方法。目前支持的包装类有:
数据转换类名称 | 需求字段 | 修改/添加字段 | 说明 |
ImgAugWrapper | img gt_polygons (optional for text recognition)gt_bboxes (optional for text recognition)gt_bboxes_labels (optional for text recognition)gt_ignored (optional for text recognition)gt_texts (optional) |
img gt_polygons (optional for text recognition)gt_bboxes (optional for text recognition)gt_bboxes_labels (optional for text recognition)gt_ignored (optional for text recognition)img_shape (optional)gt_texts (optional) |
ImgAug 包装类,用于打通 ImgAug 与 MMOCR 的数据格式及配置,方便用户调用 ImgAug 实现的一系列数据增强方法。 |
TorchVisionWrapper | img |
img img_shape |
TorchVision 包装类,用于打通 TorchVision 与 MMOCR 的数据格式及配置,方便用户调用 torchvision.transforms 中实现的一系列数据变换方法。 |
ImgAugWrapper
示例¶
例如,在原生的 ImgAug 中,我们可以按照如下代码定义一个 Sequential
类型的数据增强流程,对图像分别进行随机翻转、随机旋转和随机缩放:
import imgaug.augmenters as iaa
aug = iaa.Sequential(
iaa.Fliplr(0.5), # 以概率 0.5 进行水平翻转
iaa.Affine(rotate=(-10, 10)), # 随机旋转 -10 到 10 度
iaa.Resize((0.5, 3.0)) # 随机缩放到 50% 到 300% 的尺寸
)
而在 MMOCR 中,我们可以通过 ImgAugWrapper
包装类,将上述数据增强流程直接配置到 train_pipeline
中:
dict(
type='ImgAugWrapper',
args=[
['Fliplr', 0.5],
dict(cls='Affine', rotate=[-10, 10]),
['Resize', [0.5, 3.0]],
]
)
其中,args
参数接收一个列表,列表中的每个元素可以是一个列表,也可以是一个字典。如果是列表,则列表的第一个元素为 imgaug.augmenters
中的类名,后面的元素为该类的初始化参数;如果是字典,则字典的 cls
键对应 imgaug.augmenters
中的类名,其他键值对则对应该类的初始化参数。
TorchVisionWrapper
示例¶
例如,在原生的 TorchVision 中,我们可以按照如下代码定义一个 Compose
类型的数据变换流程,对图像进行色彩抖动:
import torchvision.transforms as transforms
aug = transforms.Compose([
transforms.ColorJitter(
brightness=32.0 / 255, # 亮度抖动范围
saturation=0.5) # 饱和度抖动范围
])
而在 MMOCR 中,我们可以通过 TorchVisionWrapper
包装类,将上述数据变换流程直接配置到 train_pipeline
中:
dict(
type='TorchVisionWrapper',
op='ColorJitter',
brightness=32.0 / 255,
saturation=0.5
)
其中,op
参数为 torchvision.transforms
中的类名,后面的参数则对应该类的初始化参数。
模型评测¶
注解
阅读此文档前,建议您先了解 MMEngine: 模型精度评测基本概念。
评测指标¶
MMOCR 基于 MMEngine: BaseMetric 基类实现了常用的文本检测、文本识别以及关键信息抽取任务的评测指标,用户可以通过修改配置文件中的 val_evaluator
与 test_evaluator
字段来便捷地指定验证与测试阶段采用的评测方法。例如,以下配置展示了如何在文本检测算法中使用 HmeanIOUMetric
来评测模型性能。
# 文本检测任务中通常使用 HmeanIOUMetric 来评测模型性能
val_evaluator = [dict(type='HmeanIOUMetric')]
# 此外,MMOCR 也支持相同任务下的多种指标组合评测,如同时使用 WordMetric 及 CharMetric
val_evaluator = [
dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol']),
dict(type='CharMetric')
]
小技巧
更多评测相关配置请参考评测配置教程。
如下表所示,MMOCR 目前针对文本检测、识别、及关键信息抽取等任务共内置了 5 种评测指标,分别为 HmeanIOUMetric
,WordMetric
,CharMetric
,OneMinusNEDMetric
,和 F1Metric
。
评测指标 | 任务类型 | 输入字段 | 输出字段 |
HmeanIOUMetric | 文本检测 | pred_polygons pred_scores gt_polygons |
recall precision hmean |
WordMetric | 文本识别 | pred_text gt_text |
word_acc word_acc_ignore_case word_acc_ignore_case_symbol |
CharMetric | 文本识别 | pred_text gt_text |
char_recall char_precision |
OneMinusNEDMetric | 文本识别 | pred_text gt_text |
1-N.E.D |
F1Metric | 关键信息抽取 | pred_labels gt_labels |
macro_f1 micro_f1 |
通常来说,每一类任务所采用的评测标准是约定俗成的,用户一般无须深入了解或手动修改评测方法的内部实现。然而,为了方便用户实现更加定制化的需求,本文档将进一步介绍了 MMOCR 内置评测算法的具体实现策略,以及可配置参数。
HmeanIOUMetric¶
HmeanIOUMetric 是文本检测任务中应用最广泛的评测指标之一,因其计算了检测精度(Precision)与召回率(Recall)之间的调和平均数(Harmonic mean, H-mean),故得名 HmeanIOUMetric
。记精度为 P,召回率为 R,则 HmeanIOUMetric
可由下式计算得到:
另外,由于其等价于 \(\beta = 1\) 时的 F-score (又称 F-measure 或 F-metric),HmeanIOUMetric
有时也被写作 F1Metric
或 f1-score
等:
在 MMOCR 的设计中,HmeanIOUMetric
的计算可以概括为以下几个步骤:
过滤无效的预测边界盒
依据置信度阈值
pred_score_thrs
过滤掉得分较低的预测边界盒依据
ignore_precision_thr
阈值过滤掉与ignored
样本重合度过高的预测边界盒
值得注意的是,
pred_score_thrs
默认将自动搜索一定范围内的最佳阈值,用户也可以通过手动修改配置文件来自定义搜索范围:# HmeanIOUMetric 默认以 0.1 为步长搜索 [0.3, 0.9] 范围内的最佳得分阈值 val_evaluator = dict(type='HmeanIOUMetric', pred_score_thrs=dict(start=0.3, stop=0.9, step=0.1))
计算 IoU 矩阵
在数据处理阶段,
HmeanIOUMetric
会计算并维护一个 \(M \times N\) 的 IoU 矩阵iou_metric
,以方便后续的边界盒配对步骤。其中,M 和 N 分别为标签边界盒与过滤后预测边界盒的数量。由此,该矩阵的每个元素都存放了第 m 个标签边界盒与第 n 个预测边界盒之间的交并比(IoU)。
基于相应的配对策略统计能被准确匹配的 GT 样本数
尽管
HmeanIOUMetric
可以由固定的公式计算取得,不同的任务或算法库内部的具体实现仍可能存在一些细微差别。这些差异主要体现在采用不同的策略来匹配真实与预测边界盒,从而导致最终得分的差距。目前,MMOCR 内部的HmeanIOUMetric
共支持两种不同的匹配策略,即vanilla
与max_matching
。如下所示,用户可以通过修改配置文件来指定不同的匹配策略。vanilla
匹配策略HmeanIOUMetric
默认采用vanilla
匹配策略,该实现与 MMOCR 0.x 版本中的hmean-iou
及 ICDAR 系列官方文本检测竞赛的评测标准保持一致,采用先到先得的匹配方式对标签边界盒(Ground-truth bbox)与预测边界盒(Predicted bbox)进行配对。# 不指定 strategy 时,HmeanIOUMetric 默认采用 'vanilla' 匹配策略 val_evaluator = dict(type='HmeanIOUMetric')
max_matching
匹配策略针对现有匹配机制中的不完善之处,MMOCR 算法库实现了一套更高效的匹配策略,用以最大化匹配数目。
# 指定采用 'max_matching' 匹配策略 val_evaluator = dict(type='HmeanIOUMetric', strategy='max_matching')
注解
我们建议面向学术研究的开发用户采用默认的
vanilla
匹配策略,以保证与其他论文的对比结果保持一致。而面向工业应用的开发用户则可以采用max_matching
匹配策略,以获得精准的结果。根据上文介绍的
HmeanIOUMetric
公式计算最终的评测得分
WordMetric¶
WordMetric 实现了单词级别的文本识别评测指标,并内置了 exact
,ignore_case
,及 ignore_case_symbol
三种文本匹配模式,用户可以在配置文件中修改 mode
字段来自由组合输出一种或多种文本匹配模式下的 WordMetric
得分。
# 在文本识别任务中使用 WordMetric 评测
val_evaluator = [
dict(type='WordMetric', mode=['exact', 'ignore_case', 'ignore_case_symbol'])
]
exact
:全匹配模式,即,预测与标签完全一致才能被记录为正确样本。ignore_case
:忽略大小写的匹配模式。ignore_case_symbol
:忽略大小写及符号的匹配模式,这也是大部分学术论文中报告的文本识别准确率;MMOCR 报告的识别模型性能默认采用该匹配模式。
假设真实标签为 MMOCR!
,模型的输出结果为 mmocr
,则三种匹配模式下的 WordMetric
得分分别为:{'exact': 0, 'ignore_case': 0, 'ignore_case_symbol': 1}
。
CharMetric¶
CharMetric 实现了不区分大小写的字符级别的文本识别评测指标。
# 在文本识别任务中使用 CharMetric 评测
val_evaluator = [dict(type='CharMetric')]
具体而言,CharMetric
会输出两个评测评测指标,即字符精度 char_precision
和字符召回率 char_recall
。设正确预测的字符(True Positive)数量为 \(\sigma_{tp}\),则精度 P 和召回率 R 可由下式计算取得:
其中,\(\sigma_{gt}\) 与 \(\sigma_{pred}\) 分别为标签文本与预测文本所包含的字符总数。
例如,假设标签文本为 “MMOCR”,预测文本为 “mm0cR1”,则使用 CharMetric
评测指标的得分为:
OneMinusNEDMetric¶
OneMinusNEDMetric(1-N.E.D)
常用于中文或英文文本行级别标注的文本识别评测,不同于全匹配的评测标准要求预测与真实样本完全一致,该评测指标使用归一化的编辑距离(Edit Distance,又名莱温斯坦距离 Levenshtein Distance)来测量预测文本与真实文本之间的差异性,从而在评测长文本样本时能够更好地区分出模型的性能差异。假设真实和预测文本分别为 \(s_i\) 和 \(\hat{s_i}\),其长度分别为 \(l_{i}\) 和 \(\hat{l_i}\),则 OneMinusNEDMetric
得分可由下式计算得到:
其中,N 是样本总数,\(D(s_1, s_2)\) 为两个字符串之间的编辑距离。
例如,假设真实标签为 “OpenMMLabMMOCR”,模型 A 的预测结果为 “0penMMLabMMOCR”, 模型 B 的预测结果为 “uvwxyz”,则采用全匹配和 OneMinusNEDMetric
评测指标的结果分别为:
全匹配 | 1 - N.E.D. | |
模型 A | 0 | 0.92857 |
模型 B | 0 | 0 |
由上表可以发现,尽管模型 A 仅预测错了一个字母,而模型 B 全部预测错误,在使用全匹配的评测指标时,这两个模型的得分都为0;而使用 OneMinuesNEDMetric
的评测指标则能够更好地区分模型在长文本上的性能差异。
F1Metric¶
F1Metric 实现了针对 KIE 任务的 F1-Metric 评测指标,并提供了 micro
和 macro
两种评测模式。
val_evaluator = [
dict(type='F1Metric', mode=['micro', 'macro'],
]
micro
模式:依据 True Positive,False Negative,及 False Positive 总数来计算全局 F1-Metric 得分。macro
模式:依据类别标签计算每一类的 F1-Metric,并求平均值。
自定义评测指标¶
对于追求更高定制化功能的用户,MMOCR 也支持自定义实现不同类型的评测指标。一般来说,用户只需要新建自定义评测指标类 CustomizedMetric
并继承 MMEngine: BaseMetric,然后分别重写数据格式处理方法 process
以及指标计算方法 compute_metrics
。最后,将其加入 METRICS
注册器即可实现任意定制化的评测指标。
from mmengine.evaluator import BaseMetric
from mmocr.registry import METRICS
@METRICS.register_module()
class CustomizedMetric(BaseMetric):
def process(self, data_batch: Sequence[Dict], predictions: Sequence[Dict]):
""" process 接收两个参数,分别为 data_batch 存放真实标签信息,以及 predictions
存放预测结果。process 方法负责将标签信息转换并存放至 self.results 变量中
"""
pass
def compute_metrics(self, results: List):
""" compute_metric 使用经过 process 方法处理过的标签数据计算最终评测得分
"""
pass
注解
更多内容可参见 MMEngine 文档: BaseMetric。
数据集类¶
概览¶
在 MMOCR 中,所有的数据集都通过不同的基于 mmengine.BaseDataset 的 Dataset 类进行处理。 Dataset 类负责加载数据并进行初始解析,然后将其馈送到 数据流水线 进行数据预处理、增强、格式化等操作。
在本教程中,我们将介绍 Dataset 类的一些常见接口,以及 MMOCR 中 Dataset 实现的使用以及它们支持的注释类型。
小技巧
Dataset 类支持一些高级功能,例如懒加载、数据序列化、利用各种数据集包装器执行数据连接、重复和类别平衡。这些内容将不在本教程中介绍,但您可以阅读 MMEngine: BaseDataset 了解更多详细信息。
常见接口¶
现在,让我们看一个具体的示例并学习 Dataset 类的一些典型接口。OCRDataset
是 MMOCR 中默认使用的 Dataset 实现,因为它的标注格式足够灵活,支持 所有 OCR 任务(详见 OCRDataset)。现在我们将实例化一个 OCRDataset
对象,其中将加载 tests/data/det_toy_dataset
中的玩具数据集。
from mmocr.datasets import OCRDataset
from mmengine.registry import init_default_scope
init_default_scope('mmocr')
train_pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(type='RandomCrop', min_side_ratio=0.1),
dict(type='Resize', scale=(640, 640), keep_ratio=True),
dict(type='Pad', size=(640, 640)),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
dataset = OCRDataset(
data_root='tests/data/det_toy_dataset',
ann_file='textdet_test.json',
test_mode=False,
pipeline=train_pipeline)
让我们查看一下这个数据集的大小:
>>> print(len(dataset))
10
通常,Dataset 类加载并存储两种类型的信息:(1)元信息:储存数据集的属性,例如此数据集中可用的对象类别。 (2)标注:图像的路径及其标签。我们可以通过 dataset.metainfo
访问元信息:
>>> from pprint import pprint
>>> pprint(dataset.metainfo)
{'category': [{'id': 0, 'name': 'text'}],
'dataset_type': 'TextDetDataset',
'task_name': 'textdet'}
对于标注,我们可以通过 dataset.get_data_info(idx)
访问它。该方法返回一个字典,其中包含数据集中第 idx
个样本的信息。该样本已经经过初步解析,但尚未由 数据流水线 处理。
>>> from pprint import pprint
>>> pprint(dataset.get_data_info(0))
{'height': 720,
'img_path': 'tests/data/det_toy_dataset/test/img_10.jpg',
'instances': [{'bbox': [260.0, 138.0, 284.0, 158.0],
'bbox_label': 0,
'ignore': True,
'polygon': [261, 138, 284, 140, 279, 158, 260, 158]},
...,
{'bbox': [1011.0, 157.0, 1079.0, 173.0],
'bbox_label': 0,
'ignore': True,
'polygon': [1011, 157, 1079, 160, 1076, 173, 1011, 170]}],
'sample_idx': 0,
'seg_map': 'test/gt_img_10.txt',
'width': 1280}
另一方面,我们可以通过 dataset[idx]
或 dataset.__getitem__(idx)
获取由数据流水线完整处理过后的样本,该样本可以直接馈入模型并执行完整的训练/测试循环。它有两个字段:
inputs
:经过数据增强后的图像;data_samples
:包含经过数据增强后的标注和元信息的 DataSample,这些元信息可能由一些数据变换产生,并用以记录该样本的某些关键属性。
>>> pprint(dataset[0])
{'data_samples': <TextDetDataSample(
META INFORMATION
ori_shape: (720, 1280)
img_path: 'tests/data/det_toy_dataset/imgs/test/img_10.jpg'
img_shape: (640, 640)
DATA FIELDS
gt_instances: <InstanceData(
META INFORMATION
DATA FIELDS
labels: tensor([0, 0, 0])
polygons: [array([207.33984 , 104.65409 , 208.34634 , 84.528305, 231.49594 ,
86.54088 , 226.46341 , 104.65409 , 207.33984 , 104.65409 ],
dtype=float32), array([237.53496 , 103.6478 , 235.52196 , 84.528305, 365.36096 ,
86.54088 , 364.35446 , 107.67296 , 237.53496 , 103.6478 ],
dtype=float32), array([105.68293, 166.03773, 105.68293, 151.94969, 177.14471, 150.94339,
178.15121, 165.03145, 105.68293, 166.03773], dtype=float32)]
ignored: tensor([ True, False, True])
bboxes: tensor([[207.3398, 84.5283, 231.4959, 104.6541],
[235.5220, 84.5283, 365.3610, 107.6730],
[105.6829, 150.9434, 178.1512, 166.0377]])
) at 0x7f7359f04fa0>
) at 0x7f735a0508e0>,
'inputs': tensor([[[129, 111, 131, ..., 0, 0, 0], ...
[ 19, 18, 15, ..., 0, 0, 0]]], dtype=torch.uint8)}
数据集类及标注格式¶
每个数据集实现只能加载特定格式的数据集。这里列出了所有支持的数据集类及其兼容的格式,以及一个示例配置,以演示如何在实践中使用它们。
注解
如果您不熟悉配置系统,可以阅读 数据集配置文件。
OCRDataset¶
通常,OCR 数据集中有许多不同类型的标注,在不同的子任务(如文本检测和文本识别)中,格式也经常会有所不同。这些差异可能会导致在使用不同数据集时需要不同的数据加载代码,增加了用户的学习和维护成本。
在 MMOCR 中,我们提出了一种统一的数据集格式,可以适应 OCR 的所有三个子任务:文本检测、文本识别和端到端 OCR。这种设计最大程度地提高了数据集的一致性,允许在不同任务之间重复使用数据标注,也使得数据集管理更加方便。考虑到流行的数据集格式并不一致,MMOCR 提供了 Dataset Preparer 来帮助用户将其数据集转换为 MMOCR 格式。我们也十分鼓励研究人员基于此数据格式开发自己的数据集。
标注格式¶
此标注文件是一个 .json
文件,存储一个包含 metainfo
和 data_list
的 dict
,前者包括有关数据集的基本信息,后者由每个图片的标注组成。这里呈现了标注文件中的所有字段的列表,但其中某些字段仅会在特定任务中被用到。
{
"metainfo":
{
"dataset_type": "TextDetDataset", # 可选项: TextDetDataset/TextRecogDataset/TextSpotterDataset
"task_name": "textdet", # 可选项: textdet/textspotter/textrecog
"category": [{"id": 0, "name": "text"}] # 在 textdet/textspotter 里用到
},
"data_list":
[
{
"img_path": "test_img.jpg",
"height": 604,
"width": 640,
"instances": # 一图内的多个实例
[
{
"bbox": [0, 0, 10, 20], # textdet/textspotter 内用到, [x1, y1, x2, y2]。
"bbox_label": 0, # 对象类别, 在 MMOCR 中恒为 0 (文本)
"polygon": [0, 0, 0, 10, 10, 20, 20, 0], # textdet/textspotter 内用到。 [x1, y1, x2, y2, ....]
"text": "mmocr", # textspotter/textrecog 内用到
"ignore": False # textspotter/textdet 内用到,决定是否在训练时忽略该实例
},
#...
],
}
#... 多图片
]
}
示例配置¶
以下是配置的一部分,我们在 train_dataloader
中使用 OCRDataset
加载用于文本检测模型的 ICDAR2015 数据集。请注意,OCRDataset
可以加载由 Dataset Preparer 准备的任何 OCR 数据集。也就是说,您可以将其用于文本识别和文本检测,但您仍然需要根据不同任务的需求修改 pipeline
中的数据变换。
pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
icdar2015_textdet_train = dict(
type='OCRDataset',
data_root='data/icdar2015',
ann_file='textdet_train.json',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train)
RecogLMDBDataset¶
当数据量非常大时,从文件中读取图像或标签可能会很慢。此外,在学术界,大多数场景文本识别数据集的图像和标签都以 lmdb 格式存储。(示例)
为了更接近主流实践并提高数据存储效率,MMOCR支持通过 RecogLMDBDataset
从 lmdb 数据集加载图像和标签。
标注格式¶
MMOCR 会读取 lmdb 数据集中的以下键:
num_samples
:描述数据集的数据量的参数。图像和标签的键分别以
image-000000001
和label-000000001
的格式命名,索引从1开始。
MMOCR 在 tests/data/rec_toy_dataset/imgs.lmdb
中提供了一个 toy lmdb 数据集。您可以使用以下代码片段了解其格式。
>>> import lmdb
>>>
>>> env = lmdb.open('tests/data/rec_toy_dataset/imgs.lmdb')
>>> txn = env.begin()
>>> for k, v in txn.cursor():
>>> print(k, v)
b'image-000000001' b'\xff...'
b'image-000000002' b'\xff...'
b'image-000000003' b'\xff...'
b'image-000000004' b'\xff...'
b'image-000000005' b'\xff...'
b'image-000000006' b'\xff...'
b'image-000000007' b'\xff...'
b'image-000000008' b'\xff...'
b'image-000000009' b'\xff...'
b'image-000000010' b'\xff...'
b'label-000000001' b'GRAND'
b'label-000000002' b'HOTEL'
b'label-000000003' b'HOTEL'
b'label-000000004' b'PACIFIC'
b'label-000000005' b'03/09/2009'
b'label-000000006' b'ANING'
b'label-000000007' b'Virgin'
b'label-000000008' b'america'
b'label-000000009' b'ATTACK'
b'label-000000010' b'DAVIDSON'
b'num-samples' b'10'
示例配置¶
以下是示例配置的一部分,我们在其中使用 RecogLMDBDataset
加载 toy 数据集。由于 RecogLMDBDataset
会将图像加载为 numpy 数组,因此如果要在数据管道中成功加载图像,应该记得把LoadImageFromFile
替换成 LoadImageFromNDArray
。
pipeline = [
dict(
type='LoadImageFromNDArray'),
dict(
type='LoadOCRAnnotations',
with_text=True,
),
dict(
type='PackTextRecogInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
toy_textrecog_train = dict(
type='RecogLMDBDataset',
data_root='tests/data/rec_toy_dataset/',
ann_file='imgs.lmdb',
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=toy_textrecog_train)
RecogTextDataset¶
在 MMOCR 1.0 之前,MMOCR 0.x 的文本识别任务的输入是文本文件。这些格式已在 MMOCR 1.0 中弃用,这个类随时可能被删除。更多信息
标注格式¶
文本文件可以是 txt
格式或 jsonl
格式。简单的 .txt
标注通过空格将图像名称和词语标注分隔开,因此这种格式并无法处理文本实例中包含空格的情况。
img1.jpg OpenMMLab
img2.jpg MMOCR
jsonl
格式使用类似字典的结构来表示标注,其中键 filename
和 text
存储图像名称和单词标签。
{"filename": "img1.jpg", "text": "OpenMMLab"}
{"filename": "img2.jpg", "text": "MMOCR"}
示例配置¶
以下是一个示例配置,我们在训练中使用 RecogTextDataset
加载 txt 标签,而在测试中使用 jsonl 标签。
pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
# loading 0.x txt format annos
txt_dataset = dict(
type='RecogTextDataset',
data_root=data_root,
ann_file='old_label.txt',
data_prefix=dict(img_path='imgs'),
parser_cfg=dict(
type='LineStrParser',
keys=['filename', 'text'],
keys_idx=[0, 1]),
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=txt_dataset)
# loading 0.x json line format annos
jsonl_dataset = dict(
type='RecogTextDataset',
data_root=data_root,
ann_file='old_label.jsonl',
data_prefix=dict(img_path='imgs'),
parser_cfg=dict(
type='LineJsonParser',
keys=['filename', 'text'],
pipeline=pipeline))
test_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=False),
dataset=jsonl_dataset)
IcdarDataset¶
在 MMOCR 1.0 之前,MMOCR 0.x 的文本检测输入采用了类似 COCO 格式的注释。这些格式已在 MMOCR 1.0 中弃用,这个类在将来的任何时候都可能被删除。更多信息
标注格式¶
{
"images": [
{
"id": 1,
"width": 800,
"height": 600,
"file_name": "test.jpg"
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [0,0,10,10],
"segmentation": [
[0,0,10,0,10,10,0,10]
],
"area": 100,
"iscrowd": 0
}
]
}
配置示例¶
这是配置示例的一部分,其中我们令 train_dataloader
使用 IcdarDataset
来加载旧标签。
pipeline = [
dict(
type='LoadImageFromFile'),
dict(
type='LoadOCRAnnotations',
with_polygon=True,
with_bbox=True,
with_label=True,
),
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
]
icdar2015_textdet_train = dict(
type='IcdarDatasetDataset',
data_root='data/det/icdar2015',
ann_file='instances_training.json',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=pipeline)
train_dataloader = dict(
batch_size=16,
num_workers=8,
persistent_workers=True,
sampler=dict(type='DefaultSampler', shuffle=True),
dataset=icdar2015_textdet_train)
WildReceiptDataset¶
该类为 WildReceipt 数据集定制。
标注格式¶
// Close Set
{
"file_name": "image_files/Image_16/11/d5de7f2a20751e50b84c747c17a24cd98bed3554.jpeg",
"height": 1200,
"width": 1600,
"annotations":
[
{
"box": [550.0, 190.0, 937.0, 190.0, 937.0, 104.0, 550.0, 104.0],
"text": "SAFEWAY",
"label": 1
},
{
"box": [1048.0, 211.0, 1074.0, 211.0, 1074.0, 196.0, 1048.0, 196.0],
"text": "TM",
"label": 25
}
], //...
}
// Open Set
{
"file_name": "image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg",
"height": 348,
"width": 348,
"annotations":
[
{
"box": [114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0],
"text": "CHOEUN",
"label": 2,
"edge": 1
},
{
"box": [97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0],
"text": "KOREANRESTAURANT",
"label": 2,
"edge": 1
}
]
}
设计理念与特性[待更新]¶
待更新
数据流[待更新]¶
待更新
模型[待更新]¶
待更新
可视化组件[待更新]¶
待更新
开发默认约定[待更新]¶
待更新
引擎[待更新]¶
待更新
支持数据集一览¶
支持的数据集¶
数据集名称 |
文本检测 |
文本识别 |
端到端文本检测识别 |
关键信息抽取 |
---|---|---|---|---|
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
||||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
||||
✓ |
||||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
||||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
||
✓ |
✓ |
✓ |
✓ |
数据集详情¶
COCO Text v2¶
“COCO-Text: Dataset and Benchmark for Text Detection and Recognition in Natural Images”, arXiv, 2016. PDF
A. 数据集基础信息
官方网址: cocotextv2
发布年份: 2016
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: CC BY 4.0
B. 标注格式
Text Detection/Spotting
{
"cats": {},
"anns": {
"45346": {
"mask":[468.9,286.7,468.9,295.2,493.0,295.8,493.0,287.2],
"class":"machine printed",
"bbox":[468.9,286.7,24.1,9.1],
"image_id":522579,
"id":167312,
"language":"english",
"area":55.5,
"utf8_string":"the",
"legibility":"legible"
},
// ...
},
"imgs": {
"522579": {
"file_name":"COCO_train2014_000000522579.jpg",
"height":476,
"width":640,
"id":522579,
"set":"train",
},
// ...
},
"imgToAnns": {
"522579": [167294, 167295, 167296, 167297, 167298, 167299, 167300, 167301, 167302, 167303, 167304, 167305, 167306, 167307, 167308, 167309, 167310, 167311, 167312, 167313, 167314, 167315, 167316, 167317],
// ...
},
"info": {}
}
C. 参考文献
@article{veit2016coco, title={Coco-text: Dataset and benchmark for text detection and recognition in natural images}, author={Veit, Andreas and Matera, Tomas and Neumann, Lukas and Matas, Jiri and Belongie, Serge}, journal={arXiv preprint arXiv:1601.07140}, year={2016}}
CTW1500¶
“Curved scene text detection via transverse and longitudinal sequence connection”, PR, 2019. PDF
A. 数据集基础信息
官方网址: ctw1500
发布年份: 2019
语言: [‘English’]
场景: [‘Scene’]
标注粒度: [‘Word’, ‘Line’]
支持任务: [‘textrecog’, ‘textdet’, ‘textspotting’]
数据集许可证: N/A
B. 标注格式
C. 参考文献
@article{liu2019curved, title={Curved scene text detection via transverse and longitudinal sequence connection}, author={Liu, Yuliang and Jin, Lianwen and Zhang, Shuaitao and Luo, Canjie and Zhang, Sheng}, journal={Pattern Recognition}, volume={90}, pages={337--345}, year={2019}, publisher={Elsevier} }
CUTE80¶
“A Robust Arbitrary Text Detection System for Natural Scene Images”, ESWA, 2014. PDF
A. 数据集基础信息
官方网址: cute80
发布年份: 2014
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textrecog’]
数据集许可证: N/A
B. 标注格式
Text Recognition
# timage/img_name text 1 text
timage/001.jpg RONALDO 1 RONALDO
timage/002.jpg 7 1 7
timage/003.jpg SEACREST 1 SEACREST
timage/004.jpg BEACH 1 BEACH
C. 参考文献
@article{risnumawan2014robust, title={A robust arbitrary text detection system for natural scene images}, author={Risnumawan, Anhar and Shivakumara, Palaiahankote and Chan, Chee Seng and Tan, Chew Lim}, journal={Expert Systems with Applications}, volume={41}, number={18}, pages={8027--8048}, year={2014}, publisher={Elsevier}}
FUNSD¶
“FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents”, ICDAR, 2019. PDF
A. 数据集基础信息
官方网址: funsd
发布年份: 2019
语言: [‘English’]
场景: [‘Document’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: FUNSD License
B. 标注格式
Text Detection/Recognition/Spotting
{
"form": [
{
"id": 0,
"text": "Registration No.",
"box": [
94,
169,
191,
186
],
"linking": [
[
0,
1
]
],
"label": "question",
"words": [
{
"text": "Registration",
"box": [
94,
169,
168,
186
]
},
{
"text": "No.",
"box": [
170,
169,
191,
183
]
}
]
},
{
"id": 1,
"text": "533",
"box": [
209,
169,
236,
182
],
"label": "answer",
"words": [
{
"box": [
209,
169,
236,
182
],
"text": "533"
}
],
"linking": [
[
0,
1
]
]
}
]
}
C. 参考文献
@inproceedings{jaume2019, title = {FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents}, author = {Guillaume Jaume, Hazim Kemal Ekenel, Jean-Philippe Thiran}, booktitle = {Accepted to ICDAR-OST}, year = {2019}}
Incidental Scene Text IC13¶
“ICDAR 2013 Robust Reading Competition”, ICDAR, 2013. PDF
A. 数据集基础信息
官方网址: icdar2013
发布年份: 2013
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: N/A
B. 标注格式
Text Detection
# train split
# x1 y1 x2 y2 "transcript"
158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"
# test split
# x1, y1, x2, y2, "transcript"
38, 43, 920, 215, "Tiredness"
275, 264, 665, 450, "kills"
0, 699, 77, 830, "A"
Text Recognition
# img_name, "text"
word_1.png, "PROPER"
word_2.png, "FOOD"
word_3.png, "PRONTO"
C. 参考文献
@inproceedings{karatzas2013icdar, title={ICDAR 2013 robust reading competition}, author={Karatzas, Dimosthenis and Shafait, Faisal and Uchida, Seiichi and Iwamura, Masakazu and i Bigorda, Lluis Gomez and Mestre, Sergi Robles and Mas, Joan and Mota, David Fernandez and Almazan, Jon Almazan and De Las Heras, Lluis Pere}, booktitle={2013 12th international conference on document analysis and recognition}, pages={1484--1493}, year={2013}, organization={IEEE}}
Incidental Scene Text IC15¶
“ICDAR 2015 Competition on Robust Reading”, ICDAR, 2015. PDF
A. 数据集基础信息
官方网址: icdar2015
发布年份: 2015
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: CC BY 4.0
B. 标注格式
Text Detection
# x1,y1,x2,y2,x3,y3,x4,y4,trans
377,117,463,117,465,130,378,130,Genaxis Theatre
493,115,519,115,519,131,493,131,[06]
374,155,409,155,409,170,374,170,###
Text Recognition
# img_name, "text"
word_1.png, "Genaxis Theatre"
word_2.png, "[06]"
word_3.png, "62-03"
C. 参考文献
@inproceedings{karatzas2015icdar, title={ICDAR 2015 competition on robust reading}, author={Karatzas, Dimosthenis and Gomez-Bigorda, Lluis and Nicolaou, Anguelos and Ghosh, Suman and Bagdanov, Andrew and Iwamura, Masakazu and Matas, Jiri and Neumann, Lukas and Chandrasekhar, Vijay Ramaseshan and Lu, Shijian and others}, booktitle={2015 13th international conference on document analysis and recognition (ICDAR)}, pages={1156--1160}, year={2015}, organization={IEEE}}
IIIT5K¶
“Scene Text Recognition using Higher Order Language Priors”, BMVC, 2012. PDF
A. 数据集基础信息
官方网址: iiit5k
发布年份: 2012
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textrecog’]
数据集许可证: N/A
B. 标注格式
Text Recognition
# img_name, "text"
train/1009_2.png You
train/1017_1.png Rescue
train/1017_2.png mission
C. 参考文献
@InProceedings{MishraBMVC12, author = "Mishra, A. and Alahari, K. and Jawahar, C.~V.", title = "Scene Text Recognition using Higher Order Language Priors", booktitle = "BMVC", year = "2012"}
Synthetic Word Dataset (MJSynth/Syn90k)¶
“Reading Text in the Wild with Convolutional Neural Networks”, International Journal of Computer Vision, 2016. PDF
A. 数据集基础信息
官方网址: mjsynth
发布年份: 2016
语言: [‘English’]
场景: [‘Synthesis’]
标注粒度: [‘Word’]
支持任务: [‘textrecog’]
数据集许可证: N/A
B. 标注格式
Text Recognition
./3000/7/182_slinking_71711.jpg 71711
./3000/7/182_REMODELERS_64541.jpg 64541
C. 参考文献
@InProceedings{Jaderberg14c, author = "Max Jaderberg and Karen Simonyan and Andrea Vedaldi and Andrew Zisserman", title = "Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition", booktitle = "Workshop on Deep Learning, NIPS", year = "2014", }
@Article{Jaderberg16, author = "Max Jaderberg and Karen Simonyan and Andrea Vedaldi and Andrew Zisserman", title = "Reading Text in the Wild with Convolutional Neural Networks", journal = "International Journal of Computer Vision", number = "1", volume = "116", pages = "1--20", month = "jan", year = "2016", }
NAF¶
“Deep Visual Template-Free Form Parsing”, ICDAR, 2019. PDF
A. 数据集基础信息
官方网址: naf
发布年份: 2019
语言: [‘English’]
场景: [‘Document’, ‘Handwritten’]
标注粒度: [‘Word’, ‘Line’]
支持任务: [‘textrecog’, ‘textdet’, ‘textspotting’]
数据集许可证: CDLA
B. 标注格式
Text Detection/Recognition/Spotting
{"fieldBBs": [{"poly_points": [[435, 1406], [466, 1406], [466, 1439], [435, 1439]], "type": "fieldCheckBox", "id": "f0", "isBlank": 1}, {"poly_points": [[435, 1444], [469, 1444], [469, 1478], [435, 1478]], "type": "fieldCheckBox", "id": "f1", "isBlank": 1}],
"textBBs": [{"poly_points": [[1183, 1337], [2028, 1345], [2032, 1395], [1186, 1398]], "type": "text", "id": "t0"}, {"poly_points": [[492, 1336], [809, 1338], [809, 1379], [492, 1378]], "type": "text", "id": "t1"}, {"poly_points": [[512, 1375], [798, 1376], [798, 1405], [512, 1404]], "type": "textInst", "id": "t2"}], "imageFilename": "007182398_00026.jpg", "transcriptions": {"f0": "\u00bf\u00bf\u00bf \u00bf\u00bf\u00bf 18/1/49 \u00bf\u00bf\u00bf\u00bf\u00bf", "f1": "U.S. Navy 53rd. Naval Const. Batt.", "t0": "APPLICATION FOR HEADSTONE OR MARKER", "t1": "ORIGINAL"}}
C. 参考文献
@inproceedings{davis2019deep, title={Deep visual template-free form parsing}, author={Davis, Brian and Morse, Bryan and Cohen, Scott and Price, Brian and Tensmeyer, Chris}, booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, pages={134--141}, year={2019}, organization={IEEE}}
Scanned Receipts OCR and Information Extraction¶
“ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction”, ICDAR, 2019. PDF
A. 数据集基础信息
官方网址: sroie
发布年份: 2019
语言: [‘English’]
场景: [‘Document’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: CC BY 4.0
B. 标注格式
Text Detection, Text Recognition and Text Spotting
# x1,y1,x2,y2,x3,y3,x4,y4,trans
72,25,326,25,326,64,72,64,TAN WOON YANN
50,82,440,82,440,121,50,121,BOOK TA .K(TAMAN DAYA) SDN BND
205,121,285,121,285,139,205,139,789417-W
C. 参考文献
@INPROCEEDINGS{8977955, author={Huang, Zheng and Chen, Kai and He, Jianhua and Bai, Xiang and Karatzas, Dimosthenis and Lu, Shijian and Jawahar, C. V.}, booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)}, title={ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction}, year={2019}, volume={}, number={}, pages={1516-1520}, doi={10.1109/ICDAR.2019.00244}}
Street View Text Dataset (SVT)¶
“Word Spotting in the Wild”, ECCV, 2010. PDF
A. 数据集基础信息
官方网址: svt
发布年份: 2010
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: N/A
B. 标注格式
Text Detection/Recognition/Spotting
<image>
<imageName>img/14_03.jpg</imageName>
<address>341 Southwest 10th Avenue Portland OR</address>
<lex>
LIVING,ROOM,THEATERS,KENNY,ZUKE,DELICATESSEN,CLYDE,COMMON,ACE,HOTEL,PORTLAND,ROSE,CITY,BOOKS,STUMPTOWN,COFFEE,ROASTERS,RED,CAP,GARAGE,FISH,GROTTO,SEAFOOD,RESTAURANT,AURA,RESTAURANT,LOUNGE,ROCCO,PIZZA,PASTA,BUFFALO,EXCHANGE,MARK,SPENCER,LIGHT,FEZ,BALLROOM,READING,FRENZY,ROXY,SCANDALS,MARTINOTTI,CAFE,DELI,CROWSENBERG,HALF
</lex>
<Resolution x="1280" y="880"/>
<taggedRectangles>
<taggedRectangle height="75" width="236" x="375" y="253">
<tag>LIVING</tag>
</taggedRectangle>
<taggedRectangle height="76" width="175" x="639" y="272">
<tag>ROOM</tag>
</taggedRectangle>
<taggedRectangle height="87" width="281" x="839" y="283">
<tag>THEATERS</tag>
</taggedRectangle>
</taggedRectangles>
</image>
C. 参考文献
@inproceedings{wang2010word, title={Word spotting in the wild}, author={Wang, Kai and Belongie, Serge}, booktitle={European conference on computer vision}, pages={591--604}, year={2010}, organization={Springer}}
Street View Text Perspective (SVT-P)¶
“Recognizing Text with Perspective Distortion in Natural Scenes”, ICCV, 2013. PDF
A. 数据集基础信息
官方网址: svtp
发布年份: 2013
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textrecog’]
数据集许可证: N/A
B. 标注格式
Text Recognition
13_15_0_par.jpg WYNDHAM
13_15_1_par.jpg HOTEL
12_16_0_par.jpg UNITED
C. 参考文献
@inproceedings{phan2013recognizing, title={Recognizing text with perspective distortion in natural scenes}, author={Phan, Trung Quy and Shivakumara, Palaiahnakote and Tian, Shangxuan and Tan, Chew Lim}, booktitle={Proceedings of the IEEE International Conference on Computer Vision}, pages={569--576}, year={2013}}
SynthText in the Wild Dataset¶
“Synthetic Data for Text Localisation in Natural Images”, CVPR, 2016. PDF
A. 数据集基础信息
官方网址: synthtext
发布年份: 2016
语言: [‘English’]
场景: [‘Synthesis’]
标注粒度: [‘Word’, ‘Character’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: Synthext Custom
B. 标注格式
Text Detection/Recognition/Spotting
{
"imnames": [['8/ballet_106_0.jpg', ...]],
"wordBB": [[[420.58957 418.85016 448.08478 410.3094 117.745026
322.30963 322.6857 159.09138 154.27284 260.14597
431.9315 427.52274 296.86508 99.56819 108.96211 ]
[512.3321 431.88342 519.4515 499.81183 179.0544
377.97382 376.4993 203.64464 193.77492 313.61514
487.58023 484.64633 365.83176 142.49403 144.90457 ]
[511.92203 428.7077 518.7375 499.0373 172.1684
378.35858 377.2078 203.3191 193.0739 319.69186
485.6758 482.571 365.76303 142.31898 144.43858 ]
[420.1795 415.67444 447.3708 409.53485 110.859024
322.6944 323.3942 158.76585 153.57182 266.2227
430.02707 425.44742 296.79636 99.39314 108.49613 ]]
[[ 21.06382 46.19922 47.570374 73.95366 197.17792
9.993624 48.437763 9.064571 49.659035 208.57095
118.41646 162.82489 29.548729 5.800581 28.812992 ]
[ 23.069519 48.254295 50.130234 77.18146 208.71487
8.999153 46.69632 9.698633 50.869553 203.25742
122.64043 168.38647 29.660484 6.2558594 29.602367 ]
[ 41.827087 68.39458 70.03627 98.65903 245.30832
30.534437 68.589294 32.57161 73.74529 264.40634
147.7303 189.70224 72.08 22.759935 50.81941 ]
[ 39.82139 66.3395 67.47641 95.43123 233.77136
31.528908 70.33074 31.937548 72.534775 269.71988
143.50633 184.14066 71.96825 22.304657 50.030033 ]], ...],
"charBB": [[[423.16126397 439.60847343 450.66887979 466.31976402 479.76190495
504.59927448 418.80489444 450.13965942 464.16775197 480.46891089
502.46437709 413.02373632 433.01396211 446.7222192 470.28467827
482.51674486 116.52285438 139.51408587 150.7448586 162.03366629
322.84717946 333.54881536 343.28386485 363.07416389 323.48968759
337.98503283 356.66355903 160.48517048 174.1707753 189.64454066
155.7637383 167.45490471 179.63644201 262.2183876 271.75848874
284.05396524 298.26103738 432.8464733 449.15387392 468.07231897
428.11482147 445.61538159 469.24565878 296.86441324 323.6603118
344.09880401 101.14677814 110.45423597 120.54555495 131.18342618
132.20545124 110.01673682 120.83144568 131.35885673]
[438.2997574 452.61288403 466.31976402 482.22585715 498.3934528
512.20555863 431.88338084 466.11639619 481.73414937 499.62012025
519.36789779 432.51717267 449.23571387 465.73425964 484.45139112
499.59056304 140.27413679 149.59811175 160.13352083 169.59504507
333.55849014 344.33923741 361.08275796 378.09844418 339.92898685
355.57692063 376.51230484 174.1707753 189.07871028 203.64462646
165.22739457 181.27572412 193.60260894 270.99557614 283.13281739
298.75499435 313.61511672 447.1421735 470.27065563 487.02126631
446.97485257 468.98979567 484.64633864 317.88691577 341.16094163
365.8300006 111.15280603 120.54555495 130.72086821 135.27663717
142.4726875 120.1331955 133.07976304 144.75919258]
[435.54895424 449.95797159 464.5848793 480.68235876 497.04793842
511.1101386 428.95660757 463.61882066 480.14247127 498.2535215
518.03243928 429.36600266 447.19056345 463.89483785 482.21016814
498.18529977 142.63162835 152.55587851 162.80539142 172.21885945
333.35620309 344.09880401 360.86201193 377.82379299 339.7646859
355.37508239 376.1110999 172.46032372 187.37816388 201.39094518
163.04321987 178.99078221 191.89681939 275.3073355 286.08373072
301.85539131 318.57227103 444.54207279 467.53925436 485.27070558
444.57367155 466.90671029 482.56302723 317.62908407 340.9131681
365.44465854 109.40501176 119.4999228 129.67892444 134.35253232
140.97421069 118.61779828 131.34019115 143.25688164]
[420.17946701 436.74150236 448.74896556 464.5848793 478.18853922
503.4152019 415.67442461 447.3707845 462.35927516 478.8614766
500.86810735 409.54560397 430.77026495 444.64606264 467.79077782
480.89051912 119.14629674 142.63162835 153.56593297 164.78799774
322.69436747 333.35620309 343.11884239 362.84714115 323.37931952
337.83763574 356.35573621 158.76583616 172.46032372 187.37816388
153.57183805 165.15781218 177.92125239 266.22269514 274.45156305
286.82608962 302.69695881 430.02705241 446.01814255 466.05208347
425.44741792 443.19481667 466.90671029 296.79634428 323.49707084
343.82488703 99.39315359 109.40501176 119.4999228 130.25798537
130.70149005 108.49612777 119.08444238 129.84935461]]
[[ 22.26958901 21.60559248 27.0241972 27.25747678 27.45783459
28.73896576 47.91255579 47.80732383 53.77711568 54.24219042
52.00169325 74.79043429 80.45929285 81.04748707 76.11658669
82.58335942 203.67278213 201.2743445 205.59358622 205.51198143
10.06536976 10.82312635 16.77203865 16.31842372 54.80444433
54.66492 47.33822371 15.08534083 15.18716407 9.62607092
51.06813224 50.18928243 56.16019366 220.78902143 236.08062638
231.69267533 209.73652786 124.25352842 119.99631725 128.73732717
165.78411123 167.31764153 167.05531699 29.97351822 31.5116502
31.14650552 5.88513488 12.51324147 12.57920537 8.21515307
8.21998849 35.66412031 29.17945741 36.00660903]
[ 22.46075572 21.76391911 27.25747678 27.49456029 27.73554156
28.85582217 48.25428361 48.21714995 54.27828788 54.78857757
52.4595556 75.57743634 81.15533616 81.86325615 76.681392
83.31596322 210.04771309 203.83983042 208.00417391 207.41791524
9.79265706 10.55231862 16.36406888 15.97405105 54.64620856
54.49559004 47.09756263 15.18716407 15.29808166 9.69862498
51.27597632 50.48652154 56.49239954 216.92183074 232.02141018
226.44624213 203.25738931 125.19349641 121.32658508 130.00428964
167.43676857 169.36588297 168.38645076 29.58279603 31.19899202
30.75826599 5.92344996 12.57920537 12.64571832 8.23451892
8.26856497 35.82646468 29.342662 36.22165159]
[ 40.15739982 40.47241401 40.79219178 41.14411963 41.50190876
41.80934074 66.81590976 68.05921213 68.6519006 69.30152766
70.01097963 96.14641662 96.04484417 96.89110144 97.81897661
98.62829468 237.26055111 240.35280825 243.54641271 245.04022528
31.33842788 31.14650552 30.84702178 30.54399042 69.80098672
68.7212013 68.62479627 32.13243303 32.34474067 32.54416771
72.82501686 73.31372392 73.70922459 267.74318222 265.39839711
259.52741156 253.14023308 144.60810334 145.23371653 147.69958337
186.00278322 188.17713786 189.70144388 71.89351759 53.62266986
54.40060855 22.41084398 22.51791234 22.62587258 17.11356079
22.74567232 50.25232032 46.05692507 50.79345235]
[ 39.82138755 40.18347166 40.44598236 40.79219178 41.08959901
41.64111176 66.33948982 67.47640971 68.01403337 68.60595247
69.3953105 95.13188979 95.21297344 95.91593691 97.08847413
97.75212171 229.94285119 237.26055111 240.66752705 242.74145162
31.52890731 31.33842788 31.16401306 30.81155638 69.87135926
68.80273568 68.71664209 31.93753588 32.13243303 32.34474067
72.53476992 72.88981775 73.28094858 269.71986636 267.92938572
262.93698624 256.88902439 143.50635029 143.61251781 146.24080653
184.14064261 185.86853729 188.17713786 71.96823746 53.79651809
54.60870874 22.30465649 22.41084398 22.51791234 17.07939535
22.63671808 50.03002471 45.81009198 50.49899163]], ...],
"txt": [['Lines:\nI lost\nKevin ' 'will ' 'line\nand '
'and\nthe ' '(and ' 'the\nout '
'you ' "don't\n pkg "], ...]
}
C. 参考文献
@InProceedings{Gupta16, author = "Ankush Gupta and Andrea Vedaldi and Andrew Zisserman", title = "Synthetic Data for Text Localisation in Natural Images", booktitle = "IEEE Conference on Computer Vision and Pattern Recognition", year = "2016", }
Text OCR¶
“TextOCR: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text”, CVPR, 2021. PDF
A. 数据集基础信息
官方网址: textocr
发布年份: 2021
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: CC BY 4.0
B. 标注格式
Text Detection/Recognition/Spotting
{
"imgs": {
"OpenImages_ImageID_1": {
"id": "OpenImages_ImageID_1",
"width": "INT, Width of the image",
"height": "INT, Height of the image",
"set": "Split train|val|test",
"filename": "train|test/OpenImages_ImageID_1.jpg"
},
"OpenImages_ImageID_2": {
"...": "..."
}
},
"anns": {
"OpenImages_ImageID_1_1": {
"id": "STR, OpenImages_ImageID_1_1, Specifies the nth annotation for an image",
"image_id": "OpenImages_ImageID_1",
"bbox": [
"FLOAT x1",
"FLOAT y1",
"FLOAT x2",
"FLOAT y2"
],
"points": [
"FLOAT x1",
"FLOAT y1",
"FLOAT x2",
"FLOAT y2",
"...",
"FLOAT xN",
"FLOAT yN"
],
"utf8_string": "text for this annotation",
"area": "FLOAT, area of this box"
},
"OpenImages_ImageID_1_2": {
"...": "..."
},
"OpenImages_ImageID_2_1": {
"...": "..."
}
},
"img2Anns": {
"OpenImages_ImageID_1": [
"OpenImages_ImageID_1_1",
"OpenImages_ImageID_1_2",
"OpenImages_ImageID_1_2"
],
"OpenImages_ImageID_N": [
"..."
]
}
}
C. 参考文献
@inproceedings{singh2021textocr, title={{TextOCR}: Towards large-scale end-to-end reasoning for arbitrary-shaped scene text}, author={Singh, Amanpreet and Pang, Guan and Toh, Mandy and Huang, Jing and Galuba, Wojciech and Hassner, Tal}, journal={The Conference on Computer Vision and Pattern Recognition}, year={2021}}
Total Text¶
“Total-Text: Towards Orientation Robustness in Scene Text Detection”, IJDAR, 2020. PDF
A. 数据集基础信息
官方网址: totaltext
发布年份: 2020
语言: [‘English’]
场景: [‘Natural Scene’]
标注粒度: [‘Word’]
支持任务: [‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: BSD-3
B. 标注格式
Text Detection/Spotting
x: [[259 313 389 427 354 302]], y: [[542 462 417 459 507 582]], ornt: [u'c'], transcriptions: [u'PAUL']
x: [[400 478 494 436]], y: [[398 380 448 465]], ornt: [u'#'], transcriptions: [u'#']
C. 参考文献
@article{CK2019, author = {Chee Kheng Chng and Chee Seng Chan and Chenglin Liu}, title = {Total-Text: Towards Orientation Robustness in Scene Text Detection}, journal = {International Journal on Document Analysis and Recognition (IJDAR)}, volume = {23}, pages = {31-52}, year = {2020}, doi = {10.1007/s10032-019-00334-z}}
WildReceipt¶
“Spatial Dual-Modality Graph Reasoning for Key Information Extraction”, arXiv, 2021. PDF
A. 数据集基础信息
官方网址: wildreceipt
发布年份: 2021
语言: [‘English’]
场景: [‘Receipt’]
标注粒度: [‘Word’]
支持任务: [‘kie’, ‘textdet’, ‘textrecog’, ‘textspotting’]
数据集许可证: N/A
B. 标注格式
KIE
// Close Set
{
"file_name": "image_files/Image_16/11/d5de7f2a20751e50b84c747c17a24cd98bed3554.jpeg",
"height": 1200,
"width": 1600,
"annotations":
[
{
"box": [550.0, 190.0, 937.0, 190.0, 937.0, 104.0, 550.0, 104.0],
"text": "SAFEWAY",
"label": 1
},
{
"box": [1048.0, 211.0, 1074.0, 211.0, 1074.0, 196.0, 1048.0, 196.0],
"text": "TM",
"label": 25
}
], //...
}
// Open Set
{
"file_name": "image_files/Image_12/10/845be0dd6f5b04866a2042abd28d558032ef2576.jpeg",
"height": 348,
"width": 348,
"annotations":
[
{
"box": [114.0, 19.0, 230.0, 19.0, 230.0, 1.0, 114.0, 1.0],
"text": "CHOEUN",
"label": 2,
"edge": 1
},
{
"box": [97.0, 35.0, 236.0, 35.0, 236.0, 19.0, 97.0, 19.0],
"text": "KOREANRESTAURANT",
"label": 2,
"edge": 1
}
]
}
C. 参考文献
@article{sun2021spatial, title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction}, author={Sun, Hongbin and Kuang, Zhanghui and Yue, Xiaoyu and Lin, Chenhao and Zhang, Wayne}, journal={arXiv preprint arXiv:2103.14470}, year={2021} }
数据准备 (Beta)¶
注解
Dataset Preparer 目前仍处在公测阶段,欢迎尝鲜试用!如遇到任何问题,请及时向我们反馈。
一键式数据准备脚本¶
MMOCR 提供了统一的一站式数据集准备脚本 prepare_dataset.py
。
仅需一行命令即可完成数据的下载、解压、格式转换,及基础配置的生成。
python tools/dataset_converters/prepare_dataset.py [-h] [--nproc NPROC] [--task {textdet,textrecog,textspotting,kie}] [--splits SPLITS [SPLITS ...]] [--lmdb] [--overwrite-cfg] [--dataset-zoo-path DATASET_ZOO_PATH] datasets [datasets ...]
参数 | 类型 | 说明 |
---|---|---|
dataset_name | str | (必须)需要准备的数据集名称。 |
--nproc | str | 使用的进程数,默认为 4。 |
--task | str | 将数据集格式转换为指定任务的 MMOCR 格式。可选项为: 'textdet', 'textrecog', 'textspotting' 和 'kie'。 |
--splits | ['train', 'val', 'test'] | 希望准备的数据集分割,可以接受多个参数。默认为 train val test 。 |
--lmdb | str | 把数据储存为 LMDB 格式,仅当任务为 textrecog 时生效。 |
--overwrite-cfg | str | 若数据集的基础配置已经在 configs/{task}/_base_/datasets 中存在,依然重写该配置 |
--dataset-zoo-path | str | 存放数据库配置文件的路径。若不指定,则默认为 ./dataset_zoo |
例如,以下命令展示了如何使用该脚本为 ICDAR2015 数据集准备文本检测任务所需的数据。
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textdet --overwrite-cfg
该脚本也支持同时准备多个数据集,例如,以下命令展示了如何使用该脚本同时为 ICDAR2015 和 TotalText 数据集准备文本识别任务所需的数据。
python tools/dataset_converters/prepare_dataset.py icdar2015 totaltext --task textrecog --overwrite-cfg
进一步了解 Dataset Preparer 支持的数据集,您可以浏览支持的数据集文档。一些需要手动准备的数据集也列在了 文字检测 和 文字识别 内。
对于中国境内的用户,我们也推荐通过开源数据平台OpenDataLab来下载数据,以获得更好的下载体验。数据下载后,参考脚本中 data_obtainer
的 save_name
字段,将文件放在 data/cache/
下并重新运行脚本即可。
进阶用法¶
LMDB 格式¶
在文本识别任务中,通常使用 LMDB 格式来存储数据,以加快数据的读取速度。在使用 prepare_dataset.py
脚本准备数据时,可以通过 --lmdb
参数来指定将数据转换为 LMDB 格式。例如:
python tools/dataset_converters/prepare_dataset.py icdar2015 --task textrecog --lmdb
数据集准备完成后,Dataset Preparer 会在 configs/textrecog/_base_/datasets/
中生成 icdar2015_lmdb.py
配置。你可以继承该配置,并将 dataloader
指向 LMDB 数据集。然而,LMDB 数据集的读取需要配合 LoadImageFromNDArray
,因此你也同样需要修改 pipeline
。
例如,想要将 configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py
的训练集改为刚刚生成的 icdar2015,则需要作如下修改:
修改
configs/textrecog/crnn/crnn_mini-vgg_5e_mj.py
:_base_ = [ '../_base_/datasets/icdar2015_lmdb.py', # 指向 icdar2015 lmdb 数据集 ... # 省略 ] train_list = [_base_.icdar2015_lmdb_textrecog_train] ...
修改
configs/textrecog/crnn/_base_crnn_mini-vgg.py
中的train_pipeline
, 将LoadImageFromFile
改为LoadImageFromNDArray
:train_pipeline = [ dict( type='LoadImageFromNDArray', color_type='grayscale', file_client_args=file_client_args, ignore_empty=True, min_size=2), ... ]
设计¶
OCR 数据集数量众多,不同的数据集有着不同的语言,不同的标注格式,不同的场景等。 数据集的使用情况一般有两种,一种是快速的了解数据集的相关信息,另一种是在使用数据集训练模型。为了满足这两种使用场景MMOCR 提供数据集自动化准备脚本,数据集自动化准备脚本使用了模块化的设计,极大地增强了扩展性,用户能够很方便地配置其他公开数据集或私有数据集。数据集自动化准备脚本的配置文件被统一存储在 dataset_zoo/
目录下,用户可以在该目录下找到所有已由 MMOCR 官方支持的数据集准备脚本配置文件。该文件夹的目录结构如下:
dataset_zoo/
├── icdar2015
│ ├── metafile.yml
│ ├── sample_anno.md
│ ├── textdet.py
│ ├── textrecog.py
│ └── textspotting.py
└── wildreceipt
├── metafile.yml
├── sample_anno.md
├── kie.py
├── textdet.py
├── textrecog.py
└── textspotting.py
数据集相关信息¶
数据集的相关信息包括数据集的标注格式、数据集的标注示例、数据集的基本统计信息等。虽然在每个数据集的官网中都有这些信息,但是这些信息分散在各个数据集的官网中,用户需要花费大量的时间来挖掘数据集的基本信息。因此,MMOCR 设计了一些范式,它可以帮助用户快速了解数据集的基本信息。 MMOCR 将数据集的相关信息分为两个部分,一部分是数据集的基本信息包括包括发布年份,论文作者,以及版权等其他信息,另一部分是数据集的标注信息,包括数据集的标注格式、数据集的标注示例。每一部分 MMOCR 都会提供一个范式,贡献者可以根据范式来填写数据集的基本信息,使用用户就可以快速了解数据集的基本信息。 根据数据集的基本信息 MMOCR 提供了一个 metafile.yml
文件,其中存放了对应数据集的基本信息,包括发布年份,论文作者,以及版权等其他信息,这样用户就可以快速了解数据集的基本信息。该文件在数据集准备过程中并不是强制要求的(因此用户在使用添加自己的私有数据集时可以忽略该文件),但为了用户更好地了解各个公开数据集的信息,MMOCR 建议用户在使用数据集准备脚本前阅读对应的元文件信息,以了解该数据集的特征是否符合用户需求。MMOCR 以 ICDAR2015 作为示例, 其示例内容如下所示:
Name: 'Incidental Scene Text IC15'
Paper:
Title: ICDAR 2015 Competition on Robust Reading
URL: https://rrc.cvc.uab.es/files/short_rrc_2015.pdf
Venue: ICDAR
Year: '2015'
BibTeX: '@inproceedings{karatzas2015icdar,
title={ICDAR 2015 competition on robust reading},
author={Karatzas, Dimosthenis and Gomez-Bigorda, Lluis and Nicolaou, Anguelos and Ghosh, Suman and Bagdanov, Andrew and Iwamura, Masakazu and Matas, Jiri and Neumann, Lukas and Chandrasekhar, Vijay Ramaseshan and Lu, Shijian and others},
booktitle={2015 13th international conference on document analysis and recognition (ICDAR)},
pages={1156--1160},
year={2015},
organization={IEEE}}'
Data:
Website: https://rrc.cvc.uab.es/?ch=4
Language:
- English
Scene:
- Natural Scene
Granularity:
- Word
Tasks:
- textdet
- textrecog
- textspotting
License:
Type: CC BY 4.0
Link: https://creativecommons.org/licenses/by/4.0/
具体地,MMOCR 在下表中列出每个字段对应的含义:
字段名 | 含义 |
---|---|
Name | 数据集的名称 |
Paper.Title | 数据集论文的标题 |
Paper.URL | 数据集论文的链接 |
Paper.Venue | 数据集论文发表的会议/期刊名称 |
Paper.Year | 数据集论文发表的年份 |
Paper.BibTeX | 数据集论文的引用的 BibTex |
Data.Website | 数据集的官方网站 |
Data.Language | 数据集支持的语言 |
Data.Scene | 数据集支持的场景,如 Natural Scene , Document , Handwritten 等 |
Data.Granularity | 数据集支持的粒度,如 Character , Word , Line 等 |
Data.Tasks | 数据集支持的任务,如 textdet , textrecog , textspotting , kie 等 |
Data.License | 数据集的许可证信息,如果不存在许可证,则使用 N/A 填充 |
Data.Format | 数据集标注文件的格式,如 .txt , .xml , .json 等 |
Data.Keywords | 数据集的特性关键词,如 Horizontal , Vertical , Curved 等 |
对于数据集的标注信息,MMOCR 提供了一个 sample_anno.md
文件,用户可以根据范式来填写数据集的标注信息,这样用户就可以快速了解数据集的标注信息。MMOCR 以 ICDAR2015 作为示例, 其示例内容如下所示:
**Text Detection**
```text
# x1,y1,x2,y2,x3,y3,x4,y4,trans
377,117,463,117,465,130,378,130,Genaxis Theatre
493,115,519,115,519,131,493,131,[06]
374,155,409,155,409,170,374,170,###
```
sample_anno.md
中包含数据集针对不同任务的标注信息,包含标注文件的格式(text 对应的是 txt 文件,标注文件的格式也可以在 meta.yml 中找到),标注的示例。
通过上述两个文件的信息,用户就可以快速了解数据集的基本信息,同时 MMOCR 汇总了所有数据集的基本信息,用户可以在 Overview 中查看所有数据集的基本信息。
数据集使用¶
经过数十年的发展,OCR 领域涌现出了一系列的相关数据集,这些数据集往往采用风格各异的格式来提供文本的标注文件,使得用户在使用这些数据集时不得不进行格式转换。因此,为了方便用户进行数据集准备,我们设计了 Dataset Preaprer,帮助用户快速将数据集准备为 MMOCR 支持的格式, 详见数据格式文档。下图展示了 Dataset Preparer 的典型运行流程。
由图可见,Dataset Preparer 在运行时,会依次执行以下操作:
对训练集、验证集和测试集,由各 preparer 进行:
删除文件(Delete)
生成数据集的配置文件(Config Generator)
为了便于应对各种数据集的情况,MMOCR 将每个部分均设计为可插拔的模块,并允许用户通过 dataset_zoo/ 下的配置文件对数据集准备流程进行配置。这些配置文件采用了 Python 格式,其使用方法与 MMOCR 算法库的其他配置文件完全一致,详见配置文件文档。
在 dataset_zoo/
下,每个数据集均占有一个文件夹,文件夹下会以任务名命名配置文件,以区分不同任务下的配置。以 ICDAR2015 文字检测部分为例,示例配置 dataset_zoo/icdar2015/textdet.py
如下所示:
data_root = 'data/icdar2015'
cache_path = 'data/cache'
train_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
save_name='ic15_textdet_train_img.zip',
md5='c51cbace155dcc4d98c8dd19d378f30d',
content=['image'],
mapping=[['ic15_textdet_train_img', 'textdet_imgs/train']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'ch4_training_localization_transcription_gt.zip',
save_name='ic15_textdet_train_gt.zip',
md5='3bfaf1988960909014f7987d2343060b',
content=['annotation'],
mapping=[['ic15_textdet_train_gt', 'annotations/train']]),
]),
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg', '.JPG'],
rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
parser=dict(type='ICDARTxtTextDetAnnParser', encoding='utf-8-sig'),
packer=dict(type='TextDetPacker'),
dumper=dict(type='JsonDumper'),
)
test_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_test_images.zip',
save_name='ic15_textdet_test_img.zip',
md5='97e4c1ddcf074ffcc75feff2b63c35dd',
content=['image'],
mapping=[['ic15_textdet_test_img', 'textdet_imgs/test']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'Challenge4_Test_Task4_GT.zip',
save_name='ic15_textdet_test_gt.zip',
md5='8bce173b06d164b98c357b0eb96ef430',
content=['annotation'],
mapping=[['ic15_textdet_test_gt', 'annotations/test']]),
]),
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg', '.JPG'],
rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
parser=dict(type='ICDARTxtTextDetAnnParser', encoding='utf-8-sig'),
packer=dict(type='TextDetPacker'),
dumper=dict(type='JsonDumper'),
)
delete = ['annotations', 'ic15_textdet_test_img', 'ic15_textdet_train_img']
config_generator = dict(type='TextDetConfigGenerator')
数据集下载、解压、移动 (Obtainer)¶
Dataset Preparer 中,obtainer
模块负责了数据集的下载、解压和移动。如今,MMOCR 暂时只提供了 NaiveDataObtainer
。通常来说,内置的 NaiveDataObtainer
即可完成绝大部分可以通过直链访问的数据集的下载,并支持解压、移动文件和重命名等操作。然而,MMOCR 暂时不支持自动下载存储在百度或谷歌网盘等需要登陆才能访问资源的数据集。 这里简要介绍一下 NaiveDataObtainer
.
字段名 | 含义 |
---|---|
cache_path | 数据集缓存路径,用于存储数据集准备过程中下载的压缩包等文件 |
data_root | 数据集存储的根目录 |
files | 数据集文件列表,用于描述数据集的下载信息 |
files
字段是一个列表,列表中的每个元素都是一个字典,用于描述一个数据集文件的下载信息。如下表所示:
字段名 | 含义 |
---|---|
url | 数据集文件的下载链接 |
save_name | 数据集文件的保存名称 |
md5 (可选) | 数据集文件的 md5 值,用于校验下载的文件是否完整 |
split (可选) | 数据集文件所属的数据集划分,如 train ,test 等,该字段可以空缺 |
content (可选) | 数据集文件的内容,如 image ,annotation 等,该字段可以空缺 |
mapping (可选) | 数据集文件的解压映射,用于指定解压后的文件存储的位置,该字段可以空缺 |
同时,Dataset Preparer 存在以下约定:
不同类型的数据集的图片统一移动到对应类别
{taskname}_imgs/{split}/
文件夹下,如textdet_imgs/train/
。对于一个标注文件包含所有图像的标注信息的情况,标注移到到
annotations/{split}.*
文件中。 如annotations/train.json
。对于一个标注文件包含一个图像的标注信息的情况,所有的标注文件移动到
annotations/{split}/
文件中。 如annotations/train/
。对于一些其他的特殊情况,比如所有训练、测试、验证的图像都在一个文件夹下,可以将图像移动到自己设定的文件夹下,比如
{taskname}_imgs/imgs/
,同时要在后续的gatherer
模块中指定图像的存储位置。
示例配置如下:
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/ch4_training_images.zip',
save_name='ic15_textdet_train_img.zip',
md5='c51cbace155dcc4d98c8dd19d378f30d',
content=['image'],
mapping=[['ic15_textdet_train_img', 'textdet_imgs/train']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'ch4_training_localization_transcription_gt.zip',
save_name='ic15_textdet_train_gt.zip',
md5='3bfaf1988960909014f7987d2343060b',
content=['annotation'],
mapping=[['ic15_textdet_train_gt', 'annotations/train']]),
]),
数据集收集 (Gatherer)¶
gatherer
遍历数据集目录下的文件,将图像与标注文件一一对应,并整理出一份文件列表供 parser
读取。因此,首先需要知道当前数据集下,图片文件与标注文件匹配的规则。OCR 数据集有两种常用标注保存形式,一种为多个标注文件对应多张图片,一种则为单个标注文件对应多张图片,如:
多对多
├── {taskname}_imgs/{split}/img_img_1.jpg
├── annotations/{split}/gt_img_1.txt
├── {taskname}_imgs/{split}/img_2.jpg
├── annotations/{split}/gt_img_2.txt
├── {taskname}_imgs/{split}/img_3.JPG
├── annotations/{split}/gt_img_3.txt
单对多
├── {taskname}/{split}/img_1.jpg
├── {taskname}/{split}/img_2.jpg
├── {taskname}/{split}/img_3.JPG
├── annotations/gt.txt
具体设计如下所示
MMOCR 内置了 PairGatherer
与 MonoGatherer
来处理以上这两种常用情况。其中 PairGatherer
用于多对多的情况,MonoGatherer
用于单对多的情况。
注解
为了简化处理,gatherer 约定数据集的图片和标注需要分别储存在 {taskname}_imgs/{split}/
和 annotations/
下。特别地,对于多对多的情况,标注文件需要放置于 annotations/{split}
。
在多对多的情况下,
PairGatherer
需要按照一定的命名规则找到图片文件和对应的标注文件。首先,需要通过img_suffixes
参数指定图片的后缀名,如上述例子中的img_suffixes=[.jpg,.JPG]
。此外,还需要通过正则表达式rule
, 来指定图片与标注文件的对应关系,其中,规则rule
是一个正则表达式对,例如rule=[r'img_(\d+)\.([jJ][pP][gG])',r'gt_img_\1.txt']
。 第一个正则表达式用于匹配图片文件名,\d+
用于匹配图片的序号,([jJ][pP][gG])
用于匹配图片的后缀名。 第二个正则表达式用于匹配标注文件名,其中\1
则将匹配到的图片序号与标注文件序号对应起来。示例配置为
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg', '.JPG'],
rule=[r'img_(\d+)\.([jJ][pP][gG])', r'gt_img_\1.txt']),
单对多的情况通常比较简单,用户只需要指定标注文件名即可。对于训练集示例配置为
gatherer=dict(type='MonoGatherer', ann_name='train.txt'),
MMOCR 同样对 Gatherer
的返回值做了约定,Gatherer
会返回两个元素的元组,第一个元素为图像路径列表(包含所有图像路径) 或者所有图像所在的文件夹, 第二个元素为标注文件路径列表(包含所有标注文件路径)或者标注文件的路径(该标注文件包含所有图像标注信息)。
具体而言,PairGatherer
的返回值为(图像路径列表, 标注文件路径列表),示例如下:
(['{taskname}_imgs/{split}/img_1.jpg', '{taskname}_imgs/{split}/img_2.jpg', '{taskname}_imgs/{split}/img_3.JPG'],
['annotations/{split}/gt_img_1.txt', 'annotations/{split}/gt_img_2.txt', 'annotations/{split}/gt_img_3.txt'])
MonoGatherer
的返回值为(图像文件夹路径, 标注文件路径), 示例为:
('{taskname}/{split}', 'annotations/gt.txt')
数据集解析 (Parser)¶
Parser
主要用于解析原始的标注文件,因为原始标注情况多种多样,因此 MMOCR 提供了 BaseParser
作为基类,用户可以继承该类来实现自己的 Parser
。在 BaseParser
中,MMOCR 设计了两个接口:parse_files
和 parse_file
,约定在其中进行标注的解析。而对于 Gatherer
的两种不同输入情况(多对多、单对多),这两个接口的实现则应有所不同。
BaseParser
默认处理多对多的情况。其中,由parer_files
将数据并行分发至多个parse_file
进程,并由每个parse_file
分别进行单个图像标注的解析。对于单对多的情况,用户则需要重写
parse_files
,以实现加载标注,并返回规范的结果。
BaseParser
的接口定义如下所示:
class BaseParser:
def __call__(self, img_paths, ann_paths):
return self.parse_files(img_paths, ann_paths)
def parse_files(self, img_paths: Union[List[str], str],
ann_paths: Union[List[str], str]) -> List[Tuple]:
samples = track_parallel_progress_multi_args(
self.parse_file, (img_paths, ann_paths), nproc=self.nproc)
return samples
@abstractmethod
def parse_file(self, img_path: str, ann_path: str) -> Tuple:
raise NotImplementedError
为了保证后续模块的统一性,MMOCR 对 parse_files
与 parse_file
的返回值做了约定。 parse_file
的返回值为一个元组,元组中的第一个元素为图像路径,第二个元素为标注信息。标注信息为一个列表,列表中的每个元素为一个字典,字典中的字段为poly
, text
, ignore
,如下所示:
# An example of returned values:
(
'imgs/train/xxx.jpg',
[
dict(
poly=[0, 1, 1, 1, 1, 0, 0, 0],
text='hello',
ignore=False),
...
]
)
parse_files
的输出为一个列表,列表中的每个元素为 parse_file
的返回值。 示例为:
[
(
'imgs/train/xxx.jpg',
[
dict(
poly=[0, 1, 1, 1, 1, 0, 0, 0],
text='hello',
ignore=False),
...
]
),
...
]
数据集转换 (Packer)¶
packer
主要是将数据转化到统一的标注格式, 因为输入的数据为 Parsers 的输出,格式已经固定, 因此 Packer 只需要将输入的格式转化为每种任务统一的标注格式即可。如今 MMOCR 支持的任务有文本检测、文本识别、端对端OCR 以及关键信息提取,MMOCR 针对每个任务均有对应的 Packer,如下所示:
对于文字检测、端对端OCR及关键信息提取,MMOCR 均有唯一对应的 Packer
。而在文字识别领域, MMOCR 则提供了两种 Packer
,分别为 TextRecogPacker
和 TextRecogCropPacker
,其原因在与文字识别的数据集存在两种情况:
每个图像均为一个识别样本,
parser
返回的标注信息仅为一个dict(text='xxx')
,此时使用TextRecogPacker
即可。数据集没有将文字从图像中裁剪出来,本质是一个端对端OCR的标注,包含了文字的位置信息以及对应的文本信息,
TextRecogCropPacker
会将文字从图像中裁剪出来,然后再转化成文字识别的统一格式。
标注保存 (Dumper)¶
dumper
来决定要将数据保存为何种格式。目前,MMOCR 支持 JsonDumper
, WildreceiptOpensetDumper
,及 TextRecogLMDBDumper
。他们分别用于将数据保存为标准的 MMOCR Json 格式、Wildreceipt 格式,及文本识别领域学术界常用的 LMDB 格式。
临时文件清理 (Delete)¶
在处理数据集时,往往会产生一些不需要的临时文件。这里可以以列表的形式传入这些文件或文件夹,在结束转换时即会删除。
生成基础配置 (ConfigGenerator)¶
为了在数据集准备完毕后可以自动生成基础配置,目前,MMOCR 按任务实现了 TextDetConfigGenerator
、TextRecogConfigGenerator
和 TextSpottingConfigGenerator
。它们支持的主要参数如下:
字段名 | 含义 |
---|---|
data_root | 数据集存储的根目录 |
train_anns | 配置文件内训练集标注的路径。若不指定,则默认为 [dict(ann_file='{taskname}_train.json', dataset_postfix=''] 。 |
val_anns | 配置文件内验证集标注的路径。若不指定,则默认为空。 |
test_anns | 配置文件内测试集标注的路径。若不指定,则默认指向 [dict(ann_file='{taskname}_test.json', dataset_postfix=''] 。 |
config_path | 算法库存放配置文件的路径,配置生成器会将默认配置写入 {config_path}/{taskname}/_base_/datasets/{dataset_name}.py 下。若不指定,则默认为 configs/ |
在准备好数据集的所有文件后,配置生成器就会自动生成调用该数据集所需要的基础配置文件。下面给出了一个最小化的 TextDetConfigGenerator
配置示例:
config_generator = dict(type='TextDetConfigGenerator')
生成后的文件默认会被置于 configs/{task}/_base_/datasets/
下。例如,本例中,icdar 2015 的基础配置文件就会被生成在 configs/textdet/_base_/datasets/icdar2015.py
下:
icdar2015_textdet_data_root = 'data/icdar2015'
icdar2015_textdet_train = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root,
ann_file='textdet_train.json',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=None)
icdar2015_textdet_test = dict(
type='OCRDataset',
data_root=icdar2015_textdet_data_root,
ann_file='textdet_test.json',
test_mode=True,
pipeline=None)
假如数据集比较特殊,标注存在着几个变体,配置生成器也支持在基础配置中生成指向各自变体的变量,但这需要用户在设置时用不同的 dataset_postfix
区分。例如,ICDAR 2015 文字识别数据的测试集就存在着原版和 1811 两种标注版本,可以在 test_anns
中指定它们,如下所示:
config_generator = dict(
type='TextRecogConfigGenerator',
test_anns=[
dict(ann_file='textrecog_test.json'),
dict(dataset_postfix='857', ann_file='textrecog_test_857.json')
])
配置生成器会生成以下配置:
icdar2015_textrecog_data_root = 'data/icdar2015'
icdar2015_textrecog_train = dict(
type='OCRDataset',
data_root=icdar2015_textrecog_data_root,
ann_file='textrecog_train.json',
pipeline=None)
icdar2015_textrecog_test = dict(
type='OCRDataset',
data_root=icdar2015_textrecog_data_root,
ann_file='textrecog_test.json',
test_mode=True,
pipeline=None)
icdar2015_1811_textrecog_test = dict(
type='OCRDataset',
data_root=icdar2015_textrecog_data_root,
ann_file='textrecog_test_1811.json',
test_mode=True,
pipeline=None)
有了该文件后,MMOCR 就能从模型的配置文件中直接导入该数据集到 dataloader
中使用(以下样例节选自 configs/textdet/dbnet/dbnet_resnet18_fpnc_1200e_icdar2015.py
):
_base_ = [
'../_base_/datasets/icdar2015.py',
# ...
]
# dataset settings
icdar2015_textdet_train = _base_.icdar2015_textdet_train
icdar2015_textdet_test = _base_.icdar2015_textdet_test
# ...
train_dataloader = dict(
dataset=icdar2015_textdet_train)
val_dataloader = dict(
dataset=icdar2015_textdet_test)
test_dataloader = val_dataloader
注解
除非用户在运行脚本的时候手动指定了 overwrite-cfg
,配置生成器默认不会自动覆盖已经存在的基础配置文件。
向 Dataset Preparer 添加新的数据集¶
添加公开数据集¶
MMOCR 已经支持了许多常用的公开数据集。如果你想用的数据集还没有被支持,并且你也愿意为 MMOCR 开源社区贡献代码,你可以按照以下步骤来添加一个新的数据集。
接下来以添加 ICDAR2013 数据集为例,展示如何一步一步地添加一个新的公开数据集。
添加 metafile.yml
¶
首先,确认 dataset_zoo/
中不存在准备添加的数据集。然后我们先新建以待添加数据集命名的文件夹,如 icdar2013/
(通常,使用不包含符号的小写英文字母及数字来命名数据集)。在 icdar2013/
文件夹中,新建 metafile.yml
文件,并按照以下模板来填充数据集的基本信息:
Name: 'Incidental Scene Text IC13'
Paper:
Title: ICDAR 2013 Robust Reading Competition
URL: https://www.imlab.jp/publication_data/1352/icdar_competition_report.pdf
Venue: ICDAR
Year: '2013'
BibTeX: '@inproceedings{karatzas2013icdar,
title={ICDAR 2013 robust reading competition},
author={Karatzas, Dimosthenis and Shafait, Faisal and Uchida, Seiichi and Iwamura, Masakazu and i Bigorda, Lluis Gomez and Mestre, Sergi Robles and Mas, Joan and Mota, David Fernandez and Almazan, Jon Almazan and De Las Heras, Lluis Pere},
booktitle={2013 12th international conference on document analysis and recognition},
pages={1484--1493},
year={2013},
organization={IEEE}}'
Data:
Website: https://rrc.cvc.uab.es/?ch=2
Language:
- English
Scene:
- Natural Scene
Granularity:
- Word
Tasks:
- textdet
- textrecog
- textspotting
License:
Type: N/A
Link: N/A
Format: .txt
Keywords:
- Horizontal
添加标注示例¶
最后,可以在 dataset_zoo/icdar2013/
目录下添加标注示例文件 sample_anno.md
以帮助文档脚本在生成文档时添加标注示例,标注示例文件是一个 Markdown 文件,其内容通常包含了单个样本的原始数据格式。例如,以下代码块展示了 ICDAR2013 数据集的数据样例文件:
**Text Detection**
```text
# train split
# x1 y1 x2 y2 "transcript"
158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"
# test split
# x1, y1, x2, y2, "transcript"
38, 43, 920, 215, "Tiredness"
275, 264, 665, 450, "kills"
0, 699, 77, 830, "A"
```
添加对应任务的配置文件¶
在 dataset_zoo/icdar2013
中,接着添加以任务名称命名的 .py
配置文件。如 textdet.py
,textrecog.py
,textspotting.py
,kie.py
等。配置模板如下所示:
data_root = ''
data_cache = 'data/cache'
train_prepare = dict(
obtainer=dict(
type='NaiveObtainer',
data_cache=data_cache,
files=[
dict(
url='xx',
md5='',
save_name='xxx',
mapping=list())
]),
gatherer=dict(type='xxxGatherer', **kwargs),
parser=dict(type='xxxParser', **kwargs),
packer=dict(type='TextxxxPacker'), # 对应任务的 Packer
dumper=dict(type='JsonDumper'),
)
test_prepare = dict(
obtainer=dict(
type='NaiveObtainer',
data_cache=data_cache,
files=[
dict(
url='xx',
md5='',
save_name='xxx',
mapping=list())
]),
gatherer=dict(type='xxxGatherer', **kwargs),
parser=dict(type='xxxParser', **kwargs),
packer=dict(type='TextxxxPacker'), # 对应任务的 Packer
dumper=dict(type='JsonDumper'),
)
以文件检测任务为例,来介绍配置文件的具体内容。
一般情况下用户无需重新实现新的 obtainer
, gatherer
, packer
或 dumper
,但是通常需要根据数据集的标注格式实现新的 parser
。
对于 obtainer
的配置这里不在做过的介绍,可以参考 数据集下载、解压、移动。
针对 gatherer
,通过观察获取的 ICDAR2013 数据集文件发现,其每一张图片都有一个对应的 .txt
格式的标注文件:
data_root
├── textdet_imgs/train/
│ ├── img_1.jpg
│ ├── img_2.jpg
│ └── ...
├── annotations/train/
│ ├── gt_img_1.txt
│ ├── gt_img_2.txt
│ └── ...
且每个标注文件名与图片的对应关系为:gt_img_1.txt
对应 img_1.jpg
,以此类推。因此可以使用 PairGatherer
来进行匹配。
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg'],
rule=[r'(\w+)\.jpg', r'gt_\1.txt'])
规则 rule
第一个正则表达式用于匹配图片文件名,第二个正则表达式用于匹配标注文件名。在这里,使用 (\w+)
来匹配图片文件名,使用 gt_\1.txt
来匹配标注文件名,其中 \1
表示第一个正则表达式匹配到的内容。即,实现了将 img_xx.jpg
替换为 gt_img_xx.txt
的功能。
接下来,需要实现 parser
,即将原始标注文件解析为标准格式。通常来说,用户在添加新的数据集前,可以浏览已支持数据集的详情页,并查看是否已有相同格式的数据集。如果已有相同格式的数据集,则可以直接使用该数据集的 parser
。否则,则需要实现新的格式解析器。
数据格式解析器被统一存储在 mmocr/datasets/preparers/parsers
目录下。所有的 parser
都需要继承 BaseParser
,并实现 parse_file
或 parse_files
方法。具体可以参考数据集解析
通过观察 ICDAR2013 数据集的标注文件:
158 128 411 181 "Footpath"
443 128 501 169 "To"
64 200 363 243 "Colchester"
542, 710, 938, 841, "break"
87, 884, 457, 1021, "could"
517, 919, 831, 1024, "save"
我们发现内置的 ICDARTxtTextDetAnnParser
已经可以满足需求,因此可以直接使用该 parser
,并将其配置到 preparer
中。
parser=dict(
type='ICDARTxtTextDetAnnParser',
remove_strs=[',', '"'],
encoding='utf-8',
format='x1 y1 x2 y2 trans',
separator=' ',
mode='xyxy')
其中,由于标注文件中混杂了多余的引号 “”
和逗号 ,
,可以通过指定 remove_strs=[',', '"']
来进行移除。另外在 format
中指定了标注文件的格式,其中 x1 y1 x2 y2 trans
表示标注文件中的每一行包含了四个坐标和一个文本内容,且坐标和文本内容之间使用空格分隔(separator
=’ ‘)。另外,需要指定 mode
为 xyxy
,表示标注文件中的坐标是左上角和右下角的坐标,这样以来,ICDARTxtTextDetAnnParser
即可将该格式的标注解析为统一格式。
对于 packer
,以文件检测任务为例,其 packer
为 TextDetPacker
,其配置如下:
packer=dict(type='TextDetPacker')
最后,指定 dumper
,这里一般情况下保存为json格式,其配置如下:
dumper=dict(type='JsonDumper')
经过上述配置后,针对 ICDAR2013 训练集的配置文件如下:
train_preparer = dict(
obtainer=dict(
type='NaiveDataObtainer',
cache_path=cache_path,
files=[
dict(
url='https://rrc.cvc.uab.es/downloads/'
'Challenge2_Training_Task12_Images.zip',
save_name='ic13_textdet_train_img.zip',
md5='a443b9649fda4229c9bc52751bad08fb',
content=['image'],
mapping=[['ic13_textdet_train_img', 'textdet_imgs/train']]),
dict(
url='https://rrc.cvc.uab.es/downloads/'
'Challenge2_Training_Task1_GT.zip',
save_name='ic13_textdet_train_gt.zip',
md5='f3a425284a66cd67f455d389c972cce4',
content=['annotation'],
mapping=[['ic13_textdet_train_gt', 'annotations/train']]),
]),
gatherer=dict(
type='PairGatherer',
img_suffixes=['.jpg'],
rule=[r'(\w+)\.jpg', r'gt_\1.txt']),
parser=dict(
type='ICDARTxtTextDetAnnParser',
remove_strs=[',', '"'],
format='x1 y1 x2 y2 trans',
separator=' ',
mode='xyxy'),
packer=dict(type='TextDetPacker'),
dumper=dict(type='JsonDumper'),
)
为了在数据集准备完毕后可以自动生成基础配置, 还需要配置一下对应任务的 config_generator
。
在本例中,因为为文字检测任务,仅需要设置 Generator 为 TextDetConfigGenerator
即可
config_generator = dict(type='TextDetConfigGenerator', )
添加私有数据集¶
待更新…
Text Detection¶
注解
This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.
Overview¶
Dataset | Images | Annotation Files | |||
---|---|---|---|---|---|
training | validation | testing | |||
ICDAR2011 | homepage | - | - | ||
ICDAR2017 | homepage | instances_training.json | instances_val.json | - | |
CurvedSynText150k | homepage | Part1 | Part2 | instances_training.json | - | - | |
DeText | homepage | - | - | - | |
Lecture Video DB | homepage | - | - | - | |
LSVT | homepage | - | - | - | |
IMGUR | homepage | - | - | - | |
KAIST | homepage | - | - | - | |
MTWI | homepage | - | - | - | |
ReCTS | homepage | - | - | - | |
IIIT-ILST | homepage | - | - | - | |
VinText | homepage | - | - | - | |
BID | homepage | - | - | - | |
RCTW | homepage | - | - | - | |
HierText | homepage | - | - | - | |
ArT | homepage | - | - | - |
Install AWS CLI (optional)¶
Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ./aws/install -i /usr/local/aws-cli -b /usr/local/bin !aws configure # this command will require you to input keys, you can skip them except # for the Default region name # AWS Access Key ID [None]: # AWS Secret Access Key [None]: # Default region name [None]: us-east-1 # Default output format [None]
For users in China, these datasets can also be downloaded from OpenDataLab with high speed:
Important Note¶
注解
For users who want to train models on CTW1500, ICDAR 2015/2017, and Totaltext dataset, there might be some images containing orientation info in EXIF data. The default OpenCV
backend used in MMCV would read them and apply the rotation on the images. However, their gold annotations are made on the raw pixels, and such
inconsistency results in false examples in the training set. Therefore, users should use dict(type='LoadImageFromFile', color_type='color_ignore_orientation')
in pipelines to change MMCV’s default loading behaviour. (see DBNet’s pipeline config for example)
ICDAR 2011 (Born-Digital Images)¶
Step1: Download
Challenge1_Training_Task12_Images.zip
,Challenge1_Training_Task1_GT.zip
,Challenge1_Test_Task12_Images.zip
, andChallenge1_Test_Task1_GT.zip
from homepageTask 1.1: Text Localization (2013 edition)
.mkdir icdar2011 && cd icdar2011 mkdir imgs && mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task1_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task12_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task1_GT.zip --no-check-certificate # For images unzip -q Challenge1_Training_Task12_Images.zip -d imgs/training unzip -q Challenge1_Test_Task12_Images.zip -d imgs/test # For annotations unzip -q Challenge1_Training_Task1_GT.zip -d annotations/training unzip -q Challenge1_Test_Task1_GT.zip -d annotations/test rm Challenge1_Training_Task12_Images.zip && rm Challenge1_Test_Task12_Images.zip && rm Challenge1_Training_Task1_GT.zip && rm Challenge1_Test_Task1_GT.zip
Step 2: Generate
instances_training.json
andinstances_test.json
with the following command:python tools/dataset_converters/textdet/ic11_converter.py PATH/TO/icdar2011 --nproc 4
After running the above codes, the directory structure should be as follows:
│── icdar2011 │ ├── imgs │ ├── instances_test.json │ └── instances_training.json
ICDAR 2017¶
Follow similar steps as ICDAR 2015.
The resulting directory structure looks like the following:
├── icdar2017 │ ├── imgs │ ├── annotations │ ├── instances_training.json │ └── instances_val.json
CurvedSynText150k¶
Step1: Download syntext1.zip and syntext2.zip to
CurvedSynText150k/
.Step2:
unzip -q syntext1.zip mv train.json train1.json unzip images.zip rm images.zip unzip -q syntext2.zip mv train.json train2.json unzip images.zip rm images.zip
Step3: Download instances_training.json to
CurvedSynText150k/
Or, generate
instances_training.json
with following command:python tools/dataset_converters/common/curvedsyntext_converter.py PATH/TO/CurvedSynText150k --nproc 4
The resulting directory structure looks like the following:
├── CurvedSynText150k │ ├── syntext_word_eng │ ├── emcs_imgs │ └── instances_training.json
DeText¶
Step1: Download
ch9_training_images.zip
,ch9_training_localization_transcription_gt.zip
,ch9_validation_images.zip
, andch9_validation_localization_transcription_gt.zip
from Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
Step2: Generate
instances_training.json
andinstances_val.json
with following command:python tools/dataset_converters/textdet/detext_converter.py PATH/TO/detext --nproc 4
After running the above codes, the directory structure should be as follows:
│── detext │ ├── annotations │ ├── imgs │ ├── instances_test.json │ └── instances_training.json
Lecture Video DB¶
Step1: Download IIIT-CVid.zip to
lv/
.mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip mv IIIT-CVid/Frames imgs rm IIIT-CVid.zip
Step2: Generate
instances_training.json
,instances_val.json
, andinstances_test.json
with following command:python tools/dataset_converters/textdet/lv_converter.py PATH/TO/lv --nproc 4
The resulting directory structure looks like the following:
│── lv │ ├── imgs │ ├── instances_test.json │ ├── instances_training.json │ └── instances_val.json
LSVT¶
Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to
lsvt/
.mkdir lsvt && cd lsvt # Download LSVT dataset wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json mkdir annotations tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/ mv train_full_images_0 imgs rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
Step2: Generate
instances_training.json
andinstances_val.json
(optional) with the following command:# Annotations of LSVT test split is not publicly available, split a validation # set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/lsvt_converter.py PATH/TO/lsvt
After running the above codes, the directory structure should be as follows:
|── lsvt │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
IMGUR¶
Step1: Run
download_imgur5k.py
to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-Dataset
Step2: Generate
instances_train.json
,instance_val.json
andinstances_test.json
with the following command:python tools/dataset_converters/textdet/imgur_converter.py PATH/TO/imgur
After running the above codes, the directory structure should be as follows:
│── imgur │ ├── annotations │ ├── imgs │ ├── instances_test.json │ ├── instances_training.json │ └── instances_val.json
KAIST¶
Step1: Complete download KAIST_all.zip to
kaist/
.mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip rm KAIST_all.zip
Step2: Extract zips:
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
Step3: Generate
instances_training.json
andinstances_val.json
(optional) with following command:# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 python tools/dataset_converters/textdet/kaist_converter.py PATH/TO/kaist --nproc 4
After running the above codes, the directory structure should be as follows:
│── kaist │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
MTWI¶
Step1: Download
mtwi_2018_train.zip
from homepage.mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zip
Step2: Generate
instances_training.json
andinstance_val.json
(optional) with the following command:# Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/mtwi_converter.py PATH/TO/mtwi --nproc 4
After running the above codes, the directory structure should be as follows:
│── mtwi │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
ReCTS¶
Step1: Download ReCTS.zip to
rects/
from the homepage.mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip && rm -rf gt
Step2: Generate
instances_training.json
andinstances_val.json
(optional) with following command:# Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/rects_converter.py PATH/TO/rects --nproc 4 --val-ratio 0.2
After running the above codes, the directory structure should be as follows:
│── rects │ ├── annotations │ ├── imgs │ ├── instances_val.json (optional) │ └── instances_training.json
ILST¶
Step1: Download
IIIT-ILST
from onedriveStep2: Run the following commands
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
Step3: Generate
instances_training.json
andinstances_val.json
(optional). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textdet/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
After running the above codes, the directory structure should be as follows:
│── IIIT-ILST │ ├── annotations │ ├── imgs │ ├── instances_val.json (optional) │ └── instances_training.json
VinText¶
Step1: Download vintext.zip to
vintext
mkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- │ sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
Step2: Generate
instances_training.json
,instances_test.json
andinstances_unseen_test.json
python tools/dataset_converters/textdet/vintext_converter.py PATH/TO/vintext --nproc 4
After running the above codes, the directory structure should be as follows:
│── vintext │ ├── annotations │ ├── imgs │ ├── instances_test.json │ ├── instances_unseen_test.json │ └── instances_training.json
BID¶
Step1: Download BID Dataset.zip
Step2: Run the following commands to preprocess the dataset
# Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.ini
Step3: - Step3: Generate
instances_training.json
andinstances_val.json
(optional). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textdet/bid_converter.py PATH/TO/BID --nproc 4
After running the above codes, the directory structure should be as follows:
│── BID │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
RCTW¶
Step1: Download
train_images.zip.001
,train_images.zip.002
, andtrain_gts.zip
from the homepage, extract the zips torctw/imgs
andrctw/annotations
, respectively.Step2: Generate
instances_training.json
andinstances_val.json
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/dataset_converters/textdet/rctw_converter.py PATH/TO/rctw --nproc 4
After running the above codes, the directory structure should be as follows:
│── rctw │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
HierText¶
Step1 (optional): Install AWS CLI.
Step2: Clone HierText repo to get annotations
mkdir HierText git clone https://github.com/google-research-datasets/hiertext.git
Step3: Download
train.tgz
,validation.tgz
from awsaws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz . aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
Step4: Process raw data
# process annotations mv hiertext/gt ./ rm -rf hiertext mv gt annotations gzip -d annotations/train.jsonl.gz gzip -d annotations/validation.jsonl.gz # process images mkdir imgs mv train.tgz imgs/ mv validation.tgz imgs/ tar -xzvf imgs/train.tgz tar -xzvf imgs/validation.tgz
Step5: Generate
instances_training.json
andinstance_val.json
. HierText includes different levels of annotation, from paragraph, line, to word. Check the original paper for details. E.g. set--level paragraph
to get paragraph-level annotation. Set--level line
to get line-level annotation. set--level word
to get word-level annotation.# Collect word annotation from HierText --level word python tools/dataset_converters/textdet/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
After running the above codes, the directory structure should be as follows:
│── HierText │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json
ArT¶
Step1: Download
train_images.tar.gz
, andtrain_labels.json
from the homepage toart/
mkdir art && cd art mkdir annotations # Download ArT dataset wget https://dataset-bj.cdn.bcebos.com/art/train_images.tar.gz --no-check-certificate wget https://dataset-bj.cdn.bcebos.com/art/train_labels.json --no-check-certificate # Extract tar -xf train_images.tar.gz mv train_images imgs mv train_labels.json annotations/ # Remove unnecessary files rm train_images.tar.gz
Step2: Generate
instances_training.json
andinstances_val.json
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/data/textdet/art_converter.py PATH/TO/art --nproc 4
After running the above codes, the directory structure should be as follows:
│── art │ ├── annotations │ ├── imgs │ ├── instances_training.json │ └── instances_val.json (optional)
Text Recognition¶
注解
This page is a manual preparation guide for datasets not yet supported by Dataset Preparer, which all these scripts will be eventually migrated into.
Overview¶
Dataset | images | annotation file | annotation file |
---|---|---|---|
training | test | ||
coco_text | homepage | train_labels.json | - |
ICDAR2011 | homepage | - | - |
SynthAdd | SynthText_Add.zip (code:627x) | train_labels.json | - |
OpenVINO | Open Images | annotations | annotations |
DeText | homepage | - | - |
Lecture Video DB | homepage | - | - |
LSVT | homepage | - | - |
IMGUR | homepage | - | - |
KAIST | homepage | - | - |
MTWI | homepage | - | - |
ReCTS | homepage | - | - |
IIIT-ILST | homepage | - | - |
VinText | homepage | - | - |
BID | homepage | - | - |
RCTW | homepage | - | - |
HierText | homepage | - | - |
ArT | homepage | - | - |
(*) Since the official homepage is unavailable now, we provide an alternative for quick reference. However, we do not guarantee the correctness of the dataset.
Install AWS CLI (optional)¶
Since there are some datasets that require the AWS CLI to be installed in advance, we provide a quick installation guide here:
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install ./aws/install -i /usr/local/aws-cli -b /usr/local/bin !aws configure # this command will require you to input keys, you can skip them except # for the Default region name # AWS Access Key ID [None]: # AWS Secret Access Key [None]: # Default region name [None]: us-east-1 # Default output format [None]
For users in China, these datasets can also be downloaded from OpenDataLab with high speed:
ICDAR 2011 (Born-Digital Images)¶
Step1: Download
Challenge1_Training_Task3_Images_GT.zip
,Challenge1_Test_Task3_Images.zip
, andChallenge1_Test_Task3_GT.txt
from homepageTask 1.3: Word Recognition (2013 edition)
.mkdir icdar2011 && cd icdar2011 mkdir annotations # Download ICDAR 2011 wget https://rrc.cvc.uab.es/downloads/Challenge1_Training_Task3_Images_GT.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_Images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/Challenge1_Test_Task3_GT.txt --no-check-certificate # For images mkdir crops unzip -q Challenge1_Training_Task3_Images_GT.zip -d crops/train unzip -q Challenge1_Test_Task3_Images.zip -d crops/test # For annotations mv Challenge1_Test_Task3_GT.txt annotations && mv crops/train/gt.txt annotations/Challenge1_Train_Task3_GT.txt
Step2: Convert original annotations to
train_labels.json
andtest_labels.json
with the following command:python tools/dataset_converters/textrecog/ic11_converter.py PATH/TO/icdar2011
After running the above codes, the directory structure should be as follows:
├── icdar2011 │ ├── crops │ ├── train_labels.json │ └── test_labels.json
coco_text¶
Step1: Download from homepage
Step2: Download train_labels.json
After running the above codes, the directory structure should be as follows:
├── coco_text │ ├── train_labels.json │ └── train_words
SynthAdd¶
Step1: Download
SynthText_Add.zip
from SynthAdd (code:627x))Step2: Download train_labels.json
Step3:
mkdir SynthAdd && cd SynthAdd mv /path/to/SynthText_Add.zip . unzip SynthText_Add.zip mv /path/to/train_labels.json . # create soft link cd /path/to/mmocr/data/recog ln -s /path/to/SynthAdd SynthAdd
After running the above codes, the directory structure should be as follows:
├── SynthAdd │ ├── train_labels.json │ └── SynthText_Add
OpenVINO¶
Step1 (optional): Install AWS CLI.
Step2: Download Open Images subsets
train_1
,train_2
,train_5
,train_f
, andvalidation
toopenvino/
.mkdir openvino && cd openvino # Download Open Images subsets for s in 1 2 5 f; do aws s3 --no-sign-request cp s3://open-images-dataset/tar/train_${s}.tar.gz . done aws s3 --no-sign-request cp s3://open-images-dataset/tar/validation.tar.gz . # Download annotations for s in 1 2 5 f; do wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_train_${s}.json done wget https://storage.openvinotoolkit.org/repositories/openvino_training_extensions/datasets/open_images_v5_text/text_spotting_openimages_v5_validation.json # Extract images mkdir -p openimages_v5/val for s in 1 2 5 f; do tar zxf train_${s}.tar.gz -C openimages_v5 done tar zxf validation.tar.gz -C openimages_v5/val
Step3: Generate
train_{1,2,5,f}_labels.json
,val_labels.json
and crop images using 4 processes with the following command:python tools/dataset_converters/textrecog/openvino_converter.py /path/to/openvino 4
After running the above codes, the directory structure should be as follows:
├── OpenVINO │ ├── image_1 │ ├── image_2 │ ├── image_5 │ ├── image_f │ ├── image_val │ ├── train_1_labels.json │ ├── train_2_labels.json │ ├── train_5_labels.json │ ├── train_f_labels.json │ └── val_labels.json
DeText¶
Step1: Download
ch9_training_images.zip
,ch9_training_localization_transcription_gt.zip
,ch9_validation_images.zip
, andch9_validation_localization_transcription_gt.zip
from Task 3: End to End on the homepage.mkdir detext && cd detext mkdir imgs && mkdir annotations && mkdir imgs/training && mkdir imgs/val && mkdir annotations/training && mkdir annotations/val # Download DeText wget https://rrc.cvc.uab.es/downloads/ch9_training_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_training_localization_transcription_gt.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_images.zip --no-check-certificate wget https://rrc.cvc.uab.es/downloads/ch9_validation_localization_transcription_gt.zip --no-check-certificate # Extract images and annotations unzip -q ch9_training_images.zip -d imgs/training && unzip -q ch9_training_localization_transcription_gt.zip -d annotations/training && unzip -q ch9_validation_images.zip -d imgs/val && unzip -q ch9_validation_localization_transcription_gt.zip -d annotations/val # Remove zips rm ch9_training_images.zip && rm ch9_training_localization_transcription_gt.zip && rm ch9_validation_images.zip && rm ch9_validation_localization_transcription_gt.zip
Step2: Generate
train_labels.json
andtest_labels.json
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/detext/ignores python tools/dataset_converters/textrecog/detext_converter.py PATH/TO/detext --nproc 4
After running the above codes, the directory structure should be as follows:
├── detext │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── test_labels.json
NAF¶
Step1: Download labeled_images.tar.gz to
naf/
.mkdir naf && cd naf # Download NAF dataset wget https://github.com/herobd/NAF_dataset/releases/download/v1.0/labeled_images.tar.gz tar -zxf labeled_images.tar.gz # For images mkdir annotations && mv labeled_images imgs # For annotations git clone https://github.com/herobd/NAF_dataset.git mv NAF_dataset/train_valid_test_split.json annotations/ && mv NAF_dataset/groups annotations/ rm -rf NAF_dataset && rm labeled_images.tar.gz
Step2: Generate
train_labels.json
,val_labels.json
, andtest_labels.json
with following command:# Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/naf/ignores python tools/dataset_converters/textrecog/naf_converter.py PATH/TO/naf --nproc 4
After running the above codes, the directory structure should be as follows:
├── naf │ ├── crops │ ├── train_labels.json │ ├── val_labels.json │ └── test_labels.json
Lecture Video DB¶
警告
This section is not fully tested yet.
注解
The LV dataset has already provided cropped images and the corresponding annotations
Step1: Download IIIT-CVid.zip to
lv/
.mkdir lv && cd lv # Download LV dataset wget http://cdn.iiit.ac.in/cdn/preon.iiit.ac.in/~kartik/IIIT-CVid.zip unzip -q IIIT-CVid.zip # For image mv IIIT-CVid/Crops ./ # For annotation mv IIIT-CVid/train.txt train_labels.json && mv IIIT-CVid/val.txt val_label.txt && mv IIIT-CVid/test.txt test_labels.json rm IIIT-CVid.zip
Step2: Generate
train_labels.json
,val.json
, andtest.json
with following command:python tools/dataset_converters/textdreog/lv_converter.py PATH/TO/lv
After running the above codes, the directory structure should be as follows:
├── lv │ ├── Crops │ ├── train_labels.json │ └── test_labels.json
LSVT¶
警告
This section is not fully tested yet.
Step1: Download train_full_images_0.tar.gz, train_full_images_1.tar.gz, and train_full_labels.json to
lsvt/
.mkdir lsvt && cd lsvt # Download LSVT dataset wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_0.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_images_1.tar.gz wget https://dataset-bj.cdn.bcebos.com/lsvt/train_full_labels.json mkdir annotations tar -xf train_full_images_0.tar.gz && tar -xf train_full_images_1.tar.gz mv train_full_labels.json annotations/ && mv train_full_images_1/*.jpg train_full_images_0/ mv train_full_images_0 imgs rm train_full_images_0.tar.gz && rm train_full_images_1.tar.gz && rm -rf train_full_images_1
Step2: Generate
train_labels.json
andval_label.json
(optional) with the following command:# Annotations of LSVT test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/lsvt/ignores python tools/dataset_converters/textdrecog/lsvt_converter.py PATH/TO/lsvt --nproc 4
After running the above codes, the directory structure should be as follows:
├── lsvt │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
IMGUR¶
警告
This section is not fully tested yet.
Step1: Run
download_imgur5k.py
to download images. You can merge PR#5 in your local repository to enable a much faster parallel execution of image download.mkdir imgur && cd imgur git clone https://github.com/facebookresearch/IMGUR5K-Handwriting-Dataset.git # Download images from imgur.com. This may take SEVERAL HOURS! python ./IMGUR5K-Handwriting-Dataset/download_imgur5k.py --dataset_info_dir ./IMGUR5K-Handwriting-Dataset/dataset_info/ --output_dir ./imgs # For annotations mkdir annotations mv ./IMGUR5K-Handwriting-Dataset/dataset_info/*.json annotations rm -rf IMGUR5K-Handwriting-Dataset
Step2: Generate
train_labels.json
,val_label.txt
andtest_labels.json
and crop images with the following command:python tools/dataset_converters/textrecog/imgur_converter.py PATH/TO/imgur
After running the above codes, the directory structure should be as follows:
├── imgur │ ├── crops │ ├── train_labels.json │ ├── test_labels.json │ └── val_label.json
KAIST¶
警告
This section is not fully tested yet.
Step1: Download KAIST_all.zip to
kaist/
.mkdir kaist && cd kaist mkdir imgs && mkdir annotations # Download KAIST dataset wget http://www.iapr-tc11.org/dataset/KAIST_SceneText/KAIST_all.zip unzip -q KAIST_all.zip && rm KAIST_all.zip
Step2: Extract zips:
python tools/dataset_converters/common/extract_kaist.py PATH/TO/kaist
Step3: Generate
train_labels.json
andval_label.json
(optional) with following command:# Since KAIST does not provide an official split, you can split the dataset by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/kaist/ignores python tools/dataset_converters/textrecog/kaist_converter.py PATH/TO/kaist --nproc 4
After running the above codes, the directory structure should be as follows:
├── kaist │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
MTWI¶
警告
This section is not fully tested yet.
Step1: Download
mtwi_2018_train.zip
from homepage.mkdir mtwi && cd mtwi unzip -q mtwi_2018_train.zip mv image_train imgs && mv txt_train annotations rm mtwi_2018_train.zip
Step2: Generate
train_labels.json
andval_label.json
(optional) with the following command:# Annotations of MTWI test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/mtwi/ignores python tools/dataset_converters/textrecog/mtwi_converter.py PATH/TO/mtwi --nproc 4
After running the above codes, the directory structure should be as follows:
├── mtwi │ ├── crops │ ├── train_labels.json │ └── val_label.json (optional)
ReCTS¶
警告
This section is not fully tested yet.
Step1: Download ReCTS.zip to
rects/
from the homepage.mkdir rects && cd rects # Download ReCTS dataset # You can also find Google Drive link on the dataset homepage wget https://datasets.cvc.uab.es/rrc/ReCTS.zip --no-check-certificate unzip -q ReCTS.zip mv img imgs && mv gt_unicode annotations rm ReCTS.zip -f && rm -rf gt
Step2: Generate
train_labels.json
andval_label.json
(optional) with the following command:# Annotations of ReCTS test split is not publicly available, split a validation # set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise # vertical images will be filtered and stored in PATH/TO/rects/ignores python tools/dataset_converters/textrecog/rects_converter.py PATH/TO/rects --nproc 4
After running the above codes, the directory structure should be as follows:
├── rects │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
ILST¶
警告
This section is not fully tested yet.
Step1: Download
IIIT-ILST.zip
from onedrive linkStep2: Run the following commands
unzip -q IIIT-ILST.zip && rm IIIT-ILST.zip cd IIIT-ILST # rename files cd Devanagari && for i in `ls`; do mv -f $i `echo "devanagari_"$i`; done && cd .. cd Malayalam && for i in `ls`; do mv -f $i `echo "malayalam_"$i`; done && cd .. cd Telugu && for i in `ls`; do mv -f $i `echo "telugu_"$i`; done && cd .. # transfer image path mkdir imgs && mkdir annotations mv Malayalam/{*jpg,*jpeg} imgs/ && mv Malayalam/*xml annotations/ mv Devanagari/*jpg imgs/ && mv Devanagari/*xml annotations/ mv Telugu/*jpeg imgs/ && mv Telugu/*xml annotations/ # remove unnecessary files rm -rf Devanagari && rm -rf Malayalam && rm -rf Telugu && rm -rf README.txt
Step3: Generate
train_labels.json
andval_label.json
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textrecog/ilst_converter.py PATH/TO/IIIT-ILST --nproc 4
After running the above codes, the directory structure should be as follows:
├── IIIT-ILST │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
VinText¶
警告
This section is not fully tested yet.
Step1: Download vintext.zip to
vintext
mkdir vintext && cd vintext # Download dataset from google drive wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1UUQhNvzgpZy7zXBFQp0Qox-BBjunZ0ml" -O vintext.zip && rm -rf /tmp/cookies.txt # Extract images and annotations unzip -q vintext.zip && rm vintext.zip mv vietnamese/labels ./ && mv vietnamese/test_image ./ && mv vietnamese/train_images ./ && mv vietnamese/unseen_test_images ./ rm -rf vietnamese # Rename files mv labels annotations && mv test_image test && mv train_images training && mv unseen_test_images unseen_test mkdir imgs mv training imgs/ && mv test imgs/ && mv unseen_test imgs/
Step2: Generate
train_labels.json
,test_labels.json
,unseen_test_labels.json
, and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts).python tools/dataset_converters/textrecog/vintext_converter.py PATH/TO/vietnamese --nproc 4
After running the above codes, the directory structure should be as follows:
├── vintext │ ├── crops │ ├── ignores │ ├── train_labels.json │ ├── test_labels.json │ └── unseen_test_labels.json
BID¶
警告
This section is not fully tested yet.
Step1: Download BID Dataset.zip
Step2: Run the following commands to preprocess the dataset
# Rename mv BID\ Dataset.zip BID_Dataset.zip # Unzip and Rename unzip -q BID_Dataset.zip && rm BID_Dataset.zip mv BID\ Dataset BID # The BID dataset has a problem of permission, and you may # add permission for this file chmod -R 777 BID cd BID mkdir imgs && mkdir annotations # For images and annotations mv CNH_Aberta/*in.jpg imgs && mv CNH_Aberta/*txt annotations && rm -rf CNH_Aberta mv CNH_Frente/*in.jpg imgs && mv CNH_Frente/*txt annotations && rm -rf CNH_Frente mv CNH_Verso/*in.jpg imgs && mv CNH_Verso/*txt annotations && rm -rf CNH_Verso mv CPF_Frente/*in.jpg imgs && mv CPF_Frente/*txt annotations && rm -rf CPF_Frente mv CPF_Verso/*in.jpg imgs && mv CPF_Verso/*txt annotations && rm -rf CPF_Verso mv RG_Aberto/*in.jpg imgs && mv RG_Aberto/*txt annotations && rm -rf RG_Aberto mv RG_Frente/*in.jpg imgs && mv RG_Frente/*txt annotations && rm -rf RG_Frente mv RG_Verso/*in.jpg imgs && mv RG_Verso/*txt annotations && rm -rf RG_Verso # Remove unnecessary files rm -rf desktop.ini
Step3: Generate
train_labels.json
andval_label.json
(optional) and crop images using 4 processes with the following command (add--preserve-vertical
if you wish to preserve the images containing vertical texts). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if test-ratio is 0.2, then 20% of the data are left out as the validation set in this example.python tools/dataset_converters/textrecog/bid_converter.py PATH/TO/BID --nproc 4
After running the above codes, the directory structure should be as follows:
├── BID │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
RCTW¶
警告
This section is not fully tested yet.
Step1: Download
train_images.zip.001
,train_images.zip.002
, andtrain_gts.zip
from the homepage, extract the zips torctw/imgs
andrctw/annotations
, respectively.Step2: Generate
train_labels.json
andval_label.json
(optional). Since the original dataset doesn’t have a validation set, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of RCTW test split is not publicly available, split a validation set by adding --val-ratio 0.2 # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/rctw/ignores python tools/dataset_converters/textrecog/rctw_converter.py PATH/TO/rctw --nproc 4
After running the above codes, the directory structure should be as follows:
│── rctw │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json (optional)
HierText¶
警告
This section is not fully tested yet.
Step1 (optional): Install AWS CLI.
Step2: Clone HierText repo to get annotations
mkdir HierText git clone https://github.com/google-research-datasets/hiertext.git
Step3: Download
train.tgz
,validation.tgz
from awsaws s3 --no-sign-request cp s3://open-images-dataset/ocr/train.tgz . aws s3 --no-sign-request cp s3://open-images-dataset/ocr/validation.tgz .
Step4: Process raw data
# process annotations mv hiertext/gt ./ rm -rf hiertext mv gt annotations gzip -d annotations/train.json.gz gzip -d annotations/validation.json.gz # process images mkdir imgs mv train.tgz imgs/ mv validation.tgz imgs/ tar -xzvf imgs/train.tgz tar -xzvf imgs/validation.tgz
Step5: Generate
train_labels.json
andval_label.json
. HierText includes different levels of annotation, includingparagraph
,line
, andword
. Check the original paper for details. E.g. set--level paragraph
to get paragraph-level annotation. Set--level line
to get line-level annotation. set--level word
to get word-level annotation.# Collect word annotation from HierText --level word # Add --preserve-vertical to preserve vertical texts for training, otherwise vertical images will be filtered and stored in PATH/TO/HierText/ignores python tools/dataset_converters/textrecog/hiertext_converter.py PATH/TO/HierText --level word --nproc 4
After running the above codes, the directory structure should be as follows:
│── HierText │ ├── crops │ ├── ignores │ ├── train_labels.json │ └── val_label.json
ArT¶
警告
This section is not fully tested yet.
Step1: Download
train_images.tar.gz
, andtrain_labels.json
from the homepage toart/
mkdir art && cd art mkdir annotations # Download ArT dataset wget https://dataset-bj.cdn.bcebos.com/art/train_task2_images.tar.gz wget https://dataset-bj.cdn.bcebos.com/art/train_task2_labels.json # Extract tar -xf train_task2_images.tar.gz mv train_task2_images crops mv train_task2_labels.json annotations/ # Remove unnecessary files rm train_images.tar.gz
Step2: Generate
train_labels.json
andval_label.json
(optional). Since the test annotations are not publicly available, you may specify--val-ratio
to split the dataset. E.g., if val-ratio is 0.2, then 20% of the data are left out as the validation set in this example.# Annotations of ArT test split is not publicly available, split a validation set by adding --val-ratio 0.2 python tools/dataset_converters/textrecog/art_converter.py PATH/TO/art
After running the above codes, the directory structure should be as follows:
│── art │ ├── crops │ ├── train_labels.json │ └── val_label.json (optional)
关键信息提取¶
注解
我们正努力往 Dataset Preparer 中增加更多数据集。对于 Dataset Preparer 暂未能完整支持的数据集,本页提供了一系列手动下载的步骤,供有需要的用户使用。
概览¶
关键信息提取任务的数据集,文件目录应按如下配置:
└── wildreceipt
├── class_list.txt
├── dict.txt
├── image_files
├── test.txt
└── train.txt
准备步骤¶
WildReceipt¶
下载并解压 wildreceipt.tar
WildReceiptOpenset¶
准备好 WildReceipt。
转换 WildReceipt 成 OpenSet 格式:
# 你可以运行以下命令以获取更多可用参数:
# python tools/data/kie/closeset_to_openset.py -h
python tools/data/kie/closeset_to_openset.py data/wildreceipt/train.txt data/wildreceipt/openset_train.txt
python tools/data/kie/closeset_to_openset.py data/wildreceipt/test.txt data/wildreceipt/openset_test.txt
注解
这篇教程里讲述了更多 CloseSet 和 OpenSet 数据格式之间的区别。
总览¶
权重¶
以下是可用于推理的权重列表。
为了便于使用,有的权重可能会存在多个较短的别名,这在表格中将用“/”分隔。
例如,表格中展示的 DB_r18 / dbnet_resnet18_fpnc_1200e_icdar2015
表示您可以使用
DB_r18
或 dbnet_resnet18_fpnc_1200e_icdar2015
来初始化推理器:
>>> from mmocr.apis import TextDetInferencer
>>> inferencer = TextDetInferencer(model='DB_r18')
>>> # 等价于
>>> inferencer = TextDetInferencer(model='dbnet_resnet18_fpnc_1200e_icdar2015')
文字检测¶
模型 |
README |
ICDAR2015 (hmean-iou) |
CTW1500 (hmean-iou) |
Totaltext (hmean-iou) |
---|---|---|---|---|
|
0.8169 |
- |
- |
|
|
0.8504 |
- |
- |
|
|
0.8543 |
- |
- |
|
|
0.8644 |
- |
- |
|
|
- |
- |
0.8182 |
|
|
0.8622 |
- |
- |
|
|
0.8684 |
- |
- |
|
|
0.8882 |
- |
- |
|
|
- |
0.7458 |
- |
|
|
- |
0.7562 |
- |
|
|
0.8182 |
- |
- |
|
|
0.8513 |
- |
- |
|
|
- |
0.8467 |
- |
|
|
- |
0.8488 |
- |
|
|
- |
0.8192 |
- |
|
|
0.8528 |
- |
- |
|
|
0.8604 |
- |
- |
|
|
- |
- |
0.8134 |
|
|
- |
0.777 |
- |
|
|
0.7848 |
- |
- |
|
|
- |
0.7793 |
- |
|
|
- |
0.8037 |
- |
|
|
0.7998 |
- |
- |
|
|
0.8478 |
- |
- |
|
|
- |
0.8286 |
- |
|
|
- |
0.8529 |
- |
文字识别¶
注解
Avg 指该模型在 IIIT5K、SVT、ICDAR2013、ICDAR2015、SVTP、CT80 上的平均结果。
模型 |
README |
Avg (word_acc) |
IIIT5K (word_acc) |
SVT (word_acc) |
ICDAR2013 (word_acc) |
ICDAR2015 (word_acc) |
SVTP (word_acc) |
CT80 (word_acc) |
---|---|---|---|---|---|---|---|---|
|
0.88 |
0.95 |
0.91 |
0.94 |
0.79 |
0.84 |
0.84 |
|
|
0.91 |
0.96 |
0.94 |
0.95 |
0.81 |
0.89 |
0.88 |
|
|
0.86 |
0.94 |
0.89 |
0.93 |
0.77 |
0.81 |
0.85 |
|
|
0.70 |
0.81 |
0.81 |
0.87 |
0.56 |
0.61 |
0.57 |
|
|
0.96 |
0.98 |
0.98 |
0.98 |
0.90 |
0.94 |
0.99 |
|
|
0.88 |
0.95 |
0.90 |
0.95 |
0.76 |
0.85 |
0.89 |
|
|
0.83 |
0.92 |
0.88 |
0.94 |
0.72 |
0.78 |
0.75 |
|
|
0.87 |
0.95 |
0.88 |
0.95 |
0.76 |
0.80 |
0.89 |
|
|
0.87 |
0.95 |
0.90 |
0.94 |
0.74 |
0.80 |
0.89 |
|
|
0.86 |
0.86 |
0.90 |
0.94 |
0.75 |
0.85 |
0.89 |
|
|
0.87 |
0.86 |
0.92 |
0.94 |
0.74 |
0.84 |
0.90 |
|
|
0.87 |
0.95 |
0.89 |
0.93 |
0.76 |
0.81 |
0.87 |
|
|
0.88 |
0.95 |
0.88 |
0.94 |
0.76 |
0.83 |
0.90 |
|
|
0.87 |
0.96 |
0.87 |
0.94 |
0.77 |
0.81 |
0.89 |
|
|
0.90 |
0.96 |
0.92 |
0.96 |
0.80 |
0.88 |
0.90 |
|
|
0.88 |
0.94 |
0.90 |
0.96 |
0.79 |
0.86 |
0.85 |
统计数据¶
模型权重文件数量: 55
配置文件数量: 49
论文数量: 20
ALGORITHM: 20
文本检测模型¶
模型权重文件数量: 29
配置文件数量: 29
论文数量: 8
[ALGORITHM] Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection
[ALGORITHM] Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network
[ALGORITHM] Fourier Contour Embedding for Arbitrary-Shaped Text Detection
[ALGORITHM] Mask R-CNN
[ALGORITHM] Real-Time Scene Text Detection With Differentiable Binarization and Adaptive Scale Fusion
[ALGORITHM] Real-Time Scene Text Detection With Differentiable Binarization
[ALGORITHM] Shape Robust Text Detection With Progressive Scale Expansion Network
[ALGORITHM] Textsnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
文本识别模型¶
模型权重文件数量: 23
配置文件数量: 17
论文数量: 10
[ALGORITHM] Aster: An Attentional Scene Text Recognizer With Flexible Rectification
[ALGORITHM] Master: Multi-Aspect Non-Local Network for Scene Text Recognition
[ALGORITHM] Nrtr: A No-Recurrence Sequence-to-Sequence Model for Scene Text Recognition
[ALGORITHM] On Recognizing Texts of Arbitrary Shapes With 2d Self-Attention
[ALGORITHM] Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition
[ALGORITHM] Revisiting Scene Text Recognition: A Data Perspective
[ALGORITHM] Robustscanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
[ALGORITHM] Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition
[ALGORITHM] Svtr: Scene Text Recognition With a Single Visual Model
前沿模型¶
这里是一些已经复现,但是尚未包含在 MMOCR 包中的前沿模型。
ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve Network¶
This is an implementation of ABCNet based on MMOCR, MMCV, and MMEngine.
ABCNet is a conceptually novel, efficient, and fully convolutional framework for text spotting, which address the problem by proposing the Adaptive Bezier-Curve Network (ABCNet). Our contributions are three-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance with arbitrary shapes, significantly improving the precision compared with previous methods. 3) Compared with standard bounding box detection, our Bezier curve detection introduces negligible computation overhead, resulting in superiority of our method in both efficiency and accuracy. Experiments on arbitrarily-shaped benchmark datasets, namely Total-Text and CTW1500, demonstrate that ABCNet achieves state-of-the-art accuracy, meanwhile significantly improving the speed. In particular, on Total-Text, our realtime version is over 10 times faster than recent state-of-the-art methods with a competitive recognition accuracy.

ABCNet v2: Adaptive Bezier-Curve Network for Real-time End-to-end Text Spotting¶
This is an implementation of ABCNetV2 based on MMOCR, MMCV, and MMEngine.
ABCNetV2 contributions are four-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve, which, compared with segmentation-based methods, can not only provide structured output but also controllable representation. 2) We design a novel BezierAlign layer for extracting accurate convolution features of a text instance of arbitrary shapes, significantly improving the precision of recognition over previous methods. 3) Different from previous methods, which often suffer from complex post-processing and sensitive hyper-parameters, our ABCNet v2 maintains a simple pipeline with the only post-processing non-maximum suppression (NMS). 4) As the performance of text recognition closely depends on feature alignment, ABCNet v2 further adopts a simple yet effective coordinate convolution to encode the position of the convolutional filters, which leads to a considerable improvement with negligible computation overhead. Comprehensive experiments conducted on various bilingual (English and Chinese) benchmark datasets demonstrate that ABCNet v2 can achieve state-of-the-art performance while maintaining very high efficiency.

SPTS: Single-Point Text Spotting¶
This is an implementation of SPTS based on MMOCR, MMCV, and MMEngine.
Existing scene text spotting (i.e., end-to-end text detection and recognition) methods rely on costly bounding box annotations (e.g., text-line, word-level, or character-level bounding boxes). For the first time, we demonstrate that training scene text spotting models can be achieved with an extremely low-cost annotation of a single-point for each instance. We propose an end-to-end scene text spotting method that tackles scene text spotting as a sequence prediction task. Given an image as input, we formulate the desired detection and recognition results as a sequence of discrete tokens and use an auto-regressive Transformer to predict the sequence. The proposed method is simple yet effective, which can achieve state-of-the-art results on widely used benchmarks. Most significantly, we show that the performance is not very sensitive to the positions of the point annotation, meaning that it can be much easier to be annotated or even be automatically generated than the bounding box that requires precise positions. We believe that such a pioneer attempt indicates a significant opportunity for scene text spotting applications of a much larger scale than previously possible.

骨干网络¶
oCLIP¶
Abstract¶
Recently, Vision-Language Pre-training (VLP) techniques have greatly benefited various vision-language tasks by jointly learning visual and textual representations, which intuitively helps in Optical Character Recognition (OCR) tasks due to the rich visual and textual information in scene text images. However, these methods cannot well cope with OCR tasks because of the difficulty in both instance-level text encoding and image-text pair acquisition (i.e. images and captured texts in them). This paper presents a weakly supervised pre-training method, oCLIP, which can acquire effective scene text representations by jointly learning and aligning visual and textual information. Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features, respectively, as well as a visual-textual decoder that models the interaction among textual and visual features for learning effective scene text representations. With the learning of textual features, the pre-trained model can attend texts in images well with character awareness. Besides, these designs enable the learning from weakly annotated texts (i.e. partial texts in images without text bounding boxes) which mitigates the data annotation constraint greatly. Experiments over the weakly annotated images in ICDAR2019-LSVT show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks, respectively. In addition, the proposed method outperforms existing pre-training techniques consistently across multiple public datasets (e.g., +3.2% and +1.3% for Total-Text and CTW1500).

Models¶
Backbone | Pre-train Data | Model |
---|---|---|
ResNet-50 | SynthText | Link |
注解
The model is converted from the official oCLIP.
Supported Text Detection Models¶
DBNet | DBNet++ | FCENet | TextSnake | PSENet | DRRG | Mask R-CNN | |
---|---|---|---|---|---|---|---|
ICDAR2015 | ✓ | ✓ | ✓ | ✓ | ✓ | ||
CTW1500 | ✓ | ✓ | ✓ | ✓ | ✓ |
Citation¶
@article{xue2022language,
title={Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting},
author={Xue, Chuhui and Zhang, Wenqing and Hao, Yu and Lu, Shijian and Torr, Philip and Bai, Song},
journal={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2022}
}
文本检测模型¶
DBNet¶
Real-time Scene Text Detection with Differentiable Binarization
Abstract¶
Recently, segmentation-based methods are quite popular in scene text detection, as the segmentation results can more accurately describe scene text of various shapes such as curve text. However, the post-processing of binarization is essential for segmentation-based detection, which converts probability maps produced by a segmentation method into bounding boxes/regions of text. In this paper, we propose a module named Differentiable Binarization (DB), which can perform the binarization process in a segmentation network. Optimized along with a DB module, a segmentation network can adaptively set the thresholds for binarization, which not only simplifies the post-processing but also enhances the performance of text detection. Based on a simple segmentation network, we validate the performance improvements of DB on five benchmark datasets, which consistently achieves state-of-the-art results, in terms of both detection accuracy and speed. In particular, with a light-weight backbone, the performance improvements by DB are significant so that we can look for an ideal tradeoff between detection accuracy and efficiency. Specifically, with a backbone of ResNet-18, our detector achieves an F-measure of 82.8, running at 62 FPS, on the MSRA-TD500 dataset.

Results and models¶
SynthText¶
Method | Backbone | Training set | ##iters | Download |
---|---|---|---|---|
DBNet_r18 | ResNet18 | SynthText | 100,000 | model | log |
ICDAR2015¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
DBNet_r18 | ResNet18 | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 736 | 0.8853 | 0.7583 | 0.8169 | model | log |
DBNet_r50 | ResNet50 | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.8744 | 0.8276 | 0.8504 | model | log |
DBNet_r50dcn | ResNet50-DCN | Synthtext | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.8784 | 0.8315 | 0.8543 | model | log |
DBNet_r50-oclip | ResNet50-oCLIP | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9052 | 0.8272 | 0.8644 | model | log |
Citation¶
@article{Liao_Wan_Yao_Chen_Bai_2020,
title={Real-Time Scene Text Detection with Differentiable Binarization},
journal={Proceedings of the AAAI Conference on Artificial Intelligence},
author={Liao, Minghui and Wan, Zhaoyi and Yao, Cong and Chen, Kai and Bai, Xiang},
year={2020},
pages={11474-11481}}
DBNetpp¶
Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion
Abstract¶
Recently, segmentation-based scene text detection methods have drawn extensive attention in the scene text detection field, because of their superiority in detecting the text instances of arbitrary shapes and extreme aspect ratios, profiting from the pixel-level descriptions. However, the vast majority of the existing segmentation-based approaches are limited to their complex post-processing algorithms and the scale robustness of their segmentation models, where the post-processing algorithms are not only isolated to the model optimization but also time-consuming and the scale robustness is usually strengthened by fusing multi-scale feature maps directly. In this paper, we propose a Differentiable Binarization (DB) module that integrates the binarization process, one of the most important steps in the post-processing procedure, into a segmentation network. Optimized along with the proposed DB module, the segmentation network can produce more accurate results, which enhances the accuracy of text detection with a simple pipeline. Furthermore, an efficient Adaptive Scale Fusion (ASF) module is proposed to improve the scale robustness by fusing features of different scales adaptively. By incorporating the proposed DB and ASF with the segmentation network, our proposed scene text detector consistently achieves state-of-the-art results, in terms of both detection accuracy and speed, on five standard benchmarks.

Results and models¶
SynthText¶
Method | BackBone | Training set | ##iters | Download |
---|---|---|---|---|
DBNetpp_r50dcn | ResNet50-dcnv2 | SynthText | 100,000 | model | log |
ICDAR2015¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
DBNetpp_r50 | ResNet50 | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9079 | 0.8209 | 0.8622 | model | log |
DBNetpp_r50dcn | ResNet50-dcnv2 | Synthtext (model) | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9116 | 0.8291 | 0.8684 | model | log |
DBNetpp_r50-oclip | ResNet50-oCLIP | - | ICDAR2015 Train | ICDAR2015 Test | 1200 | 1024 | 0.9174 | 0.8609 | 0.8882 | model | log |
Citation¶
@article{liao2022real,
title={Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale Fusion},
author={Liao, Minghui and Zou, Zhisheng and Wan, Zhaoyi and Yao, Cong and Bai, Xiang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2022},
publisher={IEEE}
}
DRRG¶
Deep relational reasoning graph network for arbitrary shape text detection
Abstract¶
Arbitrary shape text detection is a challenging task due to the high variety and complexity of scenes texts. In this paper, we propose a novel unified relational reasoning graph network for arbitrary shape text detection. In our method, an innovative local graph bridges a text proposal model via Convolutional Neural Network (CNN) and a deep relational reasoning network via Graph Convolutional Network (GCN), making our network end-to-end trainable. To be concrete, every text instance will be divided into a series of small rectangular components, and the geometry attributes (e.g., height, width, and orientation) of the small components will be estimated by our text proposal model. Given the geometry attributes, the local graph construction model can roughly establish linkages between different text components. For further reasoning and deducing the likelihood of linkages between the component and its neighbors, we adopt a graph-based network to perform deep relational reasoning on local graphs. Experiments on public available datasets demonstrate the state-of-the-art performance of our method.

Results and models¶
CTW1500¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
DRRG | ResNet50 | - | CTW1500 Train | CTW1500 Test | 1200 | 640 | 0.8775 | 0.8179 | 0.8467 | model \ log |
DRRG_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 1200 | model \ log |
Citation¶
@article{zhang2020drrg,
title={Deep relational reasoning graph network for arbitrary shape text detection},
author={Zhang, Shi-Xue and Zhu, Xiaobin and Hou, Jie-Bo and Liu, Chang and Yang, Chun and Wang, Hongfa and Yin, Xu-Cheng},
booktitle={CVPR},
pages={9699-9708},
year={2020}
}
FCENet¶
Fourier Contour Embedding for Arbitrary-Shaped Text Detection
Abstract¶
One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances. Most of existing methods model text instances in image spatial domain via masks or contour point sequences in the Cartesian or the polar coordinate system. However, the mask representation might lead to expensive post-processing, while the point sequence one may have limited capability to model texts with highly-curved shapes. To tackle these problems, we model text instances in the Fourier domain and propose one novel Fourier Contour Embedding (FCE) method to represent arbitrary shaped text contours as compact signatures. We further construct FCENet with a backbone, feature pyramid networks (FPN) and a simple post-processing with the Inverse Fourier Transformation (IFT) and Non-Maximum Suppression (NMS). Different from previous methods, FCENet first predicts compact Fourier signatures of text instances, and then reconstructs text contours via IFT and NMS during test. Extensive experiments demonstrate that FCE is accurate and robust to fit contours of scene texts even with highly-curved shapes, and also validate the effectiveness and the good generalization of FCENet for arbitrary-shaped text detection. Furthermore, experimental results show that our FCENet is superior to the state-of-the-art (SOTA) methods on CTW1500 and Total-Text, especially on challenging highly-curved text subset.

Results and models¶
CTW1500¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
FCENet_r50dcn | ResNet50 + DCNv2 | - | CTW1500 Train | CTW1500 Test | 1500 | (736, 1080) | 0.8689 | 0.8296 | 0.8488 | model | log |
FCENet_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 1500 | (736, 1080) | 0.8383 | 0.801 | 0.8192 | model | log |
ICDAR2015¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
FCENet_r50 | ResNet50 | - | IC15 Train | IC15 Test | 1500 | (2260, 2260) | 0.8243 | 0.8834 | 0.8528 | model | log |
FCENet_r50-oclip | ResNet50-oCLIP | - | IC15 Train | IC15 Test | 1500 | (2260, 2260) | 0.9176 | 0.8098 | 0.8604 | model | log |
Total Text¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
FCENet_r50 | ResNet50 | - | Totaltext Train | Totaltext Test | 1500 | (1280, 960) | 0.8485 | 0.7810 | 0.8134 | model | log |
Citation¶
@InProceedings{zhu2021fourier,
title={Fourier Contour Embedding for Arbitrary-Shaped Text Detection},
author={Yiqin Zhu and Jianyong Chen and Lingyu Liang and Zhanghui Kuang and Lianwen Jin and Wayne Zhang},
year={2021},
booktitle = {CVPR}
}
Mask R-CNN¶
Abstract¶
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

Results and models¶
CTW1500¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
MaskRCNN | - | - | CTW1500 Train | CTW1500 Test | 160 | 1600 | 0.7165 | 0.7776 | 0.7458 | model | log |
MaskRCNN_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 160 | 1600 | 0.753 | 0.7593 | 0.7562 | model | log |
ICDAR2015¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
MaskRCNN | ResNet50 | - | ICDAR2015 Train | ICDAR2015 Test | 160 | 1920 | 0.8644 | 0.7766 | 0.8182 | model | log |
MaskRCNN_r50-oclip | ResNet50-oCLIP | - | ICDAR2015 Train | ICDAR2015 Test | 160 | 1920 | 0.8695 | 0.8339 | 0.8513 | model | log |
Citation¶
@INPROCEEDINGS{8237584,
author={K. {He} and G. {Gkioxari} and P. {Dollár} and R. {Girshick}},
booktitle={2017 IEEE International Conference on Computer Vision (ICCV)},
title={Mask R-CNN},
year={2017},
pages={2980-2988},
doi={10.1109/ICCV.2017.322}}
PANet¶
Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network
Abstract¶
Scene text detection, an important step of scene text reading systems, has witnessed rapid development with convolutional neural networks. Nonetheless, two main challenges still exist and hamper its deployment to real-world applications. The first problem is the trade-off between speed and accuracy. The second one is to model the arbitrary-shaped text instance. Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical this http URL this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing. More specifically, the segmentation head is made up of Feature Pyramid Enhancement Module (FPEM) and Feature Fusion Module (FFM). FPEM is a cascadable U-shaped module, which can introduce multi-level information to guide the better segmentation. FFM can gather the features given by the FPEMs of different depths into a final feature for segmentation. The learnable post-processing is implemented by Pixel Aggregation (PA), which can precisely aggregate text pixels by predicted similarity vectors. Experiments on several standard benchmarks validate the superiority of the proposed PAN. It is worth noting that our method can achieve a competitive F-measure of 79.9% at 84.2 FPS on CTW1500.

Results and models¶
Citation¶
@inproceedings{WangXSZWLYS19,
author={Wenhai Wang and Enze Xie and Xiaoge Song and Yuhang Zang and Wenjia Wang and Tong Lu and Gang Yu and Chunhua Shen},
title={Efficient and Accurate Arbitrary-Shaped Text Detection With Pixel Aggregation Network},
booktitle={ICCV},
pages={8439--8448},
year={2019}
}
PSENet¶
Shape robust text detection with progressive scale expansion network
Abstract¶
Scene text detection has witnessed rapid progress especially with the recent development of convolutional neural networks. However, there still exists two challenges which prevent the algorithm into industry applications. On the one hand, most of the state-of-art algorithms require quadrangle bounding box which is in-accurate to locate the texts with arbitrary shape. On the other hand, two text instances which are close to each other may lead to a false detection which covers both instances. Traditionally, the segmentation-based approach can relieve the first problem but usually fail to solve the second challenge. To address these two challenges, in this paper, we propose a novel Progressive Scale Expansion Network (PSENet), which can precisely detect text instances with arbitrary shapes. More specifically, PSENet generates the different scale of kernels for each text instance, and gradually expands the minimal scale kernel to the text instance with the complete shape. Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances. Extensive experiments on CTW1500, Total-Text, ICDAR 2015 and ICDAR 2017 MLT validate the effectiveness of PSENet. Notably, on CTW1500, a dataset full of long curve texts, PSENet achieves a F-measure of 74.3% at 27 FPS, and our best F-measure (82.2%) outperforms state-of-art algorithms by 6.6%. The code will be released in the future.

Results and models¶
CTW1500¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
PSENet | ResNet50 | - | CTW1500 Train | CTW1500 Test | 600 | 1280 | 0.7705 | 0.7883 | 0.7793 | model | log |
PSENet_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 600 | 1280 | 0.8483 | 0.7636 | 0.8037 | model | log |
ICDAR2015¶
Method | Backbone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
PSENet | ResNet50 | - | IC15 Train | IC15 Test | 600 | 2240 | 0.8396 | 0.7636 | 0.7998 | model | log |
PSENet_r50-oclip | ResNet50-oCLIP | - | IC15 Train | IC15 Test | 600 | 2240 | 0.8895 | 0.8098 | 0.8478 | model | log |
Citation¶
@inproceedings{wang2019shape,
title={Shape robust text detection with progressive scale expansion network},
author={Wang, Wenhai and Xie, Enze and Li, Xiang and Hou, Wenbo and Lu, Tong and Yu, Gang and Shao, Shuai},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={9336--9345},
year={2019}
}
Textsnake¶
TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes
Abstract¶
Driven by deep neural networks and large scale datasets, scene text detection methods have progressed substantially over the past years, continuously refreshing the performance records on various standard benchmarks. However, limited by the representations (axis-aligned rectangles, rotated rectangles or quadrangles) adopted to describe text, existing methods may fall short when dealing with much more free-form text instances, such as curved text, which are actually very common in real-world scenarios. To tackle this problem, we propose a more flexible representation for scene text, termed as TextSnake, which is able to effectively represent text instances in horizontal, oriented and curved forms. In TextSnake, a text instance is described as a sequence of ordered, overlapping disks centered at symmetric axes, each of which is associated with potentially variable radius and orientation. Such geometry attributes are estimated via a Fully Convolutional Network (FCN) model. In experiments, the text detector based on TextSnake achieves state-of-the-art or comparable performance on Total-Text and SCUT-CTW1500, the two newly published benchmarks with special emphasis on curved text in natural images, as well as the widely-used datasets ICDAR 2015 and MSRA-TD500. Specifically, TextSnake outperforms the baseline on Total-Text by more than 40% in F-measure.

Results and models¶
CTW1500¶
Method | BackBone | Pretrained Model | Training set | Test set | ##epochs | Test size | Precision | Recall | Hmean | Download |
---|---|---|---|---|---|---|---|---|---|---|
TextSnake | ResNet50 | - | CTW1500 Train | CTW1500 Test | 1200 | 736 | 0.8535 | 0.8052 | 0.8286 | model | log |
TextSnake_r50-oclip | ResNet50-oCLIP | - | CTW1500 Train | CTW1500 Test | 1200 | 736 | 0.8869 | 0.8215 | 0.8529 | model | log |
Citation¶
@article{long2018textsnake,
title={TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes},
author={Long, Shangbang and Ruan, Jiaqiang and Zhang, Wenjie and He, Xin and Wu, Wenhao and Yao, Cong},
booktitle={ECCV},
pages={20-36},
year={2018}
}
文本识别模型¶
ABINet¶
Abstract¶
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | note |
---|---|---|---|
Syn90k | 8919273 | 1 | synth |
SynthText | 7239272 | 1 | alphanumeric |
Test Dataset¶
testset | instance_num | note |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and models¶
methods | pretrained | Regular Text | Irregular Text | download | ||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
ABINet-Vision | - | 0.9523 | 0.9196 | 0.9369 | 0.7896 | 0.8403 | 0.8437 | model | log |
ABINet-Vision-TTA | - | 0.9523 | 0.9196 | 0.9360 | 0.8175 | 0.8450 | 0.8542 | |
ABINet | Pretrained | 0.9603 | 0.9397 | 0.9557 | 0.8146 | 0.8868 | 0.8785 | model | log |
ABINet-TTA | Pretrained | 0.9597 | 0.9397 | 0.9527 | 0.8426 | 0.8930 | 0.8854 |
注解
ABINet allows its encoder to run and be trained without decoder and fuser. Its encoder is designed to recognize texts as a stand-alone model and therefore can work as an independent text recognizer. We release it as ABINet-Vision.
Facts about the pretrained model: MMOCR does not have a systematic pipeline to pretrain the language model (LM) yet, thus the weights of LM are converted from the official pretrained model. The weights of ABINet-Vision are directly used as the vision model of ABINet.
We also provide ABINet trained on Union14M
Evaluated on six common benchmarks
methods | pretrained | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
ABINet-Vision | - | 0.9730 | 0.9645 | 0.9552 | 0.8536 | 0.8977 | 0.9479 | model |
Evaluated on Union14M-Benchmark
Methods | Unsolved Challenges | Additional Challenges | General | download | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | Incomplete | General | |||
ABINet-Vision | 0.750 | 0.615 | 0.653 | 0.711 | 0.729 | 0.591 | 0.026 | 0.794 | model |
Citation¶
@article{fang2021read,
title={Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition},
author={Fang, Shancheng and Xie, Hongtao and Wang, Yuxin and Mao, Zhendong and Zhang, Yongdong},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2021}
}
ASTER¶
ASTER: An Attentional Scene Text Recognizer with Flexible Rectification
Abstract¶
A challenging aspect of scene text recognition is to handle text with distortions or irregular layout. In particular, perspective text and curved text are common in natural scenes and are difficult to recognize. In this work, we introduce ASTER, an end-to-end neural network model that comprises a rectification network and a recognition network. The rectification network adaptively transforms an input image into a new one, rectifying the text in it. It is powered by a flexible Thin-Plate Spline transformation which handles a variety of text irregularities and is trained without human annotations. The recognition network is an attentional sequence-to-sequence model that predicts a character sequence directly from the rectified image. The whole model is trained end to end, requiring only images and their groundtruth text. Through extensive experiments, we verify the effectiveness of the rectification and demonstrate the state-of-the-art recognition performance of ASTER. Furthermore, we demonstrate that ASTER is a powerful component in end-to-end recognition systems, for its ability to enhance the detector.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | note |
---|---|---|---|
Syn90k | 8919273 | 1 | synth |
SynthText | 7239272 | 1 | alphanumeric |
Test Dataset¶
testset | instance_num | note |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and models¶
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
ASTER | ResNet45 | 0.9357 | 0.8949 | 0.9281 | 0.7665 | 0.8062 | 0.8507 | model | log | |
ASTER-TTA | ResNet45 | 0.9337 | 0.8949 | 0.9251 | 0.7925 | 0.8109 | 0.8507 |
We also provide ASTER trained on Union14M
Evaluated on six common benchmarks
Methods | pretrained | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
ASTER | - | 0.9437 | 0.8903 | 0.9360 | 0.7857 | 0.8093 | 0.9097 | model |
Evaluated on Union14M-Benchmark
Methods | Unsolved Challenges | Additional Challenges | General | download | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | Incomplete | General | |||
ASTER | 0.384 | 0.130 | 0.418 | 0.529 | 0.319 | 0.498 | 0.013 | 0.667 | model |
Citation¶
@article{shi2018aster,
title={Aster: An attentional scene text recognizer with flexible rectification},
author={Shi, Baoguang and Yang, Mingkun and Wang, Xinggang and Lyu, Pengyuan and Yao, Cong and Bai, Xiang},
journal={IEEE transactions on pattern analysis and machine intelligence},
volume={41},
number={9},
pages={2035--2048},
year={2018},
publisher={IEEE}
}
CRNN¶
Abstract¶
Image-based sequence recognition has been a long-standing research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | note |
---|---|---|---|
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | note |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and models¶
methods | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|
methods | IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||
CRNN | 0.8053 | 0.7991 | 0.8739 | 0.5571 | 0.6093 | 0.5694 | model | log | |
CRNN-TTA | 0.8013 | 0.7975 | 0.8631 | 0.5763 | 0.6093 | 0.5764 | model | log |
Citation¶
@article{shi2016end,
title={An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition},
author={Shi, Baoguang and Bai, Xiang and Yao, Cong},
journal={IEEE transactions on pattern analysis and machine intelligence},
year={2016}
}
MAERec¶
Revisiting Scene Text Recognition: A Data Perspective
Abstract¶
This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective. We begin by revisiting the six commonly used benchmarks in STR and observe a trend of performance saturation, whereby only 2.91% of the benchmark images cannot be accurately recognized by an ensemble of 13 representative models. While these results are impressive and suggest that STR could be considered solved, however, we argue that this is primarily due to the less challenging nature of the common benchmarks, thus concealing the underlying issues that STR faces. To this end, we consolidate a large-scale real STR dataset, namely Union14M, which comprises 4 million labeled images and 10 million unlabeled images, to assess the performance of STR models in more complex real-world scenarios. Our experiments demonstrate that the 13 models can only achieve an average accuracy of 66.53% on the 4 million labeled images, indicating that STR still faces numerous challenges in the real world. By analyzing the error patterns of the 13 models, we identify seven open challenges in STR and develop a challenge-driven benchmark consisting of eight distinct subsets to facilitate further progress in the field. Our exploration demonstrates that STR is far from being solved and leveraging data may be a promising solution. In this regard, we find that utilizing the 10 million unlabeled images through self-supervised pre-training can significantly improve the robustness of STR model in real-world scenarios and leads to state-of-the-art performance.
Dataset¶
Test Dataset¶
On six common benchmarks
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
On Union14M-Benchmark
testset | instance_num | type |
---|---|---|
Artistic | 900 | Unsolved Challenge |
Curve | 2426 | Unsolved Challenge |
Multi-Oriented | 1369 | Unsolved Challenge |
Contextless | 779 | Additional Challenge |
Multi-Words | 829 | Additional Challenge |
Salient | 1585 | Additional Challenge |
Incomplete | 1495 | Additional Challenge |
General | 400,000 | - |
Results and Models¶
Evaluated on six common benchmarks
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
MAERec-S | ViT-Small (Pretrained on Union14M-U) | 98.0 | 97.6 | 96.8 | 87.1 | 93.2 | 97.9 | model | |
MAERec-B | ViT-Base (Pretrained on Union14M-U) | 98.5 | 98.1 | 97.8 | 89.5 | 94.4 | 98.6 | model |
Evaluated on Union14M-Benchmark
Methods | Backbone | Unsolved Challenges | Additional Challenges | General | download | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | Incomplete | General | ||||
MAERec-S | ViT-Small (Pretrained on Union14M-U) | 81.4 | 71.4 | 72.0 | 82.0 | 78.5 | 82.4 | 2.7 | 82.5 | model | |
MAERec-B | ViT-Base (Pretrained on Union14M-U) | 88.8 | 83.9 | 80.0 | 85.5 | 84.9 | 87.5 | 2.6 | 85.8 | model |
To train with MAERec, you need to download pretrained ViT weight and load it in the config file. Check here for instructions
Citation¶
@misc{jiang2023revisiting,
title={Revisiting Scene Text Recognition: A Data Perspective},
author={Qing Jiang and Jiapeng Wang and Dezhi Peng and Chongyu Liu and Lianwen Jin},
year={2023},
eprint={2307.08723},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
MASTER¶
MASTER: Multi-aspect non-local network for scene text recognition
Abstract¶
Attention-based scene text recognizers have gained huge success, which leverages a more compact intermediate representation to learn 1d- or 2d- attention by a RNN-based encoder-decoder architecture. However, such methods suffer from attention-drift problem because high similarity among encoded features leads to attention confusion under the RNN-based local attention mechanism. Moreover, RNN-based methods have low efficiency due to poor parallelization. To overcome these problems, we propose the MASTER, a self-attention based scene text recognizer that (1) not only encodes the input-output attention but also learns self-attention which encodes feature-feature and target-target relationships inside the encoder and decoder and (2) learns a more powerful and robust intermediate representation to spatial distortion, and (3) owns a great training efficiency because of high training parallelization and a high-speed inference because of an efficient memory-cache mechanism. Extensive experiments on various benchmarks demonstrate the superior performance of our MASTER on both regular and irregular scene text.
Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
SynthAdd | 1216889 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
MASTER | R31-GCAModule | 0.9490 | 0.8887 | 0.9517 | 0.7650 | 0.8465 | 0.8889 | model | log | |
MASTER-TTA | R31-GCAModule | 0.9450 | 0.8887 | 0.9478 | 0.7906 | 0.8481 | 0.8958 |
Citation¶
@article{Lu2021MASTER,
title={MASTER: Multi-Aspect Non-local Network for Scene Text Recognition},
author={Ning Lu and Wenwen Yu and Xianbiao Qi and Yihao Chen and Ping Gong and Rong Xiao and Xiang Bai},
journal={Pattern Recognition},
year={2021}
}
NRTR¶
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
Abstract¶
Scene text recognition has attracted a great many researches due to its importance to various applications. Existing methods mainly adopt recurrence or convolution based networks. Though have obtained good performance, these methods still suffer from two limitations: slow training speed due to the internal recurrence of RNNs, and high complexity due to stacked convolutional layers for long-term feature extraction. This paper, for the first time, proposes a no-recurrence sequence-to-sequence text recognizer, named NRTR, that dispenses with recurrences and convolutions entirely. NRTR follows the encoder-decoder paradigm, where the encoder uses stacked self-attention to extract image features, and the decoder applies stacked self-attention to recognize texts based on encoder output. NRTR relies solely on self-attention mechanism thus could be trained with more parallelization and less complexity. Considering scene image has large variation in text and background, we further design a modality-transform block to effectively transform 2D input images to 1D sequences, combined with the encoder to extract more discriminative features. NRTR achieves state-of-the-art or highly competitive performance on both regular and irregular benchmarks, while requires only a small fraction of training time compared to the best model from the literature (at least 8 times faster).

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
NRTR | NRTRModalityTransform | 0.9147 | 0.8841 | 0.9369 | 0.7246 | 0.7783 | 0.7500 | model | log | |
NRTR-TTA | NRTRModalityTransform | 0.9123 | 0.8825 | 0.9310 | 0.7492 | 0.7798 | 0.7535 | ||
NRTR | R31-1/8-1/4 | 0.9483 | 0.8918 | 0.9507 | 0.7578 | 0.8016 | 0.8889 | model | log | |
NRTR-TTA | R31-1/8-1/4 | 0.9443 | 0.8903 | 0.9478 | 0.7790 | 0.8078 | 0.8854 | ||
NRTR | R31-1/16-1/8 | 0.9470 | 0.8918 | 0.9399 | 0.7376 | 0.7969 | 0.8854 | model | log | |
NRTR-TTA | R31-1/16-1/8 | 0.9423 | 0.8903 | 0.9360 | 0.7641 | 0.8016 | 0.8854 |
We also provide NRTR trained on Union14M
Evaluated on six common benchmarks
Methods | Backbone | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
NRTR | R31-1/8-1/4 | 0.9673 | 0.9320 | 0.9557 | 0.8074 | 0.8357 | 0.9201 | model |
Evaluated on Union14M-Benchmark
Methods | Backbone | Unsolved Challenges | Additional Challenges | General | download | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | Incomplete | General | ||||
NRTR | R31-1/8-1/4 | 0.493 | 0.406 | 0.543 | 0.696 | 0.429 | 0.755 | 0.015 | 0.752 | model |
Citation¶
@inproceedings{sheng2019nrtr,
title={NRTR: A no-recurrence sequence-to-sequence model for scene text recognition},
author={Sheng, Fenfen and Chen, Zhineng and Xu, Bo},
booktitle={2019 International Conference on Document Analysis and Recognition (ICDAR)},
pages={781--786},
year={2019},
organization={IEEE}
}
RobustScanner¶
RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition
Abstract¶
The attention-based encoder-decoder framework has recently achieved impressive results for scene text recognition, and many variants have emerged with improvements in recognition quality. However, it performs poorly on contextless texts (e.g., random character sequences) which is unacceptable in most of real application scenarios. In this paper, we first deeply investigate the decoding process of the decoder. We empirically find that a representative character-level sequence decoder utilizes not only context information but also positional information. Contextual information, which the existing approaches heavily rely on, causes the problem of attention drift. To suppress such side-effect, we propose a novel position enhancement branch, and dynamically fuse its outputs with those of the decoder attention module for scene text recognition. Specifically, it contains a position aware module to enable the encoder to output feature vectors encoding their own spatial positions, and an attention module to estimate glimpses using the positional clue (i.e., the current decoding time step) only. The dynamic fusion is conducted for more robust feature via an element-wise gate mechanism. Theoretically, our proposed method, dubbed \emph{RobustScanner}, decodes individual characters with dynamic ratio between context and positional clues, and utilizes more positional ones when the decoding sequences with scarce context, and thus is robust and practical. Empirically, it has achieved new state-of-the-art results on popular regular and irregular text recognition benchmarks while without much performance drop on contextless benchmarks, validating its robustness in both contextual and contextless application scenarios.

Dataset¶
Results and Models¶
Methods | GPUs | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | ||||
RobustScanner | 4 | 0.9510 | 0.9011 | 0.9320 | 0.7578 | 0.8078 | 0.8750 | model | log | |
RobustScanner-TTA | 4 | 0.9487 | 0.9011 | 0.9261 | 0.7805 | 0.8124 | 0.8819 |
References¶
[1] Li, Hui and Wang, Peng and Shen, Chunhua and Zhang, Guyu. Show, attend and read: A simple and strong baseline for irregular text recognition. In AAAI 2019.
Citation¶
@inproceedings{yue2020robustscanner,
title={RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition},
author={Yue, Xiaoyu and Kuang, Zhanghui and Lin, Chenhao and Sun, Hongbin and Zhang, Wayne},
booktitle={European Conference on Computer Vision},
year={2020}
}
SAR¶
Show, Attend and Read: A Simple and Strong Baseline for Irregular Text Recognition
Abstract¶
Recognizing irregular text in natural scene images is challenging due to the large variance in text appearance, such as curvature, orientation and distortion. Most existing approaches rely heavily on sophisticated model designs and/or extra fine-grained annotations, which, to some extent, increase the difficulty in algorithm implementation and data collection. In this work, we propose an easy-to-implement strong baseline for irregular scene text recognition, using off-the-shelf neural network components and only word-level annotations. It is composed of a 31-layer ResNet, an LSTM-based encoder-decoder framework and a 2-dimensional attention module. Despite its simplicity, the proposed method is robust and achieves state-of-the-art performance on both regular and irregular scene text recognition benchmarks.

Dataset¶
Results and Models¶
Methods | Backbone | Decoder | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||||
SAR | R31-1/8-1/4 | ParallelSARDecoder | 0.9533 | 0.8964 | 0.9369 | 0.7602 | 0.8326 | 0.9062 | model | log | |
SAR-TTA | R31-1/8-1/4 | ParallelSARDecoder | 0.9510 | 0.8964 | 0.9340 | 0.7862 | 0.8372 | 0.9132 | ||
SAR | R31-1/8-1/4 | SequentialSARDecoder | 0.9553 | 0.9073 | 0.9409 | 0.7761 | 0.8093 | 0.8958 | model | log | |
SAR-TTA | R31-1/8-1/4 | SequentialSARDecoder | 0.9530 | 0.9073 | 0.9389 | 0.8002 | 0.8124 | 0.9028 |
We also provide ASTER trained on Union14M
Evaluated on six common benchmarks
Methods | Backbone | Decoder | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||||
SAR | R31-1/8-1/4 | SequentialSARDecoder | 0.9707 | 0.9366 | 0.9576 | 0.8219 | 0.8698 | 0.9201 | model |
Evaluated on Union14M-Benchmark
Methods | Backbone | Decoder | Unsolved Challenges | Additional Challenges | General | download | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | Incomplete | General | |||||
SAR | R31-1/8-1/4 | SequentialSARDecoder | 0.689 | 0.569 | 0.606 | 0.733 | 0.601 | 0.746 | 0.021 | 0.760 | model |
Citation¶
@inproceedings{li2019show,
title={Show, attend and read: A simple and strong baseline for irregular text recognition},
author={Li, Hui and Wang, Peng and Shen, Chunhua and Zhang, Guyu},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={33},
number={01},
pages={8610--8617},
year={2019}
}
SATRN¶
On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention
Abstract¶
Scene text recognition (STR) is the task of recognizing character sequences in natural scenes. While there have been great advances in STR methods, current methods still fail to recognize texts in arbitrary shapes, such as heavily curved or rotated texts, which are abundant in daily life (e.g. restaurant signs, product labels, company logos, etc). This paper introduces a novel architecture to recognizing texts of arbitrary shapes, named Self-Attention Text Recognition Network (SATRN), which is inspired by the Transformer. SATRN utilizes the self-attention mechanism to describe two-dimensional (2D) spatial dependencies of characters in a scene text image. Exploiting the full-graph propagation of self-attention, SATRN can recognize texts with arbitrary arrangements and large inter-character spacing. As a result, SATRN outperforms existing STR models by a large margin of 5.7 pp on average in “irregular text” benchmarks. We provide empirical analyses that illustrate the inner mechanisms and the extent to which the model is applicable (e.g. rotated and multi-line text). We will open-source the code.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
Satrn | 0.9600 | 0.9181 | 0.9606 | 0.8045 | 0.8837 | 0.8993 | model | log | |
Satrn-TTA | 0.9530 | 0.9181 | 0.9527 | 0.8276 | 0.8884 | 0.9028 | ||
Satrn_small | 0.9423 | 0.9011 | 0.9567 | 0.7886 | 0.8574 | 0.8472 | model | log | |
Satrn_small-TTA | 0.9380 | 0.8995 | 0.9488 | 0.8122 | 0.8620 | 0.8507 |
We also provide SATRN trained on Union14M
Evaluated on six common benchmarks
Methods | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
SATRN | 0.9727 | 0.9536 | 0.9685 | 0.8714 | 0.9039 | 0.9618 | model |
Evaluated on Union14M-Benchmark
Methods | Unsolved Challenges | Additional Challenges | General | download | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Curve | Multi-Oriented | Artistic | Contextless | Salient | Multi-Words | Incomplete | General | |||
SATRN | 0.748 | 0.647 | 0.671 | 0.761 | 0.722 | 0.741 | 0.009 | 0.758 | model |
Citation¶
@article{junyeop2019recognizing,
title={On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention},
author={Junyeop Lee, Sungrae Park, Jeonghun Baek, Seong Joon Oh, Seonghyeon Kim, Hwalsuk Lee},
year={2019}
}
SVTR¶
SVTR: Scene Text Recognition with a Single Visual Model
Abstract¶
Dominant scene text recognition models commonly contain two building blocks, a visual model for feature extraction and a sequence model for text transcription. This hybrid architecture, although accurate, is complex and less efficient. In this study, we propose a Single Visual model for Scene Text recognition within the patch-wise image tokenization framework, which dispenses with the sequential modeling entirely. The method, termed SVTR, firstly decomposes an image text into small patches named character components. Afterward, hierarchical stages are recurrently carried out by component-level mixing, merging and/or combining. Global and local mixing blocks are devised to perceive the inter-character and intra-character patterns, leading to a multi-grained character component perception. Thus, characters are recognized by a simple linear prediction. Experimental results on both English and Chinese scene text recognition tasks demonstrate the effectiveness of SVTR. SVTR-L (Large) achieves highly competitive accuracy in English and outperforms existing methods by a large margin in Chinese, while running faster. In addition, SVTR-T (Tiny) is an effective and much smaller model, which shows appealing speed at inference.

Dataset¶
Train Dataset¶
trainset | instance_num | repeat_num | source |
---|---|---|---|
SynthText | 7266686 | 1 | synth |
Syn90k | 8919273 | 1 | synth |
Test Dataset¶
testset | instance_num | type |
---|---|---|
IIIT5K | 3000 | regular |
SVT | 647 | regular |
IC13 | 1015 | regular |
IC15 | 2077 | irregular |
SVTP | 645 | irregular |
CT80 | 288 | irregular |
Results and Models¶
Methods | Regular Text | Irregular Text | download | |||||
---|---|---|---|---|---|---|---|---|
IIIT5K | SVT | IC13-1015 | IC15-2077 | SVTP | CT80 | |||
SVTR-tiny | - | - | - | - | - | - | - | |
SVTR-small | 0.8553 | 0.9026 | 0.9448 | 0.7496 | 0.8496 | 0.8854 | model | log | |
SVTR-small-TTA | 0.8397 | 0.8964 | 0.9241 | 0.7597 | 0.8124 | 0.8646 | ||
SVTR-base | 0.8570 | 0.9181 | 0.9438 | 0.7448 | 0.8388 | 0.9028 | model | log | |
SVTR-base-TTA | 0.8517 | 0.9011 | 0.9379 | 0.7569 | 0.8279 | 0.8819 | ||
SVTR-large | - | - | - | - | - | - | - |
注解
The implementation and configuration follow the original code and paper, but there is still a gap between the reproduced results and the official ones. We appreciate any suggestions to improve its performance.
Citation¶
@inproceedings{ijcai2022p124,
title = {SVTR: Scene Text Recognition with a Single Visual Model},
author = {Du, Yongkun and Chen, Zhineng and Jia, Caiyan and Yin, Xiaoting and Zheng, Tianlun and Li, Chenxia and Du, Yuning and Jiang, Yu-Gang},
booktitle = {Proceedings of the Thirty-First International Joint Conference on
Artificial Intelligence, {IJCAI-22}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
editor = {Lud De Raedt},
pages = {884--890},
year = {2022},
month = {7},
note = {Main Track},
doi = {10.24963/ijcai.2022/124},
url = {https://doi.org/10.24963/ijcai.2022/124},
}
关键信息提取模型¶
SDMGR¶
Spatial Dual-Modality Graph Reasoning for Key Information Extraction
Abstract¶
Key information extraction from document images is of paramount importance in office automation. Conventional template matching based approaches fail to generalize well to document images of unseen templates, and are not robust against text recognition errors. In this paper, we propose an end-to-end Spatial Dual-Modality Graph Reasoning method (SDMG-R) to extract key information from unstructured document images. We model document images as dual-modality graphs, nodes of which encode both the visual and textual features of detected text regions, and edges of which represent the spatial relations between neighboring text regions. The key information extraction is solved by iteratively propagating messages along graph edges and reasoning the categories of graph nodes. In order to roundly evaluate our proposed method as well as boost the future research, we release a new dataset named WildReceipt, which is collected and annotated tailored for the evaluation of key information extraction from document images of unseen templates in the wild. It contains 25 key information categories, a total of about 69000 text boxes, and is about 2 times larger than the existing public datasets. Extensive experiments validate that all information including visual features, textual features and spatial relations can benefit key information extraction. It has been shown that SDMG-R can effectively extract key information from document images of unseen templates, and obtain new state-of-the-art results on the recent popular benchmark SROIE and our WildReceipt. Our code and dataset will be publicly released.

Results and models¶
WildReceipt¶
Method | Modality | Macro F1-Score | Download |
---|---|---|---|
sdmgr_unet16 | Visual + Textual | 0.890 | model | log |
sdmgr_novisual | Textual | 0.873 | model | log |
WildReceiptOpenset¶
Method | Modality | Edge F1-Score | Node Macro F1-Score | Node Micro F1-Score | Download |
---|---|---|---|---|---|
sdmgr_novisual_openset | Textual | 0.792 | 0.931 | 0.940 | model | log |
Citation¶
@misc{sun2021spatial,
title={Spatial Dual-Modality Graph Reasoning for Key Information Extraction},
author={Hongbin Sun and Zhanghui Kuang and Xiaoyu Yue and Chenhao Lin and Wayne Zhang},
year={2021},
eprint={2103.14470},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
分支¶
本文档旨在全面解释 MMOCR 中每个分支的目的和功能。
分支概述¶
1. main
¶
main
分支是 MMOCR 项目的默认分支。它包含了 MMOCR 的最新稳定版本,目前包含了 MMOCR 1.x(例如 v1.0.0)的代码。main
分支确保用户能够使用最新和最可靠的软件版本。
2. dev-1.x
¶
dev-1.x
分支用于开发 MMOCR 的下一个版本。此分支将在发版前进行依赖性测试,通过的提交将会合成到新版本中,并被发布到 main
分支。通过设置单独的开发分支,项目可以在不影响 main
分支稳定性的情况下继续发展。所有 PR 应合并到 dev-1.x
分支。
3. 0.x
¶
0.x
分支用作 MMOCR 0.x(例如 v0.6.3)的存档。此分支将不再积极接受更新或改进,但它仍可作为历史参考,或供尚未升级到 MMOCR 1.x 的用户使用。
贡献指南¶
OpenMMLab 欢迎所有人参与我们项目的共建。本文档将指导您如何通过拉取请求为 OpenMMLab 项目作出贡献。
基本的工作流:¶
获取最新的代码库
从最新的
dev-1.x
分支创建分支进行开发提交修改 (不要忘记使用 pre-commit hooks!)
推送你的修改并创建一个
拉取请求
讨论、审核代码
将开发分支合并到
dev-1.x
分支
具体步骤¶
1. 获取最新的代码库¶
当你第一次提 PR 时
复刻 OpenMMLab 原代码库,点击 GitHub 页面右上角的 Fork 按钮即可
克隆复刻的代码库到本地
git clone git@github.com:XXX/mmocr.git
添加原代码库为上游代码库
git remote add upstream git@github.com:open-mmlab/mmocr
从第二个 PR 起
检出本地代码库的主分支,然后从最新的原代码库的主分支拉取更新。这里假设你正基于
dev-1.x
开发。git checkout dev-1.x git pull upstream dev-1.x
2. 从 dev-1.x
分支创建一个新的开发分支¶
git checkout -b branchname
小技巧
为了保证提交历史清晰可读,我们强烈推荐您先切换到 dev-1.x
分支,再创建新的分支。
3. 提交你的修改¶
如果你是第一次尝试贡献,请在 MMOCR 的目录下安装并初始化 pre-commit hooks。
pip install -U pre-commit pre-commit install
提交修改。在每次提交前,pre-commit hooks 都会被触发并规范化你的代码格式。
# coding git add [files] git commit -m 'messages'
注解
有时你的文件可能会在提交时被 pre-commit hooks 自动修改。这时请重新添加并提交修改后的文件。
4. 推送你的修改到复刻的代码库,并创建一个拉取请求¶
推送当前分支到远端复刻的代码库
git push origin branchname
创建一个拉取请求
修改拉取请求信息模板,描述修改原因和修改内容。还可以在 PR 描述中,手动关联到相关的议题 (issue),(更多细节,请参考官方文档)。
另外,如果你正在往
dev-1.x
分支提交代码,你还需要在创建 PR 的界面中将基础分支改为dev-1.x
,因为现在默认的基础分支是main
。你同样可以把 PR 关联给相关人员进行评审。
5. 讨论并评审你的代码¶
根据评审人员的意见修改代码,并推送修改
6. 拉取请求
合并之后删除该分支¶
在 PR 合并之后,你就可以删除该分支了。
git branch -d branchname # 删除本地分支 git push origin --delete branchname # 删除远程分支
PR 规范¶
使用 pre-commit hook,尽量减少代码风格相关问题
一个 PR 对应一个短期分支
粒度要细,一个PR只做一件事情,避免超大的PR
Bad:实现 Faster R-CNN
Acceptable:给 Faster R-CNN 添加一个 box head
Good:给 box head 增加一个参数来支持自定义的 conv 层数
每次 Commit 时需要提供清晰且有意义 commit 信息
提供清晰且有意义的
拉取请求
描述标题写明白任务名称,一般格式:[Prefix] Short description of the pull request (Suffix)
prefix: 新增功能 [Feature], 修 bug [Fix], 文档相关 [Docs], 开发中 [WIP] (暂时不会被review)
描述里介绍
拉取请求
的主要修改内容,结果,以及对其他部分的影响, 参考拉取请求
模板关联相关的
议题
(issue) 和其他拉取请求
Changelog of v1.x¶
v1.0.0 (04/06/2023)¶
We are excited to announce the first official release of MMOCR 1.0, with numerous enhancements, bug fixes, and the introduction of new dataset support!
🌟 Highlights¶
Support for SCUT-CTW1500, SynthText, and MJSynth datasets
Updated FAQ and documentation
Deprecation of file_client_args in favor of backend_args
Added a new MMOCR tutorial notebook
🆕 New Features & Enhancement¶
Add SCUT-CTW1500 by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1677
Cherry Pick #1205 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1774
Make lanms-neo optional by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1772
SynthText by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1779
Deprecate file_client_args and use backend_args instead by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1765
MJSynth by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1791
Add MMOCR tutorial notebook by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1771
decouple batch_size to det_batch_size, rec_batch_size and kie_batch_size in MMOCRInferencer by @hugotong6425 in https://github.com/open-mmlab/mmocr/pull/1801
Accepts local-rank in train.py and test.py by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1806
update stitch_boxes_into_lines by @cherryjm in https://github.com/open-mmlab/mmocr/pull/1824
Add tests for pytorch 2.0 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1836
📝 Docs¶
FAQ by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1773
Remove LoadImageFromLMDB from docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1767
Mark projects in docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1766
add opendatalab download link by @jorie-peng in https://github.com/open-mmlab/mmocr/pull/1753
Fix some deadlinks in the docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1469
Fix quick run by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1775
Dataset by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1782
Update faq by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1817
more social network links by @fengshiwest in https://github.com/open-mmlab/mmocr/pull/1818
Update docs after branch switching by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1834
🛠️ Bug Fixes:¶
Place dicts to .mim by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1781
Test svtr_small instead of svtr_tiny by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1786
Add pse weight to metafile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1787
Synthtext metafile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1788
Clear up some unused scripts by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1798
if dst not exists, when move a single file may raise a file not exists error. by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1803
CTW1500 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1814
MJSynth & SynthText Dataset Preparer config by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1805
Use poly_intersection instead of poly.intersection to avoid sup… by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1811
Abinet: fix ValueError: Blur limit must be odd when centered=True. Got: (3, 6) by @hugotong6425 in https://github.com/open-mmlab/mmocr/pull/1821
Bug generated during kie inference visualization by @Yangget in https://github.com/open-mmlab/mmocr/pull/1830
Revert sync bn in inferencer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1832
Fix mmdet digit version by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1840
🎉 New Contributors¶
@jorie-peng made their first contribution in https://github.com/open-mmlab/mmocr/pull/1753
@hugotong6425 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1801
@fengshiwest made their first contribution in https://github.com/open-mmlab/mmocr/pull/1818
@cherryjm made their first contribution in https://github.com/open-mmlab/mmocr/pull/1824
@Yangget made their first contribution in https://github.com/open-mmlab/mmocr/pull/1830
Thank you to all the contributors for making this release possible! We’re excited about the new features and enhancements in this version, and we’re looking forward to your feedback and continued support. Happy coding! 🚀
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc6…v1.0.0
Highlights¶
v1.0.0rc6 (03/07/2023)¶
Highlights¶
Two new models, ABCNet v2 (inference only) and SPTS are added to
projects/
folder.Announcing
Inferencer
, a unified inference interface in OpenMMLab for everyone’s easy access and quick inference with all the pre-trained weights. DocsUsers can use test-time augmentation for text recognition tasks. Docs
Support batch augmentation through
BatchAugSampler
, which is a technique used in SPTS.Dataset Preparer has been refactored to allow more flexible configurations. Besides, users are now able to prepare text recognition datasets in LMDB formats. Docs
Some textspotting datasets have been revised to enhance the correctness and consistency with the common practice.
Potential spurious warnings from
shapely
have been eliminated.
Dependency¶
This version requires MMEngine >= 0.6.0, MMCV >= 2.0.0rc4 and MMDet >= 3.0.0rc5.
New Features & Enhancements¶
Discard deprecated lmdb dataset format and only support img+label now by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1681
abcnetv2 inference by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1657
Add RepeatAugSampler by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1678
SPTS by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1696
Refactor Inferencers by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1608
Dynamic return type for rescale_polygons by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1702
Revise upstream version limit by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1703
TextRecogCropConverter add crop with opencv warpPersepective function by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1667
change cudnn benchmark to false by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1705
Add ST-pretrained DB-series models and logs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1635
Only keep meta and state_dict when publish model by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1729
Rec TTA by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1401
Speedup formatting by replacing np.transpose with torch… by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1719
Support auto import modules from registry. by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1731
Support batch visualization & dumping in Inferencer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1722
add a new argument font_properties to set a specific font file in order to draw Chinese characters properly by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1709
Refactor data converter and gather by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1707
Support batch augmentation through BatchAugSampler by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1757
Put all registry into registry.py by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1760
train by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1756
configs for regression benchmark by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1755
Support lmdb format in Dataset Preparer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1762
Docs¶
update the link of DBNet by @AllentDan in https://github.com/open-mmlab/mmocr/pull/1672
Add notice for default branch switching by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1693
docs: Add twitter discord medium youtube link by @vansin in https://github.com/open-mmlab/mmocr/pull/1724
Remove unsupported datasets in docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1670
Bug Fixes¶
Update dockerfile by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1671
Explicitly create np object array for compatibility by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1691
Fix a minor error in docstring by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1685
Fix lint by @triple-Mu in https://github.com/open-mmlab/mmocr/pull/1694
Fix LoadOCRAnnotation ut by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1695
Fix isort pre-commit error by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1697
Update owners by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1699
Detect intersection before using shapley.intersection to eliminate spurious warnings by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1710
Fix some inferencer bugs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1706
Fix textocr ignore flag by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1712
Add missing softmax in ASTER forward_test by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1718
Fix head in readme by @vansin in https://github.com/open-mmlab/mmocr/pull/1727
Fix some browse dataset script bugs and draw textdet gt instance with ignore flags by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1701
icdar textrecog ann parser skip data with ignore flag by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1708
bezier_to_polygon -> bezier2polygon by @double22a in https://github.com/open-mmlab/mmocr/pull/1739
Fix docs recog CharMetric P/R error definition by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1740
Remove outdated resources in demo/ by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1747
Fix wrong ic13 textspotting split data; add lexicons to ic13, ic15 and totaltext by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1758
SPTS readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1761
New Contributors¶
@triple-Mu made their first contribution in https://github.com/open-mmlab/mmocr/pull/1694
@double22a made their first contribution in https://github.com/open-mmlab/mmocr/pull/1739
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc5…v1.0.0rc6
v1.0.0rc5 (01/06/2023)¶
Highlights¶
Two models, Aster and SVTR, are added to our model zoo. The full implementation of ABCNet is also available now.
Dataset Preparer supports 5 more datasets: CocoTextV2, FUNSD, TextOCR, NAF, SROIE.
We have 4 more text recognition transforms, and two helper transforms. See https://github.com/open-mmlab/mmocr/pull/1646 https://github.com/open-mmlab/mmocr/pull/1632 https://github.com/open-mmlab/mmocr/pull/1645 for details.
The transform,
FixInvalidPolygon
, is getting smarter at dealing with invalid polygons, and now capable of handling more weird annotations. As a result, a complete training cycle on TotalText dataset can be performed bug-free. The weights of DBNet and FCENet pretrained on TotalText are also released.
New Features & Enhancements¶
Update ic15 det config according to DataPrepare by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1617
Refactor icdardataset metainfo to lowercase. by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1620
Add ASTER Encoder by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1239
Add ASTER decoder by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1625
Add ASTER config by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1238
Update ASTER config by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1629
Support browse_dataset.py to visualize original dataset by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1503
Add CocoTextv2 to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1514
Add Funsd to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1550
Add TextOCR to Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1543
Refine example projects and readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1628
Enhance FixInvalidPolygon, add RemoveIgnored transform by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1632
ConditionApply by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1646
Add NAF to dataset preparer by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1609
Add SROIE to dataset preparer by @FerryHuang in https://github.com/open-mmlab/mmocr/pull/1639
Add svtr decoder by @willpat1213 in https://github.com/open-mmlab/mmocr/pull/1448
Add missing unit tests by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1651
Add svtr encoder by @willpat1213 in https://github.com/open-mmlab/mmocr/pull/1483
ABCNet train by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1610
Totaltext cfgs for DB and FCE by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1633
Add Aliases to models by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1611
SVTR transforms by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1645
Add SVTR framework and configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1621
Issue Template by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1663
Docs¶
Add Chinese translation for browse_dataset.py by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1647
updata abcnet doc by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1658
update the dbnetpp`s readme file by @zhuyue66 in https://github.com/open-mmlab/mmocr/pull/1626
Inferencer docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1744
Bug Fixes¶
nn.SmoothL1Loss beta can not be zero in PyTorch 1.13 version by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1616
ctc loss bug if target is empty by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1618
Add torch 1.13 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1619
Remove outdated tutorial link by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1627
Dev 1.x some doc mistakes by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1630
Support custom font to visualize some languages (e.g. Korean) by @ProtossDragoon in https://github.com/open-mmlab/mmocr/pull/1567
db_module_loss,negative number encountered in sqrt by @KevinNuNu in https://github.com/open-mmlab/mmocr/pull/1640
Use int instead of np.int by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1636
Remove support for py3.6 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1660
New Contributors¶
@zhuyue66 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1626
@KevinNuNu made their first contribution in https://github.com/open-mmlab/mmocr/pull/1630
@FerryHuang made their first contribution in https://github.com/open-mmlab/mmocr/pull/1639
@willpat1213 made their first contribution in https://github.com/open-mmlab/mmocr/pull/1448
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc4…v1.0.0rc5
v1.0.0rc4 (12/06/2022)¶
Highlights¶
Dataset Preparer can automatically generate base dataset configs at the end of the preparation process, and supports 6 more datasets: IIIT5k, CUTE80, ICDAR2013, ICDAR2015, SVT, SVTP.
Introducing our
projects/
folder - implementing new models and features into OpenMMLab’s algorithm libraries has long been complained to be troublesome due to the rigorous requirements on code quality, which could hinder the fast iteration of SOTA models and might discourage community members from sharing their latest outcome here. We now introduceprojects/
folder, where some experimental features, frameworks and models can be placed, only needed to satisfy the minimum requirement on the code quality. Everyone is welcome to post their implementation of any great ideas in this folder! We also add the first example project to illustrate what we expect a good project to have (check out the raw content of README.md for more info!).Inside the
projects/
folder, we are releasing the preview version of ABCNet, which is the first implementation of text spotting models in MMOCR. It’s inference-only now, but the full implementation will be available very soon.
New Features & Enhancements¶
Add SVT to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1521
Polish bbox2poly by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1532
Add SVTP to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1523
Iiit5k converter by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1530
Add cute80 to dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1522
Add IC13 preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1531
Add ‘Projects/’ folder, and the first example project by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1524
Rename to {dataset-name}_task_train/test by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1541
Add print_config.py to the tools by @IncludeMathH in https://github.com/open-mmlab/mmocr/pull/1547
Add get_md5 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1553
Add config generator by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1552
Support IC15_1811 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1556
Update CT80 config by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1555
Add config generators to all textdet and textrecog configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1560
Refactor TPS by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1240
Add TextSpottingConfigGenerator by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1561
Add common typing by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1596
Update textrecog config and readme by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1597
Support head loss or postprocessor is None for only infer by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1594
Textspotting datasample by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1593
Simplify mono_gather by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1588
ABCNet v1 infer by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1598
Docs¶
Add Chinese Guidance on How to Add New Datasets to Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1506
Update the qq group link by @vansin in https://github.com/open-mmlab/mmocr/pull/1569
Collapse some sections; update logo url by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1571
Update dataset preparer (CN) by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1591
Bug Fixes¶
Fix two bugs in dataset preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1513
Register bug of CLIPResNet by @jyshee in https://github.com/open-mmlab/mmocr/pull/1517
Being more conservative on Dataset Preparer by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1520
python -m pip upgrade in windows by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1525
Fix wildreceipt metafile by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1528
Fix Dataset Preparer Extract by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1527
Fix ICDARTxtParser by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1529
Fix Dataset Zoo Script by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1533
Fix crop without padding and recog metainfo delete unuse info by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1526
Automatically create nonexistent directory for base configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1535
Change mmcv.dump to mmengine.dump by @ProtossDragoon in https://github.com/open-mmlab/mmocr/pull/1540
mmocr.utils.typing -> mmocr.utils.typing_utils by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1538
Wildreceipt tests by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1546
Fix judge exist dir by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1542
Fix IC13 textdet config by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1563
Fix IC13 textrecog annotations by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1568
Auto scale lr by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1584
Fix icdar data parse for text containing separator by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1587
Fix textspotting ut by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1599
Fix TextSpottingConfigGenerator and TextSpottingDataConverter by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1604
Keep E2E Inferencer output simple by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1559
New Contributors¶
@jyshee made their first contribution in https://github.com/open-mmlab/mmocr/pull/1517
@ProtossDragoon made their first contribution in https://github.com/open-mmlab/mmocr/pull/1540
@IncludeMathH made their first contribution in https://github.com/open-mmlab/mmocr/pull/1547
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc3…v1.0.0rc4
v1.0.0rc3 (11/03/2022)¶
Highlights¶
We release several pretrained models using oCLIP-ResNet as the backbone, which is a ResNet variant trained with oCLIP and can significantly boost the performance of text detection models.
Preparing datasets is troublesome and tedious, especially in OCR domain where multiple datasets are usually required. In order to free our users from laborious work, we designed a Dataset Preparer to help you get a bunch of datasets ready for use, with only one line of command! Dataset Preparer is also crafted to consist of a series of reusable modules, each responsible for handling one of the standardized phases throughout the preparation process, shortening the development cycle on supporting new datasets.
New Features & Enhancements¶
Add Dataset Preparer by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1484
support modified resnet structure used in oCLIP by @HannibalAPE in https://github.com/open-mmlab/mmocr/pull/1458
Add oCLIP configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1509
Docs¶
Update install.md by @rogachevai in https://github.com/open-mmlab/mmocr/pull/1494
Refine some docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1455
Update some dataset preparer related docs by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1502
oclip readme by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1505
Bug Fixes¶
Fix offline_eval error caused by new data flow by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1500
New Contributors¶
@rogachevai made their first contribution in https://github.com/open-mmlab/mmocr/pull/1494
@HannibalAPE made their first contribution in https://github.com/open-mmlab/mmocr/pull/1458
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc2…v1.0.0rc3
v1.0.0rc2 (10/14/2022)¶
This release relaxes the version requirement of MMEngine
to >=0.1.0, < 1.0.0
.
v1.0.0rc1 (10/09/2022)¶
Highlights¶
This release fixes a severe bug leading to inaccurate metric report in multi-GPU training.
We release the weights for all the text recognition models in MMOCR 1.0 architecture. The inference shorthand for them are also added back to ocr.py
. Besides, more documentation chapters are available now.
New Features & Enhancements¶
Simplify the Mask R-CNN config by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1391
auto scale lr by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1326
Update paths to pretrain weights by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1416
Streamline duplicated split_result in pan_postprocessor by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1418
Update model links in ocr.py and inference.md by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1431
Update rec configs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1417
Visualizer refine by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1411
Support get flops and parameters in dev-1.x by @vansin in https://github.com/open-mmlab/mmocr/pull/1414
Docs¶
intersphinx and api by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1367
Fix quickrun by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1374
Fix some docs issues by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1385
Add Documents for DataElements by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1381
config english by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1372
Metrics by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1399
Add version switcher to menu by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1407
Data Transforms by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1392
Fix inference docs by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1415
Fix some docs by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1410
Add maintenance plan to migration guide by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1413
Update Recog Models by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1402
Bug Fixes¶
clear metric.results only done in main process by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1379
Fix a bug in MMDetWrapper by @xinke-wang in https://github.com/open-mmlab/mmocr/pull/1393
Fix browse_dataset.py by @Mountchicken in https://github.com/open-mmlab/mmocr/pull/1398
ImgAugWrapper: Do not cilp polygons if not applicable by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1231
Fix CI by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1365
Fix merge stage test by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1370
Del CI support for torch 1.5.1 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1371
Test windows cu111 by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1373
Fix windows CI by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1387
Upgrade pre commit hooks by @Harold-lkk in https://github.com/open-mmlab/mmocr/pull/1429
Skip invalid augmented polygons in ImgAugWrapper by @gaotongxiao in https://github.com/open-mmlab/mmocr/pull/1434
New Contributors¶
@vansin made their first contribution in https://github.com/open-mmlab/mmocr/pull/1414
Full Changelog: https://github.com/open-mmlab/mmocr/compare/v1.0.0rc0…v1.0.0rc1
v1.0.0rc0 (09/01/2022)¶
We are excited to announce the release of MMOCR 1.0.0rc0. MMOCR 1.0.0rc0 is the first version of MMOCR 1.x, a part of the OpenMMLab 2.0 projects. Built upon the new training engine, MMOCR 1.x unifies the interfaces of dataset, models, evaluation, and visualization with faster training and testing speed.
Highlights¶
New engines. MMOCR 1.x is based on MMEngine, which provides a general and powerful runner that allows more flexible customizations and significantly simplifies the entrypoints of high-level interfaces.
Unified interfaces. As a part of the OpenMMLab 2.0 projects, MMOCR 1.x unifies and refactors the interfaces and internal logics of train, testing, datasets, models, evaluation, and visualization. All the OpenMMLab 2.0 projects share the same design in those interfaces and logics to allow the emergence of multi-task/modality algorithms.
Cross project calling. Benefiting from the unified design, you can use the models implemented in other OpenMMLab projects, such as MMDet. We provide an example of how to use MMDetection’s Mask R-CNN through
MMDetWrapper
. Check our documents for more details. More wrappers will be released in the future.Stronger visualization. We provide a series of useful tools which are mostly based on brand-new visualizers. As a result, it is more convenient for the users to explore the models and datasets now.
More documentation and tutorials. We add a bunch of documentation and tutorials to help users get started more smoothly. Read it here.
Breaking Changes¶
We briefly list the major breaking changes here. We will update the migration guide to provide complete details and migration instructions.
Dependencies¶
MMOCR 1.x relies on MMEngine to run. MMEngine is a new foundational library for training deep learning models in OpenMMLab 2.0 models. The dependencies of file IO and training are migrated from MMCV 1.x to MMEngine.
MMOCR 1.x relies on MMCV>=2.0.0rc0. Although MMCV no longer maintains the training functionalities since 2.0.0rc0, MMOCR 1.x relies on the data transforms, CUDA operators, and image processing interfaces in MMCV. Note that the package
mmcv
is the version that provide pre-built CUDA operators andmmcv-lite
does not since MMCV 2.0.0rc0, whilemmcv-full
has been deprecated.
Training and testing¶
MMOCR 1.x uses Runner in MMEngine rather than that in MMCV. The new Runner implements and unifies the building logic of dataset, model, evaluation, and visualizer. Therefore, MMOCR 1.x no longer maintains the building logics of those modules in
mmocr.train.apis
andtools/train.py
. Those code have been migrated into MMEngine. Please refer to the migration guide of Runner in MMEngine for more details.The Runner in MMEngine also supports testing and validation. The testing scripts are also simplified, which has similar logic as that in training scripts to build the runner.
The execution points of hooks in the new Runner have been enriched to allow more flexible customization. Please refer to the migration guide of Hook in MMEngine for more details.
Learning rate and momentum scheduling has been migrated from
Hook
toParameter Scheduler
in MMEngine. Please refer to the migration guide of Parameter Scheduler in MMEngine for more details.
Configs¶
The Runner in MMEngine uses a different config structures to ease the understanding of the components in runner. Users can read the config example of MMOCR or refer to the migration guide in MMEngine for migration details.
The file names of configs and models are also refactored to follow the new rules unified across OpenMMLab 2.0 projects. Please refer to the user guides of config for more details.
Dataset¶
The Dataset classes implemented in MMOCR 1.x all inherits from the BaseDetDataset
, which inherits from the BaseDataset in MMEngine. There are several changes of Dataset in MMOCR 1.x.
All the datasets support to serialize the data list to reduce the memory when multiple workers are built to accelerate data loading.
The interfaces are changed accordingly.
Data Transforms¶
The data transforms in MMOCR 1.x all inherits from those in MMCV>=2.0.0rc0, which follows a new convention in OpenMMLab 2.0 projects. The changes are listed as below:
The interfaces are also changed. Please refer to the API Reference
The functionality of some data transforms (e.g.,
Resize
) are decomposed into several transforms.The same data transforms in different OpenMMLab 2.0 libraries have the same augmentation implementation and the logic of the same arguments, i.e.,
Resize
in MMDet 3.x and MMOCR 1.x will resize the image in the exact same manner given the same arguments.
Model¶
The models in MMOCR 1.x all inherits from BaseModel
in MMEngine, which defines a new convention of models in OpenMMLab 2.0 projects. Users can refer to the tutorial of model in MMengine for more details. Accordingly, there are several changes as the following:
The model interfaces, including the input and output formats, are significantly simplified and unified following the new convention in MMOCR 1.x. Specifically, all the input data in training and testing are packed into
inputs
anddata_samples
, whereinputs
contains model inputs like a list of image tensors, anddata_samples
contains other information of the current data sample such as ground truths and model predictions. In this way, different tasks in MMOCR 1.x can share the same input arguments, which makes the models more general and suitable for multi-task learning.The model has a data preprocessor module, which is used to pre-process the input data of model. In MMOCR 1.x, the data preprocessor usually does necessary steps to form the input images into a batch, such as padding. It can also serve as a place for some special data augmentations or more efficient data transformations like normalization.
The internal logic of model have been changed. In MMOCR 0.x, model used
forward_train
andsimple_test
to deal with different model forward logics. In MMOCR 1.x and OpenMMLab 2.0, the forward function has three modes:loss
,predict
, andtensor
for training, inference, and tracing or other purposes, respectively. The forward function callsself.loss()
,self.predict()
, andself._forward()
given the modesloss
,predict
, andtensor
, respectively.
Evaluation¶
MMOCR 1.x mainly implements corresponding metrics for each task, which are manipulated by Evaluator to complete the evaluation. In addition, users can build evaluator in MMOCR 1.x to conduct offline evaluation, i.e., evaluate predictions that may not produced by MMOCR, prediction follows our dataset conventions. More details can be find in the Evaluation Tutorial in MMEngine.
Visualization¶
The functions of visualization in MMOCR 1.x are removed. Instead, in OpenMMLab 2.0 projects, we use Visualizer to visualize data. MMOCR 1.x implements TextDetLocalVisualizer
, TextRecogLocalVisualizer
, and KIELocalVisualizer
to allow visualization of ground truths, model predictions, and feature maps, etc., at any place, for the three tasks supported in MMOCR. It also supports to dump the visualization data to any external visualization backends such as Tensorboard and Wandb. Check our Visualization Document for more details.
Improvements¶
Most models enjoy a performance improvement from the new framework and refactor of data transforms. For example, in MMOCR 1.x, DBNet-R50 achieves 0.854 hmean score on ICDAR 2015, while the counterpart can only get 0.840 hmean score in MMOCR 0.x.
Support mixed precision training of most of the models. However, the rest models are not supported yet because the operators they used might not be representable in fp16. We will update the documentation and list the results of mixed precision training.
Ongoing changes¶
Test-time augmentation: which was supported in MMOCR 0.x, is not implemented yet in this version due to limited time slot. We will support it in the following releases with a new and simplified design.
Inference interfaces: a unified inference interfaces will be supported in the future to ease the use of released models.
Interfaces of useful tools that can be used in notebook: more useful tools that implemented in the
tools/
directory will have their python interfaces so that they can be used through notebook and in downstream libraries.Documentation: we will add more design docs, tutorials, and migration guidance so that the community can deep dive into our new design, participate the future development, and smoothly migrate downstream libraries to MMOCR 1.x.
概览¶
伴随着 OpenMMLab 2.0 的发布,MMOCR 1.0 本身也作出了许多突破性的改变,使得代码的冗余度降低,代码效率提高,整体设计上也变得更为一致。然而,这些改变使得完美的后向兼容不再可能。我们也深知在这样巨大的变动之下,老用户想第一时间适应新版本也绝非易事。因此,我们推出了详细的迁移指南,旨在让老用户们尽可能平滑地过渡到全新的框架,最终能享受到全新的 MMOCR 和整个OpenMMLab 2.0 生态系统为生产力带来的巨大优势。
警告
MMOCR 1.0 依赖于新的基础训练框架 MMEngine,因而有着与 MMOCR 0.x 完全不同的依赖链。尽管你可能已经拥有了一个可以正常运行 MMOCR 0.x 的环境,但你仍然需要创建一个新的 python 环境来安装 MMOCR 1.0 版本所需要的依赖库。我们提供了详细的安装文档以供参考。
接下来,请根据你的实际需求,阅读需要的章节:
若需要了解 MMOCR 1.0 的主要变化,请阅读 MMOCR 1.x 更新汇总
如果你需要把 0.x 版本中训练的模型直接迁移到 1.0 版本中使用,请阅读 预训练模型迁移
如下图所示,MMOCR 1.x 版本的维护计划主要分为三个阶段,即“公测期”,“兼容期”以及“维护期”。对于旧版本,我们将不再增加主要新功能。因此,我们强烈建议用户尽早迁移至 MMOCR 1.x 版本。
MMOCR 1.x 更新汇总¶
此处列出了 MMOCR 1.x 相对于 0.x 版本的重大更新。
架构升级:MMOCR 1.x 是基于 MMEngine,提供了一个通用的、强大的执行器,允许更灵活的定制,提供了统一的训练和测试入口。
统一接口:MMOCR 1.x 统一了数据集、模型、评估和可视化的接口和内部逻辑。支持更强的扩展性。
跨项目调用:受益于统一的设计,你可以使用其他OpenMMLab项目中实现的模型,如MMDet。 我们提供了一个例子,说明如何通过MMDetWrapper使用MMDetection的Mask R-CNN。查看我们的文档以了解更多细节。更多的包装器将在未来发布。
更强的可视化:我们提供了一系列可视化工具, 用户现在可以更方便可视化数据。
更多的文档和教程:我们增加了更多的教程,降低用户的学习门槛。
一站式数据准备:准备数据集已经不再是难事。使用我们的 Dataset Preparer,一行命令即可让多个数据集准备就绪。
拥抱更多
projects/
: 我们推出了projects/
文件夹,用于存放一些实验性的新特性、框架和模型。我们对这个文件夹下的代码规范不作过多要求,力求让社区的所有想法第一时间得到实现和展示。请查看我们的样例 project 以了解更多。更多新模型:MMOCR 1.0 支持了更多模型和模型种类。
分支迁移¶
在早期阶段,MMOCR 有三个分支:main
、1.x
和 dev-1.x
。随着 MMOCR 1.0.0 正式版的发布,我们也重命名了其中一些分支,下面提供了新旧分支的对照。
main
分支包括了 MMOCR 0.x(例如 v0.6.3)的代码。现在已经被重命名为0.x
。1.x
包含了 MMOCR 1.x(例如 1.0.0rc6)的代码。现在它是main
分支的别名,会在 2023 的年中删除。dev-1.x
是 MMOCR 1.x 的开发分支。现在保持不变。
有关分支的更多信息,请查看分支。
升级 main
分支时解决冲突¶
对于希望从旧 main
分支(包含 MMOCR 0.x 代码)升级的用户,代码可能会导致冲突。要避免这些冲突,请按照以下步骤操作:
请 commit 在
main
上的所有更改(若有),并备份您当前的main
分支。git checkout main git add --all git commit -m 'backup' git checkout -b main_backup
从远程存储库获取最新更改。
git remote add openmmlab git@github.com:open-mmlab/mmocr.git git fetch openmmlab
通过运行
git reset --hard openmmlab/main
将main
分支重置为远程存储库上的最新main
分支。git checkout main git reset --hard openmmlab/main
按照这些步骤,您可以成功升级您的 main
分支。
代码结构变动¶
MMOCR 为了兼顾文本检测、识别和关键信息提取等任务,在初版设计时存在许多欠缺考虑的地方。在本次 1.0 版本的升级中,MMOCR 同步提出了新的模型架构,旨在尽量与 OpenMMLab 整体的设计对齐,且在算法库内部达成结构上的统一。虽然本次升级并非完全后向兼容,但所有的变动都是有迹可循的。因此,我们在本章节总结出了开发者可能会关心的改动,供有需要的用户参考。
整体改动¶
MMOCR 0.x 存在着对模块功能边界定义不清晰的问题。在 MMOCR 1.0 中,我们重构了模型模块的设计,并定义了它们的模块边界。
考虑到方向差异过大,MMOCR 1.0 中取消了对命名实体识别的支持。
模型中计算损失(loss)的部分模块被抽象化为 Module Loss,转换原始标注为损失目标(loss target)的功能也被包括在内。另一个模块抽象 Postprocessor 则负责在预测时解码模型原始输出为对应任务的
DataSample
。所有模型的输入简化为包含图像原始特征的
inputs
和图片元信息的List[DataSample]
。输出格式也得到统一,训练时是包含 loss 的字典,测试时的输出为包含预测结果的对应任务的DataSample
。Module Loss 来源于 0.x 版本中实现与单个模型强相关的
XXLoss
类,它们在 1.0 中均被统一重命名为XXModuleLoss
的形式(如DBLoss
被重命名为DBModuleLoss
),head
传入的 loss 配置参数名也从loss
改为module_loss
。与模型实现无关的通用损失类名称保持
XXLoss
的形式,并放置于mmocr/models/common/losses
下,如MaskedBCELoss
。mmocr/models/common/losses
下的改动:0.x 中DiceLoss
被重名为MaskedDiceLoss
。FocalLoss
被移除。增加了起源于 label converter 的 Dictionary 模块,它会在文本识别和关键信息提取任务中被用到。
文本检测¶
关键改动(太长不看版)¶
旧版的模型权重仍然适用于新版,但需要将权重字典
state_dict
中以bbox_head
开头的字段重命名为det_head
。计算 target 有关的变换
XXTargets
被转移到了XXModuleLoss
中。
SingleStageTextDetector¶
原本继承链为
mmdet.BaseDetector->SingleStageDetector->SingleStageTextDetector
,现在改为直接继承自BaseDetector
, 中间的SingleStageDetector
被删除。bbox_head
改名为det_head
。train_cfg
、test_cfg
和pretrained
字段被移除。forward_train()
与simple_test()
分别被重构为loss()
与predict()
方法。其中simple_test()
中负责将模型原始输出拆分并输入head.get_bounary()
的部分被整合进了BaseTextDetPostProcessor
中。TextDetectorMixin
中只实现了show_result()
方法,实现与TextDetLocalVisualizer
重合,因此已经被移除。
Head¶
HeadMixin
为XXXHead
在 0.x 版本中必须继承的基类,现在被BaseTextDetHead
代替。里面的get_boundary()
和resize_boundary()
方法被重写为BaseTextDetPostProcessor
的__call__()
和rescale()
方法。
ModuleLoss¶
文本检测中特有的数据变换
XXXTargets
全部移动到XXXModuleLoss._get_target_single
中,与生成 target 相关的配置不再在数据流水线(pipeline)中设置,转而在XXXLoss
中被配置。例如,DBNetTargets
的实现被移动到DBModuleLoss._get_target_single()
中,而用户可以通过设置DBModuleLoss
的初始化参数来控制损失目标的生成。
Postprocessor¶
原本的
XXXPostprocessor.__call__()
中的逻辑转移到重构后的XXXPostprocessor.get_text_instances()
。BasePostprocessor
重构为BaseTextDetPostProcessor
,此基类会将模型输出的预测结果拆分并逐个进行处理,并支持根据scale_factor
自动缩放输出的多边形(polygon)或界定框(bounding box)。
文本识别¶
关键改动(太长不看版)¶
由于字典序发生了变化,且存在部分模型架构上的 bug 被修复,旧版的识别模型权重已经不再能直接应用于 1.0 中,我们将会在后续为有需要的用户推出迁移脚本教程。
0.x 版本中的 SegOCR 支持暂时移除,TPS-CRNN 会在后续版本中被支持。
测试时增强(test time augmentation)在此版本中暂未支持,但将会在后续版本中更新。
Label converter 模块被移除,里面的功能被拆分至 Dictionary, ModuleLoss 和 Postprocessor 模块中。
统一模型中对
max_seq_len
的定义为模型的原始输出长度。
Label Converter¶
原有的 label converter 存在拼写错误 (label convertor),我们通过删除掉这个类规避了这个问题。
负责对字符/字符串与数字索引互相转换的部分被提取至
Dictionary
类中。在旧版本中,不同的 label converter 会有不一样的特殊字符集和字符序。在 0.x 版本中,字符序如下:
Converter | 字符序 |
---|---|
AttnConvertor , ABIConvertor |
<UKN> , <BOS/EOS> , <PAD> , characters |
CTCConvertor |
<BLK> , <UKN> , characters |
在 1.0 中,我们不再以任务为边界设计不同的字典和字符序,取而代之的是统一了字符序的 Dictionary,其字符序为 characters, <BOS/EOS>, <PAD>, <UKN>。CTCConvertor
中 <BLK> 被等价替换为 <PAD>。
label_convertor
中原本支持三种方式初始化字典:dict_type
、dict_file
和dict_list
,现在在Dictionary
中被简化为dict_file
一种。同时,我们也把原本在dict_type
中支持的字典格式转化为现在dicts/
目录下的预设字典文件。对应映射如下:
MMOCR 0.x: dict_type |
MMOCR 1.0: 字典路径 |
---|---|
DICT90 | dicts/english_digits_symbols.txt |
DICT91 | dicts/english_digits_symbols_space.txt |
DICT36 | dicts/lower_english_digits.txt |
DICT37 | dicts/lower_english_digits_space.txt |
label_converter
中str2tensor()
的实现被转移到ModuleLoss.get_targets()
中。下面的表格列出了旧版与新版方法实现的对应关系。注意,新旧版的实现并非完全一致。
MMOCR 0.x | MMOCR 1.0 | 备注 |
---|---|---|
ABIConvertor.str2tensor() , AttnConvertor.str2tensor() |
BaseTextRecogModuleLoss.get_targets() |
原本两个类中的实现存在的差异在新版本中被统一 |
CTCConvertor.str2tensor() |
CTCModuleLoss.get_targets() |
label_converter
中tensor2idx()
的实现被转移到Postprocessor.get_single_prediction()
中。下面的表格列出了旧版与新版方法实现的对应关系。注意,新旧版的实现并非完全一致。
MMOCR 0.x | MMOCR 1.0 |
---|---|
ABIConvertor.tensor2idx() , AttnConvertor.tensor2idx() |
AttentionPostprocessor.get_single_prediction() |
CTCConvertor.tensor2idx() |
CTCPostProcessor.get_single_prediction() |
关键信息提取¶
关键改动(太长不看版)¶
由于模型的输入发生了变化,旧版模型的权重已经不再能直接应用于 1.0 中。
KIEDataset & OpensetKIEDataset¶
读取数据的部分被简化到
WildReceiptDataset
中。对节点和边作额外处理的部分被转移到了
LoadKIEAnnotation
中。使用字典对文本进行转化的部分被转移到了
SDMGRHead.convert_text()
中,使用Dictionary
实现。计算文本框之间关系的部分
compute_relation()
被转移到SDMGRHead.compute_relations()
中,在模型内进行。评估模型表现的部分被简化为
F1Metric
。OpensetKIEDataset
中处理模型边输出的部分被整理到SDMGRPostProcessor
中。
SDMGR¶
show_result()
被整合到KIEVisualizer
中。forward_test()
中对输出进行后处理的部分被整理到SDMGRPostProcessor
中。
Utils 变动¶
原本散布在各处的功能函数现已被统一归类在 mmocr/utils/
下。以下为该目录下各文件的作用域:
bbox_utils.py:四边界定框(bounding box)有关的功能函数。
check_argument.py:检查参数类型的功能函数。
collect_env.py:收集运行环境的功能函数。
data_converter_utils.py:用于数据集转换的功能函数。
fileio.py:输入/输出有关的功能函数。
img_utils.py:处理图片的功能函数。
mask_utils.py:与掩码有关的功能函数。
ocr.py:用于 MMOCR 推理的功能函数。
parsers.py:解码文件的功能函数。
polygon_utils.py:多边形的功能函数。
setup_env.py:存放初始化 MMOCR 的功能函数。
string_utils.py:存放字符串的功能函数。
typing.py:存放 MMOCR 中常用数据类型的缩写。
数据集迁移¶
在 OpenMMLab 2.0 系列算法库基于 MMEngine 设计了统一的数据集基类 BaseDataset,并制定了数据集标注文件规范。基于此,我们在 MMOCR 1.0 版本中重构了 OCR 任务数据集基类 OCRDataset
。以下文档将介绍 MMOCR 中新旧数据集格式的区别,以及如何将旧数据集迁移至新版本中。对于暂不方便进行数据迁移的用户,我们也在第三节提供了临时的代码兼容方案。
注解
关键信息抽取任务仍采用原有的 WildReceipt 数据集标注格式。
旧版数据格式回顾¶
针对不同任务,MMOCR 0.x 版本实现了多种不同的数据集类型,如文本检测任务的 IcdarDataset
,TextDetDataset
;文本识别任务的 OCRDataset
,OCRSegDataset
等。而不同的数据集类型同时还可能存在多种不同的标注及文件存储后端,如 .txt
、.json
、.jsonl
等,使得用户在自定义数据集时需要配置各类数据加载器 (Loader
) 以及数据解析器 (Parser
)。这不仅增加了用户的使用难度,也带来了许多问题和隐患。例如,以 .txt
格式存储的简单 OCDDataset
在遇到包含空格的文本标注时将会报错。
文本检测¶
文本检测任务中,IcdarDataset
采用了与通用目标检测 COCO 数据集一致的标注格式。
{
"images": [
{
"id": 1,
"width": 800,
"height": 600,
"file_name": "test.jpg"
}
],
"annotations": [
{
"id": 1,
"image_id": 1,
"category_id": 1,
"bbox": [0,0,10,10],
"segmentation": [
[0,0,10,0,10,10,0,10]
],
"area": 100,
"iscrowd": 0
}
]
}
而 TextDetDataset
则采用了 JSON Line 的存储格式,将类似 COCO 格式的标签转换成文本存放在 .txt
或 .jsonl
格式文件中。
{"file_name": "test/img_2.jpg", "height": 720, "width": 1280, "annotations": [{"iscrowd": 0, "category_id": 1, "bbox": [602.0, 173.0, 33.0, 24.0], "segmentation": [[602, 173, 635, 175, 634, 197, 602, 196]]}, {"iscrowd": 0, "category_id": 1, "bbox": [734.0, 310.0, 58.0, 54.0], "segmentation": [[734, 310, 792, 320, 792, 364, 738, 361]]}]}
{"file_name": "test/img_5.jpg", "height": 720, "width": 1280, "annotations": [{"iscrowd": 1, "category_id": 1, "bbox": [405.0, 409.0, 32.0, 52.0], "segmentation": [[408, 409, 437, 436, 434, 461, 405, 433]]}, {"iscrowd": 1, "category_id": 1, "bbox": [435.0, 434.0, 8.0, 33.0], "segmentation": [[437, 434, 443, 440, 441, 467, 435, 462]]}]}
文本识别¶
对于文本识别任务,MMOCR 0.x 版本中存在两种数据标注格式。其中 .txt
格式的标注文件每一行共有两个字段,分别存放了图片名以及标注的文本内容,并以空格分隔。
img1.jpg OpenMMLab
img2.jpg MMOCR
而 JSON Line 格式则使用 json.dumps
将 JSON 格式的标注转换为文本内容后存放在 .jsonl 文件中,其内容形似一个字典,将文件名和文本标注信息分别存放在 filename
和 text
字段中。
{"filename": "img1.jpg", "text": "OpenMMLab"}
{"filename": "img2.jpg", "text": "MMOCR"}
新版数据格式¶
为解决 0.x 版本中数据集格式过于混杂的情况,MMOCR 1.x 采用了基于 MMEngine 设计的统一数据标准。每一个数据标注文件存放在 .json
文件中,并使用类似字典的格式分别存放了数据集的元信息(metainfo
)与具体的标注内容(data_list
)。
{
"metainfo":
{
"classes": ("cat", "dog"),
// ...
},
"data_list":
[
{
"img_path": "xxx/xxx_0.jpg",
"img_label": 0,
// ...
},
// ...
]
}
基于此,我们针对 MMOCR 特有的任务设计了 TextDetDataset
、TextRecogDataset
。
文本检测¶
新版格式介绍¶
TextDetDataset
中存放了文本检测任务所需的边界盒标注、文件名等信息。由于文本检测任务中只有 1 个类别,因此我们将其类别 id 默认设置为 0,而背景类则为 1。tests/data/det_toy_dataset/instances_test.json
中存放了一个文本检测任务的数据标注示例,用户可以参考该文件来将自己的数据集转换为我们支持的格式。
{
"metainfo":
{
"dataset_type": "TextDetDataset",
"task_name": "textdet",
"category": [{"id": 0, "name": "text"}]
},
"data_list":
[
{
"img_path": "test_img.jpg",
"height": 640,
"width": 640,
"instances":
[
{
"polygon": [0, 0, 0, 10, 10, 20, 20, 0],
"bbox": [0, 0, 10, 20],
"bbox_label": 0,
"ignore": False
},
// ...
]
}
]
}
其中,bbox
字段的格式为 [min_x, min_y, max_x, max_y]
。
迁移脚本¶
为帮助用户将旧版本标注文件迁移至新格式,我们提供了迁移脚本。使用方法如下:
python tools/dataset_converters/textdet/data_migrator.py ${IN_PATH} ${OUT_PATH}
参数 | 类型 | 说明 |
---|---|---|
in_path | str | (必须)旧版标注的路径 |
out_path | str | (必须)新版标注的路径 |
--task | 'auto', 'textdet', 'textspotter' | 指定输出数据集标注的所兼容的任务。若指定为 textdet ,则不会转存 coco 格式中的 text 字段。默认为 auto,即根据旧版标注的格式自动决定输出的标注格式。 |
文本识别¶
新版格式介绍¶
TextRecogDataset
中存放了文本识别任务所需的文本内容,通常而言,文本识别数据集中的每一张图片都仅包含一个文本实例。我们在 tests/data/rec_toy_dataset/labels.json
提供了一个简单的识别数据格式示例,用户可以参考该文件以进一步了解其中的细节。
{
"metainfo":
{
"dataset_type": "TextRecogDataset",
"task_name": "textrecog",
},
"data_list":
[
{
"img_path": "test_img.jpg",
"instances":
[
{
"text": "GRAND"
}
]
}
]
}
迁移脚本¶
为帮助用户将旧版本标注文件迁移至新格式,我们提供了迁移脚本。使用方法如下:
python tools/dataset_converters/textrecog/data_migrator.py ${IN_PATH} ${OUT_PATH} --format ${txt, jsonl, lmdb}
参数 | 类型 | 说明 |
---|---|---|
in_path | str | (必须)旧版标注的路径 |
out_path | str | (必须)新版标注的路径 |
--format | 'txt', 'jsonl', 'lmdb' | 指定旧版数据集标注的格式。 |
兼容性¶
考虑到用户对数据迁移所需的成本,我们在 MMOCR 1.x 版本中暂时对 MMOCR 0.x 旧版本格式进行了兼容。
注解
用于兼容旧数据格式的代码和组件可能在未来的版本中被完全移除。因此,我们强烈建议用户将数据集迁移至新的数据格式标准。
具体而言,我们提供了三个临时的数据集类 IcdarDataset, RecogTextDataset, RecogLMDBDataset 来兼容旧格式的标注文件。分别对应了 MMOCR 0.x 版本中的文本检测数据集 IcdarDataset
,.txt
、.jsonl
和 LMDB
格式的文本识别数据标注。其使用方式与 0.x 版本一致。
IcdarDataset 支持 0.x 版本文本检测任务的 COCO 标注格式。只需要在
configs/textdet/_base_/datasets
中添加新的数据集配置文件,并指定其数据集类型为IcdarDataset
即可。data_root = 'data/det/icdar2015' train_dataset = dict( type='IcdarDataset', data_root=data_root, ann_file='instances_training.json', data_prefix=dict(img_path='imgs/'), filter_cfg=dict(filter_empty_gt=True, min_size=32), pipeline=None)
RecogTextDataset 支持 0.x 版本文本识别任务的
txt
和jsonl
标注格式。只需要在configs/textrecog/_base_/datasets
中添加新的数据集配置文件,并指定其数据集类型为RecogTextDataset
即可。例如,以下示例展示了如何配置并读取 toy dataset 中的旧格式标签old_label.txt
以及old_label.jsonl
。data_root = 'tests/data/rec_toy_dataset/' # 读取旧版 txt 格式识别数据标签 txt_dataset = dict( type='RecogTextDataset', data_root=data_root, ann_file='old_label.txt', data_prefix=dict(img_path='imgs'), parser_cfg=dict( type='LineStrParser', keys=['filename', 'text'], keys_idx=[0, 1]), pipeline=[]) # 读取旧版 json line 格式识别数据标签 jsonl_dataset = dict( type='RecogTextDataset', data_root=data_root, ann_file='old_label.jsonl', data_prefix=dict(img_path='imgs'), parser_cfg=dict( type='LineJsonParser', keys=['filename', 'text'], pipeline=[])
RecogLMDBDataset 支持 0.x 版本文本识别任务图像+文字的
LMDB
标注格式。只需要在configs/textrecog/_base_/datasets
中添加新的数据集配置文件,并指定其数据集类型为RecogLMDBDataset
即可。例如,以下示例展示了如何配置并读取 toy dataset 中的imgs.lmdb
,该lmdb
文件包含标签和图像。# 将数据集类型设定为 RecogLMDBDataset data_root = 'tests/data/rec_toy_dataset/' lmdb_dataset = dict( type='RecogLMDBDataset', data_root=data_root, ann_file='imgs.lmdb', pipeline=None)
还需把
train_pipeline
及test_pipeline
中的数据读取方法如LoadImageFromFile
替换为LoadImageFromNDArray
:train_pipeline = [dict(type='LoadImageFromNDArray')]
预训练模型迁移指南¶
由于在新版本中我们对模型的结构进行了大量的重构和修复,MMOCR 1.x 并不能直接读入旧版的预训练权重。我们在网站上同步更新了所有模型的预训练权重和log,供有需要的用户使用。
此外,我们正在进行针对文本检测任务的权重迁移工具的开发,并计划于近期版本内发布。由于文本识别和关键信息提取模型改动过大,且迁移是有损的,我们暂时不计划作相应支持。如果您有具体的需求,欢迎通过 Issue 向我们提问。
数据变换迁移¶
简介¶
MMOCR 0.x 版本中,我们在 mmocr/datasets/pipelines/xxx_transforms.py
中实现了一系列的数据变换(Data Transforms)方法。然而,这些模块分散在各处,且缺乏规范统一的设计。因此,我们在 MMOCR 1.x 版本中对所有的数据增强模块进行了重构,并依照任务类型分别存放在 mmocr/datasets/transforms
目录下的 ocr_transforms.py
,textdet_transforms.py
及 textrecog_transforms.py
中。其中,ocr_transforms.py
中实现了 OCR 相关任务通用的数据增强模块,而 textdet_transforms.py
和 textrecog_transforms.py
则分别实现了文本检测任务与文本识别任务相关的数据增强模组。
由于在重构过程中我们对部分模块进行了重命名、合并或拆分,使得新的调用接口与默认参数可能与旧版本存在不一致。因此,本文档将详细介绍如何对数据增强模块进行迁移,即,如何配置现有的数据变换来达到与旧版一致的行为。
配置迁移指南¶
数据格式化相关数据变换¶
Collect
+CustomFormatBundle
->PackTextDetInputs
/PackTextRecogInputs
PackxxxInputs
同时囊括了 Collect
和 CustomFormatBundle
两个功能,且不再有 key
参数,而训练目标 target 的生成现在被转移至在 loss
中完成。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='CustomFormatBundle',
keys=['gt_shrink', 'gt_shrink_mask', 'gt_thr', 'gt_thr_mask'],
meta_keys=['img_path', 'ori_shape', 'img_shape'],
visualize=dict(flag=False, boundary_key='gt_shrink')),
dict(
type='Collect',
keys=['img', 'gt_shrink', 'gt_shrink_mask', 'gt_thr', 'gt_thr_mask'])
|
dict(
type='PackTextDetInputs',
meta_keys=('img_path', 'ori_shape', 'img_shape'))
|
数据增强相关数据变换¶
ResizeOCR
->Resize
,RescaleToHeight
,PadToWidth
原有的
ResizeOCR
现在被拆分为三个独立的数据增强模块。keep_aspect_ratio=False
时,等价为 1.x 版本中的Resize
,其配置可按如下方式修改。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='ResizeOCR',
height=32,
min_width=100,
max_width=100,
keep_aspect_ratio=False)
|
dict(
type='Resize',
scale=(100, 32),
keep_ratio=False)
|
keep_aspect_ratio=True
,且 max_width=None
时。将图片的高缩放至固定值,并等比例缩放图像的宽。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='ResizeOCR',
height=32,
min_width=32,
max_width=None,
width_downsample_ratio = 1.0 / 16
keep_aspect_ratio=True)
|
dict(
type='RescaleToHeight',
height=32,
min_width=32,
max_width=None,
width_divisor=16),
|
keep_aspect_ratio=True
,且 max_width
为固定值时。将图片的高缩放至固定值,并等比例缩放图像的宽。若缩放后的图像宽小于 max_width
, 则将其填充至 max_width
, 反之则将其裁剪至 max_width
。即,输出图像的尺寸固定为 (height, max_width)
。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='ResizeOCR',
height=32,
min_width=32,
max_width=100,
width_downsample_ratio = 1.0 / 16,
keep_aspect_ratio=True)
|
dict(
type='RescaleToHeight',
height=32,
min_width=32,
max_width=100,
width_divisor=16),
dict(
type='PadToWidth',
width=100)
|
RandomRotateTextDet
&RandomRotatePolyInstances
->RandomRotate
随机旋转数据增强策略已被整合至
RanomRotate
。该方法的默认行为与 0.x 版本中的RandomRotateTextDet
保持一致。此时仅需指定最大旋转角度max_angle
即可。
注解
新旧版本 “max_angle” 的默认值不同,因此需要重新进行指定。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(type='RandomRotateTextDet')
|
dict(type='RandomRotate', max_angle=10)
|
对于 RandomRotatePolyInstances
,则需要指定参数 use_canvas=True
。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='RandomRotatePolyInstances',
rotate_ratio=0.5, # 指定概率为0.5
max_angle=60,
pad_with_fixed_color=False)
|
# 用 RandomApply 对数据变换进行包装,并指定执行概率
dict(
type='RandomApply',
transforms=[
dict(type='RandomRotate',
max_angle=60,
pad_with_fixed_color=False,
use_canvas=True)],
prob=0.5) # 设置执行概率为 0.5
|
注解
在 0.x 版本中,部分数据增强方法通过定义一个内部变量 “xxx_ratio” 来指定执行概率,如 “rotate_ratio”, “crop_ratio” 等。在 1.x 版本中,这些参数已被统一删除。现在,我们可以通过 “RandomApply” 来对不同的数据变换方法进行包装,并指定其执行概率。
RandomCropFlip
->TextDetRandomCropFlip
目前仅对方法名进行了更改,其他参数保持一致。
RandomCropPolyInstances
->RandomCrop
新版本移除了
crop_ratio
以及instance_key
,并统一使用gt_polygons
为目标进行裁剪。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='RandomCropPolyInstances',
instance_key='gt_masks',
crop_ratio=0.8, # 指定概率为 0.8
min_side_ratio=0.3)
|
# 用 RandomApply 对数据变换进行包装,并指定执行概率
dict(
type='RandomApply',
transforms=[dict(type='RandomCrop', min_side_ratio=0.3)],
prob=0.8) # 设置执行概率为 0.8
|
RandomCropInstances
->TextDetRandomCrop
新版本移除了
instance_key
和mask_type
,并统一使用gt_polygons
为目标进行裁剪。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='RandomCropInstances',
target_size=(800,800),
instance_key='gt_kernels')
|
dict(
type='TextDetRandomCrop',
target_size=(800,800))
|
EastRandomCrop
->RandomCrop
+Resize
+mmengine.Pad
原有的
EastRandomCrop
内同时对图像进行了剪裁、缩放以及填充。在新版本中,我们可以通过组合三种数据增强策略来达到相同的效果。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='EastRandomCrop',
max_tries=10,
min_crop_side_ratio=0.1,
target_size=(640, 640))
|
dict(type='RandomCrop', min_side_ratio=0.1),
dict(type='Resize', scale=(640,640), keep_ratio=True),
dict(type='Pad', size=(640,640))
|
RandomScaling
->mmengine.RandomResize
在新版本中,我们直接使用 MMEngine 中实现的
RandomResize
来代替原有的实现。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='RandomScaling',
size=800,
scale=(0.75, 2.5))
|
dict(
type='RandomResize',
scale=(800, 800),
ratio_range=(0.75, 2.5),
keep_ratio=True)
|
注解
默认地,数据流水线会从当前 scope 的注册器中搜索对应的数据变换,如果不存在该数据变换,则将继续在上游库,如 MMCV 及 MMEngine 中进行搜索。例如,MMOCR 中并未实现 RandomResize
方法,但我们仍然可以在配置中直接引用该数据增强方法,因为程序将自动从上游的 MMCV 中搜索该方法。此外,用户也可以通过添加前缀的形式来指定 scope。例如,mmengine.RandomResize
将强制指定使用 MMCV 库中实现的 RandomResize
,当上下游库中存在同名方法时,则可以通过这种形式强制使用特定的版本。另外需要注意的是,MMCV 中所有的数据变换方法都被注册至 MMEngine 中,因此我们使用 mmengine.RandomResize
而不是 mmcv.RandomResize
。
SquareResizePad
->Resize
+SourceImagePad
原有的
SquareResizePad
内部实现了两个分支,并依据概率pad_ratio
随机使用其中的一个分支进行数据增强。具体而言,一个分支先对图像缩放再填充;另一个分支则直接对图像进行缩放。为增强不同模块的复用性,我们在 1.x 版本中将该方法拆分成了Resize
+SourceImagePad
的组合形式,并通过 MMCV 中的RandomChoice
来控制分支。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='SquareResizePad',
target_size=800,
pad_ratio=0.6)
|
dict(
type='RandomChoice',
transforms=[
[
dict(
type='Resize',
scale=800,
keep_ratio=True),
dict(
type='SourceImagePad',
target_scale=800)
],
[
dict(
type='Resize',
scale=800,
keep_ratio=False)
]
],
prob=[0.4, 0.6]), # 两种组合的选用概率
|
注解
在 1.x 版本中,随机选择包装器 “RandomChoice” 代替了 “OneOfWrapper”,可以从一系列数据变换组合中随机抽取一组并应用。
RandomWrapper
->mmegnine.RandomApply
在 1.x 版本中,
RandomWrapper
包装器被替换为由 MMCV 实现的RandomApply
,用以指定数据变换的执行概率。其中概率p
现在被命名为prob
。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='RandomWrapper',
p=0.25,
transforms=[
dict(type='PyramidRescale'),
])
|
dict(
type='RandomApply',
prob=0.25,
transforms=[
dict(type='PyramidRescale'),
])
|
OneOfWrapper
->mmegnine.RandomChoice
随机选择包装器现在被重命名为 RandomChoice
,并且使用方法和原来完全一致。
ScaleAspectJitter
->ShortScaleAspectJitter
,BoundedScaleAspectJitter
原有的 ScaleAspectJitter
实现了多种不同的图像尺寸抖动数据增强策略,在新版本中,我们将其拆分为数个逻辑更加清晰的独立数据变化方法。
resize_type='indep_sample_in_range'
时,其等价于图像在指定范围内的随机缩放。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='ScaleAspectJitter',
img_scale=None,
keep_ratio=False,
resize_type='indep_sample_in_range',
scale_range=(640, 2560))
|
dict(
type='RandomResize',
scale=(640, 640),
ratio_range=(1.0, 4.125),
resize_type='Resize',
keep_ratio=True))
|
resize_type='long_short_bound'
时,将图像缩放至指定大小,再对其长宽比进行抖动。这一逻辑现在由新的数据变换类 BoundedScaleAspectJitter
实现。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='ScaleAspectJitter',
img_scale=[(3000, 736)], # Unused
ratio_range=(0.7, 1.3),
aspect_ratio_range=(0.9, 1.1),
multiscale_mode='value',
long_size_bound=800,
short_size_bound=480,
resize_type='long_short_bound',
keep_ratio=False)
|
dict(
type='BoundedScaleAspectJitter',
long_size_bound=800,
short_size_bound=480,
ratio_range=(0.7, 1.3),
aspect_ratio_range=(0.9, 1.1))
|
resize_type='around_min_img_scale'
(默认参数)时,将图像的短边缩放至指定大小,再在指定范围内对长宽比进行抖动。最后,确保其边长能被 scale_divisor
整除。这一逻辑由新的数据变换类 ShortScaleAspectJitter
实现。
MMOCR 0.x 配置 | MMOCR 1.x 配置 |
---|---|
dict(
type='ScaleAspectJitter',
img_scale=[(3000, 640)],
ratio_range=(0.7, 1.3),
aspect_ratio_range=(0.9, 1.1),
multiscale_mode='value',
keep_ratio=False)
|
dict(
type='ShortScaleAspectJitter',
short_size=640,
ratio_range=(0.7, 1.3),
aspect_ratio_range=(0.9, 1.1),
scale_divisor=32),
|
mmocr.apis¶
mmocr.apis
Inferencers¶
MMOCR Inferencer. |
|
Text Detection inferencer. |
|
Text Recognition inferencer. |
|
Text Spotting inferencer. |
|
Key Information Extraction Inferencer. |
mmocr.structures¶
A data structure interface of MMOCR. |
|
A data structure interface of MMOCR for text recognition. |
|
A data structure interface of MMOCR. |
mmocr.datasets¶
mmocr.datasets
Datasets¶
OCRDataset for text detection and text recognition. |
|
WildReceipt Dataset for key information extraction. |
Compatible Datasets¶
Dataset for text detection while ann_file in coco format. |
|
RecogLMDBDataset for text recognition. |
|
RecogTextDataset for text recognition. |
Dataset Wrapper¶
A wrapper of concatenated dataset. |
mmocr.datasets¶
mmocr.datasets.transforms
Loading¶
Load an image from file. |
|
Load and process the |
|
Load and process the |
TextDet Transforms¶
First randomly rescale the image so that the longside and shortside of the image are around the bound; then jitter its aspect ratio. |
|
Flip the image & bbox polygon. |
|
Pad Image to target size. |
|
First rescale the image for its shorter side to reach the short_size and then jitter its aspect ratio, final rescale the shape guaranteed to be divided by scale_divisor. |
|
Randomly select a region and crop images to a target size and make sure to contain text region. |
|
Random crop and flip a patch in the image. |
TextRecog Transforms¶
A general geometric augmentation tool for text images in the CVPR 2020 paper “Learn to Augment: Joint Data Augmentation and Network Optimization for Text Recognition”. |
|
Randomly crop the image’s height, either from top or bottom. |
|
Jitter the image contents. |
|
Reverse image pixels. |
|
Resize the image to the base shape, downsample it with gaussian pyramid, and rescale it back to original size. |
|
Only pad the image’s width. |
|
Rescale the image to the height according to setting and keep the aspect ratio unchanged if possible. |
OCR Transforms¶
Randomly crop images and make sure to contain at least one intact instance. |
|
Randomly rotate the image, boxes, and polygons. |
|
Resize image & bboxes & polygons. |
|
Fix invalid polygons in the dataset. |
|
Removed ignored elements from the pipeline. |
Formatting¶
Pack the inputs data for text detection. |
|
Pack the inputs data for text recognition. |
|
Pack the inputs data for key information extraction. |
Transform Wrapper¶
A wrapper around imgaug https://github.com/aleju/imgaug. |
|
A wrapper around torchvision transforms. |
mmocr.models¶
models.common¶
Dictionary¶
The class generates a dictionary for recognition. |
Losses¶
This loss combines a Sigmoid layers and a masked balanced BCE loss in one single class. |
|
Masked dice loss. |
|
Masked Smooth L1 loss. |
|
Masked square dice loss. |
|
This loss combines a Sigmoid layers and a masked BCE loss in one single class. |
|
Smooth L1 loss. |
|
Cross entropy loss. |
|
Masked Balanced BCE loss. |
|
Masked BCE loss. |
Layers¶
Transformer Encoder Layer. |
|
Transformer Decoder Layer. |
Modules¶
Scaled Dot-Product Attention Module. |
|
Multi-Head Attention module. |
|
Two-layer feed-forward module. |
|
Fixed positional encoding with sine and cosine functions. |
models.textdet¶
Detectors¶
The class for implementing single stage text detector. |
|
The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization. |
|
The class for implementing PANet text detector: |
|
The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network. |
|
The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. |
|
The class for implementing FCENet text detector FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection |
|
The class for implementing DRRG text detector. |
|
A wrapper of MMDet’s model. |
Data Preprocessors¶
Image pre-processor for detection tasks. |
Necks¶
This code is from https://github.com/WenmuZhou/PAN.pytorch. |
|
FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network. |
|
FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization. |
|
The class for implementing DRRG and TextSnake U-Net-like FPN. |
Heads¶
Base head for text detection, build the loss and postprocessor. |
|
The class for PSENet head. |
|
The class for PANet head. |
|
The class for DBNet head. |
|
The class for implementing FCENet head. |
|
The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes. |
|
The class for DRRG head: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection. |
Module Losses¶
Base class for the module loss of segmentation-based text detection algorithms with some handy utilities. |
|
The class for implementing PANet loss. |
|
The class for implementing PSENet loss. |
|
The class for implementing DBNet loss. |
|
The class for implementing TextSnake loss. |
|
The class for implementing FCENet loss. |
|
The class for implementing DRRG loss. |
Postprocessors¶
Base postprocessor for text detection models. |
|
Decoding predictions of PSENet to instances. |
|
Convert scores to quadrangles via post processing in PANet. |
|
Decoding predictions of DbNet to instances. |
|
Merge text components and construct boundaries of text instances. |
|
Decoding predictions of FCENet to instances. |
|
Decoding predictions of TextSnake to instances. |
models.textrecog¶
Recognizers¶
Data Preprocessors¶
Preprocessors¶
BackBones¶
Encoders¶
Decoders¶
Module Losses¶
Postprocessors¶
Layers¶
models.kie¶
Extractors¶
The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. |
Module Losses¶
The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. |
Postprocessors¶
Postprocessor for SDMGR. |
mmocr.evaluation¶
mmocr.evaluation
TextDet Metric¶
HmeanIOU metric. |
TextRecog Metric¶
Word metrics for text recognition task. |
|
Character metrics for text recognition task. |
|
One minus NED metric for text recognition task. |
KIE Metric¶
Compute F1 scores. |
mmocr.visualization¶
The MMOCR Text Detection Local Visualizer. |
|
The MMOCR Text Detection Local Visualizer. |
|
MMOCR Text Detection Local Visualizer. |
|
The MMOCR Text Detection Local Visualizer. |
mmocr.utils¶
Box Utils¶
Converting a bounding box to a polygon. |
|
Calculate the distance between the center points of two bounding boxes. |
|
Calculate the diagonal length of a bounding box (distance between the top-left and bottom-right). |
|
Sample points from the boundary of a polygon enclosed by two Bezier curves, which are controlled by |
|
Check if two boxes are on the same line by their y-axis coordinates. |
|
Rescale bboxes according to scale_factor. |
|
Stitch fragmented boxes of words into lines. |
Point Utils¶
Calculate the distance between two points. |
|
Calculate the center of a set of points. |
Polygon Utils¶
Calculate the IOU between two boundaries. |
|
Crop polygon to be within a box region. |
|
Check if the polygon is inside the target region. |
|
Offset (expand/shrink) the polygon by the target distance. |
|
Converting a polygon to a bounding box. |
|
Convert a polygon to shapely.geometry.Polygon. |
|
Calculate the intersection area between two polygons. |
|
Calculate the IOU between two polygons. |
|
Convert a potentially invalid polygon to a valid one by eliminating self-crossing or self-touching parts. |
|
Calculate the union area between two polygons. |
|
Convert a nested list of boundaries to a list of Polygons. |
|
Rescale a polygon according to scale_factor. |
|
Rescale polygons according to scale_factor. |
|
Convert a nested list of boundaries to a list of Polygons. |
|
Sort arbitrary points in clockwise order in Cartesian coordinate, you may need to reverse the output sequence if you are using OpenCV’s image coordinate. |
|
Sort box vertices in clockwise order from left-top first. |
|
Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4] |
Mask Utils¶
Fill holes in matrix. |
Misc Utils¶
check x is 2d-list([[1], []]) or 1d empty list([]). |
|
check x is 3d-list([[[1], []]]) or 2d empty list([[], []]) or 1d empty list([]). |
|
欢迎加入 OpenMMLab 社区¶
扫描下方的二维码可关注 OpenMMLab 团队的 知乎官方账号,加入 OpenMMLab 团队的 官方交流 QQ 群,或通过添加微信“Open小喵Lab”加入官方交流微信群, 或者加入我们的 Slack 社区



我们会在 OpenMMLab 社区为大家
📢 分享 AI 框架的前沿核心技术
💻 解读 PyTorch 常用模块源码
📰 发布 OpenMMLab 的相关新闻
🚀 介绍 OpenMMLab 开发的前沿算法
🏃 获取更高效的问题答疑和意见反馈
🔥 提供与各行各业开发者充分交流的平台
干货满满 📘,等你来撩 💗,OpenMMLab 社区期待您的加入 👬