API Reference¶

mmocr.apis¶

mmocr.apis.model_inference(model, imgs, batch_mode=False)[source]¶

Inference image(s) with the detector.

Parameters

model (nn.Module) – The loaded detector.
imgs (str/ndarray or list[str/ndarray] or tuple[str/ndarray]) – Either image files or loaded images.
batch_mode (bool) – If True, use batch mode for inference.

Returns

Predicted results.

Return type

result (dict)

mmocr.core¶

evaluation¶

mmocr.core.evaluation.eval_hmean_ic13(det_boxes, gt_boxes, gt_ignored_boxes, precision_thr=0.4, recall_thr=0.8, center_dist_thr=1.0, one2one_score=1.0, one2many_score=0.8, many2one_score=1.0)[source]¶

Evalute hmean of text detection using the icdar2013 standard.

Parameters

det_boxes (list[list[list[float]]]) – List of arrays of shape (n, 2k). Each element is the det_boxes for one img. k>=4.
gt_boxes (list[list[list[float]]]) – List of arrays of shape (m, 2k). Each element is the gt_boxes for one img. k>=4.
gt_ignored_boxes (list[list[list[float]]]) – List of arrays of (l, 2k). Each element is the ignored gt_boxes for one img. k>=4.
precision_thr (float) – Precision threshold of the iou of one (gt_box, det_box) pair.
recall_thr (float) – Recall threshold of the iou of one (gt_box, det_box) pair.
center_dist_thr (float) – Distance threshold of one (gt_box, det_box) center point pair.
one2one_score (float) – Reward when one gt matches one det_box.
one2many_score (float) – Reward when one gt matches many det_boxes.
many2one_score (float) – Reward when many gts match one det_box.

Returns

Tuple of dicts which encodes the hmean for the dataset and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_hmean_iou(pred_boxes, gt_boxes, gt_ignored_boxes, iou_thr=0.5, precision_thr=0.5)[source]¶

Evalute hmean of text detection using IOU standard.

Parameters

pred_boxes (list[list[list[float]]]) – Text boxes for an img list. Each box has 2k (>=8) values.
gt_boxes (list[list[list[float]]]) – Ground truth text boxes for an img list. Each box has 2k (>=8) values.
gt_ignored_boxes (list[list[list[float]]]) – Ignored ground truth text boxes for an img list. Each box has 2k (>=8) values.
iou_thr (float) – Iou threshold when one (gt_box, det_box) pair is matched.
precision_thr (float) – Precision threshold when one (gt_box, det_box) pair is matched.

Returns

Tuple of dicts indicates the hmean for the dataset: and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_ocr_metric(pred_texts, gt_texts)[source]¶

Evaluate the text recognition performance with metric: word accuracy and 1-N.E.D. See https://rrc.cvc.uab.es/?ch=14&com=tasks for details.

Parameters

pred_texts (list[str]) – Text strings of prediction.
gt_texts (list[str]) – Text strings of ground truth.

Returns

float]): Metric dict for text recognition, include:

word_acc: Accuracy in word level.
word_acc_ignore_case: Accuracy in word level, ignore letter case.
word_acc_ignore_case_symbol: Accuracy in word level, ignore
letter case and symbol. (default metric for academic evaluation)
char_recall: Recall in character level, ignore
letter case and symbol.
char_precision: Precision in character level, ignore
letter case and symbol.
1-N.E.D: 1 - normalized_edit_distance.

Return type

eval_res (dict[str

mmocr.core.evaluation.eval_hmean(results, img_infos, ann_infos, metrics={'hmean-iou'}, score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]¶

Evaluation in hmean metric.

Parameters

results (list[dict]) – Each dict corresponds to one image, containing the following keys: boundary_result
img_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: filename, height, width
ann_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: masks, masks_ignore
score_thr (float) – Score threshold of prediction map.
metrics (set{str}) – Hmean metric set, should be one or all of {‘hmean-iou’, ‘hmean-ic13’}

Returns

float]

Return type

dict[str

mmocr.core.evaluation.compute_f1_score(preds, gts, ignores=[])[source]¶

Compute the F1-score of prediction.

Parameters

preds (Tensor) – The predicted probability NxC map with N and C being the sample number and class number respectively.
gts (Tensor) – The ground truth vector of size N.
ignores – The index set of classes that are ignored when reporting results. Note: all samples are participated in computing.

mmocr.core.evaluation.eval_ner_f1(results, gt_infos)[source]¶

Evaluate for ner task.

Parameters

results (list) – Predict results of entities.
gt_infos (list[dict]) – Ground-truth infomation which contains text and label.

Returns

precision,recall, f1-score of total: and each catogory.

Return type

class_info (dict)

mmocr.utils¶

mmocr.utils.get_root_logger(log_file=None, log_level=20)[source]¶

Use get_logger method in mmcv to get the root logger.

The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If log_file is specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmpose”.

Parameters

log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.
log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.

Returns

The root logger.

Return type

logging.Logger

mmocr.utils.collect_env()[source]¶: Collect the information of the running environments.

mmocr.utils.drop_orientation(img_file)[source]¶

Check if the image has orientation information. If yes, ignore it by converting the image format to png, and return new filename, otherwise return the original filename.

Parameters: img_file (str) – The image path
Returns: The converted image filename with proper postfix

mmocr.utils.convert_annotations(image_infos, out_json_name)[source]¶

Convert the annotation into coco style.

Parameters

image_infos (list) – The list of image information dicts
out_json_name (str) – The output json filename

Returns

The coco style dict

Return type

out_json(dict)

mmocr.utils.is_not_png(img_file)[source]¶

Check img_file is not png image.

Parameters: img_file (str) – The input image file name
Returns: The bool flag indicating whether it is not png

mmocr.models¶

common_backbones¶

class mmocr.models.common.backbones.UNet(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None)[source]¶

UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf

Parameters

in_channels (int) – Number of input image channels. Default” 3.
base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.
num_stages (int) – Number of stages in encoder, normally 5. Default: 5.
strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).
enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).
dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).
downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).
enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).
dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict | None) – Config dict for convolution layer. Default: None.
norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).
upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.
plugins (dict) – plugins for convolutional layers. Default: None.

Notice:: The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.

init_weights(pretrained=None)[source]¶

Initialize the weights in backbone.

Parameters: pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

train(mode=True)[source]¶: Convert the model into training mode while keep normalization layer freezed.

class mmocr.models.common.losses.DiceLoss(eps=1e-06)[source]¶

class mmocr.models.common.losses.FocalLoss(gamma=2, weight=None, ignore_index=-100)[source]¶

Multi-class Focal loss implementation.

Parameters

gamma (float) – The larger the gamma, the smaller the loss weight of easier samples.
weight (float) – A manual rescaling weight given to each class.
ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

textdet_dense_heads¶

textdet_necks¶

textdet_detectors¶

textdet_losses¶

textdet_postprocess¶

textrecog_recognizer¶

textrecog_backbones¶

textrecog_necks¶

textrecog_heads¶

textrecog_convertors¶

textrecog_encoders¶

textrecog_decoders¶

textrecog_losses¶

textrecog_backbones¶

textrecog_layers¶

kie_extractors¶

class mmocr.models.kie.extractors.SDMGR(backbone, neck=None, bbox_head=None, extractor={'featmap_strides': [1], 'roi_layer': {'output_size': 7, 'type': 'RoIAlign'}, 'type': 'SingleRoIExtractor'}, visual_modality=False, train_cfg=None, test_cfg=None, pretrained=None, class_list=None)[source]¶

The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.

Parameters

visual_modality (bool) – Whether use the visual modality.
class_list (None | str) – Mapping file of class index to class name. If None, class index will be shown in show_results, else class name.

forward_train(img, img_metas, relations, texts, gt_bboxes, gt_labels)[source]¶

Parameters

img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, please see mmdet.datasets.pipelines.Collect.
relations (list[tensor]) – Relations between bboxes.
texts (list[tensor]) – Texts in bboxes.
gt_bboxes (list[tensor]) – Each item is the truth boxes for each image in [tl_x, tl_y, br_x, br_y] format.
gt_labels (list[tensor]) – Class indices corresponding to each box.

Returns

A dictionary of loss components.

Return type

dict[str, tensor]

show_result(img, result, boxes, win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]¶

Draw result on img.

Parameters

img (str or tensor) – The image to be displayed.
result (dict) – The results to draw on img.
boxes (list) – Bbox of img.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The output filename. Default: None.

Returns

Only if not show or out_file.

Return type

img (tensor)

kie_heads¶

class mmocr.models.kie.heads.SDMGRHead(num_chars=92, visual_dim=64, fusion_dim=1024, node_input=32, node_embed=256, edge_input=5, edge_embed=256, num_gnn=2, num_classes=26, loss={'type': 'SDMGRLoss'}, bidirectional=False, train_cfg=None, test_cfg=None)[source]¶

kie_losses¶

class mmocr.models.kie.losses.SDMGRLoss(node_weight=1.0, edge_weight=1.0, ignore=0)[source]¶

The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

https://arxiv.org/abs/2103.14470.

mmocr.datasets¶

datasets¶

class mmocr.datasets.base_dataset.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

Custom dataset for text detection, text recognition, and their downstream tasks.

The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

converted to dict for visualizing only).

{
“file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

[

{
“iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

72, 767, 357, 763]]

}

]

}
The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

Parameters

ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

format_results(results, **kwargs)[source]¶: Placeholder to format result to dataset-specific output.

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

prepare_test_img(img_info)[source]¶

Get testing data from pipeline.

Parameters

idx (int) – Index of data.

Returns

Testing data after pipeline with new keys introduced by: pipeline.

Return type

dict

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

class mmocr.datasets.icdar_dataset.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]¶

evaluate(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]¶

Evaluate the hmean metric.

Parameters

results (list[dict]) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]]: The evaluation results.

Return type

dict[dict[str

load_annotations(ann_file)[source]¶

Load annotation from COCO style annotation file.

Parameters: ann_file (str) – Path of annotation file.
Returns: Annotation info from COCO api.
Return type: list[dict]

API Reference¶

mmocr.apis¶

mmocr.core¶

evaluation¶

mmocr.utils¶

mmocr.models¶

common_backbones¶

textdet_dense_heads¶

textdet_necks¶

textdet_detectors¶

textdet_losses¶

textdet_postprocess¶

textrecog_recognizer¶

textrecog_backbones¶

textrecog_necks¶

textrecog_heads¶

textrecog_convertors¶

textrecog_encoders¶

textrecog_decoders¶

textrecog_losses¶

textrecog_backbones¶

textrecog_layers¶

kie_extractors¶

kie_heads¶

kie_losses¶

mmocr.datasets¶

datasets¶

pipelines¶