API Reference

mmocr.apis

mmocr.apis.model_inference(model, imgs, batch_mode=False)[source]

Inference image(s) with the detector.

Parameters
  • model (nn.Module) – The loaded detector.

  • imgs (str/ndarray or list[str/ndarray] or tuple[str/ndarray]) – Either image files or loaded images.

  • batch_mode (bool) – If True, use batch mode for inference.

Returns

Predicted results.

Return type

result (dict)

mmocr.core

evaluation

mmocr.core.evaluation.eval_hmean_ic13(det_boxes, gt_boxes, gt_ignored_boxes, precision_thr=0.4, recall_thr=0.8, center_dist_thr=1.0, one2one_score=1.0, one2many_score=0.8, many2one_score=1.0)[source]

Evalute hmean of text detection using the icdar2013 standard.

Parameters
  • det_boxes (list[list[list[float]]]) – List of arrays of shape (n, 2k). Each element is the det_boxes for one img. k>=4.

  • gt_boxes (list[list[list[float]]]) – List of arrays of shape (m, 2k). Each element is the gt_boxes for one img. k>=4.

  • gt_ignored_boxes (list[list[list[float]]]) – List of arrays of (l, 2k). Each element is the ignored gt_boxes for one img. k>=4.

  • precision_thr (float) – Precision threshold of the iou of one (gt_box, det_box) pair.

  • recall_thr (float) – Recall threshold of the iou of one (gt_box, det_box) pair.

  • center_dist_thr (float) – Distance threshold of one (gt_box, det_box) center point pair.

  • one2one_score (float) – Reward when one gt matches one det_box.

  • one2many_score (float) – Reward when one gt matches many det_boxes.

  • many2one_score (float) – Reward when many gts match one det_box.

Returns

Tuple of dicts which encodes the hmean for the dataset and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_hmean_iou(pred_boxes, gt_boxes, gt_ignored_boxes, iou_thr=0.5, precision_thr=0.5)[source]

Evalute hmean of text detection using IOU standard.

Parameters
  • pred_boxes (list[list[list[float]]]) – Text boxes for an img list. Each box has 2k (>=8) values.

  • gt_boxes (list[list[list[float]]]) – Ground truth text boxes for an img list. Each box has 2k (>=8) values.

  • gt_ignored_boxes (list[list[list[float]]]) – Ignored ground truth text boxes for an img list. Each box has 2k (>=8) values.

  • iou_thr (float) – Iou threshold when one (gt_box, det_box) pair is matched.

  • precision_thr (float) – Precision threshold when one (gt_box, det_box) pair is matched.

Returns

Tuple of dicts indicates the hmean for the dataset

and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_ocr_metric(pred_texts, gt_texts)[source]

Evaluate the text recognition performance with metric: word accuracy and 1-N.E.D. See https://rrc.cvc.uab.es/?ch=14&com=tasks for details.

Parameters
  • pred_texts (list[str]) – Text strings of prediction.

  • gt_texts (list[str]) – Text strings of ground truth.

Returns

float]): Metric dict for text recognition, include:
  • word_acc: Accuracy in word level.

  • word_acc_ignore_case: Accuracy in word level, ignore letter case.

  • word_acc_ignore_case_symbol: Accuracy in word level, ignore

    letter case and symbol. (default metric for academic evaluation)

  • char_recall: Recall in character level, ignore

    letter case and symbol.

  • char_precision: Precision in character level, ignore

    letter case and symbol.

  • 1-N.E.D: 1 - normalized_edit_distance.

Return type

eval_res (dict[str

mmocr.core.evaluation.eval_hmean(results, img_infos, ann_infos, metrics={'hmean-iou'}, score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]

Evaluation in hmean metric.

Parameters
  • results (list[dict]) – Each dict corresponds to one image, containing the following keys: boundary_result

  • img_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: filename, height, width

  • ann_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: masks, masks_ignore

  • score_thr (float) – Score threshold of prediction map.

  • metrics (set{str}) – Hmean metric set, should be one or all of {‘hmean-iou’, ‘hmean-ic13’}

Returns

float]

Return type

dict[str

mmocr.core.evaluation.compute_f1_score(preds, gts, ignores=[])[source]

Compute the F1-score of prediction.

Parameters
  • preds (Tensor) – The predicted probability NxC map with N and C being the sample number and class number respectively.

  • gts (Tensor) – The ground truth vector of size N.

  • ignores – The index set of classes that are ignored when reporting results. Note: all samples are participated in computing.

mmocr.core.evaluation.eval_ner_f1(results, gt_infos)[source]

Evaluate for ner task.

Parameters
  • results (list) – Predict results of entities.

  • gt_infos (list[dict]) – Ground-truth infomation which contains text and label.

Returns

precision,recall, f1-score of total

and each catogory.

Return type

class_info (dict)

mmocr.utils

mmocr.utils.get_root_logger(log_file=None, log_level=20)[source]

Use get_logger method in mmcv to get the root logger.

The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If log_file is specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmpose”.

Parameters
  • log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.

  • log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.

Returns

The root logger.

Return type

logging.Logger

mmocr.utils.collect_env()[source]

Collect the information of the running environments.

mmocr.utils.drop_orientation(img_file)[source]

Check if the image has orientation information. If yes, ignore it by converting the image format to png, and return new filename, otherwise return the original filename.

Parameters

img_file (str) – The image path

Returns

The converted image filename with proper postfix

mmocr.utils.convert_annotations(image_infos, out_json_name)[source]

Convert the annotation into coco style.

Parameters
  • image_infos (list) – The list of image information dicts

  • out_json_name (str) – The output json filename

Returns

The coco style dict

Return type

out_json(dict)

mmocr.utils.is_not_png(img_file)[source]

Check img_file is not png image.

Parameters

img_file (str) – The input image file name

Returns

The bool flag indicating whether it is not png

mmocr.models

common_backbones

class mmocr.models.common.backbones.UNet(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None)[source]

UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf

Parameters
  • in_channels (int) – Number of input image channels. Default” 3.

  • base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.

  • num_stages (int) – Number of stages in encoder, normally 5. Default: 5.

  • strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).

  • enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).

  • dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).

  • downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).

  • enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).

  • dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • conv_cfg (dict | None) – Config dict for convolution layer. Default: None.

  • norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).

  • upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.

  • plugins (dict) – plugins for convolutional layers. Default: None.

Notice:

The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.

init_weights(pretrained=None)[source]

Initialize the weights in backbone.

Parameters

pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

train(mode=True)[source]

Convert the model into training mode while keep normalization layer freezed.

class mmocr.models.common.losses.DiceLoss(eps=1e-06)[source]
class mmocr.models.common.losses.FocalLoss(gamma=2, weight=None, ignore_index=-100)[source]

Multi-class Focal loss implementation.

Parameters
  • gamma (float) – The larger the gamma, the smaller the loss weight of easier samples.

  • weight (float) – A manual rescaling weight given to each class.

  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

textdet_dense_heads

textdet_necks

textdet_detectors

textdet_losses

textdet_postprocess

textrecog_recognizer

textrecog_backbones

textrecog_necks

textrecog_heads

textrecog_convertors

textrecog_encoders

textrecog_decoders

textrecog_losses

textrecog_backbones

textrecog_layers

kie_extractors

class mmocr.models.kie.extractors.SDMGR(backbone, neck=None, bbox_head=None, extractor={'featmap_strides': [1], 'roi_layer': {'output_size': 7, 'type': 'RoIAlign'}, 'type': 'SingleRoIExtractor'}, visual_modality=False, train_cfg=None, test_cfg=None, pretrained=None, class_list=None)[source]

The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.

Parameters
  • visual_modality (bool) – Whether use the visual modality.

  • class_list (None | str) – Mapping file of class index to class name. If None, class index will be shown in show_results, else class name.

forward_train(img, img_metas, relations, texts, gt_bboxes, gt_labels)[source]
Parameters
  • img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, please see mmdet.datasets.pipelines.Collect.

  • relations (list[tensor]) – Relations between bboxes.

  • texts (list[tensor]) – Texts in bboxes.

  • gt_bboxes (list[tensor]) – Each item is the truth boxes for each image in [tl_x, tl_y, br_x, br_y] format.

  • gt_labels (list[tensor]) – Class indices corresponding to each box.

Returns

A dictionary of loss components.

Return type

dict[str, tensor]

show_result(img, result, boxes, win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]

Draw result on img.

Parameters
  • img (str or tensor) – The image to be displayed.

  • result (dict) – The results to draw on img.

  • boxes (list) – Bbox of img.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The output filename. Default: None.

Returns

Only if not show or out_file.

Return type

img (tensor)

kie_heads

class mmocr.models.kie.heads.SDMGRHead(num_chars=92, visual_dim=64, fusion_dim=1024, node_input=32, node_embed=256, edge_input=5, edge_embed=256, num_gnn=2, num_classes=26, loss={'type': 'SDMGRLoss'}, bidirectional=False, train_cfg=None, test_cfg=None)[source]

kie_losses

class mmocr.models.kie.losses.SDMGRLoss(node_weight=1.0, edge_weight=1.0, ignore=0)[source]

The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

https://arxiv.org/abs/2103.14470.

mmocr.datasets

datasets

class mmocr.datasets.base_dataset.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]

Custom dataset for text detection, text recognition, and their downstream tasks.

  1. The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

    converted to dict for visualizing only).

    {

    “file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

    [
    {

    “iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

    72, 767, 357, 763]]

    }

    ]

    }

  2. The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

    format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

Parameters
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

format_results(results, **kwargs)[source]

Placeholder to format result to dataset-specific output.

pre_pipeline(results)[source]

Prepare results dict for pipeline.

prepare_test_img(img_info)[source]

Get testing data from pipeline.

Parameters

idx (int) – Index of data.

Returns

Testing data after pipeline with new keys introduced by

pipeline.

Return type

dict

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.icdar_dataset.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]
evaluate(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]

Evaluate the hmean metric.

Parameters
  • results (list[dict]) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]]: The evaluation results.

Return type

dict[dict[str

load_annotations(ann_file)[source]

Load annotation from COCO style annotation file.

Parameters

ann_file (str) – Path of annotation file.

Returns

Annotation info from COCO api.

Return type

list[dict]

pipelines