API Reference

mmocr.apis

mmocr.apis.model_inference(model, imgs, batch_mode=False)[source]

Inference image(s) with the detector.

Parameters
  • model (nn.Module) – The loaded detector.

  • imgs (str/ndarray or list[str/ndarray] or tuple[str/ndarray]) – Either image files or loaded images.

  • batch_mode (bool) – If True, use batch mode for inference.

Returns

Predicted results.

Return type

result (dict)

mmocr.core

evaluation

mmocr.core.evaluation.eval_hmean_ic13(det_boxes, gt_boxes, gt_ignored_boxes, precision_thr=0.4, recall_thr=0.8, center_dist_thr=1.0, one2one_score=1.0, one2many_score=0.8, many2one_score=1.0)[source]

Evalute hmean of text detection using the icdar2013 standard.

Parameters
  • det_boxes (list[list[list[float]]]) – List of arrays of shape (n, 2k). Each element is the det_boxes for one img. k>=4.

  • gt_boxes (list[list[list[float]]]) – List of arrays of shape (m, 2k). Each element is the gt_boxes for one img. k>=4.

  • gt_ignored_boxes (list[list[list[float]]]) – List of arrays of (l, 2k). Each element is the ignored gt_boxes for one img. k>=4.

  • precision_thr (float) – Precision threshold of the iou of one (gt_box, det_box) pair.

  • recall_thr (float) – Recall threshold of the iou of one (gt_box, det_box) pair.

  • center_dist_thr (float) – Distance threshold of one (gt_box, det_box) center point pair.

  • one2one_score (float) – Reward when one gt matches one det_box.

  • one2many_score (float) – Reward when one gt matches many det_boxes.

  • many2one_score (float) – Reward when many gts match one det_box.

Returns

Tuple of dicts which encodes the hmean for the dataset and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_hmean_iou(pred_boxes, gt_boxes, gt_ignored_boxes, iou_thr=0.5, precision_thr=0.5)[source]

Evalute hmean of text detection using IOU standard.

Parameters
  • pred_boxes (list[list[list[float]]]) – Text boxes for an img list. Each box has 2k (>=8) values.

  • gt_boxes (list[list[list[float]]]) – Ground truth text boxes for an img list. Each box has 2k (>=8) values.

  • gt_ignored_boxes (list[list[list[float]]]) – Ignored ground truth text boxes for an img list. Each box has 2k (>=8) values.

  • iou_thr (float) – Iou threshold when one (gt_box, det_box) pair is matched.

  • precision_thr (float) – Precision threshold when one (gt_box, det_box) pair is matched.

Returns

Tuple of dicts indicates the hmean for the dataset

and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_ocr_metric(pred_texts, gt_texts)[source]

Evaluate the text recognition performance with metric: word accuracy and 1-N.E.D. See https://rrc.cvc.uab.es/?ch=14&com=tasks for details.

Parameters
  • pred_texts (list[str]) – Text strings of prediction.

  • gt_texts (list[str]) – Text strings of ground truth.

Returns

float]): Metric dict for text recognition, include:
  • word_acc: Accuracy in word level.

  • word_acc_ignore_case: Accuracy in word level, ignore letter case.

  • word_acc_ignore_case_symbol: Accuracy in word level, ignore

    letter case and symbol. (default metric for academic evaluation)

  • char_recall: Recall in character level, ignore

    letter case and symbol.

  • char_precision: Precision in character level, ignore

    letter case and symbol.

  • 1-N.E.D: 1 - normalized_edit_distance.

Return type

eval_res (dict[str

mmocr.core.evaluation.eval_hmean(results, img_infos, ann_infos, metrics={'hmean-iou'}, score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]

Evaluation in hmean metric.

Parameters
  • results (list[dict]) – Each dict corresponds to one image, containing the following keys: boundary_result

  • img_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: filename, height, width

  • ann_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: masks, masks_ignore

  • score_thr (float) – Score threshold of prediction map.

  • metrics (set{str}) – Hmean metric set, should be one or all of {‘hmean-iou’, ‘hmean-ic13’}

Returns

float]

Return type

dict[str

mmocr.core.evaluation.compute_f1_score(preds, gts, ignores=[])[source]

Compute the F1-score of prediction.

Parameters
  • preds (Tensor) – The predicted probability NxC map with N and C being the sample number and class number respectively.

  • gts (Tensor) – The ground truth vector of size N.

  • ignores – The index set of classes that are ignored when reporting results. Note: all samples are participated in computing.

mmocr.utils

mmocr.utils.get_root_logger(log_file=None, log_level=20)[source]

Use get_logger method in mmcv to get the root logger.

The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If log_file is specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmpose”.

Parameters
  • log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.

  • log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.

Returns

The root logger.

Return type

logging.Logger

mmocr.utils.collect_env()[source]

Collect the information of the running environments.

mmocr.utils.drop_orientation(img_file)[source]

Check if the image has orientation information. If yes, ignore it by converting the image format to png, and return new filename, otherwise return the original filename.

Parameters

img_file (str) – The image path

Returns

The converted image filename with proper postfix

mmocr.utils.convert_annotations(image_infos, out_json_name)[source]

Convert the annotation into coco style.

Parameters
  • image_infos (list) – The list of image information dicts

  • out_json_name (str) – The output json filename

Returns

The coco style dict

Return type

out_json(dict)

mmocr.utils.is_not_png(img_file)[source]

Check img_file is not png image.

Parameters

img_file (str) – The input image file name

Returns

The bool flag indicating whether it is not png

mmocr.models

common_backbones

class mmocr.models.common.backbones.UNet(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None)[source]

UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf

Parameters
  • in_channels (int) – Number of input image channels. Default” 3.

  • base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.

  • num_stages (int) – Number of stages in encoder, normally 5. Default: 5.

  • strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).

  • enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).

  • dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).

  • downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).

  • enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).

  • dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • conv_cfg (dict | None) – Config dict for convolution layer. Default: None.

  • norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).

  • upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.

  • plugins (dict) – plugins for convolutional layers. Default: None.

Notice:

The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.

init_weights(pretrained=None)[source]

Initialize the weights in backbone.

Parameters

pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

train(mode=True)[source]

Convert the model into training mode while keep normalization layer freezed.

class mmocr.models.common.losses.DiceLoss(eps=1e-06)[source]

textdet_dense_heads

class mmocr.models.textdet.dense_heads.PSEHead(in_channels, out_channels, text_repr_type='poly', downsample_ratio=0.25, loss={'type': 'PSELoss'}, train_cfg=None, test_cfg=None)[source]

The class for PANet head.

class mmocr.models.textdet.dense_heads.PANHead(in_channels, out_channels, text_repr_type='poly', downsample_ratio=0.25, loss={'type': 'PANLoss'}, train_cfg=None, test_cfg=None)[source]

The class for PANet head.

class mmocr.models.textdet.dense_heads.DBHead(in_channels, with_bias=False, decoding_type='db', text_repr_type='poly', downsample_ratio=1.0, loss={'type': 'DBLoss'}, train_cfg=None, test_cfg=None)[source]

The class for DBNet head.

This was partially adapted from https://github.com/MhLiao/DB

class mmocr.models.textdet.dense_heads.FCEHead(in_channels, scales, fourier_degree=5, sample_num=50, reconstr_points=50, decoding_type='fcenet', loss={'type': 'FCELoss'}, score_thresh=0.3, nms_thresh=0.1, alpha=1.0, beta=1.0, train_cfg=None, test_cfg=None)[source]

The class for implementing FCENet head. FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection.

[https://arxiv.org/abs/2104.10442]

Parameters
  • in_channels (int) – The number of input channels.

  • scales (list[int]) – The scale of each layer.

  • fourier_degree (int) – The maximum Fourier transform degree k.

  • sample_num (int) – The sampling points number of regression loss. If it is too small, FCEnet tends to be overfitting.

  • score_thresh (float) – The threshold to filter out the final candidates.

  • nms_thresh (float) – The threshold of nms.

  • alpha (float) – The parameter to calculate final scores. Score_{final} = (Score_{text region} ^ alpha) * (Score{text center region} ^ beta)

  • beta (float) – The parameter to calculate final scores.

get_boundary(score_maps, img_metas, rescale)[source]

Compute text boundaries via post processing.

Parameters
  • score_maps (Tensor) – The text score map.

  • img_metas (dict) – The image meta info.

  • rescale (bool) – Rescale boundaries to the original image resolution if true, and keep the score_maps resolution if false.

Returns

The result dict.

Return type

results (dict)

class mmocr.models.textdet.dense_heads.HeadMixin[source]

The head minxin for dbnet and pannet heads.

get_boundary(score_maps, img_metas, rescale)[source]

Compute text boundaries via post processing.

Parameters
  • score_maps (Tensor) – The text score map.

  • img_metas (dict) – The image meta info.

  • rescale (bool) – Rescale boundaries to the original image resolution if true, and keep the score_maps resolution if false.

Returns

The result dict.

Return type

results (dict)

loss(pred_maps, **kwargs)[source]

Compute the loss for text detection.

Parameters

pred_maps (tensor) – The input score maps of NxCxHxW.

Returns

The dict for losses.

Return type

losses (dict)

resize_boundary(boundaries, scale_factor)[source]

Rescale boundaries via scale_factor.

Parameters
  • boundaries (list[list[float]]) – The boundary list. Each boundary

  • size 2k+1 with k>=4. (with) –

  • scale_factor (ndarray) – The scale factor of size (4,).

Returns

The scaled boundaries.

Return type

boundaries (list[list[float]])

class mmocr.models.textdet.dense_heads.TextSnakeHead(in_channels, decoding_type='textsnake', text_repr_type='poly', loss={'type': 'TextSnakeLoss'}, train_cfg=None, test_cfg=None)[source]

The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]

textdet_necks

class mmocr.models.textdet.necks.FPEM_FFM(in_channels, conv_out=128, fpem_repeat=2, align_corners=False)[source]

This code is from https://github.com/WenmuZhou/PAN.pytorch.

init_weights()[source]

Initialize the weights of FPN module.

class mmocr.models.textdet.necks.FPNF(in_channels=[256, 512, 1024, 2048], out_channels=256, fusion_type='concat', upsample_ratio=1)[source]

FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network.

class mmocr.models.textdet.necks.FPNC(in_channels, lateral_channels=256, out_channels=64, bias_on_lateral=False, bn_re_on_lateral=False, bias_on_smooth=False, bn_re_on_smooth=False, conv_after_concat=False)[source]

FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization.

This was partially adapted from https://github.com/MhLiao/DB and https://github.com/WenmuZhou/DBNet.pytorch

init_weights()[source]

Initialize the weights of FPN module.

class mmocr.models.textdet.necks.FPN_UNET(in_channels, out_channels)[source]

The class for implementing DRRG and TextSnake U-Net-like FPN.

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection [https://arxiv.org/abs/2003.07493]. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [https://arxiv.org/abs/1807.01544].

textdet_detectors

class mmocr.models.textdet.detectors.TextDetectorMixin(show_score)[source]

The class for implementing text detector auxiliary methods.

get_boundary(results)[source]

Convert segmentation into text boundaries.

Parameters

results (tuple) – The result tuple. The first element is segmentation while the second is its scores.

Returns

A result dict containing ‘boundary_result’.

Return type

results (dict)

show_result(img, result, score_thr=0.5, bbox_color='green', text_color='green', thickness=1, font_scale=0.5, win_name='', show=False, wait_time=0, out_file=None)[source]

Draw result over img.

Parameters
  • img (str or Tensor) – The image to be displayed.

  • result (dict) – The results to draw over img.

  • score_thr (float, optional) – Minimum score of bboxes to be shown. Default: 0.3.

  • bbox_color (str or tuple or Color) – Color of bbox lines.

  • text_color (str or tuple or Color) – Color of texts.

  • thickness (int) – Thickness of lines.

  • font_scale (float) – Font scales of texts.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The filename to write the image. Default: None.imshow_pred_boundary`

class mmocr.models.textdet.detectors.SingleStageTextDetector(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None)[source]

The class for implementing single stage text detector.

It is the parent class of PANet, PSENet, and DBNet.

forward_train(img, img_metas, **kwargs)[source]
Parameters
  • img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys, see mmdet.datasets.pipelines.Collect.

Returns

A dictionary of loss components.

Return type

dict[str, Tensor]

class mmocr.models.textdet.detectors.OCRMaskRCNN(backbone, rpn_head, roi_head, train_cfg, test_cfg, neck=None, pretrained=None, text_repr_type='quad', show_score=False)[source]

Mask RCNN tailored for OCR.

class mmocr.models.textdet.detectors.DBNet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]

The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization.

[https://arxiv.org/abs/1911.08947].

class mmocr.models.textdet.detectors.PANet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]

The class for implementing PANet text detector:

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network [https://arxiv.org/abs/1908.05900].

class mmocr.models.textdet.detectors.PSENet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]

The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network.

[https://arxiv.org/abs/1806.02559].

class mmocr.models.textdet.detectors.TextSnake(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]

The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]

class mmocr.models.textdet.detectors.FCENet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]

The class for implementing FCENet text detector FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text

Detection

[https://arxiv.org/abs/2104.10442]

textdet_losses

class mmocr.models.textdet.losses.PANLoss(alpha=0.5, beta=0.25, delta_aggregation=0.5, delta_discrimination=3, ohem_ratio=3, reduction='mean', speedup_bbox_thr=-1)[source]

The class for implementing PANet loss: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.

[https://arxiv.org/abs/1908.05900]. This was partially adapted from https://github.com/WenmuZhou/PAN.pytorch

aggregation_discrimination_loss(gt_texts, gt_kernels, inst_embeds)[source]

Compute the aggregation and discrimnative losses.

Parameters
  • gt_texts (tensor) – The ground truth text mask of size Nx1xHxW.

  • gt_kernels (tensor) – The ground truth text kernel mask of size Nx1xHxW.

  • inst_embeds (tensor) – The text instance embedding tensor of size Nx4xHxW.

Returns

The aggregation loss before reduction. loss_discrs (tensor): The discriminative loss before reduction.

Return type

loss_aggrs (tensor)

bitmasks2tensor(bitmasks, target_sz)[source]

Convert Bitmasks to tensor.

Parameters
  • bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.

  • target_sz (tuple(int, int)) – The target tensor size HxW.

Returns
results (list[tensor]): The list of kernel tensors. Each

element is for one kernel level.

forward(preds, downsample_ratio, gt_kernels, gt_mask)[source]

Compute PANet loss.

Parameters
  • preds (tensor) – The output tensor with size of Nx6xHxW.

  • gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.

  • gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask fo one img.

  • downsample_ratio (float) – The downsample ratio between preds and the input img.

Returns

The loss dictionary.

Return type

results (dict)

ohem_batch(text_scores, gt_texts, gt_mask)[source]

OHEM sampling for a batch of imgs.

Parameters
  • text_scores (Tensor) – The text scores of size NxHxW.

  • gt_texts (Tensor) – The gt text masks of size NxHxW.

  • gt_mask (Tensor) – The gt effective mask of size NxHxW.

Returns

The sampled mask of size NxHxW.

Return type

sampled_masks (Tensor)

ohem_img(text_score, gt_text, gt_mask)[source]

Sample the top-k maximal negative samples and all positive samples.

Parameters
  • text_score (Tensor) – The text score with size of HxW.

  • gt_text (Tensor) – The ground truth text mask of HxW.

  • gt_mask (Tensor) – The effective region mask of HxW.

Returns

The sampled pixel mask of size HxW.

Return type

sampled_mask (Tensor)

class mmocr.models.textdet.losses.PSELoss(alpha=0.7, ohem_ratio=3, reduction='mean', kernel_sample_type='adaptive')[source]

The class for implementing PSENet loss: Shape Robust Text Detection with Progressive Scale Expansion Network [https://arxiv.org/abs/1806.02559].

This is partially adapted from https://github.com/whai362/PSENet.

forward(score_maps, downsample_ratio, gt_kernels, gt_mask)[source]

Compute PSENet loss.

Parameters
  • score_maps (tensor) – The output tensor with size of Nx6xHxW.

  • gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.

  • gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask fo one img.

  • downsample_ratio (float) – The downsample ratio between score_maps and the input img.

Returns

The loss.

Return type

results (dict)

class mmocr.models.textdet.losses.DBLoss(alpha=1, beta=1, reduction='mean', negative_ratio=3.0, eps=1e-06, bbce_loss=False)[source]

The class for implementing DBNet loss.

This is partially adapted from https://github.com/MhLiao/DB.

bitmasks2tensor(bitmasks, target_sz)[source]

Convert Bitmasks to tensor.

Parameters
  • bitmasks (list[BitMasks]) – The BitMasks list. Each item is for one img.

  • target_sz (tuple(int, int)) – The target tensor size of KxHxW with K being the number of kernels.

Returns
result_tensors (list[tensor]): The list of kernel tensors. Each

element is for one kernel level.

forward(preds, downsample_ratio, gt_shrink, gt_shrink_mask, gt_thr, gt_thr_mask)[source]

Compute DBNet loss.

Parameters
  • preds (tensor) – The output tensor with size of Nx3xHxW.

  • downsample_ratio (float) – The downsample ratio for the ground truths.

  • gt_shrink (list[BitmapMasks]) – The mask list with each element being the shrinked text mask for one img.

  • gt_shrink_mask (list[BitmapMasks]) – The effective mask list with each element being the shrinked effective mask for one img.

  • gt_thr (list[BitmapMasks]) – The mask list with each element being the threshold text mask for one img.

  • gt_thr_mask (list[BitmapMasks]) – The effective mask list with each element being the threshold effective mask for one img.

Returns

The dict for dbnet losses with loss_prob,

loss_db and loss_thresh.

Return type

results(dict)

class mmocr.models.textdet.losses.TextSnakeLoss(ohem_ratio=3.0)[source]

The class for implementing TextSnake loss: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [https://arxiv.org/abs/1807.01544]. This is partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

bitmasks2tensor(bitmasks, target_sz)[source]

Convert Bitmasks to tensor.

Parameters
  • bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.

  • target_sz (tuple(int, int)) – The target tensor size HxW.

Returns
results (list[tensor]): The list of kernel tensors. Each

element is for one kernel level.

class mmocr.models.textdet.losses.FCELoss(fourier_degree, sample_num, ohem_ratio=3.0)[source]

The class for implementing FCENet loss FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped

Text Detection

[https://arxiv.org/abs/2104.10442]

Parameters
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • sample_num (int) – The sampling points number of regression loss. If it is too small, fcenet tends to be overfitting.

  • ohem_ratio (float) – the negative/positive ratio in OHEM.

fourier2poly(real_maps, imag_maps)[source]

Transform Fourier coefficient maps to polygon maps.

Parameters
  • real_maps (tensor) – A map composed of the real parts of the Fourier coefficients, whose shape is (-1, 2k+1)

  • imag_maps (tensor) – A map composed of the imag parts of the Fourier coefficients, whose shape is (-1, 2k+1)

Returns
x_maps (tensor): A map composed of the x value of the polygon

represented by n sample points (xn, yn), whose shape is (-1, n)

y_maps (tensor): A map composed of the y value of the polygon

represented by n sample points (xn, yn), whose shape is (-1, n)

textdet_postprocess

textrecog_recognizer

class mmocr.models.textrecog.recognizer.BaseRecognizer[source]

Base class for text recognition.

abstract aug_test(imgs, img_metas, **kwargs)[source]

Test function with test time augmentation.

Parameters
  • imgs (list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.

  • img_metas (list[list[dict]]) – The metadata of images.

abstract extract_feat(imgs)[source]

Extract features from images.

forward(img, img_metas, return_loss=True, **kwargs)[source]

Calls either forward_train() or forward_test() depending on whether return_loss is True.

Note that img and img_meta are single-nested (i.e. tensor and list[dict]).

forward_test(imgs, img_metas, **kwargs)[source]
Parameters
  • imgs (tensor | list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.

  • img_metas (list[dict] | list[list[dict]]) – The outer list indicates images in a batch.

abstract forward_train(imgs, img_metas, **kwargs)[source]
Parameters
  • img (tensor) – tensors with shape (N, C, H, W). Typically should be mean centered and std scaled.

  • img_metas (list[dict]) – List of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, see mmdet.datasets.pipelines.Collect.

  • kwargs (keyword arguments) – Specific to concrete implementation.

init_weights(pretrained=None)[source]

Initialize the weights for detector.

Parameters

pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

show_result(img, result, gt_label='', win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]

Draw result on img.

Parameters
  • img (str or tensor) – The image to be displayed.

  • result (dict) – The results to draw on img.

  • gt_label (str) – Ground truth label of img.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The output filename. Default: None.

Returns

Only if not show or out_file.

Return type

img (tensor)

train_step(data, optimizer)[source]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer update, which are done by an optimizer hook. Note that in some complicated cases or models (e.g. GAN), the whole process (including the back propagation and optimizer update) is also defined by this method.

Parameters
  • data (dict) – The outputs of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples.

  • loss is a tensor for back propagation, which is a

weighted sum of multiple losses. - log_vars contains all the variables to be sent to the logger. - num_samples indicates the batch size used for averaging the logs (Note: for the DDP model, num_samples refers to the batch size for each GPU).

Return type

dict

val_step(data, optimizer)[source]

The iteration step during validation.

This method shares the same signature as train_step(), but is used during val epochs. Note that the evaluation after training epochs is not implemented by this method, but by an evaluation hook.

class mmocr.models.textrecog.recognizer.CRNNNet(*args: Any, **kwargs: Any)[source]

CTC-loss based recognizer.

class mmocr.models.textrecog.recognizer.SARNet(*args: Any, **kwargs: Any)[source]

Implementation of SAR

class mmocr.models.textrecog.recognizer.NRTR(*args: Any, **kwargs: Any)[source]

Implementation of NRTR

class mmocr.models.textrecog.recognizer.RobustScanner(*args: Any, **kwargs: Any)[source]

Implementation of `RobustScanner.

<https://arxiv.org/pdf/2007.07542.pdf>

textrecog_backbones

class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False)[source]
Implement ResNet backbone for text recognition, modified from

ResNet

Parameters
  • base_channels (int) – Number of channels of input image tensor.

  • layers (list[int]) – List of BasicBlock number for each stage.

  • channels (list[int]) – List of out_channels of Conv2d layer.

  • out_indices (None | Sequence[int]) – Indices of output stages.

  • stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

class mmocr.models.textrecog.backbones.VeryDeepVgg(leaky_relu=True, input_channels=3)[source]
Implement VGG-VeryDeep backbone for text recognition, modified from

VGG-VeryDeep

Parameters
  • leaky_relu (bool) – Use leakyRelu or not.

  • input_channels (int) – Number of channels of input image tensor.

class mmocr.models.textrecog.backbones.NRTRModalityTransform(input_channels=3, input_height=32)[source]

textrecog_necks

class mmocr.models.textrecog.necks.FPNOCR(in_channels, out_channels, last_stage_only=True)[source]

FPN-like Network for segmentation based text recognition.

Parameters
  • in_channels (list[int]) – Number of input channels for each scale.

  • out_channels (int) – Number of output channels for each scale.

  • last_stage_only (bool) – If True, output last stage only.

textrecog_heads

class mmocr.models.textrecog.heads.SegHead(in_channels=128, num_classes=37, upsample_param=None)[source]

Head for segmentation based text recognition.

Parameters
  • in_channels (int) – Number of input channels.

  • num_classes (int) – Number of output classes.

  • upsample_param (dict | None) – Config dict for interpolation layer. Default: dict(scale_factor=1.0, mode=’nearest’)

textrecog_convertors

class mmocr.models.textrecog.convertors.BaseConvertor(dict_type='DICT90', dict_file=None, dict_list=None)[source]

Convert between text, index and tensor for text recognize pipeline.

Parameters
  • dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.

  • dict_file (None|str) – Character dict file path. If not none, the dict_file is of higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.

idx2str(indexes)[source]

Convert indexes to text strings.

Parameters

indexes (list[list[int]]) – [[1,2,3,3,4], [5,4,6,3,7]].

Returns

[‘hello’, ‘world’].

Return type

strings (list[str])

num_classes()[source]

Number of output classes.

str2idx(strings)[source]

Convert strings to indexes.

Parameters

strings (list[str]) – [‘hello’, ‘world’].

Returns

[[1,2,3,3,4], [5,4,6,3,7]].

Return type

indexes (list[list[int]])

str2tensor(strings)[source]

Convert text-string to input tensor.

Parameters

strings (list[str]) – [‘hello’, ‘world’].

Returns

[torch.Tensor([1,2,3,3,4]),

torch.Tensor([5,4,6,3,7])].

Return type

tensors (list[torch.Tensor])

tensor2idx(output)[source]

Convert model output tensor to character indexes and scores. :param output: The model outputs with size: N * T * C :type output: tensor

Returns

[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]].

Return type

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.CTCConvertor(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[source]

Convert between text, index and tensor for CTC loss-based pipeline.

Parameters
  • dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.

  • dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.

  • with_unknown (bool) – If True, add UKN token to class.

  • lower (bool) – If True, convert original string to lower case.

str2tensor(strings)[source]

Convert text-string to ctc-loss input tensor.

Parameters

strings (list[str]) – [‘hello’, ‘world’].

Returns

tensor | list[tensor]):
tensors (list[tensor]): [torch.Tensor([1,2,3,3,4]),

torch.Tensor([5,4,6,3,7])].

flatten_targets (tensor): torch.Tensor([1,2,3,3,4,5,4,6,3,7]). target_lengths (tensor): torch.IntTensot([5,5]).

Return type

dict (str

tensor2idx(output, img_metas, topk=1, return_topk=False)[source]

Convert model output tensor to index-list. :param output: The model outputs with size: N * T * C. :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict] :param topk: The highest k classes to be returned. :type topk: int :param return_topk: Whether to return topk or just top1. :type return_topk: bool

Returns

[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]] (

indexes_topk (list[list[list[int]->len=topk]]): scores_topk (list[list[list[float]->len=topk]])

).

Return type

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.AttnConvertor(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, max_seq_len=40, lower=False, start_end_same=True, **kwargs)[source]

Convert between text, index and tensor for encoder-decoder based pipeline.

Parameters
  • dict_type (str) – Type of dict, should be one of {‘DICT36’, ‘DICT90’}.

  • dict_file (None|str) – Character dict file path. If not none, higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, higher priority than dict_type, but lower than dict_file.

  • with_unknown (bool) – If True, add UKN token to class.

  • max_seq_len (int) – Maximum sequence length of label.

  • lower (bool) – If True, convert original string to lower case.

  • start_end_same (bool) – Whether use the same index for start and end token or not. Default: True.

str2tensor(strings)[source]

Convert text-string into tensor. :param strings: [‘hello’, ‘world’] :type strings: list[str]

Returns

Tensor | list[tensor]):
tensors (list[Tensor]): [torch.Tensor([1,2,3,3,4]),

torch.Tensor([5,4,6,3,7])]

padded_targets (Tensor(bsz * max_seq_len))

Return type

dict (str

tensor2idx(outputs, img_metas=None)[source]

Convert output tensor to text-index :param outputs: model outputs with size: N * T * C :type outputs: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]

Returns

[[1,2,3,3,4], [5,4,6,3,7]] scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]]

Return type

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.SegConvertor(dict_type='DICT36', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[source]

Convert between text, index and tensor for segmentation based pipeline.

Parameters
  • dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.

  • dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, the list

  • of higher priority than dict_type, but lower than dict_file. (is) –

  • with_unknown (bool) – If True, add UKN token to class.

  • lower (bool) – If True, convert original string to lower case.

tensor2str(output, img_metas=None)[source]

Convert model output tensor to string labels. :param output: Model outputs with size: N * C * H * W :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]

Returns

Decoded text labels. scores (list[list[float]]): Decoded chars scores.

Return type

texts (list[str])

textrecog_encoders

class mmocr.models.textrecog.encoders.SAREncoder(enc_bi_rnn=False, enc_do_rnn=0.0, enc_gru=False, d_model=512, d_enc=512, mask=True, **kwargs)[source]

Implementation of encoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_

Parameters
  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.

  • enc_do_rnn (float) – Dropout probability of RNN layer in encoder.

  • enc_gru (bool) – If True, use GRU, else LSTM in encoder.

  • d_model (int) – Dim of channels from backbone.

  • d_enc (int) – Dim of encoder RNN layer.

  • mask (bool) – If True, mask padding in RNN sequence.

class mmocr.models.textrecog.encoders.TFEncoder(n_layers=6, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, dropout=0.1, **kwargs)[source]

Encode 2d feature map to 1d sequence.

class mmocr.models.textrecog.encoders.BaseEncoder(*args: Any, **kwargs: Any)[source]

Base Encoder class for text recognition.

class mmocr.models.textrecog.encoders.ChannelReductionEncoder(in_channels, out_channels)[source]

textrecog_decoders

class mmocr.models.textrecog.decoders.CRNNDecoder(in_channels=None, num_classes=None, rnn_flag=False, **kwargs)[source]
class mmocr.models.textrecog.decoders.ParallelSARDecoder(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0.0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=92, pred_concat=False, **kwargs)[source]

Implementation Parallel Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_

Parameters
  • number_classes (int) – Output class number.

  • channels (list[int]) – Network layer channels.

  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.

  • dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.

  • dec_do_rnn (float) – Dropout of RNN layer in decoder.

  • dec_gru (bool) – If True, use GRU, else LSTM in decoder.

  • d_model (int) – Dim of channels from backbone.

  • d_enc (int) – Dim of encoder RNN layer.

  • d_k (int) – Dim of channels of attention module.

  • pred_dropout (float) – Dropout probability of prediction layer.

  • max_seq_len (int) – Maximum sequence length for decoding.

  • mask (bool) – If True, mask padding in feature map.

  • start_idx (int) – Index of start token.

  • padding_idx (int) – Index of padding token.

  • pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.

class mmocr.models.textrecog.decoders.SequentialSARDecoder(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_gru=False, d_k=64, d_model=512, d_enc=512, pred_dropout=0.0, mask=True, max_seq_len=40, start_idx=0, padding_idx=92, pred_concat=False, **kwargs)[source]

Implementation Sequential Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

Parameters
  • number_classes (int) – Number of output class.

  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.

  • dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.

  • dec_do_rnn (float) – Dropout of RNN layer in decoder.

  • dec_gru (bool) – If True, use GRU, else LSTM in decoder.

  • d_k (int) – Dim of conv layers in attention module.

  • d_model (int) – Dim of channels from backbone.

  • d_enc (int) – Dim of encoder RNN layer.

  • pred_dropout (float) – Dropout probability of prediction layer.

  • max_seq_len (int) – Maximum sequence length during decoding.

  • mask (bool) – If True, mask padding in feature map.

  • start_idx (int) – Index of start token.

  • padding_idx (int) – Index of padding token.

  • pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.

class mmocr.models.textrecog.decoders.ParallelSARDecoderWithBS(beam_width=5, num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=0, pred_concat=False, **kwargs)[source]

Parallel Decoder module with beam-search in SAR.

Parameters

beam_width (int) – Width for beam search.

class mmocr.models.textrecog.decoders.TFDecoder(n_layers=6, d_embedding=512, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, n_position=200, dropout=0.1, num_classes=93, max_seq_len=40, start_idx=1, padding_idx=92, **kwargs)[source]

Transformer Decoder block with self attention mechanism.

class mmocr.models.textrecog.decoders.BaseDecoder(**kwargs)[source]

Base decoder class for text recognition.

class mmocr.models.textrecog.decoders.SequenceAttentionDecoder(num_classes=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, start_idx=0, mask=True, padding_idx=None, dropout_ratio=0, return_feature=False, encode_value=False)[source]
class mmocr.models.textrecog.decoders.PositionAttentionDecoder(num_classes=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, mask=True, return_feature=False, encode_value=False)[source]
class mmocr.models.textrecog.decoders.RobustScannerDecoder(num_classes=None, dim_input=512, dim_model=128, max_seq_len=40, start_idx=0, mask=True, padding_idx=None, encode_value=False, hybrid_decoder=None, position_decoder=None)[source]

textrecog_losses

class mmocr.models.textrecog.losses.CELoss(ignore_index=-1, reduction='none')[source]

Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss.

Parameters
  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

class mmocr.models.textrecog.losses.SARLoss(ignore_index=0, reduction='mean', **kwargs)[source]

Implementation of loss module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

Parameters
  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

class mmocr.models.textrecog.losses.CTCLoss(flatten=True, blank=0, reduction='mean', zero_infinity=False, **kwargs)[source]

Implementation of loss module for CTC-loss based text recognition.

Parameters
  • flatten (bool) – If True, use flattened targets, else padded targets.

  • blank (int) – Blank label. Default 0.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

  • zero_infinity (bool) – Whether to zero infinite losses and the associated gradients. Default: False. Infinite losses mainly occur when the inputs are too short to be aligned to the targets.

class mmocr.models.textrecog.losses.TFLoss(ignore_index=-1, reduction='none', flatten=True, **kwargs)[source]

Implementation of loss module for transformer.

class mmocr.models.textrecog.losses.SegLoss(seg_downsample_ratio=0.5, seg_with_loss_weight=True, ignore_index=255, **kwargs)[source]

Implementation of loss module for segmentation based text recognition method.

Parameters
  • seg_downsample_ratio (float) – Downsample ratio of segmentation map.

  • seg_with_loss_weight (bool) – If True, set weight for segmentation loss.

  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

textrecog_backbones

class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False)[source]
Implement ResNet backbone for text recognition, modified from

ResNet

Parameters
  • base_channels (int) – Number of channels of input image tensor.

  • layers (list[int]) – List of BasicBlock number for each stage.

  • channels (list[int]) – List of out_channels of Conv2d layer.

  • out_indices (None | Sequence[int]) – Indices of output stages.

  • stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

class mmocr.models.textrecog.backbones.VeryDeepVgg(leaky_relu=True, input_channels=3)[source]
Implement VGG-VeryDeep backbone for text recognition, modified from

VGG-VeryDeep

Parameters
  • leaky_relu (bool) – Use leakyRelu or not.

  • input_channels (int) – Number of channels of input image tensor.

class mmocr.models.textrecog.backbones.NRTRModalityTransform(input_channels=3, input_height=32)[source]

textrecog_layers

class mmocr.models.textrecog.layers.BidirectionalLSTM(nIn, nHidden, nOut)[source]
class mmocr.models.textrecog.layers.MultiHeadAttention(n_head=8, d_model=512, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, mask_value=0)[source]

Multi-Head Attention module.

class mmocr.models.textrecog.layers.PositionalEncoding(d_hid=512, n_position=200)[source]
class mmocr.models.textrecog.layers.PositionwiseFeedForward(d_in, d_hid, dropout=0.1, act_layer=torch.nn.GELU)[source]

A two-feed-forward-layer module.

class mmocr.models.textrecog.layers.BasicBlock(inplanes, planes, stride=1, downsample=False)[source]
class mmocr.models.textrecog.layers.Bottleneck(inplanes, planes, stride=1, downsample=False)[source]
class mmocr.models.textrecog.layers.RobustScannerFusionLayer(dim_model, dim=-1)[source]
class mmocr.models.textrecog.layers.DotProductAttentionLayer(dim_model=None)[source]
class mmocr.models.textrecog.layers.PositionAwareLayer(dim_model, rnn_layers=2)[source]
mmocr.models.textrecog.layers.get_subsequent_mask(seq)[source]

For masking out the subsequent info.

class mmocr.models.textrecog.layers.TransformerDecoderLayer(d_model=512, d_inner=256, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, mask_value=0, act_layer=torch.nn.GELU)[source]

kie_extractors

class mmocr.models.kie.extractors.SDMGR(backbone, neck=None, bbox_head=None, extractor={'featmap_strides': [1], 'roi_layer': {'output_size': 7, 'type': 'RoIAlign'}, 'type': 'SingleRoIExtractor'}, visual_modality=False, train_cfg=None, test_cfg=None, pretrained=None, class_list=None)[source]

The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.

Parameters
  • visual_modality (bool) – Whether use the visual modality.

  • class_list (None | str) – Mapping file of class index to class name. If None, class index will be shown in show_results, else class name.

forward_train(img, img_metas, relations, texts, gt_bboxes, gt_labels)[source]
Parameters
  • img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, please see mmdet.datasets.pipelines.Collect.

  • relations (list[tensor]) – Relations between bboxes.

  • texts (list[tensor]) – Texts in bboxes.

  • gt_bboxes (list[tensor]) – Each item is the truth boxes for each image in [tl_x, tl_y, br_x, br_y] format.

  • gt_labels (list[tensor]) – Class indices corresponding to each box.

Returns

A dictionary of loss components.

Return type

dict[str, tensor]

show_result(img, result, boxes, win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]

Draw result on img.

Parameters
  • img (str or tensor) – The image to be displayed.

  • result (dict) – The results to draw on img.

  • boxes (list) – Bbox of img.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The output filename. Default: None.

Returns

Only if not show or out_file.

Return type

img (tensor)

kie_heads

class mmocr.models.kie.heads.SDMGRHead(num_chars=92, visual_dim=64, fusion_dim=1024, node_input=32, node_embed=256, edge_input=5, edge_embed=256, num_gnn=2, num_classes=26, loss={'type': 'SDMGRLoss'}, bidirectional=False, train_cfg=None, test_cfg=None)[source]

kie_losses

class mmocr.models.kie.losses.SDMGRLoss(node_weight=1.0, edge_weight=1.0, ignore=0)[source]

The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

https://arxiv.org/abs/2103.14470.

mmocr.datasets

class mmocr.datasets.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]
evaluate(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]

Evaluate the hmean metric.

Parameters
  • results (list[dict]) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]]: The evaluation results.

Return type

dict[dict[str

load_annotations(ann_file)[source]

Load annotation from COCO style annotation file.

Parameters

ann_file (str) – Path of annotation file.

Returns

Annotation info from COCO api.

Return type

list[dict]

class mmocr.datasets.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]

Custom dataset for text detection, text recognition, and their downstream tasks.

  1. The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

    converted to dict for visualizing only).

    {

    “file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

    [
    {

    “iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

    72, 767, 357, 763]]

    }

    ]

    }

  2. The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

    format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

Parameters
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

format_results(results, **kwargs)[source]

Placeholder to format result to dataset-specific output.

pre_pipeline(results)[source]

Prepare results dict for pipeline.

prepare_test_img(img_info)[source]

Get testing data from pipeline.

Parameters

idx (int) – Index of data.

Returns

Testing data after pipeline with new keys introduced by

pipeline.

Return type

dict

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.OCRDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]
evaluate(results, metric='acc', logger=None, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

pre_pipeline(results)[source]

Prepare results dict for pipeline.

class mmocr.datasets.TextDetDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]
evaluate(results, metric='hmean-iou', score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • score_thr (float) – Score threshold for prediction map.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]

Return type

dict[str

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.CustomFormatBundle(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[source]

Custom formatting bundle.

It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:

  • gt_kernels: to DataContainer (cpu_only=True)

  • gt_effective_mask: to DataContainer (cpu_only=True)

Parameters
  • keys (list[str]) – Fields to be formatted to DC only.

  • call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.

  • visualize (dict) – If flag=True, visualize gt mask for debugging.

class mmocr.datasets.DBNetTargets(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[source]

Generate gt shrinked text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.

Parameters
  • shrink_ratio (float) – The area shrinked ratio between text kernels and their text masks.

  • thr_min (float) – The minimum value of the threshold map.

  • thr_max (float) – The maximum value of the threshold map.

  • min_short_size (int) – The minimum size of polygon below which the polygon is invalid.

draw_border_map(polygon, canvas, mask)[source]

Generate threshold map for one polygon.

Parameters
  • polygon (ndarray) – The polygon boundary ndarray.

  • canvas (ndarray) – The generated threshold map.

  • mask (ndarray) – The generated threshold mask.

find_invalid(results)[source]

Find invalid polygons.

Parameters

results (dict) – The dict containing gt_mask.

Returns

The indicators for ignoring polygons.

Return type

ignore_tags (list[bool])

generate_targets(results)[source]

Generate the gt targets for DBNet.

Parameters

results (dict) – The input result dictionary.

Returns

The output result dictionary.

Return type

results (dict)

generate_thr_map(img_size, polygons)[source]

Generate threshold map.

Parameters
  • img_size (tuple(int)) – The image size (h,w)

  • polygons (list(ndarray)) – The polygon list.

Returns

The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.

Return type

thr_map (ndarray)

ignore_texts(results, ignore_tags)[source]

Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.

Parameters
  • results (dict) – Result for one image.

  • ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.

Returns

Results after filtering.

Return type

results (dict)

invalid_polygon(poly)[source]

Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.

Parameters

poly (ndarray) – The polygon boundary point sequence.

Returns

Whether the polygon is invalid.

Return type

True/False (bool)

class mmocr.datasets.OCRSegDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]
pre_pipeline(results)[source]

Prepare results dict for pipeline.

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.KIEDataset(ann_file, loader, dict_file, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[source]
Parameters
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.

  • dict_file (str) – Character dict file path.

  • norm (float) – Norm to map value from one range to another.

compute_relation(boxes)[source]

Compute relation between every two boxes.

evaluate(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

list_to_numpy(ann_infos)[source]

Convert bboxes, relations, texts and labels to ndarray.

pad_text_indices(text_inds)[source]

Pad text index to same length.

pre_pipeline(results)[source]

Prepare results dict for pipeline.

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.FCENetTargets(fourier_degree=5, resample_step=4.0, center_region_shrink_ratio=0.3, level_size_divisors=(8, 16, 32), level_proportion_range=((0, 0.4), (0.3, 0.7), (0.6, 1.0)))[source]

Generate the ground truth targets of FCENet: Fourier Contour Embedding for Arbitrary-Shaped Text Detection.

[https://arxiv.org/abs/2104.10442]

Parameters
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • resample_step (float) – The step size for resampling the text center line (TCL). It’s better not to exceed half of the minimum width.

  • center_region_shrink_ratio (float) – The shrink ratio of text center region.

  • level_size_divisors (tuple(int)) – The downsample ratio on each level.

  • level_proportion_range (tuple(tuple(int))) – The range of text sizes assigned to each level.

cal_fourier_signature(polygon, fourier_degree)[source]

Calculate Fourier signature from input polygon.

Parameters
  • polygon (ndarray) – The input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

Returns

An array shaped (2k+1, 2) containing

real part and image part of 2k+1 Fourier coefficients.

Return type

fourier_signature (ndarray)

clockwise(c, fourier_degree)[source]

Make sure the polygon reconstructed from Fourier coefficients c in the clockwise direction.

Parameters

polygon (list[float]) – The origin polygon.

Returns

The polygon in clockwise point order.

Return type

new_polygon (lost[float])

fourier_transform(polygon, fourier_degree)[source]

Perform Fourier transformation to generate Fourier coefficients ck from polygon.

Parameters
  • polygon (ndarray) – An input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

Returns

Fourier coefficients.

Return type

c (ndarray(complex))

generate_center_region_mask(img_size, text_polys)[source]

Generate text center region mask.

Parameters
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The text center region mask.

Return type

center_region_mask (ndarray)

generate_fourier_maps(img_size, text_polys)[source]

Generate Fourier coefficient maps.

Parameters
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The Fourier coefficient real part maps. fourier_image_map (ndarray): The Fourier coefficient image part

maps.

Return type

fourier_real_map (ndarray)

generate_level_targets(img_size, text_polys, ignore_polys)[source]

Generate ground truth target on each level.

Parameters
  • img_size (list[int]) – Shape of input image.

  • text_polys (list[list[ndarray]]) – A list of ground truth polygons.

  • ignore_polys (list[list[ndarray]]) – A list of ignored polygons.

Returns

A list of ground target on each level.

Return type

level_maps (list(ndarray))

generate_targets(results)[source]

Generate the ground truth targets for FCENet.

Parameters

results (dict) – The input result dictionary.

Returns

The output result dictionary.

Return type

results (dict)

normalize_polygon(polygon)[source]

Normalize one polygon so that its start point is at right most.

Parameters

polygon (list[float]) – The origin polygon.

Returns

The polygon with start point at right.

Return type

new_polygon (lost[float])

resample_polygon(polygon, n=400)[source]

Resample one polygon with n points on its boundary.

Parameters
  • polygon (list[float]) – The input polygon.

  • n (int) – The number of resampled points.

Returns

The resampled polygon.

Return type

resampled_polygon (list[float])

class mmocr.datasets.HardDiskLoader(ann_file, parser, repeat=1)[source]

Load annotation file from hard disk to RAM.

Parameters

ann_file (str) – Annotation file path.

class mmocr.datasets.LmdbLoader(ann_file, parser, repeat=1)[source]

Load annotation file with lmdb storage backend.

class mmocr.datasets.LineStrParser(keys=['filename', 'text'], keys_idx=[0, 1], separator=' ')[source]

Parse string of one line in annotation file to dict format.

Parameters
  • keys (list[str]) – Keys in result dict.

  • keys_idx (list[int]) – Value index in sub-string list for each key above.

  • separator (str) – Separator to separate string to list of sub-string.

class mmocr.datasets.LineJsonParser(keys=[], **kwargs)[source]

Parse json-string of one line in annotation file to dict format.

Parameters

keys (list[str]) – Keys in both json-string and result dict.

datasets

class mmocr.datasets.base_dataset.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]

Custom dataset for text detection, text recognition, and their downstream tasks.

  1. The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

    converted to dict for visualizing only).

    {

    “file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

    [
    {

    “iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

    72, 767, 357, 763]]

    }

    ]

    }

  2. The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

    format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

Parameters
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

format_results(results, **kwargs)[source]

Placeholder to format result to dataset-specific output.

pre_pipeline(results)[source]

Prepare results dict for pipeline.

prepare_test_img(img_info)[source]

Get testing data from pipeline.

Parameters

idx (int) – Index of data.

Returns

Testing data after pipeline with new keys introduced by

pipeline.

Return type

dict

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.icdar_dataset.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]
evaluate(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]

Evaluate the hmean metric.

Parameters
  • results (list[dict]) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]]: The evaluation results.

Return type

dict[dict[str

load_annotations(ann_file)[source]

Load annotation from COCO style annotation file.

Parameters

ann_file (str) – Path of annotation file.

Returns

Annotation info from COCO api.

Return type

list[dict]

class mmocr.datasets.ocr_dataset.OCRDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]
evaluate(results, metric='acc', logger=None, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

pre_pipeline(results)[source]

Prepare results dict for pipeline.

class mmocr.datasets.ocr_seg_dataset.OCRSegDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]
pre_pipeline(results)[source]

Prepare results dict for pipeline.

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.text_det_dataset.TextDetDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]
evaluate(results, metric='hmean-iou', score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • score_thr (float) – Score threshold for prediction map.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]

Return type

dict[str

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

class mmocr.datasets.kie_dataset.KIEDataset(ann_file, loader, dict_file, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[source]
Parameters
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.

  • dict_file (str) – Character dict file path.

  • norm (float) – Norm to map value from one range to another.

compute_relation(boxes)[source]

Compute relation between every two boxes.

evaluate(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[source]

Evaluate the dataset.

Parameters
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

list_to_numpy(ann_infos)[source]

Convert bboxes, relations, texts and labels to ndarray.

pad_text_indices(text_inds)[source]

Pad text index to same length.

pre_pipeline(results)[source]

Prepare results dict for pipeline.

prepare_train_img(index)[source]

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys

introduced by pipeline.

Return type

dict

pipelines

class mmocr.datasets.pipelines.LoadTextAnnotations(with_bbox=True, with_label=True, with_mask=False, with_seg=False, poly2mask=True)[source]
process_polygons(polygons)[source]

Convert polygons to list of ndarray and filter invalid polygons.

Parameters

polygons (list[list]) – Polygons of one instance.

Returns

Processed polygons.

Return type

list[numpy.ndarray]

class mmocr.datasets.pipelines.NormalizeOCR(mean, std)[source]

Normalize a tensor image with mean and standard deviation.

class mmocr.datasets.pipelines.OnlineCropOCR(box_keys=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'], jitter_prob=0.5, max_jitter_ratio_x=0.05, max_jitter_ratio_y=0.02)[source]

Crop text areas from whole image with bounding box jitter. If no bbox is given, return directly.

Parameters
  • box_keys (list[str]) – Keys in results which correspond to RoI bbox.

  • jitter_prob (float) – The probability of box jitter.

  • max_jitter_ratio_x (float) – Maximum horizontal jitter ratio relative to height.

  • max_jitter_ratio_y (float) – Maximum vertical jitter ratio relative to height.

class mmocr.datasets.pipelines.ResizeOCR(height, min_width=None, max_width=None, keep_aspect_ratio=True, img_pad_value=0, width_downsample_ratio=0.0625)[source]

Image resizing and padding for OCR.

Parameters
  • height (int | tuple(int)) – Image height after resizing.

  • min_width (none | int | tuple(int)) – Image minimum width after resizing.

  • max_width (none | int | tuple(int)) – Image maximum width after resizing.

  • keep_aspect_ratio (bool) – Keep image aspect ratio if True during resizing, Otherwise resize to the size height * max_width.

  • img_pad_value (int) – Scalar to fill padding area.

  • width_downsample_ratio (float) – Downsample ratio in horizontal direction from input image to output feature.

class mmocr.datasets.pipelines.ToTensorOCR[source]

Convert a PIL Image or numpy.ndarray to tensor.

class mmocr.datasets.pipelines.CustomFormatBundle(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[source]

Custom formatting bundle.

It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:

  • gt_kernels: to DataContainer (cpu_only=True)

  • gt_effective_mask: to DataContainer (cpu_only=True)

Parameters
  • keys (list[str]) – Fields to be formatted to DC only.

  • call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.

  • visualize (dict) – If flag=True, visualize gt mask for debugging.

class mmocr.datasets.pipelines.DBNetTargets(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[source]

Generate gt shrinked text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.

Parameters
  • shrink_ratio (float) – The area shrinked ratio between text kernels and their text masks.

  • thr_min (float) – The minimum value of the threshold map.

  • thr_max (float) – The maximum value of the threshold map.

  • min_short_size (int) – The minimum size of polygon below which the polygon is invalid.

draw_border_map(polygon, canvas, mask)[source]

Generate threshold map for one polygon.

Parameters
  • polygon (ndarray) – The polygon boundary ndarray.

  • canvas (ndarray) – The generated threshold map.

  • mask (ndarray) – The generated threshold mask.

find_invalid(results)[source]

Find invalid polygons.

Parameters

results (dict) – The dict containing gt_mask.

Returns

The indicators for ignoring polygons.

Return type

ignore_tags (list[bool])

generate_targets(results)[source]

Generate the gt targets for DBNet.

Parameters

results (dict) – The input result dictionary.

Returns

The output result dictionary.

Return type

results (dict)

generate_thr_map(img_size, polygons)[source]

Generate threshold map.

Parameters
  • img_size (tuple(int)) – The image size (h,w)

  • polygons (list(ndarray)) – The polygon list.

Returns

The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.

Return type

thr_map (ndarray)

ignore_texts(results, ignore_tags)[source]

Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.

Parameters
  • results (dict) – Result for one image.

  • ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.

Returns

Results after filtering.

Return type

results (dict)

invalid_polygon(poly)[source]

Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.

Parameters

poly (ndarray) – The polygon boundary point sequence.

Returns

Whether the polygon is invalid.

Return type

True/False (bool)

class mmocr.datasets.pipelines.PANetTargets(shrink_ratio=(1.0, 0.5), max_shrink=20)[source]

Generate the ground truths for PANet: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.

[https://arxiv.org/abs/1908.05900]. This code is partially adapted from https://github.com/WenmuZhou/PAN.pytorch.

Parameters
  • shrink_ratio (tuple[float]) – The ratios for shrinking text instances.

  • max_shrink (int) – The maximum shrink distance.

generate_targets(results)[source]

Generate the gt targets for PANet.

Parameters

results (dict) – The input result dictionary.

Returns

The output result dictionary.

Return type

results (dict)

class mmocr.datasets.pipelines.ColorJitter(**kwargs)[source]

An interface for torch color jitter so that it can be invoked in mmdetection pipeline.

class mmocr.datasets.pipelines.RandomCropInstances(target_size, instance_key, mask_type='inx0', positive_sample_ratio=0.625)[source]

Randomly crop images and make sure to contain text instances.

Parameters
  • target_size (tuple or int) – (height, width)

  • positive_sample_ratio (float) – The probability of sampling regions that go through positive regions.

class mmocr.datasets.pipelines.RandomRotateTextDet(rotate_ratio=1.0, max_angle=10)[source]

Randomly rotate images.

class mmocr.datasets.pipelines.ScaleAspectJitter(img_scale=None, multiscale_mode='range', ratio_range=None, keep_ratio=False, resize_type='around_min_img_scale', aspect_ratio_range=None, long_size_bound=None, short_size_bound=None, scale_range=None)[source]

Resize image and segmentation mask encoded by coordinates.

Allowed resize types are around_min_img_scale, long_short_bound, and indep_sample_in_range.

class mmocr.datasets.pipelines.MultiRotateAugOCR(transforms, rotate_degrees=None, force_rotate=False)[source]

Test-time augmentation with multiple rotations in the case that img_height > img_width.

An example configuration is as follows:

rotate_degrees=[0, 90, 270],
transforms=[
    dict(
        type='ResizeOCR',
        height=32,
        min_width=32,
        max_width=160,
        keep_aspect_ratio=True),
    dict(type='ToTensorOCR'),
    dict(type='NormalizeOCR', **img_norm_cfg),
    dict(
        type='Collect',
        keys=['img'],
        meta_keys=[
            'filename', 'ori_shape', 'img_shape', 'valid_ratio'
        ]),
]

After MultiRotateAugOCR with above configuration, the results are wrapped into lists of the same length as follows:

dict(
    img=[...],
    img_shape=[...]
    ...
)
Parameters
  • transforms (list[dict]) – Transformation applied for each augmentation.

  • rotate_degrees (list[int] | None) – Degrees of anti-clockwise rotation.

  • force_rotate (bool) – If True, rotate image by ‘rotate_degrees’ while ignore image aspect ratio.

class mmocr.datasets.pipelines.OCRSegTargets(label_convertor=None, attn_shrink_ratio=0.5, seg_shrink_ratio=0.25, box_type='char_rects', pad_val=255)[source]

Generate gt shrinked kernels for segmentation based OCR framework.

Parameters
  • label_convertor (dict) – Dictionary to construct label_convertor to convert char to index.

  • attn_shrink_ratio (float) – The area shrinked ratio between attention kernels and gt text masks.

  • seg_shrink_ratio (float) – The area shrinked ratio between segmentation kernels and gt text masks.

  • box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

generate_kernels(resize_shape, pad_shape, char_boxes, char_inds, shrink_ratio=0.5, binary=True)[source]

Generate char instance kernels for one shrink ratio.

Parameters
  • resize_shape (tuple(int, int)) – Image size (height, width) after resizing.

  • pad_shape (tuple(int, int)) – Image size (height, width) after padding.

  • char_boxes (list[list[float]]) – The list of char polygons.

  • char_inds (list[int]) – List of char indexes.

  • shrink_ratio (float) – The shrink ratio of kernel.

  • binary (bool) – If True, return binary ndarray containing 0 & 1 only.

Returns

The text kernel mask of (height, width).

Return type

char_kernel (ndarray)

shrink_char_quad(char_quad, shrink_ratio)[source]

Shrink char box in style of quadrangle.

Parameters
  • char_quad (list[float]) – Char box with format [x1, y1, x2, y2, x3, y3, x4, y4].

  • shrink_ratio (float) – The area shrinked ratio between gt kernels and gt text masks.

shrink_char_rect(char_rect, shrink_ratio)[source]

Shrink char box in style of rectangle.

Parameters
  • char_rect (list[float]) – Char box with format [x_min, y_min, x_max, y_max].

  • shrink_ratio (float) – The area shrinked ratio between gt kernels and gt text masks.

class mmocr.datasets.pipelines.FancyPCA(eig_vec=None, eig_val=None)[source]

Implementation of PCA based image augmentation, proposed in the paper Imagenet Classification With Deep Convolutional Neural Networks.

It alters the intensities of RGB values along the principal components of ImageNet dataset.

class mmocr.datasets.pipelines.RandomCropPolyInstances(instance_key='gt_masks', crop_ratio=0.625, min_side_ratio=0.4)[source]

Randomly crop images and make sure to contain at least one intact instance.

sample_crop_box(img_size, results)[source]

Generate crop box and make sure not to crop the polygon instances.

Parameters
  • img_size (tuple(int)) – The image size (h, w).

  • results (dict) – The results dict.

class mmocr.datasets.pipelines.RandomPaddingOCR(max_ratio=None, box_type=None)[source]

Pad the given image on all sides, as well as modify the coordinates of character bounding box in image.

Parameters
  • max_ratio (list[int]) – [left, top, right, bottom].

  • box_type (None|str) – Character box type. If not none, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

class mmocr.datasets.pipelines.ImgAug(args=None)[source]

A wrapper to use imgaug https://github.com/aleju/imgaug.

Parameters

args ([list[list|dict]]) – The argumentation list. For details, please refer to imgaug document. Take args=[[‘Fliplr’, 0.5], dict(cls=’Affine’, rotate=[-10, 10]), [‘Resize’, [0.5, 3.0]]] as an example. The args horizontally flip images with probability 0.5, followed by random rotation with angles in range [-10, 10], and resize with an independent scale in range [0.5, 3.0] for each side of images.

class mmocr.datasets.pipelines.RandomRotateImageBox(min_angle=-10, max_angle=10, box_type='char_quads')[source]

Rotate augmentation for segmentation based text recognition.

Parameters
  • min_angle (int) – Minimum rotation angle for image and box.

  • max_angle (int) – Maximum rotation angle for image and box.

  • box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

class mmocr.datasets.pipelines.OpencvToPil(**kwargs)[source]

Convert numpy.ndarray (bgr) to PIL Image (rgb).

class mmocr.datasets.pipelines.PilToOpencv(**kwargs)[source]

Convert PIL Image (rgb) to numpy.ndarray (bgr).

class mmocr.datasets.pipelines.KIEFormatBundle(*args: Any, **kwargs: Any)[source]

Key information extraction formatting bundle.

Based on the DefaultFormatBundle, itt simplifies the pipeline of formatting common fields, including “img”, “proposals”, “gt_bboxes”, “gt_labels”, “gt_masks”, “gt_semantic_seg”, “relations” and “texts”. These fields are formatted as follows.

  • img: (1) transpose, (2) to tensor, (3) to DataContainer (stack=True)

  • proposals: (1) to tensor, (2) to DataContainer

  • gt_bboxes: (1) to tensor, (2) to DataContainer

  • gt_bboxes_ignore: (1) to tensor, (2) to DataContainer

  • gt_labels: (1) to tensor, (2) to DataContainer

  • gt_masks: (1) to tensor, (2) to DataContainer (cpu_only=True)

  • gt_semantic_seg: (1) unsqueeze dim-0 (2) to tensor,
    1. to DataContainer (stack=True)

  • relations: (1) scale, (2) to tensor, (3) to DataContainer

  • texts: (1) to tensor, (2) to DataContainer

class mmocr.datasets.pipelines.TextSnakeTargets(orientation_thr=2.0, resample_step=4.0, center_region_shrink_ratio=0.3)[source]

Generate the ground truth targets of TextSnake: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]. This was partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

Parameters

orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.

draw_center_region_maps(top_line, bot_line, center_line, center_region_mask, radius_map, sin_map, cos_map, region_shrink_ratio)[source]

Draw attributes on text center region.

Parameters
  • top_line (ndarray) – The points composing top curved sideline of text polygon.

  • bot_line (ndarray) – The points composing bottom curved sideline of text polygon.

  • center_line (ndarray) – The points composing the center line of text instance.

  • center_region_mask (ndarray) – The text center region mask.

  • radius_map (ndarray) – The map where the distance from point to sidelines will be drawn on for each pixel in text center region.

  • sin_map (ndarray) – The map where vector_sin(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).

  • cos_map (ndarray) – The map where vector_cos(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).

  • region_shrink_ratio (float) – The shrink ratio of text center.

find_head_tail(points, orientation_thr)[source]

Find the head edge and tail edge of a text polygon.

Parameters
  • points (ndarray) – The points composing a text polygon.

  • orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.

Returns

The indexes of two points composing head edge. tail_inds (list): The indexes of two points composing tail edge.

Return type

head_inds (list)

generate_center_mask_attrib_maps(img_size, text_polys)[source]

Generate text center region mask and geometric attribute maps.

Parameters
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The text center region mask. radius_map (ndarray): The distance map from each pixel in text

center region to top sideline.

sin_map (ndarray): The sin(theta) map where theta is the angle

between vector (top point - bottom point) and vector (1, 0).

cos_map (ndarray): The cos(theta) map where theta is the angle

between vector (top point - bottom point) and vector (1, 0).

Return type

center_region_mask (ndarray)

generate_targets(results)[source]

Generate the gt targets for TextSnake.

Parameters

results (dict) – The input result dictionary.

Returns

The output result dictionary.

Return type

results (dict)

generate_text_region_mask(img_size, text_polys)[source]

Generate text center region mask and geometry attribute maps.

Parameters
  • img_size (tuple) – The image size (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The text region mask.

Return type

text_region_mask (ndarray)

reorder_poly_edge(points)[source]

Get the respective points composing head edge, tail edge, top sideline and bottom sideline.

Parameters

points (ndarray) – The points composing a text polygon.

Returns

The two points composing the head edge of text

polygon.

tail_edge (ndarray): The two points composing the tail edge of text

polygon.

top_sideline (ndarray): The points composing top curved sideline of

text polygon.

bot_sideline (ndarray): The points composing bottom curved sideline

of text polygon.

Return type

head_edge (ndarray)

resample_line(line, n)[source]

Resample n points on a line.

Parameters
  • line (ndarray) – The points composing a line.

  • n (int) – The resampled points number.

Returns

The points composing the resampled line.

Return type

resampled_line (ndarray)

resample_sidelines(sideline1, sideline2, resample_step)[source]

Resample two sidelines to be of the same points number according to step size.

Parameters
  • sideline1 (ndarray) – The points composing a sideline of a text polygon.

  • sideline2 (ndarray) – The points composing another sideline of a text polygon.

  • resample_step (float) – The resampled step size.

Returns

The resampled line 1. resampled_line2 (ndarray): The resampled line 2.

Return type

resampled_line1 (ndarray)

mmocr.datasets.pipelines.sort_vertex(points_x, points_y)[source]

Sort box vertices in clockwise order from left-top first.

Parameters
  • points_x (list[float]) – x of four vertices.

  • points_y (list[float]) – y of four vertices.

Returns

x of sorted four vertices. sorted_points_y (list[float]): y of sorted four vertices.

Return type

sorted_points_x (list[float])

class mmocr.datasets.pipelines.LoadImageFromNdarray(*args: Any, **kwargs: Any)[source]

Load an image from np.ndarray.

Similar with LoadImageFromFile, but the image read from results['img'], which is np.ndarray.

mmocr.datasets.pipelines.sort_vertex8(points)[source]

Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4]

class mmocr.datasets.pipelines.FCENetTargets(fourier_degree=5, resample_step=4.0, center_region_shrink_ratio=0.3, level_size_divisors=(8, 16, 32), level_proportion_range=((0, 0.4), (0.3, 0.7), (0.6, 1.0)))[source]

Generate the ground truth targets of FCENet: Fourier Contour Embedding for Arbitrary-Shaped Text Detection.

[https://arxiv.org/abs/2104.10442]

Parameters
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • resample_step (float) – The step size for resampling the text center line (TCL). It’s better not to exceed half of the minimum width.

  • center_region_shrink_ratio (float) – The shrink ratio of text center region.

  • level_size_divisors (tuple(int)) – The downsample ratio on each level.

  • level_proportion_range (tuple(tuple(int))) – The range of text sizes assigned to each level.

cal_fourier_signature(polygon, fourier_degree)[source]

Calculate Fourier signature from input polygon.

Parameters
  • polygon (ndarray) – The input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

Returns

An array shaped (2k+1, 2) containing

real part and image part of 2k+1 Fourier coefficients.

Return type

fourier_signature (ndarray)

clockwise(c, fourier_degree)[source]

Make sure the polygon reconstructed from Fourier coefficients c in the clockwise direction.

Parameters

polygon (list[float]) – The origin polygon.

Returns

The polygon in clockwise point order.

Return type

new_polygon (lost[float])

fourier_transform(polygon, fourier_degree)[source]

Perform Fourier transformation to generate Fourier coefficients ck from polygon.

Parameters
  • polygon (ndarray) – An input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

Returns

Fourier coefficients.

Return type

c (ndarray(complex))

generate_center_region_mask(img_size, text_polys)[source]

Generate text center region mask.

Parameters
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The text center region mask.

Return type

center_region_mask (ndarray)

generate_fourier_maps(img_size, text_polys)[source]

Generate Fourier coefficient maps.

Parameters
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The Fourier coefficient real part maps. fourier_image_map (ndarray): The Fourier coefficient image part

maps.

Return type

fourier_real_map (ndarray)

generate_level_targets(img_size, text_polys, ignore_polys)[source]

Generate ground truth target on each level.

Parameters
  • img_size (list[int]) – Shape of input image.

  • text_polys (list[list[ndarray]]) – A list of ground truth polygons.

  • ignore_polys (list[list[ndarray]]) – A list of ignored polygons.

Returns

A list of ground target on each level.

Return type

level_maps (list(ndarray))

generate_targets(results)[source]

Generate the ground truth targets for FCENet.

Parameters

results (dict) – The input result dictionary.

Returns

The output result dictionary.

Return type

results (dict)

normalize_polygon(polygon)[source]

Normalize one polygon so that its start point is at right most.

Parameters

polygon (list[float]) – The origin polygon.

Returns

The polygon with start point at right.

Return type

new_polygon (lost[float])

resample_polygon(polygon, n=400)[source]

Resample one polygon with n points on its boundary.

Parameters
  • polygon (list[float]) – The input polygon.

  • n (int) – The number of resampled points.

Returns

The resampled polygon.

Return type

resampled_polygon (list[float])