API Reference¶

mmocr.apis¶

mmocr.apis.model_inference(model, img)[source]¶

Inference image(s) with the detector.

Parameters

model (nn.Module) – The loaded detector.
imgs (str) – Image files.

Returns

Detection results.

Return type

result (dict)

mmocr.core¶

evaluation¶

mmocr.core.evaluation.eval_hmean_ic13(det_boxes, gt_boxes, gt_ignored_boxes, precision_thr=0.4, recall_thr=0.8, center_dist_thr=1.0, one2one_score=1.0, one2many_score=0.8, many2one_score=1.0)[source]¶

Evalute hmean of text detection using the icdar2013 standard.

Parameters

det_boxes (list[list[list[float]]]) – List of arrays of shape (n, 2k). Each element is the det_boxes for one img. k>=4.
gt_boxes (list[list[list[float]]]) – List of arrays of shape (m, 2k). Each element is the gt_boxes for one img. k>=4.
gt_ignored_boxes (list[list[list[float]]]) – List of arrays of (l, 2k). Each element is the ignored gt_boxes for one img. k>=4.
precision_thr (float) – Precision threshold of the iou of one (gt_box, det_box) pair.
recall_thr (float) – Recall threshold of the iou of one (gt_box, det_box) pair.
center_dist_thr (float) – Distance threshold of one (gt_box, det_box) center point pair.
one2one_score (float) – Reward when one gt matches one det_box.
one2many_score (float) – Reward when one gt matches many det_boxes.
many2one_score (float) – Reward when many gts match one det_box.

Returns

Tuple of dicts which encodes the hmean for the dataset and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_hmean_iou(pred_boxes, gt_boxes, gt_ignored_boxes, iou_thr=0.5, precision_thr=0.5)[source]¶

Evalute hmean of text detection using IOU standard.

Parameters

pred_boxes (list[list[list[float]]]) – Text boxes for an img list. Each box has 2k (>=8) values.
gt_boxes (list[list[list[float]]]) – Ground truth text boxes for an img list. Each box has 2k (>=8) values.
gt_ignored_boxes (list[list[list[float]]]) – Ignored ground truth text boxes for an img list. Each box has 2k (>=8) values.
iou_thr (float) – Iou threshold when one (gt_box, det_box) pair is matched.
precision_thr (float) – Precision threshold when one (gt_box, det_box) pair is matched.

Returns

Tuple of dicts indicates the hmean for the dataset: and all images.

Return type

hmean (tuple[dict])

mmocr.core.evaluation.eval_ocr_metric(pred_texts, gt_texts)[source]¶

Evaluate the text recognition performance with metric: word accuracy and 1-N.E.D. See https://rrc.cvc.uab.es/?ch=14&com=tasks for details.

Parameters

pred_texts (list[str]) – Text strings of prediction.
gt_texts (list[str]) – Text strings of ground truth.

Returns

float]): Metric dict for text recognition, include:

word_acc: Accuracy in word level.
word_acc_ignore_case: Accuracy in word level, ignore letter case.
word_acc_ignore_case_symbol: Accuracy in word level, ignore
letter case and symbol. (default metric for academic evaluation)
char_recall: Recall in character level, ignore
letter case and symbol.
char_precision: Precision in character level, ignore
letter case and symbol.
1-N.E.D: 1 - normalized_edit_distance.

Return type

eval_res (dict[str

mmocr.core.evaluation.eval_hmean(results, img_infos, ann_infos, metrics={'hmean-iou'}, score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]¶

Evaluation in hmean metric.

Parameters

results (list[dict]) – Each dict corresponds to one image, containing the following keys: boundary_result
img_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: filename, height, width
ann_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: masks, masks_ignore
score_thr (float) – Score threshold of prediction map.
metrics (set{str}) – Hmean metric set, should be one or all of {‘hmean-iou’, ‘hmean-ic13’}

Returns

float]

Return type

dict[str

mmocr.core.evaluation.compute_f1_score(preds, gts, ignores=[])[source]¶

Compute the F1-score of prediction.

Parameters

preds (Tensor) – The predicted probability NxC map with N and C being the sample number and class number respectively.
gts (Tensor) – The ground truth vector of size N.
ignores – The index set of classes that are ignored when reporting results. Note: all samples are participated in computing.

mmocr.utils¶

mmocr.utils.get_root_logger(log_file=None, log_level=20)[source]¶

Use get_logger method in mmcv to get the root logger.

The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If log_file is specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmpose”.

Parameters

log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.
log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.

Returns

The root logger.

Return type

logging.Logger

mmocr.utils.collect_env()[source]¶: Collect the information of the running environments.

mmocr.utils.drop_orientation(img_file)[source]¶

Check if the image has orientation information. If yes, ignore it by converting the image format to png, and return new filename, otherwise return the original filename.

Parameters: img_file (str) – The image path
Returns: The converted image filename with proper postfix

mmocr.models¶

common_backbones¶

class mmocr.models.common.backbones.UNet(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None)[source]¶

UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf

Parameters

in_channels (int) – Number of input image channels. Default” 3.
base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.
num_stages (int) – Number of stages in encoder, normally 5. Default: 5.
strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).
enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).
dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).
downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).
enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).
dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict | None) – Config dict for convolution layer. Default: None.
norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).
upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.
plugins (dict) – plugins for convolutional layers. Default: None.

Notice:: The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.

init_weights(pretrained=None)[source]¶

Initialize the weights in backbone.

Parameters: pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

train(mode=True)[source]¶: Convert the model into training mode while keep normalization layer freezed.

class mmocr.models.common.losses.DiceLoss(eps=1e-06)[source]¶

textdet_dense_heads¶

class mmocr.models.textdet.dense_heads.PSEHead(in_channels, out_channels, text_repr_type='poly', downsample_ratio=0.25, loss={'type': 'PSELoss'}, train_cfg=None, test_cfg=None)[source]¶: The class for PANet head.

class mmocr.models.textdet.dense_heads.PANHead(in_channels, out_channels, text_repr_type='poly', downsample_ratio=0.25, loss={'type': 'PANLoss'}, train_cfg=None, test_cfg=None)[source]¶: The class for PANet head.

class mmocr.models.textdet.dense_heads.DBHead(in_channels, with_bias=False, decoding_type='db', text_repr_type='poly', downsample_ratio=1.0, loss={'type': 'DBLoss'}, train_cfg=None, test_cfg=None)[source]¶

The class for DBNet head.

This was partially adapted from https://github.com/MhLiao/DB

class mmocr.models.textdet.dense_heads.HeadMixin[source]¶

The head minxin for dbnet and pannet heads.

get_boundary(score_maps, img_metas, rescale)[source]¶

Compute text boundaries via post processing.

Parameters

score_maps (Tensor) – The text score map.
img_metas (dict) – The image meta info.
rescale (bool) – Rescale boundaries to the original image resolution if true, and keep the score_maps resolution if false.

Returns

The result dict.

Return type

results (dict)

loss(pred_maps, **kwargs)[source]¶

Compute the loss for text detection.

Parameters: pred_maps (tensor) – The input score maps of NxCxHxW.
Returns: The dict for losses.
Return type: losses (dict)

resize_boundary(boundaries, scale_factor)[source]¶

Rescale boundaries via scale_factor.

Parameters

boundaries (list[list[float]]) – The boundary list. Each boundary
size 2k+1 with k>=4. (with) –
scale_factor (ndarray) – The scale factor of size (4,).

Returns

The scaled boundaries.

Return type

boundaries (list[list[float]])

class mmocr.models.textdet.dense_heads.TextSnakeHead(in_channels, decoding_type='textsnake', text_repr_type='poly', loss={'type': 'TextSnakeLoss'}, train_cfg=None, test_cfg=None)[source]¶

The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]

textdet_necks¶

class mmocr.models.textdet.necks.FPEM_FFM(in_channels, conv_out=128, fpem_repeat=2, align_corners=False)[source]¶

This code is from https://github.com/WenmuZhou/PAN.pytorch.

init_weights()[source]¶: Initialize the weights of FPN module.

class mmocr.models.textdet.necks.FPNF(in_channels=[256, 512, 1024, 2048], out_channels=256, fusion_type='concat', upsample_ratio=1)[source]¶: FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network.

class mmocr.models.textdet.necks.FPNC(in_channels, lateral_channels=256, out_channels=64, bias_on_lateral=False, bn_re_on_lateral=False, bias_on_smooth=False, bn_re_on_smooth=False, conv_after_concat=False)[source]¶

FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization.

This was partially adapted from https://github.com/MhLiao/DB and https://github.com/WenmuZhou/DBNet.pytorch

init_weights()[source]¶: Initialize the weights of FPN module.

class mmocr.models.textdet.necks.FPN_UNET(in_channels, out_channels)[source]¶

The class for implementing DRRG and TextSnake U-Net-like FPN.

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection [https://arxiv.org/abs/2003.07493]. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [https://arxiv.org/abs/1807.01544].

textdet_detectors¶

class mmocr.models.textdet.detectors.TextDetectorMixin(show_score)[source]¶

The class for implementing text detector auxiliary methods.

get_boundary(results)[source]¶

Convert segmentation into text boundaries.

Parameters: results (tuple) – The result tuple. The first element is segmentation while the second is its scores.
Returns: A result dict containing ‘boundary_result’.
Return type: results (dict)

show_result(img, result, score_thr=0.5, bbox_color='green', text_color='green', thickness=1, font_scale=0.5, win_name='', show=False, wait_time=0, out_file=None)[source]¶

Draw result over img.

Parameters

img (str or Tensor) – The image to be displayed.
result (dict) – The results to draw over img.
score_thr (float, optional) – Minimum score of bboxes to be shown. Default: 0.3.
bbox_color (str or tuple or Color) – Color of bbox lines.
text_color (str or tuple or Color) – Color of texts.
thickness (int) – Thickness of lines.
font_scale (float) – Font scales of texts.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The filename to write the image. Default: None.imshow_pred_boundary`

class mmocr.models.textdet.detectors.SingleStageTextDetector(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None)[source]¶

The class for implementing single stage text detector.

It is the parent class of PANet, PSENet, and DBNet.

forward_train(img, img_metas, **kwargs)[source]¶

Parameters

img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
img_metas (list[dict]) – A list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys, see mmdet.datasets.pipelines.Collect.

Returns

A dictionary of loss components.

Return type

dict[str, Tensor]

class mmocr.models.textdet.detectors.OCRMaskRCNN(backbone, rpn_head, roi_head, train_cfg, test_cfg, neck=None, pretrained=None, text_repr_type='quad', show_score=False)[source]¶: Mask RCNN tailored for OCR.

class mmocr.models.textdet.detectors.DBNet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]¶

The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization.

[https://arxiv.org/abs/1911.08947].

class mmocr.models.textdet.detectors.PANet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]¶

The class for implementing PANet text detector:

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network [https://arxiv.org/abs/1908.05900].

class mmocr.models.textdet.detectors.PSENet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]¶

The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network.

[https://arxiv.org/abs/1806.02559].

class mmocr.models.textdet.detectors.TextSnake(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]¶

The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]

textdet_losses¶

class mmocr.models.textdet.losses.PANLoss(alpha=0.5, beta=0.25, delta_aggregation=0.5, delta_discrimination=3, ohem_ratio=3, reduction='mean', speedup_bbox_thr=-1)[source]¶

The class for implementing PANet loss: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.

[https://arxiv.org/abs/1908.05900]. This was partially adapted from https://github.com/WenmuZhou/PAN.pytorch

aggregation_discrimination_loss(gt_texts, gt_kernels, inst_embeds)[source]¶

Compute the aggregation and discrimnative losses.

Parameters

gt_texts (tensor) – The ground truth text mask of size Nx1xHxW.
gt_kernels (tensor) – The ground truth text kernel mask of size Nx1xHxW.
inst_embeds (tensor) – The text instance embedding tensor of size Nx4xHxW.

Returns

The aggregation loss before reduction. loss_discrs (tensor): The discriminative loss before reduction.

Return type

loss_aggrs (tensor)

bitmasks2tensor(bitmasks, target_sz)[source]¶

Convert Bitmasks to tensor.

Parameters

bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.
target_sz (tuple(int, int)) – The target tensor size HxW.

Returns

results (list[tensor]): The list of kernel tensors. Each: element is for one kernel level.

forward(preds, downsample_ratio, gt_kernels, gt_mask)[source]¶

Compute PANet loss.

Parameters

preds (tensor) – The output tensor with size of Nx6xHxW.
gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.
gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask fo one img.
downsample_ratio (float) – The downsample ratio between preds and the input img.

Returns

The loss dictionary.

Return type

results (dict)

ohem_batch(text_scores, gt_texts, gt_mask)[source]¶

OHEM sampling for a batch of imgs.

Parameters

text_scores (Tensor) – The text scores of size NxHxW.
gt_texts (Tensor) – The gt text masks of size NxHxW.
gt_mask (Tensor) – The gt effective mask of size NxHxW.

Returns

The sampled mask of size NxHxW.

Return type

sampled_masks (Tensor)

ohem_img(text_score, gt_text, gt_mask)[source]¶

Sample the top-k maximal negative samples and all positive samples.

Parameters

text_score (Tensor) – The text score with size of HxW.
gt_text (Tensor) – The ground truth text mask of HxW.
gt_mask (Tensor) – The effective region mask of HxW.

Returns

The sampled pixel mask of size HxW.

Return type

sampled_mask (Tensor)

class mmocr.models.textdet.losses.PSELoss(alpha=0.7, ohem_ratio=3, reduction='mean', kernel_sample_type='adaptive')[source]¶

The class for implementing PSENet loss: Shape Robust Text Detection with Progressive Scale Expansion Network [https://arxiv.org/abs/1806.02559].

This is partially adapted from https://github.com/whai362/PSENet.

forward(score_maps, downsample_ratio, gt_kernels, gt_mask)[source]¶

Compute PSENet loss.

Parameters

score_maps (tensor) – The output tensor with size of Nx6xHxW.
gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.
gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask fo one img.
downsample_ratio (float) – The downsample ratio between score_maps and the input img.

Returns

The loss.

Return type

results (dict)

class mmocr.models.textdet.losses.DBLoss(alpha=1, beta=1, reduction='mean', negative_ratio=3.0, eps=1e-06, bbce_loss=False)[source]¶

The class for implementing DBNet loss.

This is partially adapted from https://github.com/MhLiao/DB.

bitmasks2tensor(bitmasks, target_sz)[source]¶

Convert Bitmasks to tensor.

Parameters

bitmasks (list[BitMasks]) – The BitMasks list. Each item is for one img.
target_sz (tuple(int, int)) – The target tensor size of KxHxW with K being the number of kernels.

Returns

result_tensors (list[tensor]): The list of kernel tensors. Each: element is for one kernel level.

forward(preds, downsample_ratio, gt_shrink, gt_shrink_mask, gt_thr, gt_thr_mask)[source]¶

Compute DBNet loss.

Parameters

preds (tensor) – The output tensor with size of Nx3xHxW.
downsample_ratio (float) – The downsample ratio for the ground truths.
gt_shrink (list[BitmapMasks]) – The mask list with each element being the shrinked text mask for one img.
gt_shrink_mask (list[BitmapMasks]) – The effective mask list with each element being the shrinked effective mask for one img.
gt_thr (list[BitmapMasks]) – The mask list with each element being the threshold text mask for one img.
gt_thr_mask (list[BitmapMasks]) – The effective mask list with each element being the threshold effective mask for one img.

Returns

The dict for dbnet losses with loss_prob,: loss_db and loss_thresh.

Return type

results(dict)

class mmocr.models.textdet.losses.TextSnakeLoss(ohem_ratio=3.0)[source]¶

The class for implementing TextSnake loss: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [https://arxiv.org/abs/1807.01544]. This is partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

bitmasks2tensor(bitmasks, target_sz)[source]¶

Convert Bitmasks to tensor.

Parameters

bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.
target_sz (tuple(int, int)) – The target tensor size HxW.

Returns

results (list[tensor]): The list of kernel tensors. Each: element is for one kernel level.

textdet_postprocess¶

textrecog_recognizer¶

class mmocr.models.textrecog.recognizer.BaseRecognizer[source]¶

Base class for text recognition.

abstract aug_test(imgs, img_metas, **kwargs)[source]¶

Test function with test time augmentation.

Parameters

imgs (list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.
img_metas (list[list[dict]]) – The metadata of images.

abstract extract_feat(imgs)[source]¶: Extract features from images.

forward(img, img_metas, return_loss=True, **kwargs)[source]¶

Calls either forward_train() or forward_test() depending on whether return_loss is True.

Note that img and img_meta are single-nested (i.e. tensor and list[dict]).

forward_test(imgs, img_metas, **kwargs)[source]¶

Parameters

imgs (tensor | list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.
img_metas (list[dict] | list[list[dict]]) – The outer list indicates images in a batch.

abstract forward_train(imgs, img_metas, **kwargs)[source]¶

Parameters

img (tensor) – tensors with shape (N, C, H, W). Typically should be mean centered and std scaled.
img_metas (list[dict]) – List of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, see mmdet.datasets.pipelines.Collect.
kwargs (keyword arguments) – Specific to concrete implementation.

init_weights(pretrained=None)[source]¶

Initialize the weights for detector.

Parameters: pretrained (str, optional) – Path to pre-trained weights. Defaults to None.

show_result(img, result, gt_label='', win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]¶

Draw result on img.

Parameters

img (str or tensor) – The image to be displayed.
result (dict) – The results to draw on img.
gt_label (str) – Ground truth label of img.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The output filename. Default: None.

Returns

Only if not show or out_file.

Return type

img (tensor)

train_step(data, optimizer)[source]¶

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer update, which are done by an optimizer hook. Note that in some complicated cases or models (e.g. GAN), the whole process (including the back propagation and optimizer update) is also defined by this method.

Parameters

data (dict) – The outputs of dataloader.
optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

Returns

It should contain at least 3 keys: loss, log_vars,

num_samples.

loss is a tensor for back propagation, which is a

weighted sum of multiple losses. - log_vars contains all the variables to be sent to the logger. - num_samples indicates the batch size used for averaging the logs (Note: for the DDP model, num_samples refers to the batch size for each GPU).

Return type

dict

val_step(data, optimizer)[source]¶

The iteration step during validation.

This method shares the same signature as train_step(), but is used during val epochs. Note that the evaluation after training epochs is not implemented by this method, but by an evaluation hook.

class mmocr.models.textrecog.recognizer.CRNNNet(*args: Any, **kwargs: Any)[source]¶: CTC-loss based recognizer.

class mmocr.models.textrecog.recognizer.SARNet(*args: Any, **kwargs: Any)[source]¶: Implementation of SAR

class mmocr.models.textrecog.recognizer.NRTR(*args: Any, **kwargs: Any)[source]¶: Implementation of NRTR

class mmocr.models.textrecog.recognizer.RobustScanner(*args: Any, **kwargs: Any)[source]¶

Implementation of `RobustScanner.

<https://arxiv.org/pdf/2007.07542.pdf>

textrecog_backbones¶

class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False)[source]¶

Implement ResNet backbone for text recognition, modified from: ResNet

Parameters

base_channels (int) – Number of channels of input image tensor.
layers (list[int]) – List of BasicBlock number for each stage.
channels (list[int]) – List of out_channels of Conv2d layer.
out_indices (None | Sequence[int]) – Indicdes of output stages.
stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.
last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

class mmocr.models.textrecog.backbones.VeryDeepVgg(leakyRelu=True, input_channels=3)[source]¶

Implement VGG-VeryDeep backbone for text recognition, modified from: VGG-VeryDeep

Parameters

input_channels (int) – Number of channels of input image tensor.
leakyRelu (bool) – Use leakyRelu or not.

class mmocr.models.textrecog.backbones.NRTRModalityTransform(input_channels=3, input_height=32)[source]¶

textrecog_necks¶

class mmocr.models.textrecog.necks.FPNOCR(in_channels, out_channels, last_stage_only=True)[source]¶

FPN-like Network for segmentation based text recognition.

Parameters

in_channels (list[int]) – Number of input channels for each scale.
out_channels (int) – Number of output channels for each scale.
last_stage_only (bool) – If True, output last stage only.

textrecog_heads¶

class mmocr.models.textrecog.heads.SegHead(in_channels=128, num_classes=37, upsample_param=None)[source]¶

Head for segmentation based text recognition.

Parameters

in_channels (int) – Number of input channels.
num_classes (int) – Number of output classes.
upsample_param (dict | None) – Config dict for interpolation layer. Default: dict(scale_factor=1.0, mode=’nearest’)

textrecog_convertors¶

class mmocr.models.textrecog.convertors.BaseConvertor(dict_type='DICT90', dict_file=None, dict_list=None)[source]¶

Convert between text, index and tensor for text recognize pipeline.

Parameters

dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.
dict_file (None|str) – Character dict file path. If not none, the dict_file is of higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.

idx2str(indexes)[source]¶

Convert indexes to text strings.

Parameters: indexes (list[list[int]]) – [[1,2,3,3,4], [5,4,6,3,7]].
Returns: [‘hello’, ‘world’].
Return type: strings (list[str])

num_classes()[source]¶: Number of output classes.

str2idx(strings)[source]¶

Convert strings to indexes.

Parameters: strings (list[str]) – [‘hello’, ‘world’].
Returns: [[1,2,3,3,4], [5,4,6,3,7]].
Return type: indexes (list[list[int]])

str2tensor(strings)[source]¶

Convert text-string to input tensor.

Parameters

strings (list[str]) – [‘hello’, ‘world’].

Returns

[torch.Tensor([1,2,3,3,4]),: torch.Tensor([5,4,6,3,7])].

Return type

tensors (list[torch.Tensor])

tensor2idx(output)[source]¶

Convert model output tensor to character indexes and scores. :param output: The model outputs with size: N * T * C :type output: tensor

Returns

[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]].

Return type

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.CTCConvertor(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[source]¶

Convert between text, index and tensor for CTC loss-based pipeline.

Parameters

dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.
dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.
with_unknown (bool) – If True, add UKN token to class.
lower (bool) – If True, convert original string to lower case.

str2tensor(strings)[source]¶

Convert text-string to ctc-loss input tensor.

Parameters

strings (list[str]) – [‘hello’, ‘world’].

Returns

tensor | list[tensor]):

tensors (list[tensor]): [torch.Tensor([1,2,3,3,4]),: torch.Tensor([5,4,6,3,7])].

flatten_targets (tensor): torch.Tensor([1,2,3,3,4,5,4,6,3,7]). target_lengths (tensor): torch.IntTensot([5,5]).

Return type

dict (str

tensor2idx(output, img_metas, topk=1, return_topk=False)[source]¶

Convert model output tensor to index-list. :param output: The model outputs with size: N * T * C. :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict] :param topk: The highest k classes to be returned. :type topk: int :param return_topk: Whether to return topk or just top1. :type return_topk: bool

Returns

[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]] (

indexes_topk (list[list[list[int]->len=topk]]): scores_topk (list[list[list[float]->len=topk]])

).

Return type

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.AttnConvertor(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, max_seq_len=40, lower=False, start_end_same=True, **kwargs)[source]¶

Convert between text, index and tensor for encoder-decoder based pipeline.

Parameters

dict_type (str) – Type of dict, should be one of {‘DICT36’, ‘DICT90’}.
dict_file (None|str) – Character dict file path. If not none, higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, higher priority than dict_type, but lower than dict_file.
with_unknown (bool) – If True, add UKN token to class.
max_seq_len (int) – Maximum sequence length of label.
lower (bool) – If True, convert original string to lower case.
start_end_same (bool) – Whether use the same index for start and end token or not. Default: True.

str2tensor(strings)[source]¶

Convert text-string into tensor. :param strings: [‘hello’, ‘world’] :type strings: list[str]

Returns

Tensor | list[tensor]):

tensors (list[Tensor]): [torch.Tensor([1,2,3,3,4]),: torch.Tensor([5,4,6,3,7])]

padded_targets (Tensor(bsz * max_seq_len))

Return type

dict (str

tensor2idx(outputs, img_metas=None)[source]¶

Convert output tensor to text-index :param outputs: model outputs with size: N * T * C :type outputs: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]

Returns

[[1,2,3,3,4], [5,4,6,3,7]] scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]]

Return type

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.SegConvertor(dict_type='DICT36', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[source]¶

Convert between text, index and tensor for segmentation based pipeline.

Parameters

dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.
dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, the list
of higher priority than dict_type, but lower than dict_file. (is) –
with_unknown (bool) – If True, add UKN token to class.
lower (bool) – If True, convert original string to lower case.

tensor2str(output, img_metas=None)[source]¶

Convert model output tensor to string labels. :param output: Model outputs with size: N * C * H * W :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]

Returns: Decoded text labels. scores (list[list[float]]): Decoded chars scores.
Return type: texts (list[str])

textrecog_encoders¶

class mmocr.models.textrecog.encoders.SAREncoder(enc_bi_rnn=False, enc_do_rnn=0.0, enc_gru=False, d_model=512, d_enc=512, mask=True, **kwargs)[source]¶

Implementation of encoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_

Parameters

enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.
enc_do_rnn (float) – Dropout probability of RNN layer in encoder.
enc_gru (bool) – If True, use GRU, else LSTM in encoder.
d_model (int) – Dim of channels from backbone.
d_enc (int) – Dim of encoder RNN layer.
mask (bool) – If True, mask padding in RNN sequence.

class mmocr.models.textrecog.encoders.TFEncoder(n_layers=6, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, dropout=0.1, **kwargs)[source]¶: Encode 2d feature map to 1d sequence.

class mmocr.models.textrecog.encoders.BaseEncoder(*args: Any, **kwargs: Any)[source]¶: Base Encoder class for text recognition.

class mmocr.models.textrecog.encoders.ChannelReductionEncoder(in_channels, out_channels)[source]¶

textrecog_decoders¶

class mmocr.models.textrecog.decoders.CRNNDecoder(in_channels=None, num_classes=None, rnn_flag=False, **kwargs)[source]¶

class mmocr.models.textrecog.decoders.ParallelSARDecoder(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0.0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=92, pred_concat=False, **kwargs)[source]¶

Implementation Parallel Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_

Parameters

number_classes (int) – Output class number.
channels (list[int]) – Network layer channels.
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.
dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.
dec_do_rnn (float) – Dropout of RNN layer in decoder.
dec_gru (bool) – If True, use GRU, else LSTM in decoder.
d_model (int) – Dim of channels from backbone.
d_enc (int) – Dim of encoder RNN layer.
d_k (int) – Dim of channels of attention module.
pred_dropout (float) – Dropout probability of prediction layer.
max_seq_len (int) – Maximum sequence length for decoding.
mask (bool) – If True, mask padding in feature map.
start_idx (int) – Index of start token.
padding_idx (int) – Index of padding token.
pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.

class mmocr.models.textrecog.decoders.SequentialSARDecoder(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_gru=False, d_k=64, d_model=512, d_enc=512, pred_dropout=0.0, mask=True, max_seq_len=40, start_idx=0, padding_idx=92, pred_concat=False, **kwargs)[source]¶

Implementation Sequential Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

Parameters

number_classes (int) – Number of output class.
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.
dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.
dec_do_rnn (float) – Dropout of RNN layer in decoder.
dec_gru (bool) – If True, use GRU, else LSTM in decoder.
d_k (int) – Dim of conv layers in attention module.
d_model (int) – Dim of channels from backbone.
d_enc (int) – Dim of encoder RNN layer.
pred_dropout (float) – Dropout probability of prediction layer.
max_seq_len (int) – Maximum sequence length during decoding.
mask (bool) – If True, mask padding in feature map.
start_idx (int) – Index of start token.
padding_idx (int) – Index of padding token.
pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.

class mmocr.models.textrecog.decoders.ParallelSARDecoderWithBS(beam_width=5, num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=0, pred_concat=False, **kwargs)[source]¶

Parallel Decoder module with beam-search in SAR.

Parameters: beam_width (int) – Width for beam search.

class mmocr.models.textrecog.decoders.TFDecoder(n_layers=6, d_embedding=512, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, n_position=200, dropout=0.1, num_classes=93, max_seq_len=40, start_idx=1, padding_idx=92, **kwargs)[source]¶: Transformer Decoder block with self attention mechanism.

class mmocr.models.textrecog.decoders.BaseDecoder(**kwargs)[source]¶: Base decoder class for text recognition.

class mmocr.models.textrecog.decoders.SequenceAttentionDecoder(num_classes=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, start_idx=0, mask=True, padding_idx=None, dropout_ratio=0, return_feature=False, encode_value=False)[source]¶

class mmocr.models.textrecog.decoders.PositionAttentionDecoder(num_classes=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, mask=True, return_feature=False, encode_value=False)[source]¶

class mmocr.models.textrecog.decoders.RobustScannerDecoder(num_classes=None, dim_input=512, dim_model=128, max_seq_len=40, start_idx=0, mask=True, padding_idx=None, encode_value=False, hybrid_decoder=None, position_decoder=None)[source]¶

textrecog_losses¶

class mmocr.models.textrecog.losses.CELoss(ignore_index=-1, reduction='none')[source]¶

Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss.

Parameters

ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

class mmocr.models.textrecog.losses.SARLoss(ignore_index=0, reduction='mean', **kwargs)[source]¶

Implementation of loss module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

Parameters

ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

class mmocr.models.textrecog.losses.CTCLoss(flatten=True, blank=0, reduction='mean', zero_infinity=False, **kwargs)[source]¶

Implementation of loss module for CTC-loss based text recognition.

Parameters

flatten (bool) – If True, use flattened targets, else padded targets.
blank (int) – Blank label. Default 0.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).
zero_infinity (bool) – Whether to zero infinite losses and the associated gradients. Default: False. Infinite losses mainly occur when the inputs are too short to be aligned to the targets.

class mmocr.models.textrecog.losses.TFLoss(ignore_index=-1, reduction='none', flatten=True, **kwargs)[source]¶: Implementation of loss module for transformer.

class mmocr.models.textrecog.losses.SegLoss(seg_downsample_ratio=0.5, seg_with_loss_weight=True, ignore_index=255, **kwargs)[source]¶

Implementation of loss module for segmentation based text recognition method.

Parameters

seg_downsample_ratio (float) – Downsample ratio of segmentation map.
seg_with_loss_weight (bool) – If True, set weight for segmentation loss.
ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

textrecog_backbones¶

class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False)[source]

Implement ResNet backbone for text recognition, modified from: ResNet

Parameters

base_channels (int) – Number of channels of input image tensor.
layers (list[int]) – List of BasicBlock number for each stage.
channels (list[int]) – List of out_channels of Conv2d layer.
out_indices (None | Sequence[int]) – Indicdes of output stages.
stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.
last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

class mmocr.models.textrecog.backbones.VeryDeepVgg(leakyRelu=True, input_channels=3)[source]

Implement VGG-VeryDeep backbone for text recognition, modified from: VGG-VeryDeep

Parameters

input_channels (int) – Number of channels of input image tensor.
leakyRelu (bool) – Use leakyRelu or not.

class mmocr.models.textrecog.backbones.NRTRModalityTransform(input_channels=3, input_height=32)[source]

textrecog_layers¶

class mmocr.models.textrecog.layers.BidirectionalLSTM(nIn, nHidden, nOut)[source]¶

class mmocr.models.textrecog.layers.MultiHeadAttention(n_head=8, d_model=512, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, mask_value=0)[source]¶: Multi-Head Attention module.

class mmocr.models.textrecog.layers.PositionalEncoding(d_hid=512, n_position=200)[source]¶

class mmocr.models.textrecog.layers.PositionwiseFeedForward(d_in, d_hid, dropout=0.1, act_layer=torch.nn.GELU)[source]¶: A two-feed-forward-layer module.

class mmocr.models.textrecog.layers.BasicBlock(inplanes, planes, stride=1, downsample=False)[source]¶

class mmocr.models.textrecog.layers.Bottleneck(inplanes, planes, stride=1, downsample=False)[source]¶

class mmocr.models.textrecog.layers.RobustScannerFusionLayer(dim_model, dim=-1)[source]¶

class mmocr.models.textrecog.layers.DotProductAttentionLayer(dim_model=None)[source]¶

class mmocr.models.textrecog.layers.PositionAwareLayer(dim_model, rnn_layers=2)[source]¶

mmocr.models.textrecog.layers.get_subsequent_mask(seq)[source]¶: For masking out the subsequent info.

class mmocr.models.textrecog.layers.TransformerDecoderLayer(d_model=512, d_inner=256, n_head=8, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, mask_value=0, act_layer=torch.nn.GELU)[source]¶

kie_extractors¶

class mmocr.models.kie.extractors.SDMGR(backbone, neck=None, bbox_head=None, extractor={'featmap_strides': [1], 'roi_layer': {'output_size': 7, 'type': 'RoIAlign'}, 'type': 'SingleRoIExtractor'}, visual_modality=False, train_cfg=None, test_cfg=None, pretrained=None, class_list=None)[source]¶

The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.

Parameters

visual_modality (bool) – Whether use the visual modality.
class_list (None | str) – Mapping file of class index to class name. If None, class index will be shown in show_results, else class name.

forward_train(img, img_metas, relations, texts, gt_bboxes, gt_labels)[source]¶

Parameters

img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, please see mmdet.datasets.pipelines.Collect.
relations (list[tensor]) – Relations between bboxes.
texts (list[tensor]) – Texts in bboxes.
gt_bboxes (list[tensor]) – Each item is the truth boxes for each image in [tl_x, tl_y, br_x, br_y] format.
gt_labels (list[tensor]) – Class indices corresponding to each box.

Returns

A dictionary of loss components.

Return type

dict[str, tensor]

show_result(img, result, boxes, win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]¶

Draw result on img.

Parameters

img (str or tensor) – The image to be displayed.
result (dict) – The results to draw on img.
boxes (list) – Bbox of img.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The output filename. Default: None.

Returns

Only if not show or out_file.

Return type

img (tensor)

kie_heads¶

class mmocr.models.kie.heads.SDMGRHead(num_chars=92, visual_dim=64, fusion_dim=1024, node_input=32, node_embed=256, edge_input=5, edge_embed=256, num_gnn=2, num_classes=26, loss={'type': 'SDMGRLoss'}, bidirectional=False, train_cfg=None, test_cfg=None)[source]¶

kie_losses¶

class mmocr.models.kie.losses.SDMGRLoss(node_weight=1.0, edge_weight=1.0, ignore=0)[source]¶

The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

https://arxiv.org/abs/2103.14470.

mmocr.datasets¶

class mmocr.datasets.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]¶

evaluate(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]¶

Evaluate the hmean metric.

Parameters

results (list[dict]) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]]: The evaluation results.

Return type

dict[dict[str

load_annotations(ann_file)[source]¶

Load annotation from COCO style annotation file.

Parameters: ann_file (str) – Path of annotation file.
Returns: Annotation info from COCO api.
Return type: list[dict]

class mmocr.datasets.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

Custom dataset for text detection, text recognition, and their downstream tasks.

The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

converted to dict for visualizing only).

{
“file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

[

{
“iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

72, 767, 357, 763]]

}

]

}
The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

Parameters

ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

format_results(results, **kwargs)[source]¶: Placeholder to format result to dataset-specific output.

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

prepare_test_img(img_info)[source]¶

Get testing data from pipeline.

Parameters

idx (int) – Index of data.

Returns

Testing data after pipeline with new keys introduced by: pipeline.

Return type

dict

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

class mmocr.datasets.OCRDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

evaluate(results, metric='acc', logger=None, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

class mmocr.datasets.TextDetDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

evaluate(results, metric='hmean-iou', score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
score_thr (float) – Score threshold for prediction map.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]

Return type

dict[str

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

class mmocr.datasets.CustomFormatBundle(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[source]¶

Custom formatting bundle.

It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:

gt_kernels: to DataContainer (cpu_only=True)
gt_effective_mask: to DataContainer (cpu_only=True)

Parameters

keys (list[str]) – Fields to be formatted to DC only.
call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.
visualize (dict) – If flag=True, visualize gt mask for debugging.

class mmocr.datasets.DBNetTargets(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[source]¶

Generate gt shrinked text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.

Parameters

shrink_ratio (float) – The area shrinked ratio between text kernels and their text masks.
thr_min (float) – The minimum value of the threshold map.
thr_max (float) – The maximum value of the threshold map.
min_short_size (int) – The minimum size of polygon below which the polygon is invalid.

draw_border_map(polygon, canvas, mask)[source]¶

Generate threshold map for one polygon.

Parameters

polygon (ndarray) – The polygon boundary ndarray.
canvas (ndarray) – The generated threshold map.
mask (ndarray) – The generated threshold mask.

find_invalid(results)[source]¶

Find invalid polygons.

Parameters: results (dict) – The dict containing gt_mask.
Returns: The indicators for ignoring polygons.
Return type: ignore_tags (list[bool])

generate_targets(results)[source]¶

Generate the gt targets for DBNet.

Parameters: results (dict) – The input result dictionary.
Returns: The output result dictionary.
Return type: results (dict)

generate_thr_map(img_size, polygons)[source]¶

Generate threshold map.

Parameters

img_size (tuple(int)) – The image size (h,w)
polygons (list(ndarray)) – The polygon list.

Returns

The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.

Return type

thr_map (ndarray)

ignore_texts(results, ignore_tags)[source]¶

Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.

Parameters

results (dict) – Result for one image.
ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.

Returns

Results after filtering.

Return type

results (dict)

invalid_polygon(poly)[source]¶

Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.

Parameters: poly (ndarray) – The polygon boundary point sequence.
Returns: Whether the polygon is invalid.
Return type: True/False (bool)

class mmocr.datasets.OCRSegDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

class mmocr.datasets.KIEDataset(ann_file, loader, dict_file, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[source]¶

Parameters

ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.
dict_file (str) – Character dict file path.
norm (float) – Norm to map value from one range to another.

compute_relation(boxes)[source]¶: Compute relation between every two boxes.

evaluate(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

list_to_numpy(ann_infos)[source]¶: Convert bboxes, relations, texts and labels to ndarray.

pad_text_indices(text_inds)[source]¶: Pad text index to same length.

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

datasets¶

class mmocr.datasets.base_dataset.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

Custom dataset for text detection, text recognition, and their downstream tasks.

The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

converted to dict for visualizing only).

{
“file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

[

{
“iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

72, 767, 357, 763]]

}

]

}
The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

Parameters

ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

format_results(results, **kwargs)[source]¶: Placeholder to format result to dataset-specific output.

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

prepare_test_img(img_info)[source]¶

Get testing data from pipeline.

Parameters

idx (int) – Index of data.

Returns

Testing data after pipeline with new keys introduced by: pipeline.

Return type

dict

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

class mmocr.datasets.icdar_dataset.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]¶

evaluate(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]¶

Evaluate the hmean metric.

Parameters

results (list[dict]) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]]: The evaluation results.

Return type

dict[dict[str

load_annotations(ann_file)[source]¶

Load annotation from COCO style annotation file.

Parameters: ann_file (str) – Path of annotation file.
Returns: Annotation info from COCO api.
Return type: list[dict]

class mmocr.datasets.ocr_dataset.OCRDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

evaluate(results, metric='acc', logger=None, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

class mmocr.datasets.ocr_seg_dataset.OCRSegDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

class mmocr.datasets.text_det_dataset.TextDetDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶

evaluate(results, metric='hmean-iou', score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
score_thr (float) – Score threshold for prediction map.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.

Returns

float]

Return type

dict[str

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

class mmocr.datasets.kie_dataset.KIEDataset(ann_file, loader, dict_file, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[source]¶

Parameters

ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.
dict_file (str) – Character dict file path.
norm (float) – Norm to map value from one range to another.

compute_relation(boxes)[source]¶: Compute relation between every two boxes.

evaluate(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[source]¶

Evaluate the dataset.

Parameters

results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

Returns

float]

Return type

dict[str

list_to_numpy(ann_infos)[source]¶: Convert bboxes, relations, texts and labels to ndarray.

pad_text_indices(text_inds)[source]¶: Pad text index to same length.

pre_pipeline(results)[source]¶: Prepare results dict for pipeline.

prepare_train_img(index)[source]¶

Get training data and annotations from pipeline.

Parameters

index (int) – Index of data.

Returns

Training data and annotation after pipeline with new keys: introduced by pipeline.

Return type

dict

pipelines¶

class mmocr.datasets.pipelines.LoadTextAnnotations(with_bbox=True, with_label=True, with_mask=False, with_seg=False, poly2mask=True)[source]¶

process_polygons(polygons)[source]¶

Convert polygons to list of ndarray and filter invalid polygons.

Parameters: polygons (list[list]) – Polygons of one instance.
Returns: Processed polygons.
Return type: list[numpy.ndarray]

class mmocr.datasets.pipelines.NormalizeOCR(mean, std)[source]¶: Normalize a tensor image with mean and standard deviation.

class mmocr.datasets.pipelines.OnlineCropOCR(box_keys=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'], jitter_prob=0.5, max_jitter_ratio_x=0.05, max_jitter_ratio_y=0.02)[source]¶

Crop text areas from whole image with bounding box jitter. If no bbox is given, return directly.

Parameters

box_keys (list[str]) – Keys in results which correspond to RoI bbox.
jitter_prob (float) – The probability of box jitter.
max_jitter_ratio_x (float) – Maximum horizontal jitter ratio relative to height.
max_jitter_ratio_y (float) – Maximum vertical jitter ratio relative to height.

class mmocr.datasets.pipelines.ResizeOCR(height, min_width=None, max_width=None, keep_aspect_ratio=True, img_pad_value=0, width_downsample_ratio=0.0625)[source]¶

Image resizing and padding for OCR.

Parameters

height (int | tuple(int)) – Image height after resizing.
min_width (none | int | tuple(int)) – Image minimum width after resizing.
max_width (none | int | tuple(int)) – Image maximum width after resizing.
keep_aspect_ratio (bool) – Keep image aspect ratio if True during resizing, Otherwise resize to the size height * max_width.
img_pad_value (int) – Scalar to fill padding area.
width_downsample_ratio (float) – Downsample ratio in horizontal direction from input image to output feature.

class mmocr.datasets.pipelines.ToTensorOCR[source]¶: Convert a PIL Image or numpy.ndarray to tensor.

class mmocr.datasets.pipelines.CustomFormatBundle(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[source]¶

Custom formatting bundle.

It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:

gt_kernels: to DataContainer (cpu_only=True)
gt_effective_mask: to DataContainer (cpu_only=True)

Parameters

keys (list[str]) – Fields to be formatted to DC only.
call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.
visualize (dict) – If flag=True, visualize gt mask for debugging.

class mmocr.datasets.pipelines.DBNetTargets(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[source]¶

Generate gt shrinked text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.

Parameters

shrink_ratio (float) – The area shrinked ratio between text kernels and their text masks.
thr_min (float) – The minimum value of the threshold map.
thr_max (float) – The maximum value of the threshold map.
min_short_size (int) – The minimum size of polygon below which the polygon is invalid.

draw_border_map(polygon, canvas, mask)[source]¶

Generate threshold map for one polygon.

Parameters

polygon (ndarray) – The polygon boundary ndarray.
canvas (ndarray) – The generated threshold map.
mask (ndarray) – The generated threshold mask.

find_invalid(results)[source]¶

Find invalid polygons.

Parameters: results (dict) – The dict containing gt_mask.
Returns: The indicators for ignoring polygons.
Return type: ignore_tags (list[bool])

generate_targets(results)[source]¶

Generate the gt targets for DBNet.

Parameters: results (dict) – The input result dictionary.
Returns: The output result dictionary.
Return type: results (dict)

generate_thr_map(img_size, polygons)[source]¶

Generate threshold map.

Parameters

img_size (tuple(int)) – The image size (h,w)
polygons (list(ndarray)) – The polygon list.

Returns

The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.

Return type

thr_map (ndarray)

ignore_texts(results, ignore_tags)[source]¶

Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.

Parameters

results (dict) – Result for one image.
ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.

Returns

Results after filtering.

Return type

results (dict)

invalid_polygon(poly)[source]¶

Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.

Parameters: poly (ndarray) – The polygon boundary point sequence.
Returns: Whether the polygon is invalid.
Return type: True/False (bool)

class mmocr.datasets.pipelines.PANetTargets(shrink_ratio=(1.0, 0.5), max_shrink=20)[source]¶

Generate the ground truths for PANet: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.

[https://arxiv.org/abs/1908.05900]. This code is partially adapted from https://github.com/WenmuZhou/PAN.pytorch.

Parameters

shrink_ratio (tuple[float]) – The ratios for shrinking text instances.
max_shrink (int) – The maximum shrink distance.

generate_targets(results)[source]¶

Generate the gt targets for PANet.

Parameters: results (dict) – The input result dictionary.
Returns: The output result dictionary.
Return type: results (dict)

class mmocr.datasets.pipelines.ColorJitter(**kwargs)[source]¶: An interface for torch color jitter so that it can be invoked in mmdetection pipeline.

class mmocr.datasets.pipelines.RandomCropInstances(target_size, instance_key, mask_type='inx0', positive_sample_ratio=0.625)[source]¶

Randomly crop images and make sure to contain text instances.

Parameters

target_size (tuple or int) – (height, width)
positive_sample_ratio (float) – The probability of sampling regions that go through positive regions.

class mmocr.datasets.pipelines.RandomRotateTextDet(rotate_ratio=1.0, max_angle=10)[source]¶: Randomly rotate images.

class mmocr.datasets.pipelines.ScaleAspectJitter(img_scale=None, multiscale_mode='range', ratio_range=None, keep_ratio=False, resize_type='around_min_img_scale', aspect_ratio_range=None, long_size_bound=None, short_size_bound=None, scale_range=None)[source]¶

Resize image and segmentation mask encoded by coordinates.

Allowed resize types are around_min_img_scale, long_short_bound, and indep_sample_in_range.

class mmocr.datasets.pipelines.MultiRotateAugOCR(transforms, rotate_degrees=None, force_rotate=False)[source]¶

Test-time augmentation with multiple rotations in the case that img_height > img_width.

An example configuration is as follows:

rotate_degrees=[0, 90, 270],
transforms=[
    dict(
        type='ResizeOCR',
        height=32,
        min_width=32,
        max_width=160,
        keep_aspect_ratio=True),
    dict(type='ToTensorOCR'),
    dict(type='NormalizeOCR', **img_norm_cfg),
    dict(
        type='Collect',
        keys=['img'],
        meta_keys=[
            'filename', 'ori_shape', 'img_shape', 'valid_ratio'
        ]),
]

After MultiRotateAugOCR with above configuration, the results are wrapped into lists of the same length as follows:

dict(
    img=[...],
    img_shape=[...]
    ...
)

Parameters

transforms (list[dict]) – Transformation applied for each augmentation.
rotate_degrees (list[int] | None) – Degrees of anti-clockwise rotation.
force_rotate (bool) – If True, rotate image by ‘rotate_degrees’ while ignore image aspect ratio.

class mmocr.datasets.pipelines.OCRSegTargets(label_convertor=None, attn_shrink_ratio=0.5, seg_shrink_ratio=0.25, box_type='char_rects', pad_val=255)[source]¶

Generate gt shrinked kernels for segmentation based OCR framework.

Parameters

label_convertor (dict) – Dictionary to construct label_convertor to convert char to index.
attn_shrink_ratio (float) – The area shrinked ratio between attention kernels and gt text masks.
seg_shrink_ratio (float) – The area shrinked ratio between segmentation kernels and gt text masks.
box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

generate_kernels(resize_shape, pad_shape, char_boxes, char_inds, shrink_ratio=0.5, binary=True)[source]¶

Generate char instance kernels for one shrink ratio.

Parameters

resize_shape (tuple(int, int)) – Image size (height, width) after resizing.
pad_shape (tuple(int, int)) – Image size (height, width) after padding.
char_boxes (list[list[float]]) – The list of char polygons.
char_inds (list[int]) – List of char indexes.
shrink_ratio (float) – The shrink ratio of kernel.
binary (bool) – If True, return binary ndarray containing 0 & 1 only.

Returns

The text kernel mask of (height, width).

Return type

char_kernel (ndarray)

shrink_char_quad(char_quad, shrink_ratio)[source]¶

Shrink char box in style of quadrangle.

Parameters

char_quad (list[float]) – Char box with format [x1, y1, x2, y2, x3, y3, x4, y4].
shrink_ratio (float) – The area shrinked ratio between gt kernels and gt text masks.

shrink_char_rect(char_rect, shrink_ratio)[source]¶

Shrink char box in style of rectangle.

Parameters

char_rect (list[float]) – Char box with format [x_min, y_min, x_max, y_max].
shrink_ratio (float) – The area shrinked ratio between gt kernels and gt text masks.

class mmocr.datasets.pipelines.FancyPCA(eig_vec=None, eig_val=None)[source]¶

Implementation of PCA based image augmentation, proposed in the paper Imagenet Classification With Deep Convolutional Neural Networks.

It alters the intensities of RGB values along the principal components of ImageNet dataset.

class mmocr.datasets.pipelines.RandomCropPolyInstances(instance_key='gt_masks', crop_ratio=0.625, min_side_ratio=0.4)[source]¶

Randomly crop images and make sure to contain at least one intact instance.

sample_crop_box(img_size, masks)[source]¶

Generate crop box and make sure not to crop the polygon instances.

Parameters

img_size (tuple(int)) – The image size.
masks (list[list[ndarray]]) – The polygon masks.

class mmocr.datasets.pipelines.RandomPaddingOCR(max_ratio=None, box_type=None)[source]¶

Pad the given image on all sides, as well as modify the coordinates of character bounding box in image.

Parameters

max_ratio (list[int]) – [left, top, right, bottom].
box_type (None|str) – Character box type. If not none, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

class mmocr.datasets.pipelines.ImgAug(args=None)[source]¶

A wrapper to use imgaug https://github.com/aleju/imgaug.

Parameters: args ([list[list|dict]]) – The argumentation list. For details, please refer to imgaug document. Take args=[[‘Fliplr’, 0.5], dict(cls=’Affine’, rotate=[-10, 10]), [‘Resize’, [0.5, 3.0]]] as an example. The args horizontally flip images with probability 0.5, followed by random rotation with angles in range [-10, 10], and resize with an independent scale in range [0.5, 3.0] for each side of images.

class mmocr.datasets.pipelines.RandomRotateImageBox(min_angle=-10, max_angle=10, box_type='char_quads')[source]¶

Rotate augmentation for segmentation based text recognition.

Parameters

min_angle (int) – Minimum rotation angle for image and box.
max_angle (int) – Maximum rotation angle for image and box.
box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

class mmocr.datasets.pipelines.OpencvToPil(**kwargs)[source]¶: Convert numpy.ndarray (bgr) to PIL Image (rgb).

class mmocr.datasets.pipelines.PilToOpencv(**kwargs)[source]¶: Convert PIL Image (rgb) to numpy.ndarray (bgr).

class mmocr.datasets.pipelines.KIEFormatBundle(*args: Any, **kwargs: Any)[source]¶

Key information extraction formatting bundle.

Based on the DefaultFormatBundle, itt simplifies the pipeline of formatting common fields, including “img”, “proposals”, “gt_bboxes”, “gt_labels”, “gt_masks”, “gt_semantic_seg”, “relations” and “texts”. These fields are formatted as follows.

img: (1) transpose, (2) to tensor, (3) to DataContainer (stack=True)
proposals: (1) to tensor, (2) to DataContainer
gt_bboxes: (1) to tensor, (2) to DataContainer
gt_bboxes_ignore: (1) to tensor, (2) to DataContainer
gt_labels: (1) to tensor, (2) to DataContainer
gt_masks: (1) to tensor, (2) to DataContainer (cpu_only=True)
gt_semantic_seg: (1) unsqueeze dim-0 (2) to tensor,
1. to DataContainer (stack=True)
relations: (1) scale, (2) to tensor, (3) to DataContainer
texts: (1) to tensor, (2) to DataContainer

class mmocr.datasets.pipelines.TextSnakeTargets(orientation_thr=2.0, resample_step=4.0, center_region_shrink_ratio=0.3)[source]¶

Generate the ground truth targets of TextSnake: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]. This was partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

Parameters: orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.

draw_center_region_maps(top_line, bot_line, center_line, center_region_mask, radius_map, sin_map, cos_map, region_shrink_ratio)[source]¶

Draw attributes on text center region.

Parameters

top_line (ndarray) – The points composing top curved sideline of text polygon.
bot_line (ndarray) – The points composing bottom curved sideline of text polygon.
center_line (ndarray) – The points composing the center line of text instance.
center_region_mask (ndarray) – The text center region mask.
radius_map (ndarray) – The map where the distance from point to sidelines will be drawn on for each pixel in text center region.
sin_map (ndarray) – The map where vector_sin(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).
cos_map (ndarray) – The map where vector_cos(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).
region_shrink_ratio (float) – The shrink ratio of text center.

find_head_tail(points, orientation_thr)[source]¶

Find the head edge and tail edge of a text polygon.

Parameters

points (ndarray) – The points composing a text polygon.
orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.

Returns

The indexes of two points composing head edge. tail_inds (list): The indexes of two points composing tail edge.

Return type

head_inds (list)

generate_center_mask_attrib_maps(img_size, text_polys)[source]¶

Generate text center region mask and geometric attribute maps.

Parameters

img_size (tuple) – The image size of (height, width).
text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The text center region mask. radius_map (ndarray): The distance map from each pixel in text

center region to top sideline.

sin_map (ndarray): The sin(theta) map where theta is the angle: between vector (top point - bottom point) and vector (1, 0).
cos_map (ndarray): The cos(theta) map where theta is the angle: between vector (top point - bottom point) and vector (1, 0).

Return type

center_region_mask (ndarray)

generate_targets(results)[source]¶

Generate the gt targets for TextSnake.

Parameters: results (dict) – The input result dictionary.
Returns: The output result dictionary.
Return type: results (dict)

generate_text_region_mask(img_size, text_polys)[source]¶

Generate text center region mask and geometry attribute maps.

Parameters

img_size (tuple) – The image size (height, width).
text_polys (list[list[ndarray]]) – The list of text polygons.

Returns

The text region mask.

Return type

text_region_mask (ndarray)

reorder_poly_edge(points)[source]¶

Get the respective points composing head edge, tail edge, top sideline and bottom sideline.

Parameters

points (ndarray) – The points composing a text polygon.

Returns

The two points composing the head edge of text: polygon.
tail_edge (ndarray): The two points composing the tail edge of text: polygon.
top_sideline (ndarray): The points composing top curved sideline of: text polygon.
bot_sideline (ndarray): The points composing bottom curved sideline: of text polygon.

Return type

head_edge (ndarray)

resample_line(line, n)[source]¶

Resample n points on a line.

Parameters

line (ndarray) – The points composing a line.
n (int) – The resampled points number.

Returns

The points composing the resampled line.

Return type

resampled_line (ndarray)

resample_sidelines(sideline1, sideline2, resample_step)[source]¶

Resample two sidelines to be of the same points number according to step size.

Parameters

sideline1 (ndarray) – The points composing a sideline of a text polygon.
sideline2 (ndarray) – The points composing another sideline of a text polygon.
resample_step (float) – The resampled step size.

Returns

The resampled line 1. resampled_line2 (ndarray): The resampled line 2.

Return type

resampled_line1 (ndarray)

mmocr.datasets.pipelines.sort_vertex(points_x, points_y)[source]¶

Sort box vertices in clockwise order from left-top first.

Parameters

points_x (list[float]) – x of four vertices.
points_y (list[float]) – y of four vertices.

Returns

x of sorted four vertices. sorted_points_y (list[float]): y of sorted four vertices.

Return type

sorted_points_x (list[float])