API Reference¶
mmocr.apis¶
mmocr.core¶
evaluation¶
-
mmocr.core.evaluation.
eval_hmean_ic13
(det_boxes, gt_boxes, gt_ignored_boxes, precision_thr=0.4, recall_thr=0.8, center_dist_thr=1.0, one2one_score=1.0, one2many_score=0.8, many2one_score=1.0)[source]¶ Evalute hmean of text detection using the icdar2013 standard.
- Parameters
det_boxes (list[list[list[float]]]) – List of arrays of shape (n, 2k). Each element is the det_boxes for one img. k>=4.
gt_boxes (list[list[list[float]]]) – List of arrays of shape (m, 2k). Each element is the gt_boxes for one img. k>=4.
gt_ignored_boxes (list[list[list[float]]]) – List of arrays of (l, 2k). Each element is the ignored gt_boxes for one img. k>=4.
precision_thr (float) – Precision threshold of the iou of one (gt_box, det_box) pair.
recall_thr (float) – Recall threshold of the iou of one (gt_box, det_box) pair.
center_dist_thr (float) – Distance threshold of one (gt_box, det_box) center point pair.
one2one_score (float) – Reward when one gt matches one det_box.
one2many_score (float) – Reward when one gt matches many det_boxes.
many2one_score (float) – Reward when many gts match one det_box.
- Returns
Tuple of dicts which encodes the hmean for the dataset and all images.
- Return type
hmean (tuple[dict])
-
mmocr.core.evaluation.
eval_hmean_iou
(pred_boxes, gt_boxes, gt_ignored_boxes, iou_thr=0.5, precision_thr=0.5)[source]¶ Evalute hmean of text detection using IOU standard.
- Parameters
pred_boxes (list[list[list[float]]]) – Text boxes for an img list. Each box has 2k (>=8) values.
gt_boxes (list[list[list[float]]]) – Ground truth text boxes for an img list. Each box has 2k (>=8) values.
gt_ignored_boxes (list[list[list[float]]]) – Ignored ground truth text boxes for an img list. Each box has 2k (>=8) values.
iou_thr (float) – Iou threshold when one (gt_box, det_box) pair is matched.
precision_thr (float) – Precision threshold when one (gt_box, det_box) pair is matched.
- Returns
- Tuple of dicts indicates the hmean for the dataset
and all images.
- Return type
hmean (tuple[dict])
-
mmocr.core.evaluation.
eval_ocr_metric
(pred_texts, gt_texts)[source]¶ Evaluate the text recognition performance with metric: word accuracy and 1-N.E.D. See https://rrc.cvc.uab.es/?ch=14&com=tasks for details.
- Parameters
pred_texts (list[str]) – Text strings of prediction.
gt_texts (list[str]) – Text strings of ground truth.
- Returns
- float]): Metric dict for text recognition, include:
word_acc: Accuracy in word level.
word_acc_ignore_case: Accuracy in word level, ignore letter case.
- word_acc_ignore_case_symbol: Accuracy in word level, ignore
letter case and symbol. (default metric for academic evaluation)
- char_recall: Recall in character level, ignore
letter case and symbol.
- char_precision: Precision in character level, ignore
letter case and symbol.
1-N.E.D: 1 - normalized_edit_distance.
- Return type
eval_res (dict[str
-
mmocr.core.evaluation.
eval_hmean
(results, img_infos, ann_infos, metrics={'hmean-iou'}, score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]¶ Evaluation in hmean metric.
- Parameters
results (list[dict]) – Each dict corresponds to one image, containing the following keys: boundary_result
img_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: filename, height, width
ann_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: masks, masks_ignore
score_thr (float) – Score threshold of prediction map.
metrics (set{str}) – Hmean metric set, should be one or all of {‘hmean-iou’, ‘hmean-ic13’}
- Returns
float]
- Return type
dict[str
-
mmocr.core.evaluation.
compute_f1_score
(preds, gts, ignores=[])[source]¶ Compute the F1-score of prediction.
- Parameters
preds (Tensor) – The predicted probability NxC map with N and C being the sample number and class number respectively.
gts (Tensor) – The ground truth vector of size N.
ignores – The index set of classes that are ignored when reporting results. Note: all samples are participated in computing.
mmocr.utils¶
-
mmocr.utils.
get_root_logger
(log_file=None, log_level=20)[source]¶ Use get_logger method in mmcv to get the root logger.
The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If log_file is specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmpose”.
- Parameters
log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.
log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.
- Returns
The root logger.
- Return type
logging.Logger
-
mmocr.utils.
drop_orientation
(img_file)[source]¶ Check if the image has orientation information. If yes, ignore it by converting the image format to png, and return new filename, otherwise return the original filename.
- Parameters
img_file (str) – The image path
- Returns
The converted image filename with proper postfix
mmocr.models¶
common_backbones¶
-
class
mmocr.models.common.backbones.
UNet
(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None)[source]¶ UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf
- Parameters
in_channels (int) – Number of input image channels. Default” 3.
base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.
num_stages (int) – Number of stages in encoder, normally 5. Default: 5.
strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).
enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).
dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).
downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).
enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).
dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).
with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.
conv_cfg (dict | None) – Config dict for convolution layer. Default: None.
norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).
act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).
upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).
norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.
dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.
plugins (dict) – plugins for convolutional layers. Default: None.
- Notice:
The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.
textdet_dense_heads¶
-
class
mmocr.models.textdet.dense_heads.
PSEHead
(in_channels, out_channels, text_repr_type='poly', downsample_ratio=0.25, loss={'type': 'PSELoss'}, train_cfg=None, test_cfg=None)[source]¶ The class for PANet head.
-
class
mmocr.models.textdet.dense_heads.
PANHead
(in_channels, out_channels, text_repr_type='poly', downsample_ratio=0.25, loss={'type': 'PANLoss'}, train_cfg=None, test_cfg=None)[source]¶ The class for PANet head.
-
class
mmocr.models.textdet.dense_heads.
DBHead
(in_channels, with_bias=False, decoding_type='db', text_repr_type='poly', downsample_ratio=1.0, loss={'type': 'DBLoss'}, train_cfg=None, test_cfg=None)[source]¶ The class for DBNet head.
This was partially adapted from https://github.com/MhLiao/DB
-
class
mmocr.models.textdet.dense_heads.
HeadMixin
[source]¶ The head minxin for dbnet and pannet heads.
-
get_boundary
(score_maps, img_metas, rescale)[source]¶ Compute text boundaries via post processing.
- Parameters
score_maps (Tensor) – The text score map.
img_metas (dict) – The image meta info.
rescale (bool) – Rescale boundaries to the original image resolution if true, and keep the score_maps resolution if false.
- Returns
The result dict.
- Return type
results (dict)
-
loss
(pred_maps, **kwargs)[source]¶ Compute the loss for text detection.
- Parameters
pred_maps (tensor) – The input score maps of NxCxHxW.
- Returns
The dict for losses.
- Return type
losses (dict)
-
resize_boundary
(boundaries, scale_factor)[source]¶ Rescale boundaries via scale_factor.
- Parameters
boundaries (list[list[float]]) – The boundary list. Each boundary
size 2k+1 with k>=4. (with) –
scale_factor (ndarray) – The scale factor of size (4,).
- Returns
The scaled boundaries.
- Return type
boundaries (list[list[float]])
-
textdet_necks¶
-
class
mmocr.models.textdet.necks.
FPEM_FFM
(in_channels, conv_out=128, fpem_repeat=2, align_corners=False)[source]¶ This code is from https://github.com/WenmuZhou/PAN.pytorch.
-
class
mmocr.models.textdet.necks.
FPNF
(in_channels=[256, 512, 1024, 2048], out_channels=256, fusion_type='concat', upsample_ratio=1)[source]¶ FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network.
-
class
mmocr.models.textdet.necks.
FPNC
(in_channels, lateral_channels=256, out_channels=64, bias_on_lateral=False, bn_re_on_lateral=False, bias_on_smooth=False, bn_re_on_smooth=False, conv_after_concat=False)[source]¶ FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization.
This was partially adapted from https://github.com/MhLiao/DB and https://github.com/WenmuZhou/DBNet.pytorch
-
class
mmocr.models.textdet.necks.
FPN_UNET
(in_channels, out_channels)[source]¶ The class for implementing DRRG and TextSnake U-Net-like FPN.
DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection [https://arxiv.org/abs/2003.07493]. TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [https://arxiv.org/abs/1807.01544].
textdet_detectors¶
-
class
mmocr.models.textdet.detectors.
TextDetectorMixin
(show_score)[source]¶ The class for implementing text detector auxiliary methods.
-
get_boundary
(results)[source]¶ Convert segmentation into text boundaries.
- Parameters
results (tuple) – The result tuple. The first element is segmentation while the second is its scores.
- Returns
A result dict containing ‘boundary_result’.
- Return type
results (dict)
-
show_result
(img, result, score_thr=0.5, bbox_color='green', text_color='green', thickness=1, font_scale=0.5, win_name='', show=False, wait_time=0, out_file=None)[source]¶ Draw result over img.
- Parameters
img (str or Tensor) – The image to be displayed.
result (dict) – The results to draw over img.
score_thr (float, optional) – Minimum score of bboxes to be shown. Default: 0.3.
bbox_color (str or tuple or
Color
) – Color of bbox lines.text_color (str or tuple or
Color
) – Color of texts.thickness (int) – Thickness of lines.
font_scale (float) – Font scales of texts.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The filename to write the image. Default: None.imshow_pred_boundary`
-
-
class
mmocr.models.textdet.detectors.
SingleStageTextDetector
(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None)[source]¶ The class for implementing single stage text detector.
It is the parent class of PANet, PSENet, and DBNet.
-
forward_train
(img, img_metas, **kwargs)[source]¶ - Parameters
img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
img_metas (list[dict]) – A list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys, see
mmdet.datasets.pipelines.Collect
.
- Returns
A dictionary of loss components.
- Return type
dict[str, Tensor]
-
-
class
mmocr.models.textdet.detectors.
OCRMaskRCNN
(backbone, rpn_head, roi_head, train_cfg, test_cfg, neck=None, pretrained=None, text_repr_type='quad', show_score=False)[source]¶ Mask RCNN tailored for OCR.
-
class
mmocr.models.textdet.detectors.
DBNet
(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]¶ The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization.
-
class
mmocr.models.textdet.detectors.
PANet
(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]¶ The class for implementing PANet text detector:
Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network [https://arxiv.org/abs/1908.05900].
-
class
mmocr.models.textdet.detectors.
PSENet
(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False)[source]¶ The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network.
textdet_losses¶
-
class
mmocr.models.textdet.losses.
PANLoss
(alpha=0.5, beta=0.25, delta_aggregation=0.5, delta_discrimination=3, ohem_ratio=3, reduction='mean', speedup_bbox_thr=-1)[source]¶ The class for implementing PANet loss: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.
[https://arxiv.org/abs/1908.05900]. This was partially adapted from https://github.com/WenmuZhou/PAN.pytorch
-
aggregation_discrimination_loss
(gt_texts, gt_kernels, inst_embeds)[source]¶ Compute the aggregation and discrimnative losses.
- Parameters
gt_texts (tensor) – The ground truth text mask of size Nx1xHxW.
gt_kernels (tensor) – The ground truth text kernel mask of size Nx1xHxW.
inst_embeds (tensor) – The text instance embedding tensor of size Nx4xHxW.
- Returns
The aggregation loss before reduction. loss_discrs (tensor): The discriminative loss before reduction.
- Return type
loss_aggrs (tensor)
-
bitmasks2tensor
(bitmasks, target_sz)[source]¶ Convert Bitmasks to tensor.
- Parameters
bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.
target_sz (tuple(int, int)) – The target tensor size HxW.
- Returns
- results (list[tensor]): The list of kernel tensors. Each
element is for one kernel level.
-
forward
(preds, downsample_ratio, gt_kernels, gt_mask)[source]¶ Compute PANet loss.
- Parameters
preds (tensor) – The output tensor with size of Nx6xHxW.
gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.
gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask fo one img.
downsample_ratio (float) – The downsample ratio between preds and the input img.
- Returns
The loss dictionary.
- Return type
results (dict)
-
ohem_batch
(text_scores, gt_texts, gt_mask)[source]¶ OHEM sampling for a batch of imgs.
- Parameters
text_scores (Tensor) – The text scores of size NxHxW.
gt_texts (Tensor) – The gt text masks of size NxHxW.
gt_mask (Tensor) – The gt effective mask of size NxHxW.
- Returns
The sampled mask of size NxHxW.
- Return type
sampled_masks (Tensor)
-
ohem_img
(text_score, gt_text, gt_mask)[source]¶ Sample the top-k maximal negative samples and all positive samples.
- Parameters
text_score (Tensor) – The text score with size of HxW.
gt_text (Tensor) – The ground truth text mask of HxW.
gt_mask (Tensor) – The effective region mask of HxW.
- Returns
The sampled pixel mask of size HxW.
- Return type
sampled_mask (Tensor)
-
-
class
mmocr.models.textdet.losses.
PSELoss
(alpha=0.7, ohem_ratio=3, reduction='mean', kernel_sample_type='adaptive')[source]¶ The class for implementing PSENet loss: Shape Robust Text Detection with Progressive Scale Expansion Network [https://arxiv.org/abs/1806.02559].
This is partially adapted from https://github.com/whai362/PSENet.
-
forward
(score_maps, downsample_ratio, gt_kernels, gt_mask)[source]¶ Compute PSENet loss.
- Parameters
score_maps (tensor) – The output tensor with size of Nx6xHxW.
gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.
gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask fo one img.
downsample_ratio (float) – The downsample ratio between score_maps and the input img.
- Returns
The loss.
- Return type
results (dict)
-
-
class
mmocr.models.textdet.losses.
DBLoss
(alpha=1, beta=1, reduction='mean', negative_ratio=3.0, eps=1e-06, bbce_loss=False)[source]¶ The class for implementing DBNet loss.
This is partially adapted from https://github.com/MhLiao/DB.
-
bitmasks2tensor
(bitmasks, target_sz)[source]¶ Convert Bitmasks to tensor.
- Parameters
bitmasks (list[BitMasks]) – The BitMasks list. Each item is for one img.
target_sz (tuple(int, int)) – The target tensor size of KxHxW with K being the number of kernels.
- Returns
- result_tensors (list[tensor]): The list of kernel tensors. Each
element is for one kernel level.
-
forward
(preds, downsample_ratio, gt_shrink, gt_shrink_mask, gt_thr, gt_thr_mask)[source]¶ Compute DBNet loss.
- Parameters
preds (tensor) – The output tensor with size of Nx3xHxW.
downsample_ratio (float) – The downsample ratio for the ground truths.
gt_shrink (list[BitmapMasks]) – The mask list with each element being the shrinked text mask for one img.
gt_shrink_mask (list[BitmapMasks]) – The effective mask list with each element being the shrinked effective mask for one img.
gt_thr (list[BitmapMasks]) – The mask list with each element being the threshold text mask for one img.
gt_thr_mask (list[BitmapMasks]) – The effective mask list with each element being the threshold effective mask for one img.
- Returns
- The dict for dbnet losses with loss_prob,
loss_db and loss_thresh.
- Return type
results(dict)
-
-
class
mmocr.models.textdet.losses.
TextSnakeLoss
(ohem_ratio=3.0)[source]¶ The class for implementing TextSnake loss: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes [https://arxiv.org/abs/1807.01544]. This is partially adapted from https://github.com/princewang1994/TextSnake.pytorch.
-
bitmasks2tensor
(bitmasks, target_sz)[source]¶ Convert Bitmasks to tensor.
- Parameters
bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.
target_sz (tuple(int, int)) – The target tensor size HxW.
- Returns
- results (list[tensor]): The list of kernel tensors. Each
element is for one kernel level.
-
textdet_postprocess¶
textrecog_recognizer¶
-
class
mmocr.models.textrecog.recognizer.
BaseRecognizer
[source]¶ Base class for text recognition.
-
abstract
aug_test
(imgs, img_metas, **kwargs)[source]¶ Test function with test time augmentation.
- Parameters
imgs (list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.
img_metas (list[list[dict]]) – The metadata of images.
-
forward
(img, img_metas, return_loss=True, **kwargs)[source]¶ Calls either
forward_train()
orforward_test()
depending on whetherreturn_loss
isTrue
.Note that img and img_meta are single-nested (i.e. tensor and list[dict]).
-
forward_test
(imgs, img_metas, **kwargs)[source]¶ - Parameters
imgs (tensor | list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.
img_metas (list[dict] | list[list[dict]]) – The outer list indicates images in a batch.
-
abstract
forward_train
(imgs, img_metas, **kwargs)[source]¶ - Parameters
img (tensor) – tensors with shape (N, C, H, W). Typically should be mean centered and std scaled.
img_metas (list[dict]) – List of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, see
mmdet.datasets.pipelines.Collect
.kwargs (keyword arguments) – Specific to concrete implementation.
-
init_weights
(pretrained=None)[source]¶ Initialize the weights for detector.
- Parameters
pretrained (str, optional) – Path to pre-trained weights. Defaults to None.
-
show_result
(img, result, gt_label='', win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]¶ Draw result on img.
- Parameters
img (str or tensor) – The image to be displayed.
result (dict) – The results to draw on img.
gt_label (str) – Ground truth label of img.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The output filename. Default: None.
- Returns
Only if not show or out_file.
- Return type
img (tensor)
-
train_step
(data, optimizer)[source]¶ The iteration step during training.
This method defines an iteration step during training, except for the back propagation and optimizer update, which are done by an optimizer hook. Note that in some complicated cases or models (e.g. GAN), the whole process (including the back propagation and optimizer update) is also defined by this method.
- Parameters
data (dict) – The outputs of dataloader.
optimizer (
torch.optim.Optimizer
| dict) – The optimizer of runner is passed totrain_step()
. This argument is unused and reserved.
- Returns
- It should contain at least 3 keys:
loss
,log_vars
, num_samples
.loss
is a tensor for back propagation, which is a
weighted sum of multiple losses. -
log_vars
contains all the variables to be sent to the logger. -num_samples
indicates the batch size used for averaging the logs (Note: for the DDP model, num_samples refers to the batch size for each GPU).
- It should contain at least 3 keys:
- Return type
dict
-
val_step
(data, optimizer)[source]¶ The iteration step during validation.
This method shares the same signature as
train_step()
, but is used during val epochs. Note that the evaluation after training epochs is not implemented by this method, but by an evaluation hook.
-
abstract
-
class
mmocr.models.textrecog.recognizer.
CRNNNet
(*args: Any, **kwargs: Any)[source]¶ CTC-loss based recognizer.
-
class
mmocr.models.textrecog.recognizer.
SARNet
(*args: Any, **kwargs: Any)[source]¶ Implementation of SAR
-
class
mmocr.models.textrecog.recognizer.
NRTR
(*args: Any, **kwargs: Any)[source]¶ Implementation of NRTR
textrecog_backbones¶
-
class
mmocr.models.textrecog.backbones.
ResNet31OCR
(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False)[source]¶ - Implement ResNet backbone for text recognition, modified from
- Parameters
base_channels (int) – Number of channels of input image tensor.
layers (list[int]) – List of BasicBlock number for each stage.
channels (list[int]) – List of out_channels of Conv2d layer.
out_indices (None | Sequence[int]) – Indicdes of output stages.
stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.
last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.
textrecog_necks¶
-
class
mmocr.models.textrecog.necks.
FPNOCR
(in_channels, out_channels, last_stage_only=True)[source]¶ FPN-like Network for segmentation based text recognition.
- Parameters
in_channels (list[int]) – Number of input channels for each scale.
out_channels (int) – Number of output channels for each scale.
last_stage_only (bool) – If True, output last stage only.
textrecog_heads¶
-
class
mmocr.models.textrecog.heads.
SegHead
(in_channels=128, num_classes=37, upsample_param=None)[source]¶ Head for segmentation based text recognition.
- Parameters
in_channels (int) – Number of input channels.
num_classes (int) – Number of output classes.
upsample_param (dict | None) – Config dict for interpolation layer. Default: dict(scale_factor=1.0, mode=’nearest’)
textrecog_convertors¶
-
class
mmocr.models.textrecog.convertors.
BaseConvertor
(dict_type='DICT90', dict_file=None, dict_list=None)[source]¶ Convert between text, index and tensor for text recognize pipeline.
- Parameters
dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.
dict_file (None|str) – Character dict file path. If not none, the dict_file is of higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.
-
idx2str
(indexes)[source]¶ Convert indexes to text strings.
- Parameters
indexes (list[list[int]]) – [[1,2,3,3,4], [5,4,6,3,7]].
- Returns
[‘hello’, ‘world’].
- Return type
strings (list[str])
-
str2idx
(strings)[source]¶ Convert strings to indexes.
- Parameters
strings (list[str]) – [‘hello’, ‘world’].
- Returns
[[1,2,3,3,4], [5,4,6,3,7]].
- Return type
indexes (list[list[int]])
-
str2tensor
(strings)[source]¶ Convert text-string to input tensor.
- Parameters
strings (list[str]) – [‘hello’, ‘world’].
- Returns
- [torch.Tensor([1,2,3,3,4]),
torch.Tensor([5,4,6,3,7])].
- Return type
tensors (list[torch.Tensor])
-
tensor2idx
(output)[source]¶ Convert model output tensor to character indexes and scores. :param output: The model outputs with size: N * T * C :type output: tensor
- Returns
[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],
[0.9,0.9,0.98,0.97,0.96]].
- Return type
indexes (list[list[int]])
-
class
mmocr.models.textrecog.convertors.
CTCConvertor
(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[source]¶ Convert between text, index and tensor for CTC loss-based pipeline.
- Parameters
dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.
dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.
with_unknown (bool) – If True, add UKN token to class.
lower (bool) – If True, convert original string to lower case.
-
str2tensor
(strings)[source]¶ Convert text-string to ctc-loss input tensor.
- Parameters
strings (list[str]) – [‘hello’, ‘world’].
- Returns
- tensor | list[tensor]):
- tensors (list[tensor]): [torch.Tensor([1,2,3,3,4]),
torch.Tensor([5,4,6,3,7])].
flatten_targets (tensor): torch.Tensor([1,2,3,3,4,5,4,6,3,7]). target_lengths (tensor): torch.IntTensot([5,5]).
- Return type
dict (str
-
tensor2idx
(output, img_metas, topk=1, return_topk=False)[source]¶ Convert model output tensor to index-list. :param output: The model outputs with size: N * T * C. :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict] :param topk: The highest k classes to be returned. :type topk: int :param return_topk: Whether to return topk or just top1. :type return_topk: bool
- Returns
[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],
[0.9,0.9,0.98,0.97,0.96]] (
indexes_topk (list[list[list[int]->len=topk]]): scores_topk (list[list[list[float]->len=topk]])
).
- Return type
indexes (list[list[int]])
-
class
mmocr.models.textrecog.convertors.
AttnConvertor
(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, max_seq_len=40, lower=False, start_end_same=True, **kwargs)[source]¶ Convert between text, index and tensor for encoder-decoder based pipeline.
- Parameters
dict_type (str) – Type of dict, should be one of {‘DICT36’, ‘DICT90’}.
dict_file (None|str) – Character dict file path. If not none, higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, higher priority than dict_type, but lower than dict_file.
with_unknown (bool) – If True, add UKN token to class.
max_seq_len (int) – Maximum sequence length of label.
lower (bool) – If True, convert original string to lower case.
start_end_same (bool) – Whether use the same index for start and end token or not. Default: True.
-
str2tensor
(strings)[source]¶ Convert text-string into tensor. :param strings: [‘hello’, ‘world’] :type strings: list[str]
- Returns
- Tensor | list[tensor]):
- tensors (list[Tensor]): [torch.Tensor([1,2,3,3,4]),
torch.Tensor([5,4,6,3,7])]
padded_targets (Tensor(bsz * max_seq_len))
- Return type
dict (str
-
tensor2idx
(outputs, img_metas=None)[source]¶ Convert output tensor to text-index :param outputs: model outputs with size: N * T * C :type outputs: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]
- Returns
[[1,2,3,3,4], [5,4,6,3,7]] scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],
[0.9,0.9,0.98,0.97,0.96]]
- Return type
indexes (list[list[int]])
-
class
mmocr.models.textrecog.convertors.
SegConvertor
(dict_type='DICT36', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[source]¶ Convert between text, index and tensor for segmentation based pipeline.
- Parameters
dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.
dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.
dict_list (None|list[str]) – Character list. If not none, the list
of higher priority than dict_type, but lower than dict_file. (is) –
with_unknown (bool) – If True, add UKN token to class.
lower (bool) – If True, convert original string to lower case.
-
tensor2str
(output, img_metas=None)[source]¶ Convert model output tensor to string labels. :param output: Model outputs with size: N * C * H * W :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]
- Returns
Decoded text labels. scores (list[list[float]]): Decoded chars scores.
- Return type
texts (list[str])
textrecog_encoders¶
-
class
mmocr.models.textrecog.encoders.
SAREncoder
(enc_bi_rnn=False, enc_do_rnn=0.0, enc_gru=False, d_model=512, d_enc=512, mask=True, **kwargs)[source]¶ Implementation of encoder module in `SAR.
<https://arxiv.org/abs/1811.00751>`_
- Parameters
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.
enc_do_rnn (float) – Dropout probability of RNN layer in encoder.
enc_gru (bool) – If True, use GRU, else LSTM in encoder.
d_model (int) – Dim of channels from backbone.
d_enc (int) – Dim of encoder RNN layer.
mask (bool) – If True, mask padding in RNN sequence.
-
class
mmocr.models.textrecog.encoders.
TFEncoder
(n_layers=6, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, dropout=0.1, **kwargs)[source]¶ Encode 2d feature map to 1d sequence.
textrecog_decoders¶
-
class
mmocr.models.textrecog.decoders.
CRNNDecoder
(in_channels=None, num_classes=None, rnn_flag=False, **kwargs)[source]¶
-
class
mmocr.models.textrecog.decoders.
ParallelSARDecoder
(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0.0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=92, pred_concat=False, **kwargs)[source]¶ Implementation Parallel Decoder module in `SAR.
<https://arxiv.org/abs/1811.00751>`_
- Parameters
number_classes (int) – Output class number.
channels (list[int]) – Network layer channels.
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.
dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.
dec_do_rnn (float) – Dropout of RNN layer in decoder.
dec_gru (bool) – If True, use GRU, else LSTM in decoder.
d_model (int) – Dim of channels from backbone.
d_enc (int) – Dim of encoder RNN layer.
d_k (int) – Dim of channels of attention module.
pred_dropout (float) – Dropout probability of prediction layer.
max_seq_len (int) – Maximum sequence length for decoding.
mask (bool) – If True, mask padding in feature map.
start_idx (int) – Index of start token.
padding_idx (int) – Index of padding token.
pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.
-
class
mmocr.models.textrecog.decoders.
SequentialSARDecoder
(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_gru=False, d_k=64, d_model=512, d_enc=512, pred_dropout=0.0, mask=True, max_seq_len=40, start_idx=0, padding_idx=92, pred_concat=False, **kwargs)[source]¶ Implementation Sequential Decoder module in `SAR.
<https://arxiv.org/abs/1811.00751>`_.
- Parameters
number_classes (int) – Number of output class.
enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.
dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.
dec_do_rnn (float) – Dropout of RNN layer in decoder.
dec_gru (bool) – If True, use GRU, else LSTM in decoder.
d_k (int) – Dim of conv layers in attention module.
d_model (int) – Dim of channels from backbone.
d_enc (int) – Dim of encoder RNN layer.
pred_dropout (float) – Dropout probability of prediction layer.
max_seq_len (int) – Maximum sequence length during decoding.
mask (bool) – If True, mask padding in feature map.
start_idx (int) – Index of start token.
padding_idx (int) – Index of padding token.
pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.
-
class
mmocr.models.textrecog.decoders.
ParallelSARDecoderWithBS
(beam_width=5, num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=0, pred_concat=False, **kwargs)[source]¶ Parallel Decoder module with beam-search in SAR.
- Parameters
beam_width (int) – Width for beam search.
-
class
mmocr.models.textrecog.decoders.
TFDecoder
(n_layers=6, d_embedding=512, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, n_position=200, dropout=0.1, num_classes=93, max_seq_len=40, start_idx=1, padding_idx=92, **kwargs)[source]¶ Transformer Decoder block with self attention mechanism.
-
class
mmocr.models.textrecog.decoders.
BaseDecoder
(**kwargs)[source]¶ Base decoder class for text recognition.
-
class
mmocr.models.textrecog.decoders.
SequenceAttentionDecoder
(num_classes=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, start_idx=0, mask=True, padding_idx=None, dropout_ratio=0, return_feature=False, encode_value=False)[source]¶
textrecog_losses¶
-
class
mmocr.models.textrecog.losses.
CELoss
(ignore_index=-1, reduction='none')[source]¶ Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss.
- Parameters
ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).
-
class
mmocr.models.textrecog.losses.
SARLoss
(ignore_index=0, reduction='mean', **kwargs)[source]¶ Implementation of loss module in `SAR.
<https://arxiv.org/abs/1811.00751>`_.
- Parameters
ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).
-
class
mmocr.models.textrecog.losses.
CTCLoss
(flatten=True, blank=0, reduction='mean', zero_infinity=False, **kwargs)[source]¶ Implementation of loss module for CTC-loss based text recognition.
- Parameters
flatten (bool) – If True, use flattened targets, else padded targets.
blank (int) – Blank label. Default 0.
reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).
zero_infinity (bool) – Whether to zero infinite losses and the associated gradients. Default: False. Infinite losses mainly occur when the inputs are too short to be aligned to the targets.
-
class
mmocr.models.textrecog.losses.
TFLoss
(ignore_index=-1, reduction='none', flatten=True, **kwargs)[source]¶ Implementation of loss module for transformer.
-
class
mmocr.models.textrecog.losses.
SegLoss
(seg_downsample_ratio=0.5, seg_with_loss_weight=True, ignore_index=255, **kwargs)[source]¶ Implementation of loss module for segmentation based text recognition method.
- Parameters
seg_downsample_ratio (float) – Downsample ratio of segmentation map.
seg_with_loss_weight (bool) – If True, set weight for segmentation loss.
ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.
textrecog_backbones¶
-
class
mmocr.models.textrecog.backbones.
ResNet31OCR
(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False)[source] - Implement ResNet backbone for text recognition, modified from
- Parameters
base_channels (int) – Number of channels of input image tensor.
layers (list[int]) – List of BasicBlock number for each stage.
channels (list[int]) – List of out_channels of Conv2d layer.
out_indices (None | Sequence[int]) – Indicdes of output stages.
stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.
last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.
-
class
mmocr.models.textrecog.backbones.
VeryDeepVgg
(leakyRelu=True, input_channels=3)[source] - Implement VGG-VeryDeep backbone for text recognition, modified from
- Parameters
input_channels (int) – Number of channels of input image tensor.
leakyRelu (bool) – Use leakyRelu or not.
-
class
mmocr.models.textrecog.backbones.
NRTRModalityTransform
(input_channels=3, input_height=32)[source]
textrecog_layers¶
-
class
mmocr.models.textrecog.layers.
MultiHeadAttention
(n_head=8, d_model=512, d_k=64, d_v=64, dropout=0.1, qkv_bias=False, mask_value=0)[source]¶ Multi-Head Attention module.
-
class
mmocr.models.textrecog.layers.
PositionwiseFeedForward
(d_in, d_hid, dropout=0.1, act_layer=torch.nn.GELU)[source]¶ A two-feed-forward-layer module.
-
class
mmocr.models.textrecog.layers.
BasicBlock
(inplanes, planes, stride=1, downsample=False)[source]¶
-
class
mmocr.models.textrecog.layers.
Bottleneck
(inplanes, planes, stride=1, downsample=False)[source]¶
kie_extractors¶
-
class
mmocr.models.kie.extractors.
SDMGR
(backbone, neck=None, bbox_head=None, extractor={'featmap_strides': [1], 'roi_layer': {'output_size': 7, 'type': 'RoIAlign'}, 'type': 'SingleRoIExtractor'}, visual_modality=False, train_cfg=None, test_cfg=None, pretrained=None, class_list=None)[source]¶ The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.
- Parameters
visual_modality (bool) – Whether use the visual modality.
class_list (None | str) – Mapping file of class index to class name. If None, class index will be shown in show_results, else class name.
-
forward_train
(img, img_metas, relations, texts, gt_bboxes, gt_labels)[source]¶ - Parameters
img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.
img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, please see
mmdet.datasets.pipelines.Collect
.relations (list[tensor]) – Relations between bboxes.
texts (list[tensor]) – Texts in bboxes.
gt_bboxes (list[tensor]) – Each item is the truth boxes for each image in [tl_x, tl_y, br_x, br_y] format.
gt_labels (list[tensor]) – Class indices corresponding to each box.
- Returns
A dictionary of loss components.
- Return type
dict[str, tensor]
-
show_result
(img, result, boxes, win_name='', show=False, wait_time=0, out_file=None, **kwargs)[source]¶ Draw result on img.
- Parameters
img (str or tensor) – The image to be displayed.
result (dict) – The results to draw on img.
boxes (list) – Bbox of img.
win_name (str) – The window name.
wait_time (int) – Value of waitKey param. Default: 0.
show (bool) – Whether to show the image. Default: False.
out_file (str or None) – The output filename. Default: None.
- Returns
Only if not show or out_file.
- Return type
img (tensor)
kie_heads¶
kie_losses¶
mmocr.datasets¶
-
class
mmocr.datasets.
IcdarDataset
(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]¶ -
evaluate
(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]¶ Evaluate the hmean metric.
- Parameters
results (list[dict]) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.
- Returns
float]]: The evaluation results.
- Return type
dict[dict[str
-
-
class
mmocr.datasets.
BaseDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶ Custom dataset for text detection, text recognition, and their downstream tasks.
The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str
converted to dict for visualizing only).
- {
“file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:
- [
- {
“iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,
72, 767, 357, 763]]
}
]
}
The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.
format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello
- Parameters
ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.
-
evaluate
(results, metric=None, logger=None, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
- Returns
float]
- Return type
dict[str
-
class
mmocr.datasets.
OCRDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶ -
evaluate
(results, metric='acc', logger=None, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
- Returns
float]
- Return type
dict[str
-
-
class
mmocr.datasets.
TextDetDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶ -
evaluate
(results, metric='hmean-iou', score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
score_thr (float) – Score threshold for prediction map.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.
- Returns
float]
- Return type
dict[str
-
-
class
mmocr.datasets.
CustomFormatBundle
(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[source]¶ Custom formatting bundle.
It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:
gt_kernels: to DataContainer (cpu_only=True)
gt_effective_mask: to DataContainer (cpu_only=True)
- Parameters
keys (list[str]) – Fields to be formatted to DC only.
call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.
visualize (dict) – If flag=True, visualize gt mask for debugging.
-
class
mmocr.datasets.
DBNetTargets
(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[source]¶ Generate gt shrinked text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.
- Parameters
shrink_ratio (float) – The area shrinked ratio between text kernels and their text masks.
thr_min (float) – The minimum value of the threshold map.
thr_max (float) – The maximum value of the threshold map.
min_short_size (int) – The minimum size of polygon below which the polygon is invalid.
-
draw_border_map
(polygon, canvas, mask)[source]¶ Generate threshold map for one polygon.
- Parameters
polygon (ndarray) – The polygon boundary ndarray.
canvas (ndarray) – The generated threshold map.
mask (ndarray) – The generated threshold mask.
-
find_invalid
(results)[source]¶ Find invalid polygons.
- Parameters
results (dict) – The dict containing gt_mask.
- Returns
The indicators for ignoring polygons.
- Return type
ignore_tags (list[bool])
-
generate_targets
(results)[source]¶ Generate the gt targets for DBNet.
- Parameters
results (dict) – The input result dictionary.
- Returns
The output result dictionary.
- Return type
results (dict)
-
generate_thr_map
(img_size, polygons)[source]¶ Generate threshold map.
- Parameters
img_size (tuple(int)) – The image size (h,w)
polygons (list(ndarray)) – The polygon list.
- Returns
The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.
- Return type
thr_map (ndarray)
-
ignore_texts
(results, ignore_tags)[source]¶ Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.
- Parameters
results (dict) – Result for one image.
ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.
- Returns
Results after filtering.
- Return type
results (dict)
-
invalid_polygon
(poly)[source]¶ Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.
- Parameters
poly (ndarray) – The polygon boundary point sequence.
- Returns
Whether the polygon is invalid.
- Return type
True/False (bool)
-
class
mmocr.datasets.
OCRSegDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶
-
class
mmocr.datasets.
KIEDataset
(ann_file, loader, dict_file, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[source]¶ - Parameters
ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.
dict_file (str) – Character dict file path.
norm (float) – Norm to map value from one range to another.
-
evaluate
(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
- Returns
float]
- Return type
dict[str
datasets¶
-
class
mmocr.datasets.base_dataset.
BaseDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶ Custom dataset for text detection, text recognition, and their downstream tasks.
The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str
converted to dict for visualizing only).
- {
“file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:
- [
- {
“iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,
72, 767, 357, 763]]
}
]
}
The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.
format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello
- Parameters
ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.
-
evaluate
(results, metric=None, logger=None, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
- Returns
float]
- Return type
dict[str
-
class
mmocr.datasets.icdar_dataset.
IcdarDataset
(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=-1)[source]¶ -
evaluate
(results, metric='hmean-iou', logger=None, score_thr=0.3, rank_list=None, **kwargs)[source]¶ Evaluate the hmean metric.
- Parameters
results (list[dict]) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.
- Returns
float]]: The evaluation results.
- Return type
dict[dict[str
-
-
class
mmocr.datasets.ocr_dataset.
OCRDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶ -
evaluate
(results, metric='acc', logger=None, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
- Returns
float]
- Return type
dict[str
-
-
class
mmocr.datasets.ocr_seg_dataset.
OCRSegDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶
-
class
mmocr.datasets.text_det_dataset.
TextDetDataset
(ann_file, loader, pipeline, img_prefix='', test_mode=False)[source]¶ -
evaluate
(results, metric='hmean-iou', score_thr=0.3, rank_list=None, logger=None, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
score_thr (float) – Score threshold for prediction map.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
rank_list (str) – json file used to save eval result of each image after ranking.
- Returns
float]
- Return type
dict[str
-
-
class
mmocr.datasets.kie_dataset.
KIEDataset
(ann_file, loader, dict_file, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[source]¶ - Parameters
ann_file (str) – Annotation file path.
pipeline (list[dict]) – Processing pipeline.
loader (dict) – Dictionary to construct loader to load annotation infos.
img_prefix (str, optional) – Image prefix to generate full image path.
test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.
dict_file (str) – Character dict file path.
norm (float) – Norm to map value from one range to another.
-
evaluate
(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[source]¶ Evaluate the dataset.
- Parameters
results (list) – Testing results of the dataset.
metric (str | list[str]) – Metrics to be evaluated.
logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.
- Returns
float]
- Return type
dict[str
pipelines¶
-
class
mmocr.datasets.pipelines.
LoadTextAnnotations
(with_bbox=True, with_label=True, with_mask=False, with_seg=False, poly2mask=True)[source]¶
-
class
mmocr.datasets.pipelines.
NormalizeOCR
(mean, std)[source]¶ Normalize a tensor image with mean and standard deviation.
-
class
mmocr.datasets.pipelines.
OnlineCropOCR
(box_keys=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'], jitter_prob=0.5, max_jitter_ratio_x=0.05, max_jitter_ratio_y=0.02)[source]¶ Crop text areas from whole image with bounding box jitter. If no bbox is given, return directly.
- Parameters
box_keys (list[str]) – Keys in results which correspond to RoI bbox.
jitter_prob (float) – The probability of box jitter.
max_jitter_ratio_x (float) – Maximum horizontal jitter ratio relative to height.
max_jitter_ratio_y (float) – Maximum vertical jitter ratio relative to height.
-
class
mmocr.datasets.pipelines.
ResizeOCR
(height, min_width=None, max_width=None, keep_aspect_ratio=True, img_pad_value=0, width_downsample_ratio=0.0625)[source]¶ Image resizing and padding for OCR.
- Parameters
height (int | tuple(int)) – Image height after resizing.
min_width (none | int | tuple(int)) – Image minimum width after resizing.
max_width (none | int | tuple(int)) – Image maximum width after resizing.
keep_aspect_ratio (bool) – Keep image aspect ratio if True during resizing, Otherwise resize to the size height * max_width.
img_pad_value (int) – Scalar to fill padding area.
width_downsample_ratio (float) – Downsample ratio in horizontal direction from input image to output feature.
-
class
mmocr.datasets.pipelines.
CustomFormatBundle
(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[source]¶ Custom formatting bundle.
It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:
gt_kernels: to DataContainer (cpu_only=True)
gt_effective_mask: to DataContainer (cpu_only=True)
- Parameters
keys (list[str]) – Fields to be formatted to DC only.
call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.
visualize (dict) – If flag=True, visualize gt mask for debugging.
-
class
mmocr.datasets.pipelines.
DBNetTargets
(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[source]¶ Generate gt shrinked text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.
- Parameters
shrink_ratio (float) – The area shrinked ratio between text kernels and their text masks.
thr_min (float) – The minimum value of the threshold map.
thr_max (float) – The maximum value of the threshold map.
min_short_size (int) – The minimum size of polygon below which the polygon is invalid.
-
draw_border_map
(polygon, canvas, mask)[source]¶ Generate threshold map for one polygon.
- Parameters
polygon (ndarray) – The polygon boundary ndarray.
canvas (ndarray) – The generated threshold map.
mask (ndarray) – The generated threshold mask.
-
find_invalid
(results)[source]¶ Find invalid polygons.
- Parameters
results (dict) – The dict containing gt_mask.
- Returns
The indicators for ignoring polygons.
- Return type
ignore_tags (list[bool])
-
generate_targets
(results)[source]¶ Generate the gt targets for DBNet.
- Parameters
results (dict) – The input result dictionary.
- Returns
The output result dictionary.
- Return type
results (dict)
-
generate_thr_map
(img_size, polygons)[source]¶ Generate threshold map.
- Parameters
img_size (tuple(int)) – The image size (h,w)
polygons (list(ndarray)) – The polygon list.
- Returns
The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.
- Return type
thr_map (ndarray)
-
ignore_texts
(results, ignore_tags)[source]¶ Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.
- Parameters
results (dict) – Result for one image.
ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.
- Returns
Results after filtering.
- Return type
results (dict)
-
invalid_polygon
(poly)[source]¶ Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.
- Parameters
poly (ndarray) – The polygon boundary point sequence.
- Returns
Whether the polygon is invalid.
- Return type
True/False (bool)
-
class
mmocr.datasets.pipelines.
PANetTargets
(shrink_ratio=(1.0, 0.5), max_shrink=20)[source]¶ Generate the ground truths for PANet: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.
[https://arxiv.org/abs/1908.05900]. This code is partially adapted from https://github.com/WenmuZhou/PAN.pytorch.
- Parameters
shrink_ratio (tuple[float]) – The ratios for shrinking text instances.
max_shrink (int) – The maximum shrink distance.
-
class
mmocr.datasets.pipelines.
ColorJitter
(**kwargs)[source]¶ An interface for torch color jitter so that it can be invoked in mmdetection pipeline.
-
class
mmocr.datasets.pipelines.
RandomCropInstances
(target_size, instance_key, mask_type='inx0', positive_sample_ratio=0.625)[source]¶ Randomly crop images and make sure to contain text instances.
- Parameters
target_size (tuple or int) – (height, width)
positive_sample_ratio (float) – The probability of sampling regions that go through positive regions.
-
class
mmocr.datasets.pipelines.
RandomRotateTextDet
(rotate_ratio=1.0, max_angle=10)[source]¶ Randomly rotate images.
-
class
mmocr.datasets.pipelines.
ScaleAspectJitter
(img_scale=None, multiscale_mode='range', ratio_range=None, keep_ratio=False, resize_type='around_min_img_scale', aspect_ratio_range=None, long_size_bound=None, short_size_bound=None, scale_range=None)[source]¶ Resize image and segmentation mask encoded by coordinates.
Allowed resize types are around_min_img_scale, long_short_bound, and indep_sample_in_range.
-
class
mmocr.datasets.pipelines.
MultiRotateAugOCR
(transforms, rotate_degrees=None, force_rotate=False)[source]¶ Test-time augmentation with multiple rotations in the case that img_height > img_width.
An example configuration is as follows:
rotate_degrees=[0, 90, 270], transforms=[ dict( type='ResizeOCR', height=32, min_width=32, max_width=160, keep_aspect_ratio=True), dict(type='ToTensorOCR'), dict(type='NormalizeOCR', **img_norm_cfg), dict( type='Collect', keys=['img'], meta_keys=[ 'filename', 'ori_shape', 'img_shape', 'valid_ratio' ]), ]
After MultiRotateAugOCR with above configuration, the results are wrapped into lists of the same length as follows:
dict( img=[...], img_shape=[...] ... )
- Parameters
transforms (list[dict]) – Transformation applied for each augmentation.
rotate_degrees (list[int] | None) – Degrees of anti-clockwise rotation.
force_rotate (bool) – If True, rotate image by ‘rotate_degrees’ while ignore image aspect ratio.
-
class
mmocr.datasets.pipelines.
OCRSegTargets
(label_convertor=None, attn_shrink_ratio=0.5, seg_shrink_ratio=0.25, box_type='char_rects', pad_val=255)[source]¶ Generate gt shrinked kernels for segmentation based OCR framework.
- Parameters
label_convertor (dict) – Dictionary to construct label_convertor to convert char to index.
attn_shrink_ratio (float) – The area shrinked ratio between attention kernels and gt text masks.
seg_shrink_ratio (float) – The area shrinked ratio between segmentation kernels and gt text masks.
box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with
xyxy
style and ‘char_quads’ for quadrangle withx1y1x2y2x3y3x4y4
style.
-
generate_kernels
(resize_shape, pad_shape, char_boxes, char_inds, shrink_ratio=0.5, binary=True)[source]¶ Generate char instance kernels for one shrink ratio.
- Parameters
resize_shape (tuple(int, int)) – Image size (height, width) after resizing.
pad_shape (tuple(int, int)) – Image size (height, width) after padding.
char_boxes (list[list[float]]) – The list of char polygons.
char_inds (list[int]) – List of char indexes.
shrink_ratio (float) – The shrink ratio of kernel.
binary (bool) – If True, return binary ndarray containing 0 & 1 only.
- Returns
The text kernel mask of (height, width).
- Return type
char_kernel (ndarray)
-
class
mmocr.datasets.pipelines.
FancyPCA
(eig_vec=None, eig_val=None)[source]¶ Implementation of PCA based image augmentation, proposed in the paper
Imagenet Classification With Deep Convolutional Neural Networks
.It alters the intensities of RGB values along the principal components of ImageNet dataset.
-
class
mmocr.datasets.pipelines.
RandomCropPolyInstances
(instance_key='gt_masks', crop_ratio=0.625, min_side_ratio=0.4)[source]¶ Randomly crop images and make sure to contain at least one intact instance.
-
class
mmocr.datasets.pipelines.
RandomPaddingOCR
(max_ratio=None, box_type=None)[source]¶ Pad the given image on all sides, as well as modify the coordinates of character bounding box in image.
- Parameters
max_ratio (list[int]) – [left, top, right, bottom].
box_type (None|str) – Character box type. If not none, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with
xyxy
style and ‘char_quads’ for quadrangle withx1y1x2y2x3y3x4y4
style.
-
class
mmocr.datasets.pipelines.
ImgAug
(args=None)[source]¶ A wrapper to use imgaug https://github.com/aleju/imgaug.
- Parameters
args ([list[list|dict]]) – The argumentation list. For details, please refer to imgaug document. Take args=[[‘Fliplr’, 0.5], dict(cls=’Affine’, rotate=[-10, 10]), [‘Resize’, [0.5, 3.0]]] as an example. The args horizontally flip images with probability 0.5, followed by random rotation with angles in range [-10, 10], and resize with an independent scale in range [0.5, 3.0] for each side of images.
-
class
mmocr.datasets.pipelines.
RandomRotateImageBox
(min_angle=-10, max_angle=10, box_type='char_quads')[source]¶ Rotate augmentation for segmentation based text recognition.
- Parameters
min_angle (int) – Minimum rotation angle for image and box.
max_angle (int) – Maximum rotation angle for image and box.
box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with
xyxy
style and ‘char_quads’ for quadrangle withx1y1x2y2x3y3x4y4
style.
-
class
mmocr.datasets.pipelines.
OpencvToPil
(**kwargs)[source]¶ Convert
numpy.ndarray
(bgr) toPIL Image
(rgb).
-
class
mmocr.datasets.pipelines.
PilToOpencv
(**kwargs)[source]¶ Convert
PIL Image
(rgb) tonumpy.ndarray
(bgr).
-
class
mmocr.datasets.pipelines.
KIEFormatBundle
(*args: Any, **kwargs: Any)[source]¶ Key information extraction formatting bundle.
Based on the DefaultFormatBundle, itt simplifies the pipeline of formatting common fields, including “img”, “proposals”, “gt_bboxes”, “gt_labels”, “gt_masks”, “gt_semantic_seg”, “relations” and “texts”. These fields are formatted as follows.
img: (1) transpose, (2) to tensor, (3) to DataContainer (stack=True)
proposals: (1) to tensor, (2) to DataContainer
gt_bboxes: (1) to tensor, (2) to DataContainer
gt_bboxes_ignore: (1) to tensor, (2) to DataContainer
gt_labels: (1) to tensor, (2) to DataContainer
gt_masks: (1) to tensor, (2) to DataContainer (cpu_only=True)
- gt_semantic_seg: (1) unsqueeze dim-0 (2) to tensor,
to DataContainer (stack=True)
relations: (1) scale, (2) to tensor, (3) to DataContainer
texts: (1) to tensor, (2) to DataContainer
-
class
mmocr.datasets.pipelines.
TextSnakeTargets
(orientation_thr=2.0, resample_step=4.0, center_region_shrink_ratio=0.3)[source]¶ Generate the ground truth targets of TextSnake: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.
[https://arxiv.org/abs/1807.01544]. This was partially adapted from https://github.com/princewang1994/TextSnake.pytorch.
- Parameters
orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.
-
draw_center_region_maps
(top_line, bot_line, center_line, center_region_mask, radius_map, sin_map, cos_map, region_shrink_ratio)[source]¶ Draw attributes on text center region.
- Parameters
top_line (ndarray) – The points composing top curved sideline of text polygon.
bot_line (ndarray) – The points composing bottom curved sideline of text polygon.
center_line (ndarray) – The points composing the center line of text instance.
center_region_mask (ndarray) – The text center region mask.
radius_map (ndarray) – The map where the distance from point to sidelines will be drawn on for each pixel in text center region.
sin_map (ndarray) – The map where vector_sin(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).
cos_map (ndarray) – The map where vector_cos(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).
region_shrink_ratio (float) – The shrink ratio of text center.
-
find_head_tail
(points, orientation_thr)[source]¶ Find the head edge and tail edge of a text polygon.
- Parameters
points (ndarray) – The points composing a text polygon.
orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.
- Returns
The indexes of two points composing head edge. tail_inds (list): The indexes of two points composing tail edge.
- Return type
head_inds (list)
-
generate_center_mask_attrib_maps
(img_size, text_polys)[source]¶ Generate text center region mask and geometric attribute maps.
- Parameters
img_size (tuple) – The image size of (height, width).
text_polys (list[list[ndarray]]) – The list of text polygons.
- Returns
The text center region mask. radius_map (ndarray): The distance map from each pixel in text
center region to top sideline.
- sin_map (ndarray): The sin(theta) map where theta is the angle
between vector (top point - bottom point) and vector (1, 0).
- cos_map (ndarray): The cos(theta) map where theta is the angle
between vector (top point - bottom point) and vector (1, 0).
- Return type
center_region_mask (ndarray)
-
generate_targets
(results)[source]¶ Generate the gt targets for TextSnake.
- Parameters
results (dict) – The input result dictionary.
- Returns
The output result dictionary.
- Return type
results (dict)
-
generate_text_region_mask
(img_size, text_polys)[source]¶ Generate text center region mask and geometry attribute maps.
- Parameters
img_size (tuple) – The image size (height, width).
text_polys (list[list[ndarray]]) – The list of text polygons.
- Returns
The text region mask.
- Return type
text_region_mask (ndarray)
-
reorder_poly_edge
(points)[source]¶ Get the respective points composing head edge, tail edge, top sideline and bottom sideline.
- Parameters
points (ndarray) – The points composing a text polygon.
- Returns
- The two points composing the head edge of text
polygon.
- tail_edge (ndarray): The two points composing the tail edge of text
polygon.
- top_sideline (ndarray): The points composing top curved sideline of
text polygon.
- bot_sideline (ndarray): The points composing bottom curved sideline
of text polygon.
- Return type
head_edge (ndarray)
-
resample_line
(line, n)[source]¶ Resample n points on a line.
- Parameters
line (ndarray) – The points composing a line.
n (int) – The resampled points number.
- Returns
The points composing the resampled line.
- Return type
resampled_line (ndarray)
-
resample_sidelines
(sideline1, sideline2, resample_step)[source]¶ Resample two sidelines to be of the same points number according to step size.
- Parameters
sideline1 (ndarray) – The points composing a sideline of a text polygon.
sideline2 (ndarray) – The points composing another sideline of a text polygon.
resample_step (float) – The resampled step size.
- Returns
The resampled line 1. resampled_line2 (ndarray): The resampled line 2.
- Return type
resampled_line1 (ndarray)
-
mmocr.datasets.pipelines.
sort_vertex
(points_x, points_y)[source]¶ Sort box vertices in clockwise order from left-top first.
- Parameters
points_x (list[float]) – x of four vertices.
points_y (list[float]) – y of four vertices.
- Returns
x of sorted four vertices. sorted_points_y (list[float]): y of sorted four vertices.
- Return type
sorted_points_x (list[float])