Shortcuts

mmocr.apis

mmocr.apis.disable_text_recog_aug_test(cfg, set_types=None)[源代码]

Remove aug_test from test pipeline for text recognition.

参数
  • cfg (mmcv.Config) – Input config.

  • set_types (list[str]) – Type of dataset source. Should be None or sublist of [‘test’, ‘val’].

mmocr.apis.init_detector(config, checkpoint=None, device='cuda:0', cfg_options=None)[源代码]

Initialize a detector from config file.

参数
  • config (str or mmcv.Config) – Config file path or the config object.

  • checkpoint (str, optional) – Checkpoint path. If left as None, the model will not load any weights.

  • cfg_options (dict) – Options to override some settings in the used config.

返回

The constructed detector.

返回类型

nn.Module

mmocr.apis.init_random_seed(seed=None, device='cuda')[源代码]

Initialize random seed. If the seed is None, it will be replaced by a random number, and then broadcasted to all processes.

参数
  • seed (int, Optional) – The seed.

  • device (str) – The device where the seed will be put on.

返回

Seed to be used.

返回类型

int

mmocr.apis.model_inference(model, imgs, ann=None, batch_mode=False, return_data=False)[源代码]

Inference image(s) with the detector.

参数
  • model (nn.Module) – The loaded detector.

  • imgs (str/ndarray or list[str/ndarray] or tuple[str/ndarray]) – Either image files or loaded images.

  • batch_mode (bool) – If True, use batch mode for inference.

  • ann (dict) – Annotation info for key information extraction.

  • return_data – Return postprocessed data.

返回

Predicted results.

返回类型

result (dict)

mmocr.apis.replace_image_to_tensor(cfg, set_types=None)[源代码]

Replace ‘ImageToTensor’ to ‘DefaultFormatBundle’.

mmocr.apis.tensor2grayimgs(tensor, mean=(127), std=(127), **kwargs)[源代码]

Convert tensor to 1-channel gray images.

参数
  • tensor (torch.Tensor) – Tensor that contains multiple images, shape ( N, C, H, W).

  • mean (tuple[float], optional) – Mean of images. Defaults to (127).

  • std (tuple[float], optional) – Standard deviation of images. Defaults to (127).

返回

A list that contains multiple images.

返回类型

list[np.ndarray]

mmocr.core

evaluation

mmocr.core.evaluation.compute_f1_score(preds, gts, ignores=[])[源代码]

Compute the F1-score of prediction.

参数
  • preds (Tensor) – The predicted probability NxC map with N and C being the sample number and class number respectively.

  • gts (Tensor) – The ground truth vector of size N.

  • ignores – The index set of classes that are ignored when reporting results. Note: all samples are participated in computing.

mmocr.core.evaluation.eval_hmean(results, img_infos, ann_infos, metrics={'hmean-iou'}, score_thr=None, min_score_thr=0.3, max_score_thr=0.9, step=0.1, rank_list=None, logger=None, **kwargs)[源代码]

Evaluation in hmean metric. It conducts grid search over a range of boundary score thresholds and reports the best result.

参数
  • results (list[dict]) – Each dict corresponds to one image, containing the following keys: boundary_result

  • img_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: filename, height, width

  • ann_infos (list[dict]) – Each dict corresponds to one image, containing the following keys: masks, masks_ignore

  • score_thr (float) – Deprecated. Please use min_score_thr instead.

  • min_score_thr (float) – Minimum score threshold of prediction map.

  • max_score_thr (float) – Maximum score threshold of prediction map.

  • step (float) – The spacing between score thresholds.

  • metrics (set{str}) – Hmean metric set, should be one or all of {‘hmean-iou’, ‘hmean-ic13’}

返回

float]

返回类型

dict[str

mmocr.core.evaluation.eval_hmean_ic13(det_boxes, gt_boxes, gt_ignored_boxes, precision_thr=0.4, recall_thr=0.8, center_dist_thr=1.0, one2one_score=1.0, one2many_score=0.8, many2one_score=1.0)[源代码]

Evaluate hmean of text detection using the icdar2013 standard.

参数
  • det_boxes (list[list[list[float]]]) – List of arrays of shape (n, 2k). Each element is the det_boxes for one img. k>=4.

  • gt_boxes (list[list[list[float]]]) – List of arrays of shape (m, 2k). Each element is the gt_boxes for one img. k>=4.

  • gt_ignored_boxes (list[list[list[float]]]) – List of arrays of (l, 2k). Each element is the ignored gt_boxes for one img. k>=4.

  • precision_thr (float) – Precision threshold of the iou of one (gt_box, det_box) pair.

  • recall_thr (float) – Recall threshold of the iou of one (gt_box, det_box) pair.

  • center_dist_thr (float) – Distance threshold of one (gt_box, det_box) center point pair.

  • one2one_score (float) – Reward when one gt matches one det_box.

  • one2many_score (float) – Reward when one gt matches many det_boxes.

  • many2one_score (float) – Reward when many gts match one det_box.

返回

Tuple of dicts which encodes the hmean for the dataset and all images.

返回类型

hmean (tuple[dict])

mmocr.core.evaluation.eval_hmean_iou(pred_boxes, gt_boxes, gt_ignored_boxes, iou_thr=0.5, precision_thr=0.5)[源代码]

Evaluate hmean of text detection using IOU standard.

参数
  • pred_boxes (list[list[list[float]]]) – Text boxes for an img list. Each box has 2k (>=8) values.

  • gt_boxes (list[list[list[float]]]) – Ground truth text boxes for an img list. Each box has 2k (>=8) values.

  • gt_ignored_boxes (list[list[list[float]]]) – Ignored ground truth text boxes for an img list. Each box has 2k (>=8) values.

  • iou_thr (float) – Iou threshold when one (gt_box, det_box) pair is matched.

  • precision_thr (float) – Precision threshold when one (gt_box, det_box) pair is matched.

返回

Tuple of dicts indicates the hmean for the dataset

and all images.

返回类型

hmean (tuple[dict])

mmocr.core.evaluation.eval_ner_f1(results, gt_infos)[源代码]

Evaluate for ner task.

参数
  • results (list) – Predict results of entities.

  • gt_infos (list[dict]) – Ground-truth information which contains text and label.

返回

precision,recall, f1-score of total

and each catogory.

返回类型

class_info (dict)

mmocr.core.evaluation.eval_ocr_metric(pred_texts, gt_texts, metric='acc')[源代码]

Evaluate the text recognition performance with metric: word accuracy and 1-N.E.D. See https://rrc.cvc.uab.es/?ch=14&com=tasks for details.

参数
  • pred_texts (list[str]) – Text strings of prediction.

  • gt_texts (list[str]) – Text strings of ground truth.

  • metric (str | list[str]) –

    Metric(s) to be evaluated. Options are:

    • ’word_acc’: Accuracy at word level.

    • ’word_acc_ignore_case’: Accuracy at word level, ignoring letter case.

    • ’word_acc_ignore_case_symbol’: Accuracy at word level, ignoring letter case and symbol. (Default metric for academic evaluation)

    • ’char_recall’: Recall at character level, ignoring letter case and symbol.

    • ’char_precision’: Precision at character level, ignoring letter case and symbol.

    • ’one_minus_ned’: 1 - normalized_edit_distance

    In particular, if metric == 'acc', results on all metrics above will be reported.

返回

float}: Result dict for text recognition, keys could be some of the following: [‘word_acc’, ‘word_acc_ignore_case’, ‘word_acc_ignore_case_symbol’, ‘char_recall’, ‘char_precision’, ‘1-N.E.D’].

返回类型

dict{str

mmocr.utils

class mmocr.utils.Registry(name, build_func=None, parent=None, scope=None)[源代码]

A registry to map strings to classes or functions.

Registered object could be built from registry. Meanwhile, registered functions could be called from registry.

示例

>>> MODELS = Registry('models')
>>> @MODELS.register_module()
>>> class ResNet:
>>>     pass
>>> resnet = MODELS.build(dict(type='ResNet'))
>>> @MODELS.register_module()
>>> def resnet50():
>>>     pass
>>> resnet = MODELS.build(dict(type='resnet50'))

Please refer to https://mmcv.readthedocs.io/en/latest/understand_mmcv/registry.html for advanced usage.

参数
  • name (str) – Registry name.

  • build_func (func, optional) – Build function to construct instance from Registry, func:build_from_cfg is used if neither parent or build_func is specified. If parent is specified and build_func is not given, build_func will be inherited from parent. Default: None.

  • parent (Registry, optional) – Parent registry. The class registered in children registry could be built from parent. Default: None.

  • scope (str, optional) – The scope of registry. It is the key to search for children registry. If not specified, scope will be the name of the package where class is defined, e.g. mmdet, mmcls, mmseg. Default: None.

get(key)[源代码]

Get the registry record.

参数

key (str) – The class name in string format.

返回

The corresponding class.

返回类型

class

static infer_scope()[源代码]

Infer the scope of registry.

The name of the package where registry is defined will be returned.

示例

>>> # in mmdet/models/backbone/resnet.py
>>> MODELS = Registry('models')
>>> @MODELS.register_module()
>>> class ResNet:
>>>     pass
The scope of ``ResNet`` will be ``mmdet``.
返回

The inferred scope name.

返回类型

str

register_module(name=None, force=False, module=None)[源代码]

Register a module.

A record will be added to self._module_dict, whose key is the class name or the specified name, and value is the class itself. It can be used as a decorator or a normal function.

示例

>>> backbones = Registry('backbone')
>>> @backbones.register_module()
>>> class ResNet:
>>>     pass
>>> backbones = Registry('backbone')
>>> @backbones.register_module(name='mnet')
>>> class MobileNet:
>>>     pass
>>> backbones = Registry('backbone')
>>> class ResNet:
>>>     pass
>>> backbones.register_module(ResNet)
参数
  • name (str | None) – The module name to be registered. If not specified, the class name will be used.

  • force (bool, optional) – Whether to override an existing class with the same name. Default: False.

  • module (type) – Module class or function to be registered.

static split_scope_key(key)[源代码]

Split scope and key.

The first scope will be split from key.

实际案例

>>> Registry.split_scope_key('mmdet.ResNet')
'mmdet', 'ResNet'
>>> Registry.split_scope_key('ResNet')
None, 'ResNet'
返回

The former element is the first scope of the key, which can be None. The latter is the remaining key.

返回类型

tuple[str | None, str]

class mmocr.utils.StringStrip(strip=True, strip_pos='both', strip_str=None)[源代码]

Removing the leading and/or the trailing characters based on the string argument passed.

参数
  • strip (bool) – Whether remove characters from both left and right of the string. Default: True.

  • strip_pos (str) – Which position for removing, can be one of (‘both’, ‘left’, ‘right’), Default: ‘both’.

  • strip_str (str|None) – A string specifying the set of characters to be removed from the left and right part of the string. If None, all leading and trailing whitespaces are removed from the string. Default: None.

mmocr.utils.bezier_to_polygon(bezier_points, num_sample=20)[源代码]

Sample points from the boundary of a polygon enclosed by two Bezier curves, which are controlled by bezier_points.

参数
  • bezier_points (ndarray) – A \((2, 4, 2)\) array of 8 Bezeir points or its equalivance. The first 4 points control the curve at one side and the last four control the other side.

  • num_sample (int) – The number of sample points at each Bezeir curve.

返回

A list of 2*num_sample points representing the polygon extracted from Bezier curves.

返回类型

list[ndarray]

警告

The points are not guaranteed to be ordered. Please use mmocr.utils.sort_points() to sort points if necessary.

mmocr.utils.build_from_cfg(cfg: Dict, registry: mmcv.utils.registry.Registry, default_args: Optional[Dict] = None)Any[源代码]

Build a module from config dict when it is a class configuration, or call a function from config dict when it is a function configuration.

示例

>>> MODELS = Registry('models')
>>> @MODELS.register_module()
>>> class ResNet:
>>>     pass
>>> resnet = build_from_cfg(dict(type='Resnet'), MODELS)
>>> # Returns an instantiated object
>>> @MODELS.register_module()
>>> def resnet50():
>>>     pass
>>> resnet = build_from_cfg(dict(type='resnet50'), MODELS)
>>> # Return a result of the calling function
参数
  • cfg (dict) – Config dict. It should at least contain the key “type”.

  • registry (Registry) – The registry to search the type from.

  • default_args (dict, optional) – Default initialization arguments.

返回

The constructed object.

返回类型

object

mmocr.utils.collect_env()[源代码]

Collect the information of the running environments.

mmocr.utils.convert_annotations(image_infos, out_json_name)[源代码]

Convert the annotation into coco style.

参数
  • image_infos (list) – The list of image information dicts

  • out_json_name (str) – The output json filename

返回

The coco style dict

返回类型

out_json(dict)

mmocr.utils.drop_orientation(img_file)[源代码]

Check if the image has orientation information. If yes, ignore it by converting the image format to png, and return new filename, otherwise return the original filename.

参数

img_file (str) – The image path

返回

The converted image filename with proper postfix

mmocr.utils.get_root_logger(log_file=None, log_level=20)[源代码]

Use get_logger method in mmcv to get the root logger.

The logger will be initialized if it has not been initialized. By default a StreamHandler will be added. If log_file is specified, a FileHandler will also be added. The name of the root logger is the top-level package name, e.g., “mmpose”.

参数
  • log_file (str | None) – The log filename. If specified, a FileHandler will be added to the root logger.

  • log_level (int) – The root logger level. Note that only the process of rank 0 is affected, while other processes will set the level to “Error” and be silent most of the time.

返回

The root logger.

返回类型

logging.Logger

mmocr.utils.is_2dlist(x)[源代码]

check x is 2d-list([[1], []]) or 1d empty list([]).

Notice:

The reason that it contains 1d empty list is because some arguments from gt annotation file or model prediction may be empty, but usually, it should be 2d-list.

mmocr.utils.is_3dlist(x)[源代码]

check x is 3d-list([[[1], []]]) or 2d empty list([[], []]) or 1d empty list([]).

Notice:

The reason that it contains 1d or 2d empty list is because some arguments from gt annotation file or model prediction may be empty, but usually, it should be 3d-list.

mmocr.utils.is_not_png(img_file)[源代码]

Check img_file is not png image.

参数

img_file (str) – The input image file name

返回

The bool flag indicating whether it is not png

mmocr.utils.is_on_same_line(box_a, box_b, min_y_overlap_ratio=0.8)[源代码]

Check if two boxes are on the same line by their y-axis coordinates.

Two boxes are on the same line if they overlap vertically, and the length of the overlapping line segment is greater than min_y_overlap_ratio * the height of either of the boxes.

参数
  • box_a (list), box_b (list) – Two bounding boxes to be checked

  • min_y_overlap_ratio (float) – The minimum vertical overlapping ratio allowed for boxes in the same line

返回

The bool flag indicating if they are on the same line

mmocr.utils.list_from_file(filename, encoding='utf-8')[源代码]

Load a text file and parse the content as a list of strings. The trailing “r” and “n” of each line will be removed.

注解

This will be replaced by mmcv’s version after it supports encoding.

参数
  • filename (str) – Filename.

  • encoding (str) – Encoding used to open the file. Default utf-8.

返回

A list of strings.

返回类型

list[str]

mmocr.utils.list_to_file(filename, lines)[源代码]

Write a list of strings to a text file.

参数
  • filename (str) – The output filename. It will be created/overwritten.

  • lines (list(str)) – Data to be written.

mmocr.utils.recog2lmdb(img_root, label_path, output, label_format='txt', label_only=False, batch_size=1000, encoding='utf-8', lmdb_map_size=109951162776, verify=True)[源代码]

Create text recognition dataset to LMDB format.

参数
  • img_root (str) – Path to images.

  • label_path (str) – Path to label file.

  • output (str) – LMDB output path.

  • label_format (str) – Format of the label file, either txt or jsonl.

  • label_only (bool) – Only convert label to lmdb format.

  • batch_size (int) – Number of files written to the cache each time.

  • encoding (str) – Label encoding method.

  • lmdb_map_size (int) – Maximum size database may grow to.

  • verify (bool) – If true, check the validity of every image.Defaults to True.

E.g. This function supports MMOCR’s recognition data format and the label file can be txt or jsonl, as follows:

├──img_root | |—— img1.jpg | |—— img2.jpg | |—— … |——label.txt (or label.jsonl)

label.txt: img1.jpg HELLO

img2.jpg WORLD …

label.jsonl: {‘filename’:’img1.jpg’, ‘text’:’HELLO’}

{‘filename’:’img2.jpg’, ‘text’:’WORLD’} …

mmocr.utils.revert_sync_batchnorm(module)[源代码]

Helper function to convert all SyncBatchNorm layers in the model to BatchNormXd layers.

Adapted from @kapily’s work: (https://github.com/pytorch/pytorch/issues/41081#issuecomment-783961547)

参数

module (nn.Module) – The module containing SyncBatchNorm layers.

返回

The converted module with BatchNormXd layers.

返回类型

module_output

mmocr.utils.setup_multi_processes(cfg)[源代码]

Setup multi-processing environment variables.

mmocr.utils.sort_points(points)[源代码]

Sort arbitory points in clockwise order. Reference: https://stackoverflow.com/a/6989383.

参数

points (list[ndarray] or ndarray or list[list]) – A list of unsorted boundary points.

返回

A list of points sorted in clockwise order.

返回类型

list[ndarray]

mmocr.utils.stitch_boxes_into_lines(boxes, max_x_dist=10, min_y_overlap_ratio=0.8)[源代码]

Stitch fragmented boxes of words into lines.

Note: part of its logic is inspired by @Johndirr (https://github.com/faustomorales/keras-ocr/issues/22)

参数
  • boxes (list) – List of ocr results to be stitched

  • max_x_dist (int) – The maximum horizontal distance between the closest edges of neighboring boxes in the same line

  • min_y_overlap_ratio (float) – The minimum vertical overlapping ratio allowed for any pairs of neighboring boxes in the same line

返回

List of merged boxes and texts

返回类型

merged_boxes(list[dict])

mmocr.models

Common Backbones

class mmocr.models.common.backbones.UNet(in_channels=3, base_channels=64, num_stages=5, strides=(1, 1, 1, 1, 1), enc_num_convs=(2, 2, 2, 2, 2), dec_num_convs=(2, 2, 2, 2), downsamples=(True, True, True, True), enc_dilations=(1, 1, 1, 1, 1), dec_dilations=(1, 1, 1, 1), with_cp=False, conv_cfg=None, norm_cfg={'type': 'BN'}, act_cfg={'type': 'ReLU'}, upsample_cfg={'type': 'InterpConv'}, norm_eval=False, dcn=None, plugins=None, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Constant', 'layer': ['_BatchNorm', 'GroupNorm'], 'val': 1}])[源代码]

UNet backbone. U-Net: Convolutional Networks for Biomedical Image Segmentation. https://arxiv.org/pdf/1505.04597.pdf

参数
  • in_channels (int) – Number of input image channels. Default” 3.

  • base_channels (int) – Number of base channels of each stage. The output channels of the first stage. Default: 64.

  • num_stages (int) – Number of stages in encoder, normally 5. Default: 5.

  • strides (Sequence[int 1 | 2]) – Strides of each stage in encoder. len(strides) is equal to num_stages. Normally the stride of the first stage in encoder is 1. If strides[i]=2, it uses stride convolution to downsample in the correspondence encoder stage. Default: (1, 1, 1, 1, 1).

  • enc_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence encoder stage. Default: (2, 2, 2, 2, 2).

  • dec_num_convs (Sequence[int]) – Number of convolutional layers in the convolution block of the correspondence decoder stage. Default: (2, 2, 2, 2).

  • downsamples (Sequence[int]) – Whether use MaxPool to downsample the feature map after the first stage of encoder (stages: [1, num_stages)). If the correspondence encoder stage use stride convolution (strides[i]=2), it will never use MaxPool to downsample, even downsamples[i-1]=True. Default: (True, True, True, True).

  • enc_dilations (Sequence[int]) – Dilation rate of each stage in encoder. Default: (1, 1, 1, 1, 1).

  • dec_dilations (Sequence[int]) – Dilation rate of each stage in decoder. Default: (1, 1, 1, 1).

  • with_cp (bool) – Use checkpoint or not. Using checkpoint will save some memory while slowing down the training speed. Default: False.

  • conv_cfg (dict | None) – Config dict for convolution layer. Default: None.

  • norm_cfg (dict | None) – Config dict for normalization layer. Default: dict(type=’BN’).

  • act_cfg (dict | None) – Config dict for activation layer in ConvModule. Default: dict(type=’ReLU’).

  • upsample_cfg (dict) – The upsample config of the upsample module in decoder. Default: dict(type=’InterpConv’).

  • norm_eval (bool) – Whether to set norm layers to eval mode, namely, freeze running stats (mean and var). Note: Effect on Batch Norm and its variants only. Default: False.

  • dcn (bool) – Use deformable convolution in convolutional layer or not. Default: None.

  • plugins (dict) – plugins for convolutional layers. Default: None.

Notice:

The input image size should be divisible by the whole downsample rate of the encoder. More detail of the whole downsample rate can be found in UNet._check_input_divisible.

forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

train(mode=True)[源代码]

Convert the model into training mode while keep normalization layer freezed.

class mmocr.models.common.losses.DiceLoss(eps=1e-06)[源代码]
forward(pred, target, mask=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.common.losses.FocalLoss(gamma=2, weight=None, ignore_index=- 100)[源代码]

Multi-class Focal loss implementation.

参数
  • gamma (float) – The larger the gamma, the smaller the loss weight of easier samples.

  • weight (float) – A manual rescaling weight given to each class.

  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

forward(input, target)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Text Detection Detectors

class mmocr.models.textdet.detectors.DBNet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False, init_cfg=None)[源代码]

The class for implementing DBNet text detector: Real-time Scene Text Detection with Differentiable Binarization.

[https://arxiv.org/abs/1911.08947].

class mmocr.models.textdet.detectors.DRRG(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False, init_cfg=None)[源代码]

The class for implementing DRRG text detector. Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

[https://arxiv.org/abs/2003.07493]

forward_train(img, img_metas, **kwargs)[源代码]
参数
  • img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A List of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys see mmdet.datasets.pipelines.Collect.

返回

A dictionary of loss components.

返回类型

dict[str, Tensor]

simple_test(img, img_metas, rescale=False)[源代码]

Test function without test-time augmentation.

参数
  • img (torch.Tensor) – Images with shape (N, C, H, W).

  • img_metas (list[dict]) – List of image information.

  • rescale (bool, optional) – Whether to rescale the results. Defaults to False.

返回

BBox results of each image and classes.

The outer list corresponds to each image. The inner list corresponds to each class.

返回类型

list[list[np.ndarray]]

class mmocr.models.textdet.detectors.FCENet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False, init_cfg=None)[源代码]

The class for implementing FCENet text detector FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text

Detection

[https://arxiv.org/abs/2104.10442]

simple_test(img, img_metas, rescale=False)[源代码]

Test function without test-time augmentation.

参数
  • img (torch.Tensor) – Images with shape (N, C, H, W).

  • img_metas (list[dict]) – List of image information.

  • rescale (bool, optional) – Whether to rescale the results. Defaults to False.

返回

BBox results of each image and classes.

The outer list corresponds to each image. The inner list corresponds to each class.

返回类型

list[list[np.ndarray]]

class mmocr.models.textdet.detectors.OCRMaskRCNN(backbone, rpn_head, roi_head, train_cfg, test_cfg, neck=None, pretrained=None, text_repr_type='quad', show_score=False, init_cfg=None)[源代码]

Mask RCNN tailored for OCR.

get_boundary(results)[源代码]

Convert segmentation into text boundaries.

参数

results (tuple) – The result tuple. The first element is segmentation while the second is its scores.

返回

A result dict containing ‘boundary_result’.

返回类型

dict

simple_test(img, img_metas, proposals=None, rescale=False)[源代码]

Test without augmentation.

class mmocr.models.textdet.detectors.PANet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False, init_cfg=None)[源代码]

The class for implementing PANet text detector:

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network [https://arxiv.org/abs/1908.05900].

class mmocr.models.textdet.detectors.PSENet(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False, init_cfg=None)[源代码]

The class for implementing PSENet text detector: Shape Robust Text Detection with Progressive Scale Expansion Network.

[https://arxiv.org/abs/1806.02559].

class mmocr.models.textdet.detectors.SingleStageTextDetector(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, init_cfg=None)[源代码]

The class for implementing single stage text detector.

forward_train(img, img_metas, **kwargs)[源代码]
参数
  • img (Tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details on the values of these keys, see mmdet.datasets.pipelines.Collect.

返回

A dictionary of loss components.

返回类型

dict[str, Tensor]

simple_test(img, img_metas, rescale=False)[源代码]

Test function without test-time augmentation.

参数
  • img (torch.Tensor) – Images with shape (N, C, H, W).

  • img_metas (list[dict]) – List of image information.

  • rescale (bool, optional) – Whether to rescale the results. Defaults to False.

返回

BBox results of each image and classes.

The outer list corresponds to each image. The inner list corresponds to each class.

返回类型

list[list[np.ndarray]]

class mmocr.models.textdet.detectors.TextDetectorMixin(show_score)[源代码]

Base class for text detector, only to show results.

参数

show_score (bool) – Whether to show text instance score.

show_result(img, result, score_thr=0.5, bbox_color='green', text_color='green', thickness=1, font_scale=0.5, win_name='', show=False, wait_time=0, out_file=None)[源代码]

Draw result over img.

参数
  • img (str or Tensor) – The image to be displayed.

  • result (dict) – The results to draw over img.

  • score_thr (float, optional) – Minimum score of bboxes to be shown. Default: 0.3.

  • bbox_color (str or tuple or Color) – Color of bbox lines.

  • text_color (str or tuple or Color) – Color of texts.

  • thickness (int) – Thickness of lines.

  • font_scale (float) – Font scales of texts.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The filename to write the image. Default: None.imshow_pred_boundary`

class mmocr.models.textdet.detectors.TextSnake(backbone, neck, bbox_head, train_cfg=None, test_cfg=None, pretrained=None, show_score=False, init_cfg=None)[源代码]

The class for implementing TextSnake text detector: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]

Text Detection Heads

class mmocr.models.textdet.dense_heads.DBHead(in_channels, with_bias=False, downsample_ratio=1.0, loss={'type': 'DBLoss'}, postprocessor={'text_repr_type': 'quad', 'type': 'DBPostprocessor'}, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv'}, {'type': 'Constant', 'layer': 'BatchNorm', 'val': 1.0, 'bias': 0.0001}], train_cfg=None, test_cfg=None, **kwargs)[源代码]

The class for DBNet head.

This was partially adapted from https://github.com/MhLiao/DB

参数
  • in_channels (int) – The number of input channels of the db head.

  • with_bias (bool) – Whether add bias in Conv2d layer.

  • downsample_ratio (float) – The downsample ratio of ground truths.

  • loss (dict) – Config of loss for dbnet.

  • postprocessor (dict) – Config of postprocessor for dbnet.

forward(inputs)[源代码]
参数

inputs (Tensor) – Shape (batch_size, hidden_size, h, w).

返回

A tensor of the same shape as input.

返回类型

Tensor

class mmocr.models.textdet.dense_heads.DRRGHead(in_channels, k_at_hops=(8, 4), num_adjacent_linkages=3, node_geo_feat_len=120, pooling_scale=1.0, pooling_output_size=(4, 3), nms_thr=0.3, min_width=8.0, max_width=24.0, comp_shrink_ratio=1.03, comp_ratio=0.4, comp_score_thr=0.3, text_region_thr=0.2, center_region_thr=0.2, center_region_area_thr=50, local_graph_thr=0.7, loss={'type': 'DRRGLoss'}, postprocessor={'link_thr': 0.85, 'type': 'DRRGPostprocessor'}, train_cfg=None, test_cfg=None, init_cfg={'mean': 0, 'override': {'name': 'out_conv'}, 'std': 0.01, 'type': 'Normal'}, **kwargs)[源代码]

The class for DRRG head: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

参数
  • k_at_hops (tuple(int)) – The number of i-hop neighbors, i = 1, 2.

  • num_adjacent_linkages (int) – The number of linkages when constructing adjacent matrix.

  • node_geo_feat_len (int) – The length of embedded geometric feature vector of a component.

  • pooling_scale (float) – The spatial scale of rotated RoI-Align.

  • pooling_output_size (tuple(int)) – The output size of RRoI-Aligning.

  • nms_thr (float) – The locality-aware NMS threshold of text components.

  • min_width (float) – The minimum width of text components.

  • max_width (float) – The maximum width of text components.

  • comp_shrink_ratio (float) – The shrink ratio of text components.

  • comp_ratio (float) – The reciprocal of aspect ratio of text components.

  • comp_score_thr (float) – The score threshold of text components.

  • text_region_thr (float) – The threshold for text region probability map.

  • center_region_thr (float) – The threshold for text center region probability map.

  • center_region_area_thr (int) – The threshold for filtering small-sized text center region.

  • local_graph_thr (float) – The threshold to filter identical local graphs.

  • loss (dict) – The config of loss that DRRGHead uses..

  • postprocessor (dict) – Config of postprocessor for Drrg.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(inputs, gt_comp_attribs)[源代码]
参数
  • inputs (Tensor) – Shape of \((N, C, H, W)\).

  • gt_comp_attribs (list[ndarray]) – The padded text component attributes. Shape: (num_component, 8).

返回

Returns (pred_maps, (gcn_pred, gt_labels)).

  • pred_maps (Tensor): Prediction map with shape \((N, C_{out}, H, W)\).
  • gcn_pred (Tensor): Prediction from GCN module, with shape \((N, 2)\).
  • gt_labels (Tensor): Ground-truth label with shape \((N, 8)\).

返回类型

tuple

get_boundary(edges, scores, text_comps, img_metas, rescale)[源代码]

Compute text boundaries via post processing.

参数
  • edges (ndarray) – The edge array of shape N * 2, each row is a pair of text component indices that makes up an edge in graph.

  • scores (ndarray) – The edge score array.

  • text_comps (ndarray) – The text components.

  • img_metas (list[dict]) – The image meta infos.

  • rescale (bool) – Rescale boundaries to the original image resolution.

返回

The result dict containing key boundary_result.

返回类型

dict

single_test(feat_maps)[源代码]
参数

feat_maps (Tensor) – Shape of \((N, C, H, W)\).

返回

Returns (edge, score, text_comps).

  • edge (ndarray): The edge array of shape \((N, 2)\) where each row is a pair of text component indices that makes up an edge in graph.
  • score (ndarray): The score array of shape \((N,)\), corresponding to the edge above.
  • text_comps (ndarray): The text components of shape \((N, 9)\) where each row corresponds to one box and its score: (x1, y1, x2, y2, x3, y3, x4, y4, score).

返回类型

tuple

class mmocr.models.textdet.dense_heads.FCEHead(in_channels, scales, fourier_degree=5, nms_thr=0.1, loss={'num_sample': 50, 'type': 'FCELoss'}, postprocessor={'alpha': 1.0, 'beta': 2.0, 'num_reconstr_points': 50, 'score_thr': 0.3, 'text_repr_type': 'poly', 'type': 'FCEPostprocessor'}, train_cfg=None, test_cfg=None, init_cfg={'mean': 0, 'override': [{'name': 'out_conv_cls'}, {'name': 'out_conv_reg'}], 'std': 0.01, 'type': 'Normal'}, **kwargs)[源代码]

The class for implementing FCENet head.

FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection

参数
  • in_channels (int) – The number of input channels.

  • scales (list[int]) – The scale of each layer.

  • fourier_degree (int) – The maximum Fourier transform degree k.

  • nms_thr (float) – The threshold of nms.

  • loss (dict) – Config of loss for FCENet.

  • postprocessor (dict) – Config of postprocessor for FCENet.

forward(feats)[源代码]
参数

feats (list[Tensor]) – Each tensor has the shape of \((N, C_i, H_i, W_i)\).

返回

Each pair of tensors corresponds to the classification result and regression result computed from the input tensor with the same index. They have the shapes of \((N, C_{cls,i}, H_i, W_i)\) and \((N, C_{out,i}, H_i, W_i)\).

返回类型

list[[Tensor, Tensor]]

get_boundary(score_maps, img_metas, rescale)[源代码]

Compute text boundaries via post processing.

参数
  • score_maps (Tensor) – The text score map.

  • img_metas (dict) – The image meta info.

  • rescale (bool) – Rescale boundaries to the original image resolution if true, and keep the score_maps resolution if false.

返回

A dict where boundary results are stored in boundary_result.

返回类型

dict

class mmocr.models.textdet.dense_heads.HeadMixin(loss, postprocessor)[源代码]

Base head class for text detection, including loss calcalation and postprocess.

参数
  • loss (dict) – Config to build loss.

  • postprocessor (dict) – Config to build postprocessor.

get_boundary(score_maps, img_metas, rescale)[源代码]

Compute text boundaries via post processing.

参数
  • score_maps (Tensor) – The text score map.

  • img_metas (dict) – The image meta info.

  • rescale (bool) – Rescale boundaries to the original image resolution if true, and keep the score_maps resolution if false.

返回

A dict where boundary results are stored in boundary_result.

返回类型

dict

loss(pred_maps, **kwargs)[源代码]

Compute the loss for scene text detection.

参数

pred_maps (Tensor) – The input score maps of shape \((NxCxHxW)\).

返回

The dict for losses.

返回类型

dict

resize_boundary(boundaries, scale_factor)[源代码]

Rescale boundaries via scale_factor.

参数
  • boundaries (list[list[float]]) – The boundary list. Each boundary has \(2k+1\) elements with \(k>=4\).

  • scale_factor (ndarray) – The scale factor of size \((4,)\).

返回

The scaled boundaries.

返回类型

list[list[float]]

class mmocr.models.textdet.dense_heads.PANHead(in_channels, out_channels, downsample_ratio=0.25, loss={'type': 'PANLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'PANPostprocessor'}, train_cfg=None, test_cfg=None, init_cfg={'mean': 0, 'override': {'name': 'out_conv'}, 'std': 0.01, 'type': 'Normal'}, **kwargs)[源代码]

The class for PANet head.

参数
  • in_channels (list[int]) – A list of 4 numbers of input channels.

  • out_channels (int) – Number of output channels.

  • downsample_ratio (float) – Downsample ratio.

  • loss (dict) – Configuration dictionary for loss type. Supported loss types are “PANLoss” and “PSELoss”.

  • postprocessor (dict) – Config of postprocessor for PANet.

  • train_cfg (dict) – Depreciated.

  • test_cfg (dict) – Depreciated.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(inputs)[源代码]
参数

inputs (list[Tensor] | Tensor) – Each tensor has the shape of \((N, C_i, W, H)\), where \(\sum_iC_i=C_{in}\) and \(C_{in}\) is input_channels.

返回

A tensor of shape \((N, C_{out}, W, H)\) where \(C_{out}\) is output_channels.

返回类型

Tensor

class mmocr.models.textdet.dense_heads.PSEHead(in_channels, out_channels, downsample_ratio=0.25, loss={'type': 'PSELoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'PSEPostprocessor'}, train_cfg=None, test_cfg=None, init_cfg=None, **kwargs)[源代码]

The class for PSENet head.

参数
  • in_channels (list[int]) – A list of 4 numbers of input channels.

  • out_channels (int) – Number of output channels.

  • downsample_ratio (float) – Downsample ratio.

  • loss (dict) – Configuration dictionary for loss type. Supported loss types are “PANLoss” and “PSELoss”.

  • postprocessor (dict) – Config of postprocessor for PSENet.

  • train_cfg (dict) – Depreciated.

  • test_cfg (dict) – Depreciated.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

class mmocr.models.textdet.dense_heads.TextSnakeHead(in_channels, out_channels=5, downsample_ratio=1.0, loss={'type': 'TextSnakeLoss'}, postprocessor={'text_repr_type': 'poly', 'type': 'TextSnakePostprocessor'}, train_cfg=None, test_cfg=None, init_cfg={'mean': 0, 'override': {'name': 'out_conv'}, 'std': 0.01, 'type': 'Normal'}, **kwargs)[源代码]

The class for TextSnake head: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

参数
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • downsample_ratio (float) – Downsample ratio.

  • loss (dict) – Configuration dictionary for loss type.

  • postprocessor (dict) – Config of postprocessor for TextSnake.

  • train_cfg – Depreciated.

  • test_cfg – Depreciated.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(inputs)[源代码]
参数

inputs (Tensor) – Shape \((N, C_{in}, H, W)\), where \(C_{in}\) is in_channels. \(H\) and \(W\) should be the same as the input of backbone.

返回

A tensor of shape \((N, 5, H, W)\).

返回类型

Tensor

Text Detection Necks

class mmocr.models.textdet.necks.FPEM_FFM(in_channels, conv_out=128, fpem_repeat=2, align_corners=False, init_cfg={'distribution': 'uniform', 'layer': 'Conv2d', 'type': 'Xavier'})[源代码]

This code is from https://github.com/WenmuZhou/PAN.pytorch.

参数
  • in_channels (list[int]) – A list of 4 numbers of input channels.

  • conv_out (int) – Number of output channels.

  • fpem_repeat (int) – Number of FPEM layers before FFM operations.

  • align_corners (bool) – The interpolation behaviour in FFM operation, used in torch.nn.functional.interpolate().

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(x)[源代码]
参数

x (list[Tensor]) – A list of four tensors of shape \((N, C_i, H_i, W_i)\), representing C2, C3, C4, C5 features respectively. \(C_i\) should matches the number in in_channels.

返回

Four tensors of shape \((N, C_{out}, H_0, W_0)\) where \(C_{out}\) is conv_out.

返回类型

list[Tensor]

class mmocr.models.textdet.necks.FPNC(in_channels, lateral_channels=256, out_channels=64, bias_on_lateral=False, bn_re_on_lateral=False, bias_on_smooth=False, bn_re_on_smooth=False, asf_cfg=None, conv_after_concat=False, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv'}, {'type': 'Constant', 'layer': 'BatchNorm', 'val': 1.0, 'bias': 0.0001}])[源代码]

FPN-like fusion module in Real-time Scene Text Detection with Differentiable Binarization.

This was partially adapted from https://github.com/MhLiao/DB and https://github.com/WenmuZhou/DBNet.pytorch.

参数
  • in_channels (list[int]) – A list of numbers of input channels.

  • lateral_channels (int) – Number of channels for lateral layers.

  • out_channels (int) – Number of output channels.

  • bias_on_lateral (bool) – Whether to use bias on lateral convolutional layers.

  • bn_re_on_lateral (bool) – Whether to use BatchNorm and ReLU on lateral convolutional layers.

  • bias_on_smooth (bool) – Whether to use bias on smoothing layer.

  • bn_re_on_smooth (bool) – Whether to use BatchNorm and ReLU on smoothing layer.

  • asf_cfg (dict) – Adaptive Scale Fusion module configs. The attention_type can be ‘ScaleChannelSpatial’.

  • conv_after_concat (bool) – Whether to add a convolution layer after the concatenation of predictions.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(inputs)[源代码]
参数

inputs (list[Tensor]) – Each tensor has the shape of \((N, C_i, H_i, W_i)\). It usually expects 4 tensors (C2-C5 features) from ResNet.

返回

A tensor of shape \((N, C_{out}, H_0, W_0)\) where \(C_{out}\) is out_channels.

返回类型

Tensor

class mmocr.models.textdet.necks.FPNF(in_channels=[256, 512, 1024, 2048], out_channels=256, fusion_type='concat', init_cfg={'distribution': 'uniform', 'layer': 'Conv2d', 'type': 'Xavier'})[源代码]

FPN-like fusion module in Shape Robust Text Detection with Progressive Scale Expansion Network.

参数
  • in_channels (list[int]) – A list of number of input channels.

  • out_channels (int) – The number of output channels.

  • fusion_type (str) – Type of the final feature fusion layer. Available options are “concat” and “add”.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(inputs)[源代码]
参数

inputs (list[Tensor]) – Each tensor has the shape of \((N, C_i, H_i, W_i)\). It usually expects 4 tensors (C2-C5 features) from ResNet.

返回

A tensor of shape \((N, C_{out}, H_0, W_0)\) where \(C_{out}\) is out_channels.

返回类型

Tensor

class mmocr.models.textdet.necks.FPN_UNet(in_channels, out_channels, init_cfg={'distribution': 'uniform', 'layer': ['Conv2d', 'ConvTranspose2d'], 'type': 'Xavier'})[源代码]

The class for implementing DRRG and TextSnake U-Net-like FPN.

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

参数
  • in_channels (list[int]) – Number of input channels at each scale. The length of the list should be 4.

  • out_channels (int) – The number of output channels.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(x)[源代码]
参数

x (list[Tensor] | tuple[Tensor]) – A list of four tensors of shape \((N, C_i, H_i, W_i)\), representing C2, C3, C4, C5 features respectively. \(C_i\) should matches the number in in_channels.

返回

Shape \((N, C, H, W)\) where \(H=4H_0\) and \(W=4W_0\).

返回类型

Tensor

Text Detection Losses

class mmocr.models.textdet.losses.DBLoss(alpha=1, beta=1, reduction='mean', negative_ratio=3.0, eps=1e-06, bbce_loss=False)[源代码]

The class for implementing DBNet loss.

This is partially adapted from https://github.com/MhLiao/DB.

参数
  • alpha (float) – The binary loss coef.

  • beta (float) – The threshold loss coef.

  • reduction (str) – The way to reduce the loss.

  • negative_ratio (float) – The ratio of positives to negatives.

  • eps (float) – Epsilon in the threshold loss function.

  • bbce_loss (bool) – Whether to use balanced bce for probability loss. If False, dice loss will be used instead.

bitmasks2tensor(bitmasks, target_sz)[源代码]

Convert Bitmasks to tensor.

参数
  • bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.

  • target_sz (tuple(int, int)) – The target tensor of size \((H, W)\).

返回

The list of kernel tensors. Each element stands for one kernel level.

返回类型

list[Tensor]

forward(preds, downsample_ratio, gt_shrink, gt_shrink_mask, gt_thr, gt_thr_mask)[源代码]

Compute DBNet loss.

参数
  • preds (Tensor) – The output tensor with size \((N, 3, H, W)\).

  • downsample_ratio (float) – The downsample ratio for the ground truths.

  • gt_shrink (list[BitmapMasks]) – The mask list with each element being the shrunk text mask for one img.

  • gt_shrink_mask (list[BitmapMasks]) – The effective mask list with each element being the shrunk effective mask for one img.

  • gt_thr (list[BitmapMasks]) – The mask list with each element being the threshold text mask for one img.

  • gt_thr_mask (list[BitmapMasks]) – The effective mask list with each element being the threshold effective mask for one img.

返回

The dict for dbnet losses with “loss_prob”, “loss_db” and “loss_thresh”.

返回类型

dict

class mmocr.models.textdet.losses.DRRGLoss(ohem_ratio=3.0)[源代码]

The class for implementing DRRG loss. This is partially adapted from https://github.com/GXYM/DRRG licensed under the MIT license.

DRRG: Deep Relational Reasoning Graph Network for Arbitrary Shape Text Detection.

参数

ohem_ratio (float) – The negative/positive ratio in ohem.

balance_bce_loss(pred, gt, mask)[源代码]

Balanced Binary-CrossEntropy Loss.

参数
  • pred (Tensor) – Shape of \((1, H, W)\).

  • gt (Tensor) – Shape of \((1, H, W)\).

  • mask (Tensor) – Shape of \((1, H, W)\).

返回

Balanced bce loss.

返回类型

Tensor

bitmasks2tensor(bitmasks, target_sz)[源代码]

Convert Bitmasks to tensor.

参数
  • bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.

  • target_sz (tuple(int, int)) – The target tensor of size \((H, W)\).

返回

The list of kernel tensors. Each element stands for one kernel level.

返回类型

list[Tensor]

forward(preds, downsample_ratio, gt_text_mask, gt_center_region_mask, gt_mask, gt_top_height_map, gt_bot_height_map, gt_sin_map, gt_cos_map)[源代码]

Compute Drrg loss.

参数
  • preds (tuple(Tensor)) – The first is the prediction map with shape \((N, C_{out}, H, W)\). The second is prediction from GCN module, with shape \((N, 2)\). The third is ground-truth label with shape \((N, 8)\).

  • downsample_ratio (float) – The downsample ratio.

  • gt_text_mask (list[BitmapMasks]) – Text mask.

  • gt_center_region_mask (list[BitmapMasks]) – Center region mask.

  • gt_mask (list[BitmapMasks]) – Effective mask.

  • gt_top_height_map (list[BitmapMasks]) – Top height map.

  • gt_bot_height_map (list[BitmapMasks]) – Bottom height map.

  • gt_sin_map (list[BitmapMasks]) – Sinusoid map.

  • gt_cos_map (list[BitmapMasks]) – Cosine map.

返回

A loss dict with loss_text, loss_center, loss_height, loss_sin, loss_cos, and loss_gcn.

返回类型

dict

gcn_loss(gcn_data)[源代码]

CrossEntropy Loss from gcn module.

参数

gcn_data (tuple(Tensor, Tensor)) – The first is the prediction with shape \((N, 2)\) and the second is the gt label with shape \((m, n)\) where \(m * n = N\).

返回

CrossEntropy loss.

返回类型

Tensor

class mmocr.models.textdet.losses.FCELoss(fourier_degree, num_sample, ohem_ratio=3.0)[源代码]

The class for implementing FCENet loss.

FCENet(CVPR2021): Fourier Contour Embedding for Arbitrary-shaped Text Detection

参数
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • num_sample (int) – The sampling points number of regression loss. If it is too small, fcenet tends to be overfitting.

  • ohem_ratio (float) – the negative/positive ratio in OHEM.

forward(preds, _, p3_maps, p4_maps, p5_maps)[源代码]

Compute FCENet loss.

参数
  • preds (list[list[Tensor]]) – The outer list indicates images in a batch, and the inner list indicates the classification prediction map (with shape \((N, C, H, W)\)) and regression map (with shape \((N, C, H, W)\)).

  • p3_maps (list[ndarray]) – List of leval 3 ground truth target map with shape \((C, H, W)\).

  • p4_maps (list[ndarray]) – List of leval 4 ground truth target map with shape \((C, H, W)\).

  • p5_maps (list[ndarray]) – List of leval 5 ground truth target map with shape \((C, H, W)\).

返回

A loss dict with loss_text, loss_center, loss_reg_x and loss_reg_y.

返回类型

dict

fourier2poly(real_maps, imag_maps)[源代码]

Transform Fourier coefficient maps to polygon maps.

参数
  • real_maps (tensor) – A map composed of the real parts of the Fourier coefficients, whose shape is (-1, 2k+1)

  • imag_maps (tensor) – A map composed of the imag parts of the Fourier coefficients, whose shape is (-1, 2k+1)

Returns
x_maps (tensor): A map composed of the x value of the polygon

represented by n sample points (xn, yn), whose shape is (-1, n)

y_maps (tensor): A map composed of the y value of the polygon

represented by n sample points (xn, yn), whose shape is (-1, n)

class mmocr.models.textdet.losses.PANLoss(alpha=0.5, beta=0.25, delta_aggregation=0.5, delta_discrimination=3, ohem_ratio=3, reduction='mean', speedup_bbox_thr=- 1)[源代码]

The class for implementing PANet loss. This was partially adapted from https://github.com/WenmuZhou/PAN.pytorch.

PANet: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.

参数
  • alpha (float) – The kernel loss coef.

  • beta (float) – The aggregation and discriminative loss coef.

  • delta_aggregation (float) – The constant for aggregation loss.

  • delta_discrimination (float) – The constant for discriminative loss.

  • ohem_ratio (float) – The negative/positive ratio in ohem.

  • reduction (str) – The way to reduce the loss.

  • speedup_bbox_thr (int) – Speed up if speedup_bbox_thr > 0 and < bbox num.

aggregation_discrimination_loss(gt_texts, gt_kernels, inst_embeds)[源代码]

Compute the aggregation and discrimnative losses.

参数
  • gt_texts (Tensor) – The ground truth text mask of size \((N, 1, H, W)\).

  • gt_kernels (Tensor) – The ground truth text kernel mask of size \((N, 1, H, W)\).

  • inst_embeds (Tensor) – The text instance embedding tensor of size \((N, 1, H, W)\).

返回

A tuple of aggregation loss and discriminative loss before reduction.

返回类型

(Tensor, Tensor)

bitmasks2tensor(bitmasks, target_sz)[源代码]

Convert Bitmasks to tensor.

参数
  • bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.

  • target_sz (tuple(int, int)) – The target tensor of size \((H, W)\).

返回

The list of kernel tensors. Each element stands for one kernel level.

返回类型

list[Tensor]

forward(preds, downsample_ratio, gt_kernels, gt_mask)[源代码]

Compute PANet loss.

参数
  • preds (Tensor) – The output tensor of size \((N, 6, H, W)\).

  • downsample_ratio (float) – The downsample ratio between preds and the input img.

  • gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.

  • gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask for one img.

返回

A loss dict with loss_text, loss_kernel, loss_aggregation and loss_discrimination.

返回类型

dict

ohem_batch(text_scores, gt_texts, gt_mask)[源代码]

OHEM sampling for a batch of imgs.

参数
  • text_scores (Tensor) – The text scores of size \((H, W)\).

  • gt_texts (Tensor) – The gt text masks of size \((H, W)\).

  • gt_mask (Tensor) – The gt effective mask of size \((H, W)\).

返回

The sampled mask of size \((H, W)\).

返回类型

Tensor

ohem_img(text_score, gt_text, gt_mask)[源代码]

Sample the top-k maximal negative samples and all positive samples.

参数
  • text_score (Tensor) – The text score of size \((H, W)\).

  • gt_text (Tensor) – The ground truth text mask of size \((H, W)\).

  • gt_mask (Tensor) – The effective region mask of size \((H, W)\).

返回

The sampled pixel mask of size \((H, W)\).

返回类型

Tensor

class mmocr.models.textdet.losses.PSELoss(alpha=0.7, ohem_ratio=3, reduction='mean', kernel_sample_type='adaptive')[源代码]

The class for implementing PSENet loss. This is partially adapted from https://github.com/whai362/PSENet.

PSENet: Shape Robust Text Detection with Progressive Scale Expansion Network.

参数
  • alpha (float) – Text loss coefficient, and \(1-\alpha\) is the kernel loss coefficient.

  • ohem_ratio (float) – The negative/positive ratio in ohem.

  • reduction (str) – The way to reduce the loss. Available options are “mean” and “sum”.

forward(score_maps, downsample_ratio, gt_kernels, gt_mask)[源代码]

Compute PSENet loss.

参数
  • score_maps (tensor) – The output tensor with size of Nx6xHxW.

  • downsample_ratio (float) – The downsample ratio between score_maps and the input img.

  • gt_kernels (list[BitmapMasks]) – The kernel list with each element being the text kernel mask for one img.

  • gt_mask (list[BitmapMasks]) – The effective mask list with each element being the effective mask for one img.

返回

A loss dict with loss_text and loss_kernel.

返回类型

dict

class mmocr.models.textdet.losses.TextSnakeLoss(ohem_ratio=3.0)[源代码]

The class for implementing TextSnake loss. This is partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

参数

ohem_ratio (float) – The negative/positive ratio in ohem.

bitmasks2tensor(bitmasks, target_sz)[源代码]

Convert Bitmasks to tensor.

参数
  • bitmasks (list[BitmapMasks]) – The BitmapMasks list. Each item is for one img.

  • target_sz (tuple(int, int)) – The target tensor of size \((H, W)\).

返回

The list of kernel tensors. Each element stands for one kernel level.

返回类型

list[Tensor]

forward(pred_maps, downsample_ratio, gt_text_mask, gt_center_region_mask, gt_mask, gt_radius_map, gt_sin_map, gt_cos_map)[源代码]
参数
  • pred_maps (Tensor) – The prediction map of shape \((N, 5, H, W)\), where each dimension is the map of “text_region”, “center_region”, “sin_map”, “cos_map”, and “radius_map” respectively.

  • downsample_ratio (float) – Downsample ratio.

  • gt_text_mask (list[BitmapMasks]) – Gold text masks.

  • gt_center_region_mask (list[BitmapMasks]) – Gold center region masks.

  • gt_mask (list[BitmapMasks]) – Gold general masks.

  • gt_radius_map (list[BitmapMasks]) – Gold radius maps.

  • gt_sin_map (list[BitmapMasks]) – Gold sin maps.

  • gt_cos_map (list[BitmapMasks]) – Gold cos maps.

返回

A loss dict with loss_text, loss_center, loss_radius, loss_sin and loss_cos.

返回类型

dict

Text Detection Postprocessors

class mmocr.models.textdet.postprocess.DBPostprocessor(text_repr_type='poly', mask_thr=0.3, min_text_score=0.3, min_text_width=5, unclip_ratio=1.5, epsilon_ratio=0.01, max_candidates=3000, **kwargs)[源代码]

Decoding predictions of DbNet to instances. This is partially adapted from https://github.com/MhLiao/DB.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’.

  • mask_thr (float) – The mask threshold value for binarization.

  • min_text_score (float) – The threshold value for converting binary map to shrink text regions.

  • min_text_width (int) – The minimum width of boundary polygon/box predicted.

  • unclip_ratio (float) – The unclip ratio for text regions dilation.

  • epsilon_ratio (float) – The epsilon ratio for approximation accuracy.

  • max_candidates (int) – The maximum candidate number.

class mmocr.models.textdet.postprocess.DRRGPostprocessor(link_thr, **kwargs)[源代码]

Merge text components and construct boundaries of text instances.

参数

link_thr (float) – The edge score threshold.

class mmocr.models.textdet.postprocess.FCEPostprocessor(fourier_degree, num_reconstr_points, text_repr_type='poly', alpha=1.0, beta=2.0, score_thr=0.3, nms_thr=0.1, **kwargs)[源代码]

Decoding predictions of FCENet to instances.

参数
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • num_reconstr_points (int) – The points number of the polygon reconstructed from predicted Fourier coefficients.

  • text_repr_type (str) – Boundary encoding type ‘poly’ or ‘quad’.

  • scale (int) – The down-sample scale of the prediction.

  • alpha (float) – The parameter to calculate final scores. Score_{final} = (Score_{text region} ^ alpha) * (Score_{text center region}^ beta)

  • beta (float) – The parameter to calculate final score.

  • score_thr (float) – The threshold used to filter out the final candidates.

  • nms_thr (float) – The threshold of nms.

class mmocr.models.textdet.postprocess.PANPostprocessor(text_repr_type='poly', min_text_confidence=0.5, min_kernel_confidence=0.5, min_text_avg_confidence=0.85, min_text_area=16, **kwargs)[源代码]

Convert scores to quadrangles via post processing in PANet. This is partially adapted from https://github.com/WenmuZhou/PAN.pytorch.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’.

  • min_text_confidence (float) – The minimal text confidence.

  • min_kernel_confidence (float) – The minimal kernel confidence.

  • min_text_avg_confidence (float) – The minimal text average confidence.

  • min_text_area (int) – The minimal text instance region area.

class mmocr.models.textdet.postprocess.PSEPostprocessor(text_repr_type='poly', min_kernel_confidence=0.5, min_text_avg_confidence=0.85, min_kernel_area=0, min_text_area=16, **kwargs)[源代码]

Decoding predictions of PSENet to instances. This is partially adapted from https://github.com/whai362/PSENet.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’.

  • min_kernel_confidence (float) – The minimal kernel confidence.

  • min_text_avg_confidence (float) – The minimal text average confidence.

  • min_kernel_area (int) – The minimal text kernel area.

  • min_text_area (int) – The minimal text instance region area.

class mmocr.models.textdet.postprocess.TextSnakePostprocessor(text_repr_type='poly', min_text_region_confidence=0.6, min_center_region_confidence=0.2, min_center_area=30, disk_overlap_thr=0.03, radius_shrink_ratio=1.03, **kwargs)[源代码]

Decoding predictions of TextSnake to instances. This was partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

参数
  • text_repr_type (str) – The boundary encoding type ‘poly’ or ‘quad’.

  • min_text_region_confidence (float) – The confidence threshold of text region in TextSnake.

  • min_center_region_confidence (float) – The confidence threshold of text center region in TextSnake.

  • min_center_area (int) – The minimal text center region area.

  • disk_overlap_thr (float) – The radius overlap threshold for merging disks.

  • radius_shrink_ratio (float) – The shrink ratio of ordered disks radii.

Text Recognition Recognizer

class mmocr.models.textrecog.recognizer.ABINet(preprocessor=None, backbone=None, encoder=None, decoder=None, iter_size=1, fuser=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

Implementation of `Read Like Humans: Autonomous, Bidirectional and Iterative LanguageModeling for Scene Text Recognition.

<https://arxiv.org/pdf/2103.06495.pdf>`_

forward_train(img, img_metas)[源代码]
参数
  • img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘filename’, and may also contain ‘ori_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmdet.datasets.pipelines.Collect.

返回

A dictionary of loss components.

返回类型

dict[str, tensor]

simple_test(img, img_metas, **kwargs)[源代码]

Test function with test time augmentation.

参数
  • imgs (torch.Tensor) – Image input tensor.

  • img_metas (list[dict]) – List of image information.

返回

Text label result of each image.

返回类型

list[str]

class mmocr.models.textrecog.recognizer.BaseRecognizer(init_cfg=None)[源代码]

Base class for text recognition.

abstract aug_test(imgs, img_metas, **kwargs)[源代码]

Test function with test time augmentation.

参数
  • imgs (list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.

  • img_metas (list[list[dict]]) – The metadata of images.

abstract extract_feat(imgs)[源代码]

Extract features from images.

forward(img, img_metas, return_loss=True, **kwargs)[源代码]

Calls either forward_train() or forward_test() depending on whether return_loss is True.

Note that img and img_meta are single-nested (i.e. tensor and list[dict]).

forward_test(imgs, img_metas, **kwargs)[源代码]
参数
  • imgs (tensor | list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.

  • img_metas (list[dict] | list[list[dict]]) – The outer list indicates images in a batch.

abstract forward_train(imgs, img_metas, **kwargs)[源代码]
参数
  • img (tensor) – tensors with shape (N, C, H, W). Typically should be mean centered and std scaled.

  • img_metas (list[dict]) – List of image info dict where each dict has: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, see mmdet.datasets.pipelines.Collect.

  • kwargs (keyword arguments) – Specific to concrete implementation.

static show_result(img, result, gt_label='', win_name='', show=False, wait_time=0, out_file=None, **kwargs)[源代码]

Draw result on img.

参数
  • img (str or tensor) – The image to be displayed.

  • result (dict) – The results to draw on img.

  • gt_label (str) – Ground truth label of img.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The output filename. Default: None.

返回

Only if not show or out_file.

返回类型

img (tensor)

train_step(data, optimizer)[源代码]

The iteration step during training.

This method defines an iteration step during training, except for the back propagation and optimizer update, which are done by an optimizer hook. Note that in some complicated cases or models (e.g. GAN), the whole process (including the back propagation and optimizer update) is also defined by this method.

参数
  • data (dict) – The outputs of dataloader.

  • optimizer (torch.optim.Optimizer | dict) – The optimizer of runner is passed to train_step(). This argument is unused and reserved.

返回

It should contain at least 3 keys: loss, log_vars,

num_samples.

  • loss is a tensor for back propagation, which is a

weighted sum of multiple losses. - log_vars contains all the variables to be sent to the logger. - num_samples indicates the batch size used for averaging the logs (Note: for the DDP model, num_samples refers to the batch size for each GPU).

返回类型

dict

val_step(data, optimizer)[源代码]

The iteration step during validation.

This method shares the same signature as train_step(), but is used during val epochs. Note that the evaluation after training epochs is not implemented by this method, but by an evaluation hook.

class mmocr.models.textrecog.recognizer.CRNNNet(preprocessor=None, backbone=None, encoder=None, decoder=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

CTC-loss based recognizer.

class mmocr.models.textrecog.recognizer.EncodeDecodeRecognizer(preprocessor=None, backbone=None, encoder=None, decoder=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

Base class for encode-decode recognizer.

aug_test(imgs, img_metas, **kwargs)[源代码]

Test function as well as time augmentation.

参数
  • imgs (list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.

  • img_metas (list[list[dict]]) – The metadata of images.

extract_feat(img)[源代码]

Directly extract features from the backbone.

forward_train(img, img_metas)[源代码]
参数
  • img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘filename’, and may also contain ‘ori_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmdet.datasets.pipelines.Collect.

返回

A dictionary of loss components.

返回类型

dict[str, tensor]

simple_test(img, img_metas, **kwargs)[源代码]

Test function with test time augmentation.

参数
  • imgs (torch.Tensor) – Image input tensor.

  • img_metas (list[dict]) – List of image information.

返回

Text label result of each image.

返回类型

list[str]

class mmocr.models.textrecog.recognizer.MASTER(preprocessor=None, backbone=None, encoder=None, decoder=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

Implementation of MASTER

class mmocr.models.textrecog.recognizer.NRTR(preprocessor=None, backbone=None, encoder=None, decoder=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

Implementation of NRTR

class mmocr.models.textrecog.recognizer.RobustScanner(preprocessor=None, backbone=None, encoder=None, decoder=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

Implementation of `RobustScanner.

<https://arxiv.org/pdf/2007.07542.pdf>

class mmocr.models.textrecog.recognizer.SARNet(preprocessor=None, backbone=None, encoder=None, decoder=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

Implementation of SAR

class mmocr.models.textrecog.recognizer.SATRN(preprocessor=None, backbone=None, encoder=None, decoder=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, max_seq_len=40, pretrained=None, init_cfg=None)[源代码]

Implementation of SATRN

class mmocr.models.textrecog.recognizer.SegRecognizer(preprocessor=None, backbone=None, neck=None, head=None, loss=None, label_convertor=None, train_cfg=None, test_cfg=None, pretrained=None, init_cfg=None)[源代码]

Base class for segmentation based recognizer.

aug_test(imgs, img_metas, **kwargs)[源代码]

Test function with test time augmentation.

参数
  • imgs (list[tensor]) – Tensor should have shape NxCxHxW, which contains all images in the batch.

  • img_metas (list[list[dict]]) – The metadata of images.

extract_feat(img)[源代码]

Directly extract features from the backbone.

forward_train(img, img_metas, gt_kernels=None)[源代码]
参数
  • img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘filename’, and may also contain ‘ori_shape’, and ‘img_norm_cfg’. For details on the values of these keys see mmdet.datasets.pipelines.Collect.

返回

A dictionary of loss components.

返回类型

dict[str, tensor]

simple_test(img, img_metas, **kwargs)[源代码]

Test function without test time augmentation.

参数
  • imgs (torch.Tensor) – Image input tensor.

  • img_metas (list[dict]) – List of image information.

返回

Text label result of each image.

返回类型

list[str]

Text Recognition Backbones

class mmocr.models.textrecog.backbones.NRTRModalityTransform(input_channels=3, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]
forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.backbones.ResNet(in_channels, stem_channels, block_cfgs, arch_layers, arch_channels, strides, out_indices=None, plugins=None, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]
参数
  • in_channels (int) – Number of channels of input image tensor.

  • stem_channels (list[int]) – List of channels in each stem layer. E.g., [64, 128] stands for 64 and 128 channels in the first and second stem layers.

  • block_cfgs (dict) – Configs of block

  • arch_layers (list[int]) – List of Block number for each stage.

  • arch_channels (list[int]) – List of channels for each stage.

  • strides (Sequence[int] | Sequence[tuple]) – Strides of the first block of each stage.

  • out_indices (None | Sequence[int]) – Indices of output stages. If not specified, only the last stage will be returned.

  • stage_plugins (dict) – Configs of stage plugins

  • init_cfg (dict or list[dict], optional) – Initialization config dict.

forward(x)[源代码]

Args: x (Tensor): Image tensor of shape \((N, 3, H, W)\).

返回

Feature tensor. It can be a list of feature outputs at specific layers if out_indices is specified.

返回类型

Tensor or list[Tensor]

class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]
Implement ResNet backbone for text recognition, modified from

ResNet

参数
  • base_channels (int) – Number of channels of input image tensor.

  • layers (list[int]) – List of BasicBlock number for each stage.

  • channels (list[int]) – List of out_channels of Conv2d layer.

  • out_indices (None | Sequence[int]) – Indices of output stages.

  • stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.backbones.ResNetABI(in_channels=3, stem_channels=32, base_channels=32, arch_settings=[3, 4, 6, 6, 3], strides=[2, 1, 2, 1, 1], out_indices=None, last_stage_pool=False, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]

Implement ResNet backbone for text recognition, modified from `ResNet.

<https://arxiv.org/pdf/1512.03385.pdf>`_ and https://github.com/FangShancheng/ABINet

参数
  • in_channels (int) – Number of channels of input image tensor.

  • stem_channels (int) – Number of stem channels.

  • base_channels (int) – Number of base channels.

  • arch_settings (list[int]) – List of BasicBlock number for each stage.

  • strides (Sequence[int]) – Strides of the first block of each stage.

  • out_indices (None | Sequence[int]) – Indices of output stages. If not specified, only the last stage will be returned.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

forward(x)[源代码]
参数

x (Tensor) – Image tensor of shape \((N, 3, H, W)\).

返回

Feature tensor. Its shape depends on ResNetABI’s config. It can be a list of feature outputs at specific layers if out_indices is specified.

返回类型

Tensor or list[Tensor]

class mmocr.models.textrecog.backbones.ShallowCNN(input_channels=1, hidden_dim=512, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

Implement Shallow CNN block for SATRN.

SATRN: On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention.

参数
  • base_channels (int) – Number of channels of input image tensor \(D_i\).

  • hidden_dim (int) – Size of hidden layers of the model \(D_m\).

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(x)[源代码]
参数

x (Tensor) – Input image feature \((N, D_i, H, W)\).

返回

A tensor of shape \((N, D_m, H/4, W/4)\).

返回类型

Tensor

class mmocr.models.textrecog.backbones.VeryDeepVgg(leaky_relu=True, input_channels=3, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

Implement VGG-VeryDeep backbone for text recognition, modified from VGG-VeryDeep

参数
  • leaky_relu (bool) – Use leakyRelu or not.

  • input_channels (int) – Number of channels of input image tensor.

forward(x)[源代码]
参数

x (Tensor) – Images of shape \((N, C, H, W)\).

返回

The feature Tensor of shape \((N, 512, H/32, (W/4+1)\).

返回类型

Tensor

Text Recognition Necks

class mmocr.models.textrecog.necks.FPNOCR(in_channels, out_channels, last_stage_only=True, init_cfg=None)[源代码]

FPN-like Network for segmentation based text recognition.

参数
  • in_channels (list[int]) – Number of input channels \(C_i\) for each scale.

  • out_channels (int) – Number of output channels \(C_{out}\) for each scale.

  • last_stage_only (bool) – If True, output last stage only.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(inputs)[源代码]
参数

inputs (list[Tensor]) – A list of n tensors. Each tensor has the shape of \((N, C_i, H_i, W_i)\). It usually expects 4 tensors (C2-C5 features) from ResNet.

返回

A tuple of n-1 tensors. Each has the of shape \((N, C_{out}, H_{n-2-i}, W_{n-2-i})\). If last_stage_only=True (default), the size of the tuple is 1 and only the last element will be returned.

返回类型

tuple(Tensor)

Text Recognition Heads

class mmocr.models.textrecog.heads.SegHead(in_channels=128, num_classes=37, upsample_param=None, init_cfg=None)[源代码]

Head for segmentation based text recognition.

参数
  • in_channels (int) – Number of input channels \(C\).

  • num_classes (int) – Number of output classes \(C_{out}\).

  • upsample_param (dict | None) – Config dict for interpolation layer. Default: dict(scale_factor=1.0, mode='nearest')

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(out_neck)[源代码]
参数

out_neck (list[Tensor]) – A list of tensor of shape \((N, C_i, H_i, W_i)\). The network only uses the last one (out_neck[-1]).

返回

A tensor of shape \((N, C_{out}, kH, kW)\) where \(k\) is determined by upsample_param.

返回类型

Tensor

Text Recognition Preprocessors

class mmocr.models.textrecog.preprocessor.BasePreprocessor(init_cfg: Optional[dict] = None)[源代码]

Base Preprocessor class for text recognition.

forward(x, **kwargs)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.preprocessor.TPSPreprocessor(num_fiducial=20, img_size=(32, 100), rectified_img_size=(32, 100), num_img_channel=1, init_cfg=None)[源代码]

Rectification Network of RARE, namely TPS based STN in https://arxiv.org/pdf/1603.03915.pdf.

参数
  • num_fiducial (int) – Number of fiducial points of TPS-STN.

  • img_size (tuple(int, int)) – Size \((H, W)\) of the input image.

  • rectified_img_size (tuple(int, int)) – Size \((H_r, W_r)\) of the rectified image.

  • num_img_channel (int) – Number of channels of the input image.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(batch_img)[源代码]
参数

batch_img (Tensor) – Images to be rectified with size \((N, C, H, W)\).

返回

Rectified image with size \((N, C, H_r, W_r)\).

返回类型

Tensor

Text Recognition Backbones

class mmocr.models.textrecog.backbones.NRTRModalityTransform(input_channels=3, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]
forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.backbones.ResNet(in_channels, stem_channels, block_cfgs, arch_layers, arch_channels, strides, out_indices=None, plugins=None, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]
参数
  • in_channels (int) – Number of channels of input image tensor.

  • stem_channels (list[int]) – List of channels in each stem layer. E.g., [64, 128] stands for 64 and 128 channels in the first and second stem layers.

  • block_cfgs (dict) – Configs of block

  • arch_layers (list[int]) – List of Block number for each stage.

  • arch_channels (list[int]) – List of channels for each stage.

  • strides (Sequence[int] | Sequence[tuple]) – Strides of the first block of each stage.

  • out_indices (None | Sequence[int]) – Indices of output stages. If not specified, only the last stage will be returned.

  • stage_plugins (dict) – Configs of stage plugins

  • init_cfg (dict or list[dict], optional) – Initialization config dict.

forward(x)[源代码]

Args: x (Tensor): Image tensor of shape \((N, 3, H, W)\).

返回

Feature tensor. It can be a list of feature outputs at specific layers if out_indices is specified.

返回类型

Tensor or list[Tensor]

class mmocr.models.textrecog.backbones.ResNet31OCR(base_channels=3, layers=[1, 2, 5, 3], channels=[64, 128, 256, 256, 512, 512, 512], out_indices=None, stage4_pool_cfg={'kernel_size': (2, 1), 'stride': (2, 1)}, last_stage_pool=False, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]
Implement ResNet backbone for text recognition, modified from

ResNet

参数
  • base_channels (int) – Number of channels of input image tensor.

  • layers (list[int]) – List of BasicBlock number for each stage.

  • channels (list[int]) – List of out_channels of Conv2d layer.

  • out_indices (None | Sequence[int]) – Indices of output stages.

  • stage4_pool_cfg (dict) – Dictionary to construct and configure pooling layer in stage 4.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.backbones.ResNetABI(in_channels=3, stem_channels=32, base_channels=32, arch_settings=[3, 4, 6, 6, 3], strides=[2, 1, 2, 1, 1], out_indices=None, last_stage_pool=False, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Constant', 'val': 1, 'layer': 'BatchNorm2d'}])[源代码]

Implement ResNet backbone for text recognition, modified from `ResNet.

<https://arxiv.org/pdf/1512.03385.pdf>`_ and https://github.com/FangShancheng/ABINet

参数
  • in_channels (int) – Number of channels of input image tensor.

  • stem_channels (int) – Number of stem channels.

  • base_channels (int) – Number of base channels.

  • arch_settings (list[int]) – List of BasicBlock number for each stage.

  • strides (Sequence[int]) – Strides of the first block of each stage.

  • out_indices (None | Sequence[int]) – Indices of output stages. If not specified, only the last stage will be returned.

  • last_stage_pool (bool) – If True, add MaxPool2d layer to last stage.

forward(x)[源代码]
参数

x (Tensor) – Image tensor of shape \((N, 3, H, W)\).

返回

Feature tensor. Its shape depends on ResNetABI’s config. It can be a list of feature outputs at specific layers if out_indices is specified.

返回类型

Tensor or list[Tensor]

class mmocr.models.textrecog.backbones.ShallowCNN(input_channels=1, hidden_dim=512, init_cfg=[{'type': 'Kaiming', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

Implement Shallow CNN block for SATRN.

SATRN: On Recognizing Texts of Arbitrary Shapes with 2D Self-Attention.

参数
  • base_channels (int) – Number of channels of input image tensor \(D_i\).

  • hidden_dim (int) – Size of hidden layers of the model \(D_m\).

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(x)[源代码]
参数

x (Tensor) – Input image feature \((N, D_i, H, W)\).

返回

A tensor of shape \((N, D_m, H/4, W/4)\).

返回类型

Tensor

class mmocr.models.textrecog.backbones.VeryDeepVgg(leaky_relu=True, input_channels=3, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

Implement VGG-VeryDeep backbone for text recognition, modified from VGG-VeryDeep

参数
  • leaky_relu (bool) – Use leakyRelu or not.

  • input_channels (int) – Number of channels of input image tensor.

forward(x)[源代码]
参数

x (Tensor) – Images of shape \((N, C, H, W)\).

返回

The feature Tensor of shape \((N, 512, H/32, (W/4+1)\).

返回类型

Tensor

Text Recognition Layers

class mmocr.models.textrecog.layers.Adaptive2DPositionalEncoding(d_hid=512, n_height=100, n_width=100, dropout=0.1, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}])[源代码]
Implement Adaptive 2D positional encoder for SATRN, see

SATRN Modified from https://github.com/Media-Smart/vedastr Licensed under the Apache License, Version 2.0 (the “License”);

参数
  • d_hid (int) – Dimensions of hidden layer.

  • n_height (int) – Max height of the 2D feature output.

  • n_width (int) – Max width of the 2D feature output.

  • dropout (int) – Size of hidden layers of the model.

forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.BasicBlock(inplanes, planes, stride=1, downsample=None, use_conv1x1=False, plugins=None)[源代码]
forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

make_block_plugins(in_channels, plugins)[源代码]

make plugins for block.

参数
  • in_channels (int) – Input channels of plugin.

  • plugins (list[dict]) – List of plugins cfg to build.

返回

List of the names of plugin.

返回类型

list[str]

class mmocr.models.textrecog.layers.BidirectionalLSTM(nIn, nHidden, nOut)[源代码]
forward(input)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.Bottleneck(inplanes, planes, stride=1, downsample=False)[源代码]
forward(x)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.DotProductAttentionLayer(dim_model=None)[源代码]
forward(query, key, value, mask=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.PositionAwareLayer(dim_model, rnn_layers=2)[源代码]
forward(img_feature)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.layers.RobustScannerFusionLayer(dim_model, dim=- 1, init_cfg=None)[源代码]
forward(x0, x1)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Text Recognition Convertors

class mmocr.models.textrecog.convertors.ABIConvertor(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, max_seq_len=40, lower=False, start_end_same=True, **kwargs)[源代码]

Convert between text, index and tensor for encoder-decoder based pipeline. Modified from AttnConvertor to get closer to ABINet’s original implementation.

参数
  • dict_type (str) – Type of dict, should be one of {‘DICT36’, ‘DICT90’}.

  • dict_file (None|str) – Character dict file path. If not none, higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, higher priority than dict_type, but lower than dict_file.

  • with_unknown (bool) – If True, add UKN token to class.

  • max_seq_len (int) – Maximum sequence length of label.

  • lower (bool) – If True, convert original string to lower case.

  • start_end_same (bool) – Whether use the same index for start and end token or not. Default: True.

str2tensor(strings)[源代码]

Convert text-string into tensor. Different from mmocr.models.textrecog.convertors.AttnConvertor, the targets field returns target index no longer than max_seq_len (EOS token included).

参数

strings (list[str]) – For instance, [‘hello’, ‘world’]

返回

A dict with two tensors.

  • targets (list[Tensor]): [torch.Tensor([1,2,3,3,4,8]), torch.Tensor([5,4,6,3,7,8])]
  • padded_targets (Tensor): Tensor of shape (bsz * max_seq_len)).

返回类型

dict

class mmocr.models.textrecog.convertors.AttnConvertor(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, max_seq_len=40, lower=False, start_end_same=True, **kwargs)[源代码]

Convert between text, index and tensor for encoder-decoder based pipeline.

参数
  • dict_type (str) – Type of dict, should be one of {‘DICT36’, ‘DICT90’}.

  • dict_file (None|str) – Character dict file path. If not none, higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, higher priority than dict_type, but lower than dict_file.

  • with_unknown (bool) – If True, add UKN token to class.

  • max_seq_len (int) – Maximum sequence length of label.

  • lower (bool) – If True, convert original string to lower case.

  • start_end_same (bool) – Whether use the same index for start and end token or not. Default: True.

str2tensor(strings)[源代码]

Convert text-string into tensor. :param strings: [‘hello’, ‘world’] :type strings: list[str]

返回

Tensor | list[tensor]):
tensors (list[Tensor]): [torch.Tensor([1,2,3,3,4]),

torch.Tensor([5,4,6,3,7])]

padded_targets (Tensor(bsz * max_seq_len))

返回类型

dict (str

tensor2idx(outputs, img_metas=None)[源代码]

Convert output tensor to text-index :param outputs: model outputs with size: N * T * C :type outputs: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]

返回

[[1,2,3,3,4], [5,4,6,3,7]] scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]]

返回类型

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.BaseConvertor(dict_type='DICT90', dict_file=None, dict_list=None)[源代码]

Convert between text, index and tensor for text recognize pipeline.

参数
  • dict_type (str) – Type of dict, options are ‘DICT36’, ‘DICT37’, ‘DICT90’ and ‘DICT91’.

  • dict_file (None|str) – Character dict file path. If not none, the dict_file is of higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.

idx2str(indexes)[源代码]

Convert indexes to text strings.

参数

indexes (list[list[int]]) – [[1,2,3,3,4], [5,4,6,3,7]].

返回

[‘hello’, ‘world’].

返回类型

strings (list[str])

num_classes()[源代码]

Number of output classes.

str2idx(strings)[源代码]

Convert strings to indexes.

参数

strings (list[str]) – [‘hello’, ‘world’].

返回

[[1,2,3,3,4], [5,4,6,3,7]].

返回类型

indexes (list[list[int]])

str2tensor(strings)[源代码]

Convert text-string to input tensor.

参数

strings (list[str]) – [‘hello’, ‘world’].

返回

[torch.Tensor([1,2,3,3,4]),

torch.Tensor([5,4,6,3,7])].

返回类型

tensors (list[torch.Tensor])

tensor2idx(output)[源代码]

Convert model output tensor to character indexes and scores. :param output: The model outputs with size: N * T * C :type output: tensor

返回

[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]].

返回类型

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.CTCConvertor(dict_type='DICT90', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[源代码]

Convert between text, index and tensor for CTC loss-based pipeline.

参数
  • dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.

  • dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, the list is of higher priority than dict_type, but lower than dict_file.

  • with_unknown (bool) – If True, add UKN token to class.

  • lower (bool) – If True, convert original string to lower case.

str2tensor(strings)[源代码]

Convert text-string to ctc-loss input tensor.

参数

strings (list[str]) – [‘hello’, ‘world’].

返回

tensor | list[tensor]):
tensors (list[tensor]): [torch.Tensor([1,2,3,3,4]),

torch.Tensor([5,4,6,3,7])].

flatten_targets (tensor): torch.Tensor([1,2,3,3,4,5,4,6,3,7]). target_lengths (tensor): torch.IntTensot([5,5]).

返回类型

dict (str

tensor2idx(output, img_metas, topk=1, return_topk=False)[源代码]

Convert model output tensor to index-list. :param output: The model outputs with size: N * T * C. :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict] :param topk: The highest k classes to be returned. :type topk: int :param return_topk: Whether to return topk or just top1. :type return_topk: bool

返回

[[1,2,3,3,4], [5,4,6,3,7]]. scores (list[list[float]]): [[0.9,0.8,0.95,0.97,0.94],

[0.9,0.9,0.98,0.97,0.96]] (

indexes_topk (list[list[list[int]->len=topk]]): scores_topk (list[list[list[float]->len=topk]])

).

返回类型

indexes (list[list[int]])

class mmocr.models.textrecog.convertors.SegConvertor(dict_type='DICT36', dict_file=None, dict_list=None, with_unknown=True, lower=False, **kwargs)[源代码]

Convert between text, index and tensor for segmentation based pipeline.

参数
  • dict_type (str) – Type of dict, should be either ‘DICT36’ or ‘DICT90’.

  • dict_file (None|str) – Character dict file path. If not none, the file is of higher priority than dict_type.

  • dict_list (None|list[str]) – Character list. If not none, the list

  • of higher priority than dict_type (is) –

  • lower than dict_file. (but) –

  • with_unknown (bool) – If True, add UKN token to class.

  • lower (bool) – If True, convert original string to lower case.

tensor2str(output, img_metas=None)[源代码]

Convert model output tensor to string labels. :param output: Model outputs with size: N * C * H * W :type output: tensor :param img_metas: Each dict contains one image info. :type img_metas: list[dict]

返回

Decoded text labels. scores (list[list[float]]): Decoded chars scores.

返回类型

texts (list[str])

Text Recognition Encoders

class mmocr.models.textrecog.encoders.ABIVisionModel(encoder={'type': 'TransformerEncoder'}, decoder={'type': 'ABIVisionDecoder'}, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'}, **kwargs)[源代码]

A wrapper of visual feature encoder and language token decoder that converts visual features into text tokens.

Implementation of VisionEncoder in

ABINet.

参数
  • encoder (dict) – Config for image feature encoder.

  • decoder (dict) – Config for language token decoder.

  • init_cfg (dict) – Specifies the initialization method for model layers.

forward(feat, img_metas=None)[源代码]
参数

feat (Tensor) – Images of shape (N, E, H, W).

返回

A dict with keys feature, logits and attn_scores.

  • feature (Tensor): Shape (N, T, E). Raw visual features for language decoder.
  • logits (Tensor): Shape (N, T, C). The raw logits for characters. C is the number of characters.
  • attn_scores (Tensor): Shape (N, T, H, W). Intermediate result for vision-language aligner.

返回类型

dict

class mmocr.models.textrecog.encoders.BaseEncoder(init_cfg: Optional[dict] = None)[源代码]

Base Encoder class for text recognition.

forward(feat, **kwargs)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.encoders.ChannelReductionEncoder(in_channels, out_channels, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'})[源代码]

Change the channel number with a one by one convoluational layer.

参数
  • in_channels (int) – Number of input channels.

  • out_channels (int) – Number of output channels.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(feat, img_metas=None)[源代码]
参数
  • feat (Tensor) – Image features with the shape of \((N, C_{in}, H, W)\).

  • img_metas (None) – Unused.

返回

A tensor of shape \((N, C_{out}, H, W)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.NRTREncoder(n_layers=6, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, dropout=0.1, init_cfg=None, **kwargs)[源代码]

Transformer Encoder block with self attention mechanism.

参数
  • n_layers (int) – The number of sub-encoder-layers in the encoder (default=6).

  • n_head (int) – The number of heads in the multiheadattention models (default=8).

  • d_k (int) – Total number of features in key.

  • d_v (int) – Total number of features in value.

  • d_model (int) – The number of expected features in the decoder inputs (default=512).

  • d_inner (int) – The dimension of the feedforward network model (default=256).

  • dropout (float) – Dropout layer on attn_output_weights.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(feat, img_metas=None)[源代码]
参数
  • feat (Tensor) – Backbone output of shape \((N, C, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

The encoder output tensor. Shape \((N, T, C)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.SAREncoder(enc_bi_rnn=False, enc_do_rnn=0.0, enc_gru=False, d_model=512, d_enc=512, mask=True, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}], **kwargs)[源代码]

Implementation of encoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

参数
  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.

  • enc_do_rnn (float) – Dropout probability of RNN layer in encoder.

  • enc_gru (bool) – If True, use GRU, else LSTM in encoder.

  • d_model (int) – Dim \(D_i\) of channels from backbone.

  • d_enc (int) – Dim \(D_m\) of encoder RNN layer.

  • mask (bool) – If True, mask padding in RNN sequence.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(feat, img_metas=None)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A tensor of shape \((N, D_m)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.SatrnEncoder(n_layers=12, n_head=8, d_k=64, d_v=64, d_model=512, n_position=100, d_inner=256, dropout=0.1, init_cfg=None, **kwargs)[源代码]

Implement encoder for SATRN, see `SATRN.

<https://arxiv.org/abs/1910.04396>`_.

参数
  • n_layers (int) – Number of attention layers.

  • n_head (int) – Number of parallel attention heads.

  • d_k (int) – Dimension of the key vector.

  • d_v (int) – Dimension of the value vector.

  • d_model (int) – Dimension \(D_m\) of the input from previous model.

  • n_position (int) – Length of the positional encoding vector. Must be greater than max_seq_len.

  • d_inner (int) – Hidden dimension of feedforward layers.

  • dropout (float) – Dropout rate.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(feat, img_metas=None)[源代码]
参数
  • feat (Tensor) – Feature tensor of shape \((N, D_m, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A tensor of shape \((N, T, D_m)\).

返回类型

Tensor

class mmocr.models.textrecog.encoders.TransformerEncoder(n_layers=2, n_head=8, d_model=512, d_inner=2048, dropout=0.1, max_len=256, init_cfg=None)[源代码]

Implement transformer encoder for text recognition, modified from <https://github.com/FangShancheng/ABINet>.

参数
  • n_layers (int) – Number of attention layers.

  • n_head (int) – Number of parallel attention heads.

  • d_model (int) – Dimension \(D_m\) of the input from previous model.

  • d_inner (int) – Hidden dimension of feedforward layers.

  • dropout (float) – Dropout rate.

  • max_len (int) – Maximum output sequence length \(T\).

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward(feature)[源代码]
参数

feature (Tensor) – Feature tensor of shape \((N, D_m, H, W)\).

返回

Features of shape \((N, D_m, H, W)\).

返回类型

Tensor

Text Recognition Decoders

class mmocr.models.textrecog.decoders.ABILanguageDecoder(d_model=512, n_head=8, d_inner=2048, n_layers=4, max_seq_len=40, dropout=0.1, detach_tokens=True, num_chars=90, use_self_attn=False, pad_idx=0, init_cfg=None, **kwargs)[源代码]

Transformer-based language model responsible for spell correction. Implementation of language model of

参数
  • d_model (int) – Hidden size of input.

  • n_head (int) – Number of multi-attention heads.

  • d_inner (int) – Hidden size of feedforward network model.

  • n_layers (int) – The number of similar decoding layers.

  • max_seq_len (int) – Maximum text sequence length \(T\).

  • dropout (float) – Dropout rate.

  • detach_tokens (bool) – Whether to block the gradient flow at input tokens.

  • num_chars (int) – Number of text characters \(C\).

  • use_self_attn (bool) – If True, use self attention in decoder layers, otherwise cross attention will be used.

  • pad_idx (bool) – The index of the token indicating the end of output, which is used to compute the length of output. It is usually the index of <EOS> or <PAD> token.

  • init_cfg (dict) – Specifies the initialization method for model layers.

forward_train(feat, logits, targets_dict, img_metas)[源代码]
参数

logits (Tensor) – Raw language logitis. Shape (N, T, C).

返回

A dict with keys feature and logits. feature (Tensor): Shape (N, T, E). Raw textual features for vision

language aligner.

logits (Tensor): Shape (N, T, C). The raw logits for characters

after spell correction.

class mmocr.models.textrecog.decoders.ABIVisionDecoder(in_channels=512, num_channels=64, attn_height=8, attn_width=32, attn_mode='nearest', max_seq_len=40, num_chars=90, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'}, **kwargs)[源代码]

Converts visual features into text characters.

Implementation of VisionEncoder in

ABINet.

参数
  • in_channels (int) – Number of channels \(E\) of input vector.

  • num_channels (int) – Number of channels of hidden vectors in mini U-Net.

  • h (int) – Height \(H\) of input image features.

  • w (int) – Width \(W\) of input image features.

  • in_channels – Number of channels of input image features.

  • num_channels – Number of channels of hidden vectors in mini U-Net.

  • attn_height (int) – Height \(H\) of input image features.

  • attn_width (int) – Width \(W\) of input image features.

  • attn_mode (str) – Upsampling mode for torch.nn.Upsample in mini U-Net.

  • max_seq_len (int) – Maximum text sequence length \(T\).

  • num_chars (int) – Number of text characters \(C\).

  • init_cfg (dict) – Specifies the initialization method for model layers.

forward_train(feat, out_enc=None, targets_dict=None, img_metas=None)[源代码]
参数

feat (Tensor) – Image features of shape (N, E, H, W).

返回

A dict with keys feature, logits and attn_scores.

  • feature (Tensor): Shape (N, T, E). Raw visual features for language decoder.
  • logits (Tensor): Shape (N, T, C). The raw logits for characters.
  • attn_scores (Tensor): Shape (N, T, H, W). Intermediate result for vision-language aligner.

返回类型

dict

class mmocr.models.textrecog.decoders.BaseDecoder(init_cfg=None, **kwargs)[源代码]

Base decoder class for text recognition.

forward(feat, out_enc, targets_dict=None, img_metas=None, train_mode=True)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class mmocr.models.textrecog.decoders.CRNNDecoder(in_channels=None, num_classes=None, rnn_flag=False, init_cfg={'layer': 'Conv2d', 'type': 'Xavier'}, **kwargs)[源代码]

Decoder for CRNN.

参数
  • in_channels (int) – Number of input channels.

  • num_classes (int) – Number of output classes.

  • rnn_flag (bool) – Use RNN or CNN as the decoder.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward_test(feat, out_enc, img_metas)[源代码]
参数

feat (Tensor) – A Tensor of shape \((N, H, 1, W)\).

返回

The raw logit tensor. Shape \((N, W, C)\) where \(C\) is num_classes.

返回类型

Tensor

forward_train(feat, out_enc, targets_dict, img_metas)[源代码]
参数

feat (Tensor) – A Tensor of shape \((N, H, 1, W)\).

返回

The raw logit tensor. Shape \((N, W, C)\) where \(C\) is num_classes.

返回类型

Tensor

class mmocr.models.textrecog.decoders.MasterDecoder(start_idx, padding_idx, num_classes=93, n_layers=3, n_head=8, d_model=512, feat_size=240, d_inner=2048, attn_drop=0.0, ffn_drop=0.0, feat_pe_drop=0.2, max_seq_len=30, init_cfg=None)[源代码]

Decoder module in MASTER.

Code is partially modified from https://github.com/wenwenyu/MASTER-pytorch.

参数
  • start_idx (int) – The index of <SOS>.

  • padding_idx (int) – The index of <PAD>.

  • num_classes (int) – Number of text characters \(C\).

  • n_layers (int) – Number of attention layers.

  • n_head (int) – Number of parallel attention heads.

  • d_model (int) – Dimension \(E\) of the input from previous model.

  • feat_size (int) – The size of the input feature from previous model, usually \(H * W\).

  • d_inner (int) – Hidden dimension of feedforward layers.

  • attn_drop (float) – Dropout rate of the attention layer.

  • ffn_drop (float) – Dropout rate of the feedforward layer.

  • feat_pe_drop (float) – Dropout rate of the feature positional encoding layer.

  • max_seq_len (int) – Maximum output sequence length \(T\).

  • init_cfg (dict or list[dict], optional) – Initialization configs.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – The feature map from backbone of shape \((N, E, H, W)\).

  • out_enc (Tensor) – Encoder output.

  • img_metas – Unused.

返回

Raw logit tensor of shape \((N, T, C)\).

返回类型

Tensor

forward_train(feat, out_enc, targets_dict, img_metas=None)[源代码]
参数
  • feat (Tensor) – The feature map from backbone of shape \((N, E, H, W)\).

  • out_enc (Tensor) – Encoder output.

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas – Unused.

返回

Raw logit tensor of shape \((N, T, C)\).

返回类型

Tensor

make_mask(tgt, device)[源代码]

Make mask for self attention.

参数
  • tgt (Tensor) – Shape [N, l_tgt]

  • device (torch.Device) – Mask device.

返回

Mask of shape [N * self.n_head, l_tgt, l_tgt]

返回类型

Tensor

class mmocr.models.textrecog.decoders.NRTRDecoder(n_layers=6, d_embedding=512, n_head=8, d_k=64, d_v=64, d_model=512, d_inner=256, n_position=200, dropout=0.1, num_classes=93, max_seq_len=40, start_idx=1, padding_idx=92, init_cfg=None, **kwargs)[源代码]

Transformer Decoder block with self attention mechanism.

参数
  • n_layers (int) – Number of attention layers.

  • d_embedding (int) – Language embedding dimension.

  • n_head (int) – Number of parallel attention heads.

  • d_k (int) – Dimension of the key vector.

  • d_v (int) – Dimension of the value vector.

  • d_model (int) – Dimension \(D_m\) of the input from previous model.

  • d_inner (int) – Hidden dimension of feedforward layers.

  • n_position (int) – Length of the positional encoding vector. Must be greater than max_seq_len.

  • dropout (float) – Dropout rate.

  • num_classes (int) – Number of output classes \(C\).

  • max_seq_len (int) – Maximum output sequence length \(T\).

  • start_idx (int) – The index of <SOS>.

  • padding_idx (int) – The index of <PAD>.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

警告

This decoder will not predict the final class which is assumed to be <PAD>. Therefore, its output size is always \(C - 1\). <PAD> is also ignored by loss as specified in mmocr.models.textrecog.recognizer.EncodeDecodeRecognizer.

forward_train(feat, out_enc, targets_dict, img_metas)[源代码]
参数
  • feat (None) – Unused.

  • out_enc (Tensor) – Encoder output of shape \((N, T, D_m)\) where \(D_m\) is d_model.

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

The raw logit tensor. Shape \((N, T, C)\).

返回类型

Tensor

static get_subsequent_mask(seq)[源代码]

For masking out the subsequent info.

class mmocr.models.textrecog.decoders.ParallelSARDecoder(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0.0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=92, pred_concat=False, init_cfg=None, **kwargs)[源代码]

Implementation Parallel Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

参数
  • num_classes (int) – Output class number \(C\).

  • channels (list[int]) – Network layer channels.

  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.

  • dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.

  • dec_do_rnn (float) – Dropout of RNN layer in decoder.

  • dec_gru (bool) – If True, use GRU, else LSTM in decoder.

  • d_model (int) – Dim of channels from backbone \(D_i\).

  • d_enc (int) – Dim of encoder RNN layer \(D_m\).

  • d_k (int) – Dim of channels of attention module.

  • pred_dropout (float) – Dropout probability of prediction layer.

  • max_seq_len (int) – Maximum sequence length for decoding.

  • mask (bool) – If True, mask padding in feature map.

  • start_idx (int) – Index of start token.

  • padding_idx (int) – Index of padding token.

  • pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

警告

This decoder will not predict the final class which is assumed to be <PAD>. Therefore, its output size is always \(C - 1\). <PAD> is also ignored by loss as specified in mmocr.models.textrecog.recognizer.EncodeDecodeRecognizer.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\).

返回类型

Tensor

forward_train(feat, out_enc, targets_dict, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.ParallelSARDecoderWithBS(beam_width=5, num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_do_rnn=0, dec_gru=False, d_model=512, d_enc=512, d_k=64, pred_dropout=0.0, max_seq_len=40, mask=True, start_idx=0, padding_idx=0, pred_concat=False, init_cfg=None, **kwargs)[源代码]

Parallel Decoder module with beam-search in SAR.

参数

beam_width (int) – Width for beam search.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.PositionAttentionDecoder(num_classes=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, mask=True, return_feature=False, encode_value=False, init_cfg=None)[源代码]

Position attention decoder for RobustScanner.

RobustScanner: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

参数
  • num_classes (int) – Number of output classes \(C\).

  • rnn_layers (int) – Number of RNN layers.

  • dim_input (int) – Dimension \(D_i\) of input vector feat.

  • dim_model (int) – Dimension \(D_m\) of the model. Should also be the same as encoder output vector out_enc.

  • max_seq_len (int) – Maximum output sequence length \(T\).

  • mask (bool) – Whether to mask input features according to img_meta['valid_ratio'].

  • return_feature (bool) – Return feature or logits as the result.

  • encode_value (bool) – Whether to use the output of encoder out_enc as value of attention layer. If False, the original feature feat will be used.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

警告

This decoder will not predict the final class which is assumed to be <PAD>. Therefore, its output size is always \(C - 1\). <PAD> is also ignored by loss as specified in mmocr.models.textrecog.recognizer.EncodeDecodeRecognizer.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\) if return_feature=False. Otherwise it would be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).

返回类型

Tensor

forward_train(feat, out_enc, targets_dict, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\) if return_feature=False. Otherwise it will be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.RobustScannerDecoder(num_classes=None, dim_input=512, dim_model=128, max_seq_len=40, start_idx=0, mask=True, padding_idx=None, encode_value=False, hybrid_decoder=None, position_decoder=None, init_cfg=None)[源代码]

Decoder for RobustScanner.

RobustScanner: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

参数
  • num_classes (int) – Number of output classes \(C\).

  • dim_input (int) – Dimension \(D_i\) of input vector feat.

  • dim_model (int) – Dimension \(D_m\) of the model. Should also be the same as encoder output vector out_enc.

  • max_seq_len (int) – Maximum output sequence length \(T\).

  • start_idx (int) – The index of <SOS>.

  • mask (bool) – Whether to mask input features according to img_meta['valid_ratio'].

  • padding_idx (int) – The index of <PAD>.

  • encode_value (bool) – Whether to use the output of encoder out_enc as value of attention layer. If False, the original feature feat will be used.

  • hybrid_decoder (dict) – Configuration dict for hybrid decoder.

  • position_decoder (dict) – Configuration dict for position decoder.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

警告

This decoder will not predict the final class which is assumed to be <PAD>. Therefore, its output size is always \(C - 1\). <PAD> is also ignored by loss as specified in mmocr.models.textrecog.recognizer.EncodeDecodeRecognizer.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

The output logit sequence tensor of shape \((N, T, C-1)\).

返回类型

Tensor

forward_train(feat, out_enc, targets_dict, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.SequenceAttentionDecoder(num_classes=None, rnn_layers=2, dim_input=512, dim_model=128, max_seq_len=40, start_idx=0, mask=True, padding_idx=None, dropout=0, return_feature=False, encode_value=False, init_cfg=None)[源代码]

Sequence attention decoder for RobustScanner.

RobustScanner: RobustScanner: Dynamically Enhancing Positional Clues for Robust Text Recognition

参数
  • num_classes (int) – Number of output classes \(C\).

  • rnn_layers (int) – Number of RNN layers.

  • dim_input (int) – Dimension \(D_i\) of input vector feat.

  • dim_model (int) – Dimension \(D_m\) of the model. Should also be the same as encoder output vector out_enc.

  • max_seq_len (int) – Maximum output sequence length \(T\).

  • start_idx (int) – The index of <SOS>.

  • mask (bool) – Whether to mask input features according to img_meta['valid_ratio'].

  • padding_idx (int) – The index of <PAD>.

  • dropout (float) – Dropout rate.

  • return_feature (bool) – Return feature or logits as the result.

  • encode_value (bool) – Whether to use the output of encoder out_enc as value of attention layer. If False, the original feature feat will be used.

  • init_cfg (dict or list[dict], optional) – Initialization configs.

警告

This decoder will not predict the final class which is assumed to be <PAD>. Therefore, its output size is always \(C - 1\). <PAD> is also ignored by loss as specified in mmocr.models.textrecog.recognizer.EncodeDecodeRecognizer.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

The output logit sequence tensor of shape \((N, T, C-1)\).

返回类型

Tensor

forward_test_step(feat, out_enc, decode_sequence, current_step, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • decode_sequence (Tensor) – Shape \((N, T)\). The tensor that stores history decoding result.

  • current_step (int) – Current decoding step.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

Shape \((N, C-1)\). The logit tensor of predicted tokens at current time step.

返回类型

Tensor

forward_train(feat, out_enc, targets_dict, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\) if return_feature=False. Otherwise it would be the hidden feature before the prediction projection layer, whose shape is \((N, T, D_m)\).

返回类型

Tensor

class mmocr.models.textrecog.decoders.SequentialSARDecoder(num_classes=37, enc_bi_rnn=False, dec_bi_rnn=False, dec_gru=False, d_k=64, d_model=512, d_enc=512, pred_dropout=0.0, mask=True, max_seq_len=40, start_idx=0, padding_idx=92, pred_concat=False, init_cfg=None, **kwargs)[源代码]

Implementation Sequential Decoder module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

参数
  • num_classes (int) – Output class number \(C\).

  • enc_bi_rnn (bool) – If True, use bidirectional RNN in encoder.

  • dec_bi_rnn (bool) – If True, use bidirectional RNN in decoder.

  • dec_do_rnn (float) – Dropout of RNN layer in decoder.

  • dec_gru (bool) – If True, use GRU, else LSTM in decoder.

  • d_k (int) – Dim of conv layers in attention module.

  • d_model (int) – Dim of channels from backbone \(D_i\).

  • d_enc (int) – Dim of encoder RNN layer \(D_m\).

  • pred_dropout (float) – Dropout probability of prediction layer.

  • max_seq_len (int) – Maximum sequence length during decoding.

  • mask (bool) – If True, mask padding in feature map.

  • start_idx (int) – Index of start token.

  • padding_idx (int) – Index of padding token.

  • pred_concat (bool) – If True, concat glimpse feature from attention with holistic feature and hidden state.

forward_test(feat, out_enc, img_metas)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\).

返回类型

Tensor

forward_train(feat, out_enc, targets_dict, img_metas=None)[源代码]
参数
  • feat (Tensor) – Tensor of shape \((N, D_i, H, W)\).

  • out_enc (Tensor) – Encoder output of shape \((N, D_m, H, W)\).

  • targets_dict (dict) – A dict with the key padded_targets, a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

A raw logit tensor of shape \((N, T, C-1)\).

返回类型

Tensor

Text Recognition Fusers

class mmocr.models.textrecog.fusers.ABIFuser(d_model=512, max_seq_len=40, num_chars=90, init_cfg=None, **kwargs)[源代码]

Mix and align visual feature and linguistic feature Implementation of language model of ABINet.

参数
  • d_model (int) – Hidden size of input.

  • max_seq_len (int) – Maximum text sequence length \(T\).

  • num_chars (int) – Number of text characters \(C\).

  • init_cfg (dict) – Specifies the initialization method for model layers.

forward(l_feature, v_feature)[源代码]
参数
  • l_feature – (N, T, E) where T is length, N is batch size and d is dim of model.

  • v_feature – (N, T, E) shape the same as l_feature.

返回

A dict with key logits The logits of shape (N, T, C) where N is batch size, T is length

and C is the number of characters.

Text Recognition Losses

class mmocr.models.textrecog.losses.ABILoss(enc_weight=1.0, dec_weight=1.0, fusion_weight=1.0, num_classes=37, **kwargs)[源代码]

Implementation of ABINet multiloss that allows mixing different types of losses with weights.

参数
  • enc_weight (float) – The weight of encoder loss. Defaults to 1.0.

  • dec_weight (float) – The weight of decoder loss. Defaults to 1.0.

  • fusion_weight (float) – The weight of fuser (aligner) loss. Defaults to 1.0.

  • num_classes (int) – Number of unique output language tokens.

返回

A dictionary whose key/value pairs are the losses of three modules.

forward(outputs, targets_dict, img_metas=None)[源代码]
参数
  • outputs (dict) – The output dictionary with at least one of out_enc, out_dec and out_fusers specified.

  • targets_dict (dict) – The target dictionary containing the key padded_targets, which represents target sequences in shape (batch_size, sequence_length).

返回

A loss dictionary with loss_visual, loss_lang and loss_fusion. Each should either be the loss tensor or 0 if the output of its corresponding module is not given.

class mmocr.models.textrecog.losses.CELoss(ignore_index=- 1, reduction='none', ignore_first_char=False)[源代码]

Implementation of loss module for encoder-decoder based text recognition method with CrossEntropy loss.

参数
  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

  • ignore_first_char (bool) – Whether to ignore the first token in target ( usually the start token). If True, the last token of the output sequence will also be removed to be aligned with the target length.

forward(outputs, targets_dict, img_metas=None)[源代码]
参数
  • outputs (Tensor) – A raw logit tensor of shape \((N, T, C)\).

  • targets_dict (dict) – A dict with a key padded_targets, which is a tensor of shape \((N, T)\). Each element is the index of a character.

  • img_metas (None) – Unused.

返回

A loss dict with the key loss_ce.

返回类型

dict

class mmocr.models.textrecog.losses.CTCLoss(flatten=True, blank=0, reduction='mean', zero_infinity=False, **kwargs)[源代码]

Implementation of loss module for CTC-loss based text recognition.

参数
  • flatten (bool) – If True, use flattened targets, else padded targets.

  • blank (int) – Blank label. Default 0.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (‘none’, ‘mean’, ‘sum’).

  • zero_infinity (bool) – Whether to zero infinite losses and the associated gradients. Default: False. Infinite losses mainly occur when the inputs are too short to be aligned to the targets.

forward(outputs, targets_dict, img_metas=None)[源代码]
参数
  • outputs (Tensor) – A raw logit tensor of shape \((N, T, C)\).

  • targets_dict (dict) –

    A dict with 3 keys target_lengths, flatten_targets and targets.

    • target_lengths (Tensor): A tensor of shape \((N)\). Each item is the length of a word.
    • flatten_targets (Tensor): Used if self.flatten=True (default). A tensor of shape (sum(targets_dict[‘target_lengths’])). Each item is the index of a character.
    • targets (Tensor): Used if self.flatten=False. A tensor of \((N, T)\). Empty slots are padded with self.blank.

  • img_metas (dict) – A dict that contains meta information of input images. Preferably with the key valid_ratio.

返回

The loss dict with key loss_ctc.

返回类型

dict

class mmocr.models.textrecog.losses.SARLoss(ignore_index=- 1, reduction='mean', **kwargs)[源代码]

Implementation of loss module in `SAR.

<https://arxiv.org/abs/1811.00751>`_.

参数
  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

  • reduction (str) – Specifies the reduction to apply to the output, should be one of the following: (“none”, “mean”, “sum”).

警告

SARLoss assumes that the first input token is always <SOS>.

class mmocr.models.textrecog.losses.SegLoss(seg_downsample_ratio=0.5, seg_with_loss_weight=True, ignore_index=255, **kwargs)[源代码]

Implementation of loss module for segmentation based text recognition method.

参数
  • seg_downsample_ratio (float) – Downsample ratio of segmentation map.

  • seg_with_loss_weight (bool) – If True, set weight for segmentation loss.

  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

forward(out_neck, out_head, gt_kernels)[源代码]
参数
  • out_neck (None) – Unused.

  • out_head (Tensor) – The output from head whose shape is \((N, C, H, W)\).

  • gt_kernels (BitmapMasks) – The ground truth masks.

返回

A loss dictionary with the key loss_seg.

返回类型

dict

class mmocr.models.textrecog.losses.TFLoss(ignore_index=- 1, reduction='none', flatten=True, **kwargs)[源代码]

Implementation of loss module for transformer.

参数
  • ignore_index (int, optional) – The character index to be ignored in loss computation.

  • reduction (str) – Type of reduction to apply to the output, should be one of the following: (“none”, “mean”, “sum”).

  • flatten (bool) – Whether to flatten the vectors for loss computation.

警告

TFLoss assumes that the first input token is always <SOS>.

KIE Extractors

class mmocr.models.kie.extractors.SDMGR(backbone, neck=None, bbox_head=None, extractor={'featmap_strides': [1], 'roi_layer': {'output_size': 7, 'type': 'RoIAlign'}, 'type': 'mmdet.SingleRoIExtractor'}, visual_modality=False, train_cfg=None, test_cfg=None, class_list=None, init_cfg=None, openset=False)[源代码]

The implementation of the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction. https://arxiv.org/abs/2103.14470.

参数
  • visual_modality (bool) – Whether use the visual modality.

  • class_list (None | str) – Mapping file of class index to class name. If None, class index will be shown in show_results, else class name.

extract_feat(img, gt_bboxes)[源代码]

Directly extract features from the backbone+neck.

forward_test(img, img_metas, relations, texts, gt_bboxes, rescale=False)[源代码]

Args: imgs (List[Tensor]): the outer list indicates test-time

augmentations and inner Tensor should have a shape NxCxHxW, which contains all images in the batch.

img_metas (List[List[dict]]): the outer list indicates test-time

augs (multiscale, flip, etc.) and the inner list indicates images in a batch.

forward_train(img, img_metas, relations, texts, gt_bboxes, gt_labels)[源代码]
参数
  • img (tensor) – Input images of shape (N, C, H, W). Typically these should be mean centered and std scaled.

  • img_metas (list[dict]) – A list of image info dict where each dict contains: ‘img_shape’, ‘scale_factor’, ‘flip’, and may also contain ‘filename’, ‘ori_shape’, ‘pad_shape’, and ‘img_norm_cfg’. For details of the values of these keys, please see mmdet.datasets.pipelines.Collect.

  • relations (list[tensor]) – Relations between bboxes.

  • texts (list[tensor]) – Texts in bboxes.

  • gt_bboxes (list[tensor]) – Each item is the truth boxes for each image in [tl_x, tl_y, br_x, br_y] format.

  • gt_labels (list[tensor]) – Class indices corresponding to each box.

返回

A dictionary of loss components.

返回类型

dict[str, tensor]

show_result(img, result, boxes, win_name='', show=False, wait_time=0, out_file=None, **kwargs)[源代码]

Draw result on img.

参数
  • img (str or tensor) – The image to be displayed.

  • result (dict) – The results to draw on img.

  • boxes (list) – Bbox of img.

  • win_name (str) – The window name.

  • wait_time (int) – Value of waitKey param. Default: 0.

  • show (bool) – Whether to show the image. Default: False.

  • out_file (str or None) – The output filename. Default: None.

返回

Only if not show or out_file.

返回类型

img (tensor)

KIE Heads

class mmocr.models.kie.heads.SDMGRHead(num_chars=92, visual_dim=64, fusion_dim=1024, node_input=32, node_embed=256, edge_input=5, edge_embed=256, num_gnn=2, num_classes=26, loss={'type': 'SDMGRLoss'}, bidirectional=False, train_cfg=None, test_cfg=None, init_cfg={'mean': 0, 'override': {'name': 'edge_embed'}, 'std': 0.01, 'type': 'Normal'})[源代码]
forward(relations, texts, x=None)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

KIE Losses

class mmocr.models.kie.losses.SDMGRLoss(node_weight=1.0, edge_weight=1.0, ignore=- 100)[源代码]

The implementation the loss of key information extraction proposed in the paper: Spatial Dual-Modality Graph Reasoning for Key Information Extraction.

https://arxiv.org/abs/2103.14470.

forward(node_preds, edge_preds, gts)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

NER Encoders

class mmocr.models.ner.encoders.BertEncoder(num_hidden_layers=12, initializer_range=0.02, vocab_size=21128, hidden_size=768, max_position_embeddings=128, type_vocab_size=2, layer_norm_eps=1e-12, hidden_dropout_prob=0.1, output_attentions=False, output_hidden_states=False, num_attention_heads=12, attention_probs_dropout_prob=0.1, intermediate_size=3072, hidden_act_cfg={'type': 'GeluNew'}, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

Bert encoder :param num_hidden_layers: The number of hidden layers. :type num_hidden_layers: int :param initializer_range: :type initializer_range: float :param vocab_size: Number of words supported. :type vocab_size: int :param hidden_size: Hidden size. :type hidden_size: int :param max_position_embeddings: Max positions embedding size. :type max_position_embeddings: int :param type_vocab_size: The size of type_vocab. :type type_vocab_size: int :param layer_norm_eps: Epsilon of layer norm. :type layer_norm_eps: float :param hidden_dropout_prob: The dropout probability of hidden layer. :type hidden_dropout_prob: float :param output_attentions: Whether use the attentions in output. :type output_attentions: bool :param output_hidden_states: Whether use the hidden_states in output. :type output_hidden_states: bool :param num_attention_heads: The number of attention heads. :type num_attention_heads: int :param attention_probs_dropout_prob: The dropout probability

of attention.

参数
  • intermediate_size (int) – The size of intermediate layer.

  • hidden_act_cfg (dict) – Hidden layer activation.

forward(results)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

NER Decoders

class mmocr.models.ner.decoders.FCDecoder(num_labels=None, hidden_dropout_prob=0.1, hidden_size=768, init_cfg=[{'type': 'Xavier', 'layer': 'Conv2d'}, {'type': 'Uniform', 'layer': 'BatchNorm2d'}])[源代码]

FC Decoder class for Ner.

参数
  • num_labels (int) – Number of categories mapped by entity label.

  • hidden_dropout_prob (float) – The dropout probability of hidden layer.

  • hidden_size (int) – Hidden layer output layer channels.

forward(outputs)[源代码]

Defines the computation performed at every call.

Should be overridden by all subclasses.

注解

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

NER Losses

class mmocr.models.ner.losses.MaskedCrossEntropyLoss(num_labels=None, ignore_index=0)[源代码]

The implementation of masked cross entropy loss.

The mask has 1 for real tokens and 0 for padding tokens,

which only keep active parts of the cross entropy loss.

参数
  • num_labels (int) – Number of classes in labels.

  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

forward(logits, img_metas)[源代码]

Loss forword. :param logits: Model output with shape [N, C]. :param img_metas: A dict containing the following keys:

  • img (list]): This parameter is reserved.

  • labels (list[int]): The labels for each word

    of the sequence.

  • texts (list): The words of the sequence.

  • input_ids (list): The ids for each word of

    the sequence.

  • attention_mask (list): The mask for each word

    of the sequence. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

  • token_type_ids (list): The tokens for each word

    of the sequence.

class mmocr.models.ner.losses.MaskedFocalLoss(num_labels=None, ignore_index=0)[源代码]

The implementation of masked focal loss.

The mask has 1 for real tokens and 0 for padding tokens,

which only keep active parts of the focal loss

参数
  • num_labels (int) – Number of classes in labels.

  • ignore_index (int) – Specifies a target value that is ignored and does not contribute to the input gradient.

forward(logits, img_metas)[源代码]

Loss forword. :param logits: Model output with shape [N, C]. :param img_metas: A dict containing the following keys:

  • img (list]): This parameter is reserved.

  • labels (list[int]): The labels for each word

    of the sequence.

  • texts (list): The words of the sequence.

  • input_ids (list): The ids for each word of

    the sequence.

  • attention_mask (list): The mask for each word

    of the sequence. The mask has 1 for real tokens and 0 for padding tokens. Only real tokens are attended to.

  • token_type_ids (list): The tokens for each word

    of the sequence.

mmocr.datasets

class mmocr.datasets.AnnFileLoader(ann_file, parser, repeat=1, file_storage_backend='disk', file_format='txt', **kwargs)[源代码]

Annotation file loader to load annotations from ann_file, and parse raw annotation to dict format with certain parser.

参数
  • ann_file (str) – Annotation file path.

  • parser (dict) – Dictionary to construct parser to parse original annotation infos.

  • repeat (int|float) – Repeated times of dataset.

  • file_storage_backend (str) – The storage backend type for annotation file. Options are “disk”, “http” and “petrel”. Default: “disk”.

  • file_format (str) – The format of annotation file. Options are “txt” and “lmdb”. Default: “txt”.

close()[源代码]

For ann_file with lmdb format only.

class mmocr.datasets.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]

Custom dataset for text detection, text recognition, and their downstream tasks.

  1. The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

    converted to dict for visualizing only).

    {

    “file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

    [
    {

    “iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

    72, 767, 357, 763]]

    }

    ]

    }

  2. The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

    format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

参数
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]

返回类型

dict[str

format_results(results, **kwargs)[源代码]

Placeholder to format result to dataset-specific output.

pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

prepare_test_img(img_info)[源代码]

Get testing data from pipeline.

参数

idx (int) – Index of data.

返回

Testing data after pipeline with new keys introduced by

pipeline.

返回类型

dict

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

class mmocr.datasets.CustomFormatBundle(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[源代码]

Custom formatting bundle.

It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:

  • gt_kernels: to DataContainer (cpu_only=True)

  • gt_effective_mask: to DataContainer (cpu_only=True)

参数
  • keys (list[str]) – Fields to be formatted to DC only.

  • call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.

  • visualize (dict) – If flag=True, visualize gt mask for debugging.

class mmocr.datasets.DBNetTargets(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[源代码]

Generate gt shrunk text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.

参数
  • shrink_ratio (float) – The area shrunk ratio between text kernels and their text masks.

  • thr_min (float) – The minimum value of the threshold map.

  • thr_max (float) – The maximum value of the threshold map.

  • min_short_size (int) – The minimum size of polygon below which the polygon is invalid.

draw_border_map(polygon, canvas, mask)[源代码]

Generate threshold map for one polygon.

参数
  • polygon (ndarray) – The polygon boundary ndarray.

  • canvas (ndarray) – The generated threshold map.

  • mask (ndarray) – The generated threshold mask.

find_invalid(results)[源代码]

Find invalid polygons.

参数

results (dict) – The dict containing gt_mask.

返回

The indicators for ignoring polygons.

返回类型

ignore_tags (list[bool])

generate_targets(results)[源代码]

Generate the gt targets for DBNet.

参数

results (dict) – The input result dictionary.

返回

The output result dictionary.

返回类型

results (dict)

generate_thr_map(img_size, polygons)[源代码]

Generate threshold map.

参数
  • img_size (tuple(int)) – The image size (h,w)

  • polygons (list(ndarray)) – The polygon list.

返回

The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.

返回类型

thr_map (ndarray)

ignore_texts(results, ignore_tags)[源代码]

Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.

参数
  • results (dict) – Result for one image.

  • ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.

返回

Results after filtering.

返回类型

results (dict)

invalid_polygon(poly)[源代码]

Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.

参数

poly (ndarray) – The polygon boundary point sequence.

返回

Whether the polygon is invalid.

返回类型

True/False (bool)

class mmocr.datasets.FCENetTargets(fourier_degree=5, resample_step=4.0, center_region_shrink_ratio=0.3, level_size_divisors=(8, 16, 32), level_proportion_range=((0, 0.4), (0.3, 0.7), (0.6, 1.0)))[源代码]

Generate the ground truth targets of FCENet: Fourier Contour Embedding for Arbitrary-Shaped Text Detection.

[https://arxiv.org/abs/2104.10442]

参数
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • resample_step (float) – The step size for resampling the text center line (TCL). It’s better not to exceed half of the minimum width.

  • center_region_shrink_ratio (float) – The shrink ratio of text center region.

  • level_size_divisors (tuple(int)) – The downsample ratio on each level.

  • level_proportion_range (tuple(tuple(int))) – The range of text sizes assigned to each level.

cal_fourier_signature(polygon, fourier_degree)[源代码]

Calculate Fourier signature from input polygon.

参数
  • polygon (ndarray) – The input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

返回

An array shaped (2k+1, 2) containing

real part and image part of 2k+1 Fourier coefficients.

返回类型

fourier_signature (ndarray)

clockwise(c, fourier_degree)[源代码]

Make sure the polygon reconstructed from Fourier coefficients c in the clockwise direction.

参数

polygon (list[float]) – The origin polygon.

返回

The polygon in clockwise point order.

返回类型

new_polygon (lost[float])

generate_center_region_mask(img_size, text_polys)[源代码]

Generate text center region mask.

参数
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

返回

The text center region mask.

返回类型

center_region_mask (ndarray)

generate_fourier_maps(img_size, text_polys)[源代码]

Generate Fourier coefficient maps.

参数
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

返回

The Fourier coefficient real part maps. fourier_image_map (ndarray): The Fourier coefficient image part

maps.

返回类型

fourier_real_map (ndarray)

generate_level_targets(img_size, text_polys, ignore_polys)[源代码]

Generate ground truth target on each level.

参数
  • img_size (list[int]) – Shape of input image.

  • text_polys (list[list[ndarray]]) – A list of ground truth polygons.

  • ignore_polys (list[list[ndarray]]) – A list of ignored polygons.

返回

A list of ground target on each level.

返回类型

level_maps (list(ndarray))

generate_targets(results)[源代码]

Generate the ground truth targets for FCENet.

参数

results (dict) – The input result dictionary.

返回

The output result dictionary.

返回类型

results (dict)

normalize_polygon(polygon)[源代码]

Normalize one polygon so that its start point is at right most.

参数

polygon (list[float]) – The origin polygon.

返回

The polygon with start point at right.

返回类型

new_polygon (lost[float])

poly2fourier(polygon, fourier_degree)[源代码]

Perform Fourier transformation to generate Fourier coefficients ck from polygon.

参数
  • polygon (ndarray) – An input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

返回

Fourier coefficients.

返回类型

c (ndarray(complex))

resample_polygon(polygon, n=400)[源代码]

Resample one polygon with n points on its boundary.

参数
  • polygon (list[float]) – The input polygon.

  • n (int) – The number of resampled points.

返回

The resampled polygon.

返回类型

resampled_polygon (list[float])

class mmocr.datasets.HardDiskLoader(ann_file, parser, repeat=1)[源代码]

Load txt format annotation file from hard disks.

class mmocr.datasets.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=- 1, ann_file_backend='disk')[源代码]

Dataset for text detection while ann_file in coco format.

参数

ann_file_backend (str) – Storage backend for annotation file, should be one in [‘disk’, ‘petrel’, ‘http’]. Default to ‘disk’.

evaluate(results, metric='hmean-iou', logger=None, score_thr=None, min_score_thr=0.3, max_score_thr=0.9, step=0.1, rank_list=None, **kwargs)[源代码]

Evaluate the hmean metric.

参数
  • results (list[dict]) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • score_thr (float) – Deprecated. Please use min_score_thr instead.

  • min_score_thr (float) – Minimum score threshold of prediction map.

  • max_score_thr (float) – Maximum score threshold of prediction map.

  • step (float) – The spacing between score thresholds.

  • rank_list (str) – json file used to save eval result of each image after ranking.

返回

float]]: The evaluation results.

返回类型

dict[dict[str

load_annotations(ann_file)[源代码]

Load annotation from COCO style annotation file.

参数

ann_file (str) – Path of annotation file.

返回

Annotation info from COCO api.

返回类型

list[dict]

class mmocr.datasets.KIEDataset(ann_file=None, loader=None, dict_file=None, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[源代码]
参数
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.

  • dict_file (str) – Character dict file path.

  • norm (float) – Norm to map value from one range to another.

compute_relation(boxes)[源代码]

Compute relation between every two boxes.

evaluate(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]

返回类型

dict[str

list_to_numpy(ann_infos)[源代码]

Convert bboxes, relations, texts and labels to ndarray.

pad_text_indices(text_inds)[源代码]

Pad text index to same length.

pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

class mmocr.datasets.LineJsonParser(keys=[])[源代码]

Parse json-string of one line in annotation file to dict format.

参数

keys (list[str]) – Keys in both json-string and result dict.

class mmocr.datasets.LineStrParser(keys=['filename', 'text'], keys_idx=[0, 1], separator=' ', **kwargs)[源代码]

Parse string of one line in annotation file to dict format.

参数
  • keys (list[str]) – Keys in result dict.

  • keys_idx (list[int]) – Value index in sub-string list for each key above.

  • separator (str) – Separator to separate string to list of sub-string.

class mmocr.datasets.LmdbLoader(ann_file, parser, repeat=1)[源代码]

Load lmdb format annotation file from hard disks.

class mmocr.datasets.NerDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]

Custom dataset for named entity recognition tasks.

参数
  • ann_file (txt) – Annotation file path.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • pipeline (list[dict]) – Processing pipeline.

  • test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

A dict containing the following keys:

’acc’, ‘recall’, ‘f1-score’.

返回类型

info (dict)

prepare_train_img(index)[源代码]

Get training data and annotations after pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys introduced by pipeline.

返回类型

dict

class mmocr.datasets.OCRDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]
evaluate(results, metric='acc', logger=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]

返回类型

dict[str

pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

class mmocr.datasets.OCRSegDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]
pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

class mmocr.datasets.OpensetKIEDataset(ann_file, loader, dict_file, img_prefix='', pipeline=None, norm=10.0, link_type='one-to-one', edge_thr=0.5, test_mode=True, key_node_idx=1, value_node_idx=2, node_classes=4)[源代码]

Openset KIE classifies the nodes (i.e. text boxes) into bg/key/value categories, and additionally learns key-value relationship among nodes.

参数
  • ann_file (str) – Annotation file path.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • dict_file (str) – Character dict file path.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • pipeline (list[dict]) – Processing pipeline.

  • norm (float) – Norm to map value from one range to another.

  • link_type (str) – one-to-one | one-to-many | many-to-one | many-to-many. For many-to-many, one key box can have many values and vice versa.

  • edge_thr (float) – Score threshold for a valid edge.

  • test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.

  • key_node_idx (int) – Index of key in node classes.

  • value_node_idx (int) – Index of value in node classes.

  • node_classes (int) – Number of node classes.

compute_openset_f1(preds, gts)[源代码]

Compute openset macro-f1 and micro-f1 score.

参数
  • preds – (list[dict]): List of prediction results, including keys: filename, pairs, etc.

  • gts – (list[dict]): List of ground-truth infos, including keys: filename, pairs, etc.

返回

Evaluation result with keys: node_openset_micro_f1, node_openset_macro_f1, edge_openset_f1.

返回类型

dict

decode_gt(filename)[源代码]

Decode ground truth.

Assemble boxes and labels into bboxes.

decode_pred(result)[源代码]

Decode prediction.

Assemble boxes and predicted labels into bboxes, and convert edges into matrix.

evaluate(results, metric='openset_f1', metric_options=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]

返回类型

dict[str

list_to_numpy(ann_infos)[源代码]

Convert bboxes, relations, texts and labels to ndarray.

pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

class mmocr.datasets.TextDetDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]
evaluate(results, metric='hmean-iou', score_thr=None, min_score_thr=0.3, max_score_thr=0.9, step=0.1, rank_list=None, logger=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • score_thr (float) – Deprecated. Please use min_score_thr instead.

  • min_score_thr (float) – Minimum score threshold of prediction map.

  • max_score_thr (float) – Maximum score threshold of prediction map.

  • step (float) – The spacing between score thresholds.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • rank_list (str) – json file used to save eval result of each image after ranking.

返回

float]

返回类型

dict[str

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

class mmocr.datasets.UniformConcatDataset(datasets, separate_eval=True, show_mean_scores='auto', pipeline=None, force_apply=False, **kwargs)[源代码]

A wrapper of ConcatDataset which support dataset pipeline assignment and replacement.

参数
  • datasets (list[dict] | list[list[dict]]) – A list of datasets cfgs.

  • separate_eval (bool) – Whether to evaluate the results separately if it is used as validation dataset. Defaults to True.

  • show_mean_scores (str | bool) – Whether to compute the mean evaluation results, only applicable when separate_eval=True. Options are [True, False, auto]. If True, mean results will be added to the result dictionary with keys in the form of mean_{metric_name}. If ‘auto’, mean results will be shown only when more than 1 dataset is wrapped.

  • pipeline (None | list[dict] | list[list[dict]]) – If None, each dataset in datasets use its own pipeline; If list[dict], it will be assigned to the dataset whose pipeline is None in datasets; If list[list[dict]], pipeline of dataset which is None in datasets will be replaced by the corresponding pipeline in the list.

  • force_apply (bool) – If True, apply pipeline above to each dataset even if it have its own pipeline. Default: False.

evaluate(results, logger=None, **kwargs)[源代码]

Evaluate the results.

参数
  • results (list[list | tuple]) – Testing results of the dataset.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]: Results of each separate dataset if self.separate_eval=True.

返回类型

dict[str

mmocr.datasets.build_dataloader(dataset, samples_per_gpu, workers_per_gpu, num_gpus=1, dist=True, shuffle=True, seed=None, runner_type='EpochBasedRunner', persistent_workers=False, class_aware_sampler=None, **kwargs)[源代码]

Build PyTorch DataLoader.

In distributed training, each GPU/process has a dataloader. In non-distributed training, there is only one dataloader for all GPUs.

参数
  • dataset (Dataset) – A PyTorch dataset.

  • samples_per_gpu (int) – Number of training samples on each GPU, i.e., batch size of each GPU.

  • workers_per_gpu (int) – How many subprocesses to use for data loading for each GPU.

  • num_gpus (int) – Number of GPUs. Only used in non-distributed training.

  • dist (bool) – Distributed training/test or not. Default: True.

  • shuffle (bool) – Whether to shuffle the data at every epoch. Default: True.

  • seed (int, Optional) – Seed to be used. Default: None.

  • runner_type (str) – Type of runner. Default: EpochBasedRunner

  • persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive. This argument is only valid when PyTorch>=1.7.0. Default: False.

  • class_aware_sampler (dict) – Whether to use ClassAwareSampler during training. Default: None.

  • kwargs – any keyword argument to be used to initialize DataLoader

返回

A PyTorch dataloader.

返回类型

DataLoader

datasets

class mmocr.datasets.base_dataset.BaseDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]

Custom dataset for text detection, text recognition, and their downstream tasks.

  1. The text detection annotation format is as follows: The annotations field is optional for testing (this is one line of anno_file, with line-json-str

    converted to dict for visualizing only).

    {

    “file_name”: “sample.jpg”, “height”: 1080, “width”: 960, “annotations”:

    [
    {

    “iscrowd”: 0, “category_id”: 1, “bbox”: [357.0, 667.0, 804.0, 100.0], “segmentation”: [[361, 667, 710, 670,

    72, 767, 357, 763]]

    }

    ]

    }

  2. The two text recognition annotation formats are as follows: The x1,y1,x2,y2,x3,y3,x4,y4 field is used for online crop augmentation during training.

    format1: sample.jpg hello format2: sample.jpg 20 20 100 20 100 40 20 40 hello

参数
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If set True, try…except will be turned off in __getitem__.

evaluate(results, metric=None, logger=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]

返回类型

dict[str

format_results(results, **kwargs)[源代码]

Placeholder to format result to dataset-specific output.

pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

prepare_test_img(img_info)[源代码]

Get testing data from pipeline.

参数

idx (int) – Index of data.

返回

Testing data after pipeline with new keys introduced by

pipeline.

返回类型

dict

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

class mmocr.datasets.icdar_dataset.IcdarDataset(ann_file, pipeline, classes=None, data_root=None, img_prefix='', seg_prefix=None, proposal_file=None, test_mode=False, filter_empty_gt=True, select_first_k=- 1, ann_file_backend='disk')[源代码]

Dataset for text detection while ann_file in coco format.

参数

ann_file_backend (str) – Storage backend for annotation file, should be one in [‘disk’, ‘petrel’, ‘http’]. Default to ‘disk’.

evaluate(results, metric='hmean-iou', logger=None, score_thr=None, min_score_thr=0.3, max_score_thr=0.9, step=0.1, rank_list=None, **kwargs)[源代码]

Evaluate the hmean metric.

参数
  • results (list[dict]) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • score_thr (float) – Deprecated. Please use min_score_thr instead.

  • min_score_thr (float) – Minimum score threshold of prediction map.

  • max_score_thr (float) – Maximum score threshold of prediction map.

  • step (float) – The spacing between score thresholds.

  • rank_list (str) – json file used to save eval result of each image after ranking.

返回

float]]: The evaluation results.

返回类型

dict[dict[str

load_annotations(ann_file)[源代码]

Load annotation from COCO style annotation file.

参数

ann_file (str) – Path of annotation file.

返回

Annotation info from COCO api.

返回类型

list[dict]

class mmocr.datasets.ocr_dataset.OCRDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]
evaluate(results, metric='acc', logger=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]

返回类型

dict[str

pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

class mmocr.datasets.ocr_seg_dataset.OCRSegDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]
pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

class mmocr.datasets.text_det_dataset.TextDetDataset(ann_file, loader, pipeline, img_prefix='', test_mode=False)[源代码]
evaluate(results, metric='hmean-iou', score_thr=None, min_score_thr=0.3, max_score_thr=0.9, step=0.1, rank_list=None, logger=None, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • score_thr (float) – Deprecated. Please use min_score_thr instead.

  • min_score_thr (float) – Minimum score threshold of prediction map.

  • max_score_thr (float) – Maximum score threshold of prediction map.

  • step (float) – The spacing between score thresholds.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

  • rank_list (str) – json file used to save eval result of each image after ranking.

返回

float]

返回类型

dict[str

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

class mmocr.datasets.kie_dataset.KIEDataset(ann_file=None, loader=None, dict_file=None, img_prefix='', pipeline=None, norm=10.0, directed=False, test_mode=True, **kwargs)[源代码]
参数
  • ann_file (str) – Annotation file path.

  • pipeline (list[dict]) – Processing pipeline.

  • loader (dict) – Dictionary to construct loader to load annotation infos.

  • img_prefix (str, optional) – Image prefix to generate full image path.

  • test_mode (bool, optional) – If True, try…except will be turned off in __getitem__.

  • dict_file (str) – Character dict file path.

  • norm (float) – Norm to map value from one range to another.

compute_relation(boxes)[源代码]

Compute relation between every two boxes.

evaluate(results, metric='macro_f1', metric_options={'macro_f1': {'ignores': []}}, **kwargs)[源代码]

Evaluate the dataset.

参数
  • results (list) – Testing results of the dataset.

  • metric (str | list[str]) – Metrics to be evaluated.

  • logger (logging.Logger | str | None) – Logger used for printing related information during evaluation. Default: None.

返回

float]

返回类型

dict[str

list_to_numpy(ann_infos)[源代码]

Convert bboxes, relations, texts and labels to ndarray.

pad_text_indices(text_inds)[源代码]

Pad text index to same length.

pre_pipeline(results)[源代码]

Prepare results dict for pipeline.

prepare_train_img(index)[源代码]

Get training data and annotations from pipeline.

参数

index (int) – Index of data.

返回

Training data and annotation after pipeline with new keys

introduced by pipeline.

返回类型

dict

pipelines

class mmocr.datasets.pipelines.ColorJitter(**kwargs)[源代码]

An interface for torch color jitter so that it can be invoked in mmdetection pipeline.

class mmocr.datasets.pipelines.CustomFormatBundle(keys=[], call_super=True, visualize={'boundary_key': None, 'flag': False})[源代码]

Custom formatting bundle.

It formats common fields such as ‘img’ and ‘proposals’ as done in DefaultFormatBundle, while other fields such as ‘gt_kernels’ and ‘gt_effective_region_mask’ will be formatted to DC as follows:

  • gt_kernels: to DataContainer (cpu_only=True)

  • gt_effective_mask: to DataContainer (cpu_only=True)

参数
  • keys (list[str]) – Fields to be formatted to DC only.

  • call_super (bool) – If True, format common fields by DefaultFormatBundle, else format fields in keys above only.

  • visualize (dict) – If flag=True, visualize gt mask for debugging.

class mmocr.datasets.pipelines.DBNetTargets(shrink_ratio=0.4, thr_min=0.3, thr_max=0.7, min_short_size=8)[源代码]

Generate gt shrunk text, gt threshold map, and their effective region masks to learn DBNet: Real-time Scene Text Detection with Differentiable Binarization [https://arxiv.org/abs/1911.08947]. This was partially adapted from https://github.com/MhLiao/DB.

参数
  • shrink_ratio (float) – The area shrunk ratio between text kernels and their text masks.

  • thr_min (float) – The minimum value of the threshold map.

  • thr_max (float) – The maximum value of the threshold map.

  • min_short_size (int) – The minimum size of polygon below which the polygon is invalid.

draw_border_map(polygon, canvas, mask)[源代码]

Generate threshold map for one polygon.

参数
  • polygon (ndarray) – The polygon boundary ndarray.

  • canvas (ndarray) – The generated threshold map.

  • mask (ndarray) – The generated threshold mask.

find_invalid(results)[源代码]

Find invalid polygons.

参数

results (dict) – The dict containing gt_mask.

返回

The indicators for ignoring polygons.

返回类型

ignore_tags (list[bool])

generate_targets(results)[源代码]

Generate the gt targets for DBNet.

参数

results (dict) – The input result dictionary.

返回

The output result dictionary.

返回类型

results (dict)

generate_thr_map(img_size, polygons)[源代码]

Generate threshold map.

参数
  • img_size (tuple(int)) – The image size (h,w)

  • polygons (list(ndarray)) – The polygon list.

返回

The generated threshold map. thr_mask (ndarray): The effective mask of threshold map.

返回类型

thr_map (ndarray)

ignore_texts(results, ignore_tags)[源代码]

Ignore gt masks and gt_labels while padding gt_masks_ignore in results given ignore_tags.

参数
  • results (dict) – Result for one image.

  • ignore_tags (list[int]) – Indicate whether to ignore its corresponding ground truth text.

返回

Results after filtering.

返回类型

results (dict)

invalid_polygon(poly)[源代码]

Judge the input polygon is invalid or not. It is invalid if its area smaller than 1 or the shorter side of its minimum bounding box smaller than min_short_size.

参数

poly (ndarray) – The polygon boundary point sequence.

返回

Whether the polygon is invalid.

返回类型

True/False (bool)

class mmocr.datasets.pipelines.FCENetTargets(fourier_degree=5, resample_step=4.0, center_region_shrink_ratio=0.3, level_size_divisors=(8, 16, 32), level_proportion_range=((0, 0.4), (0.3, 0.7), (0.6, 1.0)))[源代码]

Generate the ground truth targets of FCENet: Fourier Contour Embedding for Arbitrary-Shaped Text Detection.

[https://arxiv.org/abs/2104.10442]

参数
  • fourier_degree (int) – The maximum Fourier transform degree k.

  • resample_step (float) – The step size for resampling the text center line (TCL). It’s better not to exceed half of the minimum width.

  • center_region_shrink_ratio (float) – The shrink ratio of text center region.

  • level_size_divisors (tuple(int)) – The downsample ratio on each level.

  • level_proportion_range (tuple(tuple(int))) – The range of text sizes assigned to each level.

cal_fourier_signature(polygon, fourier_degree)[源代码]

Calculate Fourier signature from input polygon.

参数
  • polygon (ndarray) – The input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

返回

An array shaped (2k+1, 2) containing

real part and image part of 2k+1 Fourier coefficients.

返回类型

fourier_signature (ndarray)

clockwise(c, fourier_degree)[源代码]

Make sure the polygon reconstructed from Fourier coefficients c in the clockwise direction.

参数

polygon (list[float]) – The origin polygon.

返回

The polygon in clockwise point order.

返回类型

new_polygon (lost[float])

generate_center_region_mask(img_size, text_polys)[源代码]

Generate text center region mask.

参数
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

返回

The text center region mask.

返回类型

center_region_mask (ndarray)

generate_fourier_maps(img_size, text_polys)[源代码]

Generate Fourier coefficient maps.

参数
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

返回

The Fourier coefficient real part maps. fourier_image_map (ndarray): The Fourier coefficient image part

maps.

返回类型

fourier_real_map (ndarray)

generate_level_targets(img_size, text_polys, ignore_polys)[源代码]

Generate ground truth target on each level.

参数
  • img_size (list[int]) – Shape of input image.

  • text_polys (list[list[ndarray]]) – A list of ground truth polygons.

  • ignore_polys (list[list[ndarray]]) – A list of ignored polygons.

返回

A list of ground target on each level.

返回类型

level_maps (list(ndarray))

generate_targets(results)[源代码]

Generate the ground truth targets for FCENet.

参数

results (dict) – The input result dictionary.

返回

The output result dictionary.

返回类型

results (dict)

normalize_polygon(polygon)[源代码]

Normalize one polygon so that its start point is at right most.

参数

polygon (list[float]) – The origin polygon.

返回

The polygon with start point at right.

返回类型

new_polygon (lost[float])

poly2fourier(polygon, fourier_degree)[源代码]

Perform Fourier transformation to generate Fourier coefficients ck from polygon.

参数
  • polygon (ndarray) – An input polygon.

  • fourier_degree (int) – The maximum Fourier degree K.

返回

Fourier coefficients.

返回类型

c (ndarray(complex))

resample_polygon(polygon, n=400)[源代码]

Resample one polygon with n points on its boundary.

参数
  • polygon (list[float]) – The input polygon.

  • n (int) – The number of resampled points.

返回

The resampled polygon.

返回类型

resampled_polygon (list[float])

class mmocr.datasets.pipelines.FancyPCA(eig_vec=None, eig_val=None)[源代码]

Implementation of PCA based image augmentation, proposed in the paper Imagenet Classification With Deep Convolutional Neural Networks.

It alters the intensities of RGB values along the principal components of ImageNet dataset.

class mmocr.datasets.pipelines.ImgAug(args=None, clip_invalid_ploys=True)[源代码]

A wrapper to use imgaug https://github.com/aleju/imgaug.

参数
  • args ([list[list|dict]]) – The argumentation list. For details, please refer to imgaug document. Take args=[[‘Fliplr’, 0.5], dict(cls=’Affine’, rotate=[-10, 10]), [‘Resize’, [0.5, 3.0]]] as an example. The args horizontally flip images with probability 0.5, followed by random rotation with angles in range [-10, 10], and resize with an independent scale in range [0.5, 3.0] for each side of images.

  • clip_invalid_polys (bool) – Whether to clip invalid polygons after transformation. False persists to the behavior in DBNet.

class mmocr.datasets.pipelines.KIEFormatBundle(img_to_float=True, pad_val={'img': 0, 'masks': 0, 'seg': 255})[源代码]

Key information extraction formatting bundle.

Based on the DefaultFormatBundle, itt simplifies the pipeline of formatting common fields, including “img”, “proposals”, “gt_bboxes”, “gt_labels”, “gt_masks”, “gt_semantic_seg”, “relations” and “texts”. These fields are formatted as follows.

  • img: (1) transpose, (2) to tensor, (3) to DataContainer (stack=True)

  • proposals: (1) to tensor, (2) to DataContainer

  • gt_bboxes: (1) to tensor, (2) to DataContainer

  • gt_bboxes_ignore: (1) to tensor, (2) to DataContainer

  • gt_labels: (1) to tensor, (2) to DataContainer

  • gt_masks: (1) to tensor, (2) to DataContainer (cpu_only=True)

  • gt_semantic_seg: (1) unsqueeze dim-0 (2) to tensor,
    1. to DataContainer (stack=True)

  • relations: (1) scale, (2) to tensor, (3) to DataContainer

  • texts: (1) to tensor, (2) to DataContainer

class mmocr.datasets.pipelines.LoadImageFromLMDB(color_type='color')[源代码]

Load an image from lmdb file.

Similar with :obj:’LoadImageFromFile’, but the image read from “results[‘img_info’][‘filename’]”, which is a data index of lmdb file.

class mmocr.datasets.pipelines.LoadImageFromNdarray(to_float32=False, color_type='color', channel_order='bgr', file_client_args={'backend': 'disk'})[源代码]

Load an image from np.ndarray.

Similar with LoadImageFromFile, but the image read from results['img'], which is np.ndarray.

class mmocr.datasets.pipelines.LoadTextAnnotations(with_bbox=True, with_label=True, with_mask=False, with_seg=False, poly2mask=True, use_img_shape=False)[源代码]

Load annotations for text detection.

参数
  • with_bbox (bool) – Whether to parse and load the bbox annotation. Default: True.

  • with_label (bool) – Whether to parse and load the label annotation. Default: True.

  • with_mask (bool) – Whether to parse and load the mask annotation. Default: False.

  • with_seg (bool) – Whether to parse and load the semantic segmentation annotation. Default: False.

  • poly2mask (bool) – Whether to convert the instance masks from polygons to bitmaps. Default: True.

  • use_img_shape (bool) – Use the shape of loaded image from previous pipeline LoadImageFromFile to generate mask.

process_polygons(polygons)[源代码]

Convert polygons to list of ndarray and filter invalid polygons.

参数

polygons (list[list]) – Polygons of one instance.

返回

Processed polygons.

返回类型

list[numpy.ndarray]

class mmocr.datasets.pipelines.MultiRotateAugOCR(transforms, rotate_degrees=None, force_rotate=False)[源代码]

Test-time augmentation with multiple rotations in the case that img_height > img_width.

An example configuration is as follows:

rotate_degrees=[0, 90, 270],
transforms=[
    dict(
        type='ResizeOCR',
        height=32,
        min_width=32,
        max_width=160,
        keep_aspect_ratio=True),
    dict(type='ToTensorOCR'),
    dict(type='NormalizeOCR', **img_norm_cfg),
    dict(
        type='Collect',
        keys=['img'],
        meta_keys=[
            'filename', 'ori_shape', 'img_shape', 'valid_ratio'
        ]),
]

After MultiRotateAugOCR with above configuration, the results are wrapped into lists of the same length as follows:

dict(
    img=[...],
    img_shape=[...]
    ...
)
参数
  • transforms (list[dict]) – Transformation applied for each augmentation.

  • rotate_degrees (list[int] | None) – Degrees of anti-clockwise rotation.

  • force_rotate (bool) – If True, rotate image by ‘rotate_degrees’ while ignore image aspect ratio.

class mmocr.datasets.pipelines.NerTransform(label_convertor, max_len)[源代码]

Convert text to ID and entity in ground truth to label ID. The masks and tokens are generated at the same time. The four parameters will be used as input to the model.

参数
  • label_convertor – Convert text to ID and entity

  • ground truth to label ID. (in) –

  • max_len (int) – Limited maximum input length.

class mmocr.datasets.pipelines.NormalizeOCR(mean, std)[源代码]

Normalize a tensor image with mean and standard deviation.

class mmocr.datasets.pipelines.OCRSegTargets(label_convertor=None, attn_shrink_ratio=0.5, seg_shrink_ratio=0.25, box_type='char_rects', pad_val=255)[源代码]

Generate gt shrunk kernels for segmentation based OCR framework.

参数
  • label_convertor (dict) – Dictionary to construct label_convertor to convert char to index.

  • attn_shrink_ratio (float) – The area shrunk ratio between attention kernels and gt text masks.

  • seg_shrink_ratio (float) – The area shrunk ratio between segmentation kernels and gt text masks.

  • box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

generate_kernels(resize_shape, pad_shape, char_boxes, char_inds, shrink_ratio=0.5, binary=True)[源代码]

Generate char instance kernels for one shrink ratio.

参数
  • resize_shape (tuple(int, int)) – Image size (height, width) after resizing.

  • pad_shape (tuple(int, int)) – Image size (height, width) after padding.

  • char_boxes (list[list[float]]) – The list of char polygons.

  • char_inds (list[int]) – List of char indexes.

  • shrink_ratio (float) – The shrink ratio of kernel.

  • binary (bool) – If True, return binary ndarray containing 0 & 1 only.

返回

The text kernel mask of (height, width).

返回类型

char_kernel (ndarray)

shrink_char_quad(char_quad, shrink_ratio)[源代码]

Shrink char box in style of quadrangle.

参数
  • char_quad (list[float]) – Char box with format [x1, y1, x2, y2, x3, y3, x4, y4].

  • shrink_ratio (float) – The area shrunk ratio between gt kernels and gt text masks.

shrink_char_rect(char_rect, shrink_ratio)[源代码]

Shrink char box in style of rectangle.

参数
  • char_rect (list[float]) – Char box with format [x_min, y_min, x_max, y_max].

  • shrink_ratio (float) – The area shrunk ratio between gt kernels and gt text masks.

class mmocr.datasets.pipelines.OneOfWrapper(transforms)[源代码]

Randomly select and apply one of the transforms, each with the equal chance.

警告

Different from albumentations, this wrapper only runs the selected transform, but doesn’t guarantee the transform can always be applied to the input if the transform comes with a probability to run.

参数

transforms (list[dict|callable]) – Candidate transforms to be applied.

class mmocr.datasets.pipelines.OnlineCropOCR(box_keys=['x1', 'y1', 'x2', 'y2', 'x3', 'y3', 'x4', 'y4'], jitter_prob=0.5, max_jitter_ratio_x=0.05, max_jitter_ratio_y=0.02)[源代码]

Crop text areas from whole image with bounding box jitter. If no bbox is given, return directly.

参数
  • box_keys (list[str]) – Keys in results which correspond to RoI bbox.

  • jitter_prob (float) – The probability of box jitter.

  • max_jitter_ratio_x (float) – Maximum horizontal jitter ratio relative to height.

  • max_jitter_ratio_y (float) – Maximum vertical jitter ratio relative to height.

class mmocr.datasets.pipelines.OpencvToPil(**kwargs)[源代码]

Convert numpy.ndarray (bgr) to PIL Image (rgb).

class mmocr.datasets.pipelines.PANetTargets(shrink_ratio=(1.0, 0.5), max_shrink=20)[源代码]

Generate the ground truths for PANet: Efficient and Accurate Arbitrary- Shaped Text Detection with Pixel Aggregation Network.

[https://arxiv.org/abs/1908.05900]. This code is partially adapted from https://github.com/WenmuZhou/PAN.pytorch.

参数
  • shrink_ratio (tuple[float]) – The ratios for shrinking text instances.

  • max_shrink (int) – The maximum shrink distance.

generate_targets(results)[源代码]

Generate the gt targets for PANet.

参数

results (dict) – The input result dictionary.

返回

The output result dictionary.

返回类型

results (dict)

class mmocr.datasets.pipelines.PilToOpencv(**kwargs)[源代码]

Convert PIL Image (rgb) to numpy.ndarray (bgr).

class mmocr.datasets.pipelines.PyramidRescale(factor=4, base_shape=(128, 512), randomize_factor=True)[源代码]

Resize the image to the base shape, downsample it with gaussian pyramid, and rescale it back to original size.

Adapted from https://github.com/FangShancheng/ABINet.

参数
  • factor (int) – The decay factor from base size, or the number of downsampling operations from the base layer.

  • base_shape (tuple(int)) – The shape of the base layer of the pyramid.

  • randomize_factor (bool) – If True, the final factor would be a random integer in [0, factor].

Required Keys
  • img (ndarray): The input image.
Affected Keys
Modified
  • img (ndarray): The modified image.
class mmocr.datasets.pipelines.RandomCropInstances(target_size, instance_key, mask_type='inx0', positive_sample_ratio=0.625)[源代码]

Randomly crop images and make sure to contain text instances.

参数
  • target_size (tuple or int) – (height, width)

  • positive_sample_ratio (float) – The probability of sampling regions that go through positive regions.

class mmocr.datasets.pipelines.RandomCropPolyInstances(instance_key='gt_masks', crop_ratio=0.625, min_side_ratio=0.4)[源代码]

Randomly crop images and make sure to contain at least one intact instance.

sample_crop_box(img_size, results)[源代码]

Generate crop box and make sure not to crop the polygon instances.

参数
  • img_size (tuple(int)) – The image size (h, w).

  • results (dict) – The results dict.

class mmocr.datasets.pipelines.RandomPaddingOCR(max_ratio=None, box_type=None)[源代码]

Pad the given image on all sides, as well as modify the coordinates of character bounding box in image.

参数
  • max_ratio (list[int]) – [left, top, right, bottom].

  • box_type (None|str) – Character box type. If not none, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

class mmocr.datasets.pipelines.RandomRotateImageBox(min_angle=- 10, max_angle=10, box_type='char_quads')[源代码]

Rotate augmentation for segmentation based text recognition.

参数
  • min_angle (int) – Minimum rotation angle for image and box.

  • max_angle (int) – Maximum rotation angle for image and box.

  • box_type (str) – Character box type, should be either ‘char_rects’ or ‘char_quads’, with ‘char_rects’ for rectangle with xyxy style and ‘char_quads’ for quadrangle with x1y1x2y2x3y3x4y4 style.

class mmocr.datasets.pipelines.RandomRotateTextDet(rotate_ratio=1.0, max_angle=10)[源代码]

Randomly rotate images.

class mmocr.datasets.pipelines.RandomWrapper(transforms, p)[源代码]

Run a transform or a sequence of transforms with probability p.

参数
  • transforms (list[dict|callable]) – Transform(s) to be applied.

  • p (int|float) – Probability of running transform(s).

class mmocr.datasets.pipelines.ResizeNoImg(img_scale, keep_ratio=True)[源代码]

Image resizing without img.

Used for KIE.

class mmocr.datasets.pipelines.ResizeOCR(height, min_width=None, max_width=None, keep_aspect_ratio=True, img_pad_value=0, width_downsample_ratio=0.0625, backend=None)[源代码]

Image resizing and padding for OCR.

参数
  • height (int | tuple(int)) – Image height after resizing.

  • min_width (none | int | tuple(int)) – Image minimum width after resizing.

  • max_width (none | int | tuple(int)) – Image maximum width after resizing.

  • keep_aspect_ratio (bool) – Keep image aspect ratio if True during resizing, Otherwise resize to the size height * max_width.

  • img_pad_value (int) – Scalar to fill padding area.

  • width_downsample_ratio (float) – Downsample ratio in horizontal direction from input image to output feature.

  • backend (str | None) – The image resize backend type. Options are cv2, pillow, None. If backend is None, the global imread_backend specified by mmcv.use_backend() will be used. Default: None.

class mmocr.datasets.pipelines.ScaleAspectJitter(img_scale=None, multiscale_mode='range', ratio_range=None, keep_ratio=False, resize_type='around_min_img_scale', aspect_ratio_range=None, long_size_bound=None, short_size_bound=None, scale_range=None)[源代码]

Resize image and segmentation mask encoded by coordinates.

Allowed resize types are around_min_img_scale, long_short_bound, and indep_sample_in_range.

class mmocr.datasets.pipelines.TextSnakeTargets(orientation_thr=2.0, resample_step=4.0, center_region_shrink_ratio=0.3)[源代码]

Generate the ground truth targets of TextSnake: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes.

[https://arxiv.org/abs/1807.01544]. This was partially adapted from https://github.com/princewang1994/TextSnake.pytorch.

参数

orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.

cal_curve_length(line)[源代码]

Calculate the length of each edge on the discrete curve and the sum.

参数

line (ndarray) – The points composing a discrete curve.

返回

Returns (edges_length, total_length).

  • edge_length (ndarray): The length of each edge on the discrete curve.
  • total_length (float): The total length of the discrete curve.

返回类型

tuple

draw_center_region_maps(top_line, bot_line, center_line, center_region_mask, radius_map, sin_map, cos_map, region_shrink_ratio)[源代码]

Draw attributes on text center region.

参数
  • top_line (ndarray) – The points composing top curved sideline of text polygon.

  • bot_line (ndarray) – The points composing bottom curved sideline of text polygon.

  • center_line (ndarray) – The points composing the center line of text instance.

  • center_region_mask (ndarray) – The text center region mask.

  • radius_map (ndarray) – The map where the distance from point to sidelines will be drawn on for each pixel in text center region.

  • sin_map (ndarray) – The map where vector_sin(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).

  • cos_map (ndarray) – The map where vector_cos(theta) will be drawn on text center regions. Theta is the angle between tangent line and vector (1, 0).

  • region_shrink_ratio (float) – The shrink ratio of text center.

find_head_tail(points, orientation_thr)[源代码]

Find the head edge and tail edge of a text polygon.

参数
  • points (ndarray) – The points composing a text polygon.

  • orientation_thr (float) – The threshold for distinguishing between head edge and tail edge among the horizontal and vertical edges of a quadrangle.

返回

The indexes of two points composing head edge. tail_inds (list): The indexes of two points composing tail edge.

返回类型

head_inds (list)

generate_center_mask_attrib_maps(img_size, text_polys)[源代码]

Generate text center region mask and geometric attribute maps.

参数
  • img_size (tuple) – The image size of (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

返回

The text center region mask. radius_map (ndarray): The distance map from each pixel in text

center region to top sideline.

sin_map (ndarray): The sin(theta) map where theta is the angle

between vector (top point - bottom point) and vector (1, 0).

cos_map (ndarray): The cos(theta) map where theta is the angle

between vector (top point - bottom point) and vector (1, 0).

返回类型

center_region_mask (ndarray)

generate_targets(results)[源代码]

Generate the gt targets for TextSnake.

参数

results (dict) – The input result dictionary.

返回

The output result dictionary.

返回类型

results (dict)

generate_text_region_mask(img_size, text_polys)[源代码]

Generate text center region mask and geometry attribute maps.

参数
  • img_size (tuple) – The image size (height, width).

  • text_polys (list[list[ndarray]]) – The list of text polygons.

返回

The text region mask.

返回类型

text_region_mask (ndarray)

reorder_poly_edge(points)[源代码]

Get the respective points composing head edge, tail edge, top sideline and bottom sideline.

参数

points (ndarray) – The points composing a text polygon.

返回

The two points composing the head edge of text

polygon.

tail_edge (ndarray): The two points composing the tail edge of text

polygon.

top_sideline (ndarray): The points composing top curved sideline of

text polygon.

bot_sideline (ndarray): The points composing bottom curved sideline

of text polygon.

返回类型

head_edge (ndarray)

resample_line(line, n)[源代码]

Resample n points on a line.

参数
  • line (ndarray) – The points composing a line.

  • n (int) – The resampled points number.

返回

The points composing the resampled line.

返回类型

resampled_line (ndarray)

resample_sidelines(sideline1, sideline2, resample_step)[源代码]

Resample two sidelines to be of the same points number according to step size.

参数
  • sideline1 (ndarray) – The points composing a sideline of a text polygon.

  • sideline2 (ndarray) – The points composing another sideline of a text polygon.

  • resample_step (float) – The resampled step size.

返回

The resampled line 1. resampled_line2 (ndarray): The resampled line 2.

返回类型

resampled_line1 (ndarray)

class mmocr.datasets.pipelines.ToTensorNER[源代码]

Convert data with list type to tensor.

class mmocr.datasets.pipelines.ToTensorOCR[源代码]

Convert a PIL Image or numpy.ndarray to tensor.

class mmocr.datasets.pipelines.TorchVisionWrapper(op, **kwargs)[源代码]

A wrapper of torchvision trasnforms. It applies specific transform to img and updates img_shape accordingly.

警告

This transform only affects the image but not its associated annotations, such as word bounding boxes and polygon masks. Therefore, it may only be applicable to text recognition tasks.

参数
  • op (str) – The name of any transform class in torchvision.transforms().

  • **kwargs – Arguments that will be passed to initializer of torchvision transform.

Required Keys
  • img (ndarray): The input image.
Affected Keys
Modified
  • img (ndarray): The modified image.
Added
  • img_shape (tuple(int)): Size of the modified image.
mmocr.datasets.pipelines.sort_vertex(points_x, points_y)[源代码]

Sort box vertices in clockwise order from left-top first.

参数
  • points_x (list[float]) – x of four vertices.

  • points_y (list[float]) – y of four vertices.

返回

x of sorted four vertices. sorted_points_y (list[float]): y of sorted four vertices.

返回类型

sorted_points_x (list[float])

mmocr.datasets.pipelines.sort_vertex8(points)[源代码]

Sort vertex with 8 points [x1 y1 x2 y2 x3 y3 x4 y4]

utils

class mmocr.datasets.utils.AnnFileLoader(ann_file, parser, repeat=1, file_storage_backend='disk', file_format='txt', **kwargs)[源代码]

Annotation file loader to load annotations from ann_file, and parse raw annotation to dict format with certain parser.

参数
  • ann_file (str) – Annotation file path.

  • parser (dict) – Dictionary to construct parser to parse original annotation infos.

  • repeat (int|float) – Repeated times of dataset.

  • file_storage_backend (str) – The storage backend type for annotation file. Options are “disk”, “http” and “petrel”. Default: “disk”.

  • file_format (str) – The format of annotation file. Options are “txt” and “lmdb”. Default: “txt”.

close()[源代码]

For ann_file with lmdb format only.

class mmocr.datasets.utils.HardDiskLoader(ann_file, parser, repeat=1)[源代码]

Load txt format annotation file from hard disks.

class mmocr.datasets.utils.LineJsonParser(keys=[])[源代码]

Parse json-string of one line in annotation file to dict format.

参数

keys (list[str]) – Keys in both json-string and result dict.

class mmocr.datasets.utils.LineStrParser(keys=['filename', 'text'], keys_idx=[0, 1], separator=' ', **kwargs)[源代码]

Parse string of one line in annotation file to dict format.

参数
  • keys (list[str]) – Keys in result dict.

  • keys_idx (list[int]) – Value index in sub-string list for each key above.

  • separator (str) – Separator to separate string to list of sub-string.

class mmocr.datasets.utils.LmdbLoader(ann_file, parser, repeat=1)[源代码]

Load lmdb format annotation file from hard disks.

Read the Docs v: latest
Versions
latest
stable
Downloads
pdf
html
epub
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.