Training on a Single Machine

You can use tools/ to train a model in a single machine with one or more GPUs.

Here is the full usage of the script:

python tools/ ${CONFIG_FILE} [ARGS]
ARGS Type Description
--work-dir str The target folder to save logs and checkpoints. Defaults to ./work_dirs.
--load-from str The checkpoint file to load from.
--resume-from bool The checkpoint file to resume the training from.
--no-validate bool Disable checkpoint evaluation during training. Defaults to False.
--gpus int Numbers of gpus to use. Only applicable to non-distributed training.
--gpu-ids int*N A list of GPU ids to use. Only applicable to non-distributed training.
--seed int Random seed.
--deterministic bool Whether to set deterministic options for CUDNN backend.
--cfg-options str Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that the quotation marks are necessary and that no white space is allowed.
--launcher 'none', 'pytorch', 'slurm', 'mpi' Options for job launcher.
--local_rank int Used for distributed training.
--mc-config str Memory cache config for image loading speed-up during training.

Training on Multiple Machines

MMOCR implements distributed training with MMDistributedDataParallel. (Please refer to to prepare your datasets)

Arguments Type Description
PORT int The master port that will be used by the machine with rank 0. Defaults to 29500. Note: If you are launching multiple distrbuted training jobs on a single machine, you need to specify different ports for each job to avoid port conflicts.
PY_ARGS str Arguments to be parsed by tools/

Training with Slurm

If you run MMOCR on a cluster managed with Slurm, you can use the script

Arguments Type Description
GPUS int The number of GPUs to be used by this task. Defaults to 8.
GPUS_PER_NODE int The number of GPUs to be allocated per node. Defaults to 8.
CPUS_PER_TASK int The number of CPUs to be allocated per task. Defaults to 5.
SRUN_ARGS str Arguments to be parsed by srun. Available options can be found here.
PY_ARGS str Arguments to be parsed by tools/

Here is an example of using 8 GPUs to train a text detection model on the dev partition.

./tools/ dev psenet-ic15 configs/textdet/psenet/ /nfs/xxxx/psenet-ic15

Running Multiple Training Jobs on a Single Machine

If you are launching multiple training jobs on a single machine with Slurm, you may need to modify the port in configs to avoid communication conflicts.

For example, in,

dist_params = dict(backend='nccl', port=29500)


dist_params = dict(backend='nccl', port=29501)

Then you can launch two jobs with ang


Commonly Used Training Configs

Here we list some configs that are frequently used during training for quick reference.

total_epochs = 1200
data = dict(
    # Note: User can configure general settings of train, val and test dataloader by specifying them here. However, their values can be overridden in dataloader's config.
    samples_per_gpu=8, # Batch size per GPU
    workers_per_gpu=4, # Number of workers to process data for each GPU
    train_dataloader=dict(samples_per_gpu=10, drop_last=True),   # Batch size = 10, workers_per_gpu = 4
    val_dataloader=dict(samples_per_gpu=6, workers_per_gpu=1),  # Batch size = 6, workers_per_gpu = 1
    test_dataloader=dict(workers_per_gpu=16),  # Batch size = 8, workers_per_gpu = 16
# Evaluation
evaluation = dict(interval=1, by_epoch=True)  # Evaluate the model every epoch
# Saving and Logging
checkpoint_config = dict(interval=1)  # Save a checkpoint every epoch
log_config = dict(
    interval=5,  # Print out the model's performance every 5 iterations
# Optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)  # Supports all optimizers in PyTorch and shares the same parameters
optimizer_config = dict(grad_clip=None)  # Parameters for the optimizer hook. See for implementation details
# Learning policy
lr_config = dict(policy='poly', power=0.9, min_lr=1e-7, by_epoch=True)
Read the Docs v: v0.4.0
On Read the Docs
Project Home

Free document hosting provided by Read the Docs.