Note

You are reading the documentation for MMOCR 0.x, which will soon be deprecated by the end of 2022. We recommend you upgrade to MMOCR 1.0 to enjoy fruitful new features and better performance brought by OpenMMLab 2.0. Check out the maintenance plan, changelog, code and documentation of MMOCR 1.0 for more details.

Training¶

Training on a Single GPU¶

You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/train.py ${CONFIG_FILE} [ARGS]

Note

By default, MMOCR prefers GPU to CPU. If you want to train a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program. Note that CPU training requires MMCV >= 1.4.4.

CUDA_VISIBLE_DEVICES= python tools/train.py ${CONFIG_FILE} [ARGS]

ARGS	Type	Description
`--work-dir`	str	The target folder to save logs and checkpoints. Defaults to `./work_dirs`.
`--load-from`	str	Path to the pre-trained model, which will be used to initialize the network parameters.
`--resume-from`	str	Resume training from a previously saved checkpoint, which will inherit the training epoch and optimizer parameters.
`--no-validate`	bool	Disable checkpoint evaluation during training. Defaults to `False`.
`--gpus`	int	Deprecated, please use --gpu-id. Numbers of gpus to use. Only applicable to non-distributed training.
`--gpu-ids`	int*N	Deprecated, please use --gpu-id. A list of GPU ids to use. Only applicable to non-distributed training.
`--gpu-id`	int	The GPU id to use. Only applicable to non-distributed training.
`--seed`	int	Random seed.
`--diff-seed`	bool	Whether or not set different seeds for different ranks.
`--deterministic`	bool	Whether to set deterministic options for CUDNN backend.
`--cfg-options`	str	Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that the quotation marks are necessary and that no white space is allowed.
`--launcher`	'none', 'pytorch', 'slurm', 'mpi'	Options for job launcher.
`--local_rank`	int	Used for distributed training.
`--mc-config`	str	Memory cache config for image loading speed-up during training.

Training on Multiple GPUs¶

MMOCR implements distributed training with MMDistributedDataParallel. (Please refer to datasets.md to prepare your datasets)

[PORT={PORT}] ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

Arguments	Type	Description
`PORT`	int	The master port that will be used by the machine with rank 0. Defaults to 29500. Note: If you are launching multiple distrbuted training jobs on a single machine, you need to specify different ports for each job to avoid port conflicts.
`CONFIG_FILE`	str	The path to config.
`WORK_DIR`	str	The path to the working directory.
`GPU_NUM`	int	The number of GPUs to be used per node. Defaults to 8.
`PY_ARGS`	str	Arguments to be parsed by `tools/train.py`.

Training on Multiple Machines¶

You can launch a task on multiple machines connected to the same network.

NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

Arguments	Type	Description
`NNODES`	int	The number of nodes.
`NODE_RANK`	int	The rank of current node.
`PORT`	int	The master port that will be used by rank 0 node. Defaults to 29500.
`MASTER_ADDR`	str	The address of rank 0 node. Defaults to "127.0.0.1".
`CONFIG_FILE`	str	The path to config.
`WORK_DIR`	str	The path to the working directory.
`GPU_NUM`	int	The number of GPUs to be used per node. Defaults to 8.
`PY_ARGS`	str	Arguments to be parsed by `tools/train.py`.

Note

MMOCR relies on torch.distributed package for distributed training. Find more information at PyTorch’s launch utility.

Say that you want to launch a job on two machines. On the first machine:

NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

On the second machine:

NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

Note

The speed of the network could be the bottleneck of training.

Training with Slurm¶

If you run MMOCR on a cluster managed with Slurm, you can use the script slurm_train.sh.

[GPUS=${GPUS}] [GPUS_PER_NODE=${GPUS_PER_NODE}] [CPUS_PER_TASK=${CPUS_PER_TASK}] [SRUN_ARGS=${SRUN_ARGS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]

Arguments	Type	Description
`GPUS`	int	The number of GPUs to be used by this task. Defaults to 8.
`GPUS_PER_NODE`	int	The number of GPUs to be allocated per node. Defaults to 8.
`CPUS_PER_TASK`	int	The number of CPUs to be allocated per task. Defaults to 5.
`SRUN_ARGS`	str	Arguments to be parsed by srun. Available options can be found here.
`PY_ARGS`	str	Arguments to be parsed by `tools/train.py`.

Here is an example of using 8 GPUs to train a text detection model on the dev partition.

./tools/slurm_train.sh dev psenet-ic15 configs/textdet/psenet/psenet_r50_fpnf_sbn_1x_icdar2015.py /nfs/xxxx/psenet-ic15

Running Multiple Training Jobs on a Single Machine¶

If you are launching multiple training jobs on a single machine with Slurm, you may need to modify the port in configs to avoid communication conflicts.

For example, in config1.py,

dist_params = dict(backend='nccl', port=29500)

In config2.py,

dist_params = dict(backend='nccl', port=29501)

Then you can launch two jobs with config1.py ang config2.py.

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}

Commonly Used Training Configs¶

Here we list some configs that are frequently used during training for quick reference.

total_epochs = 1200
data = dict(
    # Note: User can configure general settings of train, val and test dataloader by specifying them here. However, their values can be overridden in dataloader's config.
    samples_per_gpu=8, # Batch size per GPU
    workers_per_gpu=4, # Number of workers to process data for each GPU
    train_dataloader=dict(samples_per_gpu=10, drop_last=True),   # Batch size = 10, workers_per_gpu = 4
    val_dataloader=dict(samples_per_gpu=6, workers_per_gpu=1),  # Batch size = 6, workers_per_gpu = 1
    test_dataloader=dict(workers_per_gpu=16),  # Batch size = 8, workers_per_gpu = 16
    ...
)
# Evaluation
evaluation = dict(interval=1, by_epoch=True)  # Evaluate the model every epoch
# Saving and Logging
checkpoint_config = dict(interval=1)  # Save a checkpoint every epoch
log_config = dict(
    interval=5,  # Print out the model's performance every 5 iterations
    hooks=[
        dict(type='TextLoggerHook')
    ])
# Optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)  # Supports all optimizers in PyTorch and shares the same parameters
optimizer_config = dict(grad_clip=None)  # Parameters for the optimizer hook. See https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/optimizer.py for implementation details
# Learning policy
lr_config = dict(policy='poly', power=0.9, min_lr=1e-7, by_epoch=True)