Training¶

Training on a Single GPU¶

You can use tools/train.py to train a model on a single machine with a CPU and optionally a GPU.

Here is the full usage of the script:

python tools/train.py ${CONFIG_FILE} [ARGS]

Note

By default, MMOCR prefers GPU to CPU. If you want to train a model on CPU, please empty CUDA_VISIBLE_DEVICES or set it to -1 to make GPU invisible to the program. Note that CPU training requires MMCV >= 1.4.4.

CUDA_VISIBLE_DEVICES= python tools/train.py ${CONFIG_FILE} [ARGS]

ARGS	Type	Description
`--work-dir`	str	The target folder to save logs and checkpoints. Defaults to `./work_dirs`.
`--load-from`	str	Path to the pre-trained model, which will be used to initialize the network parameters.
`--resume-from`	str	Resume training from a previously saved checkpoint, which will inherit the training epoch and optimizer parameters.
`--no-validate`	bool	Disable checkpoint evaluation during training. Defaults to `False`.
`--gpus`	int	Deprecated, please use --gpu-id. Numbers of gpus to use. Only applicable to non-distributed training.
`--gpu-ids`	int*N	Deprecated, please use --gpu-id. A list of GPU ids to use. Only applicable to non-distributed training.
`--gpu-id`	int	The GPU id to use. Only applicable to non-distributed training.
`--seed`	int	Random seed.
`--diff-seed`	bool	Whether or not set different seeds for different ranks.
`--deterministic`	bool	Whether to set deterministic options for CUDNN backend.
`--cfg-options`	str	Override some settings in the used config, the key-value pair in xxx=yyy format will be merged into the config file. If the value to be overwritten is a list, it should be of the form of either key="[a,b]" or key=a,b. The argument also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]". Note that the quotation marks are necessary and that no white space is allowed.
`--launcher`	'none', 'pytorch', 'slurm', 'mpi'	Options for job launcher.
`--local_rank`	int	Used for distributed training.
`--mc-config`	str	Memory cache config for image loading speed-up during training.

Training on Multiple GPUs¶

MMOCR implements distributed training with MMDistributedDataParallel. (Please refer to datasets.md to prepare your datasets)

[PORT={PORT}] ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

Arguments	Type	Description
`PORT`	int	The master port that will be used by the machine with rank 0. Defaults to 29500. Note: If you are launching multiple distrbuted training jobs on a single machine, you need to specify different ports for each job to avoid port conflicts.
`CONFIG_FILE`	str	The path to config.
`WORK_DIR`	str	The path to the working directory.
`GPU_NUM`	int	The number of GPUs to be used per node. Defaults to 8.
`PY_ARGS`	str	Arguments to be parsed by `tools/train.py`.

Training on Multiple Machines¶

You can launch a task on multiple machines connected to the same network.

NNODES=${NNODES} NODE_RANK=${NODE_RANK} PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

Arguments	Type	Description
`NNODES`	int	The number of nodes.
`NODE_RANK`	int	The rank of current node.
`PORT`	int	The master port that will be used by rank 0 node. Defaults to 29500.
`MASTER_ADDR`	str	The address of rank 0 node. Defaults to "127.0.0.1".
`CONFIG_FILE`	str	The path to config.
`WORK_DIR`	str	The path to the working directory.
`GPU_NUM`	int	The number of GPUs to be used per node. Defaults to 8.
`PY_ARGS`	str	Arguments to be parsed by `tools/train.py`.

Note

MMOCR relies on torch.distributed package for distributed training. Find more information at PyTorch’s launch utility.

Say that you want to launch a job on two machines. On the first machine:

NNODES=2 NODE_RANK=0 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

On the second machine:

NNODES=2 NODE_RANK=1 PORT=${MASTER_PORT} MASTER_ADDR=${MASTER_ADDR} ./tools/dist_train.sh ${CONFIG_FILE} ${WORK_DIR} ${GPU_NUM} [PY_ARGS]

Note

The speed of the network could be the bottleneck of training.

Training with Slurm¶

If you run MMOCR on a cluster managed with Slurm, you can use the script slurm_train.sh.

[GPUS=${GPUS}] [GPUS_PER_NODE=${GPUS_PER_NODE}] [CPUS_PER_TASK=${CPUS_PER_TASK}] [SRUN_ARGS=${SRUN_ARGS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR} [PY_ARGS]

Arguments	Type	Description
`GPUS`	int	The number of GPUs to be used by this task. Defaults to 8.
`GPUS_PER_NODE`	int	The number of GPUs to be allocated per node. Defaults to 8.
`CPUS_PER_TASK`	int	The number of CPUs to be allocated per task. Defaults to 5.
`SRUN_ARGS`	str	Arguments to be parsed by srun. Available options can be found here.
`PY_ARGS`	str	Arguments to be parsed by `tools/train.py`.

Here is an example of using 8 GPUs to train a text detection model on the dev partition.

./tools/slurm_train.sh dev psenet-ic15 configs/textdet/psenet/psenet_r50_fpnf_sbn_1x_icdar2015.py /nfs/xxxx/psenet-ic15

Running Multiple Training Jobs on a Single Machine¶

If you are launching multiple training jobs on a single machine with Slurm, you may need to modify the port in configs to avoid communication conflicts.

For example, in config1.py,

dist_params = dict(backend='nccl', port=29500)

In config2.py,

dist_params = dict(backend='nccl', port=29501)

Then you can launch two jobs with config1.py ang config2.py.

CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}

Commonly Used Training Configs¶

Here we list some configs that are frequently used during training for quick reference.

total_epochs = 1200
data = dict(
    # Note: User can configure general settings of train, val and test dataloader by specifying them here. However, their values can be overridden in dataloader's config.
    samples_per_gpu=8, # Batch size per GPU
    workers_per_gpu=4, # Number of workers to process data for each GPU
    train_dataloader=dict(samples_per_gpu=10, drop_last=True),   # Batch size = 10, workers_per_gpu = 4
    val_dataloader=dict(samples_per_gpu=6, workers_per_gpu=1),  # Batch size = 6, workers_per_gpu = 1
    test_dataloader=dict(workers_per_gpu=16),  # Batch size = 8, workers_per_gpu = 16
    ...
)
# Evaluation
evaluation = dict(interval=1, by_epoch=True)  # Evaluate the model every epoch
# Saving and Logging
checkpoint_config = dict(interval=1)  # Save a checkpoint every epoch
log_config = dict(
    interval=5,  # Print out the model's performance every 5 iterations
    hooks=[
        dict(type='TextLoggerHook')
    ])
# Optimizer
optimizer = dict(type='SGD', lr=0.02, momentum=0.9, weight_decay=0.0001)  # Supports all optimizers in PyTorch and shares the same parameters
optimizer_config = dict(grad_clip=None)  # Parameters for the optimizer hook. See https://github.com/open-mmlab/mmcv/blob/master/mmcv/runner/hooks/optimizer.py for implementation details
# Learning policy
lr_config = dict(policy='poly', power=0.9, min_lr=1e-7, by_epoch=True)