Checkpointing

It is sometimes the case that an optimization proceeds for a longer time than is available on computational infrastructure. For example, a shared high performance computing center with job limits of 72 hours. To address this, GloMPO incorporates checkpointing functionality. This constructs a snapshot of a managed optimization at some point in time and persists it to disk. This file can be loaded by a new GloMPOManager instance at a later time to resume the optimization.

Checkpointing options are provided to the GloMPOManager through a CheckpointingControl instance.

Checkpointing tries to create an entire image of the GloMPO state, it is the user’s responsibility to ensure that the used optimizers are restartable.

Tip

Within tests/test_optimizers.py there is the TestSubclassesGlompoCompatible class which can be used to ensure an optimizer is compatible with all of GloMPO’s functionality.

The optimization task can sometimes not be reduced to a pickled state depending on it complexity and interfaces to other codes. GloMPO will first attempt to pickle the object, failing that GloMPO will attempt to call checkpoint_save(). If this also fails, the checkpoint is created without the optimization task. GloMPO can be restarted from an incomplete checkpoint if the missing components are provided.

Similarly to manual stopping of optimizers (see User Interventions), manual checkpoints can also be requested by creating a file named CHKPT in the working directory. Note, that this file will be deleted by the manager when the checkpoint is created.

Checkpointing & Log Files

Caution

Please pay close attention to how GloMPO handles log files and loading checkpoints.

The HDF5 log file is not included inside the checkpoints since they can become extremely large if they are being used to gather lots of data. GloMPO always aims to continue an optimization by appending data to a matching log file rather than creating a new one. To do this, the following conditions must be met:

  1. A log file called glompo_log.h5 must be present in the working directory.

  2. The log file must contain a key matching the one in the checkpoint.

If a file named glompo_log.h5 is not present then a warning is issued and GloMPO will begin logging into a new file of this name.

If the file exists but does not contain a matching key an error will be raised.

It is the user’s responsibility to ensure that log files are located and named correctly in the working directory when loading checkpoints.

Caution

GloMPO will overwrite existing data in if a matching log is found in the working directory, but it contains more iteration information that the checkpoint.

For example, a checkpoint was created at the 1000th function evaluation of an optimization, but the manager continued until exiting after 1398 function evaluations. If the checkpoint is loaded, it will expect a the log file to only have 1000 function evaluations.

The only way to load this checkpoint (and ensure duplicate iterations are not included in the log) is to remove any values in the log which were generated after the checkpoint. To avoid data being overwritten, the user can manually copy/rename the log file they wish to retain before loading a checkpoint.

Checkpointing Control Settings

class CheckpointingControl(checkpoint_time_interval=inf, checkpoint_iter_interval=inf, checkpoint_at_init=False, checkpoint_at_conv=False, raise_checkpoint_fail=False, force_task_save=False, keep_past=-1, naming_format='glompo_checkpoint_%(date)_%(time)', checkpointing_dir='checkpoints')[source]

Class to set up and control the checkpointing behaviour of the GloMPOManager. This class has limited functionality and is mainly a container for various settings. The initialisation arguments match the class attributes of the same name.

Attributes:

checkpoint_at_convbool

If True a checkpoint is built just before the manager exists.

checkpoint_at_initbool

If True a checkpoint is built at the very start of the optimization. This can make starting duplicate jobs easier.

checkpoint_iter_intervalfloat

Number of function evaluations between checkpoints being saved to disk during an optimization. Function call based checkpointing not performed if this parameter is not provided.

checkpoint_time_intervalfloat

Number of seconds between checkpoints being saved to disk during an optimization. Time based checkpointing not performed if this parameter is not provided.

checkpointing_dirUnion[pathlib.Path, str]

Directory in which checkpoints are saved. Defaults to 'checkpoints'

Important

This path is always converted to an absolute path, if a relative path is provided it will be relative to the current working directory when this object is created. There is no relation to GloMPOManager.working_dir.

countint

Counter for checkpoint naming patterns which rely on incrementing filenames. Count starts from the largest existing match in checkpointing_dir or zero otherwise. Formatted to 3 digits.

force_task_savebool

Some tasks may pickle successfully but fail to load properly, if this is an issue then setting this parameter to True will cause the manager to bypass the pickle task step and immediately attempt the checkpoint_save() method.

keep_pastint

The number of checkpoints retained when a new checkpoint is made. Any older ones are deleted. Default is -1 which performs no deletion. keep_past = 0 retains no previous results, only the newly constructed checkpoint will exist.

Note

  1. GloMPO will only count the directories in checkpointing_dir and matching the supplied naming_format.

  2. Existing checkpoints will only be deleted if the new checkpoint is successfully constructed.

naming_formatstr

Convention used to name the checkpoints. Special keys that can be used:

Naming Format Key

Checkpoint Name Result

'%(date)'

Current calendar date in YYYYMMDD format

'%(year)'

Year formatted to YYYY

'%(yr)'

Year formatted to YY

'%(month)'

Numerical month formatted to MM

'%(day)'

Calendar day of the month formatted to DD

'%(time)'

Current calendar time formatted to HHMMSS (24-hour style)

'%(hour)'

Hour formatted to HH (24-hour style)

'%(min)'

Minutes formatted to MM

'%(sec)'

Seconds formatted to SS

'%(count)'

Index count of the number of checkpoints constructed.

raise_checkpoint_failbool

If True a failed checkpoint will cause the manager to end the optimization in error. Note, that GloMPO will always write out some data when it terminates. This can be a way of preserving data if the checkpoint fails. If False an error in constructing a checkpoint will simply raise a warning and pass.

property any_true

Returns True if at least one of the four checkpointing conditions is set to produce checkpoints.

get_name()[source]

Returns a new name for a checkpoint matching the naming format.

matches_naming_format(name)[source]

Returns True if the provided name matches the pattern in the naming_format.