4.9. Optimization

The Optimization class is where the other components – Job Collection, Data Set, Parameter Interfaces and Optimizers – come together. It is responsible for the selection, generation, execution and evaluation of new jobs for every new parameter set.

See also

Architecture Quick Reference for an overview

A Optimization instance will usually be initialized once every other component is defined:

>>> interface     = ReaxFFParameters('path/to/ffield.ff')
>>> jc            = JobCollection('path/to/jobcol.yml')
>>> training_set  = DataSet('path/to/data_set.yml')
>>> optimizer     = CMAOptimizer(popsize=15)
>>> optimization  = Optimization(jc, training_set, interface, optimizer)

Once initialized, the following will run a complete optimization:

>>> optimization.optimize()

After instantiation, a summary of all relevant settings can be printed with summary():

>>> optimization.summary()
Optimization() Instance Settings:
=================================
Workdir:                           opt
JobCollection size:                20
Interface:                         ReaxFFParameters
Active parameters:                 207
Optimizer:                         CMAOptimizer
Callbacks:                         Timeout
                                   Logger

Evaluators:
-----------
Name:                              training_set (_LossEvaluator)
Loss:                              SSE
Evaluation frequency:              1

Data Set entries:                  20
Data Set jobs:                     20
Batch size:                        None

CPU cores:                         6
Use PIPE:                          True
---
===

4.9.1. Optimization Setup

The optimization can be further controlled by providing a number of optional keyword arguments to the Optimization instance. While the full list of arguments is documented in the API section below, the most relevant ones are presented here.

parallellevels
An instance of the ParallelLevels class describing how the optimization is to be parallelized.
constraints
Constraints additionally define the parameter search space by checking if every solution is consistent with the definition.
callbacks
A list of callback instances. Callbacks provide a versatile way to interact with the optimization process at every iteration.
validation
Percentage of the training_set entries to be used for validation. Can be used with the Early Stopping callback.
loss
The loss function to be used for this optimization instance.
batch_size
Instead of evaluating all properties in the training_set, evaluate a maximum of randomly picked batch_size entries per iteration.

4.9.2. Optimization API

class Optimization

The top level class managing an entire optimization.

__init__(job_collection: scm.params.core.jobcollection.JobCollection, data_sets: Union[scm.params.core.dataset.DataSet, Sequence[scm.params.core.dataset.DataSet]], parameter_interface: Type[scm.params.parameterinterfaces.base.BaseParameters], optimizer: Type[scm.params.optimizers.base.BaseOptimizer], workdir: str = 'opt', plams_workdir_path: str = None, validation: float = None, callbacks: Sequence[scm.params.core.callbacks.Callback] = None, constraints: Sequence[scm.params.parameterinterfaces.base.Constraint] = None, parallel: scm.params.common.parallellevels.ParallelLevels = None, verbose: bool = True, skip_x0: bool = False, logger_every: Union[dict, int] = None, loss: Union[scm.params.core.lossfunctions.Loss, Sequence[scm.params.core.lossfunctions.Loss]] = 'sse', batch_size: Union[int, Sequence[int]] = None, use_pipe: Union[bool, Sequence[bool]] = True, data_set_names: Sequence[str] = None, eval_every: Union[int, Sequence[int]] = 1, maxjobs: Union[None, Sequence[int]] = None, maxjobs_shuffle: Union[bool, Sequence[bool]] = False)
Parameters:
job_collection : JobCollection
Job collection holding all jobs necessary to evaluate the data_sets
data_sets : DataSet, list(DataSet)
Data Set(s) to be evaluated.
In the most simple case, one data set will be evaluated as the training set. Multiple data sets can be passed to be evaluated sequentially at every optimizer step. In this case, the first data set will be interpreted as the training set, the second as a validation set.
parameter_interface : any parameter interface
The interface to the parameters that are to be optimized.
optimizer : optimizer class
An instance of an optimizer class.
workdir : optional, str
The working directory for this optimization. Once optimize() is called, will switch to it.
plams_workdir_path : optional, str
The folder in which the PLAMS working directory is created. By default the PLAMS working directory is created inside of $SCM_TMPDIR or /tmp if the former is not defined. When running on a compute cluster this variable can be set to a local directory of the machine where the jobs are running, avoiding a potentially slow PLAMS working directory that is mounted over the network.
validation : optional, float, int
If the passed value is 0<float<1, a validation set will be created from a validation percentage of the first data set in data_sets. If the passed value is 1<float<len(data_sets[0]), will create a validation set with validation entries taken from the first data set in data_sets. If you would like to pass a DataSet instance instead, you can do so in the data_sets parameter.
callbacks : optional, List of callback instances
List of callbacks interacting with the optimization instance. A Logger callback will always be added if not already present in the list. See also the logger_every argument.
constraints : optional, List of parameter constraints
Additional constraints for candidate solutions of \(\boldsymbol{x}^*\). If the any of these return False, the solution will not be considered.
parallel : optional, ParallelLevels
Configuration for the parallelization at all levels of a parameter optimization.
verbose : bool
Print the current best loss function value each time we improve
skip_x0 : bool
Before an optimization process starts, a DataSet will be evaluated with the initial parameters \(oldsymbol{x}_0\). If this initial evaluation returns an infinite loss function value, will raise an error by default. This behavior is expecting that the initial parameters are generally valid and the cause of the non finite loss is probably due to bad plams.Settings of an entry in the JobCollection.
However, if it is not known if the initial parameters can be trusted or raising an error is not desired for other reasons, this parameter can be set to True to skip the initial evaluation.
logger_every : dict or int
See every_n_iter in Logger. This option is ignored if a Logger is provided in the callbacks.
Per Data Set Parameters:
 

Note

The following parameters will be applied to all entries in data_sets, meaning each Data Set will be evaluated with the same settings. To override this, any of the parameters below can also take a list with the same number of elements as len(data_sets), mapping individual settings to every data_sets entry.

loss : optional, Loss, str
A Loss Function instance to compute the loss of every new parameter set. Residual Sum of Squares by default.
batch_size : optional, int

The number of entries to be evaluated per epoch. If None, all entries will be evaluated.

Note: One job calculation can have multiple property entries in a training set (e.g. Energy and Forces), thus, this parameter is not the same as as `maxjobs.

Note: If both, maxjobs and batch_size are set, the former will be applied first. If the resulting set is still larger than batch_size, will apply filtering by batch_size.

use_pipe : optional, bool
Whether to use the AMSWorker interface for suitable jobs.
data_set_names : optional, List[str]
When evaluating multiple data_sets, can be set to give each entry a name. Possible logger callbacks will create and write data into this subdirectory.
Defaults to ['training_set', 'validation_set', 'data_set03', ..., 'data_setXX']
eval_every : optional, int

Evaluate the Data Set at every eval_every call.

Warning

The first entry in data_sets represents the training set and must be evaluated at every call. It’s frequency will always be 1.

maxjobs : optional, int
Whether to limit each Data Set evaluation to a subset of maximum maxjobs. Igonored if None.
maxjobs_shuffle : optional, bool
If maxjobs is set, will generate a new subset of the Data Set with maxjobs at every evaluation.
optimize() → scm.params.optimizers.base.MinimizeResult

Start the optimization given the initial parameters

initial_eval()

Evaluate x0 before the optimization. Returns (fx, abort) where abort is a bool signifying whether to abort (whether a callback returned True)

summary(file=None)

Prints a summary of the current instance

__str__()

Return str(self).

delete()

Remove the working directory from disk