4.9. Optimization¶
The Optimization
class is where the other components
– Job Collection, Data Set, Parameter Interfaces and Optimizers – come together.
It is responsible for the selection, generation, execution and evaluation of new jobs for every new parameter set.
See also
Architecture Quick Reference for an overview
A Optimization
instance will usually be initialized once every other component is defined:
>>> interface = ReaxFFParameters('path/to/ffield.ff')
>>> jc = JobCollection('path/to/jobcol.yml')
>>> training_set = DataSet('path/to/data_set.yml')
>>> optimizer = CMAOptimizer(popsize=15)
>>> optimization = Optimization(jc, training_set, interface, optimizer)
Once initialized, the following will run a complete optimization:
>>> optimization.optimize()
After instantiation, a summary of all relevant settings can be printed with summary()
:
>>> optimization.summary()
Optimization() Instance Settings:
=================================
Workdir: opt
JobCollection size: 20
Interface: ReaxFFParameters
Active parameters: 207
Optimizer: CMAOptimizer
Callbacks: Timeout
Logger
Evaluators:
-----------
Name: training_set (_LossEvaluator)
Loss: SSE
Evaluation frequency: 1
Data Set entries: 20
Data Set jobs: 20
Batch size: None
CPU cores: 6
Use PIPE: True
---
===
4.9.1. Optimization Setup¶
The optimization can be further controlled by providing a number of optional keyword arguments to the Optimization
instance.
While the full list of arguments is documented in the API section below,
the most relevant ones are presented here.
- parallellevels
- An instance of the ParallelLevels class describing how the optimization is to be parallelized.
- constraints
- Constraints additionally define the parameter search space by checking if every solution is consistent with the definition.
- callbacks
- A list of callback instances. Callbacks provide a versatile way to interact with the optimization process at every iteration.
- validation
- Percentage of the training_set entries to be used for validation. Can be used with the Early Stopping callback.
- loss
- The loss function to be used for this optimization instance.
- batch_size
- Instead of evaluating all properties in the training_set, evaluate a maximum of randomly picked batch_size entries per iteration.
4.9.2. Optimization API¶
-
class
Optimization
¶ The top level class managing an entire optimization.
-
__init__
(job_collection: scm.params.core.jobcollection.JobCollection, data_sets: Union[scm.params.core.dataset.DataSet, Sequence[scm.params.core.dataset.DataSet]], parameter_interface: Type[scm.params.parameterinterfaces.base.BaseParameters], optimizer: Type[scm.params.optimizers.base.BaseOptimizer], workdir: str = 'opt', plams_workdir_path: str = None, validation: float = None, callbacks: Sequence[scm.params.core.callbacks.Callback] = None, constraints: Sequence[scm.params.parameterinterfaces.base.Constraint] = None, parallel: scm.params.common.parallellevels.ParallelLevels = None, verbose: bool = True, skip_x0: bool = False, logger_every: Union[dict, int] = None, loss: Union[scm.params.core.lossfunctions.Loss, Sequence[scm.params.core.lossfunctions.Loss]] = 'sse', batch_size: Union[int, Sequence[int]] = None, use_pipe: Union[bool, Sequence[bool]] = True, data_set_names: Sequence[str] = None, eval_every: Union[int, Sequence[int]] = 1, maxjobs: Union[None, Sequence[int]] = None, maxjobs_shuffle: Union[bool, Sequence[bool]] = False)¶ Parameters: - job_collection :
JobCollection
- Job collection holding all jobs necessary to evaluate the data_sets
- data_sets :
DataSet
, list(DataSet
) - Data Set(s) to be evaluated.
In the most simple case, one data set will be evaluated as the training set. Multiple data sets can be passed to be evaluated sequentially at every optimizer step. In this case, the first data set will be interpreted as the training set, the second as a validation set. - parameter_interface : any parameter interface
- The interface to the parameters that are to be optimized.
- optimizer : optimizer class
- An instance of an optimizer class.
- workdir : optional, str
- The working directory for this optimization.
Once
optimize()
is called, will switch to it. - plams_workdir_path : optional, str
- The folder in which the PLAMS working directory is created. By default the PLAMS working directory is created inside of $SCM_TMPDIR or /tmp if the former is not defined. When running on a compute cluster this variable can be set to a local directory of the machine where the jobs are running, avoiding a potentially slow PLAMS working directory that is mounted over the network.
- validation : optional, float, int
- If the passed value is 0<float<1, a validation set will be created from a validation percentage of the first data set in data_sets.
If the passed value is 1<float<len(data_sets[0]), will create a validation set with validation entries taken from the first data set in data_sets.
If you would like to pass a
DataSet
instance instead, you can do so in the data_sets parameter. - callbacks : optional, List of callback instances
- List of callbacks interacting with the optimization instance.
A
Logger
callback will always be added if not already present in the list. See also the logger_every argument. - constraints : optional, List of parameter constraints
- Additional constraints for candidate solutions of \(\boldsymbol{x}^*\). If the any of these return False, the solution will not be considered.
- parallel : optional, ParallelLevels
- Configuration for the parallelization at all levels of a parameter optimization.
- verbose : bool
- Print the current best loss function value each time we improve
- skip_x0 : bool
- Before an optimization process starts, a DataSet will be evaluated with the initial parameters \(oldsymbol{x}_0\).
If this initial evaluation returns an infinite loss function value, will raise an error by default.
This behavior is expecting that the initial parameters are generally valid and the cause of the non finite loss is probably
due to bad
plams.Settings
of an entry in theJobCollection
.
However, if it is not known if the initial parameters can be trusted or raising an error is not desired for other reasons, this parameter can be set to True to skip the initial evaluation. - logger_every : dict or int
- See every_n_iter in
Logger
. This option is ignored if a Logger is provided in the callbacks.
Per Data Set Parameters: Note
The following parameters will be applied to all entries in data_sets, meaning each Data Set will be evaluated with the same settings. To override this, any of the parameters below can also take a list with the same number of elements as
len(data_sets)
, mapping individual settings to every data_sets entry.- loss : optional, Loss, str
- A Loss Function instance to compute the loss of every new parameter set. Residual Sum of Squares by default.
- batch_size : optional, int
The number of entries to be evaluated per epoch. If None, all entries will be evaluated.
Note: One job calculation can have multiple property entries in a training set (e.g. Energy and Forces), thus, this parameter is not the same as as `maxjobs.
Note: If both, maxjobs and batch_size are set, the former will be applied first. If the resulting set is still larger than batch_size, will apply filtering by batch_size.
- use_pipe : optional, bool
- Whether to use the AMSWorker interface for suitable jobs.
- data_set_names : optional, List[str]
- When evaluating multiple data_sets, can be set to give each entry a name.
Possible logger callbacks will create and write data into this subdirectory.
Defaults to['training_set', 'validation_set', 'data_set03', ..., 'data_setXX']
- eval_every : optional, int
Evaluate the Data Set at every eval_every call.
Warning
The first entry in data_sets represents the training set and must be evaluated at every call. It’s frequency will always be 1.
- maxjobs : optional, int
- Whether to limit each Data Set evaluation to a subset of maximum maxjobs. Igonored if None.
- maxjobs_shuffle : optional, bool
- If maxjobs is set, will generate a new subset of the Data Set with maxjobs at every evaluation.
- job_collection :
-
optimize
() → scm.params.optimizers.base.MinimizeResult¶ Start the optimization given the initial parameters
-
initial_eval
()¶ Evaluate x0 before the optimization. Returns (fx, abort) where abort is a bool signifying whether to abort (whether a callback returned True)
-
summary
(file=None)¶ Prints a summary of the current instance
-
__str__
()¶ Return str(self).
-
delete
()¶ Remove the working directory from disk
-