2.10. Training and validation sets¶
Important
First go through the Getting Started: Lennard-Jones Potential for Argon tutorial.
With ParAMS, you can have a validation set in addition to the training set.
The optimizer will minimize the loss function on the training set. During the parametrization, you can also monitor the loss function on the validation set (see figure above). This approach is used to prevent overfitting. As long as the loss function on the (appropriately chosen) validation set is decreasing similarly to the loss function on the training set, it means that there is likely no overfitting.
There are two ways to include a validation set in ParAMS:
- An explicit validation set in the same format as the training set
- (Scripting only:) Setting the
validation
option to some fraction (e.g.validation = 0.1
), which will randomly split the original data_set into a training set and validation set.
2.10.1. Explicit validation set (validation_set.yaml)¶
Make a copy of the directory $AMSHOME/scripting/scm/params/examples/LJ_Ar_validation_set
.
- Start the ParAMS GUI: SCM → ParAMS.File → Open, and browse to the job_collection.yaml file in the example directory.On the Training Set panel are four entries: two of type Energy and two of type Forces.
- On the Validation Set panel is one entry: the Forces for
Ar32_frame001
.
- On the All panel, all validation set entries are marked with a gray color.
Move an entry from the training set to validation set:
- On the All or Training Set panel, select the entry for the Forces of
Ar32_frame003
.Training Set → Move to Validation SetThe entry disappears from the Training Set panel, and appears on the Validation Set panel
Move an entry from the validation set to training set:
- On the All or Validation Set panel, select the entry for the Forces of
Ar32_frame003
.Training Set → Move to Training SetThe entry disappears from the Validation Set panel, and appears on the Training Set panel
You can also do a random training/validation split on selected entries:
- On the All panel, select the two Energy entries and three Forces entriesTraining Set → Generate Validation Set…Percentage of entries to use for validation:
40.0
Click OKThis places 2 (40%) random entries in the validation set, and 3 in the training set.
Before continuing, revert all changes:
- File → Revert Training/Validation/Jobs
This places just the forces for the Ar32_frame001
job in the validation set again.
validation_set.yaml
, containing a
data set entry with the Expression: forces('Ar32_frame001')
. The
corresponding expression has been removed from training_set.yaml
.Note
There is only one job collection!. It is used for both the training
and validation sets. Here, the job Ar32_frame001
is needed for both the
training and validation sets.
energy('Ar32_frame001')-energy('Ar32_frame002')
is part of the training setforces('Ar32_frame001')
is part of the validation set
2.10.1.1. Validation set settings¶
- Switch to the Settings panel in the bottom halfThe
logger_every
option can be toggled by selecting Logger in the Settings drop-downTheeval_every
option can be toggled by selecting Validation Set Eval Every in the Settings drop-down
The params.conf.py
file now references both the training_set.yaml
and
validation_set.yaml
files.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | training_set = 'training_set.yaml'
validation_set = 'validation_set.yaml'
job_collection = 'job_collection.yaml'
parameter_interface = 'parameter_interface.yaml'
### Define an optimizer for the optimization task. Use either a CMAOptimizer or Scipy
#optimizer = CMAOptimizer(sigma=0.1, popsize=10, minsigma=5e-4)
optimizer = Scipy(method='Nelder-Mead') # Nelder-Mead
### run the optimization in serial
parallel = ParallelLevels(parametervectors=1, jobs=1)
# log training set loss every 5 iterations
logger_every = 5
# evaluate (and log) validation set loss every 5 iterations
eval_every = 5
|
logger_every
5 means that information about the optimization is logged every 5 iterations (for the training and validation sets)eval_every
5 means that the validation set will only be evaluated every 5 iterations (the training set must by definition be evaluated every iteration).
Tip
We recommend to set eval_every
and logger_every
to the same value.
If the validation set is very expensive to calculate, set eval_every
to
a multiple of logger_every
, otherwise the validation error will not
be logged!
2.10.1.2. Run the optimization¶
- Make sure that there is at least one entry on the Validation Set panel.File → Save AsSave the project in a new directoryFile → Run
In a terminal, run
"$AMSBIN/params" optimize
2.10.1.3. Training and validation set results¶
On the Validation Set panel, you can see the predictions and loss contributions just as on the Training Set panel.
On the Graphs panel, you can plot results for the training set, validation set, or both.
- In one of the graph drop-downs, choose Loss → lossThis plots the loss function for both the training and validation set as a function of evaluation numberIn the Data From: drop-down, you can toggle whether to plot the training loss, validation loss, or both.
In this case, both the training set and validation set losses decrease, so there is no sign of overfitting.
You can also plot the root-mean-squared error (RMSE) or mean absolute error (MAE) as a function of evaluation for both the training and validation sets:
- In one of the graph drop-downs, choose Stats → ForcesThis plots the RMSE of the forces for the training and validation sets
Note
If you plot Stats → Energy then you do not get any validation set results, since there were no Energy entries in the validation set!
For scatter plots:
- In one of the graph drop-downs, choose Forces.This plots two curves with titles Training (134): forces and Validation (134): forces.
The (134)
(in your case the number may be different) indicates for
which evaluation number the parameters came from. Every set of
parameters during the parametrization has a unique evaluation number.
The default plot is Best Training, which in this case corresponds to the parameters at evaluation 134.
- In the Best Training drop-down, select Best ValidationThis plots curves with titles Validation (140): forces and Training (140): forces.Evaluation 140 corresponds to the parameters that gave the lowest loss function on the validation set.
Note
Only after the parametrization has finished (from convergence, max_evaluations, or timeout) will you get the validation set results for the best training set parameters and vice versa. While the parametrization is running you will only be able to plot results for the training set using the best training set parameters, and for the validation set using the best validation set parameters.
You can also plot the latest evaluation:
- In the Best Validation drop-down, select Latest TrainingThis plots a curve with the title Training (149).In the Latest Training drop-down, select Latest ValidationThis plots a curve with the title Validation (145).
Here, the latest training set evaluation was done with a different set
of parameters than the latest validation set parameters. This is
because the validation set is only evaluated every 5 iterations (as
specified by the eval_every
keyword in the settings).
Tip
You double-click on a plot axis to access plot configuration settings.
The results for the validation set enter a directory called validation_set_results
:
optimization/
├── settings_and_initial_data
├── summary.txt
├── training_set_results
├── validation_set_results
The layout is exactly the same as for training_set_results
.
Important
The validation_set_results/best
directory contains results for the parameters that had the lowest validation set loss. This is generally not the same parameters that give the lowest error on the training set (for which the results are in training_set_results/best
).
The training_set_results/best
and validation_set_results/best
directories get updated as the parametrization progresses.
If the parametrization finishes successfully, either from reaching convergence, max_evaluations, or timeout, then two extra “best” directories will be created:
Directory | Results for | Parameters gave lowest loss for |
training_set_results/best |
training set | training set |
training_set_results/validation_set_best_parameters |
training set | validation set |
validation_set_results/best |
validation set | validation set |
validation_set_results/training_set_best_parameters |
validation set | training set |
The following pairs of files should be identical:
training_set_results/best/parameter_interface.yaml
andvalidation_set_results/training_set_best_paramaters/parameter_interface.yaml
training_set_results/best/evaluation.txt
andvalidation_set_results/training_set_best_paramaters/evaluation.txt
validation_set_results/best/parameter_interface.yaml
andtraining_set_results/validation_set_best_paramaters/parameter_interface.yaml
validation_set_results/best/evaluation.txt
andtraining_set_results/validation_set_best_paramaters/evaluation.txt
2.10.2. Scripting: Random split of a dataset into training and validation sets¶
The below shows how to do a random split into training and validation sets using scripting. How to do it in the GUI was explained above.
Download LJ_Ar_example.zip
, or make a copy of the directory
$AMSHOME/scripting/scm/params/examples/LJ_Ar
. These are the same files that
were used in the first tutorial.
Modify params.conf.py
by adding the following to the bottom of the file:
validation = 0.40
This will split the dataset such that roughly 40% of the dataset entries enter the validation set, and the remaining 60% enter the training set.
You can then run the optimization as normal:
"$AMSBIN/params" optimize
To find out which dataset entries entered the validation set or training set,
you can find this information in the
optimization/settings_and_initial_data/data_sets
directory contains two
files training_set.pkl.gz
and validation_set.pkl.gz
. You can load these
files using params as Python library with the help of
the Data Set class.