2.4. Import training data (Python)

Note

For the corresponding GUI tutorial, see Import training data (GUI).

This tutorial will show you how to use a ResultsImporter to add the following training data for a water molecule:

The training data comes from a UFF force field geometry optimization. The figure below shows some of the properties: the O atomic charge -0.834, the H atomic charge +0.417, the O-H bond length 0.990 Å, and the H-O-H angle 104.5°.

../../_images/water_UFF.png

Tip

Before starting the tutorial, get familiar with the Data Set class, in particular the Demonstration: Working with a DataSet.

To follow along with the tutorial, download

2.4.1. Run a reference job

ResultsImporters work on finished reference jobs. Here, we run a geometry optimization on a water molecule with the UFF force field. The calculation is run from python with PLAMS, but you can also run it with the graphical user interface and skip to Water molecule reference data.

from scm.params import *
from scm.plams import *
init(path='/tmp', folder='demo_results_importer') # set path=None to use the current working directory, for example if you do not have /tmp on your system

# set up a water molecule, can also be set up as mol = Molecule('water.yz') or mol = from_smiles('O')
mol = Molecule()
mol.add_atom(Atom(symbol='O', coords=(0,0,0)))
mol.add_atom(Atom(symbol='H', coords=(1,0,0)))
mol.add_atom(Atom(symbol='H', coords=(-0.7,0.7,0)))

ref_settings = Settings()
ref_settings.runscript.nproc = 1   # run jobs in serial
ref_settings.input.ams.task = 'GeometryOptimization'
ref_settings.input.forcefield.type = 'UFF'        # run UFF reference calculation
#ref_settings.input.adf.xc.gga = 'PBE'            # run DFT reference calculation with ADF
#ref_settings.input.band.xc.gga = 'PBE'           # run DFT reference calculation with BAND
#ref_settings.input.dftb.model = 'GFN1-xTB'       # run DFTB reference calculation
#ref_settings.input.reaxff.forcefield = 'CHO.ff'  # run ReaxFF reference calculation

water_ams_results_file = AMSJob(settings=ref_settings,
                                molecule=mol,
                                name='water_ref_geo_opt').run().rkfpath()

print("AMS results file located in " + water_ams_results_file)
[01.02|09:00:42] PLAMS working folder: /tmp/demo_results_importer
[01.02|09:00:42] JOB water_ref_geo_opt STARTED
[01.02|09:00:42] JOB water_ref_geo_opt RUNNING
[01.02|09:00:42] JOB water_ref_geo_opt FINISHED
[01.02|09:00:42] JOB water_ref_geo_opt SUCCESSFUL
AMS results file located in /tmp/demo_results_importer/water_ref_geo_opt/ams.rkf

This creates a directory containing the ams.rkf result file. You can open it in AMSmovie to visualize the trajectory.

../../_images/water_UFF_AMSmovie.png

2.4.2. Water molecule reference data

The goal of a ParAMS parametrization is to, as closely as possible, reproduce results from reference calculations. Four types of results will be demonstrated for the water molecule reference calculation:

  • The calculated charges and forces at the final geometry. The corresponding job during the parametrization is a SinglePoint on the optimized geometry from the reference calculation.
  • The optimized O-H bond length and H-O-H angle with the parametrized engine. The corresponding job during the parametrization is a GeometryOptimization starting from the optimized geometry from the reference calculation.
  • Relative energies between multiple snapshots from the geometry optimization trajectory, and forces from those same snapshots.
  • A reaction energy, in this case the propene combustion energy for the reaction C\(_3\)H\(_6\)(g) + (9/2) O\(_2\)(g) → 3 CO\(_2\)(g) + 3 H\(_2\)O(g).

2.4.3. Initialize the ResultsImporter

You can set the preferred units that you like to work in when initializing the ResultsImporter. Here, we set the energy unit to kcal/mol, and the force unit to kcal/mol/angstrom. If no units are set, some default units will be used.

You can also specify the default maximum number of iterations for geometry optimization jobs. During the parametrization, it is often convenient to limit the maximum number of steps for individual geometry optimization jobs, so that the job is guaranteed to finish in a reasonable amount of time, even for unreasonable parameter combinations.

results_importer_settings = Settings()
results_importer_settings.units.energy = 'kcal/mol'
results_importer_settings.units.forces = 'kcal/mol/angstrom'
results_importer_settings.default_go_settings.MaxIterations = 20   # set to False to have unlimited iterations
ri = ResultsImporter(settings=results_importer_settings)

The ri variable has a Job Collection, Engine Collection, and Data Set (training set):

print(type(ri.job_collection))
print(type(ri.engine_collection))
print(type(ri.data_set))
<class 'scm.params.core.jobcollection.JobCollection'>
<class 'scm.params.core.engines.EngineCollection'>
<class 'scm.params.core.dataset.DataSet'>

2.4.4. Add a singlepoint calculation on the optimized geometry

The jobs in the job collection are run repeatedly during the ParAMS optimization. They can have any task, for example SinglePoint or GeometryOptimization.

ResultsImporters add jobs to the job collection. The molecular structure is taken from the reference job, but the Task is not. The Task for jobs added via a results importer defaults to SinglePoint, even if the reference job was a GeometryOptimization!

ri.add_singlejob(water_ams_results_file,   # path to ams.rkf or to the jobname.results folder
                 properties=['charges', 'forces'],
                 name='water_singlepoint') # custom name for this job
["charges('water_singlepoint')", "forces('water_singlepoint')"]

The add_singlejob results importer has affected the JobCollection, DataSet, and EngineCollection inside the ri variable. It returns the expressions added to the DataSet (see more below).

The job ‘water_singlepoint’ has been added to the job collection:

print(ri.job_collection)
---
dtype: JobCollection
version: '2022.101'
---
ID: 'water_singlepoint'
ReferenceEngineID: forcefield;;type;UFF;
AMSInput: |
   properties
     gradients Yes
   End
   system
     Atoms
                 O      -0.0546774806      -0.1402484791       0.0000000000
                 H       0.9005794476       0.1209122488       0.0000000000
                 H      -0.5459019670       0.7193362302       0.0000000000
     End
   End
   task singlepoint
Origin: /tmp/demo_results_importer/water_ref_geo_opt/ams.rkf
OriginalEnergyHartree: 1.280587231654336e-07
_Hash_1: 2f7e7abbc8b0b00713beb419d3a5d8cd3b7bb2dbfa536f87b345ec8715d450d6
_Hash_2: 23176273e7d0a59e60f2ebb9b58b048d608e41d0cfd6b50bda3c427fc6876475
...

Above, the AMSInput block contains input to the AMS Driver, defining a job that will be repeatedly run during the parametrization. The Atoms block contains the optimized water molecule from the reference UFF calculation, and the task is set to SinglePoint. Because we requested the forces, properties gradients yes is set. This guarantees that we will be able to extract the forces during the parametrization.

The Origin, OriginalEnergyHartree, Hash_1 and Hash_2 are only used internally by the ResultsImporter. For example, the hashes help to prevent adding the same job twice to the job collection.

The ReferenceEngineID contains a reference to an engine in the EngineCollection. This entry in the engine collection contains the settings for the reference job that was run in the beginning. It is not strictly needed for the parametrization.

print(ri.engine_collection)
---
dtype: EngineCollection
version: '2022.101'
---
ID: 'forcefield;;type;UFF;'
AMSInput: |
   Engine forcefield
     type UFF
   EndEngine
...

The data_set (training set) contains two entries

print(ri.data_set)
---
dtype: DataSet
version: '2022.101'
---
Expression: charges('water_singlepoint')
Weight: 1.0
Sigma: 0.1
ReferenceValue: |
   array([-0.83399999,  0.417     ,  0.417     ])
Unit: au, 1.0
Group: Charges
INFO_ReferenceEngineIDs: forcefield;;type;UFF;
SubGroup: water_singlepoint
---
Expression: forces('water_singlepoint')
Weight: 1.0
Sigma: 3.557463138142747
ReferenceValue: |
   array([[ 0.25021751, -0.01895832, -0.        ],
          [-0.03370832, -0.12831032, -0.        ],
          [-0.21650919,  0.14726863, -0.        ]])
Unit: kcal/mol/angstrom, 1185.8210460475823
Group: Forces
INFO_ReferenceEngineIDs: forcefield;;type;UFF;
SubGroup: water_singlepoint
...

The two data_set entries have ReferenceValues taken from the reference job. The Unit for the forces is kcal/mol/angstrom, which was specified as the preferred unit when initializing the ResultsImporter. The Sigma values correspond to the default sigma values for the charges and forces extractors. The Weight default value is 1.0

The results importer has added three pieces of metadata: Group, SubGroup, and INFO_ReferenceEngineIDs. The Group specifies the type of quantity, and the SubGroup from which job the reference data comes. The Group and SubGroup can be used to help with postprocessing results. The INFO_ReferenceEngineIDs contains all the reference engines that were used in the calculation of the reference value.

You can change the Weight or Sigma just like in a standalone DataSet:

ri.data_set[0].sigma = 0.2
ri.data_set[0].weight = 3.14
print(ri.data_set[0])
---
Expression: charges('water_singlepoint')
Weight: 3.14
Sigma: 0.2
ReferenceValue: |
   array([-0.83399999,  0.417     ,  0.417     ])
Unit: au, 1.0
Group: Charges
INFO_ReferenceEngineIDs: forcefield;;type;UFF;
SubGroup: water_singlepoint

Alternatively, you can also specify the weight and sigma values when calling the add_singlejob results importer (see next section).

2.4.5. Add a geometry optimization job extracting the bond length and bond angle

To fit structural results like bond lengths or bond angles, it is necessary to run GeometryOptimization jobs during the parametrization. Use the task keyword in add_singlejob().

The properties can either be given in a list as before, or a dict with more details (i.e., weight, sigma, and unit).

The distance and angle extractors accept the atom ids of the atoms. The atom ids start with 0, so atom 0 is the first atom (the O atom), atom 1 is the second atom (one of the H atoms), and atom 2 is the the third atom (the other H atom).

ri.add_singlejob(water_ams_results_file,         # path to ams.rkf
                 properties={
                     'distance(0,1)': {  # extract O-H bond length
                         'weight': 2.0,
                         'sigma': 0.1,
                      },
                      'angle(1,0,2)': {}, #  H-O-H angle, use default weight and sigma
                 },
                 task='GeometryOptimization',    # set task to GeometryOptimization
                 name='water_geo_opt')           # set a custom name
["distance('water_geo_opt',0,1)", "angle('water_geo_opt',1,0,2)"]
print(ri.job_collection['water_geo_opt'])
ReferenceEngineID: forcefield;;type;UFF;
AMSInput: |
   geometryoptimization
     MaxIterations 20
     PretendConverged Yes
   End
   system
     Atoms
                 O      -0.0546774806      -0.1402484791       0.0000000000
                 H       0.9005794476       0.1209122488       0.0000000000
                 H      -0.5459019670       0.7193362302       0.0000000000
     End
   End
   task geometryoptimization
Origin: /tmp/demo_results_importer/water_ref_geo_opt/ams.rkf
OriginalEnergyHartree: 1.280587231654336e-07
_Hash_1: 5d9741178ce4d1ce13e4797af21815ab3346dfca206c2da9a934f72b091ec656
_Hash_2: 12ae5bc2c4266f8eda52b423d2d8a74b409d2fc573992d30ba7b4745d2fdf764

Above, the task is set to GeometryOptimization as requested. There’s also a geometryoptimization block setting the maximum number of iterations to 20, as was specified in the ResultsImporter initialization.

print(ri.data_set[-2:])   # the two last added entries
[---
Expression: distance('water_geo_opt',0,1)
Weight: 2.0
Sigma: 0.1
ReferenceValue: 0.990313447773963
Unit: angstrom, 1.0
Group: Distances
INFO_ReferenceEngineIDs: forcefield;;type;UFF;
SubGroup: water_geo_opt
, ---
Expression: angle('water_geo_opt',1,0,2)
Weight: 1.0
Sigma: 2.0
ReferenceValue: 104.45592580880005
Unit: degree, 1.0
Group: Angles
INFO_ReferenceEngineIDs: forcefield;;type;UFF;
SubGroup: water_geo_opt
]

The UFF-optimized water molecule has an O-H bond length of 0.99 angstrom, and an H-O-H angle of 104.46 degrees.

2.4.6. Add a trajectory

If you have a trajectory file, for example from a geometry optimization or molecular dynamics simulation, you can extract reference results from N equally spaced frames in the trajectory with the add_trajectory_singlepoints() results importer. The following properties are supported:

  • energy
  • relative_energies
  • forces
  • stresstensor

The properties can only be extracted if they are stored on the trajectory file. For example, forces are not stored by default in AMS MD simulations, but must be enabled before running the simulation.

Below the relative_energies and forces from the UFF geometry optimization on the water molecule are extracted. The relative_to: min_global ensures that the global minimum is included, and that energies are relative to that one. Setting N = 4 will give 3 relative energies and 4 forces added to the data_set.

new_expr = ri.add_trajectory_singlepoints(water_ams_results_file,
                             properties={
                                'relative_energies': {
                                    'relative_to': 'min_global',
                                },
                                'forces': {},
                             },
                             N=4) # get 4 frames

# print only expression and reference value, the full data_set entry can also be printed: print(ri.data_set[expression])
for expression in new_expr:
    print(expression)
    print(ri.data_set[expression].reference,'\n')
energy('water_ref_geo_opt_frame001')-energy('water_ref_geo_opt_frame041')
13.453134044851748

energy('water_ref_geo_opt_frame014')-energy('water_ref_geo_opt_frame041')
0.33548696220146534

energy('water_ref_geo_opt_frame027')-energy('water_ref_geo_opt_frame041')
0.00018039405874487025

forces('water_ref_geo_opt_frame001')
[[-19.33638687 -71.03436494  -0.        ]
 [-11.76011501  38.24820485  -0.        ]
 [ 31.09650188  32.7861601   -0.        ]]

forces('water_ref_geo_opt_frame014')
[[ 10.69896839  17.22055512  -0.        ]
 [-20.71034462  -3.62029166  -0.        ]
 [ 10.01137623 -13.60026346  -0.        ]]

forces('water_ref_geo_opt_frame027')
[[ 0.04078747 -0.68772989 -0.        ]
 [ 0.30451539  0.08345281 -0.        ]
 [-0.34530285  0.60427708 -0.        ]]

forces('water_ref_geo_opt_frame041')
[[ 0.25021751 -0.01895832 -0.        ]
 [-0.03370832 -0.12831032 -0.        ]
 [-0.21650919  0.14726863 -0.        ]]

The job collection contains single point calculations with molecules taken from the respective frames in the reference trajectory. Above we didn’t specify a name when calling add_trajectory_singlepoints, so the name is guessed from the reference job.

For example, water_ref_geo_opt_frame001 corresponds to the input molecule defined at the beginning of this demonstration:

print(ri.job_collection['water_ref_geo_opt_frame001'])
ReferenceEngineID: forcefield;;type;UFF;
AMSInput: |
   properties
     gradients Yes
   End
   system
     Atoms
                 O       0.0000000000       0.0000000000       0.0000000000
                 H       1.0000000000       0.0000000000       0.0000000000
                 H      -0.7000000000       0.7000000000       0.0000000000
     End
   End
   task singlepoint
Frame: 1
Origin: /tmp/demo_results_importer/water_ref_geo_opt/ams.rkf
OriginalEnergyHartree: 0.02143906182312575

The add_trajectory_singlepoints results importer adds another metadata to the job_collection: Frame, corresponding to the frame in the Origin file from which the input molecule was taken.

The frames to extract can also be specified with

  • add_trajectory_singlepoints(..., start=0, end=40, step=10) to extract every 10th frame between steps 0 (inclusive) and 40 (exclusive), or
  • add_trajectory_singlepoints(..., indices=[0,4,5]) to extract the first, fifth, and sixth frames.

For more details, see add_trajectory_singlepoints().

2.4.7. Add a reaction energy

Add a reaction energy with the add_reaction_energy results importer. The chemical reaction will be automatically balanced.

Here we demonstrate propene combustion: C\(_3\)H\(_6\)(g) + (9/2) O\(_2\)(g) → 3 CO\(_2\)(g) + 3 H\(_2\)O(g). We will first need to calculate three more reference jobs for C\(_3\)H\(_6\) (propene), O\(_2\) and CO\(_2\). Note that this reaction energy is described very poorly with UFF (ΔH⁰ = -1.4 kcal/mol, experimental value: -491.8 kcal/mol). To get a more realistic energy, use a different reference engine.

Below, the from_smiles function from PLAMS converts a SMILES string to a Molecule. Before running the jobs, init() needs to have been called (that was done in Run a reference job).

results_files = {}
for name, smiles in ('O2', 'O=O'), ('propene', 'CC=C'), ('CO2', 'O=C=O'):
    results_files[name] = AMSJob(settings=ref_settings, molecule=from_smiles(smiles), name=name).run().rkfpath()
print(results_files)
[01.02|09:00:43] JOB O2 STARTED
[01.02|09:00:43] JOB O2 RUNNING
[01.02|09:00:43] JOB O2 FINISHED
[01.02|09:00:43] JOB O2 SUCCESSFUL
[01.02|09:00:43] JOB propene STARTED
[01.02|09:00:43] JOB propene RUNNING
[01.02|09:00:43] JOB propene FINISHED
[01.02|09:00:43] JOB propene SUCCESSFUL
[01.02|09:00:43] JOB CO2 STARTED
[01.02|09:00:43] JOB CO2 RUNNING
[01.02|09:00:43] JOB CO2 FINISHED
[01.02|09:00:43] JOB CO2 SUCCESSFUL
{'O2': '/tmp/demo_results_importer/O2/ams.rkf', 'propene': '/tmp/demo_results_importer/propene/ams.rkf', 'CO2': '/tmp/demo_results_importer/CO2/ams.rkf'}
ri.add_reaction_energy(reactants=(results_files['propene'], results_files['O2']),
                       products=(results_files['CO2'], water_ams_results_file),
                       normalization='r0',   # make the coefficient for the first reactant (propene)
                       normalization_value=1.0, #  == 1.0
                       task='GeometryOptimization',
                       sigma=5.0)
print(ri.data_set[-1])
---
Expression: +3.0*energy('CO2')+3.0*energy('water_geo_opt')-1.0*energy('propene')-4.5*energy('O2')
Weight: 1.0
Sigma: 5.0
ReferenceValue: -1.399510907987769
Unit: kcal/mol, 627.5094737775374
Group: ReactionEnergy
INFO_ReferenceEngineIDs: forcefield;;type;UFF;

The add_reaction_energy results importer also works for reactions without reference jobs. If you know the reaction energy, for example from experimental data or from computational codes not supported by ParAMS, it is enough to simply pass in Molecules instead of ams.rkf files. This is demonstrated below for the reaction C\(_2\)H\(_4\)(g) + HCl(g) → C\(_2\)H\(_5\)Cl(g).

ri.add_reaction_energy(reactants=[from_smiles('C=C'), from_smiles('Cl')],
                       products=[from_smiles('CCCl')],
                       reactants_names=['ethylene', 'HCl'],
                       products_names=['chloroethane'],
                       task='GeometryOptimization',
                       normalization='r0',
                       reference=-17.1,
                       unit='kcal/mol',
                       metadata={
                           'Source': 'J. Appl. Chem. USSR, 1979, 52, 1439-1442',
                       })
print(ri.data_set[-1])
---
Expression: +1.0*energy('chloroethane')-1.0*energy('ethylene')-1.0*energy('HCl')
Weight: 1.0
Sigma: 1.2550189475550748
ReferenceValue: -17.1
Unit: kcal/mol, 627.5094737775374
Group: ReactionEnergy
Source: J. Appl. Chem. USSR, 1979, 52, 1439-1442

The jobs added to the job collection will be run during the parametrization. The ReferenceEngineID is set to None, since the molecule did not come from a reference calculation but simply as a set of XYZ coordinates. For example, the job 'chloroethane' looks like this:

print(ri.job_collection['chloroethane'])
ReferenceEngineID: None
AMSInput: |
   geometryoptimization
     MaxIterations 20
     PretendConverged Yes
   End
   system
     Atoms
                 C       0.7764540253      -0.0718094540       0.0077352022
                 C      -0.7405504503      -0.0784441089      -0.0210560326
                Cl      -1.2082117678       1.6413909680      -0.1197007086
                 H       1.1086665750       0.3584893732       0.9692166602
                 H       1.2119761077       0.4960752072      -0.8245467312
                 H       1.0858003363      -1.1282582526      -0.1074755487
                 H      -1.0878268370      -0.5387769551       0.9353459698
                 H      -1.1463079893      -0.6786667777      -0.8395188111
     End
   End
   task geometryoptimization

2.4.8. Training set, validation set, and other data sets

ParAMS can handle an arbitrary number of data sets. Often during parametrization, it is a good idea to have a validation set. During the parameter fitting, the loss function for the training set is minimized, and the loss function for the validation set is monitored to make sure that there is no overfitting.

By default, the data set entries added with a results importer are added to the training set. To specify a different data set, simply add for example data_set='validation_set' to the results importer. For example:

ri.add_singlejob(water_ams_results_file, properties=['energy'], data_set='validation_set')
["energy('water_singlepoint')"]

The training set can be accessed with either ri.data_set or ri.get_data_set('training_set'). The validation set is accessed with ri.get_data_set('validation_set'):

print(ri.get_data_set('validation_set'))
---
dtype: DataSet
version: '2022.101'
---
Expression: energy('water_singlepoint')
Weight: 1.0
Sigma: 1.2550189475550748
ReferenceValue: 8.035806198616458e-05
Unit: kcal/mol, 627.5094737775374
Group: TotalEnergy
INFO_ReferenceEngineIDs: forcefield;;type;UFF;
SubGroup: water_singlepoint
...

2.4.9. Exit PLAMS

If you used PLAMS to run the reference jobs, the finish() function should be called at the end, if init() was called at the beginning.

finish()
[01.02|09:00:43] PLAMS run finished. Goodbye

2.4.10. Save to disk

The store() method saves

  • the job collection to job_collection.yaml
  • the training set to training_set.yaml
  • any other data sets, e.g. validation_set.yaml a
  • the engine collection to engine_collection.yaml
  • the settings used for the results importer to results_importer_settings.yaml
ri.store(folder='saved_files', backup=True)
['saved_files/job_collection.yaml',
 'saved_files/results_importer_settings.yaml',
 'saved_files/training_set.yaml',
 'saved_files/validation_set.yaml']

You can later load those files into another results importer in a separate script. It is in general recommended to repeatedly extend a dataset, rather than to create many small datasets and try to merge them all at once. This is because entries in the job_collection and data_set need to be unique.

To load from files, initialize the results importer with

new_ri = ResultsImporter(job_collection='/path/job_collection.yaml',
                         data_set={'training_set': '/path/training_set.yaml', 'validation_set': '/path/validation_set.yaml'},
                         engine_collection='/path/engine_collection.yaml',
                         settings='/path/results_importer_settings.yaml')

2.4.11. More ResultsImporters

This tutorial showed you the add_singlejob, add_trajectory_singlepoints, and add_reaction_energy ResultsImporters. It is possible to set even more options than were shown here. For details, see the ResultsImporters API.

You can also use the add_neb_singlepoints, add_pesscan_singlepoints and add_pesexploration_singlepoints for AMS NEB, PES scans and PES exploration jobs. For details, see the documentation pages.