trainset.in¶
Description of the trainset.in file¶
The trainset.in file contains the training set data and tells the program how to calculate the cost function \(F = \Sigma ((y_i - y^{ref}_i) / acc_i)^2\) , which can be used to optimize the force field parameters. The trainset.in uses molecule identifiers, or keys, defined in the DESCRP field of the geo file (in BGF format), or in the models.in file, to compare force field derived geometries and energy differences to the reference values. The trainset.in has a free format as far as numbers concerned, although it does require that fields are space-separated. Besides, the “-”, “+” and “/” symbols have a special meaning in the trainset.in file and should not be used in identifiers. The trainset.in file is divided into 5 sections listed below. Each section begins with a start keyword and ends with the corresponding end keyword. The words in “CELL PARAMETERS” and “ENDCELL PARAMETERS” must be separated by exactly one space.
Sections format¶
Block name | Start keyword | End keyword | Format |
charges | CHARGE | ENDCHARGE | Key Acc Atom Ref |
geometries | GEOMETRY | ENDGEOMETRY | Key Acc [Atom1 [Atom2 [Atom3 [Atom4]]] Ref] |
forces | FORCES | ENDFORCES | Key Acc Atom Ref |
cell parameters | CELL PARAMETERS | ENDCELL PARAMETERS | Key Acc Type Ref |
energy differences | ENERGY | ENDENERGY | Acc [+-] Key1/n1 ... [+-] Key5/n5 Ref |
heat of formation | HEATFO | ENDHEATFO | Key Acc Ref |
“Key” is the molecule name from the geo file. “Atom” is an atom index in the corresponding molecule. “Acc” is a value of the target accuracy desired for the given error function contribution. This value is often called “weight” although in practice it is 1/weight. “Ref” is the reference value.
Format description¶
In the all sections except “ENERGY” each data line starts with the structure identifier (the Key), followed by the “Acc” of the data point. This is followed by a type identifier. Each section contains following data entries:
- CHARGE
In the CHARGE section the type identifier is the number of the atom in the molecule and the reference value is its charge. Example:
CHARGE #Key Acc Atom Ref chexane 0.1 1 -0.15 ENDCHARGE
- GEOMETRY
In the GEOMETRY section the type ID is the list of atoms defining an internal coordinate (two for an interactomic distance, three for a valence and four for a torsion angle). When there is only one atom index specified, then the Eucledian distance for the given atom between the two geometries is calculated. When the index is -1 then an average Eucledian distance quantity between the two geometries is used instead. Please note that any reference value different from zero for the Eucledian distances does not make much sense. Besides, since these disnaces are computed in the Cartesian coordinates, which means that a simple translation of the molecule as a result of energy minimization may result in large Eucledian distances for otherwise similar geometries. If there is no identifier provided then it means that the ReaxFF RMS force will be compared with the reference (which should probably be zero in most cases). Example:
GEOMETRY #Key Acc At1 At2 At3 At4 Ref chexane 0.01 1 0.0 # Eucledian distance between atom in the reference and the trial structure chexane 0.01 -1 0.0 # Average Eucledian distance between atoms in the two structures chexane 0.01 1 2 1.5 # Interatomic distance chexane 1.00 1 2 3 120.0 # Valence angle chexane 1.00 1 2 3 4 180.0 # Torsion angle chexane 1.00 0.0 # RMS force ENDGEOMETRY
- CELL PARAMETERS
In the CELL PARAMETERS section the type IDs are names of the corresponding lattice parameters. Example:
CELL PARAMETERS #Key Acc Type Ref chex_cryst 0.01 a 11.20 chex_cryst 0.01 b 11.20 chex_cryst 0.01 c 11.20 chex_cryst 0.01 alpha 90.00 chex_cryst 0.01 beta 90.00 chex_cryst 0.01 gamma 90.00 ENDCELL PARAMETERS
- HEATFO
The HEATFO section does not require a type ID as compares the ReaxFF heat of formation with the reference value. Example:
HEATFO #Key Acc Ref methane 2.00 -17.80 ENDHEATFO
- ENERGY
This section allows comparison of ReaxFF energy differences between structures to the reference data. In this case, each data line starts with the Acc of the data point, followed by up to five operator/identifier/divider parts and finishes with the reference value. The operator is either ‘+’ or ‘-‘ (‘+’ is the default). The energy associated with the identifier is divided by the divider, allowing comparison of condensed structures to monomers. The ‘/’ character in the ENERGY section data lines is optional. Example:
ENERGY #Acc op1 Key1 n1 op2 Key2 n2 DeltaE 1.5 + butbenz/1 - butbenz_a/1 -90.00 1.5 + butbenz/1 - butbenz_b/1 -71.00 1.5 + butbenz/1 - butbenz_c/1 -78.00 ENDENERGY