3.4. Extractors and Comparators

The previous section described how an arbitrary linear combination of properties extracted from jobs \(P(J)\) can be added to a Data Set for fitting. In this chapter, we are going to describe what happens when entries are added and evaluated by the Data Set and how the user can extend the number of properties that can be fitted with ParAMS.

We have briefly described the extractors in the Data Set: They are a collection of Python code snippets available to the DataSet instance which define how to extract an individual property \(P\) from a job. The extractors available to each DataSet can be checked with the extractors attribute:

>>> ds = DataSet()
>>> ds.extractors
{'angles', 'vibfreq', 'charges', 'distance', 'stresstensor', 'energy', 'forces', 'hessian', 'dihedral'}

Any expression passed to the DataSet.add_entry() method can be constructed from the available extractors, for example "angles('myJob1')".

The design of the Data Set allows for an easy extension of the extractors to suit personal needs beyond what is already provided in the package. Users are encouraged to define additional extractors whenever a new property becomes relevant for the fitting process. In the following example, we will add the functionality to extract and evaluate elastic tensors to the Data Set:

>>> 'elastictensor' in ds.extractors
False

Start by creating an elastictensor.py in an empty directory of your choice with the following contents:

from numpy import ndarray, asarray

sigma = 1e-2
def extract(amsresults) -> ndarray:
  return asarray(amsresults.get_engine_results('ElasticTensor'))

This is a minimal and complete definition of our extractor. The code snippet can be saved under any accessible path and used by providing DataSet with the corresponding directory. The base file name will be used as the extractor’s name (make sure not include extractors with the same names). Here, we have saved the above under path/to/extractors/:

>>> ds = DataSet(more_extractors='path/to/extractors/')
>>> 'elastictensor' ds.extractors
True

Because the Data Set natively works with plams.AMSResults, we were able to take a shortcut in the creation of our extractor: All we needed to do in this case is wrap a PLAMS method around the extract() function. Note that in addition to the function definition, we also provided a sigma variable. This serves as a default value when a related entry is added with through DataSet.add_entry() and should roughly be in the same order of magnitude as the “accepted accuracy” for this property. ParAMS will evaluate all entries according to \((w/\sigma)(y-\hat{y})\), where \(w\) is the weight. Providing a sigma is not mandatory but highly recommended, any extractor without a sigma will set it’s value to 1.

Important

  • An extractor stored as basename.py will be available through the basename string at runtime
  • For every extractor, the extract() function returns the property value from a job
  • ParAMS expects the first argument passed to any extract() function to be a plams.AMSResults instance (see PLAMS documentation)
  • It is recommended to define a sigma:float variable, which should be in the same order of magnitude as the prediction accuracy for this property

However, because the definition of extract() is completely up to the user, it can be made to read and process data from any other source as well. Assuming that a property is stored in a text file, an extractor that is independent of plams.AMSResults could look like this:

def extract(_, filepath):
  # AMSResults instance: `_` is ignored.
  with open(filepath) as f:
    return float(f.read())

3.4.1. Supported Data Structures

The above definition of our elastic tensor extractor returns a numpy array to the DataSet instance when evaluated, however, in theory extractors can return an arbitrary data structure which calls for additional processing to make all results consistent. Internally, each time DataSet.evaluate() is called, reference and predicted results are reduced to one weighted vector of residuals \((\boldsymbol{w}/\boldsymbol{\sigma})(\boldsymbol{y} - \boldsymbol{\hat{y}})\), which is then passed to a loss function and evaluated. This implies that any two return values of the same extractor support subtraction and multiplication (which is why we convert the the elastic tensor to a numpy array before returning). For anything that does not, a custom compare() function has to be implemented alongside extract().

3.4.2. Custom Comparators

There might be cases where an extractor either returns a data type that does not support mathematical operations or the quality of a reference and predicted value can not be measured by a simple subtraction \(y - \hat{y}\). In such cases it is necessary to define an additional compare() in the extractor:

# dictextractor.py
from typing import Dict

sigma = 0.1
def extract(amsresults) -> Dict:
  return amsresults.some_dict_property()

def comparator(y:Dict, yhat:Dict, au_to_ref:float) -> float:
  y    = list(y.values())
  yhat = list(yhat.values())

  # Unit conversion must be handled by the custom comparator:
  yhat = [i*au_to_ref for i in yhat]

  for i,ihat  in zip(y, yhat):
    ...
  return ...

A custom comparator defined in such a way will automatically be used in combination with the extractor.

Important

  • Whenever the extract() function returns a data type that does not support mathematical operations, a compare() function must be additionally defined.
  • compare() always follows the signature compare(y:Any, yhat:Any, au_to_ref:float) -> Union[ndarray,float,int]
  • Note that compare() should handle possible unit conversions by applying the equivalent of yhat = au_to_ref*yhat

3.4.3. Available Extractors

3.4.3.1. Distance

extract(amsresults, atom1: int, atom2: int) → float

Extract the interatomic distance between two atoms, defined by their indices atom1 and atom2 (atom indexing starts with 0).

Unit:
au
Sigma:
0.05

3.4.3.2. Angles

extract(amsresult, atom_ids: Tuple[int, int, int] = None) → Union[numpy.ndarray, float]

Extract the angles of the calculated molecule. To extract one specific angle only, provide the atom_ids argument as a tuple of three atom indices (atom indexing starts with 0).

Unit:
deg
Sigma:
2.0

3.4.3.3. Dihedral

extract(amsresults, atom1: int, atom2: int, atom3: int, atom4: int) → float

Extract the dihedral angle as defined by the four atom ids (atom indexing starts with 0).

Unit:
deg
Sigma:
0.2

3.4.3.4. RMSD

extract(amsresult)

Uses the Kabsch algorithm to align and calculate the root-mean-square deviation of the atomic positions given two systems. Assumes the same atoms are referenced by a common index in both molecules. When providing an external reference value, the object must be a 2d numpy array of the shape (Natoms, 4), where each inner element mus have the form [Symbol:str, x:float, y:float, z:float].

Unit:
Ångstörm
Sigma:
0.1

3.4.3.5. Energy

extract(amsresult) → float

Extract the energy of the system.

Unit:
au
Sigma:
2e-3

3.4.3.6. Forces

extract(amsresult, atindex: int = None, xyz: int = None) → Union[float, numpy.ndarray]

Extract the atomic forces of a system (array), or the specific component xyz at index atindex (float).

Unit:
au
Sigma:
0.01
Properties:
atindex : optional, int
Atom index, starting with 0
xyz : optional, 0 <= int <= 2
x(0), y(1) or z(2) component to be extracted.

3.4.3.7. Hessian

extract(amsresult) → numpy.ndarray

Extract the Cartiesian Hessian matrix.

Unit:
au
Sigma:
Undefined (defaults to 1.0)

3.4.3.8. Stress Tensor

extract(amsresult) → numpy.ndarray

Extracts the stress tensor

Unit:
au
Sigma:
1e-4

3.4.3.9. Charges

extract(amsresult, atomindex: int = None) → Union[numpy.ndarray, float]

Extract the atomic charges.
Charges at one specific atom can be requested with atomindex (atom indexing starts with 0).

Unit:
au
Sigma:
0.1

3.4.3.10. Vibrational Frequencies

extract(results) → numpy.ndarray

Extracts the vibrational frequencies.

Unit:
1/cm
Sigma:
5.0

3.4.3.11. PES

extract(results) → numpy.ndarray

Extracts and compares the results of a PES Scan. The return value r is calculated as

\[ \begin{align}\begin{aligned}\mu = \frac{1}{N} \sum_i^N y_i - \hat{y}_i\\r = \sqrt{ \sum_i^N (\hat{y}_i - y_i + \mu)^2 }\end{aligned}\end{align} \]

where N is the number of points in the PES scan.

Unit:
au
Sigma:
2e-3