3.3. Data Set¶
Following the Introduction, the DataSet
class represents a set of
physicochemical properties that the user would like to see reproduced by a parametric model.
We define an arbitrary property \(P\) as a function of a computational Job.
A single entry in the DataSet
can be expressed as a linear combination of multiple properties:
For every property \(P\) there exists an extractor which is linked to the Data Set. Extractors define how to access the desired property from a calculated job and allow the user to easily extend the scope of what can be fitted during the parameterization.
Following the above equation, the DataSet
has two purposes:
- Storage of all reference values \(\boldsymbol{y}\)
- Consecutive comparison of the values predicted by the parametric model \(\boldsymbol{\hat{y}}\) to the reference with a specific Loss function
Just like any of the collections, the DataSet
can be stored in a text format, which looks as following:
---
Expression: forces("H2O_1")
Weight: 100.0
Unit: kj/(mol*angstrom), 1.1858e3
ReferenceValue: |
array([ -0., -0., 1.75e-5,
-0., -3.45e-5, -1.75e-5,
-0., 3.45e-5, 0. ])
---
Expression: angles("H2O_2",1,2,3)
Weight: 0.333
Sigma: 0.5
---
Expression: energy("H2O_3")
Weight: 1.0
ReferenceValue: -3.44
...
At runtime, the Data Set behaves like a list.
Each element in it is an instance of DataSetEntry
, with the following attributes:
expression
weight
unit
reference
sigma
jobids
(read-only)extractors
(read-only)
3.3.1. Adding Entries¶
Addition of entries follows Eq (3.1), a weight has to be provided alongside each entry, marking the relative ‘importance’ of this entry in the total Data Set (see loss functions to see how the weight influences the total loss):
>>> ds = DataSet()
>>> # Adding without a reference value or unit:
>>> ds.add_entry("forces('Methane01')", 15.0)
>>> # Adding with a reference value:
>>> ds.add_entry("energy('Methane01') - 4*energy('Hydrogen')", 15.0, reference=0.033)
The default property units of ParAMS are atomic units (au). If the user wishes to fit properties in any other units, an additional unit argument has to be provided in form of a tupel:
>>> ds.add_entry("energy('Methane02')", 2.0, unit=('kcal/mol',6.2751e+02))
Here, the first item is the string name of the desired unit and the second item is a conversion factor from atomic units to the unit of choice.
Important
The default property units in ParAMS are atomic units. For any other units a conversion factor from au to the desired unit has to be provided.
Optional metadata can be added to every entry by passing a dictionary to the metadata argument:
>>> m = {'Origin' : 'A very important journal', 'Version' : 1.1}
>>> ds.add_entry("forces('Methane02')", 1.0, metadata=m)
>>> print(ds[-1])
---
Expression: forces('Methane02')
Weight: 1.0
Origin: 'A very important journal'
Version: 1.1
3.3.2. Accessing Entries¶
>>> ds["forces('Methane01')"] # If the expression is known.
>>> ds[0].expression # Otherwise the class can be accessed as a list:
"forces('Methane01')"
>>> ds[0].weight
15.0
>>> ds[0].unit
('au', 1.)
>>> ds[0].reference
np.array([...])
>>> ds[0].jobids # The ID of the job linking to the AMSResults value. Needs to be the key in the dictionary passed to `calculate_reference()` and `evaluate()`.
{"Methane01"}
>>> ds[0].extractors # Set of extractors that will be called when evaluating
{'forces'}
3.3.3. Removing Entries¶
>>> len(ds)
10
>>> entry = ds[3] # Single entry
>>> entry = ds["forces('Methane01')"] # Same, but with the expression known
>>> ds.index(entry)
3
>>> del ds[3] # Delete by index
>>> ds.remove(entry) # Or delete by entry
>>> len(ds)
9
The del
method can also delete multiple indices at once:
>>> len(ds)
9
>>> del_ids = [1,2,3]
>>> del ds[del_ids]
len(ds)
6
3.3.4. Adding External Reference Data¶
External reference data can be easily provided with the reference argument when calling
add_entry()
.
If your reference data is not in atomic units, make sure to either convert it
to au before adding, or specify a proper conversion factor from a.u. to your desired unit
in the unit argument:
>>> forces = np.array([[-0, -0, 1.75e-5], [-0, -3.45e-5, -1.75e-5], [-0, 3.45e-5, 0]])
>>>
>>> ds = DataSet()
>>> ds.add_entry('forces("H2O_1")' , 100., reference=forces)
>>> ds.add_entry('angles("H2O_2",1,2,3)', 0.33, reference=122)
>>> ds.add_entry('energy("H2O_3")' , 1. , reference=92.551, unit=('kcal/mol' 6.2751e2))
Unit has to be a list or tuple where the first element is a string with the target unit name and the second element is the conversion factor from au to target.
References can also be added or modified for entries that already have been added:
>>> ds['energy("H2O_3")'].reference = 12345
3.3.5. Calculating and Adding Reference Data with AMS¶
If the reference data is not yet available to the Data Set, the user can choose to extract it from an AMS calculation. By doing so, any AMS Engine can be used as the Reference Engine. Assuming a Data Set with a single entry
>>> ds = DataSet()
>>> ds.add_entry('angles("H2O_2",1,2,3)', 1)
>>> print(ds)
---
Expression: angles("H2O_2",1,2,3)
Weight: 1
Sigma: 1.0
...
the user has multiple ways of calculating the reference values.
The fastest would be through the use of JobCollection.run
,
if a Job Collection is available. Here, we make use of the fact that system geometry and job settings are
already stored in the collection.
The only additional input that is needed in this case is the selection of a reference engine:
>>> from scm.plams import Settings
>>> engine = Settings()
>>> engine.input.MOPAC # Our reference engine is MOPAC
>>> r = jc.run(engine, jobids=['H2O_2']) # r is a dictionary of {jobid : AMSResults}
Alternatively, if no Job Collection is yet created, the regular PLAMS syntax can be used:
>>> from scm.plams import *
>>> mol = Molecule('H2O_2.xyz')
>>> s = Settings()
>>> s.input.AMS.Task='GeometryOptimization'
>>> s.input.MOPAC
>>> init()
>>> job = AMSJob(name='H2O_2', molecule=mol, settings=s)
>>> r = job.run()
>>> r = {'H2O_2' : r} # The DataSet expects a dictionary
See also
The above examples use the plams.Settings
class.
See the PLAMS documentation
for more information about it.
We have now created and executed a reference job.
Passing it to the Data Set can be done with calculate_reference()
:
>>> ds.calculate_reference(r)
The Data Set will automatically extract and store the correct results:
>>> print(ds) # Check our objective function
---
Expression: angles("H2O_2",1,2,3)
Weight: 1
ReferenceValue: 103.87
Sigma: 1.0
...
Warning
By default, when calculate_reference()
is called,
possibly present reference values will be overwritten with the ones extracted from plams.AMSResults
.
You can set overwrite=False if you want to override this behavior.
3.3.6. Storage and I/O¶
Once all reference values are computed, it might be useful to store them. The Data Set can be stored in the YAML format, or pickled if disk space and speed is more important than readability:
>>> ds.store('dataset.yml') # Same as print(ds) to file
>>> ds.pickle_dump('costfunction.pkl') # Binary
The function can then be restored with:
>>> ds = DataSet('costfunction.yml')
Alternatively, when an instance is already present:
>>> ds.load('dataset.yml') # or
>>> ds.pickle_load('dataset.pkl')
3.3.7. Calculating the Loss Function Value¶
If each entry has a reference value, the computational results of any engine can be evaluated and compared by a suitable loss function. This might be useful when for a manual, quantitative comparison between different parameters and/or engines. The evaluation mostly follows the same signature as the calculation of reference data.
Note
The execution of jobs and evaluation of the Data Set is handled automatically during the Optimization. In most cases the user does not need to manually reproduce the workflow covered in this section.
>>> from scm.plams import Settings
>>> engine_settings = Settings()
>>> engine_settings.input.DFTB
>>> jobresults = jc.run(engine_settings) # Assuming we already have a JobCollection() in jc
>>> ds.evaluate(r, loss='rmse') # Assuming we already have a DataSet() in ds
1234.5
Important
The Data Set value can only be calculated (i.e. evaluate()
)
when each entry in it has a reference value.
3.3.8. Checking for Consistency with a given Job Collection¶
Data Set entries are tied to a JobCollection
by a common jobID.
The consistency of every DataSet instance can be checked with the DataSet.check_consistency()
method.
It will check if any DataSetEntry
has jobids
that are not included in a JobCollection
instance
and if so, return a list of indices with entries that can not be calculated given the Job Collection:
>>> # DataSet() in ds, JobCollection() in jc
>>> len(ds)
>>> 10
>>> bad_ids = ds.check_consistency(jc)
>>> bad_ids # `ds` entries with these indices could not be calculated given `jc`, meaning the property for the calculation of these entries requires a chemical system which is not present in `jc`
[1,10,13]
>>> del ds[bad_ids]
>>> len(ds)
7
>>> ds.check_consistency(jc) # No more issues
[]
The DataSet.check_consistency()
method is equivalent to:
>>> bad_ids = [num for num,entry in enumerate(ds) if any(i not in jc for i in entry.jobids)]
See also
3.3.9. Splitting into Subsets¶
The Data Set can be further split into subsets.
Each subset is a new instance of DataSet
.
If you are interesed in splitting it by size, see the methods:
Splitting can also be based on extractors:
>>> ds.forces() # Return a :class:`DataSet` instance with forces only.
>>> ds.angles() # Return a :class:`DataSet` instance with angles only. And so on.
These methods are created automatically based on the extractors
available to the instance.
Limiting a Data Set to only specific jobIDs can be done with the
from_jobids()
method.
3.3.10. Data Set Entry API¶
-
class
DataSetEntry
(expression, weight, reference=None, unit=None, sigma=None, metadata=None)¶ A helper class representing a single entry in the
DataSet
.Note
This class is not public in this module, as it is not possible to create these entries on their own. They can only be managed by an instance of a
DataSet
class.Attributes: - weight : float or ndarray
- The weight \(w\), denoting the relative ‘importance’ of a single entry. See Loss Functions for more information on how this parameter affects the overall loss.
- reference : Any
- Reference value of the entry.
Consecutive
DataSet.evaluate()
calls will compare to this value. - unit : Tuple[str,float]
- The unit of this property
- sigma : float
- To calculate the loss function value, each entry’s residuals
are calculated as \((w/\sigma)(y-\hat{y})\).
Ideally, sigma represents the accepted ‘accuracy’ the user expects
from a particular entry.
Default values differ depending on the property and are stored in individual extractor files. If the extractor file has no sigma defined, will default to 1 (see Extractors and Comparators for more details). - expression : str
- The expression that will be evaluated during an
DataSet.evaluate()
call. A combination of extractors and jobids. - jobids : set, read-only
- All Job IDs needed for the calculation of this entry
- extractors : set, read-only
- All extractors needed for the calculation of this entry
3.3.11. Data Set API¶
-
class
DataSet
(file=None, more_extractors=None)¶ A class representing the data set \(DS\).
Attributes:
- extractors_folder : str
- Default extractors location.
- extractors : set
- Set of all extractors available to this instance.
See Extractors and Comparators for more information. - jobids : set
- Set of all jobIDs in this instance.
- header : optional, str
- A string that will be printed at the beginning of the file
when
store()
is called.
-
__init__
(file=None, more_extractors=None)¶ Initialize a new
DataSet
instance.Parameters:
- file : str
- Load a a previously saved DataSet from this path.
- more_extractors : str or List[str]
- Path to a user-defined extractors folder.
See Extractors and Comparators for more information.
-
add_entry
(expression, weight, reference=None, unit=None, sigma=None, metadata=None, dupe_check=True)¶ Adds a new entry to the cost function, given the expression, the desired weight and (optionally) the external reference value.
Skips checking the expression for duplicates if dupe_check is False. This is recommended for large cost functions >20k entries.Parameters: - expression: str
- The string representation of the extractor and jobid to be evaluated. See Adding Entries and Extractors and Comparators for more details.
- weight: float, 1d array
- Relatie weight of this entry. See Adding Entries for more details, see Loss Functions to check how weights influence the overall value.
- reference: optional, Any
- External reference values can be provided through this argument.
- unit : optional, Tuple[str, float]
- Whenever the reference needs to be stored in any other unit
than the default (atomic units),
use this argument to provide a tuple where the first element
is the string name of the unit and the second is the conversion
factor from atomic units to the desired unit.
See Adding External Reference Data for more details. - sigma : optional, float
- To calculate the loss function value, each entry’s residuals
are calculated as \((w/\sigma)(y-\hat{y})\).
Ideally, sigma represents the accepted ‘accuracy’ the user expects
from a particular entry.
Default values differ depending on the property and are stored in individual extractor files. If the extractor file has no sigma defined, will default to 1 (see Extractors and Comparators for more details). - metadata: optional, dict
- Optional metadata, will be printed when
store()
is called, can be accessed through each entry’smetadata
argument. - dupe_check: True
- If True, will check that every expression is unique per Data Set instance.
Disable if working with large data sets.
Rather than adding an entry multiple times, consider increasing its weight.
-
calculate_reference
(results, overwrite=True)¶ Calculate the reference values for every entry, based on results.
Parameters: - results : dict
A
{jobid : AMSResults}
dictionary:>>> job = AMSJob( ... ) >>> job.run() >>> DataSet.calculate_reference({job.name:job.results})
Can be ‘sparse’ (or empty == {}), as long as the respective entry (or all) alrady has a
reference
value defined.- overwrite : bool
- By default, when this method is called, possibly present reference values will be overwritten with the ones extracted from the results dictionary. Set this to False if you want to override this behavior.
-
evaluate
(results, loss: Type[scm.params.core.lossfunctions.Loss] = 'sse', return_residuals=False)¶ Compares the optimized results with the respective
reference
. Returns a single cost function value based on the loss function.Parameters: - results : dict
- Same as
calculate_reference()
. - loss : Loss, str
- A subclass of Loss, holding the mathematical definition of the loss function to be applied to every entry, or a registered string shortcut.
- return_residuals : bool
- Whether to return residuals and contributions in addition to fx.
Returns: - fx : float
- The overall cost function value after the evaluation of results.
- residuals : List[1d-array]
- List of unweighted per-entry residuals
(or the return values of
compare(y,yhat)
in the case of custom comparators) - contributions : List[float]
- List of relative per-entry contributions to fx
-
get
(key: str)¶ Return a list of per-entry attributes, where key determines the attribute, equivalent to:
[getattr(i,key) for i in self]
.
-
load
(yamlfile='dataset.yaml')¶ Loads a DataSet from a YAML file.
-
store
(yamlfile='dataset.yaml')¶ Stores the DataSet to a YAML file.
-
pickle_load
(fname)¶ Loads a DataSet from a pickled file.
-
pickle_dump
(fname)¶ Stores the DataSet to a pickled file.
-
copy
()¶ Return a deep copy of the instance.
-
__getitem__
(idx)¶ Dict-like behavior if idx is a string, list-like if an int
-
__delitem__
(idx: Union[int, Sequence[int]])¶ Delete one entry at index idx if idx is an int, or multiple entries at their indices if idx is a sequence of ints.
-
remove
(entry)¶ Remove the
DataSetEntry
instance from this Data Set
-
index
(entry)¶ Return the index of a
DataSetEntry
instance in this Data Set
-
__len__
()¶ Return the length of this Data Set
-
__iter__
()¶ Iterare over all
DataSetEntries
in this Data Set
-
__call__
(key)¶ Same as
__getitem__()
.
-
__eq__
(other)¶ Check if two Data Sets are equal.
-
__ne__
(other)¶ Return self!=value.
-
__add__
(other)¶ Add two Data Sets, returning a new instance. Does not check for duplicates.
-
__sub__
(other)¶ Substract two Data Sets, returning a new instance. Does not check for duplicates.
-
__str__
()¶ Return a string representation of this instance.
-
jobids
¶ Return a set of all jobIDs necessary for the
evaluation
of this instance.
-
check_consistency
(jc: scm.params.core.jobcollection.JobCollection) → Sequence¶ Checks if this instance is consistent with jc, i.e. if all entries can be calculated from the given JobCollection.
Parameters: - jc : JobCollection
- The Job Collection to check against
Returns: List of entry indices that contain jobIDs not present in jc.
Usedel self[i]
to delete the entry at index i.
-
maxjobs
(njobs, seed=None, _patience=1000)¶ Returns a random subset of
self
, reduced tolen(DataSet.jobids) <= njobs
AMS jobs.Note
Can result in a subset with a lower (or zero) number of jobs than specified. This happens when the data set consists of entries that are computed from multiple jobs and adding any single entry would result in a data set with more jobs than requested.
Will warn if a smaller subset is generated and raise a ValueError if the return value would be an empty data set.
-
split
(*percentages, seed=None)¶ Returns
N=len(percentages)
random subsets ofself
, where every percentage is the relative number of entries per new instance returned.>>> len(self) == 10 >>> a,b,c = self.split(.5, .2, .3) # len(a) == 5 , len(b) == 2, len(c) == 3, no overlap
-
random
(N, seed=None)¶ Returns a new subset of self with
len(subset)==N
.
-
from_jobids
(jobids: Set[str])¶ Generate a subset only containing the provided jobids
-
print_contributions
(contribs, fname, sort=True)¶ Print contribs to fname, assuming the former has the same ordering as the return value of
get()
.Parameters: - contribs : List
- Return value of
evaluate()
- fname : str
- File location for printing
- sort : bool
- Sort the contributions from max to min. Original order if disabled.
-
print_residuals
(residuals, fname, weigh=True, extractors: List[str] = None)¶ Print residuals to fname, assuming the former has the same ordering as the return value of
get()
. Entries can be limited to certain extractors with the respective keyword argument.Parameters: - residuals : List
- Return value of
evaluate()
- fname : str
- File location for printing
- weigh : bool
- Whether or not to apply the weights when printing the residuals
- extractors : optional, List[str]
- If provided, will limit the extractors to only the ones in the list. Defaults to all extractors.
-
predictions_from_residuals
(residuals: List, extractors: List = None) → List¶ Return the absolute predicted values for each data set entry based on the residuals vector as returned by
evaluate()
.Returns: - preds : List of 2-Tuples
- Returns a list where each element is a tuple of (expression, predicted value).