8.7. Data Set¶
8.7.1. An example DataSet¶
The DataSet
class contains expressions that are evaluated during the parameter optimization.
For each new parameter set, the expressions are evaluated, and the results are compared to
reference values.
The DataSet
class contains a series of DataSetEntry
. For example, a DataSet
instance
with 3 entries might look like this, when stored on disk in the text-based YAML format:
---
Comment: An example data_set
Date: 21-May-2001
dtype: DataSet
version: '2024.101'
---
Expression: angle('H2O', 0, 1, 2)
Weight: 0.333
Sigma: 3.0
Unit: degree, 1.0
---
Expression: energy('H2O')-0.5*energy('O2')-energy('H2')
Weight: 2.0
Sigma: 10.0
ReferenceValue: -241.8
Unit: kJ/mol, 2625.15
Description: Hydrogen combustion (gasphase) per mol H2
Source: NIST Chemistry WebBook
---
Expression: forces('distorted_H2O')
Weight: 1.0
ReferenceValue: |
array([[ 0.0614444 , -0.11830478, 0.03707212],
[-0.05000567, 0.09744271, -0.03291899],
[-0.01143873, 0.02086207, -0.00415313]])
Unit: Ha/bohr, 1.0
Source: Calculated_with_DFT
...
where
Comment
,Date
,dtype
, andversion
are part of the header.H2O
,O2
,H2
, anddistorted_H2O
are jobs that appear in the Job Collection.energy
,forces
, andangle
are extractors that extract some result from a finished job.Expression and Weight are required for each data_set entry
ReferenceValue is the reference value expressed in Unit. If no Unit is given, the unit must equal the default unit for the given extractor. If no ReferenceValue is given, it can be calculated by the ParAMS Main Script or with the Data Set Evaluator class.
Sigma is the sigma value expressed in Unit. If no Sigma is given, it will equal the default sigma for the given extractor. For details, see Sigma vs. weight: What is the difference?.
Source and Description are optional metadata keys. Arbitrary metadata keys can be used.
During the parameter optimization you may use both a training set and validation set. In that case you would have
two separate DataSet
instances: one for the training set, and one for the validation set. See the tutorial Training and validation sets.
Use the Data Set Evaluator class to evaluate the data_set expressions with any engine settings and to compare the results to the ReferenceValues.
For more details, see
8.7.2. Load or store DataSet¶
Save the above text block with the name data_set.yaml
. You can then load it into a DataSet
instance with
from scm.params import *
data_set = DataSet('data_set.yaml')
print(data_set)
To save it to a new .yaml file, call the store()
method:
data_set.store('new_file.yaml')
8.7.3. Adding entries¶
Add entries with the add_entry()
method.
The arguments are:
expression (required): The expression to be evaluated. The expression must be unique.
weight (required): The weight of the entry. A larger weight will give a bigger contribution to the overall loss function. A larger weight thus indicate a more important data_set entry. The weight can either be a scalar or a numpy array with the same dimensions as reference. If weight is a scalar but reference is an array, then every component of the reference will be weighted with weight. See also Sigma vs. weight: What is the difference?.
reference (recommended): The reference value, expressed in unit. If no reference value is given, it is possible to calculate it before the parametrization using the params main script. Can either be a scalar or a numpy array, depending on the extractor in expression.
sigma (recommended): A value to normalize the expression (see Sigma vs. weight: What is the difference?). If no sigma value is given, a default one will be used depending on the extractor in the expression. If the expression contains more than one unique extractor, sigma is required. Sigma has the same unit as reference.
unit (recommended): The unit of reference and sigma. Should be expressed as a 2-tuple
('label', conversion_ratio_float)
, where the ‘label’ is not used other than for being printed to the output, andconversion_ratio_float
is a floating point number which is used to convert the default unit to the new unit. For example, unit for an energy might equal('eV', 27.211)
which will convert the default unit('hartree', 1.0)
to eV. The reference and sigma values should then be expressed in eV. NOTE: If you specify a unit you must also specify a sigma value, otherwise the default sigma will have the wrong unit.metadata (optional): A dictionary containing arbitrary metadata (for example sources for experimental reference data, or other metadata to help with postprocessing).
8.7.4. Demonstration: Working with a DataSet¶
Download data_set_demo.ipynb
or data_set_demo.py
8.7.4.1. Add an entry¶
from scm.params import *
import numpy as np
data_set = DataSet()
data_set.add_entry("angle('H2O', 0, 1, 2)", weight=0.333)
To access the last added element, use data_set[-1]
print("String representation of data_set[-1]")
print(data_set[-1])
print("Type: {}".format(type(data_set[-1])))
String representation of data_set[-1]
---
Expression: angle('H2O', 0, 1, 2)
Weight: 0.333
Unit: degree, 1.0
Type: <class 'scm.params.core.dataset.DataSetEntry'>
You can also change it after you’ve added it:
data_set[-1].sigma = 3.0
print(data_set[-1])
---
Expression: angle('H2O', 0, 1, 2)
Weight: 0.333
Sigma: 3.0
Unit: degree, 1.0
We recommend to always specify the reference value, the unit, and the sigma value when adding an entry, and also to specify any meaningful metadata about the data set entry.
data_set.add_entry(
"energy('H2O')-0.5*energy('O2')-energy('H2')",
weight=2.0,
reference=-241.8,
unit=("kJ/mol", 2625.15),
sigma=10.0,
metadata={
"Source": "NIST Chemistry WebBook",
"Description": "Hydrogen combustion (gasphase) per mol H2",
},
)
print(data_set[-1])
---
Expression: energy('H2O')-0.5*energy('O2')-energy('H2')
Weight: 2.0
Sigma: 10.0
ReferenceValue: -241.8
Unit: kJ/mol, 2625.15
Description: Hydrogen combustion (gasphase) per mol H2
Source: NIST Chemistry WebBook
All expressions in a single DataSet must be unique:
try:
data_set.add_entry("energy('H2O')-0.5*energy('O2')-energy('H2')", weight=2.0)
except Exception as e:
print("Caught the following exception: {}".format(e))
Caught the following exception: Expression energy('H2O')-0.5*energy('O2')-energy('H2') already in DataSet.
The reference values can also be numpy arrays, for example when extracting forces or charges:
forces = np.array(
[
[0.0614444, -0.11830478, 0.03707212],
[-0.05000567, 0.09744271, -0.03291899],
[-0.01143873, 0.02086207, -0.00415313],
]
)
data_set.add_entry(
"forces('distorted_H2O')",
weight=1.0,
reference=forces,
metadata={"Source": "Calculated_with_DFT"},
)
print(data_set[-1])
---
Expression: forces('distorted_H2O')
Weight: 1.0
ReferenceValue: |
array([[ 0.0614444 , -0.11830478, 0.03707212],
[-0.05000567, 0.09744271, -0.03291899],
[-0.01143873, 0.02086207, -0.00415313]])
Unit: Ha/bohr, 1.0
Source: Calculated_with_DFT
8.7.4.2. DataSetEntry attributes¶
A DataSetEntry has the following attributes:
expression : str
weight : float or numpy array
unit : 2-tuple (str, float)
reference : float or numpy array
sigma : float
jobids : set of str (read-only). The job ids that appear in the expression.
extractors : set of str (read-only). The extractors that appear in the expression.
print(data_set[-2].expression)
print(data_set[-2].weight)
print(data_set[-2].unit)
print(data_set[-2].reference)
print(data_set[-2].sigma)
energy('H2O')-0.5*energy('O2')-energy('H2')
2.0
('kJ/mol', 2625.15)
-241.8
10.0
print(data_set[-2].jobids)
{'O2', 'H2', 'H2O'}
print(data_set[-2].extractors)
{'energy'}
8.7.4.3. Accessing the DataSet entries¶
Above, data_set[-1]
was used to access the last added element, and
data_set[-2]
to access the second to last added element. More
generally, the DataSet can be indexed either as a list
or as
a dict
:
print(data_set[0].expression)
print(data_set[1].expression)
print(data_set[1].reference)
print(data_set["energy('H2O')-0.5*energy('O2')-energy('H2')"].reference)
angle('H2O', 0, 1, 2)
energy('H2O')-0.5*energy('O2')-energy('H2')
-241.8
-241.8
Get the number of entries in the DataSet with len()
:
print(len(data_set))
3
Get all of the expressions with get('expression')
or keys
:
print(data_set.get("expression"))
print(data_set.keys())
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')", "forces('distorted_H2O')"]
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')", "forces('distorted_H2O')"]
The get
method also works for all other DataSetEntry attirbutes, e.g.:
print(data_set.get("weight"))
print(data_set.get("extractors"))
[0.333, 2.0, 1.0]
[{'angle'}, {'energy'}, {'forces'}]
Loop over DataSet entries:
for ds_entry in data_set:
print(ds_entry.expression)
angle('H2O', 0, 1, 2)
energy('H2O')-0.5*energy('O2')-energy('H2')
forces('distorted_H2O')
or using the get
method:
for expr in data_set.get("expression"):
print(expr)
angle('H2O', 0, 1, 2)
energy('H2O')-0.5*energy('O2')-energy('H2')
forces('distorted_H2O')
Use the DataSet.index() method to get the index of a DataSetEntry:
ds_entry = data_set["energy('H2O')-0.5*energy('O2')-energy('H2')"]
print(data_set.index(ds_entry))
1
print(data_set[1].expression)
energy('H2O')-0.5*energy('O2')-energy('H2')
8.7.4.4. Delete a DataSet entry¶
Remove an entry with del
:
data_set.add_entry("energy('some_job')", weight=1.0)
print(len(data_set))
print(data_set[-1].expression)
del data_set[-1] # or del data_set["energy('some_job')"]
print(len(data_set))
print(data_set[-1].expression)
4
energy('some_job')
3
forces('distorted_H2O')
del
can also be used to delete multiple entries at once, as in
del data_set[0,2]
to remove the first and third entries.
8.7.4.5. Compute the intersection of two DataSets¶
Intersect two DataSets with &
:
Important
This creates a copy of the left-hand side dataset (e.g. header information) that only has entries also present in the right-hand side dataset.
another_data_set = DataSet()
another_data_set.header = {"info": "another data set"}
another_data_set.add_entry("energy('some_job')", weight=1.0)
intersected_data_set = data_set & another_data_set
print(len(intersected_data_set))
print(data_set.header)
print(another_data_set.header)
print(intersected_data_set.header)
0
{'dtype': 'DataSet', 'version': '2023.202'}
{'info': 'another data set', 'dtype': 'DataSet', 'version': '2023.202'}
{'dtype': 'DataSet', 'version': '2023.202'}
8.7.4.6. Split a DataSet into subsets¶
The following methods return a new DataSet
:
split
to get a list of nonoverlapping subsets.split_by_jobids
to get a list of nonoverlapping subsets where e.g. forces and energies from the same job end up in the same subset.maxjobs
random
from_expressions
from_jobids
from_extractors
from_metadata
from_atomic_expressions
to obtain the subset of all entries that have an “atomic” expression.
Important
For all of the above methods,
modifying entries in a subset will also modify the entries in the original data_set, and vice versa!
If you do not want this behavior, apply the
copy
method to the created subsets.
Subset from a list of given expressions
subset = data_set.from_expressions(
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"]
)
print(subset.keys())
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"]
Important
Modifying entries in a subset will also modify the entries in the original data_set, and vice versa!
expression = "angle('H2O', 0, 1, 2)"
original_sigma = data_set[expression].sigma
print(
"For expression {} the original sigma value is: {}".format(
expression, original_sigma
)
)
subset[expression].sigma = 1234 # this modifies the entry in the original data_set
print(data_set[expression].sigma)
print(subset[expression].sigma)
# restore the original value, this modifies the subset!
data_set[expression].sigma = original_sigma
print(data_set[expression].sigma)
print(subset[expression].sigma)
For expression angle('H2O', 0, 1, 2) the original sigma value is: 3.0
1234
1234
3.0
3.0
To modify a subset without modifying the original DataSet, you must
create a copy
:
new_subset = subset.copy()
new_subset[expression].sigma = 2345
print(new_subset[expression].sigma)
print(subset[expression].sigma)
print(data_set[expression].sigma)
2345
3.0
3.0
Subset from a list of job ids
subset = data_set.from_jobids(["H2O", "O2", "H2"])
print(subset.keys())
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"]
Subset from metadata key-value pairs
subset = data_set.from_metadata("Source", "NIST Chemistry WebBook")
print(subset)
---
dtype: DataSet
version: '2023.202'
---
Expression: energy('H2O')-0.5*energy('O2')-energy('H2')
Weight: 2.0
Sigma: 10.0
ReferenceValue: -241.8
Unit: kJ/mol, 2625.15
Description: Hydrogen combustion (gasphase) per mol H2
Source: NIST Chemistry WebBook
...
You can also match using regular expressions:
subset = data_set.from_metadata("Source", "^N[iI]ST\s+Che\w", regex=True)
print(subset.keys())
["energy('H2O')-0.5*energy('O2')-energy('H2')"]
Subset from extractors
subset = data_set.from_extractors("forces")
print(subset.get("expression"))
["forces('distorted_H2O')"]
A subset from multiple extractors can be generated by passing a list:
subset = data_set.from_extractors(["angle", "forces"])
print(subset.get("expression"))
["angle('H2O', 0, 1, 2)", "forces('distorted_H2O')"]
Subset from atomic expressions
subset = data_set.from_atomic_expressions()
print(data_set.get("expression"))
print(subset.get("expression"))
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')", "forces('distorted_H2O')"]
["angle('H2O', 0, 1, 2)", "forces('distorted_H2O')"]
Random subset with N entries
subset = data_set.random(2, seed=314)
print(subset.keys())
["angle('H2O', 0, 1, 2)", "energy('H2O')-0.5*energy('O2')-energy('H2')"]
Split the data_set into random nonoverlapping subsets
subset_list = data_set.split(2 / 3.0, 1 / 3.0, seed=314)
print(subset_list[0].keys())
print(subset_list[1].keys())
["forces('distorted_H2O')", "energy('H2O')-0.5*energy('O2')-energy('H2')"]
["angle('H2O', 0, 1, 2)"]
Split the data_set into random nonoverlapping subsets based on the jobids of the entries
mixed_dataset = DataSet()
for molecule in ["H2O", "NH3", "CH4"]:
mixed_dataset.add_entry(f"forces('{molecule}')")
mixed_dataset.add_entry(f"energy('{molecule}')")
mixed_dataset.add_entry(f"angle('{molecule}', 0, 1, 2)")
mixed_dataset.add_entry(f"angle('{molecule}', 0, 2, 1)")
subset = list(mixed_dataset.split_by_jobids(2 / 3.0, 1 / 3.0, seed=314))
print(subset[0].keys())
print(subset[1].keys())
["forces('H2O')", "energy('H2O')", "angle('H2O', 0, 1, 2)", "angle('H2O', 0, 2, 1)", "forces('NH3')", "energy('NH3')", "angle('NH3', 0, 1, 2)", "angle('NH3', 0, 2, 1)"]
["forces('CH4')", "energy('CH4')", "angle('CH4', 0, 1, 2)", "angle('CH4', 0, 2, 1)"]
8.7.4.7. DataSet header¶
The header can be used to store comments about a data_set. When storing as a .yaml file, the header is printed as a separate YAML entry at the top of the file.
data_set.header = {"Comment": "An example data_set", "Date": "21-May-2001"}
print(data_set)
---
Comment: An example data_set
Date: 21-May-2001
dtype: DataSet
version: '2023.202'
---
Expression: angle('H2O', 0, 1, 2)
Weight: 0.333
Sigma: 3.0
Unit: degree, 1.0
---
Expression: energy('H2O')-0.5*energy('O2')-energy('H2')
Weight: 2.0
Sigma: 10.0
ReferenceValue: -241.8
Unit: kJ/mol, 2625.15
Description: Hydrogen combustion (gasphase) per mol H2
Source: NIST Chemistry WebBook
---
Expression: forces('distorted_H2O')
Weight: 1.0
ReferenceValue: |
array([[ 0.0614444 , -0.11830478, 0.03707212],
[-0.05000567, 0.09744271, -0.03291899],
[-0.01143873, 0.02086207, -0.00415313]])
Unit: Ha/bohr, 1.0
Source: Calculated_with_DFT
...
8.7.4.8. Save the data set¶
See also Load or store DataSet
data_set.store("data_set.yaml")
8.7.5. Calculating and Adding Reference Data with AMS¶
Note
This functionality is also covered by the Data Set Evaluator class.
If some reference values are not yet available in the Data Set, the user can
run AMS calculations to calculate them.
Most conveniently this can be combined with
JobCollection.run()
,
which will calculate all jobs with one engine.
The DataSet.calculate_reference()
method is responsible for extracting
defined properties from a dictionary of
AMSResults instances:
jc = JobCollection()
ds = DataSet()
# ... populate jc and ds
# Run all jobs in jc with MOPAC:
from scm.plams import Settings
engine = Settings()
engine.input.mopac
results = jc.run(engine)
# Extract calculated results and store in ds:
ds.calculate_reference(results)
See also
The object passed to calculate_reference()
must be a dictionary of {key : AMSResults}
,
where key is the string name of a job
(see Adding Jobs to the Job Collection).
The Data Set will extract properties of all matching job names,
as defined by the entries in the instance.
For example, a Data Set instance with the entries
hessian('water')
and forces('methane')
will expect a
dictionary of {'water' : AMSResults, 'methane' : AMSResults}
to be passed to calculate_reference()
for a successful extraction.
8.7.6. Calculating the Loss Function Value¶
Note
This functionality is also covered by the Data Set Evaluator class.
The loss function value can only be calculated if each entry in the data set has a reference value.
The execution of jobs and evaluation of the Data Set is handled automatically during the Optimization. In most cases the user does not need to manually calculate the loss function value.
The loss function value is a metric of how similar two sets of results are.
If all entries in a Data Set instance have a reference value,
the DataSet.evaluate()
method can be used to calculate the loss between
the reference, and a results dictionary.
The signature is similar to the above:
jc = JobCollection()
ds = DataSet()
# ... populate jc and ds
# Run all jobs in jc with UFF:
from scm.plams import Settings
engine = Settings()
engine.input.forcefield
results = jc.run(engine)
# Extract calculated results and store in ds:
loss = ds.evaluate(results, loss='rmsd')
When calculating the loss, each entry’s weight and sigma values will be considered. The loss argument can take a number of different Loss Functions.
8.7.7. Checking for Consistency with a given Job Collection¶
Data Set entries are tied to a JobCollection
by a common jobID.
The consistency of every DataSet instance can be checked with the DataSet.check_consistency()
method.
It will check if any DataSetEntry
has jobids
that are not included in a JobCollection
instance
and if so, return a list of indices with entries that can not be calculated given the Job Collection:
>>> # DataSet() in ds, JobCollection() in jc
>>> len(ds)
>>> 10
>>> bad_ids = ds.check_consistency(jc)
>>> bad_ids # `ds` entries with these indices could not be calculated given `jc`, meaning the property for the calculation of these entries requires a chemical system which is not present in `jc`
[1,10,13]
>>> del ds[bad_ids]
>>> len(ds)
7
>>> ds.check_consistency(jc) # No more issues
[]
The DataSet.check_consistency()
method is equivalent to:
>>> bad_ids = [num for num,entry in enumerate(ds) if any(i not in jc for i in entry.jobids)]
8.7.8. Sigma vs. weight: What is the difference?¶
Sigma (σ) and weight both affect how much a given data_set entry contributes to the loss function.
For example, the loss function might be a sum-of-squared-errors (SSE) function:
where the sum runs over all data_set entries, \(w_i\) is the weight for data_set entry \(i\), \(y_i\) is the reference value, \(\hat{y}_i\) is the predicted value, and \(\sigma\) is the sigma.
The interpretation for sigma is that it corresponds to an “acceptable prediction error”. Sigma has the same unit as the reference value, and its magnitude therefore depends on which unit the reference value is expressed in. The purpose of sigma is to normalize the residual \((y_i-\hat{y}_i)\), no matter which unit \(y_i\) and \(\hat{y}_i\) are expressed in. In this way, it is possible to mix different physical quantities (energies, forces, charges, etc.) in the same training set.
The interpretation for weight is that it corresponds to how important one thinks a datapoint is. It has no unit.
For array reference values, like forces, the sigma value is the same for each force component, but the weight can vary for different force components in the same structure. If there are \(M_i\) force components for the structure \(i\), then
Summary table:
Sigma |
Weight |
|
---|---|---|
Unit |
Same as the reference value |
None |
Interpretation |
Acceptable prediction error |
Importance of this data_set entry |
Default value |
Extractor-dependent |
None (must be set explicitly) |
Element-dependent for arrays |
No |
Yes |
8.7.9. Data Set Entry API¶
- class DataSetEntry(expression, weight, reference=None, unit: Tuple[str, float] | None = None, sigma: float | None = None, metadata=None)¶
A helper class representing a single entry in the
DataSet
.Note
This class is not public in this module, as it is not possible to create these entries on their own. They can only be managed by an instance of a
DataSet
class.- Attributes:
- weightfloat or ndarray
The weight \(w\), denoting the relative ‘importance’ of a single entry. See Loss Functions for more information on how this parameter affects the overall loss.
- referenceAny
Reference value of the entry. Consecutive
DataSet.evaluate()
calls will compare to this value.- unitTuple[str,float]
Whenever the reference needs to be stored in any other unit than the default (atomic units), use this argument to provide a tuple where the first element is the string name of the unit and the second is the conversion factor from atomic units to the desired unit. The weighted residual will be calculated as \((w/\sigma)(y-c\hat{y})\), where \(c\) is the unit conversion factor.
Important
Unit conversion will not be applied to
sigma
. Adjust manually if needed.- sigmafloat
To calculate the loss function value, each entry’s residuals are calculated as \((w/\sigma)(y-\hat{y})\). Ideally, sigma represents the accepted ‘accuracy’ the user expects from a particular entry.
Default values differ depending on the property and are stored in individual extractor files. If the extractor file has no sigma defined, will default to 1 (see Extractors and Comparators for more details).- expressionstr
The expression that will be evaluated during an
DataSet.evaluate()
call. A combination of extractors and jobids.- jobidsset, read-only
All Job IDs needed for the calculation of this entry
- extractorsset, read-only
All extractors needed for the calculation of this entry
8.7.10. Data Set API¶
- class DataSet(file=None, more_extractors=None)¶
A class representing the data set \(DS\).
Attributes:
- extractors_folderstr
Default extractors location.
- extractorsset
Set of all extractors available to this instance.
See Extractors and Comparators for more information.- jobidsset
Set of all jobIDs in this instance.
- headerdict
A dictionary with global metadata that will be printed at the beginning of the file when
store()
is called. Will always contain the ParAMS version number and class name.
- __init__(file=None, more_extractors=None)¶
Initialize a new
DataSet
instance.Parameters:
- filestr
Load a a previously saved DataSet from this path.
- more_extractorsstr or List[str]
Path to a user-defined extractors folder.
See Extractors and Comparators for more information.
- add_entry(expression, weight=1.0, reference=None, unit=None, sigma=None, metadata=None, dupe_check=True)¶
Adds a new entry to the cost function, given the expression, the desired weight and (optionally) the external reference value.
Skips checking the expression for duplicates if dupe_check is False. This is recommended for large cost functions >20k entries.- Parameters:
- expression: str
The string representation of the extractor and jobid to be evaluated. See Adding entries and Extractors and Comparators for more details.
- weight: float, 1d array
Relative weight of this entry. See Adding entries for more details, see Loss Functions to check how weights influence the overall value.
- reference: optional, Any
External reference values can be provided through this argument.
- unitoptional, Tuple[str, float]
Whenever the reference needs to be stored in any other unit than the default (atomic units), use this argument to provide a tuple where the first element is the string name of the unit and the second is the conversion factor from atomic units to the desired unit. The weighted residual will be calculated as \((w/\sigma)(y-c\hat{y})\), where \(c\) is the unit conversion factor.
Important
Unit conversion will not be applied to
sigma
. Adjust manually if needed.- sigmaoptional, float
To calculate the loss function value, each entry’s residuals are calculated as \((w/\sigma)(y-\hat{y})\). Ideally, sigma represents the accepted ‘accuracy’ the user expects from a particular entry.
Default values differ depending on the property and are stored in individual extractor files. If the extractor file has no sigma defined, will default to 1 (see Extractors and Comparators for more details).- metadata: optional, dict
Optional metadata, will be printed when
store()
is called, can be accessed through each entry’smetadata
argument.- dupe_check: True
If True, will check that every expression is unique per Data Set instance. Disable if working with large data sets.
Rather than adding an entry multiple times, consider increasing its weight.
- calculate_reference(results, overwrite=True)¶
Calculate the reference values for every entry, based on results.
- Parameters:
- resultsdict
A
{jobid : AMSResults}
dictionary:>>> job = AMSJob( ... ) >>> job.run() >>> DataSet.calculate_reference({job.name:job.results})
Can be ‘sparse’ (or empty == {}), as long as the respective entry (or all) alrady has a
reference
value defined.- overwritebool
By default, when this method is called, possibly present reference values will be overwritten with the ones extracted from the results dictionary. Set this to False if you want to override this behavior.
- evaluate(results, loss: Type[Loss] = 'sse', return_residuals=False, zeros_if_failed: bool = False)¶
Compares the optimized results with the respective
reference
. Returns a single cost function value based on the loss function.- Parameters:
- resultsdict or DataSet
Same as
calculate_reference()
. Alternatively, a secondDataSet
instance can be used for an evaluation. In this case, the second instance’s reference values will be compared.- lossLoss, str
A subclass of Loss, holding the mathematical definition of the loss function to be applied to every entry, or a registered string shortcut.
- return_residualsbool
Whether to return residuals, contributions and predictions in addition to fx.
- zeros_if_failedbool
If an extractor fails to extract data, return zeros with the same shape as the reference data. This means that no errors will be raised when calculating the loss function. This is useful for the ML RunAMSAtEnd singlepoint.
- Returns:
- fxfloat
The overall cost function value after the evaluation of results.
- residualsList[1d-array]
List of unweighted per-entry residuals (or the return values of
compare(y,yhat)
in the case of custom comparators)
Only returned when return_residuals is True.- contributionsList[float]
List of relative per-entry contributions to fx
Only returned when return_residuals is True.- predictionsList[Any]
List of raw predictions extracted from the results. Note: the elements of this list do not necessarily have the same size as the elements of residuals.
Only returned when return_residuals is True.…
- get(key: str)¶
Return a list of per-entry attributes, where key determines the attribute, equivalent to:
[getattr(i,key) for i in self]
.
- set(key: str, values: Sequence)¶
Batch-set the key attribute of every entry to a value from values. values must be the same length als the number of entries in the DataSet instance.
Equivalent to[setattr(e,key,v) for e,v in zip(self,values)]
.
- load(yamlfile='data_set.yaml')¶
Loads a DataSet from a (compressed) YAML file.
- store(yamlfile='data_set.yaml')¶
Stores the DataSet to a (compressed) YAML file.
The file will be automatically compressed when the file ending is .gz or .gzip.
- __getitem__(idx)¶
Dict-like behavior if idx is a string, list-like if an int
- __delitem__(idx: int | Sequence[int])¶
Delete one entry at index idx. Can either be an index (int) or an expression (str). In both cases a sequence can be passed to delete multiple elements at once
- remove(entry)¶
Remove the
DataSetEntry
instance from this Data Set
- index(entry)¶
Return the index of a
DataSetEntry
instance in this Data Set
- keys()¶
Same as
DataSet.get('expression')
- __len__()¶
Return the length of this Data Set
- __iter__()¶
Iterare over all
DataSetEntries
in this Data Set
- __call__(key)¶
Same as
__getitem__()
.
- __eq__(other)¶
Check if two Data Sets are equal.
- __ne__(other)¶
Return self!=value.
- __add__(other)¶
Add two Data Sets, returning a new instance. Does not check for duplicates.
- __and__(other)¶
Obtain the intersection of two DataSets.
- __sub__(other)¶
Substract two Data Sets, returning a new instance. Does not check for duplicates.
- __str__()¶
Return a string representation of this instance.
- __repr__()¶
Return repr(self).
- property jobids¶
Return a set of all jobIDs necessary for the
evaluation
of this instance.
- check_consistency(jc: JobCollection) Sequence ¶
Checks if this instance is consistent with jc, i.e. if all entries can be calculated from the given JobCollection.
- Parameters:
- jcJobCollection
The Job Collection to check against
- Returns:
List of entry indices that contain jobID`s not present in `jc.
Usedel self[i]
to delete the entry at index i.
- from_extractors(extractors: str | List[str]) DataSet ¶
Returns a new Data Set instance that only contains entries with the requested extractors. For a single extractor, a string matching any of the ones in the
extractors
attribute can be passed. Otherwise a list of strings is expected.Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- maxjobs(njobs, seed=None, _patience=100, _warn=True) DataSet ¶
Returns a random subset of
self
, reduced tolen(DataSet.jobids) <= njobs
AMS jobs.Note
Can result in a subset with a lower (or zero) number of jobs than specified. This happens when the data set consists of entries that are computed from multiple jobs and adding any single entry would result in a data set with more jobs than requested.
Will warn if a smaller subset is generated and raise a ValueError if the return value would be an empty data set.Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- split(*percentages, seed=None) Tuple[DataSet] ¶
Returns
N=len(percentages)
random subsets ofself
, where every percentage is the relative number of entries per new instance returned.>>> len(self) == 10 >>> a,b,c = self.split(.5, .2, .3) # len(a) == 5 , len(b) == 2, len(c) == 3, no overlap
Note
Shallow copies will be created, meaning that changes to the entries in the child instances will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned objects.
- split_by_jobids(*percentages: float, seed: int | None = None, atomic: bool = True) Iterator[DataSet] ¶
Returns
N=len(percentages)
random subsets of the DataSet based on partitioning the jobids, where every percentage is the relative number of entries per new instance returned. seed can be used to obtain reproducible splits. By default only “atomic” entries are split, meaning those entries that only extract a specific value from a single jobid.Note
Shallow copies will be created, meaning that changes to the entries in the child instances will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned objects.
- random(N, seed=None) DataSet ¶
Returns a new subset of self with
len(subset)==N
.Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- from_jobids(jobids: Set[str]) DataSet ¶
Generate a subset only containing the provided jobids.
Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- from_metadata(key, value, regex=False) DataSet ¶
Generate a subset for which the metadata key is equal to value. If regex, value can be a regular expression.
- Parameters:
- keystr
Metadata key
- valuestr
Metadata value
- regexbool
Whether value should be matched as a regular expression.
Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- from_expressions(expressions: Sequence) DataSet ¶
Generate a subset data_set from the given expressions.
- expressions: Sequence of str
The expressions that will form the new data_set.
Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- from_data_set_entries(entries: Sequence[DataSetEntry]) DataSet ¶
Generate a subset data_set from the given data_set entries.
- entriessequence of DataSetEntry
The data_set entries that will make up the new data_set.
Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- from_weights(min_val=1.0, max_val=None, tol=1e-08) DataSet ¶
Generate a subset for which the weights are in the interval [min_val, max_val]. If no max_val is given, max_val is set to equal min_val.
Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- from_reference_values(min_val: float | None = None, max_val: float | None = None, tol: float = 1e-08) DataSet ¶
Generate a subset for which all reference values are in the interval [min_val-tol, max_val+tol].
If min_val is None, there is no lower bound. If max_val is None, there is no upper bound.
Entries without reference values will NOT be returned.
Note
A shallow copy will be created, meaning that changes to the entries in the child instance will also affect the parent instance. If you do not want this behavior, use the
copy()
method on the returned object.
- from_atomic_expressions() DataSet ¶
Generate a subset consisting of DataSet entries with only “atomic” expressions, meaning there is only one extractor
extractor('jobid')
and no further manipulations.
- print_contributions(contribs, fname, sort=True)¶
Print contribs to fname, assuming the former has the same ordering as the return value of
get()
.- Parameters:
- contribsList
Return value of
evaluate()
- fnamestr
File location for printing
- sortbool
Sort the contributions from max to min. Original order if disabled.
- print_residuals(residuals, fname, weigh=True, extractors: List[str] | None = None)¶
Print residuals to fname, assuming the former has the same ordering as the return value of
get()
. Entries can be limited to certain extractors with the respective keyword argument.- Parameters:
- residualsList
Return value of
evaluate()
- fnamestr
File location for printing
- weighbool
Whether or not to apply the weights when printing the residuals
- extractorsoptional, List[str]
If provided, will limit the extractors to only the ones in the list. Defaults to all extractors.
- get_predictions(residuals: List, extractors: List | None = None, fpath: str | None = None, return_reference=False)¶
Return the absolute predicted values for each data set entry based on the residuals vector as returned by
evaluate()
, optionally print reference and predicted values for each entry to file.- Parameters:
- residualsList
Return value of
evaluate()
- extractors: optional, Sequence of strings
A list of extractors to be included in the output. Includes all by default.
- fpathoptional, str
If provided, will print EXPRESSION, SIGMA, WEIGHT, REFERENCE, PREDICTION to file.
- return_referencebool
If true, will return a 2-tuple (preds, reference) where each item is a List.
- Returns:
- predsList of 2-Tuples
Returns a list where each element is a tuple of (expression, predicted value).
- referenceList of 2-tuples
Returns a list where each element is a tuple of (expression, reference value). NOTE: this reference value is an “effective” reference value (np.zeros_like(preds)) if a comparator is used.
- get_unique_metadata_values(key)¶
Return a set of all unique values for metadata entries.
Example for a data_set with 4 entries, the third lacks the “Group” metadata:
Group: a Group: b - Group: b
get_unique_metadata_values(‘Group’) then returns {None, ‘a’, ‘b’}
- keystr
The metadata key.
- group_by_metadata(key: str, key2: str | None = None)¶
Return a dictionary of data_sets grouped by the metadata key.
If key2 is given, return a nested dictionary grouped first by key and then by key2.
Example for a data_set with 4 entries, the third lacks the “Group” metadata:
Group: a, SubGroup: d Group: a, SubGroup: d -, SubGroup: f Group: b, SubGroup: d Group: b, SubGroup: e
if key == ‘Group’ and key2 == None, returns {‘a’: DataSet[dsentry1,dsentry2], None: DataSet[dsentry3], ‘b’: DataSet[dsentry4, dsentry5]}
If key == ‘Group’ and key2 == ‘SubGroup’, returns {‘a’: {‘d’: DataSet[dsentry1, dsentry2]}, None: {‘f’: DataSet[dsentry3]}, ‘b’: {‘d’: DataSet[dsentry4], ‘e’: DataSet[dsentry5]}}
- keystr
The first metadata key
- key2str or None
The second metadata key
- apply_weights_scheme(weights_scheme: WeightsScheme)¶
Apply weights scheme to all entries in data_set having ReferenceValues (entries without ReferenceValues are skipped)
- weights_schemeWeightsScheme
Weights scheme to apply
- get_raveled_reference_values()¶
Returns all reference values in a 1D numpy array.
Can be used to for example plot a histogram of the reference values.
- __hash__ = None¶