Database

The submodule contain several class for providing an interface to a sql database for managing COSKF files and physical properties.

class pyCRS.Database.COSKFDatabase(path)

A class provide an interface to a sql database containing the following tables.

Table name

Description

Compound

contains unique compounds along with their COSKF file based on either CAS number or any prefered identifier

Conformer

contains mutiple conformers along with their COSKF file

PhysicalProperty

contains the physical properties input by user

PropPred

contains the estimated physical properties using QSPR methods based on SMILES

Parameters:

path (str) – a path to the database file. If this file does not exist, it will be created.

add_compound(coskf_file, name=None, cas=None, identifier=None, coskf_path=None, smiles=None, nring=None, ignore_smiles_check=False, ignore_duplicates=False)

Adds a new .coskf file to the database.

Parameters:

coskf_file (str) – a path to the .coskf file, or alternatively, the file name of the .coskf file if the coskf_path is provided.

Keyword Arguments:
  • name (str) – The entry’s name, such as the compound name. If not provided, it will prioritize using the IUPAC name, CAS number, identifier, or the name of the .coskf file if such value is provided through the add_compound() method or stored in the ‘Compound Data’ section in the .coskf file.

  • cas (str) – The CAS number of the molecule. If not provided, it will attempt to use the CAS within the .coskf file if available.

  • identifier (str) – The chemical identifier of the molecule.

  • coskf_path (str) – The directory path to the coskf file. If not provided, it will attempt to locate the path of ADFCRS-2018 database.

  • smiles (str) – The SMILES string of the molecule. If not provided, it will attempt to use the SMILES within the .coskf file if available

  • nring (int) – The numbr of ring atoms. If not provided, it will attempt to use the Nring within the .coskf file if available

  • ignore_smiles_check (bool) – If set to True, skip generating the SMILES from compound’s coordinates to confirm its identity against the database. Default is False.

  • ignore_duplicates (bool) – If set to True, skip duplicate recognition using UniqueConformersCrest in AMSConformer tool. Default is False.

Note

Ensure every compound has a unique representation, either by CAS number or a preferred identifier. During the add_compound process, both the CAS number and identifier are checked for uniqueness in the database. If multiple compounds share the same CAS number and identifier, an ERROR will be raised. For instance, the below operation is not allowed since both compound shared the same identifier=’CRS0001’

db.add_compound("Benzene.coskf",cas="71-43-2",identifier="CRS0001")
db.add_compound("Ethanol.coskf",cas="64-17-5",identifier="CRS0001")
add_physical_property(identifier, attribute, value, unit=None)

Add a value of a physical property to the PhysicalProperty TABLE in the database by compound’s identifier

Parameters:
  • identifier (str) – the string representing either CAS, identifier or name of a compound

  • attribute (str) – the name of the physical property (eg. meltingpoint or hfusion)

  • value (float) – the value of the physical property

Keyword Arguments:

unit (str) – (optional) the unit of the input value. The default unit is K, kcal/mol and kcal/mol-K. The accepted unit now has K, C, kcal/mol, kJ/mol, cal/g, J/g, kcal/mol-K, kJ/mol-K, cal/g-K, J/g-K

Example

db.add_physical_property(‘Benzene’,’meltingpoint’,278.7) db.add_physical_property(‘Benzene’,’hfusion’,9.91,unit=’kJ/mol’) db.add_physical_property(‘Benzene’,’vp_equation’,’Antoine’) db.add_physical_property(‘Benzene’,’vp_params’,’4.72583, 1660.652, -1.461’)

clear_physical_property(identifier: str, attribute: str | None = None)

Clear the value of a physical property in PhysicalProperty TABLE in the database by compound’s identifier

Parameters:

identifier (str) – the string representing either CAS, identifier or name of a compound

Keyword Arguments:

attribute (str, optional) – The name of the physical property to clear. If not provided, all physical properties will be cleared.

del_row(dbrow: CompoundRow)

Remove a compound from the database and delete the corresponding .coskf file.

Parameters:

dbrow (CompoundRow) – the row to remove from the database

del_row_by_conformer_id(conformer_id)

Remove the conformer from the database.

Parameters:

conformer_id (int) – A integer of intergers representing the conformer in the CONFORMER TABLE.

Example

db.del_row_by_conformer_id(1)

del_rows(dbrows)

Remove multiple compounds from the database and delete the corresponding .coskf files.

Parameters:

dbrows (list) – the rows to remove from the database, represented as a list of CompoundRow objects

Example

db.del_rows(db.get_compounds(‘benzene’))

estimate_physical_property(identifier=None, compound_id=None)

Estimate the physical properties using the property prediction tool and add the values to the PropPred TABLE in the database

Keyword Arguments:
  • identifier (str or list) – a string or a list of string representing either CAS, identifier or name of a compound

  • compound_id (int or list) – an integer or a list representing the compound ID(s).

Note

The QSPR descriptor used in the property prediction tool is determined from the SMILES string. It first attempts to use the SMILES string provided by user via the add_compound method or modify_attribute_by_compound_id method. If unavailable, it will used the SMILES generating by OpenBabel using the compound’s coordinates in the COSKF file. Please note that the resolved SMILES may be incorrect for some molecules, for instance when bond orders cannot be automatically determined and species with charges.

Example :

db.estimate_physical_property("Benzene")
get_all_compounds()

Collect all compounds in the database

Returns:

The full list of CompoundRow instances in the database

Return type:

list of CompoundRow

get_all_conformers()

Collects all conformers in the database

Returns:

The full list of ConformerRow instances in the database.

Return type:

List of ConformerRow

get_all_physical_properties(source='PhysicalProperty')

Collect all physical properties in the database

Keyword Arguments:

source (str) – The string should be either ‘PhysicalProperty’ or ‘PropPred’. Defaults to ‘PhysicalProperty’, returning properties from the PhysicalProperty TABLE. If set to ‘PropPred’, it will return the estimated properties in PropPred TABLE.

Returns:

The full list of PhysicalPropertyRow instances or PropPredRow instances in the database

get_attribute_by_compound_id(attributes, compound_id)

Retrieve the list of values for compounds with specified compound_id(s)

Parameters:
  • attributes (str or list) – A string or a list of strings used for searching for in the COMPOUND TABLE.

  • compound_id (int or list) – A integer or a list of intergers used to search for compounds in the COMPOUND TABLE.

Returns:

A list of tuples containing the values of the specified attributes for the compounds.

Return type:

list of attributes

get_compounds(values)

Retrieves compounds from the COMPOUND TABLE in the database by matching CAS number, chemical identifier, or name.

Parameters:

values (str or list) – A string or a list of strings used for searching, representing CAS numbers, chemical identifiers, or names.

Returns:

A list of CompoundRow instances that match the search criteria

Return type:

list of CompoundRow

get_compounds_id(values)

Retrieves compound id from the COMPOUND TABLE in the database by matching CAS number, chemical identifier, or name.

Parameters:

values (str or list) – A string or a list of strings used for searching, representing CAS numbers, chemical identifiers, or names.

Returns:

A list of compound IDs that match the search criteria.

Return type:

list of int

get_conformers(values)

Retrieves conformers from the CONFORMER TABLE in the database by matching CAS number, chemical identifier, or name.

Parameters:

values (str or list) – A string or a list of strings used for searching, representing CAS numbers, chemical identifiers, or names.

Returns:

A list of ConformerRow instances that match the search criteria.

Return type:

list of ConformerRow

get_physical_properties(identifier=None, compound_id=None, source='PhysicalProperty')

Collect physical properties in the database by matching CAS number, chemical identifier, name or compound id.

Keyword Arguments:
  • identifier (str or list) – a string or a list of string representing either CAS, identifier or name of a compound

  • compound_id (int or list) – An integer or a list of integers representing the compound ID(s) in the database.

  • source (str) – The string should be either ‘PhysicalProperty’ or ‘PropPred’. Defaults to ‘PhysicalProperty’, returning properties from the PhysicalProperty TABLE. The set to ‘PropPred’, it will return the estimated properties in PropPred TABLE.

Returns:

The list of PhysicalPropertyRow instances or PropPredRow instances in the database

Return type:

list of PhysicalPropertyRow or PropPredRow

modify_attribute_by_compound_id(attribute, value, compound_id)

Modify the attribute value for an entry associated with the compound id.

Parameters:
  • attribute (str) – the attribute to be modified. It can be one of the following: ‘name’, ‘cas’, ‘identifier’, ‘smiles’, ‘nring’.

  • value (str) – the new value of the specified attribute for the compound ID(s).

  • compound_id (int) – an integer representing the compound ID.

Example :

db.modify_attribute_by_compound_id("identifier","InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H", 0)
update_compound_by_conformer_id(compound_id, conformer_id)

Update the data for a compound ID row in the COMPOUND TABLE using the data from a conformer ID row in the CONFORMER TABLE.

Parameters:
  • compound_id (int) – A integer representing compound id corresponding to a specific row in the COMPOUND TABLE of the database

  • conformer_id (int) – A integer representing conformer id corresponding to a specific row in the CONFORMER TABLE of the database

update_compound_by_lowestE(compound_id=None)

Update the data for a compound ID row in the COMPOUND TABLE using the data from a conformer ID row with the lowest energy having the same compound ID in the CONFORMER TABLE.

Keyword Arguments:
  • compound_id (int or list) – An integer or a list of integers representing the compound id(s) that represent specific rows in the COMPOUND TABLE of the database.

  • database. (If the compound_id is not specified, the method will be applied to the whole) –

visualize_conformers(compound_id)

Visualize a set of conformers in the order of the conformers id

Parameters:

compound_id (int) – an integer representing the compound ID.

class pyCRS.Database.CompoundRow(compound_id: int, conformer_id: int, name: str, cas: str, identifier: str, smiles: str, resolved_smiles: str, coskf: str, Egas: float, Ecosmo: float, nring: int)

A data class to represent the contents of a row in a COMPOUND TABLE in COSKFDatabase

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

conformer_id

A unique identifer for a specific row in the CONFORMER TABLE of the database

Type:

int

name

The name associated with the row in the COMPOUND TABLE

Type:

str

cas

The CAS number associated with the row, i.e., the compound

Type:

str

identifier

The chemical identifier associated with the row, i.e., the compound

Type:

str

smiles

The SMILES string provided by user

Type:

str

resolved_smiles

The derived SMILES string obtained using OpenBabel from the coordinates in the COSKF file.

Type:

str

coskf

The filename of the .coskf file stored in the local SCM_PYCRS_COSKF_DB directory

Type:

str

Egas

The gas phase bond energy rounded to 3 decimal places in kcal/mol

Type:

float

Ecosmo

The bond energy in a perfect conductor rounded to 3 decimal places in kcal/mol

Type:

float

nring

The number of ring atoms

Type:

int

db_path

The path to the .coskf file directory

Type:

str

get_full_coskf_path()

Returns the full path of the corresponding .coskf file

read_coskf()

Opens the .coskf file corresponding to the database entry and returns a scm.plams.KFFile instance

class pyCRS.Database.ConformerRow(conformer_id: int, compound_id: int, name: str, cas: str, identifier: str, smiles: str, resolved_smiles: str, coskf: str, Egas: float, Ecosmo: float, nring: int)

A data class to represent the contents of a row in a CONFORMER TABLE in COSKFDatabase

conformer_id

A unique identifer for a specific row in the CONFORMER TABLE of the database

Type:

int

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

name

The name associated with the row in the CONFORMER TABLE

Type:

str

cas

The CAS number associated with the row, i.e., the compound

Type:

str

identifier

The chemical identifier associated with the row, i.e., the compound

Type:

str

smiles

The SMILES string provided by user

Type:

str

resolved_smiles

The derived SMILES string obtained using OpenBabel from the coordinates in the COSKF file

Type:

str

coskf

The filename of the .coskf file stored in the local SCM_PYCRS_COSKF_DB directory

Type:

str

Egas

The gas phase bond energy rounded to 3 decimal places in kcal/mol

Type:

float

Ecosmo

The bond energy in a perfect conductor rounded to 3 decimal places in kcal/mol

Type:

float

nring

The number of ring atoms

Type:

int

db_path

The path to the .coskf file directory

Type:

str

get_full_coskf_path()

Returns the full path of the corresponding .coskf file

read_coskf()

Opens the .coskf file corresponding to the database entry and returns a scm.plams.KFFile instance

class pyCRS.Database.PhysicalPropertyRow(compound_id: int, meltingpoint: float, hfusion: float, cpfusion: float, boilingpoint: float, density: float, flashpoint: float, dielectricconstant: float, vp_equation: str, vp_params: str, tvap: float, pvap: float, Mn: float)

A data class to represent the contents of a row in a PhysicalProperty TABLE in COSKFDatabase

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

meltingpoint

melting temperature (K)

Type:

float

hfusion

enthalpy of husion (kcal/mol)

Type:

float

cpfusion

heat capacity of fusion (kcal/mol-K) calculated as the difference between the heat capacity in the liquid state and the heat capacity in the solid state.

Type:

float

boilingpoint

boiling pointK (K)

Type:

float

density

liquid density (kg/L)

Type:

float

flashpoint

flash point (K)

Type:

float

dielectricconstant

dielectric constant

Type:

flash

vp_equation

The vapor pressure equation to use. Unit in bar. Options include: ANTOINE, VPM1 and DIPPR101

Type:

str

vp_params

Parameters for the vp_equation, expressed as “A, B, C, D, E”

Type:

str

tvap

Temperature(K) at pvap

Type:

float

pvap

Pressure(bar) at tvap

Type:

float

Mn

polymer average molecular weight (g/mol)

Type:

float

Vapor Pressure Equations:
ANTOINE:

log10(P) = A - B/(C+T)

DIPPR101:

ln(P) = A + B/T + C*ln(T) + D*T**E

VPM1:

ln(P) = A/T + B*ln(T) + C*T + D

class pyCRS.Database.PropPredRow(compound_id: int, adopt_smiles: str, meltingpoint: float, hfusion: float, boilingpoint: float, density: float, flashpoint: float, dielectricconstant: float, vp_equation: str, vp_params: str)

A data class to represent the contents of a row in a PropPred TABLE in COSKFDatabase

compound_id

A unique identifer for a specific row in the COMPOUND TABLE of the database

Type:

int

adopt_smiles

The SMILES used for QSPR method

Type:

str

meltingpoint

melting temperature (K)

Type:

float

hfusion

enthalpy of husion (kcal/mol)

Type:

float

boilingpoint

boiling pointK (K)

Type:

float

density

liquid density (kg/L)

Type:

float

flashpoint

flash point (K)

Type:

float

dielectricconstant

dielectric constant

Type:

flash

vp_equation

The vapor pressure equation to use. Unit in bar. VPM1

Type:

str

vp_params

Parameters for the vp_equation, expressed as “A, B, C, D, E”

Type:

str