The Database Class

A Class designed for the storing, retrieval and updating of results.

_images/Database.png

The methods of the Database class can be divided into three categories accoring to their functionality:

  • Opening & closing the database - these methods serve as context managers for loading and unloading parts of the database from the harddrive. These methods should be used in conjunction with with statements:

    import CAT
    
    database = CAT.Database()
    with database.open_csv_lig(db.csv_lig) as db:
        print('my ligand database')
    with database.open_yaml(db.yaml) as db:
        print('my job settings database')
    with h5py.File(db.hdf5) as db:
        print('my structure database')
    
    open_csv_lig open_csv_qd open_yaml h5py.File
  • Importing to the database - these methods handle the importing of new data from python objects to the Database class:

    update_csv() update_yaml() update_hdf5() update_mongodb()
  • Exporting from the database - these methods handle the exporting of data from the Database class to other python objects or remote locations:

    from_csv() from_hdf5()

Class API

class CAT.data_handling.database.Database(path=None, host: str = 'localhost', port: int = 27017, **kwargs)[source]

The Database class.

Atributes:
  • csv_lig (str) – Path and filename of the .csv file containing all ligand related results.
  • csv_qd (str) – Path and filename of the .csv file containing all quantum dot related results.
  • yaml (str) – Path and filename of the .yaml file containing all job settings.
  • hdf5 (str) – Path and filename of the .hdf5 file containing all structures (as partiallize de-serialized .pdb files).
  • mongodb (None or dict) – Optional: A dictionary with keyword

arguments for pymongo.MongoClient. # noqa

class open_yaml(path=None, write=True)[source]

Context manager for opening and closing the job settings database.

Parameters:
  • path (str) – The path+filename to the database component.
  • write (bool) – Whether or not the database file should be updated after closing self.
class open_csv_lig(path=None, write=True)[source]

Context manager for opening and closing the ligand database.

Parameters:
  • path (str) – The path+filename to the database component.
  • write (bool) – Whether or not the database file should be updated after closing self.
class open_csv_qd(path=None, write=True)[source]

Context manager for opening and closing the quantum dot database.

Parameters:
  • path (str) – The path+filename to the database component.
  • write (bool) – Whether or not the database file should be updated after closing self.
class DF(df: pandas.core.frame.DataFrame)[source]

A mutable container for holding dataframes.

A subclass of dict containing a single key ("df") and value (a Pandas DataFrame). Calling an item or attribute of DF will call said method on the underlaying DataFrame (self["df"]). An exception to this is the "df" key, which will get/set the DataFrame instead.

update_mongodb(database: str = 'ligand', overwrite: bool = False) → None[source]

Export ligand or qd results to the MongoDB database.

Parameters:
  • database (str) – The type of database; accepted values are "ligand" and "QD".
  • overwrite (bool) – Whether or not previous entries can be overwritten or not.
update_csv(df, database='ligand', columns=None, overwrite=False, job_recipe=None, opt=False)[source]

Update self.csv_lig or self.csv_qd with (potentially) new user provided settings.

Parameters:
  • df (pd.DataFrame (columns: str, index: str, values: plams.Molecule)) – A dataframe of new (potential) database entries.
  • database (str) – The type of database; accepted values are ligand and QD.
  • columns (None or list [tuple [str]]) – A list of column keys in df which (potentially) are to be added to self. If None: Add all columns.
  • overwrite (bool) – Whether or not previous entries can be overwritten or not.
  • job_recipe (None or plams.Settings (superclass: dict)) – A Settings object with settings specific to a job.
update_yaml(job_recipe)[source]

Update self.yaml with (potentially) new user provided settings.

Parameters:job_recipe (plams.Settings (superclass: dict)) – A settings object with one or more settings specific to a job.
Returns:A dictionary with the column names as keys and the key for self.yaml as matching values.
Return type:dict (keys: str, values: str)
update_hdf5(df, database='ligand', overwrite=False, opt=False)[source]

Export molecules (see the mol column in df) to the structure database. Returns a series with the self.hdf5 indices of all new entries.

Parameters:
  • df (pd.DataFrame (columns: str, index: str, values: plams.Molecule)) – A dataframe of new (potential) database entries.
  • database (str) – The type of database; accepted values are ligand and QD.
  • overwrite (bool) – Whether or not previous entries can be overwritten or not.
Returns:

A series with the index of all new molecules in self.hdf5

Return type:

pd.Series (index: str, values: np.int64)

from_csv(df, database='ligand', get_mol=True, inplace=True)[source]

Pull results from self.csv_lig or self.csv_qd. Performs in inplace update of df if inplace = True, returing None.

Parameters:
  • df (pd.DataFrame (columns: str, index: str, values: plams.Molecule)) – A dataframe of new (potential) database entries.
  • database (str) – The type of database; accepted values are ligand and QD.
  • columns – A list of to be updated columns in df.
  • get_mol (bool) – Attempt to pull preexisting molecules from the database. See inplace for more details.
  • inplace (bool) – If True perform an inplace update of the mol column in df. Otherwise Return a new series of PLAMS molecules.
Returns:

If inplace = False: return a new series of PLAMS molecules pulled from self, else return None

Return type:

None or pd.Series (index: str, values: plams.Molecule)

from_hdf5(index, database='ligand', rdmol=True, close=True)[source]

Import structures from the hdf5 database as RDKit or PLAMS molecules.

Parameters:
  • index (list [int]) – The indices of the to be retrieved structures.
  • database (str) – The type of database; accepted values are ligand and QD.
  • rdmol (bool) – If True, return an RDKit molecule instead of a PLAMS molecule.
  • close (bool) – If the database component should be closed afterwards.
Returns:

A list of PLAMS or RDKit molecules.

Return type:

list [plams.Molecule or rdkit.Chem.Mol]

hdf5_availability(timeout: float = 5.0, max_attempts: Optional[int] = None) → None[source]

Check if a .hdf5 file is opened by another process; return once it is not.

If two processes attempt to simultaneously open a single hdf5 file then h5py will raise an OSError. The purpose of this function is ensure that a .hdf5 is actually closed, thus allowing to_hdf5() to safely access filename without the risk of raising an OSError.

Parameters:
  • filename (str) – The path+filename of the hdf5 file.
  • timeout (float) – Time timeout, in seconds, between subsequent attempts of opening filename.
  • max_attempts (int) – Optional: The maximum number attempts for opening filename. If the maximum number of attempts is exceeded, raise an OSError.
Raises:

OSError – Raised if max_attempts is exceded.

Function API

CAT.data_handling.database_functions.mol_to_file(mol_list, path=None, overwrite=False, mol_format=['xyz', 'pdb'])[source]

Export all molecules in mol_list to .pdb and/or .xyz files.

Parameters:
  • mol_list (list [plams.Molecule]) – A list of PLAMS molecules.
  • path (None or str) – The path to the directory where the molecules will be stored. Defaults to the current working directory if None.
  • overwrite (bool) – If previously generated structures can be overwritten or not.
  • mol_format (list [str]) – A list of strings with the to-be exported file types. Accepted values are xyz and/or pdb.
CAT.data_handling.database_functions.as_pdb_array(mol_list, min_size=0)[source]

Converts a list of PLAMS molecule into an array of strings representing (partially) de-serialized .pdb files.

Parameters:
  • mol_list (list [plams.Molecule]) – A list of PLAMS molecules.
  • min_size (int) – The minimumum length of the pdb_array. The array is padded with empty strings if required.
Returns:

An array with m partially deserialized .pdb files with up to n lines each.

Return type:

m*n np.ndarray [np.bytes |S80]

CAT.data_handling.database_functions.from_pdb_array(array, rdmol=True)[source]

Converts an array with a (partially) de-serialized .pdb file into an RDKit or PLAMS molecule.

Parameters:
  • array (n np.ndarray [np.bytes / S80]) – A (partially) de-serialized .pdb file with n lines.
  • rdmol (bool) – If True, return an RDKit molecule instead of a PLAMS molecule.
Returns:

A PLAMS or RDKit molecule build from array.

Return type:

plams.Molecule or rdkit.Chem.Mol

CAT.data_handling.database_functions.sanitize_yaml_settings(settings, job_type)[source]

Remove a predetermined set of unwanted keys and values from a settings object.

Parameters:settings (plams.Settings (superclass: dict)) – A settings object with, potentially, undesired keys and values.
Returns:A (nested) dictionary with unwanted keys and values removed.
Return type:dict