The Database Class

A Class designed for the storing, retrieval and updating of results.

_images/Database.png

The methods of the Database class can be divided into three categories accoring to their functionality:

  • Opening & closing the database - these methods serve as context managers for loading and unloading parts of the database from the harddrive.

    The context managers can be accessed by calling either Database.csv_lig, Database.csv_qd, or Database.hdf5, with the option of passing additional positional or keyword arguments.

    >>> from dataCAT import Database
    
    >>> database = Database()
    >>> with database.csv_lig(write=False) as db:
    >>>     print(repr(db))
    DFProxy(ndframe=<pandas.core.frame.DataFrame at 0x7ff8e958ce80>)
    
    >>> with database.hdf5('r') as db:
    >>>     print(type(db))
    <class 'h5py._hl.files.File'>
    
  • Importing to the database - these methods handle the importing of new data from python objects to the Database class:

    update_csv() update_hdf5() update_mongodb()
  • Exporting from the database - these methods handle the exporting of data from the Database class to other python objects or remote locations:

    from_csv() from_hdf5()

Index

Database.dirname Get the path+filename of the directory containing all database components.
Database.csv_lig Get a function for constructing an dataCAT.OpenLig context manager.
Database.csv_qd Get a function for constructing an dataCAT.OpenQD context manager.
Database.hdf5 Get a function for constructing a h5py.File context manager.
Database.mongodb Get a mapping with keyword arguments for pymongo.MongoClient.
Database.update_mongodb([database, overwrite]) Export ligand or qd results to the MongoDB database.
Database.update_csv(df[, index, database, …]) Update Database.csv_lig or Database.csv_qd with new settings.
Database.update_hdf5(df, index[, database, …]) Export molecules (see the "mol" column in df) to the structure database.
Database.from_csv(df[, database, get_mol, …]) Pull results from Database.csv_lig or Database.csv_qd.
Database.from_hdf5(index[, database, rdmol, …]) Import structures from the hdf5 database as RDKit or PLAMS molecules.
DFProxy(ndframe) A mutable wrapper providing a view of the underlying dataframes.
OpenLig(filename[, write]) Context manager for opening and closing the ligand database (Database.csv_lig).
OpenQD(filename[, write]) Context manager for opening and closing the QD database (Database.csv_qd).

API

class dataCAT.Database(path=None, host='localhost', port=27017, **kwargs)[source]

The Database class.

property dirname

Get the path+filename of the directory containing all database components.

property csv_lig

Get a function for constructing an dataCAT.OpenLig context manager.

Type:Callable[..., dataCAT.OpenLig]
property csv_qd

Get a function for constructing an dataCAT.OpenQD context manager.

Type:Callable[..., dataCAT.OpenQD]
property hdf5

Get a function for constructing a h5py.File context manager.

Type:Callable[..., h5py.File]
property mongodb

Get a mapping with keyword arguments for pymongo.MongoClient.

Type:Mapping[str, Any], optional
update_mongodb(database='ligand', overwrite=False)[source]

Export ligand or qd results to the MongoDB database.

Examples

>>> from dataCAT import Database

>>> kwargs = dict(...)  
>>> db = Database(**kwargs)  

# Update from db.csv_lig
>>> db.update_mongodb('ligand')  

# Update from a lig_df, a user-provided DataFrame
>>> db.update_mongodb({'ligand': lig_df})  
>>> print(type(lig_df))  
<class 'pandas.core.frame.DataFrame'>
Parameters:
  • database (str or Mapping[str, pandas.DataFrame]) – The type of database. Accepted values are "ligand" and "qd", opening Database.csv_lig and Database.csv_qd, respectivelly. Alternativelly, a dictionary with the database name and a matching DataFrame can be passed directly.
  • overwrite (bool) – Whether or not previous entries can be overwritten or not.
Return type:

None

update_csv(df, index=None, database='ligand', columns=None, overwrite=False, job_recipe=None, status=None)[source]

Update Database.csv_lig or Database.csv_qd with new settings.

Parameters:
  • df (pandas.DataFrame) – A dataframe of new (potential) database entries.
  • database (str) – The type of database; accepted values are "ligand" (Database.csv_lig) and "qd" (Database.csv_qd).
  • columns (Sequence, optional) – Optional: A sequence of column keys in df which (potentially) are to be added to this instance. If None Add all columns.
  • overwrite (bool) – Whether or not previous entries can be overwritten or not.
  • status (str, optional) – A descriptor of the status of the moleculair structures. Set to "optimized" to treat them as optimized geometries.
Return type:

None

update_hdf5(df, index, database='ligand', overwrite=False, status=None)[source]

Export molecules (see the "mol" column in df) to the structure database.

Returns a series with the Database.hdf5 indices of all new entries.

Parameters:
  • df (pandas.DataFrame) – A dataframe of new (potential) database entries.
  • database (str) – The type of database; accepted values are "ligand" and "qd".
  • overwrite (bool) – Whether or not previous entries can be overwritten or not.
  • status (str, optional) – A descriptor of the status of the moleculair structures. Set to "optimized" to treat them as optimized geometries.
Returns:

A series with the indices of all new molecules in Database.hdf5.

Return type:

pandas.Series

from_csv(df, database='ligand', get_mol=True, inplace=True)[source]

Pull results from Database.csv_lig or Database.csv_qd.

Performs in inplace update of df if inplace = True, thus returing None.

Parameters:
  • df (pandas.DataFrame) – A dataframe of new (potential) database entries.
  • database (str) – The type of database; accepted values are "ligand" and "qd".
  • get_mol (bool) – Attempt to pull preexisting molecules from the database. See the inplace argument for more details.
  • inplace (bool) – If True perform an inplace update of the "mol" column in df. Otherwise return a new series of PLAMS molecules.
Returns:

Optional: A Series of PLAMS molecules if get_mol = True and inplace = False.

Return type:

pandas.Series, optional

from_hdf5(index, database='ligand', rdmol=True, mol_list=None)[source]

Import structures from the hdf5 database as RDKit or PLAMS molecules.

Parameters:
  • index (Sequence[int] or slice) – The indices of the to be retrieved structures.
  • database (str) – The type of database; accepted values are "ligand" and "qd".
  • rdmol (bool) – If True, return an RDKit molecule instead of a PLAMS molecule.
Returns:

A list of PLAMS or RDKit molecules.

Return type:

List[plams.Molecule] or List[rdkit.Mol]

hdf5_availability(timeout=5.0, max_attempts=10)[source]

Check if a .hdf5 file is opened by another process; return once it is not.

If two processes attempt to simultaneously open a single hdf5 file then h5py will raise an OSError.

The purpose of this method is ensure that a .hdf5 file is actually closed, thus allowing the Database.from_hdf5() method to safely access filename without the risk of raising an OSError.

Parameters:
  • timeout (float) – Time timeout, in seconds, between subsequent attempts of opening filename.
  • max_attempts (int, optional) – Optional: The maximum number attempts for opening filename. If the maximum number of attempts is exceeded, raise an OSError. Setting this value to None will set the number of attempts to unlimited.
Raises:

OSError – Raised if max_attempts is exceded.

See also

dataCAT.functions.hdf5_availability()
This method as a function.
class dataCAT.DFProxy(ndframe)[source]

A mutable wrapper providing a view of the underlying dataframes.

ndframe

The embedded DataFrame.

Type:pandas.DataFrame
class dataCAT.OpenLig(filename, write=True)[source]

Context manager for opening and closing the ligand database (Database.csv_lig).

property filename

Get the name of the to-be opened file.

Type:AnyStr
property write

Get whether or not filename should be written to when closing the context manager.

Type:bool
class dataCAT.OpenQD(filename, write=True)[source]

Context manager for opening and closing the QD database (Database.csv_qd).

property filename

Get the name of the to-be opened file.

Type:AnyStr
property write

Get whether or not filename should be written to when closing the context manager.

Type:bool