HDF5 Property Storage

A module for storing quantum mechanical properties in hdf5 format.

Index

create_prop_group(file, scale) Create a group for holding user-specified properties.
create_prop_dset(group, name[, dtype, …]) Construct a new dataset for holding a user-defined molecular property.
update_prop_dset(dset, data[, index]) Update dset at position index with data.
validate_prop_group(group) Validate the passed hdf5 group, ensuring it is compatible with create_prop_group() and create_prop_group().
index_to_pandas(dset[, fields]) Construct an MultiIndex from the passed index dataset.
prop_to_dataframe(dset[, dtype]) Convert the passed property Dataset into a DataFrame.

API

dataCAT.create_prop_group(file, scale)[source]

Create a group for holding user-specified properties.

>>> import h5py
>>> from dataCAT import create_prop_group

>>> hdf5_file = str(...)  
>>> with h5py.File(hdf5_file, 'r+') as f:
...     scale = f.create_dataset('index', data=np.arange(10))
...     scale.make_scale('index')
...
...     group = create_prop_group(f, scale=scale)
...     print('group', '=', group)
group = <HDF5 group "/properties" (0 members)>
Parameters:
  • file (h5py.File or h5py.Group) – The File or Group where the new "properties" group should be created.
  • scale (h5py.DataSet) – The dimensional scale which will be attached to all property datasets created by dataCAT.create_prop_dset().
Returns:

The newly created group.

Return type:

h5py.Group

dataCAT.create_prop_dset(group, name, dtype=None, prop_names=None, **kwargs)[source]

Construct a new dataset for holding a user-defined molecular property.

Examples

In the example below a new dataset is created for storing solvation energies in water, methanol and ethanol.

>>> import h5py
>>> from dataCAT import create_prop_dset

>>> hdf5_file = str(...)  

>>> with h5py.File(hdf5_file, 'r+') as f:
...     group = f['properties']
...     prop_names = ['water', 'methanol', 'ethanol']
...
...     dset = create_prop_dset(group, 'E_solv', prop_names=prop_names)
...     dset_names = group['E_solv_names']
...
...     print('group', '=', group)
...     print('group["E_solv"]', '=', dset)
...     print('group["E_solv_names"]', '=', dset_names)
group = <HDF5 group "/properties" (2 members)>
group["E_solv"] = <HDF5 dataset "E_solv": shape (10, 3), type "<f4">
group["E_solv_names"] = <HDF5 dataset "E_solv_names": shape (3,), type "|S8">
Parameters:
  • group (h5py.Group) – The "properties" group where the new dataset will be created.
  • name (str) – The name of the new dataset.
  • prop_names (Sequence[str], optional) – The names of each row in the to-be created dataset. Used for defining the length of the second axis and will be used as a dimensional scale for aforementioned axis. If None, create a 1D dataset (with no columns) instead.
  • dtype (dtype-like) – The data type of the to-be created dataset.
  • **kwargs (Any) – Further keyword arguments for the h5py create_dataset() method.
Returns:

The newly created dataset.

Return type:

h5py.Dataset

dataCAT.update_prop_dset(dset, data, index=None)[source]

Update dset at position index with data.

Parameters:
  • dset (h5py.Dataset) – The to-be updated h5py dataset.
  • data (numpy.ndarray) – An array containing the to-be added data.
  • index (slice or numpy.ndarray, optional) – The indices of all to-be updated elements in dset. index either should be of the same length as data.
Return type:

None

dataCAT.validate_prop_group(group)[source]

Validate the passed hdf5 group, ensuring it is compatible with create_prop_group() and create_prop_group().

This method is called automatically when an exception is raised by update_prop_dset().

Parameters:group (h5py.Group) – The to-be validated hdf5 Group.
Raises:AssertionError – Raised if the validation process fails.
dataCAT.index_to_pandas(dset, fields=None)[source]

Construct an MultiIndex from the passed index dataset.

Examples

>>> from dataCAT import index_to_pandas
>>> import h5py

>>> filename = str(...)  

# Convert the entire dataset
>>> with h5py.File(filename, "r") as f:
...     dset: h5py.Dataset = f["ligand"]["index"]
...     index_to_pandas(dset)
MultiIndex([('O=C=O', 'O1'),
            ('O=C=O', 'O3'),
            ( 'CCCO', 'O4')],
           names=['ligand', 'ligand anchor'])

# Convert a subset of fields
>>> with h5py.File(filename, "r") as f:
...     dset = f["ligand"]["index"]
...     index_to_pandas(dset, fields=["ligand"])
MultiIndex([('O=C=O',),
            ('O=C=O',),
            ( 'CCCO',)],
           names=['ligand'])
Parameters:
  • dset (h5py.Dataset) – The relevant index dataset.
  • fields (Sequence[str]) – The names of the index fields that are to-be included in the returned MultiIndex. If None, include all fields.
Returns:

A multi-index constructed from the passed dataset.

Return type:

pandas.MultiIndex

dataCAT.prop_to_dataframe(dset, dtype=None)[source]

Convert the passed property Dataset into a DataFrame.

Examples

>>> import h5py
>>> from dataCAT import prop_to_dataframe

>>> hdf5_file = str(...)  

>>> with h5py.File(hdf5_file, 'r') as f:
...     dset = f['ligand/properties/E_solv']
...     df = prop_to_dataframe(dset)
...     print(df)  
E_solv_names             water  methanol   ethanol
ligand ligand anchor
O=C=O  O1            -0.918837 -0.151129 -0.177396
       O3            -0.221182 -0.261591 -0.712906
CCCO   O4            -0.314799 -0.784353 -0.190898
Parameters:
  • dset (h5py.Dataset) – The property-containing Dataset of interest.
  • dtype (dtype-like, optional) – The data type of the to-be returned DataFrame. Use None to default to the data type of dset.
Returns:

A DataFrame constructed from the passed dset.

Return type:

pandas.DataFrame