The PDBContainer Class

A module for constructing array-representations of .pdb files.

Index

`PDBContainer`(atoms, bonds, atom_count, ...)	An (immutable) class for holding array-like representions of a set of .pdb files.
`PDBContainer.atoms`	Get a read-only padded recarray for keeping track of all atom-related information.
`PDBContainer.bonds`	Get a read-only padded recarray for keeping track of all bond-related information.
`PDBContainer.atom_count`	Get a read-only ndarray for keeping track of the number of atoms in each molecule in `atoms`.
`PDBContainer.bond_count`	Get a read-only ndarray for keeping track of the number of atoms in each molecule in `bonds`.
`PDBContainer.scale`	Get a recarray representing an index.

`PDBContainer.__init__`(atoms, bonds, ...[, ...])	Initialize an instance.
`PDBContainer.__getitem__`(index)	Implement `self[index]`.
`PDBContainer.__len__`()	Implement `len(self)`.
`PDBContainer.keys`()	Yield the (public) attribute names in this class.
`PDBContainer.values`()	Yield the (public) attributes in this instance.
`PDBContainer.items`()	Yield the (public) attribute name/value pairs in this instance.
`PDBContainer.concatenate`(*args)	Concatenate \(n\) PDBContainers into a single new instance.

`PDBContainer.from_molecules`(mol_list[, ...])	Convert an iterable or sequence of molecules into a new `PDBContainer` instance.
`PDBContainer.to_molecules`([index, mol])	Create a molecule or list of molecules from this instance.
`PDBContainer.to_rdkit`([index, sanitize])	Create an rdkit molecule or list of rdkit molecules from this instance.
`PDBContainer.create_hdf5_group`(file, name, *)	Create a h5py Group for storing `dataCAT.PDBContainer` instances.
`PDBContainer.validate_hdf5`(group)	Validate the passed hdf5 group, ensuring it is compatible with `PDBContainer` instances.
`PDBContainer.from_hdf5`(group[, index])	Construct a new PDBContainer from the passed hdf5 group.
`PDBContainer.to_hdf5`(group, index[, ...])	Update all datasets in group positioned at index with its counterpart from pdb.

`PDBContainer.intersection`(value)	Construct a new PDBContainer by the intersection of self and value.
`PDBContainer.difference`(value)	Construct a new PDBContainer by the difference of self and value.
`PDBContainer.symmetric_difference`(value)	Construct a new PDBContainer by the symmetric difference of self and value.
`PDBContainer.union`(value)	Construct a new PDBContainer by the union of self and value.

API

class dataCAT.PDBContainer(atoms, bonds, atom_count, bond_count, scale=None, validate=True, copy=True, index_dtype=None)[source]

An (immutable) class for holding array-like representions of a set of .pdb files.

The PDBContainer class serves as an (intermediate) container for storing .pdb files in the hdf5 format, thus facilitating the storage and interconversion between PLAMS molecules and the h5py interface.

The methods implemented in this class can roughly be divided into three categories:

Molecule-interconversion: to_molecules(), from_molecules() & to_rdkit().
hdf5-interconversion: create_hdf5_group(), validate_hdf5(), to_hdf5() & from_hdf5().
Miscellaneous: keys(), values(), items(), __getitem__() & __len__().

Examples

>>> import h5py
>>> from scm.plams import readpdb
>>> from dataCAT import PDBContainer

>>> mol_list [readpdb(...), ...]  
>>> pdb = PDBContainer.from_molecules(mol_list)
>>> print(pdb)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(23, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(23, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    scale      = numpy.recarray(..., shape=(23,), dtype=...)
)

>>> hdf5_file = str(...)  
>>> with h5py.File(hdf5_file, 'a') as f:
...     group = pdb.create_hdf5_group(f, name='ligand')
...     pdb.to_hdf5(group, None)
...
...     print('group', '=', group)
...     for name, dset in group.items():
...         print(f'group[{name!r}]', '=', dset)
group = <HDF5 group "/ligand" (5 members)>
group['atoms'] = <HDF5 dataset "atoms": shape (23, 76), type "|V46">
group['bonds'] = <HDF5 dataset "bonds": shape (23, 75), type "|V9">
group['atom_count'] = <HDF5 dataset "atom_count": shape (23,), type "<i4">
group['bond_count'] = <HDF5 dataset "bond_count": shape (23,), type "<i4">
group['index'] = <HDF5 dataset "index": shape (23,), type "<i4">

property atoms

Get a read-only padded recarray for keeping track of all atom-related information.

See dataCAT.dtype.ATOMS_DTYPE for a comprehensive overview of all field names and dtypes.

Type: numpy.recarray, shape \((n, m)\)

property bonds

Get a read-only padded recarray for keeping track of all bond-related information.

Note that all atomic indices are 1-based.

See dataCAT.dtype.BONDS_DTYPE for a comprehensive overview of all field names and dtypes.

Type: numpy.recarray, shape \((n, k)\)

property atom_count

Get a read-only ndarray for keeping track of the number of atoms in each molecule in atoms.

Type: numpy.ndarray[int32], shape \((n,)\)

property bond_count

Get a read-only ndarray for keeping track of the number of atoms in each molecule in bonds.

Type: numpy.ndarray[int32], shape \((n,)\)

property scale

Get a recarray representing an index.

Used as dimensional scale in the h5py Group.

Type: numpy.recarray, shape \((n,)\)

__init__(atoms, bonds, atom_count, bond_count, scale=None, validate=True, copy=True, index_dtype=None)[source]

Initialize an instance.

Parameters

atoms (numpy.recarray, shape \((n, m)\)) – A padded recarray for keeping track of all atom-related information. See PDBContainer.atoms.
bonds (numpy.recarray, shape \((n, k)\)) – A padded recarray for keeping track of all bond-related information. See PDBContainer.bonds.
atom_count (numpy.ndarray[int32], shape \((n,)\)) – An ndarray for keeping track of the number of atoms in each molecule in atoms. See PDBContainer.atom_count.
bond_count (numpy.ndarray[int32], shape \((n,)\)) – An ndarray for keeping track of the number of bonds in each molecule in bonds. See PDBContainer.bond_count.
scale (numpy.recarray, shape \((n,)\), optional) – A recarray representing an index. If None, use a simple numerical index (i.e. numpy.arange()). See PDBContainer.scale.

Keyword Arguments

validate (bool) – If True perform more thorough validation of the input arrays. Note that this also allows the parameters to-be passed as array-like objects in addition to aforementioned ndarray or recarray instances.
copy (bool) – If True, set the passed arrays as copies. Only relevant if validate = True.

Return type

None

API: Miscellaneous Methods

PDBContainer.__getitem__(index)[source]

Implement self[index].

Constructs a new PDBContainer instance by slicing all arrays with index. Follows the standard NumPy broadcasting rules: if an integer or slice is passed then a shallow copy is returned; otherwise a deep copy will be created.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> print(pdb)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(23, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(23, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    scale      = numpy.recarray(..., shape=(23,), dtype=...)
)

>>> pdb[0]
PDBContainer(
    atoms      = numpy.recarray(..., shape=(1, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(1, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(1,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(1,), dtype=int32),
    scale      = numpy.recarray(..., shape=(1,), dtype=...)
)

>>> pdb[:10]
PDBContainer(
    atoms      = numpy.recarray(..., shape=(10, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(10, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(10,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(10,), dtype=int32),
    scale      = numpy.recarray(..., shape=(10,), dtype=...)
)

>>> pdb[[0, 5, 7, 9, 10]]
PDBContainer(
    atoms      = numpy.recarray(..., shape=(5, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(5, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(5,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(5,), dtype=int32),
    scale      = numpy.recarray(..., shape=(5,), dtype=...)
)

Parameters: index (int, Sequence[int] or slice) – An object for slicing arrays along axis=0.
Returns: A shallow or deep copy of a slice of this instance.
Return type: dataCAT.PDBContainer

PDBContainer.__len__()[source]

Implement len(self).

Returns: Returns the length of the arrays embedded within this instance (which are all of the same length).
Return type: int

classmethod PDBContainer.keys()[source]

Yield the (public) attribute names in this class.

Examples

>>> from dataCAT import PDBContainer

>>> for name in PDBContainer.keys():
...     print(name)
atoms
bonds
atom_count
bond_count
scale

Yields: str – The names of all attributes in this class.

PDBContainer.values()[source]

Yield the (public) attributes in this instance.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> for value in pdb.values():
...     print(object.__repr__(value))  
<numpy.recarray object at ...>
<numpy.recarray object at ...>
<numpy.ndarray object at ...>
<numpy.ndarray object at ...>
<numpy.recarray object at ...>

Yields: str – The values of all attributes in this instance.

PDBContainer.items()[source]

Yield the (public) attribute name/value pairs in this instance.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> for name, value in pdb.items():
...     print(name, '=', object.__repr__(value))  
atoms = <numpy.recarray object at ...>
bonds = <numpy.recarray object at ...>
atom_count = <numpy.ndarray object at ...>
bond_count = <numpy.ndarray object at ...>
scale = <numpy.recarray object at ...>

Yields: str and numpy.ndarray / numpy.recarray – The names and values of all attributes in this instance.

PDBContainer.concatenate(*args)[source]

Concatenate \(n\) PDBContainers into a single new instance.

Examples

>>> from dataCAT import PDBContainer

>>> pdb1 = PDBContainer(...)  
>>> pdb2 = PDBContainer(...)  
>>> pdb3 = PDBContainer(...)  
>>> print(len(pdb1), len(pdb2), len(pdb3))
23 23 23

>>> pdb_new = pdb1.concatenate(pdb2, pdb3)
>>> print(pdb_new)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(69, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(69, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(69,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(69,), dtype=int32),
    scale      = numpy.recarray(..., shape=(69,), dtype=...)
)

Parameters: *args (PDBContainer) – One or more PDBContainers.
Returns: A new PDBContainer cosntructed by concatenating self and args.
Return type: PDBContainer

API: Object Interconversion

classmethod PDBContainer.from_molecules(mol_list, min_atom=0, min_bond=0, scale=None)[source]

Convert an iterable or sequence of molecules into a new PDBContainer instance.

Examples

>>> from typing import List
>>> from dataCAT import PDBContainer
>>> from scm.plams import readpdb, Molecule

>>> mol_list: List[Molecule] = [readpdb(...), ...]  
>>> PDBContainer.from_molecules(mol_list)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(23, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(23, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    scale      = numpy.recarray(..., shape=(23,), dtype=...)
)

Parameters

mol_list (Iterable[Molecule]) – An iterable consisting of PLAMS molecules.
min_atom (int) – The minimum number of atoms which PDBContainer.atoms should accomodate.
min_bond (int) – The minimum number of bonds which PDBContainer.bonds should accomodate.
scale (array-like, optional) – An array-like object representing an user-specified index. Defaults to a simple range index if None (see numpy.arange()).

Returns

A pdb container.

Return type

dataCAT.PDBContainer

PDBContainer.to_molecules(index=None, mol=None)[source]

Create a molecule or list of molecules from this instance.

Examples

An example where one or more new molecules are created.

>>> from dataCAT import PDBContainer
>>> from scm.plams import Molecule

>>> pdb = PDBContainer(...)  

# Create a single new molecule from `pdb`
>>> pdb.to_molecules(index=0)  
<scm.plams.mol.molecule.Molecule object at ...>

# Create three new molecules from `pdb`
>>> pdb.to_molecules(index=[0, 1])  
[<scm.plams.mol.molecule.Molecule object at ...>,
 <scm.plams.mol.molecule.Molecule object at ...>]

An example where one or more existing molecules are updated in-place.

# Update `mol` with the info from `pdb`
>>> mol = Molecule(...)  # doctest: +SKIP
>>> mol_new = pdb.to_molecules(index=2, mol=mol)
>>> mol is mol_new
True

# Update all molecules in `mol_list` with info from `pdb`
>>> mol_list = [Molecule(...), Molecule(...), Molecule(...)]  # doctest: +SKIP
>>> mol_list_new = pdb.to_molecules(index=range(3), mol=mol_list)
>>> for m, m_new in zip(mol_list, mol_list_new):
...     print(m is m_new)
True
True
True

Parameters

index (int, Sequence[int] or slice, optional) – An object for slicing the arrays embedded within this instance. Follows the standard numpy broadcasting rules (e.g. self.atoms[index]). If a scalar is provided (e.g. an integer) then a single molecule will be returned. If a sequence, range, slice, etc. is provided then a list of molecules will be returned.
mol (Molecule or Iterable[Molecule], optional) – A molecule or list of molecules. If one or molecules are provided here then they will be updated in-place.

Returns

A molecule or list of molecules, depending on whether or not index is a scalar or sequence / slice. Note that if mol is not None, then the-be returned molecules won’t be copies.

Return type

Molecule or List[Molecule]

PDBContainer.to_rdkit(index=None, sanitize=True)[source]

Create an rdkit molecule or list of rdkit molecules from this instance.

Examples

An example where one or more new molecules are created.

>>> from dataCAT import PDBContainer
>>> from rdkit.Chem import Mol

>>> pdb = PDBContainer(...)  

# Create a single new molecule from `pdb`
>>> pdb.to_rdkit(index=0)  
<rdkit.Chem.rdchem.Mol object at ...>

# Create three new molecules from `pdb`
>>> pdb.to_rdkit(index=[0, 1])  
[<rdkit.Chem.rdchem.Mol object at ...>,
 <rdkit.Chem.rdchem.Mol object at ...>]

Parameters

index (int, Sequence[int] or slice, optional) – An object for slicing the arrays embedded within this instance. Follows the standard numpy broadcasting rules (e.g. self.atoms[index]). If a scalar is provided (e.g. an integer) then a single molecule will be returned. If a sequence, range, slice, etc. is provided then a list of molecules will be returned.
sanitize (bool) – Whether to sanitize the molecule before returning or not.

Returns

A molecule or list of molecules, depending on whether or not index is a scalar or sequence / slice.

Return type

Mol or list[Mol]

classmethod PDBContainer.create_hdf5_group(file, name, *, scale=None, scale_dtype=None, **kwargs)[source]

Create a h5py Group for storing dataCAT.PDBContainer instances.

Notes

The scale and scale_dtype parameters are mutually exclusive.

Parameters

file (h5py.File or h5py.Group) – The h5py File or Group where the new Group will be created.
name (str) – The name of the to-be created Group.

Keyword Arguments

scale (h5py.Dataset, keyword-only) – A pre-existing dataset serving as dimensional scale. See scale_dtype to create a new instead instead.
scale_dtype (dtype-like, keyword-only) – The datatype of the to-be created dimensional scale. See scale to use a pre-existing dataset for this purpose.
**kwargs (Any) – Further keyword arguments for the creation of each dataset. Arguments already specified by default are: name, shape, maxshape and dtype.

Returns

The newly created Group.

Return type

h5py.Group

classmethod PDBContainer.validate_hdf5(group)[source]

Validate the passed hdf5 group, ensuring it is compatible with PDBContainer instances.

An AssertionError will be raise if group does not validate.

This method is called automatically when an exception is raised by to_hdf5() or from_hdf5().

Parameters: group (h5py.Group) – The to-be validated hdf5 Group.
Raises: AssertionError – Raised if the validation process fails.

classmethod PDBContainer.from_hdf5(group, index=None)[source]

Construct a new PDBContainer from the passed hdf5 group.

Parameters

group (h5py.Group) – The to-be read h5py group.
index (int, Sequence[int] or slice, optional) – An object for slicing all datasets in group.

Returns

A new PDBContainer constructed from group.

Return type

dataCAT.PDBContainer

PDBContainer.to_hdf5(group, index, update_scale=True)[source]

Update all datasets in group positioned at index with its counterpart from pdb.

Follows the standard broadcasting rules as employed by h5py.

Important

If index is passed as a sequence of integers then, contrary to NumPy, they will have to be sorted.

Parameters

group (h5py.Group) – The to-be updated h5py group.
index (int, Sequence[int] or slice) – An object for slicing all datasets in group. Note that, contrary to numpy, if a sequence of integers is provided then they’ll have to ordered.
update_scale (bool) – If True, also export PDBContainer.scale to the dimensional scale in the passed group.

API: Set Operations

PDBContainer.intersection(value)[source]

Construct a new PDBContainer by the intersection of self and value.

Examples

An example where one or more new molecules are created.

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.intersection(range(4))
>>> print(pdb_new.scale)
[0 1 2 3]

Parameters: value (PDBContainer or array-like) – Another PDBContainer or an array-like object representing PDBContainer.scale. Note that both value and self.scale should consist of unique elements.
Returns: A new instance by intersecting self.scale and value.
Return type: PDBContainer

See also

set.intersection: Return the intersection of two sets as a new set.

PDBContainer.difference(value)[source]

Construct a new PDBContainer by the difference of self and value.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.difference(range(10, 30))
>>> print(pdb_new.scale)
[0 1 2 3 4 5 6 7 8 9]

Parameters: value (PDBContainer or array-like) – Another PDBContainer or an array-like object representing PDBContainer.scale. Note that both value and self.scale should consist of unique elements.
Returns: A new instance as the difference of self.scale and value.
Return type: PDBContainer

See also

set.difference: Return the difference of two or more sets as a new set.

PDBContainer.symmetric_difference(value)[source]

Construct a new PDBContainer by the symmetric difference of self and value.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> pdb2 = PDBContainer(..., scale=range(10, 30))  

>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.symmetric_difference(pdb2)
>>> print(pdb_new.scale)
[ 0  1  2  3  4  5  6  7  8  9 23 24 25 26 27 28 29]

Parameters: value (PDBContainer) – Another PDBContainer. Note that both value.scale and self.scale should consist of unique elements.
Returns: A new instance as the symmetric difference of self.scale and value.
Return type: PDBContainer

See also

set.symmetric_difference: Return the symmetric difference of two sets as a new set.

PDBContainer.union(value)[source]

Construct a new PDBContainer by the union of self and value.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> pdb2 = PDBContainer(..., scale=range(10, 30))  

>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.union(pdb2)
>>> print(pdb_new.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]

Parameters: value (PDBContainer) – Another PDBContainer. Note that both value and self.scale should consist of unique elements.
Returns: A new instance as the union of self.index and value.
Return type: PDBContainer

See also

set.union: Return the union of sets as a new set.