The PDBContainer Class

A module for constructing array-representations of .pdb files.

Index

PDBContainer(atoms, bonds, atom_count, ...)

An (immutable) class for holding array-like representions of a set of .pdb files.

PDBContainer.atoms

Get a read-only padded recarray for keeping track of all atom-related information.

PDBContainer.bonds

Get a read-only padded recarray for keeping track of all bond-related information.

PDBContainer.atom_count

Get a read-only ndarray for keeping track of the number of atoms in each molecule in atoms.

PDBContainer.bond_count

Get a read-only ndarray for keeping track of the number of atoms in each molecule in bonds.

PDBContainer.scale

Get a recarray representing an index.

PDBContainer.__init__(atoms, bonds, ...[, ...])

Initialize an instance.

PDBContainer.__getitem__(index)

Implement self[index].

PDBContainer.__len__()

Implement len(self).

PDBContainer.keys()

Yield the (public) attribute names in this class.

PDBContainer.values()

Yield the (public) attributes in this instance.

PDBContainer.items()

Yield the (public) attribute name/value pairs in this instance.

PDBContainer.concatenate(*args)

Concatenate \(n\) PDBContainers into a single new instance.

PDBContainer.from_molecules(mol_list[, ...])

Convert an iterable or sequence of molecules into a new PDBContainer instance.

PDBContainer.to_molecules([index, mol])

Create a molecule or list of molecules from this instance.

PDBContainer.to_rdkit([index, sanitize])

Create an rdkit molecule or list of rdkit molecules from this instance.

PDBContainer.create_hdf5_group(file, name, *)

Create a h5py Group for storing dataCAT.PDBContainer instances.

PDBContainer.validate_hdf5(group)

Validate the passed hdf5 group, ensuring it is compatible with PDBContainer instances.

PDBContainer.from_hdf5(group[, index])

Construct a new PDBContainer from the passed hdf5 group.

PDBContainer.to_hdf5(group, index[, ...])

Update all datasets in group positioned at index with its counterpart from pdb.

PDBContainer.intersection(value)

Construct a new PDBContainer by the intersection of self and value.

PDBContainer.difference(value)

Construct a new PDBContainer by the difference of self and value.

PDBContainer.symmetric_difference(value)

Construct a new PDBContainer by the symmetric difference of self and value.

PDBContainer.union(value)

Construct a new PDBContainer by the union of self and value.

API

class dataCAT.PDBContainer(atoms, bonds, atom_count, bond_count, scale=None, validate=True, copy=True, index_dtype=None)[source]

An (immutable) class for holding array-like representions of a set of .pdb files.

The PDBContainer class serves as an (intermediate) container for storing .pdb files in the hdf5 format, thus facilitating the storage and interconversion between PLAMS molecules and the h5py interface.

The methods implemented in this class can roughly be divided into three categories:

Examples

>>> import h5py
>>> from scm.plams import readpdb
>>> from dataCAT import PDBContainer

>>> mol_list [readpdb(...), ...]  
>>> pdb = PDBContainer.from_molecules(mol_list)
>>> print(pdb)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(23, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(23, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    scale      = numpy.recarray(..., shape=(23,), dtype=...)
)

>>> hdf5_file = str(...)  
>>> with h5py.File(hdf5_file, 'a') as f:
...     group = pdb.create_hdf5_group(f, name='ligand')
...     pdb.to_hdf5(group, None)
...
...     print('group', '=', group)
...     for name, dset in group.items():
...         print(f'group[{name!r}]', '=', dset)
group = <HDF5 group "/ligand" (5 members)>
group['atoms'] = <HDF5 dataset "atoms": shape (23, 76), type "|V46">
group['bonds'] = <HDF5 dataset "bonds": shape (23, 75), type "|V9">
group['atom_count'] = <HDF5 dataset "atom_count": shape (23,), type "<i4">
group['bond_count'] = <HDF5 dataset "bond_count": shape (23,), type "<i4">
group['index'] = <HDF5 dataset "index": shape (23,), type "<i4">
property atoms

Get a read-only padded recarray for keeping track of all atom-related information.

See dataCAT.dtype.ATOMS_DTYPE for a comprehensive overview of all field names and dtypes.

Type

numpy.recarray, shape \((n, m)\)

property bonds

Get a read-only padded recarray for keeping track of all bond-related information.

Note that all atomic indices are 1-based.

See dataCAT.dtype.BONDS_DTYPE for a comprehensive overview of all field names and dtypes.

Type

numpy.recarray, shape \((n, k)\)

property atom_count

Get a read-only ndarray for keeping track of the number of atoms in each molecule in atoms.

Type

numpy.ndarray[int32], shape \((n,)\)

property bond_count

Get a read-only ndarray for keeping track of the number of atoms in each molecule in bonds.

Type

numpy.ndarray[int32], shape \((n,)\)

property scale

Get a recarray representing an index.

Used as dimensional scale in the h5py Group.

Type

numpy.recarray, shape \((n,)\)

__init__(atoms, bonds, atom_count, bond_count, scale=None, validate=True, copy=True, index_dtype=None)[source]

Initialize an instance.

Parameters
Keyword Arguments
  • validate (bool) – If True perform more thorough validation of the input arrays. Note that this also allows the parameters to-be passed as array-like objects in addition to aforementioned ndarray or recarray instances.

  • copy (bool) – If True, set the passed arrays as copies. Only relevant if validate = True.

Return type

None

API: Miscellaneous Methods

PDBContainer.__getitem__(index)[source]

Implement self[index].

Constructs a new PDBContainer instance by slicing all arrays with index. Follows the standard NumPy broadcasting rules: if an integer or slice is passed then a shallow copy is returned; otherwise a deep copy will be created.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> print(pdb)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(23, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(23, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    scale      = numpy.recarray(..., shape=(23,), dtype=...)
)

>>> pdb[0]
PDBContainer(
    atoms      = numpy.recarray(..., shape=(1, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(1, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(1,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(1,), dtype=int32),
    scale      = numpy.recarray(..., shape=(1,), dtype=...)
)

>>> pdb[:10]
PDBContainer(
    atoms      = numpy.recarray(..., shape=(10, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(10, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(10,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(10,), dtype=int32),
    scale      = numpy.recarray(..., shape=(10,), dtype=...)
)

>>> pdb[[0, 5, 7, 9, 10]]
PDBContainer(
    atoms      = numpy.recarray(..., shape=(5, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(5, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(5,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(5,), dtype=int32),
    scale      = numpy.recarray(..., shape=(5,), dtype=...)
)
Parameters

index (int, Sequence[int] or slice) – An object for slicing arrays along axis=0.

Returns

A shallow or deep copy of a slice of this instance.

Return type

dataCAT.PDBContainer

PDBContainer.__len__()[source]

Implement len(self).

Returns

Returns the length of the arrays embedded within this instance (which are all of the same length).

Return type

int

classmethod PDBContainer.keys()[source]

Yield the (public) attribute names in this class.

Examples

>>> from dataCAT import PDBContainer

>>> for name in PDBContainer.keys():
...     print(name)
atoms
bonds
atom_count
bond_count
scale
Yields

str – The names of all attributes in this class.

PDBContainer.values()[source]

Yield the (public) attributes in this instance.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> for value in pdb.values():
...     print(object.__repr__(value))  
<numpy.recarray object at ...>
<numpy.recarray object at ...>
<numpy.ndarray object at ...>
<numpy.ndarray object at ...>
<numpy.recarray object at ...>
Yields

str – The values of all attributes in this instance.

PDBContainer.items()[source]

Yield the (public) attribute name/value pairs in this instance.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> for name, value in pdb.items():
...     print(name, '=', object.__repr__(value))  
atoms = <numpy.recarray object at ...>
bonds = <numpy.recarray object at ...>
atom_count = <numpy.ndarray object at ...>
bond_count = <numpy.ndarray object at ...>
scale = <numpy.recarray object at ...>
Yields

str and numpy.ndarray / numpy.recarray – The names and values of all attributes in this instance.

PDBContainer.concatenate(*args)[source]

Concatenate \(n\) PDBContainers into a single new instance.

Examples

>>> from dataCAT import PDBContainer

>>> pdb1 = PDBContainer(...)  
>>> pdb2 = PDBContainer(...)  
>>> pdb3 = PDBContainer(...)  
>>> print(len(pdb1), len(pdb2), len(pdb3))
23 23 23

>>> pdb_new = pdb1.concatenate(pdb2, pdb3)
>>> print(pdb_new)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(69, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(69, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(69,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(69,), dtype=int32),
    scale      = numpy.recarray(..., shape=(69,), dtype=...)
)
Parameters

*args (PDBContainer) – One or more PDBContainers.

Returns

A new PDBContainer cosntructed by concatenating self and args.

Return type

PDBContainer

API: Object Interconversion

classmethod PDBContainer.from_molecules(mol_list, min_atom=0, min_bond=0, scale=None)[source]

Convert an iterable or sequence of molecules into a new PDBContainer instance.

Examples

>>> from typing import List
>>> from dataCAT import PDBContainer
>>> from scm.plams import readpdb, Molecule

>>> mol_list: List[Molecule] = [readpdb(...), ...]  
>>> PDBContainer.from_molecules(mol_list)
PDBContainer(
    atoms      = numpy.recarray(..., shape=(23, 76), dtype=...),
    bonds      = numpy.recarray(..., shape=(23, 75), dtype=...),
    atom_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    bond_count = numpy.ndarray(..., shape=(23,), dtype=int32),
    scale      = numpy.recarray(..., shape=(23,), dtype=...)
)
Parameters
  • mol_list (Iterable[Molecule]) – An iterable consisting of PLAMS molecules.

  • min_atom (int) – The minimum number of atoms which PDBContainer.atoms should accomodate.

  • min_bond (int) – The minimum number of bonds which PDBContainer.bonds should accomodate.

  • scale (array-like, optional) – An array-like object representing an user-specified index. Defaults to a simple range index if None (see numpy.arange()).

Returns

A pdb container.

Return type

dataCAT.PDBContainer

PDBContainer.to_molecules(index=None, mol=None)[source]

Create a molecule or list of molecules from this instance.

Examples

An example where one or more new molecules are created.

>>> from dataCAT import PDBContainer
>>> from scm.plams import Molecule

>>> pdb = PDBContainer(...)  

# Create a single new molecule from `pdb`
>>> pdb.to_molecules(index=0)  
<scm.plams.mol.molecule.Molecule object at ...>

# Create three new molecules from `pdb`
>>> pdb.to_molecules(index=[0, 1])  
[<scm.plams.mol.molecule.Molecule object at ...>,
 <scm.plams.mol.molecule.Molecule object at ...>]

An example where one or more existing molecules are updated in-place.

# Update `mol` with the info from `pdb`
>>> mol = Molecule(...)  # doctest: +SKIP
>>> mol_new = pdb.to_molecules(index=2, mol=mol)
>>> mol is mol_new
True

# Update all molecules in `mol_list` with info from `pdb`
>>> mol_list = [Molecule(...), Molecule(...), Molecule(...)]  # doctest: +SKIP
>>> mol_list_new = pdb.to_molecules(index=range(3), mol=mol_list)
>>> for m, m_new in zip(mol_list, mol_list_new):
...     print(m is m_new)
True
True
True
Parameters
  • index (int, Sequence[int] or slice, optional) – An object for slicing the arrays embedded within this instance. Follows the standard numpy broadcasting rules (e.g. self.atoms[index]). If a scalar is provided (e.g. an integer) then a single molecule will be returned. If a sequence, range, slice, etc. is provided then a list of molecules will be returned.

  • mol (Molecule or Iterable[Molecule], optional) – A molecule or list of molecules. If one or molecules are provided here then they will be updated in-place.

Returns

A molecule or list of molecules, depending on whether or not index is a scalar or sequence / slice. Note that if mol is not None, then the-be returned molecules won’t be copies.

Return type

Molecule or List[Molecule]

PDBContainer.to_rdkit(index=None, sanitize=True)[source]

Create an rdkit molecule or list of rdkit molecules from this instance.

Examples

An example where one or more new molecules are created.

>>> from dataCAT import PDBContainer
>>> from rdkit.Chem import Mol

>>> pdb = PDBContainer(...)  

# Create a single new molecule from `pdb`
>>> pdb.to_rdkit(index=0)  
<rdkit.Chem.rdchem.Mol object at ...>

# Create three new molecules from `pdb`
>>> pdb.to_rdkit(index=[0, 1])  
[<rdkit.Chem.rdchem.Mol object at ...>,
 <rdkit.Chem.rdchem.Mol object at ...>]
Parameters
  • index (int, Sequence[int] or slice, optional) – An object for slicing the arrays embedded within this instance. Follows the standard numpy broadcasting rules (e.g. self.atoms[index]). If a scalar is provided (e.g. an integer) then a single molecule will be returned. If a sequence, range, slice, etc. is provided then a list of molecules will be returned.

  • sanitize (bool) – Whether to sanitize the molecule before returning or not.

Returns

A molecule or list of molecules, depending on whether or not index is a scalar or sequence / slice.

Return type

Mol or list[Mol]

classmethod PDBContainer.create_hdf5_group(file, name, *, scale=None, scale_dtype=None, **kwargs)[source]

Create a h5py Group for storing dataCAT.PDBContainer instances.

Notes

The scale and scale_dtype parameters are mutually exclusive.

Parameters
  • file (h5py.File or h5py.Group) – The h5py File or Group where the new Group will be created.

  • name (str) – The name of the to-be created Group.

Keyword Arguments
  • scale (h5py.Dataset, keyword-only) – A pre-existing dataset serving as dimensional scale. See scale_dtype to create a new instead instead.

  • scale_dtype (dtype-like, keyword-only) – The datatype of the to-be created dimensional scale. See scale to use a pre-existing dataset for this purpose.

  • **kwargs (Any) – Further keyword arguments for the creation of each dataset. Arguments already specified by default are: name, shape, maxshape and dtype.

Returns

The newly created Group.

Return type

h5py.Group

classmethod PDBContainer.validate_hdf5(group)[source]

Validate the passed hdf5 group, ensuring it is compatible with PDBContainer instances.

An AssertionError will be raise if group does not validate.

This method is called automatically when an exception is raised by to_hdf5() or from_hdf5().

Parameters

group (h5py.Group) – The to-be validated hdf5 Group.

Raises

AssertionError – Raised if the validation process fails.

classmethod PDBContainer.from_hdf5(group, index=None)[source]

Construct a new PDBContainer from the passed hdf5 group.

Parameters
Returns

A new PDBContainer constructed from group.

Return type

dataCAT.PDBContainer

PDBContainer.to_hdf5(group, index, update_scale=True)[source]

Update all datasets in group positioned at index with its counterpart from pdb.

Follows the standard broadcasting rules as employed by h5py.

Important

If index is passed as a sequence of integers then, contrary to NumPy, they will have to be sorted.

Parameters
  • group (h5py.Group) – The to-be updated h5py group.

  • index (int, Sequence[int] or slice) – An object for slicing all datasets in group. Note that, contrary to numpy, if a sequence of integers is provided then they’ll have to ordered.

  • update_scale (bool) – If True, also export PDBContainer.scale to the dimensional scale in the passed group.

API: Set Operations

PDBContainer.intersection(value)[source]

Construct a new PDBContainer by the intersection of self and value.

Examples

An example where one or more new molecules are created.

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.intersection(range(4))
>>> print(pdb_new.scale)
[0 1 2 3]
Parameters

value (PDBContainer or array-like) – Another PDBContainer or an array-like object representing PDBContainer.scale. Note that both value and self.scale should consist of unique elements.

Returns

A new instance by intersecting self.scale and value.

Return type

PDBContainer

See also

set.intersection

Return the intersection of two sets as a new set.

PDBContainer.difference(value)[source]

Construct a new PDBContainer by the difference of self and value.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.difference(range(10, 30))
>>> print(pdb_new.scale)
[0 1 2 3 4 5 6 7 8 9]
Parameters

value (PDBContainer or array-like) – Another PDBContainer or an array-like object representing PDBContainer.scale. Note that both value and self.scale should consist of unique elements.

Returns

A new instance as the difference of self.scale and value.

Return type

PDBContainer

See also

set.difference

Return the difference of two or more sets as a new set.

PDBContainer.symmetric_difference(value)[source]

Construct a new PDBContainer by the symmetric difference of self and value.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> pdb2 = PDBContainer(..., scale=range(10, 30))  

>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.symmetric_difference(pdb2)
>>> print(pdb_new.scale)
[ 0  1  2  3  4  5  6  7  8  9 23 24 25 26 27 28 29]
Parameters

value (PDBContainer) – Another PDBContainer. Note that both value.scale and self.scale should consist of unique elements.

Returns

A new instance as the symmetric difference of self.scale and value.

Return type

PDBContainer

See also

set.symmetric_difference

Return the symmetric difference of two sets as a new set.

PDBContainer.union(value)[source]

Construct a new PDBContainer by the union of self and value.

Examples

>>> from dataCAT import PDBContainer

>>> pdb = PDBContainer(...)  
>>> pdb2 = PDBContainer(..., scale=range(10, 30))  

>>> print(pdb.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22]

>>> pdb_new = pdb.union(pdb2)
>>> print(pdb_new.scale)
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29]
Parameters

value (PDBContainer) – Another PDBContainer. Note that both value and self.scale should consist of unique elements.

Returns

A new instance as the union of self.index and value.

Return type

PDBContainer

See also

set.union

Return the union of sets as a new set.