Subset Generation

Functions for creating distributions of atomic indices (i.e. core anchor atoms).

Index

`uniform_idx`(dist[, operation, cluster_size, ...])	Yield the column-indices of dist which yield a uniform or clustered distribution.
`distribute_idx`(core, idx, f[, mode])	Create a new distribution of atomic indices from idx of length `f * len(idx)`.

API

CAT.distribution.uniform_idx(dist, operation='min', cluster_size=1, start=None, randomness=None, weight=<function <lambda>>)[source]

Yield the column-indices of dist which yield a uniform or clustered distribution.

Given the (symmetric) distance matrix \(\boldsymbol{D} \in \mathbb{R}^{n,n}\) and the vector \(\boldsymbol{a} \in \mathbb{N}^{m}\) (representing a subset of indices in \(D\)), then the \(i\)’th element \(a_{i}\) is defined below. All elements of \(\boldsymbol{a}\) are furthermore constrained to be unique. \(f(x)\) is herein a, as of yet unspecified, function for weighting each individual distance.

Following the convention used in python, the \(\boldsymbol{X}[0:3, 1:5]\) notation is herein used to denote the submatrix created by intersecting rows \(0\) up to (but not including) \(3\) and columns \(1\) up to (but not including) \(5\).

\[\begin{split}\DeclareMathOperator*{\argmin}{\arg\!\min} a_{i} = \begin{cases} \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}_{k,:} \bigr) & \text{if} & i=0 \\ \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}[k, \boldsymbol{a}[0:i]] \bigr) & \text{if} & i > 0 \end{cases}\end{split}\]

Default weighting function: \(f(x) = e^{-x}\).

The row in \(D\) corresponding to \(a_{0}\) can alternatively be specified by start.

The \(\text{argmin}\) operation can be exchanged for \(\text{argmax}\) by setting operation to "max", thus yielding a clustered- rather than uniform-distribution.

The cluster_size parameter allows for the creation of uniformly distributed clusters of size \(r\). Herein the vector of indices, \(\boldsymbol{a} \in \mathbb{N}^{m}\) is for the purpose of book keeping reshaped into the matrix \(\boldsymbol{A} \in \mathbb{N}^{q, r} \; \text{with} \; q*r = m\). All elements of \(\boldsymbol{A}\) are, again, constrained to be unique.

\[\begin{split}\DeclareMathOperator*{\argmin}{\arg\!\min} A_{i,j} = \begin{cases} \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}_{k,:} \bigr) & \text{if} & i=0; \; j=0 \\ \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}[k; \boldsymbol{A}[0:i, 0:r] \bigl) & \text{if} & i > 0; \; j = 0 \\ \argmin\limits_{k \in \mathbb{N}} \dfrac{\sum f \bigl( \boldsymbol{D}[k, \boldsymbol{A}[0:i, 0:r] \bigr)} {\sum f \bigl( \boldsymbol{D}[k, \boldsymbol{A}[i, 0:j] \bigr)} & \text{if} & j > 0 \end{cases}\end{split}\]

Examples

>>> import numpy as np
>>> from CAT.distribution import uniform_idx

>>> dist: np.ndarray = np.random.rand(10, 10)

>>> out1 = uniform_idx(dist)
>>> idx_ar1 = np.fromiter(out1, dtype=np.intp)

>>> out2 = uniform_idx(dist, operation="min")
>>> out3 = uniform_idx(dist, cluster_size=5)
>>> out4 = uniform_idx(dist, cluster_size=[1, 1, 1, 1, 2, 2, 4])
>>> out5 = uniform_idx(dist, start=5)
>>> out6 = uniform_idx(dist, randomness=0.75)
>>> out7 = uniform_idx(dist, weight=lambda x: x**-1)

Parameters:

dist (numpy.ndarray [float], shape \((n, n)\)) – A symmetric 2D NumPy array (\(D_{i,j} = D_{j,i}\)) representing the distance matrix \(D\).
operation (str) – Whether to use argmin() or argmax(). Accepted values are "min" and "max".
cluster_size (int or Iterable [int]) –
An integer or iterable of integers representing the size of clusters. Used in conjunction with operation = "max" for creating a uniform distribution of clusters. cluster_size = 1 is equivalent to a normal uniform distribution.

Providing cluster_size as an iterable of integers will create clusters of varying, user-specified, sizes. For example, cluster_size = range(1, 4) will continuesly create clusters of sizes 1, 2 and 3. The iteration process is repeated until all atoms represented by dist are exhausted.
start (int, optional) – The index of the starting row in dist. If None, start in whichever row contains the global minimum (\(\DeclareMathOperator*{\argmin}{\arg\!\min} \argmin\limits_{k \in \mathbb{N}} ||\boldsymbol{D}_{k, :}||_{p}\)) or maximum (\(\DeclareMathOperator*{\argmax}{\arg\!\max} \argmax\limits_{k \in \mathbb{N}} ||\boldsymbol{D}_{k, :}||_{p}\)). See operation.
randomness (float, optional) – If not None, represents the probability that a random index will be yielded rather than obeying operation. Should obey the following condition: \(0 \le randomness \le 1\).
weight (Callable) – A callable for applying weights to the distance; default: \(e^{-x}\). The callable should take an array as argument and return a new array, e.g. numpy.exp().

Yields:

int – Yield the column-indices specified in \(\boldsymbol{d}\).

CAT.distribution.distribute_idx(core, idx, f, mode='uniform', **kwargs)[source]

Create a new distribution of atomic indices from idx of length f * len(idx).

Parameters:

core (array-like [float], shape \((m, 3)\)) – A 2D array-like object (such as a Molecule instance) consisting of Cartesian coordinates.
idx (int or Iterable [int], shape \((i,)\)) – An integer or iterable of unique integers representing the 0-based indices of all anchor atoms in core.
f (float) – A float obeying the following condition: \(0.0 < f \le 1.0\). Represents the fraction of idx that will be returned.
mode (str) –
How the subset of to-be returned indices will be generated. Accepts one of the following values:
- "random": A random distribution.
- "uniform": A uniform distribution; the distance between each successive atom and all previous points is maximized.
- "cluster": A clustered distribution; the distance between each successive atom and all previous points is minmized.
**kwargs (Any) – Further keyword arguments for the mode-specific functions.

Returns:

A 1D array of atomic indices. If idx has \(i\) elements, then the length of the returned list is equal to \(\max(1, f*i)\).

Return type:

numpy.ndarray [int], shape \((f*i,)\)