Subset Generation
Functions for creating distributions of atomic indices (i.e. core anchor atoms).
Index
|
Yield the column-indices of dist which yield a uniform or clustered distribution. |
|
Create a new distribution of atomic indices from idx of length |
API
- CAT.distribution.uniform_idx(dist, operation='min', cluster_size=1, start=None, randomness=None, weight=<function <lambda>>)[source]
Yield the column-indices of dist which yield a uniform or clustered distribution.
Given the (symmetric) distance matrix \(\boldsymbol{D} \in \mathbb{R}^{n,n}\) and the vector \(\boldsymbol{a} \in \mathbb{N}^{m}\) (representing a subset of indices in \(D\)), then the \(i\)’th element \(a_{i}\) is defined below. All elements of \(\boldsymbol{a}\) are furthermore constrained to be unique. \(f(x)\) is herein a, as of yet unspecified, function for weighting each individual distance.
Following the convention used in python, the \(\boldsymbol{X}[0:3, 1:5]\) notation is herein used to denote the submatrix created by intersecting rows \(0\) up to (but not including) \(3\) and columns \(1\) up to (but not including) \(5\).
\[\begin{split}\DeclareMathOperator*{\argmin}{\arg\!\min} a_{i} = \begin{cases} \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}_{k,:} \bigr) & \text{if} & i=0 \\ \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}[k, \boldsymbol{a}[0:i]] \bigr) & \text{if} & i > 0 \end{cases}\end{split}\]Default weighting function: \(f(x) = e^{-x}\).
The row in \(D\) corresponding to \(a_{0}\) can alternatively be specified by start.
The \(\text{argmin}\) operation can be exchanged for \(\text{argmax}\) by setting operation to
"max"
, thus yielding a clustered- rather than uniform-distribution.The cluster_size parameter allows for the creation of uniformly distributed clusters of size \(r\). Herein the vector of indices, \(\boldsymbol{a} \in \mathbb{N}^{m}\) is for the purpose of book keeping reshaped into the matrix \(\boldsymbol{A} \in \mathbb{N}^{q, r} \; \text{with} \; q*r = m\). All elements of \(\boldsymbol{A}\) are, again, constrained to be unique.
\[\begin{split}\DeclareMathOperator*{\argmin}{\arg\!\min} A_{i,j} = \begin{cases} \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}_{k,:} \bigr) & \text{if} & i=0; \; j=0 \\ \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}[k; \boldsymbol{A}[0:i, 0:r] \bigl) & \text{if} & i > 0; \; j = 0 \\ \argmin\limits_{k \in \mathbb{N}} \dfrac{\sum f \bigl( \boldsymbol{D}[k, \boldsymbol{A}[0:i, 0:r] \bigr)} {\sum f \bigl( \boldsymbol{D}[k, \boldsymbol{A}[i, 0:j] \bigr)} & \text{if} & j > 0 \end{cases}\end{split}\]Examples
>>> import numpy as np >>> from CAT.distribution import uniform_idx >>> dist: np.ndarray = np.random.rand(10, 10) >>> out1 = uniform_idx(dist) >>> idx_ar1 = np.fromiter(out1, dtype=np.intp) >>> out2 = uniform_idx(dist, operation="min") >>> out3 = uniform_idx(dist, cluster_size=5) >>> out4 = uniform_idx(dist, cluster_size=[1, 1, 1, 1, 2, 2, 4]) >>> out5 = uniform_idx(dist, start=5) >>> out6 = uniform_idx(dist, randomness=0.75) >>> out7 = uniform_idx(dist, weight=lambda x: x**-1)
- Parameters
dist (
numpy.ndarray
[float
], shape \((n, n)\)) – A symmetric 2D NumPy array (\(D_{i,j} = D_{j,i}\)) representing the distance matrix \(D\).operation (
str
) – Whether to useargmin()
orargmax()
. Accepted values are"min"
and"max"
.cluster_size (
int
orIterable
[int
]) –An integer or iterable of integers representing the size of clusters. Used in conjunction with
operation = "max"
for creating a uniform distribution of clusters.cluster_size = 1
is equivalent to a normal uniform distribution.Providing cluster_size as an iterable of integers will create clusters of varying, user-specified, sizes. For example,
cluster_size = range(1, 4)
will continuesly create clusters of sizes 1, 2 and 3. The iteration process is repeated until all atoms represented by dist are exhausted.start (
int
, optional) – The index of the starting row in dist. IfNone
, start in whichever row contains the global minimum (\(\DeclareMathOperator*{\argmin}{\arg\!\min} \argmin\limits_{k \in \mathbb{N}} ||\boldsymbol{D}_{k, :}||_{p}\)) or maximum (\(\DeclareMathOperator*{\argmax}{\arg\!\max} \argmax\limits_{k \in \mathbb{N}} ||\boldsymbol{D}_{k, :}||_{p}\)). See operation.randomness (
float
, optional) – If notNone
, represents the probability that a random index will be yielded rather than obeying operation. Should obey the following condition: \(0 \le randomness \le 1\).weight (
Callable
) – A callable for applying weights to the distance; default: \(e^{-x}\). The callable should take an array as argument and return a new array, e.g.numpy.exp()
.
- Yields
int
– Yield the column-indices specified in \(\boldsymbol{d}\).
- CAT.distribution.distribute_idx(core, idx, f, mode='uniform', **kwargs)[source]
Create a new distribution of atomic indices from idx of length
f * len(idx)
.- Parameters
core (array-like [
float
], shape \((m, 3)\)) – A 2D array-like object (such as aMolecule
instance) consisting of Cartesian coordinates.idx (
int
orIterable
[int
], shape \((i,)\)) – An integer or iterable of unique integers representing the 0-based indices of all anchor atoms in core.f (
float
) – A float obeying the following condition: \(0.0 < f \le 1.0\). Represents the fraction of idx that will be returned.mode (
str
) –How the subset of to-be returned indices will be generated. Accepts one of the following values:
"random"
: A random distribution."uniform"
: A uniform distribution; the distance between each successive atom and all previous points is maximized."cluster"
: A clustered distribution; the distance between each successive atom and all previous points is minmized.
**kwargs (
Any
) – Further keyword arguments for the mode-specific functions.
- Returns
A 1D array of atomic indices. If idx has \(i\) elements, then the length of the returned list is equal to \(\max(1, f*i)\).
- Return type
numpy.ndarray
[int
], shape \((f*i,)\)
See also
uniform_idx()
Yield the column-indices of dist which yield a uniform or clustered distribution.
cluster_idx()
Return the column-indices of dist which yield a clustered distribution.