# Subset Generation

Functions for creating distributions of atomic indices (i.e. core anchor atoms).

## Index

 uniform_idx(dist[, operation, cluster_size, ...]) Yield the column-indices of dist which yield a uniform or clustered distribution. distribute_idx(core, idx, f[, mode]) Create a new distribution of atomic indices from idx of length f * len(idx).

## API

CAT.distribution.uniform_idx(dist, operation='min', cluster_size=1, start=None, randomness=None, weight=<function <lambda>>)[source]

Yield the column-indices of dist which yield a uniform or clustered distribution.

Given the (symmetric) distance matrix $$\boldsymbol{D} \in \mathbb{R}^{n,n}$$ and the vector $$\boldsymbol{a} \in \mathbb{N}^{m}$$ (representing a subset of indices in $$D$$), then the $$i$$’th element $$a_{i}$$ is defined below. All elements of $$\boldsymbol{a}$$ are furthermore constrained to be unique. $$f(x)$$ is herein a, as of yet unspecified, function for weighting each individual distance.

Following the convention used in python, the $$\boldsymbol{X}[0:3, 1:5]$$ notation is herein used to denote the submatrix created by intersecting rows $$0$$ up to (but not including) $$3$$ and columns $$1$$ up to (but not including) $$5$$.

$\begin{split}\DeclareMathOperator*{\argmin}{\arg\!\min} a_{i} = \begin{cases} \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}_{k,:} \bigr) & \text{if} & i=0 \\ \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}[k, \boldsymbol{a}[0:i]] \bigr) & \text{if} & i > 0 \end{cases}\end{split}$

Default weighting function: $$f(x) = e^{-x}$$.

The row in $$D$$ corresponding to $$a_{0}$$ can alternatively be specified by start.

The $$\text{argmin}$$ operation can be exchanged for $$\text{argmax}$$ by setting operation to "max", thus yielding a clustered- rather than uniform-distribution.

The cluster_size parameter allows for the creation of uniformly distributed clusters of size $$r$$. Herein the vector of indices, $$\boldsymbol{a} \in \mathbb{N}^{m}$$ is for the purpose of book keeping reshaped into the matrix $$\boldsymbol{A} \in \mathbb{N}^{q, r} \; \text{with} \; q*r = m$$. All elements of $$\boldsymbol{A}$$ are, again, constrained to be unique.

$\begin{split}\DeclareMathOperator*{\argmin}{\arg\!\min} A_{i,j} = \begin{cases} \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}_{k,:} \bigr) & \text{if} & i=0; \; j=0 \\ \argmin\limits_{k \in \mathbb{N}} \sum f \bigl( \boldsymbol{D}[k; \boldsymbol{A}[0:i, 0:r] \bigl) & \text{if} & i > 0; \; j = 0 \\ \argmin\limits_{k \in \mathbb{N}} \dfrac{\sum f \bigl( \boldsymbol{D}[k, \boldsymbol{A}[0:i, 0:r] \bigr)} {\sum f \bigl( \boldsymbol{D}[k, \boldsymbol{A}[i, 0:j] \bigr)} & \text{if} & j > 0 \end{cases}\end{split}$

Examples

>>> import numpy as np
>>> from CAT.distribution import uniform_idx

>>> dist: np.ndarray = np.random.rand(10, 10)

>>> out1 = uniform_idx(dist)
>>> idx_ar1 = np.fromiter(out1, dtype=np.intp)

>>> out2 = uniform_idx(dist, operation="min")
>>> out3 = uniform_idx(dist, cluster_size=5)
>>> out4 = uniform_idx(dist, cluster_size=[1, 1, 1, 1, 2, 2, 4])
>>> out5 = uniform_idx(dist, start=5)
>>> out6 = uniform_idx(dist, randomness=0.75)
>>> out7 = uniform_idx(dist, weight=lambda x: x**-1)

Parameters:
• dist (numpy.ndarray [float], shape $$(n, n)$$) – A symmetric 2D NumPy array ($$D_{i,j} = D_{j,i}$$) representing the distance matrix $$D$$.

• operation (str) – Whether to use argmin() or argmax(). Accepted values are "min" and "max".

• cluster_size (int or Iterable [int]) –

An integer or iterable of integers representing the size of clusters. Used in conjunction with operation = "max" for creating a uniform distribution of clusters. cluster_size = 1 is equivalent to a normal uniform distribution.

Providing cluster_size as an iterable of integers will create clusters of varying, user-specified, sizes. For example, cluster_size = range(1, 4) will continuesly create clusters of sizes 1, 2 and 3. The iteration process is repeated until all atoms represented by dist are exhausted.

• start (int, optional) – The index of the starting row in dist. If None, start in whichever row contains the global minimum ($$\DeclareMathOperator*{\argmin}{\arg\!\min} \argmin\limits_{k \in \mathbb{N}} ||\boldsymbol{D}_{k, :}||_{p}$$) or maximum ($$\DeclareMathOperator*{\argmax}{\arg\!\max} \argmax\limits_{k \in \mathbb{N}} ||\boldsymbol{D}_{k, :}||_{p}$$). See operation.

• randomness (float, optional) – If not None, represents the probability that a random index will be yielded rather than obeying operation. Should obey the following condition: $$0 \le randomness \le 1$$.

• weight (Callable) – A callable for applying weights to the distance; default: $$e^{-x}$$. The callable should take an array as argument and return a new array, e.g. numpy.exp().

Yields:

int – Yield the column-indices specified in $$\boldsymbol{d}$$.

CAT.distribution.distribute_idx(core, idx, f, mode='uniform', **kwargs)[source]

Create a new distribution of atomic indices from idx of length f * len(idx).

Parameters:
• core (array-like [float], shape $$(m, 3)$$) – A 2D array-like object (such as a Molecule instance) consisting of Cartesian coordinates.

• idx (int or Iterable [int], shape $$(i,)$$) – An integer or iterable of unique integers representing the 0-based indices of all anchor atoms in core.

• f (float) – A float obeying the following condition: $$0.0 < f \le 1.0$$. Represents the fraction of idx that will be returned.

• mode (str) –

How the subset of to-be returned indices will be generated. Accepts one of the following values:

• "random": A random distribution.

• "uniform": A uniform distribution; the distance between each successive atom and all previous points is maximized.

• "cluster": A clustered distribution; the distance between each successive atom and all previous points is minmized.

• **kwargs (Any) – Further keyword arguments for the mode-specific functions.

Returns:

A 1D array of atomic indices. If idx has $$i$$ elements, then the length of the returned list is equal to $$\max(1, f*i)$$.

Return type:

numpy.ndarray [int], shape $$(f*i,)$$

uniform_idx()
cluster_idx()