Filtering#

Tools for training structure selection.

composition_select(composition_vector, composition, cell_sizes, num_samples, rng=None)[source]#

Structure selection based on composition multinomial probability.

Note

This function is needs quite a bit of tweaking to get nice samples.

Parameters:

composition_vector (ndarray) – N (samples) by n (components) with the composition of the samples to select from.
composition (ndarray) – array for the center composition to sample around.
cell_sizes (int or Sequence) – number of unit cells or size of supercells used to set the number of variables in the multinomial distribution.
num_samples (int) – number of samples to return. Note that if the number is too high compared to the total number of samples (or coverage of the space), it may take very long to return if at all, and the samples will not be very representative of the multinomial distributions.
rng (np.Generator) – optional numpy seed, Generator, SeedSequence, etc

Returns:

list with indices of the composition vector corresponding: to selected samples.

Return type:

list of int

full_row_rank_select(feature_matrix, tol=1e-15, nrows=None)[source]#

Choose a (maximally) full rank subset of rows in a feature matrix.

This method is for underdetermined systems, i.e., where columns (i.e. features) > rows (i.e. structures)

Parameters:

feature_matrix (ndarray) – feature matrix to select rows/structures from.
tol (float) – optional tolerance to use to determine the pivots in upper triangular matrix of the LU decomposition.
nrows (int) – optional number of rows to include. If None, will include the maximum possible number of rows.

Returns:

list with indices of rows that form a full rank system.

Return type:

list of int

gaussian_select(feature_matrix, num_samples, orthogonalize=False)[source]#

Sequentially picks samples with feature vectors that most closely aligns with a sampled random gaussian vector on the unit sphere.

This works much better when the number of rows in the feature matrix is much larger than the number of samples requested.

Parameters:

feature_matrix (ndarray) – feature matrix to select rows/structures from.
num_samples (int) – number of samples/rows/structures to select.
orthogonalize (bool) – if true, will orthogonalize the generated random vectors

Returns:

list with indices of rows that align with Gaussian samples

Return type:

list of int