Structure Wrangler#
Implementation of a StructureWrangler.
A StructureWrangler is used to generate and organize training data to fit a cluster expansion using the terms defined in a ClusterSubspace. It takes care of computing the training features (correlations) to construct a feature matrix to be used along with a target property vector to obtain the coefficients for a cluster expansion using some linear regression model.
Includes functions used to preprocess and check (wrangling) fitting data of structures and properties.
- class StructureWrangler(cluster_subspace)[source]#
Bases:
MSONable
Class to create fitting data to fit a cluster expansion.
A StructureWrangler handles (wrangles) input data structures and properties to fit in a cluster expansion. This class holds a ClusterSubspace used to compute correlation vectors and produce feature/design matrices used to fit the final ClusterExpansion.
This class is meant to take all input training data in the form of (structure, properties) where the properties represent the target material property for the given structure that will be used to train the cluster expansion, and returns the fitting data as a cluster correlation feature matrix (orbit basis function values).
This class also has methods to check/prepare/filter the data. A metadata dictionary is used to keep track of applied filters, but users can also use it to save any other pertinent information that will be saved with using
StructureWrangler.as_dict
for future reference. Other preprocessing and filtering methods are available in tools.py and select.py.Initialize a StructureWrangler.
- Parameters:
cluster_subspace (ClusterSubspace) – a ClusterSubspace object that will be used to fit a ClusterExpansion with the provided data.
- add_data_indices(key, indices)[source]#
Add a set of data indices.
For example, use this for saving test/training splits or separating duplicates.
- add_entry(entry, properties=None, weights=None, supercell_matrix=None, site_mapping=None, verbose=True, raise_failed=False)[source]#
Add a structure and measured property to the StructureWrangler.
The energy and properties need to be extensive (i.e. not normalized per atom or unit cell, directly from DFT).
An attempt to compute the correlation vector is made and if successful the structure is successfully added. Otherwise the structure is ignored. Usually failures are caused by the StructureMatcher in the given ClusterSubspace failing to map structures to the primitive structure.
- Parameters:
entry (ComputedStructureEntry) – A ComputedStructureEntry with a training structure, energy and properties
properties (dict) – Dictionary with a key describing the property and the target value for the corresponding structure. For example if only a single property {‘energy’: value} but can also add more than one, i.e. {‘total_energy’: value1, ‘formation_energy’: value2}. You are free to make up the keys for each property but make sure you are consistent for all structures that you add.
weights (dict) – the weight given to the structure when doing the fit. The key must match at least one of the given properties.
supercell_matrix (ndarray) – optional if the corresponding structure has already been matched to the ClusterSubspace prim structure, passing the supercell_matrix will use that instead of trying to re-match. If using this, the user is responsible for having the correct supercell_matrix. Here you are the cause of your own bugs.
site_mapping (list) – optional site mapping as obtained by StructureMatcher.get_mapping such that the elements of site_mapping represent the indices of the matching sites to the prim structure. If you pass this option, you are fully responsible that the mappings are correct!
verbose (bool) – optional if True, will raise warning regarding structures that fail in StructureMatcher, and structures that have duplicate corr vectors.
raise_failed (bool) – optional if True, will raise the thrown error when adding a structure that fails. This can be helpful to keep a list of structures that fail for further inspection.
- add_properties(key, property_vector)[source]#
Add another property vector to structures already in the StructureWrangler.
The length of the property vector must match the number of structures contained, and should be in the same order so that the property corresponds to the correct structure.
- Parameters:
key (str) – name of property
property_vector (ndarray) – array with the property for each structure
- add_weights(key, weights)[source]#
Add weights to structures already in the StructureWrangler.
The length of the given weights must match the number of structures contained, and should be in the same order.
- Parameters:
key (str) – name describing weights
weights (ndarray) – array with the weight for each structure
- append_entries(entries)[source]#
Append a list of entries.
Each entry must have all necessary fields. A entry can be obtained using the process_structure method.
- Parameters:
entries (list of ComputedStructureEntry) – list of entries with all necessary information
- property available_indices#
Get list of available data index sets.
- property available_properties#
Get list of properties that have been added.
- property available_weights#
Get list of weights that have been added.
- change_subspace(cluster_subspace)[source]#
Change the underlying ClusterSubspace.
Will swap out the ClusterSubspace and update features accordingly. This is a faster operation than creating a new one. Can also be useful to create a copy and then change the subspace.
- Parameters:
cluster_subspace – New ClusterSubspace to be used for determining features.
- property cluster_subspace#
Get the underlying ClusterSubspace used to compute features.
- property entries#
Get a list of the entry dictionaries.
- property feature_matrix#
Get feature matrix.
Rows are structures, and columns are correlation vectors.
- classmethod from_dict(d, energy_key=None)[source]#
Create Structure Wrangler from an MSONable dict.
- Parameters:
d (dict) – MSON dict of StructureWrangler
energy_key (str) – optional energy property key, for legacy files
- Returns:
StructureWrangler
- get_condition_number(rows=None, cols=None, norm_p=2)[source]#
Compute the condition number for the feature matrix or submatrix.
The condition number is a measure of how sensitive the solution to the linear system is to perturbations in the sampled data. The larger the condition number the more ill-conditioned the linear problem is.
- Parameters:
rows (list) – indices of structures to include in feature matrix.
cols (list) – indices of features (correlations) to include in feature matrix
norm_p – (optional) the type of norm to use when computing condition number. See the numpy docs for np.linalg.cond for options.
- Returns:
matrix condition number
- Return type:
float
- get_constant_features()[source]#
Find indices of constant feature vectors (columns).
A constant feature vector means the corresponding correlation function evaluates to the exact same value for all included structures, meaning it does not really help much when fitting. Many constant feature vectors may be a sign of insufficient sampling of configuration space.
Excludes the empty cluster, which is by definition constant.
- Returns:
array of column indices.
- Return type:
ndarray
- get_duplicate_corr_indices(cutoffs=None, decimals=12, rm_external_terms=True)[source]#
Find indices of rows with duplicate corr vectors in feature matrix.
- Parameters:
cutoffs (dict) – optional dictionary with cluster diameter cutoffs for correlation functions to consider in correlation vectors.
decimals (int) – optional number of decimals to round correlations in order to allow some numerical tolerance for finding duplicates. If None, no rounding will be done. Beware that using a ClusterSubspace with an orthogonal site basis will likely be off by some numerical tolerance so rounding is recommended.
rm_external_terms (bool) – optional if True, will not consider external terms and only consider correlations proper when looking for duplicates.
- Returns:
list containing lists of indices of rows in feature_matrix where duplicates occur
- Return type:
list
- get_feature_matrix_orbit_rank(orbit_id, rows=None)[source]#
Get the rank of an orbit submatrix of the feature matrix.
- Parameters:
orbit_id (int) – Orbit id to obtain sub feature matrix rank of.
rows (list) – optional list of row indices corresponding to structures to include.
- Returns:
rank of orbit sub feature matrix
- Return type:
int
- get_feature_matrix_rank(rows=None, cols=None)[source]#
Get the rank of the feature matrix or a submatrix of it.
- Parameters:
rows (list) – indices of structures to include in feature matrix.
cols (list) – indices of features (correlations) to include in feature matrix
- Returns:
the rank of the matrix
- Return type:
int
- get_gram_matrix(rows=None, cols=None, normalize=True)[source]#
Compute the Gram matrix for the feature matrix or a submatrix.
The Gram matrix, \(G = X^TX\), where each entry is \(G_{ij} = X_i \cdot X_j\). By default, G will have each column (feature vector) normalized. This makes it possible to compare Gram matrices for different feature matrix size or using different basis sets. This ensures every entry satisfies \(-1 \le G_{ij} \le 1\).
- Parameters:
rows (list) – indices of structures to include in feature matrix.
cols (list) – indices of features (correlations) to include in feature matrix
normalize – if True (default), will normalize each feature vector in the feature matrix.
- Returns:
Gram matrix
- Return type:
ndarray
- get_matching_corr_duplicate_indices(decimals=12, structure_matcher=None, **matcher_kwargs)[source]#
Find indices of equivalent structures.
- Parameters:
decimals (int) – optional number of decimals to round correlations in order to allow some numerical tolerance for finding duplicates.
structure_matcher (StructureMatcher) – optional StructureMatcher object to use for matching structures.
**matcher_kwargs – keyword arguments to use when initializing a StructureMatcher if not given
- Returns:
list of lists of equivalent structures (that match) and have duplicate correlation vectors.
- Return type:
list
- get_property_vector(key, normalize=True)[source]#
Get the property target vector.
The property target vector to be used to fit the corresponding correlation feature matrix to obtain coefficients for a cluster expansion. It should always be properly/consistently normalized when used for a fit.
- Parameters:
key (str) – name of the property
normalize (bool) – optional if True, normalizes by prim size. If the property sought is not already normalized, you need to normalize before fitting a CE.
- get_similarity_matrix(rows=None, cols=None, rtol=1e-05)[source]#
Get similarity matrix of correlation vectors.
Generate a matrix to compare the similarity of correlation feature vectors (columns) in the feature matrix. Matrix element a(i,j) represents the fraction of equivalent corresponding values in feature vectors i and j. This construction is analogous to the Gram matrix, but instead of an inner product, it counts the number of identical corresponding elements in feature vectors i and j.
- Parameters:
rows (list) – indices of structures to include in feature matrix.
cols (list) – indices of features (correlations) to include in feature matrix
rtol (float) – relative tolerance for comparing feature matrix column values
- Returns:
(n x n) similarity matrix
- Return type:
ndarray
- get_weights(key)[source]#
Get the weights specified by the given key.
- Parameters:
key (str) – name of corresponding weights
- property metadata#
Get dictionary to save applied filters, etc.
- property num_features#
Get number of features for each added structure.
- property num_structures#
Get number of structures added (correctly matched to prim).
- property occupancy_strings#
Get occupancy strings for each structure in the StructureWrangler.
- process_entry(entry, properties=None, weights=None, supercell_matrix=None, site_mapping=None, verbose=False, raise_failed=False)[source]#
Process a ComputedStructureEntry to be added to StructureWrangler.
Checks if the structure for this entry can be matched to the ClusterSubspace prim structure to obtain its supercell matrix, correlation, and refined structure. If so, the entry will be updated by adding these to its data dictionary.
- Parameters:
entry (ComputedStructureEntry) – A ComputedStructureEntry corresponding to a training structure and properties
properties (dict) – optional A dictionary with a keys describing the property and the target value for the corresponding structure. Energy and corrected energy should already be in the ComputedStructureEntry so there is no need to pass it here. You are free to make up the keys for each property but make sure you are consistent for all structures that you add.
weights (dict) – optional The weight given to the structure when doing the fit. The key must match at least one of the given properties.
supercell_matrix (ndarray) – optional if the corresponding structure has already been matched to the ClusterSubspace prim structure, passing the supercell_matrix will use that instead of trying to re-match. If using this the user is responsible to have the correct supercell_matrix. Here you are the cause of your own bugs.
site_mapping (list) – optional site mapping as obtained by
StructureMatcher.get_mapping
such that the elements of site_mapping represent the indices of the matching sites to the prim structure. If you pass this option, you are fully responsible that the mappings are correct!verbose (bool) – if True, will raise warning for structures that fail in StructureMatcher, and structures that have duplicate corr vectors.
raise_failed (bool) – optional if True, will raise the thrown error when adding a structure that fails. This can be helpful to keep a list of structures that fail for further inspection.
- Returns:
entry with CE pertinent properties
- Return type:
ComputedStructureEntry
- property refined_structures#
Get list of refined structures.
- remove_properties(*property_keys)[source]#
Remove properties with given keys.
- Parameters:
*property_keys (str) – names of properties to remove
- property sizes#
Get sizes of each structure in terms of number of prims.
- property structure_site_mappings#
Get list of site mappings for each structure to prim.
- property structures#
Get list of included structures.
- property supercell_matrices#
Get list of supercell matrices relating each structure to prim.
- update_features()[source]#
Update the features/feature matrix for the data held.
This is useful when something is changed in the ClusterSubspace after creating the StructureWrangler, for example when adding an Ewald term after the StructureWrangler has already been created. This will prevent having to re-match structures and such.