This page was generated from docs/src/notebooks/training-data-preparation.ipynb.

Training Data Preparation#

[1]:

import numpy as np
import json
from monty.serialization import loadfn
from pymatgen.core.structure import Structure
from smol.cofe import ClusterSubspace, StructureWrangler
from smol.cofe.space import get_species

1) Preparing a `StructureWrangler`#

Training structures and target data are handled by the StructureWrangler class. The class obtains the features and corresponding feature matrix based on the underlying ClusterSubspace provided.

In the most simply settings we just use the feature matrix our supplied total energy from DFT to fit a cluster expansion. But it many cases we may want to improve our fit quality or reduce the model complexity by modifying the target property (i.e. using a reference energy or the energy of mixing) and/or by weighing structures based on some importance metric (i.e. by energy above hull). Using the StructureWrangler we can create this modified fitting data.

[2]:

# Load the raw data
# load the prim structure
lno_prim = loadfn('data/lno_prim.json')

# load the fitting data
# load the fitting data
lno_entries = loadfn("data/lno_entries.json")

# create a cluster subspace
subspace = ClusterSubspace.from_cutoffs(
    lno_prim,
    cutoffs={2: 5, 3: 4.1},
    basis='sinusoid',
    supercell_size='O2-'
)

# create the structure wrangler
wrangler = StructureWrangler(subspace)

# add the raw data
for entry in lno_entries:
    wrangler.add_entry(entry, verbose=False)

print(f'\nTotal structures that match {wrangler.num_structures}/{len(lno_entries)}')


Total structures that match 27/31

2) Modifying and adding new target properties#

Now that we have access to the structures that match to our cluster subspace, and access to the raw and normalized target properties, we can easily create new modifiend target properties to fit to.

For a simple example say we simply want to set the minimum energy in our data as a new reference point.

[3]:

# obtain the minimum energy. Calling the get_property_vector
# will by default give you the property normalized per prim
# (you should always used consistently normalized data when fitting)
min_energy = min(wrangler.get_property_vector('energy'))

# simply create a new re-reference energy
reref_energy_vect = wrangler.get_property_vector('energy') - min_energy

# add it as a new property to the wrangler
# in this case since the reref energy is a normalized
# quantity we need to explicitly tell the wrangler
wrangler.add_properties('rereferenced_energy', reref_energy_vect)

# Now we have to properties in the wrangler that we can
# use to fit a cluster expansion, the total energy
# and the rereference energy
print(wrangler.available_properties)

['rereferenced_energy']

2.1) Another example of modifying target properties#

We can do more complex modifications of the target data. For example a very common target property to fit a cluster expansion is the mixing energy.

For the current LNO dataset we don’t have a fully delithiated structure, but for the sake of illustration lets assume that we use Ni2O3. (Plus in this dataset mixing energy is not very informative since it is almost linear in concentration.)

[4]:

e_Ni2O3 = -12.48

# we can obtain the fully lithiated structure in the dataset by searching
# through the occupancy strings.
ind = [i for i, s in enumerate(wrangler.occupancy_strings) if 'Vacancy' not in s]
e_LiNiO2 = wrangler.get_property_vector('energy')[ind[0]]

# Now we can calculate the Li/Vacancy mixing energy for the structures in our dataset
# There are many ways you can obtain concentrations/compositions, here I use the
# occupancy strings stored in the wrangler.
# If the proper end points are calculated we can also use pymatgens PhaseDiagram
# with the entries in the wrangler, and obtain the mixing energy with much less effort!
mixing_energy = []
concentration = []
for size, energy, occu in zip(
    wrangler.sizes, wrangler.get_property_vector('energy'), wrangler.occupancy_strings):
    n_Li = sum(sp == get_species('Li+') for sp in occu)
    n_vac = sum(sp == get_species('Vacancy') for sp in occu)
    c_Li = n_Li/(n_Li + n_vac)
    mix_en = energy - c_Li*e_LiNiO2 - (1 - c_Li)*e_Ni2O3
    concentration.append(c_Li)
    # remember to use the "extensive" (per supercell) value
    mixing_energy.append(size * mix_en)

# add the properties to the wrangler
wrangler.add_properties('mixing_energy', mixing_energy)
wrangler.add_properties('li_concentration', concentration)
print(wrangler.available_properties)

['rereferenced_energy', 'li_concentration', 'mixing_energy']

3) Obtaining and adding weights#

Using the structure wrangler it is also very easy to obtain fitting weights based many things such as composition, total energy or energy above hull. Currently the code has the previously available functions to obtaine weights by energy above hull or by energy above composition.

[5]:

from smol.cofe.wrangling import weights_energy_above_hull, weights_energy_above_composition

above_compostion = weights_energy_above_composition(
    wrangler.structures, wrangler.get_property_vector('energy', normalize=False),
    temperature=1000)

above_hull = weights_energy_above_hull(
    wrangler.structures, wrangler.get_property_vector('energy', normalize=False),
    cs_structure=wrangler.cluster_subspace.structure,
    temperature=1000)

# add them to the wrangler
wrangler.add_weights('energy_above_comp', above_compostion)
wrangler.add_weights('energy_above_hull', above_hull)

# to use weights in a fit you would simply pass them to
# the corresponding argument or keyword argument of
# the fitting function you are using.
# For example if you are using a regression class from
# scikit-learn,
from sklearn.linear_model import LinearRegression
estimator = LinearRegression(fit_intercept=False)
estimator.fit(
    wrangler.feature_matrix,
    wrangler.get_property_vector('energy'),
    sample_weight=wrangler.get_weights('energy_above_hull')
)

[5]:

LinearRegression(fit_intercept=False)

4) Structure Selection#

The StructureWrangler class can also be used to ‘filter’ structures to use for a fit based on some criteria. To do so we obtain the indices of all structures that satisfy some filtering criteria

For example here we will obtain all the structures with electrostatic energy below a given cuttoff

[6]:

# filter by maximum ewald energy
# all structures with ewald energy above the cutoff
# will be removed
from smol.cofe.wrangling import max_ewald_energy_indices

# get the structure indices
indices = max_ewald_energy_indices(wrangler, max_relative_energy=2)
# save them in the structure wrangler
wrangler.add_data_indices('low_electrostat_energy', indices)

print(f'Included {len(indices)}/{wrangler.num_structures} structures with Ewald energies < 2 eV/prim.')
print(f'Saved indices are {wrangler.available_indices}')

Included 26/27 structures with Ewald energies < 2 eV/prim.
Saved indices are ['low_electrostat_energy']

[7]:

# you can use the indices for selected structures to
# obtain only the corresponding values for those structures
feature_matrix = wrangler.feature_matrix[indices]
prop_vector = wrangler.get_property_vector('energy')[indices]

print(f'Feature matrix shape: {feature_matrix.shape}')
print(f'Property vector shape {prop_vector.shape}')

Feature matrix shape: (26, 11)
Property vector shape (26,)

Training Data Preparation#

1) Preparing a StructureWrangler#

2) Modifying and adding new target properties#

2.1) Another example of modifying target properties#

3) Obtaining and adding weights#

4) Structure Selection#

1) Preparing a `StructureWrangler`#