This page was generated from docs/src/notebooks/lmo-drx-ce-mc.ipynb.

LiMnO DRX Cluster Expansion#

This example shows a more in depth example of how to generate, fit and sample a cluster expansion for a slightly more complex scenario.

0) Load all data#

Load the structures with oxidation state assignments and prepare the target vector for learning. Target vector is usually a vector of total energies.

[1]:
# load the training structures with oxidation state assigned
from monty.serialization import loadfn

entries = loadfn("data/lmo_drx_entries.json")

1) Define a cluster expansion#

Load and inspect the lattice primitive cell with multi-site disorder#

[2]:
prim = loadfn("data/lmo_drx_prim.json")

See that the primitive cell is randomly disordered. When occupancies on a site (#) do not add to 1, it means vacancies are considered.

[3]:
import crystal_toolkit
print(prim)
prim
Full Formula (Li0.7 Mn1.1 O1)
Reduced Formula: Li0.7Mn1.1O1
abc   :   2.969850   2.969850   2.969850
angles:  60.000000  60.000000  60.000000
Overall Charge: +1.5
Sites (4)
  #  SP                                                a     b     c
---  ---------------------------------------------  ----  ----  ----
  0  Li+:0.250, Mn2+:0.250                          0.75  0.75  0.75
  1  Li+:0.250, Mn2+:0.250                          0.25  0.25  0.25
  2  Li+:0.200, Mn2+:0.200, Mn3+:0.200, Mn4+:0.200  0     0     0
  3  O2-                                            0.5   0.5   0.5
If you see this text, the Crystal Toolkit Jupyter Lab

extension is not installed. You can install it by running

"pip install crystaltoolkit-extension"

from the same environment you run "jupyter lab".

This only works in Jupyter Lab 3.x or above.


Structure Summary
Lattice
    abc : 2.96985 2.9698500000000005 2.96985
 angles : 60.00000000000001 59.99999999999999 60.00000000000001
 volume : 18.522028420882272
      A : 2.571965545429215 0.0 1.4849250000000003
      B : 0.8573218484764051 2.4248723708682074 1.4849250000000003
      C : 0.0 0.0 2.96985
Overall Charge: +1.5
PeriodicSite: Li+:0.250, Mn2+:0.250 (2.5720, 1.8187, 4.4548) [0.7500, 0.7500, 0.7500]
PeriodicSite: Li+:0.250, Mn2+:0.250 (0.8573, 0.6062, 1.4849) [0.2500, 0.2500, 0.2500]
PeriodicSite: Li+:0.200, Mn2+:0.200, Mn3+:0.200, Mn4+:0.200 (0.0000, 0.0000, 0.0000) [0.0000, 0.0000, 0.0000]
PeriodicSite: O2- (1.7146, 1.2124, 2.9699) [0.5000, 0.5000, 0.5000]

Specify cluster subspace information#

We will include an electrostatic term in the subspace as well

[4]:
from smol.cofe import ClusterSubspace
from smol.cofe.extern import EwaldTerm

CLUSTERCUTOFFS = {2: 6, 3: 4, 4:2}
LTOL, STOL, ATOL = 0.15, 0.20, 15
BASIS = 'sinusoid'

cs = ClusterSubspace.from_cutoffs(structure=prim,
                                  cutoffs = CLUSTERCUTOFFS,
                                  ltol = LTOL,
                                  stol = STOL,
                                  angle_tol = ATOL,
                                  supercell_size = ('O2-'),
                                  basis = BASIS,
                                  orthonormal=False)
cs.add_external_term(EwaldTerm())

Print some information about the subspace, such as the number of correlation functions for the random structure \(\langle \Phi_\beta \rangle\), number of function clusters \(\alpha\), number of orbits \(\textbf{B}\), and the orbit ids for all correlation functions \(B\) which describes how correlation functions are grouped together.

[5]:
print(cs)
Basis/Orthogonal/Orthonormal : sinusoid/True/{self.basis_orthonormal}
       Unit Cell Composition : Li+0.7 Mn2+0.7 Mn3+0.2 Mn4+0.2 O2-1
            Number of Orbits : 40
No. of Correlation Functions : 386
             Cluster Cutoffs : 2: 5.94, 3: 3.64
              External Terms : [EwaldTerm(total)]
Orbit Summary
 ------------------------------------------------------------------------
 |  ID     Degree    Cluster Diameter    Multiplicity    No. Functions  |
 |   0       0             NA                 0                1        |
 |   1       1            0.0000              2                2        |
 |   2       1            0.0000              1                4        |
 |   3       2            1.8187              8                8        |
 |   4       2            2.1000              6                3        |
 |   5       2            2.9698              12               3        |
 |   6       2            2.9699              6               10        |
 |   7       2            3.4825              24               8        |
 |   8       2            3.6373              4                3        |
 |   9       2            3.6373              4                3        |
 |  10       2            4.2000              6                3        |
 |  11       2            4.2000              3               10        |
 |  12       2            4.5768              24               8        |
 |  13       2            4.6957              24               3        |
 |  14       2            5.1439              24               4        |
 |  15       2            5.1439              12              10        |
 |  16       2            5.4560              24               8        |
 |  17       2            5.4560              8                8        |
 |  18       2            5.9397              12               3        |
 |  19       2            5.9397              6               10        |
 |  20       3            2.1000              12              12        |
 |  21       3            2.9699              24               6        |
 |  22       3            2.9699              12              12        |
 |  23       3            2.9699              12              20        |
 |  24       3            2.9699              8                4        |
 |  25       3            2.9699              8                4        |
 |  26       3            2.9699              8               20        |
 |  27       3            3.4825              48              32        |
 |  28       3            3.4825              48              16        |
 |  29       3            3.4825              24              16        |
 |  30       3            3.4825              24              12        |
 |  31       3            3.4825              24              12        |
 |  32       3            3.4825              24              20        |
 |  33       3            3.4825              12              12        |
 |  34       3            3.4825              12              20        |
 |  35       3            3.6373              24              16        |
 |  36       3            3.6373              24               8        |
 |  37       3            3.6373              24               8        |
 |  38       3            3.6373              24              12        |
 |  39       3            3.6373              4               12        |
 ------------------------------------------------------------------------

Load structure data into the StructureWrangler, which is used later for fitting

[6]:
from smol.cofe import StructureWrangler

sw = StructureWrangler(cs)
for entry in entries:
    sw.add_entry(entry, verbose=False)

This next step may take a few minutes to complete.

Our final feature matrix for fitting has shape \(n \times p\) where \(n\) is the number of structures and \(p\) is the number of correlation functions for our training structure and the Ewald energy.

[7]:
print ('Our feature matrix has the following dimensions:',
       sw.feature_matrix.shape)
Our feature matrix has the following dimensions: (238, 387)

2. Fit a cluster expansion#

Here we are fitting $E(\sigma) = \sum:nbsphinx-math:`beta `m\beta `J\_:nbsphinx-math:beta :nbsphinx-math:langle :nbsphinx-math:Phi (:nbsphinx-math:sigma`):nbsphinx-math:`alpha `:nbsphinx-math:`rangle`\beta `+ :nbsphinx-math:frac{1}{epsilon}`E_{Ewald}(\sigma) $ and we will do with various compressive sensing techniques because the system is under-determined because \(n\) < \(p\). Our feature matrix \(\textbf{X}\) is the matrix of all correlation functions and Ewald energy for our structures. We are interested in finding $ m_:nbsphinx-math:beta `J_:nbsphinx-math:beta`$ and \(\frac{1}{\epsilon}\) which together form vector \(\vec w\).

[8]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
import numpy as np
import copy

We will use L1-regularization, which is to minimize the following objective function with a penalty parameter \(\lambda\) (note that in sklearn this is called \(\alpha\), but we have already used this to define our function cluster). \begin{equation} \frac{1}{2n}||E-\textbf{X}\vec w||_2^2 + \lambda ||\vec w||_1 \end{equation}

We will optimize the hyperparameter using only a single fit with a train/test split. However, improved results can be obtained from doing a proper cross validation optimization.

[9]:
import warnings
from sklearn.linear_model import Lasso
from sklearn.exceptions import ConvergenceWarning

TRIALS = 50
TEST_SIZE = 0.20
PROPERTY = 'energy'

# we will suppress convergence warnings for now
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=ConvergenceWarning)

    all_rmse = []
    for alpha in np.logspace(-5, -3):
        rmse_list = []
        for _ in range(TRIALS):
            X_train, X_test, y_train, y_test = train_test_split(
                sw.feature_matrix, sw.get_property_vector(key=PROPERTY),
                test_size=TEST_SIZE
            )
            model = Lasso(alpha=alpha, fit_intercept=True)
            # remove the constant correlation since we are fitting
            # the intercept
            model.fit(X_train[:, 1:], y_train)
            wvec = np.concatenate((np.array([model.intercept_]),
                                   model.coef_),
                                  axis=0)
            y_predict = np.dot(X_test, wvec)
            rmse = np.sqrt(mse(y_test, y_predict))
            rmse_list.append(rmse)
        all_rmse.append(np.mean(rmse_list))

Load plotting tools to examine how the fitting is going.

[11]:
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(np.logspace(-5,-2), all_rmse)
plt.xlim([0, 0.0009])
plt.axvline(x = 0.0002, ls = '--', color = 'black')
plt.xlabel(r'Penalty hyper-parameter $\alpha$')
plt.ylabel('Average RMSE (eV/prim) in 50 trials')
[11]:
Text(0, 0.5, 'Average RMSE (eV/prim) in 50 trials')
../_images/notebooks_lmo-drx-ce-mc_23_1.png

Although we sampled a large space of penalties, \(10^{-5}\) to \(10^{-2}\), it turns out that the relevant space of hyperparameters is \(10^{-5}\) to \(10^{-4}\).

We choose the penalty of \(\lambda=0.0002\) as the regularization parameter. So, we settle on our final model and show the out-of-sample and in-sample RMSE.

The weights \(\vec w\) are also plotted.

[12]:
LAMBDA = 0.0002
X_train, X_test, y_train, y_test = train_test_split(
    sw.feature_matrix, sw.get_property_vector(key=PROPERTY),
    test_size=TEST_SIZE
)

model = Lasso(alpha=LAMBDA, fit_intercept=True)
model.fit(X_train[:, 1:], y_train)

wvec = np.concatenate(
    (np.array([model.intercept_]), model.coef_),
    axis=0
)

y_predict = np.dot(X_test, wvec)
y_train_predict = np.dot(X_train, wvec)
print(f'Out-of-sample RMSE is: {np.sqrt(mse(y_test, y_predict))} eV/prim')
print(f'In-sample RMSE is: {np.sqrt(mse(y_train, y_train_predict))} eV/prim')
print(f'Number of Features > 1E-5: {sum(np.abs(wvec) > 1E-5)}/{len(wvec)}')

first_pair = cs.orbits_by_size[2][0].bit_id
print(f'Point correlation coefficients: {wvec[:first_pair]}')
# plot the coefficients (excluding those for points))
plt.stem(range(len(wvec) - first_pair), wvec[first_pair:],
         linefmt='-', markerfmt=' ')#, basefmt=' ')
plt.xlabel('Coefficient index (i in $w_i$)')
plt.ylabel('Magnitude |$w_i$| eV/prim')
Out-of-sample RMSE is: 0.1542109191827612 eV/prim
In-sample RMSE is: 0.02426213156609966 eV/prim
Number of Features > 1E-5: 41/387
Point correlation coefficients: [-20.88771952  -2.04993158   0.           0.           1.18082107
  -1.78346928   0.        ]
/home/lbluque/miniconda3/envs/matx_dev/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:530: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.16306747608609712, tolerance: 0.05853817658209538
  model = cd_fast.enet_coordinate_descent(
[12]:
Text(0, 0.5, 'Magnitude |$w_i$| eV/prim')
../_images/notebooks_lmo-drx-ce-mc_25_3.png

Generate the Cluster Expansion object#

[13]:
from smol.cofe import RegressionData, ClusterExpansion
from random import choice

reg_data = RegressionData.from_sklearn(
    model, sw.feature_matrix, sw.get_property_vector(key=PROPERTY)
)

expansion = ClusterExpansion(
    cs, coefficients=wvec, regression_data=reg_data
)


structure = choice(sw.structures)
prediction = expansion.predict(structure, normalize=True)
print(f"Structure with composition {structure.composition} has predicted energy {prediction} eV/prim")
Structure with composition Li+10 Mn4+12 Mn3+2 O2-32 has predicted energy -18.959213889450854 eV/prim

3) Run Canonical Monte Carlo#

Now let’s run a canonical MC disordering of spinel LiMn2O4.

First load the spinel structure#

[14]:
# load the spinel structure first and determine its supercell matrix

spinel = loadfn('data/LiMn2O4_drx_tutorial.json')
starting_matrix = sw.cluster_subspace.scmatrix_from_structure(spinel)

Generate an ensemble for a given transformation of the starting spinel structure and convert the spinel structure into an initial occupancy#

[15]:
from smol.moca import Ensemble
from smol.moca import Sampler

transformation = np.matrix([[3, 0, 0], [0, 3, 0], [0, 0, 3]])

sc_matrix = starting_matrix @ transformation

# Create an Ensemble as
ensemble = Ensemble.from_cluster_expansion(expansion, sc_matrix)
# Since computing the Ewald matrix is very time consuming make sure to
# save/load the ensemble or processor
# ensemble = dumpfn(ensemble, "data/lmo_drx_ensemble.json")

spinel = spinel.__mul__(transformation)
init_occu = ensemble.processor.occupancy_from_structure(spinel)

Generate sampler for canonical MC disordering from \(T_0 = 300 K\) to \(T_f = 500 K\) in steps of \(50 K\) as a mini simulation#

Computing the ewald matrix necessary for fast MC evaluation for large supercells with multi-component structures such as this example can take a long time. Since this is done lazily in the code it will seem like the first step of MC is unbearibly long. It is suggested to save the ensemble or processor used such that the Ewald matrix does not need to be reconstructed at runtime.

[16]:
from smol.moca import Sampler

T0 = 300
TF = 500
STEP = 50
PROPOSALS = int(1E5)
current_occu = None
save_data = {}

for T in range(T0, TF + 1, STEP):
    sampler = Sampler.from_ensemble(ensemble=ensemble,
                                    kernel_type='Metropolis',
                                    temperature=T)
    if current_occu is None:
        current_occu = init_occu

    sampler.run(PROPOSALS,
                initial_occupancies=current_occu,
                thin_by=100,
                progress=True)
    current_occu = sampler.samples.get_occupancies()[-1]

    # now save some data such as occupancies
    save_data[int(T)] = {'occupancies': sampler.samples.get_occupancies(),
                         'energies_total': sampler.samples.get_energies(),
                         'features_unnormalized': sampler.samples.get_feature_vectors()}
Sampling 1 chain(s) from a cell with 3456 sites...: 100%|██████████| 100000/100000 [00:53<00:00, 1873.63it/s]
Sampling 1 chain(s) from a cell with 3456 sites...: 100%|██████████| 100000/100000 [00:51<00:00, 1947.93it/s]
Sampling 1 chain(s) from a cell with 3456 sites...: 100%|██████████| 100000/100000 [00:51<00:00, 1950.93it/s]
Sampling 1 chain(s) from a cell with 3456 sites...: 100%|██████████| 100000/100000 [00:51<00:00, 1955.61it/s]
Sampling 1 chain(s) from a cell with 3456 sites...: 100%|██████████| 100000/100000 [00:51<00:00, 1956.76it/s]

Now save the data#

[ ]:
from monty.serialization import dumpfn

filename = "canonical_MC_LiMn2O4_{}_{}_{}.json".format(T0, TF, STEP)
dumpfn(save_data, filename)