Package Design#

Overview & Mission#

smol is intentionally designed to be easy to use, install and extend. In order to achieve these goals the package has few dependencies [1], and has a heavily object-oriented and modular design that closely follows mathematical and methodological abstractions. This enables flexible creation of complex workflows and hassle-free implementation methodology extensions, that will rarely need to be implemented from scratch.

Inheritance, polymorphism and composition are key OOP design concepts in smol that result in a modular design, This hopefully allows to design complex calculation workflows and implement new functionality without having to implement new classes and functions from scratch.

smol has been designed to enable efficient and open development of new methodology for fitting and sampling applied lattice models in a user-friendly way; and as a result allow quick development-to-application turnaround time in the study of configuration dependent properties of inorganic materials.

Module Design#

smol is organized into two main submodules, smol.cofe and smol.moca:

smol.cofe implements the classes and functions to represent site spaces, configuration spaces and function subspaces over them, and the tools necessary to fit lattice models, in the spirit of cluster expansions and the structure inversion method.
smol.moca has implementations of classes to carry out Monte Carlo sampling of ensembles associated with a fitted lattice model Hamiltonian, as well as simple empirical models such as a point Coulomb potential.

Below is a detailed description of the classes and their relationships in the main modules. Before delving into these we recommend going over the simpler overview of the package in the User Guide page.

Diagrams showing the main objects and their relationships for the smol.cofe and smol.moca are presented below. In these diagrams base classes are depicted in filled (and unboxed) rectangles, derived classes are depited in boxed rectangles with an colored arrow pointing from the base class to the derived classes.

Ownership relationships between distinct classes (i.e. when a class has another class as an attribute) are depicted as dark blue colored arrows. The arrow points from the class which is an attribute to the class that holds the former as an attributed.

`smol.cofe`#

The diagram below summarizes the overall design of the smol.cofe module. The main classes that users directly instantiate and interact with are the ClusterSubspace, StructureWrangler, and ClusterExpansion as detailed in the User Guide. All main classes are derived from monty MSONable and thus can be serialized and saved as JSON objects. In addition, several other classes are defined to enable the various features and functionality in the module.

Following the diagram above, the main purpose of the classes depicted is,

ClusterSubspace (and its derived class PottsSubspace) is the main work horse to construct and represent configuration spaces and the function spaces over them. Its main purpose is to compute correlation function values given an ordered structure. In order to do so, it holds a list of Orbit objects generated for a particular disordered pymatgen Structure. Additionally an external term representing a simple empricial potential model can also be included. Currently only an EwaldTerm, to represent a point electrostatic potential is available.
- An Orbit represents both the set of symmetrically equivalent clusters, as well as the set of product functions that act over the configurations of those clusters. An orbit holds a base Cluster and a list of SiteBasis associated with each site in the cluster.
  - A Cluster is a collection of sites (derived from SiteCollection in pymatgen). The two key concepts of a Cluster is that compared to a Structure they do hold periodic sites, and compared to a Molecule the hold a lattice (of the structure they are associated with).
  - A SiteBasis represents the basis set that spans the function space of the configurations for a single site. Several types of basis sets are included and implementing new ones is relatively straightforward. A SiteBasis holds a SiteSpace which represents its single site configuration space.
    - SiteSpace represents the possible configurations of a given site. It essentially holds a pymatgen Composition, with the addition of an explicit Vacancy species, for compositions that do not sum to 1.
- An external term, representing a simple empirical pair potential can also be included in a ClusterSubspace to create a mixture model (i.e. a Cluster expansion + an empirical potential). Currently an EwaldTerm is implemented to allow mixture models of cluster expansions with explicit electrostatics.
A StructureWrangler is the main object for training data preparation. Training data is held as a list of pymatgen ComputedStructureEntry. The StructureWrangler holds a given ClusterSubspace, and takes care of making sure training structures can be correctly mapped to the disordered unit structure of the ClusterSubspace. The corresponding correlation vectors of the training structures that can be correctly mapped are computed to form the correlation matrix (feature matrix) necessary for training. In addition, the StructureWrangler has several methods that allow to inspect and further prepare training data, such as checking for duplicates, obtaining correlation matrix properties, and most importantly obtaining a properly normalized property vector for training (the normalization is done per the disordered unit cell from the ClusterSubspace ).
A ClusterExpansion represent the final fitted lattice model. It holds a ClusterSubspace, and a corresponding set of fitted coefficients. A ClusterSubspace can be used to predict the energy of new structures, as well as obtain the effective cluster interactions (ECI), and prune unimportant terms.

`smol.moca`#

A diagram showing the overall design of the smol.moca module. The main classes in this module that are necessary to run Monte Carlo sampling using a ClusterExpansion are the Processor classes, the Ensemble, and the Sampler class. However, a number of helper classes are implemented to allow running and implementing a large variety of types of MC calculations.

The class descriptions are as follows,

An Ensemble represents the probability space associated with a particular lattice model and thermodynamic boundary conditions (i.e. chemical_potentials) over a finite simulation domain size, which is represented by a supercell matrix of an associated unit cell. An Ensemble holds a Processor and a list of Sublattice instances.
- Sublattice represents the set of sites of the defined supercell that are have the same site space (set of allowed species). Sublattice instances can also be split according to a particular frozen configuration.
- Processor instances, hold a ClusterSubspace and represent a particular sampling domain in the form of a supercell matrix (of the corresponding disordered unit cell). A Processor allows to quickly and efficiently calculate the fitted property and differences from local updates for a given configuration (over the represented supercell). A Processor can also generate occupancy strings given an ordered supercell structure, as well as generate the Sublattice instances for the different sites in the supercell domain.
A Sampler is the main class to run MC sampling given a particular Ensemble instance. Apart from the ensemble to be sampled, a Sampler holds the an MCKernel that implements the particular sampling algorithm, and a SampleContainer to record the sampled configurations, correlation vectors, enthalpy, and any other associated state variable.
- MCKernel are implementations of particular MC algorithms, such as the Metropolis kernel. They take care of generating the Markov chain, sampling histograms, and any other configuration and state attribute. The include an MCUsher to propose steps, and may also include an optional MCBias to bias samples to particular configurations in phase space.
  - An MCUsher has the purpose of proposing the steps to carry out the random walk for an MC calculation. The simplest are a Flip for single site species changes, and a Swap for swapping the species at two sites. However implementing new MCUsher for more complex random walks should be relatively straightforward.
  - An MCBias serves as an additional term that can be included to bias acceptance probabilities in order to carry out sampling from extended or biased ensembles.
- MC samples are saved in a SamplerContainer from sampled Trace objects. These will always include the configurations, correlations or features, and the energy or enthalpy. Additional values depending on the particular MCKernel used are also saved. A SampleContainer has functionality to obtain simple mean and variances of sampled values. In addition, a SampleContainer can be saved as either a json file or an hdf5 container. During lengthy simulations, samples can be streamed into an hdf5 container to minimize memory requirements; further, using hdf5 containers in single writer multiple reader mode allows users to begin looking at samples before a simulation has concluded.

Footnotes

Package Design#

Overview & Mission#

Module Design#

smol.cofe#

smol.moca#

`smol.cofe`#

`smol.moca`#