This page was generated from docs/src/notebooks/adding-structures-in-parallel.ipynb.

Adding Structures in Parallel#

[1]:

import numpy as np
import json
from monty.serialization import loadfn
from pymatgen.core.structure import Structure
from smol.cofe import ClusterSubspace, StructureWrangler

1) Preparing a `StructureWrangler`#

When adding large structures or structures that underwent a considerable amount of relaxation (compared to the primitive structure) to a StructureWrangler, it can be time consuming to appropriately match the structures to compute the correlations vector for the feature matrix. In this case it can be very helpful (and easy!) to add structures in a dataset in parallel.

First, we’ll prepare the cluster subspace and structure wrangler as before.

[2]:

# Load the raw data

# load the prim structure
lmof_prim = loadfn('data/lmof_prim.json')

# load the fitting data
lmof_entries =loadfn('data/lmof_entries.json')

# create a cluster subspace
subspace = ClusterSubspace.from_cutoffs(
    lmof_prim, cutoffs={2: 7, 3: 5}, basis='sinusoid',
    supercell_size=('O2-', 'F-'),
    ltol = 0.15, stol = 0.2, angle_tol = 15)

# create the structure wrangler
wrangler = StructureWrangler(subspace)

2) Add structures in parallel#

Since adding structures is an embarassingly parallel operation, all we need to do is run a parallel loop. There are a few ways to do this in python. Here we will use the joblib library. But using multiprocessing would be very similar.

[3]:

from time import time
from joblib import Parallel, delayed, cpu_count

print(f'This computers has {cpu_count()} cpus.')

nprocs = cpu_count()  # setting this to -1 also uses all cpus

# setting a batch size usually improves speed
batch_size = 'auto' #len(lmof_data)//nprocs

start = time()

# we need to add the data a bit differently to avoid having to use
# shared memory between processes
with Parallel(n_jobs=nprocs, batch_size=batch_size, verbose=True) as parallel:
    entries = parallel(delayed(wrangler.process_entry)(
        entry, verbose=False) for entry in lmof_entries
    )

# unpack the items and remove Nones from structure that failed to match
entries = [entry for entry in entries if entry is not None]
wrangler.append_entries(entries)

print(f'Parallel finished in {time()-start} seconds.')
print(f'Matched {wrangler.num_structures}/{len(lmof_entries)} structures.')

This computers has 16 cpus.

[Parallel(n_jobs=16)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=16)]: Done  22 out of  26 | elapsed:    9.3s remaining:    1.7s

Parallel finished in 12.731098890304565 seconds.
Matched 17/26 structures.

[Parallel(n_jobs=16)]: Done  26 out of  26 | elapsed:   12.7s finished
/home/lbluque/Develop/smol/smol/cofe/wrangling/wrangler.py:804: UserWarning: The following structures have duplicated correlation vectors:
 Index 4 - Li+32 Mn3+32 O2-64 energy=-1352.3304
Index 9 - Li+16 Mn3+16 O2-32 energy=-676.1647
 Consider adding more terms to the clustersubspace or filtering duplicates.
  warnings.warn(
/home/lbluque/Develop/smol/smol/cofe/wrangling/wrangler.py:804: UserWarning: The following structures have duplicated correlation vectors:
 Index 0 - Li+9 Mn3+5 Mn4+2 O2-16 energy=-321.98039
Index 16 - Li+9 Mn3+5 Mn4+2 O2-16 energy=-322.01631
 Consider adding more terms to the clustersubspace or filtering duplicates.
  warnings.warn(

1.1) Compare with serial code

[4]:

wrangler.remove_all_data()

start = time()

for entry in lmof_entries:
    wrangler.add_entry(entry, verbose=False)

print(f'Serial finished in {time()-start} seconds.')
print(f'Matched {wrangler.num_structures}/{len(lmof_entries)} structures.')

Serial finished in 40.25381565093994 seconds.
Matched 17/26 structures.

[ ]:

Adding Structures in Parallel#

1) Preparing a StructureWrangler#

2) Add structures in parallel#

1) Preparing a `StructureWrangler`#