Ziff-Gulari-Barshad model: Phase Transitions and ML-based Surrogate Model.¶
Note
To follow this tutorial, either:
Download
PhaseTransitions-ADP.py
(run as$AMSBIN/amspython PhaseTransitions-ADP.py
).Download
PhaseTransitions-ADP.ipynb
(see also: how to install Jupyterlab)
We studied the effect of changing the composition of the gas phase, namely partial pressures for \(O_2\) and \(CO\), in the \(CO_2\) Turnover frequency (TOF), in some of the pyZacros tutorials for the Ziff-Gulari-Barshad model (see especially Phase Transitions in the ZGB model and Ziff-Gulari-Barshad model: Steady State Conditions). We used an evenly distributed grid on the \(CO\) gas phase molar fraction in those tutorials. From the results, it is possible to see that, typically, the lower the \(CO_2\) production, the more difficult it is to achieve a steady state (except when \(CO_2\) production is zero, likely because the surface got poisoned, which makes the calculation finish quickly). This means that most of the computational time is spent on points that, from a catalytic standpoint, are not as interesting as points with higher \(CO_2\) production. Thus, to reduce the overall computational cost, it would be ideal to generate more points in the more interesting areas automatically. This is the main goal of the Adaptive Design Procedure (ADP).
The ADP was created to generate training data for Machine Learning (ML) algorithms, with a particular emphasis on approximating computationally-intensive first-principles kinetic models in catalysis. The procedure is based on function topology and behavior, with the discrete gradient and relative importance of the independent variables calculated. See more details in the original publication Chem. Eng. J 400, (2020), 125469. The authors demonstrate that the procedure could design a dataset (achieving the same accuracy) using only between 60 and 80% fewer data points as are needed in an evenly distributed grid. Furthermore, the ADP generates a Surrogate Model of the data based on ML techniques, allowing interpolation of points not included in the original data set, which is critical for multiscale simulations of complex chemical reactors.
In this tutorial, we will likewise examine the effects of altering the gas phase’s composition in the \(CO_2\) Turnover frequency (TOF) in the ZGB model, but we will do so while utilizing the ADP to both suggest the values of the \(CO\) molar fractions to evaluate and generate a surrogate model for this solution. In practice, the surrogate model is a PKL file that contains all of the parameters of the ML model, allowing it to be regenerated and used subsequently. The main goal of this tutorial is to obtain this file.
First of all, we must install the package adaptiveDesignProcedure. You can either follow the procedure described in its GitHub repository https://github.com/mbracconi/adaptiveDesignProcedure, or if you are using AMS, you can do it as follows by typing in a terminal:
$ amspackages install adaptivedesignprocedure
Now, let’s start with the script. Foremost, we import all packages we need:
import multiprocessing
import numpy
import scm.plams
import scm.pyzacros as pz
import scm.pyzacros.models
import adaptiveDesignProcedure as adp
import warnings; warnings.simplefilter('ignore', UserWarning)
The import warning
line is just needed to get clean output messages
further down.
The ADP method needs a function that generates the data for the model;
we named it get_rate()
. This function accepts a single argument, a
2d-array containing the conditions to be computed with the shape
(number of conditions, number of input variables)
, and returns a
2d-array containing the calculated values with the shape
(number of conditions, number of output variables)
. The number of
conditions will be determined on the fly by the ADP. On the contrary, we
must decide on the number of input and output variables. In our example,
we have one input variable
, the molar fraction of CO, and
three output variables
, the average coverage for \(O*\) and
\(CO*\) and the \(CO_2\) TOF.
This get_rate()
function performs a ZacrosParametersScanJob
calculation. To follow the details, please refer to the example Phase
Transitions in the ZGB model. In a nutshell, it configures the
ZacrosJob
calculation with the ZGB predefined model. Then it
configures and runs the ZacrosParametersScanJob
, using the
ZacrosJob
defined before as the reference and setting the \(CO\)
molar fraction equal to the conditions established by the ADP. Finally,
it retrieves the results for each condition by calling the
turnover frequency()
and average coverage()
functions and
storing them in the output array in the correct order.
def get_rate( conditions ):
print("")
print(" Requesting:")
for cond in conditions:
print(" x_CO = ", cond[0])
print("")
#---------------------------------------
# Zacros calculation
#---------------------------------------
zgb = pz.models.ZiffGulariBarshad()
z_sett = pz.Settings()
z_sett.random_seed = 953129
z_sett.temperature = 500.0
z_sett.pressure = 1.0
z_sett.species_numbers = ('time', 0.1)
z_sett.max_time = 10.0
z_job = pz.ZacrosJob( settings=z_sett, lattice=zgb.lattice,
mechanism=zgb.mechanism,
cluster_expansion=zgb.cluster_expansion )
#---------------------------------------
# Parameters scan calculation
#---------------------------------------
ps_params = pz.ZacrosParametersScanJob.Parameters()
ps_params.add( 'x_CO', 'molar_fraction.CO', [ cond[0] for cond in conditions ] )
ps_params.add( 'x_O2', 'molar_fraction.O2', lambda params: 1.0-params['x_CO'] )
ps_job = pz.ZacrosParametersScanJob( reference=z_job, parameters=ps_params )
#---------------------------------------
# Running the calculations
#---------------------------------------
results = ps_job.run()
if not results.job.ok():
print('Something went wrong!')
#---------------------------------------
# Collecting the results
#---------------------------------------
data = numpy.nan*numpy.empty((len(conditions),3))
if( results.job.ok() ):
results_dict = results.turnover_frequency()
results_dict = results.average_coverage( last=20, update=results_dict )
for i in range(len(results_dict)):
data[i,0] = results_dict[i]['average_coverage']['O*']
data[i,1] = results_dict[i]['average_coverage']['CO*']
data[i,2] = results_dict[i]['turnover_frequency']['CO2']
return data
Now, let’s look at the script’s main part.
As usual, we initialize the pyZacros environment:
scm.pyzacros.init()
PLAMS working folder: /home/user/pyzacros/examples/ZiffGulariBarshad/plams_workdir
We’ll use the plams.JobRunner
class, which easily allows us to run
as many parallel instances as we request. In this case, we choose to use
the maximum number of simultaneous processes (maxjobs
) equal to the
number of processors in the machine. Additionally, by setting
nproc = 1
we establish that only one processor will be used for
each zacros instance.
maxjobs = multiprocessing.cpu_count()
scm.plams.config.default_jobrunner = scm.plams.JobRunner(parallel=True, maxjobs=maxjobs)
scm.plams.config.job.runscript.nproc = 1
print('Running up to {} jobs in parallel simultaneously'.format(maxjobs))
Running up to 8 jobs in parallel simultaneously
Firstly, we must define the input and output variables. As previously
stated, for the input variables, we only have the molar fraction of CO
(x_CO
). In terms of ADP, we must provide a name as well as a range
to cover during the execution, ranging from min
to max
and
beginning with num
evenly spaced samples. Regarding the output
variables, we have three: the average coverage for \(O*\) and
\(CO*\), as well as the \(CO_2\) TOF. For them, in ADP, we only
need to provide their names (ac_O
, ac_CO
, TOF_CO2
). Notice
these names are arbitrary, and they do not have any effect on the final
results. However, the number of input and output variables should be in
correspondence with the array sizes used in the get_rate()
function.
input_var = ( { 'name':'x_CO', 'min':0.2, 'max':0.8, 'num':5 }, )
output_var = ( {'name':'ac_O'},
{'name':'ac_CO'},
{'name':'TOF_CO2'}, )
Then, we create an adaptativeDesignProcedure
object by calling its
constructor, which needs the input and output variables, the function to
calculate the rates get_rates
. Additionally, for convenience, we set
the output directory as a subdirectory of the pyZacros working directory
and set the random seed (randomState
) to get reproducible results.
The last two parameters are optional. It is also possible to provide
several parameters to control the algorithm using the keyword
``algorithmParams’’. But we will get back to that later.
adpML = adp.adaptiveDesignProcedure( input_var, output_var, get_rate,
outputDir=scm.pyzacros.workdir()+'/adp.results',
randomState=10 )
------ Adaptive generation of Training Data for Machine Learning ------
Input parameters:
* Forest file: /home/user/pyzacros/examples/ZiffGulariBarshad/plams_workdir/adp.results/ml_ExtraTrees.pkl
* Training file: /home/user/pyzacros/examples/ZiffGulariBarshad/plams_workdir/adp.results/tmp/train.dat
* Figure path: /home/user/pyzacros/examples/ZiffGulariBarshad/plams_workdir/adp.results/figures
* Plotting enabled: False
* Boruta as feature selector: True
* Use Weak Support Var in Boruta:True
...
name: ac_CO
}
{
name: TOF_CO2
}
Now, we begin the calculation by invoking the method
createTrainingDataAndML()
, which, as the name implies, generates the
training data as well as the surrogate model (or ML model). The program
runs several cycles, and for each cycle, increases the number of points
and calls the “get rate()” function to evaluate them. When the Relative
Approximation Error and the Out-Of-Bag error are less than
algorithmParams['RADth']
(default=10%) and
algorithmParams['OOBth']
(default=0.05), respectively, the
calculation converges. See the original ADP documentation for more
details.
adpML.createTrainingDataAndML()
------------------ Iterative Species Points Addition ------------------
* Tabulation Variables: ac_O
* Iteration: 0
---------------------------
Points per species :5
---------------------------
Total number of points: 5
Requesting:
x_CO = 0.2
x_CO = 0.35000000000000003
x_CO = 0.5
x_CO = 0.6500000000000001
x_CO = 0.8
[02.02|22:51:12] JOB plamsjob STARTED
[02.02|22:51:12] Waiting for job plamsjob to finish
...
Total number of points: 29
Function loaded in 0.0016736984252929688
Approximation quality:
Out-Of-Bag error : 0.008866387919418402
Out-Of-Bag score : 0.9057607114823224
Accuracy constraints reached in 1 iterations
-------------------- Generating Final ExtraTrees ----------------------
----------------------- Model for CFD generated -----------------------
--------------------------- Procedure stats ---------------------------
* Training data size evolution: 5 8 12 19 29 29 29
--------------------------------- end ---------------------------------
If the execution got up to this point, everything worked as expected. Hooray!
The results are then collected by accessing the trainingData
attribute of the adpML
object, and they are presented nicely in a
table in the lines that follow.
x_CO,ac_O,ac_CO,TOF_CO2 = adpML.trainingData.T
print( "-------------------------------------------------" )
print( "%4s"%"cond", "%8s"%"x_CO", "%10s"%"ac_O", "%10s"%"ac_CO", "%12s"%"TOF_CO2" )
print( "-------------------------------------------------" )
for i in range(len(x_CO)):
print( "%4d"%i, "%8.3f"%x_CO[i], "%10.6f"%ac_O[i], "%10.6f"%ac_CO[i], "%12.6f"%TOF_CO2[i] )
-------------------------------------------------
cond x_CO ac_O ac_CO TOF_CO2
-------------------------------------------------
0 0.200 0.998060 0.000000 0.049895
1 0.350 0.974020 0.000200 0.273811
2 0.359 0.960800 0.000200 0.310947
3 0.369 0.938580 0.000360 0.403137
4 0.388 0.925940 0.000260 0.462316
5 0.397 0.887160 0.001060 0.629684
6 0.406 0.879640 0.000560 0.664653
7 0.416 0.854660 0.001520 0.759789
8 0.425 0.802920 0.002940 0.902253
9 0.434 0.796640 0.002460 0.975221
10 0.444 0.770360 0.003760 1.129684
11 0.453 0.772080 0.003540 1.167158
12 0.463 0.734200 0.004960 1.336484
13 0.472 0.701580 0.009360 1.555474
14 0.481 0.643280 0.011700 1.741516
15 0.491 0.586160 0.016680 1.960968
16 0.500 0.568160 0.018740 2.108442
17 0.509 0.552720 0.022160 2.306632
18 0.519 0.512560 0.032800 2.432063
19 0.528 0.391280 0.101060 2.705432
20 0.538 0.204580 0.342040 2.731326
21 0.547 0.004320 0.958380 1.596042
22 0.556 0.000000 1.000000 1.006589
23 0.575 0.000000 1.000000 0.581137
24 0.594 0.000000 1.000000 0.243600
25 0.613 0.000000 1.000000 0.169137
26 0.650 0.000000 1.000000 0.042695
27 0.725 0.000000 1.000000 0.004505
28 0.800 0.000000 1.000000 0.000589
The above results are the final aim of the calculation. However, we can take advantage of python libraries to visualize them. Here, we use matplotlib. Please check the matplotlib documentation for more details at matplotlib. The following lines of code allow visualizing the effect of changing the \(CO\) molar fraction on the average coverage of \(O*\) and \(CO*\) and the production rate of \(CO_2\). In the Figure, we show the training data generated by the Zacros model as points and the Surrogate Model’s prediction as lines.
import matplotlib.pyplot as plt
fig = plt.figure()
x_CO_model = numpy.linspace(0.2,0.8,201)
ac_O_model,ac_CO_model,TOF_CO2_model = adpML.predict( x_CO_model ).T
ax = plt.axes()
ax.set_xlabel('Molar Fraction CO', fontsize=14)
ax.set_ylabel("Coverage Fraction (%)", color="blue", fontsize=14)
ax.plot(x_CO_model, ac_O_model, color="blue", linestyle="-.", lw=2, zorder=1)
ax.plot(x_CO, ac_O, marker='$\u25EF$', color='blue', markersize=4, lw=0, zorder=1)
ax.plot(x_CO_model, ac_CO_model, color="blue", linestyle="-", lw=2, zorder=2)
ax.plot(x_CO, ac_CO, marker='$\u25EF$', color='blue', markersize=4, lw=0, zorder=1)
plt.text(0.3, 0.9, 'O', fontsize=18, color="blue")
plt.text(0.7, 0.9, 'CO', fontsize=18, color="blue")
ax2 = ax.twinx()
ax2.set_ylabel("TOF (mol/s/site)",color="red", fontsize=14)
ax2.plot(x_CO_model, TOF_CO2_model, color='red', linestyle='-', lw=2, zorder=0)
ax2.plot(x_CO, TOF_CO2, marker='$\u25EF$', color='red', markersize=4, lw=0, zorder=1)
plt.text(0.37, 1.5, 'CO$_2$', fontsize=18, color="red")
plt.show()
We also found three regions involving the two phase transitions, as
shown in the Figure, which are similar to the ones obtained in the
example Phase Transitions in the ZGB model. However, as you can see,
the points are not equidistant here. The ADP generated more points for
\(x_\text{CO}\) between 0.35 and 0.55, which is likely to represent
the small oscillations that occur there accurately. It is worth noting
that the Surrogate Model has some difficulty reproducing the narrow peak
at a maximum \(CO_2\) TOF. Certainly, the model could perform
better. Unfortunately, the only way to do so is by increasing the number
of points in the training set by tuning the ADP parameters’
(algorithmParams
). However, now that we have the surrogate model, we
can quickly obtain the average coverage for \(O*\) and \(CO*\)
and the \(CO_2\) TOF for any \(CO\) molar fraction.
Now, we can close the pyZacros environment:
scm.pyzacros.finish()
[02.02|22:52:41] PLAMS run finished. Goodbye
As a final note, you don’t have to recalculate any training points if
you load the surrogate model directly from the disk. As a result, you
can apply it to any other situation, such as e.g., a simulation of
reactive computational fluid dynamics (rCFD). The following code creates
the image above exactly, but it starts by reading the surrogate model
from the directory plams_workdir/adp.results/
. It can be used in a
separate script.
import numpy as np
import adaptiveDesignProcedure as adp
import matplotlib.pyplot as plt
path = "plams_workdir/adp.results/"
x_CO_model = np.linspace(0.2,0.8,201)
ac_O_model,ac_CO_model,TOF_CO2_model = adp.predict( x_CO_model, path ).T
ax = plt.axes()
ax.set_xlabel('Molar Fraction CO', fontsize=14)
ax.set_ylabel("Coverage Fraction (%)", color="blue", fontsize=14)
ax.plot(x_CO_model, ac_O_model, color="blue", linestyle="-.", lw=2, zorder=1)
ax.plot(x_CO_model, ac_CO_model, color="blue", linestyle="-", lw=2, zorder=2)
plt.text(0.3, 0.9, 'O', fontsize=18, color="blue")
plt.text(0.7, 0.9, 'CO', fontsize=18, color="blue")
ax2 = ax.twinx()
ax2.set_ylabel("TOF (mol/s/site)",color="red", fontsize=14)
ax2.plot(x_CO_model, TOF_CO2_model, color='red', linestyle='-', lw=2, zorder=0)
plt.text(0.37, 1.5, 'CO$_2$', fontsize=18, color="red")
plt.show()
By default, the Surrogate Model’s file is ml_ExtraTrees_forCFD.pkl
.
The option forestFile
in the ADP constructor allows you to alter the
prefix ml_ExtraTrees
. In the adp.predict()
method, you can
provide the complete path to this file, but if a directory is supplied
instead, it will try to discover the proper file inside, as shown in the
lines of code above.
To improve the Surrogate Model, rerun the calculation with the following parameters for the ADP:
adpML = adp.adaptiveDesignProcedure( input_var, output_var, get_rate,
algorithmParams={'dth':0.01,'d2th':0.10},
outputDir=scm.pyzacros.workdir()+'/adp.results',
randomState=10 )
As a result, you should obtain the following figure:
You can also execute the calculation but under steady-state conditions. Take a look at the example PhaseTransitions-SteadyState-ADP.py
to see the actual implementation. This can be accomplished simply by making the following changes to the get_rate()
function:
This calculation should not take more than 20 minutes. You should be able to obtain the following figure: