In this Masterclass, we will discuss how to report the results from molecular simulations.
We will emphasize that any result that we get from any simulation is a random variable.
To make the result reproducible, we must characterize the distribution
that has been sampled. It is not sufficient to report the averages.
Once you have completed this Masterclass you will be able to:
- Use PLUMED to calculate time averages and histograms from biased and unbiased simulation data.
- Use PLUMED to perform block averaging.
- Calculate error bars on-time averages computed form biased and unbiased simulation data using the central limit theorem and non-parametric bootstrap.
If you have not yet set up PLUMED, you can find information about installing it here. Please ensure that you have setup PLUMED on your machine before starting the exercises.
The data needed to execute the exercises of this Masterclass can be found on GitHub. You can clone this repository locally on your machine using the following command:
git clone https://github.com/plumed/masterclass-21-2.git
The data you need is in the folder called data
. You will find the following files in that folder:
uncorrelated_data
: you will analyze this data set in the first six exercises that followcorrelated_data
: you will analyze this data set in exercise 7weighted_data
: you will analyze this data in exercise 8
In exercise 9, you will pull everything together by generating a metadynamics trajectory. This exercise draws together all the ideas from exercises 1-8 by getting you to run a metadynamics simulation and extract the free energy surface. In the data
folder, you will thus also find the following files, which
are the input for the metadynamics simulation:
- in: The input file for simplemd that contains the parameters for the MD simulation.
- input.xyz: An initial configuration for the cluster that we are studying in this tutorial.
Notice that PLUMED input files have not been provided in the GitHub repository. You must prepare these input files yourself using the templates below.
We would recommend that you run each exercise in separate sub-directories inside the root directory masterclass-21-2
.
In other lessons you have seen how PLUMED can be used to calculate the value of a collective variables,
where
It is only possible to calculate the integrals in the quotient above exactly for elementary physical systems. For more complex systems we thus assume that we
can extract information on
Reporting only the average value we get is not sufficient as any averages we take are random.
Links to background information on statistical mechanics and statistics are provided throughout this tutorial. You can complete the tutorial without looking at the information at these links. However, we hope that the information we have provided through these links will prove useful if you want to do a more in-depth study of the topic.
We will start our study of averaging by estimating the ensemble average of the CV. The ensemble average for
In statistics, the quantity known as the ensemble average in statistical mechanics is referred to as the expectation of the random variable's
distribution.
The expectation of a random variable is often esimated by taking multiple identical samples from the distribution,
We can do the same with the data from our MD trajectory. We replace the
To calculate averages using PLUMED, you can use the input file below. This input calculates averages for the data in the uncorrelated_data
file you downloaded when you collected the GitHub repository.
#SOLUTIONFILE=work/plumed_ex1.dat
data: READ FILE=__FILL__ VALUES=__FILL__
av: AVERAGE ARG=__FILL__ STRIDE=1
PRINT ARG=av FILE=colvar
Copy the input to a file called plumed.dat, fill in the blanks and run the calculation by executing the following command:
> plumed driver --noatoms
The colvar
file that is output by PLUMED contains the average computed value from progressively larger and larger numbers of CV values. You should thus
be able to use the data in colvar
to produce a graph that shows the average as a function of the number of variables it is computed from as shown below.
The fluctuations in the average get smaller as this quantity is computed from larger numbers of random variables. We say that the average thus converges to the ensemble average, which is zero for the graph above.
We can estimate the distribution for our CV,
If we estimate uncorrelated_data
using PLUMED in this way we can use the input file:
#SOLUTIONFILE=work/plumed_ex2.dat
# We use natural units here so that kBT is set to 1
UNITS NATURAL
data: READ FILE=__FILL__ VALUES=__FILL__
hhh: HISTOGRAM ARG=__FILL__ STRIDE=1 __FILL__=-4.5 __FILL__=4.5 __FILL__=100 __FILL__=DISCRETE
fes: CONVERT_TO_FES GRID=__FILL__ TEMP=1 # This sets k_B T = 1
DUMPGRID GRID=__FILL__ FILE=fes.dat
Copy this input to a file called plumed.dat, fill in the blanks so that you have a grid that runs from -4 to +4 with 100 bins. Run the calculation by executing the following command:
> plumed driver --noatoms
The --noatoms
flag here is needed since plumed driver
is commonly used to analize a trajectory of atomic coordinates, but here we are using it to directly analyze a collective variable.
You should see that the free energy curve for this data set looks like this:
As you can see, there is a single minimum in this free energy surface.
Physical systems spend the majority of their time fluctuating around minima in the free energy landscape. Let's suppose that this minima is at
In this expression
We now recall that
The first term in the final product here is a constant that does not depend on
We can assume that the constant term in the product above normalizes the distribution. The derivation above, therefore, suggests that our CV values are all samples from a
normal distribution. We thus no longer need to estimate the histogram to get information on
Statistics tells us that if we have
We learned how to estimate uncorrelated_data
using PLUMED
and the expression above we can use the following input file:
#SOLUTIONFILE=work/plumed_ex3.dat
UNITS NATURAL
data: READ FILE=__FILL__ VALUES=__FILL__
# This line should calculate the square of the quantity read in from the file above
d2: CUSTOM ARG=__FILL__ FUNC=__FILL__ PERIODIC=NO
# Calculate the average from the read-in data
av: AVERAGE ARG=__FILL__ STRIDE=1
# Calculate the average of the squares of the read in data
av2: AVERAGE ARG=__FILL__ STRIDE=1
# Evaluate the variance using the expression above
var: CUSTOM ARG=__FILL__ FUNC=y-x*x PERIODIC=NO
# Print the variance
PRINT ARG=__FILL__ FILE=colvar
Copy this input to a file called plumed.dat, fill in the blanks and run the calculation by executing the following command:
> plumed driver --noatoms
Once you have run this calculation, you should be able to draw a graph showing how the estimate of
Truncation the Taylor series of the free energy at second order as we have done in this section is equivalent to assuming that a Harmonic Oscillator
can be used to describe the fluctuations along our CV.
The partition function, ensemble average and distribution for such systems can be calculated exactly, and there is no need for simulation.
Even when the system is not harmonic, calculating the quantity we have called
The following PLUMED input splits the CV values into blocks and calculates an average from each block of data separately. We can thus use it to get information on the distribution that is being sampled when we calculate an average from sets of 500 random variables using the ideas discussed in exercise 1.
#SOLUTIONFILE=work/plumed_ex4.dat
data: READ FILE=__FILL__ VALUES=__FILL__
av: AVERAGE ARG=__FILL__ STRIDE=1 CLEAR=500
PRINT ARG=__FILL__ STRIDE=__FILL__ FILE=colvar
Copy this input to a file called plumed.dat, fill in the blanks and run the calculation by executing the following command:
> plumed driver --noatoms
Plot a graph showing the original data from uncorrelated_data
and the averages in the colvar
file. You should be able to see something similar to this graph.
Notice how the distribution for both the black (original data) and yellow points (averages) in this graph are centred on the same quantity. Both of these quantities are thus accurate estimators for the expectation of the distribution. However, the block average that is shown in yellow is a more precise estimator for this quantity.
The reason the block average is a more precise estimator is connected to a well known result in statistics. If we compute a mean as follows:
where the
where
We can use the block averaging method introduced in exercise 4 to calculate error bars on the estimates of free energy. To generate 10 histograms
from the data in uncorrelated_data
with 100 bins starting at -4 and finishing at +4 using PLUMED we can use the input file:
#SOLUTIONFILE=work/plumed_ex5.dat
UNITS NATURAL
data: READ FILE=__FILL__ VALUES=__FILL__
hhh: HISTOGRAM ...
ARG=__FILL__ STRIDE=1
__FILL__=-4.5 __FILL__=4.5 __FILL__=100
CLEAR=__FILL__ __FILL__=DISCRETE
...
DUMPGRID GRID=__FILL__ FILE=hist.dat STRIDE=1000
Copy this input to a file called plumed.dat, fill in the blanks and run the calculation by executing the following command:
> plumed driver --noatoms
Running this command should generate several files containing histograms that will be called: analysis.0.hist.dat,
analysis.1.hist.dat etc. These files contain the histograms constructed from each of the blocks of data in your trajectory. You can merge
them all to get the final free energy surface, which can be calculated using the well-known relation between the histogram,
import matplotlib.pyplot as plt
import numpy as np
import glob
hist1 = np.loadtxt("../Exercises/Exercise_5/hist.dat")
N, average, average2 = 1, hist1[:,1], hist1[:,1]*hist1[:,1]
for filen in glob.glob( "../Exercises/Exercise_5/analysis.*.hist.dat") :
histn = np.loadtxt(filen)
N, average, average2 = N + 1, average + histn[:,1], average2 + histn[:,1]*histn[:,1]
# Final averages
average = average / N
# Final variances
var = (N/(N-1))*( average2 / N - average*average )
# Errors
error = np.sqrt( var / N )
# Convert to free energy
fes = -np.log( average )
# Convert to error in fes
ferr = error / average
# And draw graph of free energy surface
plt.fill_between( hist1[:,0], fes-ferr, fes+ferr )
plt.xlabel("CV value")
plt.ylabel('Free energy')
plt.show()
Copy this script to a cell in a python notebook and then run it on your data. You may need to adjust the names of the files that are being read to suit your machine's setup. The graph shown in the figure below shows the free energy surface generated from the python script.
The value of the free energy in the $i$th bin is calculated using:
The sum here runs over the
The error on the free energy, which is illustrated using the shaded region in the figure above, is calculated using:
The term in the square root here is the error on the average value of the histogram in bin
to get the expression above.
Lastly, notice that it is often useful to average the error over the grid using:
where the sum runs over the
Exercise 4 demonstrated one way of sampling the mean's distribution. We assumed that the
We can demonstrate how bootstrapping works in practice by using the following script, which works with the first 500 points in uncorrelated_data
. As you can see, we first calculate the average from all the data points. We then take new means by repeatedly sampling sets of 500 points with replacement from the data and calculating new means.
import numpy as np
ddd = np.loadtxt("../data/uncorrelated_data")
data = ddd[0:500,1]
bootstraps = np.zeros(200)
for i in range(200) :
av = 0
for j in range(500) : av = av + data[np.random.randint(0,500)]
bootstraps[i] = av / 500
f = open("bootstraps", "w")
f.write("#! FIELDS time boot \n")
for i in range(0,200):
f.write(str(i) + " " + str(bootstraps[i) + "\n" )
f.close()
When you run the script above, it generates a file called bootstraps
containing the averages that have been calculated by bootstrapping. If you now calculate the variance from all your bootstrapped averages you should see that it close to the value you got from the expression below which was introduced in exercise 4:
To use this expression you can insert the value of
You can also use bootstrapping to estimate the errors in the free energy surface. As an additional exercise, you can try to do this form of analysis on the data in uncorrelated_data
You should get
a result that is similar to the result you got in exercise 5
In this exercise, you will review everything you have done in the previous two exercises. You should:
- Calculate block averages for the data in
correlated_data
. Calculate the error on the average of the block average. - Calculate bootstrap averages for the data in
correlated_data
. Calculate the error from the bootstrap averages.
You will see that the variance you obtain from the bootstrap averages is less than the variance you get by block averaging.
These variances are different because there are correlations between the data points in correlated_data
. Such correlations were not present in the data set
you have examined in all the exercises that appeared previously to this one. The reason this matters is that the expression:
is only valid if the random variables that
as this expression holds as long as the random variables from which
Any data we get by computing CVs from a molecular dynamics trajectory is almost certain to contain correlations. It is thus essential to know how to handle correlated data. The block averaging technique that was introduced in exercise 4 resolves this problem. You can show that if the blocks are long enough, the averages you obtain are uncorrelated.
For the remainder of this exercise, you should use the data in correlated_data
and what you have learned in the previous exercises to calculate block averages for different block sizes.
For each block size
- Estimate the mean and variance for your
$N$ block averages (the$X_i$ ) using$\mu = \frac{1}{N} \sum X_i$ and$\sigma^2 = \frac{N}{N-1} \left[ \frac{1}{N} \sum_{i=1}^N X_i^2 - \left( \frac{1}{N}\sum_{i=1}^N X_i \right)^2 \right] $ - Insert your estimate of the variance into the following expression for the error bar on your estimate for
$\mu$ :$\epsilon = \sqrt{\frac{\sigma^2}{N}}$
You should be able to use your data to draw a graph showing the value of the average and the associated error barm
This graph shows that, when the data is correlated, the error bar is underestimated if each block average is computed from a small number of data points.
When sufficient data points are used to calculate each block average, however, the error bar settles on a constant value that is independent of the block size. When this has happened you
can be confident that your block averages are no longer correlated and the expression
Notice that you can also calculate the error bar using bootstrapping by selecting samples from your block averages. If you have time, you should try this. You should try to confirm that this method gives similar estimates for the error.
PLUMED is routinely used to run simulations using methods such as umbrella sampling and metadynamics. In these methods, a bias potential,
To get information on the unbiased distribution:
from such simulations we thus have to calculate weighted averages using:
$$ \overline{X}w = \frac{\sum{i=1}^N w_i X_i }{\sum_{i=1}^N w_i} $$
where the
In this exercise we are going to reweight the data in weighted_data
, which consists of samples from the following distribution:
with
$$ \overline{X}w = \frac{\sum{i=1}^N w_i X_i }{\sum_{i=1}^N w_i} \qquad \textrm{where} \qquad w_i = \frac{1}{P(X_i)} $$
should be an estimator for the expectation of a uniform random variable between 0 and 1. Furthermore, the expression below:
$$ \sigma^2_{X} = \frac{V_1}{V_1-1} \sum_{i=1}^N \frac{w_i}{V_1} (x_i - \overline{X}w)^2 \qquad \textrm{where} \qquad V_1 = \sum{i=1}^N w_i $$
gives an estimate for the variance of the distribution after reweighting (i.e. the variance of the uniform random variable).
Notice, finally, that the tools of statistics gives us expressions for the expectation and variance of the weighted average:
$$ \mathbb{E}(\overline{X}_w) = \overline{X} \qquad \textrm{and} \qquad \textrm{var}(\overline{X}w) = \frac{\sum{i=1}^N w_i^2 (X_i - \overline{X}w)^2 }{(\sum{i=1}^N w_i)^2} $$
These expressions hold if all the
In what follows we are thus going to try to extract the average and the fluctuations for the CV in the unbiased (in this case uniform) distribution as well as the unbiased free energy. We will calculate these quantities by computing weighted averages and weighted histograms from our simulation data. For the histogram we are also going to extract error bars by reweighting. To calculate these quantities using PLUMED we will use an input like this:
#SOLUTIONFILE=work/plumed_ex6.dat
UNITS NATURAL # This ensures that Boltzmann's constant is one
data: READ FILE=__FILL__ VALUES=__FILL__
# This restraint and the REWEIGHT_BIAS command after computes the weights in the formulas above.
mm: RESTRAINT ARG=data AT=0.6 KAPPA=4
rw: REWEIGHT_BIAS TEMP=1
wav: AVERAGE ARG=__FILL__ STRIDE=1 LOGWEIGHTS=__FILL__
# These lines compute the variance of the random variable
dd: CUSTOM ARG=data,wav FUNC=(x-y)(x-y) PERIODIC=NO
uvar: AVERAGE ARG=dd STRIDE=1 LOGWEIGHTS=rw NORMALIZATION=false
one: CONSTANT VALUE=1
wsum: AVERAGE ARG=one STRIDE=1 LOGWEIGHTS=rw NORMALIZATION=false
var: CUSTOM ARG=uvar,wsum FUNC=x/(y-1) PERIODIC=NO
# Print out the average and variance of the uniform random variable
PRINT ARG=__FILL__ STRIDE=1 FILE=colvar
# Construct the histogram
hhh: HISTOGRAM ...
ARG=__FILL__ LOGWEIGHTS=__FILL__
__FILL__=0 __FILL__=1 __FILL__=20
CLEAR=__FILL__ NORMALIZATION=true __FILL__=DISCRETE
...
DUMPGRID GRID=__FILL__ FILE=hist.dat STRIDE=1000
Copy this input to a file called plumed.dat, fill in the blanks and run the calculation by executing the following command:
> plumed driver --noatoms
I obtained the following graphs showing how the mean and variance change with sample size.
The free energy with errors from this data looks like this:
It is up to you to work out how to adapt the script given in exercise 5, so it works here. The script above works for unweighted means. Here, however, you have to calculate weighted means using the formulas above.
We can now bring all this together by running a metadynamics simulation and extracting the free energy surface by reweighting. The other exercises in this tutorial have shown us that when we do this, it is essential to:
- Quote error bars on the estimates of the free energy we obtain as the values we get from our simulation will be random variables. Quoting error bars is thus essential in terms of making our results reproducible.
- Calculate weighted averages using the ideas discussed in exercise 8 as we need to account for the effect the bias has on the sampling of phase space.
- Use the block averaging method discussed in exercise 4 to obtain multiple estimates for the free energy. Notice that we need to use block averaging because, as discussed in exercise 7, there are correlations between the CV values the system visits during the trajectory.
As these three issues have been the focus of this tutorial, we focus on them in the exercise that follows. The theory behind the metadynamics method that we are using is discussed in detail in other lessons if you are interested.
We will study a system of 7 Lennard Jones atoms in two dimensions in what follows using the MD code simplemd that is part of PLUMED. You can run this code by issuing the command:
plumed simplemd < in
where in here is the input file from the GitHub repository for this tutorial. This input file instructs PLUMED to perform 200000 steps of MD at a temperature of
The timestep in this simulation is 0.005
We want to investigate transitions between the four structures of Lennard Jones 7 that are shown below using metadynamics.
However, when we run the metadynamics, we will often find that the cluster evaporates and the seven atoms separate. To prevent this, we will thus add restraints to prevent the cluster from evaporating. The particular restraint we are going to use will
prevent all the atoms from moving more than
$$ \mathbf{x}\textrm{com} = \frac{1}{N} \sum{i=1}^N \mathbf{x}_i $$
where
as you can see, this potential does not affect the dynamics when these distances are less than 2
A metadynamics bias will be used to force the system to move between the four configurations shown in the figure of the four minima above. This bias will act on the second and third central moments of the distribution of coordination numbers.
The nth central moment of a set of numbers,
Furthermore, we can compute the coordination number of our Lennard Jones atoms using:
where
#SOLUTIONFILE=work/plumed_ex7.dat
# this optional command tells VIM that this is a PLUMED file and to colour the text accordingly
# vim: ft=plumed
# This tells PLUMED we are using Lennard Jones units
UNITS NATURAL
# Calculate the position of the centre of mass. We can then refer to this position later in the input using the label com.
com: COM __FILL__
# Add the restraint on the distance between com and the first atom
d1: DISTANCE __FILL__
UPPER_WALLS ARG=d1 __FILL__
# Add the restraint on the distance between com and the second atom
d2: DISTANCE __FILL__
UPPER_WALLS __FILL__
# Add the restraint on the distance between com and the third atom
d3: DISTANCE __FILL__
UPPER_WALLS __FILL__
# Add the restraint on the distance between com and the fourth atom
d4: DISTANCE __FILL__
UPPER_WALLS __FILL__
# Add the restraint on the distance between com and the fifth atom
d5: DISTANCE __FILL__
UPPER_WALLS __FILL__
# Add the restraint on the distance between com and the sixth atom
d6: DISTANCE __FILL__
UPPER_WALLS __FILL__
# Add the restraint on the distance between com and the seventh atom
d7: DISTANCE __FILL__
UPPER_WALLS __FILL__
# Calculate the collective variables
c1: COORDINATIONNUMBER SPECIES=__FILL__ MOMENTS=__FILL__ SWITCH={RATIONAL __FILL__ }
# Do metadynamics
METAD ...
ARG=__FILL__ HEIGHT=__FILL__ PACE=__FILL__ SIGMA=__FILL__
GRID_MIN=-1.5,-1.5 GRID_MAX=2.5,2.5 GRID_BIN=500,500 BIASFACTOR=5
...
Copy this input to file called plumed.dat and modify it so that it instructs PLUMED to add Gaussian kernels with a bandwidth of 0.1 in both the second and third moment of the distribution of coordination numbers and a height of 0.05
plumed simplemd < in
Once you have run the metadynamics calculations, you can post-process the output trajectory using driver to extract the free energy by reweighting. Notice that to do block averaging, you will need to extract multiple estimates for the (weighted) histogram. You should thus use the following input file to extract estimates of the histogram:
#SOLUTIONFILE=work/plumed_ex8.dat
# this optional command tells VIM that this is a PLUMED file and to colour the text accordingly
# vim: ft=plumed
UNITS NATURAL
# We can delete the parts of the input that specified the walls and disregard these in our
# analysis. It is OK to do this as we are only interested in the value of the free energy
# in parts of phase space where the bias due to these walls are not acting.
c1: COORDINATIONNUMBER SPECIES=__FILL__ MOMENTS=__FILL__ SWITCH={RATIONAL __FILL__}
# The metadynamics bias is restarted here so we consider the final bias as a static bias
# in our calculations
METAD ...
ARG=__FILL__ HEIGHT=0.05 PACE=50000000 SIGMA=0.1,0.1
GRID_MIN=-1.5,-1.5 GRID_MAX=2.5,2.5 GRID_BIN=500,500
TEMP=0.1 BIASFACTOR=5 RESTART=YES
...
# This adjusts the weights of the sampled configurations and thereby accounts for the
# effect of the bias potential
rw: REWEIGHT_BIAS TEMP=0.1
# Calculate the histogram and output it to a file
hh: HISTOGRAM ...
ARG=c1.* GRID_MIN=-1.5,-1.5
GRID_MAX=2.5,2.5 GRID_BIN=200,200 BANDWIDTH=0.02,0.02
LOGWEIGHTS=__FILL__ CLEAR=__FILL__ NORMALIZATION=true
...
DUMPGRID GRID=hh FILE=my_histogram.dat STRIDE=2500
Once you have filled in the blanks in this input, you can then run the calculation by using the command:
> plumed driver --ixyz trajectory.xyz --initial-step 1
You must make sure that the HILLS file that was output by your metadynamics simulation is available in the directory where you run the above command.
For the rest of the exercise, you are on your own. You must use what you have learned in the other parts of the tutorial to generate plots similar to those shown below. The leftmost plot here is the free energy surface computed by taking the (weighted) average of the blocks, the right panel shows the size of the error bar on the free energy at each point of the free energy surface.
Notice that the data is correlated here so you should investigate how the error size depends on the lengths of the blocks as was discussed in exercise 7. When I did this analysis I found that the error does not have a strong dependence on the size of the blocks.
Finally, if you are struggling to plot the 2D free energy surface, you can generate free energy as a function of one CV only using the ideas from earlier exercises.
Hint: You are now calculating weighted averages so you will need to use the code you wrote for exercise 8 to merge the histograms
If you want to know more about good practise using PLUMED you can read this paper. We would also recommend learning about kernel density estimation, which will often give you smoother histograms. My full sets of notes are available here: