Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Enhanced sampling simulation cookbooks #21

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

sukritsingh
Copy link

Here's an initial draft of 4 cookbooks for running enhanced sampling methods (AMD, GAMD, MetaD, and Umbrella Sampling) using OpenMM. Each of them are located in a subdirectory inside cookbook/Enhanced sampling methods. A couple of sample log. files are included (which are used to set up hyper parameters for these simulations).

Any and all feedback is welcome! The notebook code itself runs but it may need cleanup/additional documentation.

(This PR is a re-opening after PR #16 ) to deal with merge conflicts.

This submission is response to a larger effort to add more tutorials to the OpenMM-cookbook (#12)

@peastman
Copy link
Member

These aren't really cookbook entries as such. Take a look at the existing entries to get a sense for how they're written. A cookbook entry should be short, to the point, and completely self contained. It should start by clearly stating the problem, then present the minimal amount of code needed to solve it, while clearly explaining the code so the user can adapt it to their own problem.

These notebooks are more like tutorials based on their size and complexity. And they don't have much explanation, just large blocks of code.

@sukritsingh
Copy link
Author

sukritsingh commented Apr 24, 2023

A cookbook entry should be short, to the point, and completely self contained. [...] And they don't have much explanation, just large blocks of code.

These notebooks are self-contained and run using the villin.pdb file in the main cookbook directory. It just requires a system/simulation object to be set up (and CVForces defined).

I was basing it off a discussion in another thread where you mentioned:

The cookbook is a reference. It contains small, self contained pieces of code for performing specific tasks. They can assume you already know what you want to do, and you're just trying to find out how to do it.

So I wrote it assuming someone would know what the theory behind any of these enhanced sampling methods are and just need the code to run them on any system of their choice. I figure that the "tutorials" would have more in depth discussion, but happy to add more documentation/problem statements if that would be preferred.

Admittedly these aren't super small notebooks, but I think they are still descriptively minimal examples? Partially because setting up and running customIntegrators can just take up a few cells (especially given that it starts from just the villin.pdb). Happy to also merge cells together to improve clarity if that's more inline with cookbooks.

Either way I definitely want to document these a lot better and flesh out the details more but I just wanted to get something out there that "runs" in case y'all had any thoughts. Appreciate this discussion!

These notebooks are more like tutorials based on their size and complexity.

I'm also happy to migrate these to the tutorials directly instead rather than having them as cookbooks, if that's the preference here as well!

Copy link
Member

@peastman peastman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I picked one notebook and added detailed comments on it. I didn't go through the others, because very similar comments apply to them too. Does this make it clearer what we're looking for in a cookbook entry?

Comment on lines 25 to 39
"from openmm.app import *\n",
"from openmm import * \n",
"from openmm.unit import *\n",
"from openmmtools import forces\n",
"import mdtraj as md\n",
"import numpy as np\n",
"import parmed as pmd\n",
"import bz2\n",
"import os\n",
"from openmm import CustomIntegrator\n",
"from openmm.unit import kilojoules_per_mole, is_quantity\n",
"from openmm.unit import *\n",
"import numpy as np\n",
"import pandas as pd \n",
"from matplotlib import pyplot as plt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few problems with these imports.

  1. Many of them are never used! So why import them?
  2. Cookbook entries need to run in CI. That means if an external package doesn't get automatically installed with OpenMM, you need to explicitly install it as part of the notebook.
  3. For that reason, we should be very sparing about using external packages in the cookbook. They add complexity and should only be used when there's a really good reason.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - for clarity, I used the external packages pandas and matplotlib in other notebooks to demonstrate how one would choose hyperparameters for things like AMD or Metadynamics.

From a teaching standpoint do you think that's strictly necessary or would it be fine to just describe the process and skip ahead? If you think it's strictly necessary I can modify the notebook accordingly to install needed packages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It comes down to what you're trying to teach. If pandas is central to what you're teaching, then it's necessary. But if it's just incidental, a way of doing some setup, it's better to omit it.

In the Gaussian AMD tutorial, you use pandas to load a log file from an earlier simulation, then compute a few statistics of the energy from it. For a cookbook entry, I think it's fine to skip that and just hardcode them. Something like this:

Before running a Gaussian AMD simulation, you first need to select a strength for the boost potential. This is usually done by running a short simulation, recording the potential energy, and computing some simple statistics of it. The following are the minimum, maximum, mean, and standard deviation of the energies observed during an earlier simulation.

And then just provide hardcode values:

Vmin = -801890.7367725638,
Vmax = -794737.8288112064,
Vavg = -798419.040535957,
Vstd = 904.4602237453533

Comment on lines 48 to 56
"# define a function for creating RMSD restraints \n",
"def create_rmsd_restraint(positions, atom_indicies):\n",
" rmsd_cv = RMSDForce(positions, atom_indicies)\n",
" energy_expression = 'step(dRMSD) * (K_RMSD/2) * dRMSD^2; dRMSD = (RMSD-RMSD0);' \n",
" energy_expression += 'K_RMSD = %f;' % spring_constant.value_in_unit_system(md_unit_system) \n",
" energy_expression += 'RMSD0 = %f;' % restraint_distance.value_in_unit_system(md_unit_system) \n",
" restraint_force = CustomCVForce(energy_expression) \n",
" restraint_force.addCollectiveVariable('RMSD', rmsd_cv) \n",
" return restraint_force \n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never used.

Comment on lines 66 to 73
"# define a function to list force groups and modify system context \n",
"def forcegroupify(system): \n",
" forcegroups = {} \n",
" for i in range(system.getNumForces()): \n",
" force = system.getForce(i) \n",
" force.setForceGroup(i) \n",
" forcegroups[force] = i \n",
" return forcegroups"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never used.

Comment on lines 90 to 102
"sim_temp = 300.0 * kelvin\n",
"H_mass = 4.0 * amu #Might need to be tuned to 3.5 amu \n",
"time_step = 0.002 * picosecond \n",
"nb_cutoff = 10.0 * angstrom \n",
"box_padding = 12.0 * angstrom\n",
"salt_conc = 0.15 * molar\n",
"receptor_path=\"../villin.pdb\"\n",
"current_file=\"villin-solvated\"\n",
"# Misc parameters \n",
"restraint_distance = 0.0 * angstroms \n",
"restart_freq = 10\n",
"log_freq = 1 \n",
"prd_steps = 100 "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many of these parameters are never used. Most of the others are only used once, so you can just put the value at the appropriate place in the code without defining a variable for it. Remember, the goal of a cookbook entry is to show how to do one specific thing with the smallest amount of code possible. It doesn't need to be a complete, robust simulation code.

" constraints=HBonds, \n",
" rigidWater=True,\n",
" hydrogenMass=H_mass)\n",
"system.addForce(MonteCarloBarostat(1*bar, sim_temp))\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a barostat is unnecessary. It's a fine thing to do in a real simulation, but it's unrelated to umbrella sampling. It doesn't contribute toward showing the user what we're trying to show them.

Comment on lines 325 to 327
"simulation.reporters.append(DCDReporter('./'+current_file+''+ '.dcd', restart_freq))\n",
"simulation.reporters.append(CheckpointReporter('./'+current_file+''+ '.chk', min(prd_steps, 10*restart_freq)))\n",
"simulation.reporters.append(StateDataReporter(open('./log.' + current_file+'', 'w'), log_freq, step=True, potentialEnergy=True, kineticEnergy=True, totalEnergy=True, temperature=True, volume=True, density=True, speed=True))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary. Unless you specifically refer to the output of a reporter later on, it's relevant to what you're teaching.

Comment on lines 344 to 351
"## set force constant K for the biasing potential.\n",
"## the unit here is kJ*mol^{-1}*nm^{-2}, which is the default unit used in OpenMM\n",
"K = 100\n",
"simulation.context.setParameter(\"k\", K)\n",
"\n",
"## M centers of harmonic biasing potentials\n",
"M = 20\n",
"r0_range = np.linspace(0.3, 2.0, M, endpoint = False)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters will already have been set in the calls to addBond() above.

Comment on lines 360 to 361
"simulation.context.setParameter('r0_d1', r0_range[1])\n",
"simulation.context.setParameter('r0_d2', r0_range[2])"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Comment on lines 379 to 382
"simulation.context.setPositions(pdb.positions)\n",
"print(' initial : %s' % (simulation.context.getState(getEnergy=True).getPotentialEnergy()))\n",
"simulation.minimizeEnergy()\n",
"print(' final : %s' % (simulation.context.getState(getEnergy=True).getPotentialEnergy()))"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has nothing to do with umbrella sampling.

Comment on lines 391 to 392
"for i in range(5):\n",
" simulation.step(10)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain what's going on. We've set up a simulation with restraints. Now we run some dynamics and collect statistics. What statistics do we need to collect? How do we collect them? What do we do with them once we've collected them?

@sukritsingh
Copy link
Author

sukritsingh commented Apr 24, 2023

These comments are super helpful - thanks so much! In looking through the examples, I didn't appreciate that it didn't have to be a proper "robust" simulation code so I included everything (and it's from an older "chicken scratch" notebook).

I didn't go through the others, because very similar comments apply to them too.

Indeed - initially this was actually one bigger notebook with all four separate approaches applying to the same system.

Thanks again! I'll work on resolving the comments you made, and apply them to the other notebooks as well. I'll try to ensure that these are much more lightweight examples.

@sukritsingh
Copy link
Author

@peastman Would you be willing to take a look at these cookbooks now and let me know if you think they've been sufficiently cleaned up? I've cleaned up all four notebooks to ensure it's just as minimal a simulation as possible.

For each notebook, I also tried to document how one could save outputs (without using reporters) in a way that I like to do it, but I wasn't sure it added too much complexity or not. It doesn't involve too much overhead and it's a nice way to only save what you need to track the sampling of your simulation or generate the free energy projections. Any and all comments are welcome!

@sukritsingh
Copy link
Author

@sef43 also wanted to tag you in case you may have had any thoughts/edits here!

@peastman
Copy link
Member

peastman commented Oct 3, 2023

Sorry this PR got forgotten! I think it just got lost in everyone's inboxes. Let's get back to it and see if we can finish cleaning it up. What you have now is closer to what we're looking for, but I think we can improve it. If you read through the existing cookbook entries, you'll see they tend to follow a formula.

  1. State what problem we're going to solve.
  2. State what mechanism we're going to use to solve it.
  3. Demonstrate how to do it with the smallest amount of code possible.
  4. Explain any pieces of the code that may not be clear and point out how the reader may need to modify it to suite their own needs.

For comparison, look at your umbrella sampling entry again. Instead of explaining what problem we're solving, it goes straight into code. The only explanation is vague titles like, "Create system + CV Forces with harmonic restraints." When the reader finally gets to some text, they don't have the context needed to understand it. "Ultimately we want to track the value of these distances..." But you haven't told them what these distances are or why we would want to track them.

My best suggestion is really just to read through the existing cookbook entries. I think that will make it clear. They tend to have at least as much text as code, and they never present code without first explaining what the code is for.

@sukritsingh
Copy link
Author

sukritsingh commented Oct 6, 2023

No worries @peastman - totally understand that things come up and/or get buried! Thanks for pointing out the additional notebooks and the guidance of the Four Parts.

I've tried out an "example" of the documenting on the Umbrella sampling notebook, since you brought it up - Could you take a look at it and let me know if this is enough/needs editing/additional fleshing out still? I tried adding more context without adding more math or anything and kept the code to the minimum. Hopefully this provides sufficient context? If not I can definitely keep expanding!

If it looks good here then I have a starting point to edit the other notebooks and can push changes to them accordingly - just want to make sure we converge on "one good notebook" first!

@peastman
Copy link
Member

That's a lot better. Actually, what you have now is more of a tutorial than a cookbook recipe. But that's ok, let's just call it a tutorial instead. Umbrella sampling is a big enough topic to justify one.

I'm not sure the dist_measurer force is a good idea. It computes the distances on every step, even when you don't need them. You set the energy expression to "0" so it won't affect the simulation, but the CustomCVForce still computes them before evaluating its energy expression. Of course, just computing two distances should take negligible time. But this is a really inefficient way of doing it. The CustomCVForce maintains its own internal Context. On each step it first copies over the current state from the main Context to the internal one, then does a separate energy evaluation for each CV. That's significant overhead, even if the actual CV computation is negligible. So you add significant cost to every step, just to have the distances available when you want them.

One solution is to put the CustomCVForce into its own force group, then call setIntegrationForceGroups() on the Integrator to prevent it from including that group. That will prevent the CustomCVForce from being evaluated except when you ask for it.

But all of this may be unnecessary complexity. What about adding getPositions=True to your getState() call, then computing the distances in Python? The code should be simpler and more obvious. This also has overhead, of course, but possibly not very much. It depends how often you record the distances. In a real simulation would you actually do it every 10 steps, or would it probably be less frequent?

@sukritsingh
Copy link
Author

But that's ok, let's just call it a tutorial instead.

I think there's already an Umbrella sampling tutorial that goes into a lot more of the theoretical details of both the running and the analysis. I can reference it more directly if that's preferred but figured that this would just act as a "minimal code" example (which is what I thought the cookbook was). What would be the best way to resolve this then? Should I remove the umbrella sampling as a cookbook or make any other changes to make it more cookbook-like?

Of course, just computing two distances should take negligible time. But this is a really inefficient way of doing it. The CustomCVForce maintains its own internal Context. On each step it first copies over the current state from the main Context to the internal one, then does a separate energy evaluation for each CV.

Yea I was trying to figure out a way to do it while importing as few libraries as possible (maintaining the
"minimal" example idea), but I agree it's not the most efficient way to do it. One neat thing it offers is a way for a user to track the results of the simulation as it runs(which was my original objective).

One solution is to put the CustomCVForce into its own force group, then call setIntegrationForceGroups() on the Integrator to prevent it from including that group. That will prevent the CustomCVForce from being evaluated except when you ask for it.

I could try adding this into the notebook to help reduce computational cost! Would it just be as simple as setIntegrationForceGroup(integrator)?

What about adding getPositions=True to your getState() call, then computing the distances in Python? The code should be simpler and more obvious. This also has overhead, of course, but possibly not very much. It depends how often you record the distances. In a real simulation would you actually do it every 10 steps, or would it probably be less frequent?

I think this is a great approach and was what I first thought of doing but
a) I wasn't sure if it would be more "minimal" than just having it print out the distances live (totally defer to you on that)
b) I think my philosophy here that I wanted to convey is that it's possible to track how much your simulation has sampled (in a relevant CV-space) as the simulation runs so you can know whether or not you've sampled your relevant states.

I think evaluating distances after the simulation as run (which would require a restart from the user) is just a different, but equally valid, approach to evaluating sampling quality.

I'm happy to include a separate code block to show how one would evaluate the distances after the simulation if you think that would be helpful!

@peastman
Copy link
Member

I think there's already an Umbrella sampling tutorial

Good point. Do you think this one adds anything beyond what's in the existing tutorial? I generally think of the cookbook as reference material for answering very specific questions that wouldn't justify a whole tutorial. Things like, "How do I change the temperature during a simulation?" or, "How do I decompose the total energy into contributions from different interactions?" The user already knows exactly what they want to do. They just want to find out what code to write to do it.

I think evaluating distances after the simulation as run (which would require a restart from the user) is just a different, but equally valid, approach to evaluating sampling quality.

I didn't mean to move the calculation to after the simulation is run. It would still be computed live, just by a different method. You would replace

d1,d2 = dist_measurer.getCollectiveVariableValues(simulation.context)

with

positions = state.getPositions(asNumpy=True)
d1 = norm(positions[d1_atom1_ind]-positions[d1_atom2_ind])
d2 = norm(positions[d2_atom1_ind]-positions[d2_atom2_ind])

@sukritsingh
Copy link
Author

Do you think this one adds anything beyond what's in the existing tutorial? I generally think of the cookbook as reference material for answering very specific questions that wouldn't justify a whole tutorial.

So I think mine is a lot more "straight to the point" as far as providing code to setup and run a single window for a user who knows Umbrella sampling but just wants the OpenMM syntax, if that makes sense? The other notebook has a lot more theoretical details and of running a full Umbrella sampling simulation and provides a lot of details on system setup and analysis. This notebook would be much more "Assuming you know what you want, here is the code to apply and run a single window."

I could link to specific codeblocks and/or include more details/example code on how to run using multiple windows?

Another option is I could remove this notebook but keep the additional other three notebooks (on accelerated MD, Gaussian accelerated MD, and Metadynamics)?

It would still be computed live, just by a different method.

Oh I see! Sorry I totally misinterpreted what you were meaning, but yes the way you described doing it is equally good! I hadn't done it this way because I avoided importing numpy in the notebook (and by norm I think you mean numpy.linalg.norm?) , but if you think that this way is "better" then I can do it this way instead!

@peastman
Copy link
Member

Actually norm() is openmm.unit.norm, which is already imported. It knows how to deal with positions that have units.

@sef43 do you have any thoughts on this, since you wrote the existing tutorial?

@sukritsingh
Copy link
Author

Actually norm() is openmm.unit.norm, which is already imported. It knows how to deal with positions that have units.

Oh! Phenomenal! I've gone ahead and updated the notebook to print it using norm and took out the whole dist_measurer thing entirely! This is a much simpler way of doing this for sure! Thanks for that pointer!

It depends how often you record the distances. In a real simulation would you actually do it every 10 steps, or would it probably be less frequent?

In a real simulation it would be much less frequent than recording it 10 steps, but the sample cookbook loop I have is only running a 100 step simulation anyways. I've made a point of noting that that the recording_frequency set here is super frequent and should be set at higher values for realistic systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants