SI2011 track 3 CADD clustering tutorial

From NBCRwiki

Jump to: navigation, search

Back to Track 3



On Monday, you learned how to run MD simulations to generate a ensemble of conformations. Here, we provide a short, 15 frame trajectory of the the ATP-dependent RNA editing ligase from Trypanosoma brucei and use it to explore RMSD clustering, a technique to reduce the size of the conformational ensemble. Typically, MD simulations run for several tens to several hundreds of nanoseconds, resulting in ensembles of thousands of conformations, many of which may be redundant. In practice, it's too computationally expensive to screen a virtual compound library against all of these conformations. RMSD clustering is a numerical method that attempts to circumvent this problem by distilling the conformational ensemble to a set of non-redundant, or representative conformations. While the tutorial focuses mainly on how to perform RMSD clustering, links to references describing its application to various biomolecular systems are provided.

Please download, which contains required inputs and example outputs.

RMSD Clustering

In this tutorial, we'll explore a RMSD-based clustering algorithm developed within Gromos. The algorithm proceeds in four steps. First, a least-squares alignment is performed between all unique conformational pairs, removing external translation and rotation. Second, RMSD values are calculated between all unique conformational pairs. RMSD means "Root Mean Square Deviation" and is a single number giving a measure of the spatial displacement of one conformation relative to another: the bigger the RMSD value, the more significantly the two conformations differ (see here for more detail). Third, each conformation is assigned neighbors. The number of neighbors is equal to the number of conformations whose RMSD is below some user-defined cutoff threshold. Fourth, clusters with decreasing conformer membership are generated by successively removing the conformation with the largest number of neighbors and all of its neighbors. The conformation with the largest number of neighbors is the central member of the cluster, often called the "cluster centroid," or just the "centroid."

That may have been somewhat confusing. The take home message is this:

  • A set of clusters is generated
  • Each cluster has fewer members than the last. i.e. the first cluster has the largest number of conformations, the second cluster the next largest, and so on.
  • Each cluster has a central member, or centroid, whose properties we assume represent all other cluster members.
    • Typically, in a virtual screening workflow utilizing RMSD clustering, we'll dock a compound library against each cluster centroid.

If you're interested in pondering the clustering algorithm and its application to conformational analysis, a good place to start is here, and here. Additionally, an example of integrating RMSD-based clustering into virtual screening can be found here and here.

Preparing the Files

RMSD clustering will be performed using the CADD pipeline. To perform the clustering, CADD pipeline calls g_cluster, a program within a suite of molecular dynamics simulation and analysis programs called Gromos.

Trajectory File

Gromos does not read in MDCRD or DCD files, but multi-frame PDB files. We start by converting our trajectory into the appropriate PDB format:

  • Open the Trajectory/TbREL1.psf file in VMD. Next, load the Trajectory/TbREL1.DCD file into the psf.

During a MD simulation, the enzyme translates and rotates. We need to remove these "external" motions so that the "internal" conformational fluctuations can be characterized.

  • In the VMD Main window, click on Extensions -> Analysis -> RMSD Trajectory Tool.

The large text box initially contains the selection “protein.” Keep this selection. Typically, we'll align the trajectory using either the protein, the alpha-carbon atoms, or the protein backbone, and then cluster using some other set of atoms. In this case, we'll align the trajectory using the entire protein.

  • Click on the "Align" button.

Your trajectory has now been aligned.

  • Right click on the trajectory name in the VMD Main window.
  • For the purposes of a virtual screen, we only need the protein, so select File -> Save Coordinates, and in the “Selected Atoms” field of the Save Trajectory window that opens, type:

Click on the “Save” button and save the file as trajectory.pdb in the SI_RMSD/Input directory. You've successfully prepared your PDB trajectory file.

Active Site Residue File

Now, we'll prepare a file called active_site.pdb that specifies those residues whose conformations should be clustered. During the drug discovery process, we're typically interested in finding a small molecule that binds more tightly to the enzyme active site than the native substrate does. So, we'll include only the active-site residues in active_site.pdb.

  • While the PSF and DCD are still loaded in VMD, right click on the protein name in the VMD Main window.
  • Select “Save Coordinates." In the “Selected Atoms” field, type:
protein and resid 87 to 90 155 to 162 207 to 209 283 to 287 305 to 308
  • In the Frames section, set First and Last to 0, and Stride to 1.
  • Click on the “Save” button and save the PDB file as active_site.pdb also in the SI_RMSD/Input directory.

You've just saved the coordinates of the residues that line the active site in the first frame of the trajectory to a PDB file named active_site.pdb.

Template GPF and DPF Files

Ultimately, we'll be using the conformations that we generate today to perform a virtual screen tomorrow. To perform the screen, affinity and electrostatic grids for all the atom types in the screen set, for each receptor conformation, are needed. Instead of actually creating the grid and docking parameter files, we just provide templates that will be processed in the pipeline to create the files. These template files contain key modified portions of the full GPF and DPF files required to run AutoDock. If we fail to create these templates, AutoDock will not be able to perform the docking.


You will need to know the gridcenter and size of the grid box you wish to create. Typically, we set the grid center to coincide with the center of the active site and the npts (number of points) to something that you will know will encompass the entire active site. If you are doing this for the first time, it's useful to switch to the ADT GUI to confirm correct box placement and size. If you have a native substrate, and you're developing a competitive inhibitor, it's useful to position the box on the substrate center. For example, here's an image of the TbREL1 grid box centered on ATP,


Note that the grid box is significantly bigger than ATP to accommodate larger compounds. The center of the grid box is at {40.34, 18.53, 13.02}, and with a 0.375 angstrom spacing, the npts in the x, y and z directions is {86 72 78}, respectively. We need to put this information into a template GPF file.

  • Open a text editor and type in:
 npts 86 72 78
 gridcenter 40.34 18.53 13.02
  • Now, save the file as TbREL1.gpf in the SI_RMSD/Input directory.


This file contains all the different docking options utilized in AD4. A lot of these you can probably safely use without changing, but you will want to modify / optimize some parameters in the docking parameter file in order to validate the program for your target. A good first place to start is by altering the default values for ga_num_evals (total number of energy evaluations), ga_pop_size (population size), and ga_num_generations (number of generations). The default ga_num_evals is about 2 orders of magnitude too small for most ligands. In particular, when you perform a virtual screen, you need to set the ga_num_evals high enough that it allows all ligands in your screening set to converge. Typically this is 5-10 million.

To improve the clustering of the docking results, you may need to do system-specific parameter optimization for AutoDock. We won’t have time to go through a full iteration of this.

For this example excercise, we want the population size to be 200, 10 total dockings, 50000 energy evaluations and an RMS tolerance of 2. For your real research projects, you will probably want to set ga_run to 50 or 100 (in order to generate better statistics).

  • Open a text editor and type in the following:
 ga_pop_size 200 
 ga_run 10                          
 ga_num_evals 50000 
 rmstol 2.0
  • Save this file in the SI_RMSD/Input directory, calling it TbREL1.dpf. You should now have TbREL1.dpf and TbREL1.gpf in the /SI_RMSD/Input directory.

Remember, these are just templates! They're not the actual grid and docking parameter files- those will be automatically made in the pipeline when we run the multi-receptor virtual screen tomorrow. On a related note, if you'd like more details about running AD4 and haven't done so already, check out the tutorials, which can be found here.

Running the Workflow

Now that we have all of the required input files prepared, we can perform the clustering.

  • Open the CADD pipeline.
  • Choose MD analysis -> Load workflow -> GromosClustering to open the clustering workflow. This is roughly what the workflow should look like:


  • Input
    • TrajectoryPDBFile specifies the PDB file that will be used as the trajectory. Choose the trajectory.pdb file that you generated and saved in SI_RMSD/Input
    • ActiveResidueFile is a PDB that contains the information from active site residues. Choose active_site.pdb that you generated and saved in SI_RMSD/Input directory.
    • RMSD is the cutoff used to cluster the trajectory frames. Enter 0.08 nanometers for this parameter. The number of structures that the clustering algorithm returns is sensitive to the RMSD value. As a rule of thumb, the larger the value, the fewer the structures.
    • OutputDirectory. Running the workflow generates PDB files of the cluster centroids. OutputDirectory specifies where these PDB files will be stored. Choose Output which is located in the SI_RMSD directory.
    • GPF/DPFTemplate Browser are the locations of the GPF and DPF files that we prepared. For the GPF file, chose SI_RMSD/Input/TbREL1.gpf. For the DPF file, chose SI_RMSD/Input/TbREL1.dpf.


Gromos1.png This GromosCluster node runs the GROMOS g_cluster program that performs the RMSD clustering. All the inputs are specified from the main workflow. The following is the inside of the GromosCluster node, which you can see for yourself by double clicking on the node. You can move your mouse pointer over the nodes for additionall information


To run the RMSD clustering, click on either the green lightning bolt or the yellow lightning bolt. The yellow lightning bolt is a "hard" run and will overwrite any preexisting files if you preformed the calculation before. The green lightning bolt is a "soft" run and will not overwrite preexisting files. After the GromosCluster node is finished running, there should be 8 files in the /SI_RMSD/Output directory

  • 2 PDB files 1frame_8.pdb and 2frame_3.pdb
  • 2 DPF files 1frame_8.dpf and 2frame_3.dpf
  • 2 GPF files 1frame_8.gpf and 2frame_3.gpf
  • 2 "book keeping" files, index.html and robots.txt. index.html lists the cluster centroids, which you can verify by pointing your browser to it, and robots.txt is blank. Neither of these files is important to us.

Note that the PDB, GPF, and DPF files share the same two basenames and that each basename has 2 numbers. The first number designates the cluster number while the second indicates the trajectory frame number. So, for example, "1frame_8" indicates that the centroid of the first cluster is frame 8 from the trajectory.pdb file.

Back to Track 3

Personal tools