SI2010 track4 CADD Pipeline Tutorials

From NBCRwiki

Jump to: navigation, search

Back to Track 4


  • Make sure you have the latest version of MGLTools (v 1.5.6 2010 Summer Institute Release)
  • Please download this CADD tarball which contains the workflow files, required inputs and example outputs for all three of the following tutorials.


Multi-Receptor Virtual Screening Tutorial

This tutorial describes how to use multiple receptors in a virtual screening workflow, implemented in the CADD pipeline. Use of multiple receptor structures in virtual screening experiments is sometimes called "ensemble docking," and there are a number of ways to select receptor structures. For example, multiple wild-type crystal structures can be used as they were here. Alternatively, wild-type and mutant crystal structures may be combined to search for ligands that interact well with each, circumventing resistance issues, as was done here. Another alternative makes use of multiple conformations generated during MD simulations, examples of which can be found here, here and here, see also here for a critical analysis. This tutorial focuses on utilizing conformations generated during MD simulations, and while this approach has a number of success stories (see references above for a handful of examples), you need not limit yourself to this method and are encouraged to think critically of the all the approaches.

Preparing the System

Process the PDBs in VMD

Today we will be working with the N1 neuraminidase protein. This first section will take place in a terminal on your desktop or laptop. To start, launch VMD. On your local terminal, type:


Load the structures into VMD (by now we assume you are familiar with VMD. If you need help, ask a proctor!)

  • In the VMD Main window, select File -> New Molecule
  • Download the proteins 2HU4 and 2HTY from the RCSB PDB databank: To do this, type 2HU4 into the Filename box and hit enter. Under Determine file type, it should say Web PDB Download. Click Load. Repeat the process with 2HTY.

Notice that the structures are tetramers. As the binding sites do not bridge across multiple domains (i.e. they only involve one subunit each), we only need to extract and work with one of the 8 chains resolved in the files.

  • In the graphics window, select "chain B" and set the display type to New Cartoon for both proteins. You may want to also change the coloring method to compare chains.
  • Make an additional selection for "resid 801" only for 2HU4 and set the display type to VDW

This shows you tamiflu (aka oseltamivir) bound in the active site of the N1 neuraminidase protein and also the apo (unbound) conformation. We want to make separate files for just the proteins. Although we can do this in many ways, we will do it here in VMD.

  • Open a TkCon window by selecting Extensions -> Tk Console
  • Navigate to the SI_multi/inputs directory
  • We will select chain B of the 2HU4 protein, write that selection to a PDB file, select tamiflu, and write that to a separate PDB:
set prot [atomselect 0 "chain B"]
$prot writepdb 2HU4_B.pdb
set tami [atomselect 0 "resid 801"]
$tami writepdb oseltamivir.pdb
  • We must do this again for the 2HTY protein, except this structure is apo. Type:
set prot2 [atomselect 1 "chain B"]
$prot2 writepdb 2HTY_B.pdb 

GPF/DPF templates

To run the screen, we need affinity and electrostatic grids for all the atom types in the screen set. To generate these files, we need to create templates for the GPF (grid parameter file) and DPF (docking parameter file) files that AutoGrid will use. Instead of actually creating the grid and docking parameter files, we will just provide templates that will be processed in the pipeline to create the files. These template files contain key modified portions of the full GPF and DPF files required to run AutoDock. If we fail to create these templates, AutoDock will not be able to perform the docking for the missing atoms types.


You will need to know the gridcenter and grid size of the grid box you wish to create. Typically, we change the gridcenter to the center of the active site and the npts (number of points) to something that you will know will encompass the entire active site. If you are doing this for the first time, it may be helpful to switch to the ADT GUI to confirm correct box placement, like you did in the AutoDock tutorials.

For this neuraminidase protein, the center of the active site is at {3.73 20.236 113.925} and the npts should be changed to {64 72 66} to be sure the grids will cover the entire binding site.

  • Open a text editor and copy & paste the following:
 npts 64 72 66
 gridcenter 3.73 20.236 113.925
  • Save this file as both 2HTY_B.gpf and 2HU4_B.gpf in the SI_multi/inputs directory.


This file contains all the different docking options utilized in AD4. A lot of these you can probably safely use without changing, but you will want to modify / optimize some parameters in the docking parameter file in order to validate the program for your target. A good first place to start is by altering the default values for ga_num_evals (total number of energy evaluations), ga_pop_size (population size), and ga_num_generations (number of generations). The default ga_num_evals is about 2 orders of magnitude too small for most ligands. In particular, when you perform a virtual screen, you need to set the ga_num_evals high enough that it allows all ligands in your screening set to converge. Typically this is 5-10 million.

To improve the clustering of the docking results, you may need to do system-specific parameter optimization for AutoDock. We don’t have time today to go through a full iteration of this.

Specifically for neuraminidase, we want the population size to be 200, 10 total dockings, 50000 energy evaluations and an RMS tolerance of 2. For your real research projects, you will probably want to set ga_run to 50 or 100 (in order to generate better statistics).

  • Open a text editor and copy & paste the following:
 ga_pop_size 200 
 ga_run 10                          
 ga_num_evals 50000 
 rmstol 2.0
  • Save this file as both 2HTY_B.dpf and 2HU4_B.dpf in SI_multi/inputs

Remember, these are just templates! They're not the actual grid and docking parameter files- those will be automatically made in the pipeline.

Running the Virtual Screen

Small molecule databases

In virtual screening, you will dock many ligands (from a library of small molecules) to a single receptor of choice. Ligand libraries can be prepared with a simple script but many come already prepared. There are a few well-curated, publicly available small molecule databases that can be used for a variety of virtual screen or docking applications.

The NCI diversity set is distributed from the National Cancer Institute and contains approximately 1900 chemically diverse compounds. It is a good place to start for efficient "search in the dark" type applications. The NCI makes these compounds (and those listed in the full NCI database) available for the experimentalists to test, so it is a good place to start. It has been processed for use in AutoDock4 and is available from NBCR for your docking experiments. The ZINC database is maintained by the Shoichet lab at UCSF and is an excellently curated very large small molecule database. A variety of file formats are maintained. See . The NBCR also houses an AutoDock4 curated version of one of the recent ZINC library distributions. The Available Chemicals Directory (ACD) and Hit2Lead are other popular small molecule databases.

We will perform a virtual screen of the crystal structure against a sample library containing 6 ligands. Note that we do not have the time today to use a larger library.

Running the Workflow

  • Open Vision 1.5.6 (part of the MGLTools 2010 SI Release).
  • Choose File -> Open to load the workflow file
  • Choose the file which is located in the SI_multi directory.

This opens up a network of nodes which are connected to each other. Running this pipeline will call a preset order of different applications and operations.

  • If you double click on certain nodes, they will lead you to another workflow that is contained within that node.
  • Mouse over any node or port to see specific parameters about that node or port.


The following is a description of the nodes in the network that you will need to change to specify parameters.

  • Inputs
    • GetStructuresFromDir will get a list of PDBs/PQRs/PDBQTs from the user as a list of input receptors and each one of the these receptor files will be later used to run virtual screening. This directory should also contain template GPFs and template DPFs for each receptor. The workflow assumes that the receptor PDB files and their corresponding template GPFs/DPFs have the same base filename. Browse for the SI_multi/inputs directory where you stored the PDB files and template GPFs and DPFs.
    • PublicServerLigandDB allows the user to choose a ligand library from our server. For neuraminidase, we will be using the sample library. Note that this node can be replaced by LocalLigandDirectory or UrlLigandDB.
    • PreserveCharges? is a checkButton node so that the charges are preserved for PrepareReceptor if checked. Make sure this is checked.
    • FilterLigandNode allows the user to filter the ligands. Advanced users may want to use a customized filter. The user can get the custom filter parameter panel pop up by right clicking on the FilterLigandNode and then select Parameter Panel. The user can then use the thumbwheels to adjust values for different parameters. Note that our server are unable to support ligand libraries that have more than 2500 ligands. This option will be left at the default for this tutorial. Here is an image of the custom filter parameter panel:
  • Other
    • Iterate will go through the PDB list and perform virtual screening on each receptor.
    • DownloadSaveDir downloads the virtual screening results to the parent directory that contains the input directory to the GetStructuresFromDir node. The output will be organized by receptor. There will be directories created that correspond to each individual receptor.
    • VirtualScreening Vs2.png is where all the information is sent and the screen is ran. Here's a break down of what is contained in that node:
      1. PrepareReceptor Pr1.png This node first runs the PDB2PQR web service, and then it runs the PrepareReceptor web service. If the directory that contains a .pqr or a .pdbqt file with the same base name, then PDB2PQR/PrepareReceptor will be skipped, respectively.
      • The receptor directory specified in the main workflow contains the input files. All of the PDBs in that directory are used as inputs here.
      • Prepared receptor objects
      • URL strings to the prepared receptor or a path to a pdbqt file on the local machine, if already exists.
      The user can double click on this macro PrepareReceptor to see how it is composed:
      The Pdb2pqrWS and PrepareReceptorWS nodes inside PrepareReceptor are also macro nodes composed of other nodes:
      1.1 Pdb2pqrWS Pdb1.png This node runs the PDB2PQR web service, which assigns the proper protonation states for the protein.
      Node composition:
      1.2 PrepareReceptorWS Prws1.png This node runs the PrepareReceptor web service, which prepares all the receptor pdbqt files that are required for AutoDock4.
      Node composition:
      2. ComputeGrids Cg1.png The ComputeGrids node first runs PrepareGPF, which prepares a GPF based on the ligand library atom types and then runs AutoGrid.
      • A LigandDB object, which may be an output from LocalLigandDirectory, PublicServerLigandDB , UrlLigandDB, or FilterLigandNode. In our case, this comes from the PublicServerLigandDB node, which was specified in the main workflow.
      • A prepared receptor object, which is the output from the previous step, PrepareReceptor.
      • A gpf_template object, which you created while setting up the system. This is specified by the GetStructuresFromDir input node from the main workflow.
      • AutoGrid results of the type autogrid_result
      • AutoGrid result as a string URL
      Here is the inside of the ComputeGrids macro:
      3. AutodockVS Ad1.png The AutodockVS node runs AutoDock based on the ligand library as a web service. The web service uses a ~200 processor processor cluster and executes AutoDock jobs in parallel.
      • LigandDB object, which is again specified from the PublicServerLigandDB node on the main workflow.
      • AutoGrid object, which is the output from the ComputeGrids step.
      • A dpf_template object, which was created when you set up the system.
      • URL to AutoDock virtual screening results. Note only the top 500 hits will be displayed by default.
      Here is the inside of the AutodockVS node:

Click the yellow lightening bolt button on the task bar to run the workflow. You know the workflow is running when nodes are outlined in bright red. When no nodes have the bright red border, the virtual screen is complete. For every receptor for which you run the virtual screen, a directory will be made with all the results for that particular receptor. In our case, these directories will appear in the SI_multi directory.


When the run is finished, there are several things we need to do in order to evaluate the results for each individual receptor:

  • Compare docking pose(s) to crystal structure
    • Read the DLG file into VMD as a PDB file
    • Read the PDB into VMD as a separate PDB file
    • How close does AutoDock get to finding the crystal binding pose?
  • Look at the clusters
    • They're listed at the top of the DLG file, depicted as a text-style histogram
    • How many clusters are there? Is the clustering good? The ideal case is that the lowest energy cluster is also the most populated. How did we do?
  • Look at the energy
    • Is AutoDock able to repeatedly find a low-energy docking pose?
    • What is the energy of the first cluster?
    • What is the energy of the biggest cluster?

Structure QR Tutorial

In the next two sections of the tutorial, we will explore two different ways to reduce the structural ensemble resulting from all-atom molecular dynamics (MD) simulations. The first method, which we will study here, is QR factorization, or structure QR. In the original RCS, the computational docking experiments were carried out using snapshots extracted at equal time intervals from the MD trajectories. As the simulations are carried out for several nanoseconds, this typically sums up to 10^4 to 10^5 receptor structures, many of which may be conformationally redundant. Siince it is computationally intractable to virtually screen every single structure produced by the MD simulations, we will use structure QR to distill the ensemble to an essential, representative set. Check out the following references for more information about the VMD tool MultiSeq and the QR factorization involved in evolutionary profiles. For this tutorial, we will be working in the SI_trajqr directory.

Structure QR

Preparing the Files

  • Open the N1.dcd file in VMD, which is located in the SI_trajqr/inputs directory
  • Load the N1.psf (protein structure file) into the DCD
  • Go to Extensions -> TkConsole. Make sure you are in the SI_trajqr/inputs directory. Type:
source extract-pdbs.tcl

This command will call the extract-pdbs.tcl file, which will extract the frame-by-frame PDBs from the trajectory needed for this tutorial.

Running the Workflow

  • Open Vision 1.5.6.
  • Choose File -> Open
  • Load, which is located in the SI_trajqr directory


  • Inputs
    • Directory_Containing_Input_PDB_Files is a node that specifies the input files. A snapshot of the protein was taken every 100 frames, and these are the PDB files from the MD. Choose the SI_trajqr/inputs directory.
    • RMSD provides the RMSD cutoff for the clustering simulation. We will be using a cutoff of 0.9 Angstroms.
    • Output_Directory_Path is where you enter the directory that the output frame PDBs from the simulation will be stored. Choose SI_trajqr/outputs.
    • DPFTemplateBrowser allows you to choose an existing DPF template file to be copied and renamed to be used in the virtual screen. The workflow automatically does this for every output frame PDB that is generated from the clustering. Here, choose the 2HTY_B.dpf file from the SI_multi/inputs directory. These template GPFs and DPFs will be put in the directory specified under Output_Directory_Path.
    • GPFTemplateBrowser is the same as the DPFTemplateBrowser, but allows users to choose a GPF template to be copied and renamed. Choose the 2HTY_B.gpf file created previously.
    • TrajQR Tqr1.png The TrajQR node is what runs the TrajQR program, which creates clusters of the PDB files based on the RMSD values specified in the main workflow. The program uses QR factorization to cluster the files. The inputs are from the main workflow. Here is the inside of the TrajQR node:
      Again, to run the workflow, click on the yellow lightening bolt on the taskbar. Wait until no nodes are outlined in bright red - this signifies that the job is done.

Virtual Screen

We will be using the same virtual screening workflow that we used in the multi-receptor tutorial ( Load this file into Vision 1.5.6.

  • Inputs
    • GetStructuresFromDir - browse for the SI_trajqr/outputs directory where the PDB files were stored from the TrajQR clustering. This time, the template GPFs and DPFs were automatically created from the QR clustering workflow.
    • PublicServerLigandDB - again, we will be using the sample library.
    • FilterLigandNode - leave this option at default
    • PreserverChargers? - make sure this is checked.

The directories created for each receptor after the virtual screen is finished will appear in the SI_trajqr directory.

RMSD Clustering Tutorial

In this tutorial, we will use a different method to distill the set of receptors to screen down to a manageable size. Specifically, we will use an RMSD-based clustering algorithm developed within Gromos. More information on this method can be found in articles here, here, here and here. This algorithm aligns all of the receptor snapshots extracted from the molecular dynamics simulation(s) by their alpha carbons, then computes a pairwise RMSD distance matrix for a subset of residues lining the binding site (in actuality you can choose any atoms you wish to compute the RMSD distance matrix). Afterwards, the program clusters the structures according to a user-defined cutoff, and we choose the "central member" structure, i.e. the structure that has the lowest RMSD to all the other structures within a particular cluster, to represent each cluster. This method also gives us population information about the structures. Commonly, we will choose the top few most dominant structures to perform virtual screens. An example of this application can be found here. Often, these structures exhibit differences in the binding site configurations that are advantageous to consider in virtual screening.

We will be working in the SI_gromos directory for this last tutorial.

RMSD Clustering

For our purposes, the MD simulation has already been ran. The prepared parameter file and output trajectory file are used in clustering.

Preparing the Files

Trajectory File

Gromacs does not read in MDCRD or DCD files, but multi-frame PDB files. We start by converting our trajectory into the appropriate PDB format:

  • Open the SI_gromos/inputs/N1.dcd file in VMD. Next, load the N1.psf file into the DCD.
  • In the VMD Main window, click on Extensions -> Analysis -> RMSD Trajectory Tool.

The large text box initially contains the selection “protein.” Keep this selection. It is typical to align the trajectory by certain atoms and then to do the clustering by another atom set. In our case, we will first align the trajectory by protein, and then cluster based on the positions of the residue atoms lining the active site.

  • Click on the "Align" button.

Your trajectory has now been aligned.

  • Right click on the trajectory name in the VMD Main window.
  • Select “Save Coordinates." In the “Selected Atoms” field, type:

Click on the “Save” button and save the file as trajectory.pdb in the SI_gromos/inputs directory. You've successfully prepared your PDB trajectory file.

Active Site Residue File

Typically when clustering protein trajectories for drug design purposes, you want to know about the various configurations of the protein active site. Thus, the residues that line the active site must be considered in clustering.

  • While the PSF and DCD are still loaded in VMD, right click on the protein name in the VMD Main window.
  • Select “Save Coordinates." In the “Selected Atoms” field, type:
resid 36 to 38 52 to 57 65 to 71 75 98 99 115 to 119 142 to 147 162 to 166 196 197 212 214 263 to 266 286 320 321 345 to 360
  • In the Frames section, set First and Last to 0, and Stride to 1.
  • Click on the “Save” button and save the PDB file as active_site.pdb also in the SI_gromos/inputs directory.

These steps save the residues that line the active site in the first frame of the trajectory to a PDB file named active_site.pdb.

Running the Workflow

  • Open Vision 1.5.6.
  • Choose File -> Open to load the workflow file which is located in the SI_gromos directory. This is what the workflow should look like:


  • Inputs
    • TrajectoryPDBFile specifies the PDB file that will be used as the trajectory. Choose trajectory.pdb (in SI_gromos/inputs)
    • ActiveResidueFile is a PDB that contains the information from active site residues. Choose active_site.pdb which is also located in the inputs directory.
    • RMSD is the cutoff used to cluster the trajectory frames. Enter 0.11 nanometers for this parameter.
    • OutputDirectory is the node that specifies where the output PDB files will be stored. Choose outputs which is located in the SI_gromos directory.
    • GPF/DPFTemplate Browser is the same for the QR. Use the same files, 2HTY_B.gpf and 2HTY_B.dpf.
    • GromosCluster Gromos3.png This GromosCluster node is what runs the GROMOS program, which will cluster the protein trajectories based on RMSD. All the inputs are specified from the main workflow. The following is the inside of the GromosCluster node:

Virtual Screen

We will be using the same virtual screening workflow that we used in the multi-receptor and QR clustering tutorials ( Load this file into Vision 1.5.6.

  • Inputs
    • GetStructuresFromDir - SI_gromos/outputs directory where the output PDB files were stored from the Gromos clustering. Again, the template GPFs and DPFs were automatically made in the Gromos clustering workflow.
    • PublicServerLigandDB - sample library.
    • FilterLigandNode - default
    • PreserverChargers? - checked.

Back to Track 4

Personal tools