SI2008 Relaxed Complex Scheme general

From NBCRwiki

Jump to: navigation, search

Back to Track3


Virtual Screening and RCS Rescoring

If you are doing this tutorial on your own, please download /data/archives/si/training/2008/track3/SI2008.tar.gz copy of the files.

We have written this tutorial assuming that you are familiar with VMD, AutoDock, and basic Unix commands.

We advise that you at least work through the VMD and AutoDock4 tutorials before attempting this experiment.

Users will need to have access to the web.

Users will also need to have VMD, AutoDock, and MGL Tools installed.

All the created outputs can be downloaded as output-tarball.

Preparing the system for the virtual screen

Get PDB from pdb databank

Today we will be working with the N1 neuraminidase protein. This first section will take place in a terminal on your desktop or laptop.

Download 2HU4 from the RCSB website, or you can do it with the following command:

wget -O 2HU4.pdb

Process in VMD, extract protein, extract tamiflu

Launch VMD. On your local terminal, type:


Load the 2HU4.pdb structure into VMD (by now we assume you are familiar with VMD. If you need help, ask a proctor!)

In the VMD Main window, select File --> New Molecule
Browse for 2HU4.pdb

Notice that the structure is a tetramer. As the binding site does not bridge across multiple domains (i.e. it only involves one subunit) we only need to extract and work with one of the 8 chains resolved in the file.

In the graphics window, select "'chain B"' and set the display type to New Cartoon
Make an additional selection for "'resid 801'" and set the display type to VDW

This shows you tamiflu (aka oseltamivir) bound in the active site of the N1 neuraminidase. We want to make separate files for the protein and the ligand. Although we can do this in many ways, we will do it here in VMD.

Open a Tkcon window by selecting Extensions --> Tk Console
At the prompt make the protein selection by typing: set prot [atomselect top "chain B"]
Then write the selection to a pdb file by typing: $prot writepdb 2HU4_B.pdb
Now make another selection for tamiflu by typing: set tami [atomselect top "resid 801"]
Then write the selection to another pdb file: $tami writepdb oseltamivir.pdb

Process/assign protonation states at pH 6.5 using PDB2PQR website

Before we try to redock the tamiflu ligand into the N1 crystal structure, we need to assign the proper protonation states for the protein. We want the protein to be protonated according to pH 6.5, which is the pH at which the experimentalists will run the inhibition assays. To do this, we will use a tool called PDB2PQR.

Go to
Scroll down and select Upload a PDB, then browse to select the 2HU4_B.pdb file you created above.
Select the Amber force field
Select Amber for the output naming scheme
For the available options, be sure the top three are selected.
In the box for PROPKA, type 6.5:
x Use PROPKA to assign protonation states at pH 6.5
Click Submit.

This job will take a couple of minutes.

Wait for the file to be processed and when it's finished, click on the first link, the file with the .pqr extension. Open and look at this file in a text editor. Notice that it has comments at the top. Scroll down and see the protonation states assigned to the various histidines and other residues. When you are finished checking it out, rename the file as 2HU4_pH65.pdb. You can do that in your text editor or with a simple "mv" on the command line (if you are unsure how to do this, ask your neighbor or a lab proctor).

Getting ready to run the tutorial

Grab the tarball at the top of the page, and unzip it:

tar xzvf SI2008.tar.gz

Now you need to move into the directory you just unzipped:

cd SI2008-day2

Process protein for AD4

First, we will prepare the receptor PDB file. Note that this file is the same as the one you just processed with PDB2PQR.

In your working directory either

copy '' from "MGLTools/MGLToolsPckgs/AutoDockTools/Utilities24" and execute
./ -r 2HU4_pH65.pdb -o 2hu4_ph65.pdbqt


use a full path to the directory where MGLTools is installed locally and execute
<local_install>/MGLTools/MGLToolsPckgs/AutoDockTools/Utilities24/ -r 2HU4_pH65.pdb -o 2hu4_ph65.pdbqt

Process ligand for AD4

To prepare the ligand file for AD4 we need to process with the prepare_ligand script in AutoDock. This script will merge the non-polar hydrogens so the ligand can be correctly docked in AD4. (Note: the -A option adds hydrogens using Babel, internal to AD; of course you can always use an external program to add hydrogens and in any case should check that Babel does it correctly.) Sometimes the automated programs do not do exactly what you want them to do, and in those cases, you will need to use other programs or manually edit your ligand file. For example, PRODRG can be used to add polar hydrogens.

Again, in your working directory either copy '' from "MGLTools/MGLToolsPckgs/AutoDockTools/Utilities24", or make sure your path includes 'local_install', the directory where MGLTools is installed locally. We assume below that you have copied it locally.

./ -l oseltamivir.pdb -o oseltamivir.pdbqt

Check the output. Look where AD4 put hydrogens. Are all hydrogens all present? Why or why not?

=Make the grid and docking parameter files (.gpf, .dpf)/span>

Next it’s time to prepare the grid and docking parameter files for AD4. First, for this receptor, you must prepare a grid parameter file (GPF).

Here you will need to know the gridcenter and grid size of the grid box you wish to create. Typically, we change the gridcenter to the center of the active site and the npts (number of points) to something that you will know will encompass the entire active site. If you are doing this for the first time, it may be helpful to switch to the ADT GUI to confirm correct box placement, like you did in the AutoDock tutorials. If you already know the parameters you wish to use, you may enter them in command line.

Again, in your working directory either copy '' from "MGLTools/MGLToolsPckgs/AutoDockTools/Utilities24", or make sure your path includes 'local_install', the directory where MGLTools is installed locally. We assume below that you have copied it locally.

For this neuraminidase protein, the center of the active site is at {3.73 20.236 113.925} and the npts should be changed to {64 72 66} to be sure the grids will cover the entire binding site. You can also modify the .gpf file as it is created, by using the following command-line options:

./ -l oseltamivir.pdbqt -r 2hu4_ph65.pdbqt -p gridcenter='3.73,20.236,113.925' -p npts='64,72,66' -o 2hu4_ph65.gpf

Double check the 2hu4_ph65.gpf file to be sure that the correct grid center coordinates and number of points (npts) are input correctly. If not input correctly, edit these points in by hand. Now we need to prepare the docking parameter file (DPF).

This file contains all the different docking options utilized in AD4. A lot of these you can probably safely use without changing, but you will want to modify / optimize some parameters in the docking parameter file in order to validate the program for your target. A good first place to start is by altering the default values for ga_num_evals (total number of energy evaluations), ga_pop_size (population size), and ga_num_generations (number of generations). The default ga_num_evals is about 2 orders of magnitude too small for most ligands. In particular, when you perform a virtual screen, you need to set the ga_num_evals high enough that it allows all ligands in your screening set to converge. Typically this is 5-10 million.

To improve the clustering of the docking results, you may need to do system-specific parameter optimization for AutoDock. We don’t have time today to go through a full iteration of this. Here we give you the optimized parameters for neuraminidase (a discussion of which was presented in the lecture). Note that in this example we will set ga_run to 15, which is the number of dockings we will actually perform. For your real research projects, you will probably want to set ga_run to 50 or 100 (in order to generate better statistics).

Again, in your working directory either copy '' from "MGLTools/MGLToolsPckgs/AutoDockTools/Utilities24", or make sure your path includes 'local_install', the directory where MGLTools is installed locally. We assume below that you have copied it locally.

For this docking, change ga_pop to 200, ga_num_evals to 3000000, and rms_tol to 2.0. We can do this in the command line as follows:

./ -l oseltamivir.pdbqt -r 2hu4_ph65.pdbqt -p ga_pop_size=200 -p ga_num_evals=3000000 -p rmstol=2.0 -o oseltamivir-2hu4_ph65.dpf

Control case: redocking oseltamivir into the N1 crystal

Now it’s finally time to run the docking! To do this, first we have to run AutoGrid, which will make all the atomic affinity and electrostatic grids for the receptor and all the different atoms types in your ligand. After the grids are generated, we call run AutoDock.

The .glg and .dlg are the grid log file and docking log file, respectively.

To run AutoDock and AutoGrid, you will need to know the path to these programs on your machine. An example may be:

/home/install/usr/apps/autodock4/bin/autogrid4 -p 2hu4_ph65.gpf -l 2hu4_ph65.glg
/home/install/usr/apps/autodock4/bin/autodock4 -p oseltamivir-2hu4_ph65.dpf -l oseltamivir-2hu4_ph65.dlg

Where you will need to replace "/home/install/usr/apps/autodock4/bin/" with the actual path to autogrid and autodock on your own machine.

The docking job will take approximately 15 minutes, depending on your machine.

In the meantime, get up and stretch your legs or take a coffee break!

When the run is finished, there are several things we need to do in order to evaluate the results. We suggest you do the following (note that you can also do a nice analysis in AutoDockTools):

  1. Compare docking pose(s) to crystal structure
    1. Read the oseltamivir-2hu4_ph65.dlg file into VMD as a PDB file
    2. Read the 2HU4_B.pdb into VMD as a separate PDB file
    3. Read the oseltamivir.pdb file into VMD
    4. How close does AutoDock get to finding the crystal binding pose?
  2. Look at the clusters
    1. Listed at the end of the .dlg file, depicted as a text-style histogram
    2. How many clusters are there? Is the clustering good? The ideal case is that the lowest energy cluster is also the most populated. How did we do?
  3. Look at the energy
    1. Is AutoDock able to repeatedly find a low-energy docking pose?
    2. What is the energy of the first cluster?
    3. What is the energy of the biggest cluster?
  4. Interpret: All the results of (a) – (c) need to be taken into account in the final interpretation and evaluation of the docking results. If the results are good, you can proceed to the virtual screen, feeling confident that AutoDock can work for your system. If the results are not good (i.e. poor clustering, high energies, large deviation from the crystal pose), you should consider modifying the AutoDock docking parameters.
  5. For the control case, are our parameters good?

Performing the virtual screen

Now that we are sure we have a good set of AutoDock parameters, we will start the virtual screening process. We will perform a virtual screen of the crystal structure against the NCI diversity set (NCIDS), which has about 2000 compounds. These compounds are typically available for free by the National Cancer Institute when investigating HIV, cancer, or any related diseases. It is a reduced and representative set of full NCI database, which contains approximately 250,000 compounds.

We have a version of the NCIDS preprocessed on kryptonite for use in AutoDock4. It takes many hours of computing time to screen even this reduced database with AD4, but a lot less time if you're using AutoDock VINA. Because our time today is limited, we will try to screen a small subset of this database.

Small molecule databases

In virtual screening, you will dock many ligands (from a library of small molecules) to a single receptor of choice. You will still need to prepare the receptor as previously shown. Ligand libraries can be prepared with a simple script but many come already prepared. There are a few well-curated, publicly available small molecule databases that can be used for a variety of virtual screen or docking applications.

The NCI diversity set 2 is distributed from the National Cancer Institute and contains approximately 1900 chemically diverse compounds. It is a good place to start for efficient "search in the dark" type applications. The NCI makes these compounds (and those listed in the full NCI database) available for the experimentalists to test, so it is a good place to start. It has been processed for use in AutoDock4 and is available from NBCR for your docking experiments. The ZINC database is maintained by the Shoichet lab at UCSF and is an excellently curated very large small molecule database. A variety of file formats are maintained. See . The NBCR also houses an AutoDock4 curated version of one of the recent ZINC library distributions. The Available Chemicals Directory (ACD) and Hit2Lead are other popular small molecule databases.

Prepare the grids for the ligand set

To run the screen, we need affinity and electrostatic grids for all the atom types in the screen set. To generate these files, we need to rerun AutoGrid with all the ligand atoms types contained in the screen set. If we fail to do this, AutoDock will not be able to perform the docking for the missing atoms types. To make all the necessary grid files, we need to modify the .gpf file that we generated above.

We will now make a new grid parameter file for the screen (be sure that you copy the entire command, which is quite long!):

./ -l oseltamivir.pdbqt -r 2hu4_ph65.pdbqt -p gridcenter="3.73 20.236 113.925" -p npts="64 72 66" -p ligand_types="A C HD N NA OA SA Fe F S P Mg Mn I Zn Br Cl" -o 2hu4_ph65-screen.gpf

Now, we need to rerun AutoGrid to generate all the new map files:

/home/install/usr/apps/autodock4/bin/autogrid4 -p 2hu4_ph65-screen.gpf -l 2hu4_ph65-screen.glg

Again, you need to insert the correct path to autogrid4 on your own machine.

It will take several minutes to generate these new files, depending on your machine.

Setting up the screening directories

Now, we will set up a directory of directories that will contain one ligand docking to the crystal structure (for the purpose of this tutorial, each person will generate 16 directories, one docking per directory). You don't have to set up the directory structure in this way, but we have found that it is easiest to keep things organized if we keep the "one-directory-one-docking" file structure. Typical screens will have thousands of ligands, and you will need to keep things very organized so as not to get lost in the data.

The script to prepare the VS docking directories is called "prepare_dpfs_set1.csh" and looks like this:


mkdir 2hu4_screen
cd 2hu4_screen

foreach f (`ls your_path/SI2008-day2/Set1/*.pdbqt`)
set name = `basename $f .pdbqt`
echo $name
mkdir "$name"_2hu4
cd "$name"_2hu4
cp "$f" .
ln -s ../../2hu4_ph65.pdbqt .
ln -s ../../2*map* .
ln -s ../../2hu4_ph65-screen.gpf . -l  `basename $f` -r 2hu4_ph65.pdbqt -p ga_pop_size=200 -p ga_num_evals=2500000 -p ga_run=10 -p rmstol=2.0
cd ..

We have already put a copy of this script in your directory, but you will need to edit the part that says "your_path". You can run the script by typing

source prepare_dpfs_set1.csh

Note that as each directory is created, it will list the associated ligand on the screen.

When the script is finished, you will see that it has created a new directory in your working directory, called 2hu4_screen.

Submitting the virtual screen

Now you will see that you have created a nested set of directories in one directory called 2hu4_screen, and you are ready to submit your screen. Unfortunately running virtual screens takes some time, especially using the AutoDock4 parameters that we have specified. It is best to run such screens on clusters, where multiple processors can be utilized. We have included an example screen submission script for a cluster in the directory, called "submit-screen-SI.csh". If you had access to a cluster, you could submit it using:

qsub submit-screen-SI.csh

To check on the status of your job, you can type either of the qstat commands listed here:

qstat –u username

The job will take ~15 minutes to run on a cluster. You could also run it on your machine locally, but it will take much longer, and you will likely need to edit the job submission script. For today, since we are pressed for time, we have also supplied you with the already completed output files in the output-tarball at the top of the page.

Run the scripts to extract the energies

Now that the virtual screen jobs have completed, we have a lot of data to look through! The easiest way to sort through all the data is with scripts that sort the results by lowest energy of the highest populated cluster for each ligand.

We have created some scripts to help you analyze the results of the screen. When you are ready to run it, type:

source summary-script.csh

This creates a new file called receptor.summary.txt, which summarizes the docking results for each ligand.

Now we will sort the file by binding energy results:

cat receptor.summary.txt | sort -k11g -t, > receptor.summary.sort.txt

Analyzing the results of the virtual screen

You can look at the results of your virtual screen by opening the receptor.summary.sort.csv output file in any text editor. Do this (look at the results) and report the top binders for your set, which can be found ranked from best to worst in the receptor.summary.sort.csv file.

Try loading the structures of the top 2-3 binders into VMD (load the .dlg output file into VMD as a PDB file, each cluster is represented as a frame that you can step through with the arrow buttons in the VMD Main window). Do you see any interesting features? Did you find any promising candidates?

On running additional virtual screens with MD-generated protein structures

One aspect of the relaxed complex scheme is to use the structures generated in the MD simulations in new virtual screens. This could, for example, allow you to assess the binding of compounds to newly revealed binding pockets found through the simulations.

You can perform additional virtual screens with any number of protein structures, so long as you have the structure in PDB format, using the steps outlined above. In this case we walked you through a screen of the crystal structure, but you could have also used one of the most dominant structures that you generated from the MD-tutorial yesterday.

Relaxed Complex Scheme Rescoring

In this section, we will walk you through a relaxed complex scheme (RCS) rescoring. The general idea behind the technique is to take advantage of the new structural information provided by molecular dynamics simulations in the binding energy estimates that we use for ranking the ligands. Instead of docking one ligand into one protein structure, we will dock one ligand into many structures extracted from the MD. Therefore, instead of one single binding energy estimate, you will actually compute many binding energy estimates, resulting in a "binding spectrum" for each compound. We can then take a weighted average of the spectrum values in order to determine a new mean binding energy estimate that accounts for receptor flexibility.

We have prepared a separate tutorial on setting up molecular dynamics simulations. Clustering the structures extracted from MD is an efficient way to reduce the number of structures we use for docking, without losing too much important structural information. Today we will use clustered structures from a thorough analysis of the neuraminidase simulations that we have previously performed. This RMSD-based clustering approach was performed with GROMOS and various validation protocols were employed. It has been fully described in a recent paper.

As you work through this part of the tutorial, you will find that it is very similar to the previous section. The only difference is that we will use scripts to automate the process of doing dockings to each receptor structure.

Preparing the receptor files

The aligned clustered pdbs are in a directory called receptor_pdbs. The RMSD-based clustering analysis that we performed for the neuraminidase simulations (with inhibitor bound) reduced the initial set of 40,000 structures to 28! We can further reduce this number by only redocking into 90% of the ensemble structures. In the case of these simulations, 90% of the holo ensemble can be represented by only 5 structures.

Take a look at the structures in VMD. How similar are they? Where are they different?

These 5 protein structures are what we will redock into for the RCS application. This will save us considerable time and resources for the relaxed complex scheme redocking.

Now we must prepare each of these pdb's for docking in AutoDock4.

source receptor-prep-all.csh

This last command will prepare all the receptor pdbqt files that are required for AutoDock4. You will see that as it runs it outputs which receptor file it is working on to the screen. When it is finished, it will return the prompt. Note that it created a new directory called receptor_pdbqts in your working directory.

Preparing the ligand files

The ligands we are interested to dock into the MD ensemble will also need to be prepared for use in AutoDock4. Be sure you have your ligand(s) in the directory "ligands" that you want to dock. Since our time is limited today, we will only do the relaxed complex scheme rescoring on oseltamivir. So if you list the contents of the "ligands" directory, you will see the oseltamivir.pdbqt file. The preparation of the ligand was the same as before, using the script that is distributed with AutoDock (since we've already done this earlier today, we will reuse this file).

Please note that no matter how many ligands you would like to rescore with the relaxed complex scheme, you would simply add the prepared ligand.pdbqt files into the "ligand" directory. You could have any number of ligands in the directory and the scripts below will process them.

Creating the grid and docking parameter files

Now we need to create the grid and docking parameter files (the .gpf and .dpf files). To do this, we will use another script. As usual, with this script, it is important to note the following if you are trying this at home:

  • change the grid center to the appropriate place (make sure all your receptor structures are aligned!)
  • change the number of grid points (npts) to the desired values
  • change all the paths so they point to the right places
  • consider adding a variable or two to allow the automated file creation for a directory of ligands  :-)

We have already changed these key parameters within our prepare_rcdocking_gpf_dpf.csh script. So all we have to do is run it now:

source prepare_rcdocking_gpf_dpf.csh

As this script runs, it will output which receptor file it is working on to the screen. When it is finished, it will return the prompt.

Running the RCS docking experiment

You are now ready to perform the rescoring. You will need to run autogrid for each receptor structure, and then autodock. You can do this using the submission script for the kryptonite cluster here at NBCR:

qsub submit-rc-n1.csh

This job will take about 15 minutes to run. Time to stretch and ask questions!

Analyzing the results from the RCS docking experiment

We have created scripts that will extract the predicted binding energies from the AutoDock docking log file. There are several ways to do the analysis. For example, you can extract the lowest binding energy found overall, or use the binding energy associated with the most populated cluster. Based on a handful of recent studies, we now think that using the most populated cluster is the best idea, however, success has been demonstrated using the overall lowest energy as well (regardless of cluster size).

To run the script, type:

qsub submit_summary-rc.sge

This creates a file called summary_osel.csv, which has extracted the lowest binding energy in the most populated cluster. Now we can take a weighted mean of these values, according to the population of each of the cluster structures.

You can do this in Excel, Matlab, or any other number of programs. We know the cluster size from our clustering analysis, and from that we can determine the weight we will apply. Below the cluster populations are listed next to the cluster number. You can see that by using the top 5 clusters, we are able to account for over 90% of the receptor ensemble that was sampled through the MD simulations.

Cluster     % Population
n1ho_cl1      37.8
n1ho_cl2      22.1
n1ho_cl3      14.5
n1ho_cl4       9.0

How would you calculate a weighted average? Think about it. Is your weighted average lower than, higher than, or the same as the single crystal structure - oseltamivir docking you performed? Dynamics changes things!

If time allows, take a look at some of the RCS DLG files and examine how AutoDock docks oseltamivir in these new structures. Do they all bind in the sialic acid pocket?

Back to Track3

Personal tools