SI2008 Relaxed Complex Scheme I

From NBCRwiki

Jump to: navigation, search

Back to Track3


Molecular Dynamics and Clustering


  • Dong Xu, Postdoc NBCR/McCammon Group/UCSD
  • Rommie Amaro, NIH Postdoctoral Fellow, McCammon Group, UCSD


Bold indented lines are the commands you need to run in this tutorial session!
Please copy/paste them to your terminal window in order to avoid input errors.

Setup Working Environment

  • on ieng6 Linux Desktop
    Set some environmental variables and copy files into your account. Open one Terminal window and type:
    source /home/linux/ieng6/nbcr08/nbcr08/SI2008/rcs_setup.bash
  • on Kryptonite Cluster
    Open another Terminal window and type:
    source /nas/dxu/SI2008/setup.bash

Topic 1: MD Simulation of H5N1

The macromolecule is H5N1 Avian Influenza Neuraminidase Apo protein in the "open" loop conformation. All the commands in section will be executed on kryptonite.

Prepare the macromolecule

Download 2HTY from the RCSB website, type:

wget -O 2HTY.pdb

The original PDB contains multiple tetramers of the neuraminidase enzyme, but we only want to simulate one tetramer. Therefore we need to extract chains A, B, C and D from the pdb file for use in our simulation. Do the following:

  • Open the 2HTY.pdb file in your favorite text editor
  • Delete all atoms except for chain A - D
  • Save it as a 2HTY_tetramer.pdb

Or you can do this:

grep ATOM 2HTY.pdb |egrep -v 'REMARK| E | F | G | H ' > 2HTY_tetramer.pdb

Protonate the protein

Now we need to add the hydrogen atoms to our system; in this tutorial, we will use the PDB2PQR webserver, which is a tool developed here at NBCR. Adding hydrogen atoms to most amino acids is trivial at pH 7.0, which is approximately what we want to simulate. However, histidines have a pKa right around this range and can have either their delta nitrogen or epsilon nitrogen protonated, or they can both be protonated, yielding an amino acid with a +1 charge. This webserver will optimize the hydrogen bonding network within the protein to determine where the hydrogens should go in the optimal case.

When setting up your system of interest, you should check the placement of the hydrogens as a sanity check to be sure the automated procedure worked effectively. In some special simulation cases, you may want to mimic a specific reaction step and therefore may want to manually edit the protonation state of several residues. This is part of getting to know your system and knowing what you want to simulate and accomplish through your work.

Go to the web service PDB2PQR website or use the command line: --ff=amber --ffout=amber --with-ph=6.5 --chain 2HTY_tetramer.pdb 2HTY_tetramer.pqr

We instead use sge to submit this job.

qsub -l urgent pdb2pqr.sge

This job takes ~5 minutes and when it finishes, it will create a few new files. The one we need is 2HTY_tetramer.pqr file.

Some amino acids are protonated or deprotonated and that changes the last letter of three letter amino acid code to N (deprotonated) or H (protonated). These name changes must be changed back before processing in AMBER. For example: any lysine with the three letter code 'LYN' must be changed to 'LYS'.

Search for lines containing "HE2 GLU" and remove them.

sed 's/GLH /GLU /g' 2HTY_tetramer.pqr | sed 's/LYN /LYS /g' | sed 's/ASH /ASP /g' | grep -v 'HE2 GLU' > 2HTY_tetramer.pqr.pdb

Add calcium

In the case of neuraminidase, there is a single calcium ion buried in each chain. These calcium ions are required for the proper function of the enzyme, even though they are not directly within the active site. We therefore want to keep them in our simulations.

  • We need to extract the 4 calcium ions from 2HTY.pdb and add them to the bottom of 2HTY_tetramer.pqr.
 HETATM23825 CA    CA   991      -7.518  89.817 108.496  1.00 21.57          CA
 HETATM23826 CA    CA   992      10.548  10.522 108.324  1.00 21.80          CA
 HETATM23827 CA    CA   993     -38.113  41.086 107.567  1.00 22.12          CA
 HETATM23828 CA    CA   994      40.997  59.357 109.508  1.00 23.65          CA
  • We also need to rename atom name CA --> Ca+ and resname CA --> CAL for AMBER convention. Make sure columns are aligned correctly.
 HETATM23825 Ca+  CAL   991      -7.518  89.817 108.496  1.00   21.57          CA
 HETATM23826 Ca+  CAL   992      10.548  10.522 108.324  1.00   21.80          CA
 HETATM23827 Ca+  CAL   993     -38.113  41.086 107.567  1.00   22.12          CA
 HETATM23828 Ca+  CAL   994      40.997  59.357 109.508  1.00   23.65          CA
  • We can do both of the above using the following commands:
    echo TER >> 2HTY_tetramer.pqr.pdb
    grep 'CA CA' 2HTY.pdb| head -4|sed 's/CA /Ca+ /g' |sed 's/CA /CAL /g' >> 2HTY_tetramer.pqr.pdb

Add crystal waters

Another important aspect of setting up a simulation is deciding which crystallized water molecules to keep. We generally advise to keep all of the water molecules, unless you have specific reasons to remove them. Of further note, it is a little "dangerous" to remove crystallized water molecules that are buried or within active site pockets, as many of them play important structural roles. In addition, removing these water molecules will create an empty space / vacuum in the system and this can subsequently disrupt the local structure.

  • Lauch VMD in text mode and run
vmd -dispdev text
  • Select all crystal waters within 5 angstroms of our tetramer.
mol load pdb 2HTY.pdb
set wat [atomselect top "water within 5 of (chain A to D)"]
$wat writepdb xtal_wat.pdb
  • Open 2HTY_tetramer.pqr and add xtal_wat.pdb at the end of the file.
echo TER >> 2HTY_tetramer.pqr.pdb
sed 's/ATOM /HETATM/g' xtal_wat.pdb|sed '1,1d' >> 2HTY_tetramer.pqr.pdb

You have now created a new PDB file called xtal_wat.pdb that contains all the crystallographic water molecules within five angstroms of the N1 tetramer that we are want to simulate.

Add TER cards

In order to comply with the formatting requirements of Amber's system preparation program Leap, we need to add "TER" between all protein chains and the ions.

You can do this in any text editor if you like. The following command will add a TER between protein A, B, C, D chains:

sed '/ATOM 5750/{p;s/.*/TER/;}' 2HTY_tetramer.pqr.pdb | sed '/11500/{p;s/.*/TER/;}' | sed '/17250/{p;s/.*/TER/;}' > 2HTY_tetramer.pqr.tmp

The following command will add TER between between the 4 Ca+ ions:

sed '/HETATM23825/{p;s/.*/TER/;}' 2HTY_tetramer.pqr.tmp | sed '/HETATM23826/{p;s/.*/TER/;}' | sed '/HETATM23827/{p;s/.*/TER/;}' > 2HTY_tetramer.pqr.pdb

Disulfide bonds

When setting up your system, you will also need to check which cysteine residues, if any, are forming disulfide bonds. In Amber, normal cysteine residues are called CYS, whereas cysteine residues participating in disulfide bonds are called CYX. This is part of getting to know your system - you will need to know all the nitty-gritty structural details in order to create valid simulations. By default, Amber will assume all cysteine residues it encounters in a PDB file are CYS. So if you have disulfide bonds, you need to address them separately.

We will now rename CYS to CYX for disulfide bonded CYS residues. To see which Cysteines are in disulfide bonds, check the crystal pdb file 2HTY.pdb, and search for 'SSBOND'. This will tell you the CYS involved in disulfide bonds and which residues interact with each other. Make note of the residue numbers.

The first pair of numbers is the residue numbers before AMBER re-numbering. The second pair of numbers is the residue numbers after loading into AMBER. The third set of numbers is after adding the solvent box.

92-417 > 92-417 > 10-335
124-129 > 124-129 > 42-47
182-230 > 184-231 > 102-149
232-237 > 233-238 > 151-156
278-291 > 279-292 > 197-210
280-289 > 281-290 > 199-208
318-336 > 318-335 > 236-253
421-447 > 421-446 > 339-364

Scripts have been prepared for identifying CYX and making the Disulfide bonds: chk_CYX and bond_CYX. You don't need to run these scripts separately; they will be run in a few moments when you launch the tleap script.

Calcium parameter file

In order to simulate the calcium ions, we need to add a few lines to our parameter file.

Ca+ parameter file named 'cal.prep' has been prepared before loading the macromolecule into AMBER.

Solvate, neutralize and add ions to the system in AMBER

Now we are ready to add explicit solvent to our system. In this case, we will add a water box of "TIP3P" water molecules. It is important to know which water model you choose for your simulations. TIP3P is a standard choice and works well with the Amber99ffSB that we will be using for the protein.

In addition to adding solvent, we need to neutralize our system. We need to neutralize our system because of the algorithm we use to compute the electrostatics, which is the Particle Mesh Ewald (PME) method. We won't go into a big description here, but if you're interested to know more details you can google it!

A final consideration is that you may also want change your simulation solvent conditions. For example, you may want to choose a salt bath that is similar to either a certain experimental setup (e.g. an inhibitor assay carried out at 150 mM NaCl) or one that mimics certain physiological or cellular conditions.

To do all of this, we will use the following commands in tleap:

source leaprc.ff99SB
loadamberprep cal.prep
a = loadpdb 2HTY_tetramer.pqr.pdb
source cyx
source bond
solvatebox a TIP3PBOX 10.0
# total ions 12 added, 4 Na+ used for neutralization
addions a Cl- 4
addions a Na+ 8
saveamberparm a A.prmtop A.inpcrd
savepdb a A.pdb

4 Na+ are used to neutralize the system. Additional 4 Na+ and 4 Cl- ions are added to mimic the physiological condition (20 mM NaCl). The number of ions is determined by calculating the number of waters (32476) in the system with the following command:

grep 'O WAT' A.pdb|wc -l

Then follow the equation:

N_ions = 6.022e23 * 3.105e-26 * ionConc * nwater = 0.0187 *ionConc * nWater
N_ions = 0.0187 * 0.02 * 32476 = ~ 12

A script 'go_tleap' has prepared. To execute the above commands, submit the sge script.

qsub -l urgent go_tleap.sge

Once it gets started, this job will take approximately 15 minutes to complete. This means you have time for a coffee break! Lucky for us, there is a coffee cart right on the other side of the teddy bear from where we are.

When the job is complete, check the output log files go_tleap.sge.out and leap.log for errors and CYX residues.

Visualize the solvated system on ieng6

Now it is time to visualize your system and admire all your hard work! We will do this on your local machine (on ieng6).

This is an important step in the verification process, to be sure that everything looks as it should and that you have built the system correctly. We will do the visualization with VMD, a freely available and powerful molecular viewer. A full tutorial all about VMD is here and very useful. If you are doing this tutorial at home, you will need to install VMD. It runs on any platform and will install easily nearly without exception.

After you have verified that your system is built properly, we will need to extract some important system parameters, such as cell box size and the system origin, for use in the molecular dynamics simulations.

To launch VMD, type the following at the command-line on your ieng6 desktop terminal:


In the VMD Main window, select File --> New Molecule and browse to ~/SI2008/N1_MD

Browse for A.prmtop, set the file type to Amber7 PARM

Browse for A.inpcrd, set the file type to Amber7 RESTART

Your system should now be visualized in VMD. You can rotate and investigate it.

Open VMD TK Console. In the VMD Main window, select Extensions --> Tk Console. At the prompt, type the following:

set sel [atomselect top all]
measure minmax $sel
measure center $sel

This last set of commands measures the system box center and cell dimensions for MD setup. In a longer tutorial (i.e. if we had 3 days instead of 2), we would walk you through the MD simulation process. Today, we have just shown you how to build your system. Now we will "cheat" a little bit and just show you the final MD trajectory.

When you are finished looking at the protein, quit the VMD session.

Visualize sample MD trajectory on ieng6

Launch VMD. To do this, type the following command in a terminal on your desktop at ieng6:


In the VMD Main window, select File --> New Molecule and browse to ~/SI2008/N1_Clustering

Browse for N1-apo-chA-fr1-renumbered.pdb (format PDB)

Browse for N1-apo-all-CA-aligned.dcd (format DCD) - be sure you load this file into the PDB you just loaded

Now the trajectory is loaded into VMD. You can play the dcd file (1572 frames) as a movie using the "play button" in the VMD Main window. Look at how the protein moves over time. Check out the loop motion and other features. Investigate and have fun!

When you're finished, quit VMD.

Topic 2: Clustering Analysis of MD Trajectory

For this clustering section, we will again be working on kryptonite. So open a terminal and login, if you do not already have a window open.

Goal: Extract representative structures from large amount of MD ensemble

Distance: RMSD/DRMS based (may be mass weighted)

Tools: ptraj (AMBER), g_cluster (GROMACS), GROMOS, Bio3D/R

Algorithm: Hierarchical Average Linkage Clustering (best performer with statistics support [1])

The ptraj clustering command format is listed here for your edification:

cluster out filename [representative format] [average format] [all format] \
algorithm ([clusters n] | [epsilon critical_distance]) [rms|dme] [sieve s [start start_frame | random]] [verbose verb] [mass] mask


Start with cluster=1:

trajin N1-apo-all-CA-aligned.dcd
cluster out avg-1 representative none average none all none averagelinkage clusters 1 mass rms
"(:35-37 | :51-56 | :64-70 | :74 | :97-98 | :114-118 | :141-146 | :161-165 | :195-196 | :211 | :213 | :262-265 | :286 | :319-320 | :344-359) & @CA,C,N,O"
cd ~/SI2008/N1_Clustering
qsub -l urgent avg-1.sge

Plot "ClusteringMerge.txt" for statistics:

Critical Distance: large drop

DBI (Davies-Bouldin Index): local minimum

pSF (pseudo F): local peak

SSR/SST (...)

Look for possible consensus and determine number of clusters

SI2008_Clustering.png SI2008_Clustering2.png

Run averagelinkage again for cluster=6, 8, 14 and calculate the average Silhoutte score for more quality control

submit_ptraj.csh avg-6 N1-apo-all-CA-aligned.dcd N1-apo-chA-fr1-renumbered.pdb 6
submit_ptraj.csh avg-8 N1-apo-all-CA-aligned.dcd N1-apo-chA-fr1-renumbered.pdb 8
submit_ptraj.csh avg-14 N1-apo-all-CA-aligned.dcd N1-apo-chA-fr1-renumbered.pdb 14

Final number of clusters is 6 as it has the highest average Silhoutte scores.

tail *.sil

Run avg-6 again to get the cluster representative structures

sed 's/representative none/representative pdb/g' avg-6.ptraj > avg-6.ptraj.2
sed 's/avg-6.ptraj/avg-6.ptraj.2/g' avg-6.sge > avg-6.sge.2
qsub -l urgent avg-6.sge.2

Look up the cluster membership avg-6.txt 6 > avg-6.clus

Visualize the cluster representatives on ieng6

Launch VMD

New Molecule, browse to ~/SI2008/N1_Clustering

Load avg-6.rep.c0 through avg-6.rep.c5 files and choose PDB as file type.

Rotate the structures and check how different they look around the binding site.

Back to Track3

Personal tools