Bioperl: 3D biomolecular structure handling for Bioperl

Andrew Dalke
Sat, 12 Dec 1998 22:23:46 -0800

> This matter was discussed several years ago (1995-96)

I recall those discussions.  :)

Here's some commentary I have on the page at

  First off, MMTK <>
is an excellent place to look for an existing Python framework for
molecular modeling.  It can be used to read and manulipate proteins
from PDB files, computes surfaces and does Amber dynamics.

  Should this structure information include methods relevant to small
molecule chemistry?  This includes bond order/type, cycle detection
and aromaticity.  The library I know best in this field is Daylight's
<> and I've got an "all but documented" Python
wrapper to it I can make available.  It provides an OO interface to
Daylight's C/Fortran style API.

>  Parse and output different structure file formats, at a minimum PDB
>  but could include mmCIF, CML/XML, and other formats.

  It may be useful to see how Babel stores a molecular representation
internally at it appears to handle most of the important information
stored in existing formats.  If you're on the PDB list, you've
probably seen my objections to the IUCr patent protection/licensing of
CIF, and thus mmCIF.

  Bond determination is important for reading various file formats.
Should they be determined, eg, from distances or from a residue name

By saying:
> support protein and nucleic acid crystallograpic and NMR structures
> and their heterogens

it's more likely the former, or a mixture of the two.

  I am becoming convinced there should be a core molecular
representation which is very lightweight (stores atom&bonds), and
"views" of the molecule which provide specific interfaces.  For
example from small molecules, some people like Kekule form and others
don't.  The Kekule view of a molecule would have bonds that are
single/double while the non-Kekule view would have "aromatic" bonds.

  Why is this important for larger molecules?  Consider 

> Energy minimize an altered structure. 

  Different force fields have different descriptions of a molecule.
(Eg, what is the "covalent bond" of MM3, which has several two-body
terms, if I recall.)  Also, angle and dihedral force field terms are
first class citizens, meaning they should be handled the same as bond
terms.  It's hard to justify having those terms in the primary data
structure, but easy to suppose in the (eg) "Amber" view of the

Having different views would address also

> Convert between alternate representations of the structure. 

by providing a translation layer to the alternate representation you
want.  (I can imagine the different representations of the molecule
acting as observers of the molecule and receive update notification
when core data is changed.)

There may be performance reasons for some alternate representations,
eg, if you have a "CHARMm dynamics engine" implemented in C to evalute
the force field you might want to convert the structure into something
more amniable to the compute engine.

> Provide methods for querying structure for statistics regarding
> atoms, hetatoms, connectivity, secondary structural elements,

I'm using this point to bring up VMD
<>, the visualization software
I've help develop.  It has a Tcl based way to access a lot of this
information and might be of some help.  Of course, MMTK supports this
as do other programs.

> Multi-chain and multi-model (NMR) structures should also be
> supported.

  Should dynamics/trajectory information be stored?  In the same
fashion as multi-model structures?  Should the bond information be
determined once and assumed to be invariant over the whole
conformation set or should it be recomputed for each conformation?

  I've pretty much decided that coordinates should be stored
independently from the atoms themselves.  It may be appropriate for
the atom to be able to access the "current" conformation somehow, but
I don't know the right way to do that in the face of multithreaded

> presence of alternate conformations

  If someone can tell me a sane way to handle that, I'll be very
interested.  I don't know the right way to handle bond detection
(covalent and hydrogen) when the alt. id is used in a PDB structure.

> structure quality factors (resolution, R factor), crystal packing
> group, and other data obtainable directly from a PDB file. 

  Probably the most relevant packages to look at for this type of work
are XPLOR, O and WhatIf/ WhatCheck.  I know the first and last let
others access that data via their respective scripting languages.  I'm
working in a company and not currently doing protein structures so
cannot justify paying for any of those packages.

> Specific examples: 
>     Get the start and end residue numbers for each chain in a structure. 
>     Get the chain association for a given hetero group. 

 I suggest also giving a more concrete definition of what it means to
be a chain.  Take for example a case where the PDB defines the chains
for a single mer of a multimer structure.  (Consider 2PLV which
contains a single copy of 60 needed for a virus capsid structure.)
The simplest way to generate the full structure is to apply the
transformation matricies to the coordinates and merging the resulting
60 structures into a single PDB.  (Of course, it doesn't fulfill the
PDB definition of a chain, but that's another discussion.)

  I finally decided to let "chain" be a (nearly) arbitrary attribute
of an atom and try to determine chain-like-ness algorithmatically.
Alas, that doesn't work when there's a gap in the structure (like
unresolved coordinates).  For lack of a better word, I called these
"pfrag" for protein fragments and "nfrag" for nucleic acid fragments.

  Luckily, for the most part it doesn't make a difference.  Since PDB
structures are ordered and chain is pretty well-defined, getting the
first and last residue with a given chain attribute is easy.

  I can show off the VMD way to get the first type of data from a PDB
file.  Any way in perl should be at least as direct:

  mol load pdb whatever.pdb
  set sel [atomselect top "protein and name CA"]
  set chains [luniq [$sel get chain]]
  foreach chain $chains {
    set chainsel [atomselect top "chain $chain and name CA"]
    set ids [$chainsel get resid]
    puts "chain $chain start [lindex $ids 0] end [lindex $ids end]"

Which for 2hhd prints:
  chain A start 1 end 141
  chain B start 1 end 146
  chain C start 1 end 141
  chain D start 1 end 146

Of course, for PDB files you'll also need the insertion code to be
fully correct.  Again, well-defined definitions on what's needed for
uniqueness is important.

In VMD, some of the residue information (like the residue id) is
located in the residue rather than the atom.  The atom contains an
association back to the residue so residue properties (like the
residue id) are available at the atom level.  This is easy in C/C++.
It's harder with Perl's memory management since a lot of cycles may
occur.  I don't know how to resolve that problem.  I've heard mention
of "weak references" in a couple other languages.  Seems relevant, but
I don't know what it means :)

> Display/select atoms or residues based on physical-chemical
> properties: acidic, basic, polar, non-polar amino acids, spcific
> types of residues or atoms (e.g., oxygen, glycine), hydrogen bonds,
> disulfide bonds, etc.

Display should not be the property of a library like this because you
may then restrict certain types of user interaction..  I point out a
difference between VMD and most other visualization programs.  Most
programs have the concept of an "active" selection.  All operations are
based with respect to that selection.  VMD, on the other hand, lets
you have multiple selections, each with its own independent method of
display.  I've found the latter very useful, especially when you want
to switch between several different selections.

I know it sounds like I'm tooting my own horn, but I spent some time
developing the atom selection mechanism in VMD.  While I think I know
some better ways to do things for a next implementation, I believe
its architecture is still worth examining.  The user level
documentation is available at:

I've been meaning for several years to strip out the code as an
indepent library and document/publish the results.  As it is, I'm not
going to do that here but you can get the source and look at
SymbolTable.C, AtomLexer.l, AtomParser.y and AtomSel.C.  The library
is quite general.  It does not, however, deal well with non-atom based
selections, like selections across bonds.

  Some query languages, like Daylight's SMARTS, do support queries
across specific bonds, and it may be possible to extend it to support
selections like "the atoms involved in a hydrogen bond."  I also once
came across an NMR program that support NMR specific queries, like
searching for atoms a given number of bonds away (for though-bond
effects).  I would appreciate if someone could point me out some other
selection languages.

> III.Analyzing & Editing Structures

  How tied should the implementation be to the existing Perl math/array
modules?  I don't know the status of those modules, but it's a huge
advantage to be able to use existing methods to compute vector
operations, matrix transforms and eigenvalues/vectors.  Especially if
there are implementations of some of those operations in C :)

With basic vector types, things like
> Calculate dihedral angles
> calculate the axes of secondary structural elements
are simple, as is minimizing RMSD between two vectors.

I regard:
> Add/remove/change residues, hydrogens, and water molecules. 

as one of the hardest things to do in the list.  As proof, I point out
there are very few programs available that do molecular editing, and
I'm not happy with any of them.

The core problem is what to do with items that depend on atoms that
have been deleted from the system or been changed.  Here's a few

  I mentioned atom selections earlier.  Suppose I've selected an atom
in one place then deleted the atom elsewhere.  What becomes of the
atom selection.  How does it determine that it's invalid?  Suppose you
delete a different atom.  Should the selection still be valid? (XPLOR
says "no" by saying certain data is "fragile" and some operations make
fragile data invalid.)

  When should certain values be (re)computed?  For example, if I
rotate part of the molecule around one of the backbone carbons, I can
break an alpha helix into two parts.  Suppose there we a selection for
"alpha helix."  Should it be recomputed? 
  (I think not in all cases; consider following the evolution of all
waters that started within 5A of an ion.  You wouldn't want that
selection to be updated for every timestep.)

  Suppose you have a 3D display of the system and a 2D graph of, oh,
the hydrophobicity plot of every protein chain in the system.  At the
command line interface to the system (I'm assuming that as well) the
user deletes a protein residue, which causes the protein to be broken
into two parts.  How should the two UIs be notified that they need to
redraw their displays?

> Map between residue numbers in the atomic coordinate set and the
> numbers in the sequence string (say from a Fasta representation).

This is only meant as a caution for someone attempting this.  We tried
doing that with the SCOP data, since the SEQRES data is unnumbered
while the ATOMs are referenced with sequence id+insertion code.  (I
added some insertion codes support to VMD to handle this uniqueness).
We came into problems when the structure was only partially known so
the depositor used a residue name that was different than that given
by the sequence.  The PDB documentation says:

| The residues presented on the SEQRES records must agree with those
| found in the ATOM records.

which is almost immediately contradicted by:

| SEQRES is compared to the ATOM records during processing, and both
| are checked against the sequence database. All discrepancies are
| either resolved or annotated in the entry.

Looks like "must" was a typo for "should."

> select an arbitrary set of residues on a structure

  Could this be regarded as a sequence based "view" (with the
association I gave before) of an atom selection?

> Get related structures and domains based on criteria such as
> sequence similarity, structural similarity, date deposited,
> resolution, organism name, molecule name, etc.

Yeww, nasty!  :) Last I heard about 20% of the PDB records weren't
fully parsable.  I do have a syntactic parser for each format card
defined in the PDB, but someone else will have to put the next level
of semantics about that.

> Get URLs for web-based information about the structure.

  This is mostly for my own curiousity.  How stable are the web
services that provide this sort of information?  From past experience
only a few stable (ftp sites are usually stable but http sites,
esp. with fancy CGIs for user interface are not).  We would have to
update our parsers for some 60-odd sites about once a week to reflect
changes on the remote site.

  Here's a challange to anyone working on the structure implementation.
Can you implement these to allow non-trivial multithreading?  (An
example of a trivial solution is to serialize all requests through a
single thread.)

  Now that *that's* done, let me address this part of the email.

>  - In what sort of program would you use such a module?
>  - How would you want to access the data, and what methods should be
>     available?

  Well, I'm a whole-hearted Python developer these days, so it won't
really be me.  However, here's some generally useful tasks (tasks
meaning they aren't quite full fledged programs):

  Viewer -- Load a structure (from various formats?)  and view the
atomic coordinates using OpenGL.  Allow a couple of simple
representations (wireframe, ball&stick, vdw and tube) and coloring
schemes.  Let this render into a Tk canvas so it can be used as a
plug-in to a larger applications.
  At its simplest it needs to read the atom name and position, residue
name and insertion code and determine bonds.  It can, of course, grow
forever, but that encapulates about half of what most people want, and
it's hard to get the other 50% :)

  Editor -- My first real perl program was "pdblang," which you can
still get from (it may need
a few changes for perl5).  It implemented a simple language that let
you manipulate PDB structures.  For example, I used it to:

  read a PDB structure (2plv)
  make 4 copies of the protomer
  rotate each copy by a transformation matrix (representing a
     turn of 1/5 of 360 degrees)
  rename the segment names of each copy (so they could be addressed
  merge the structures into one large structure
  renumber the atom ids
  save to a PDB file

  It could also do simple things, like mutate a residue (translating
the new residue's backbone to the same position, deleting the old
residue, inserting the new one, and renaming/renumbering the ids),
convert atom names to a more current definition (is the terminal
oxygen OT1, OCT1, OTX1, etc.; is it ILE HD3 or HD13?). or compute the
average of several coordinate sets.

  Aligner -- reimplement Andrew Martin's PROFIT.  That does RMSD
structure alignment using sorts of ways.  You'll want to be able to
align a set of atoms, a set of residues (backbone atoms) and do a
sequence aligment and superimpose to match terms in the alignment.

  So, who's still awake after reading all that?  :)

						Andrew Dalke
=========== Bioperl Project Mailing List Message Footer =======
Project URL:
For info about how to (un)subscribe, where messages are archived, etc: