[Bioperl-l] Getting values from a GenBank File

Brian Osborne brian_osborne at cognia.com
Fri Mar 5 07:52:56 EST 2004


This is good. I've also been talking to another Bioperl-er who's writing a
script that creates an output very much like your Tables, where the input is
any type of format (Genbank, Locuslink, and so on). So you could write your
own HOWTO, yes, but in my opinion the better idea is to marry the existing
HOWTO, your tables, and this incoming script somehow.

I'm convinced that having too much on a single topic ("you could read the
FAQ 2.6 or the HOWTO  or the module documentation for Bio::That or Bio::This
or the bptutorial section 3.4...") is not good. Bioperl is under-documented,
generally, but you'll also find places where there's documentation scattered
all over the place on a given topic. The advice is inconsistent, one section
gets outdated, the authors move on, these sorts of things happen.

My suggestion would be to wait a moment until my colleague's script appears,
then let's write something coherent that's easy to maintain.

Brian O.

-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Barry Moore
Sent: Thursday, March 04, 2004 3:31 PM
To: bioperl
Subject: [Bioperl-l] Getting values from a GenBank File

While learning BioPerl, I've spent a good bit of time trying to figure
out where some bit of information in a GenBank file might be tucked away
in that RichSeq object, and how the hell to get at it.  I finally
decided it was time to sift through that object and find out what's in
there and where it is.  Along the way I wrote down  what I found, and
then added a bit to it to try and turn it into something that I would
like to have found when I first started learning Bioperl.  The result is
a "GeneBank-file-centric" crib sheet for Bioperl.  The document is
centered around a "standard" GenBank file where each seperate item of
information is marked-up with  numbers and codes that refer to
information later in the document about how to get at that particular
value in using BioPerl.  It also includes a script at the end that can
be modified easily to get at all the major stuff in the GenBank file.
This document is aimed at beginners to BioPerl, but assumes a working
knowledge of Perl.  The file is, of course, free to be used by anyone in
anyway that they wish.  I'd appreciate any corrections or improvements
if you find errors or shortcomings.


Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT


                 How to get values in a GenBank file with BioPerl

This is a guide to accessing the value of each item in a GenBank file
with BioPerl.
Underneath each value in the sample GenBank file there is a number and a
code.  The 3-letter code represents the object where the associated
value is stored.
Table 1 at the end of this file gives the object for each code.  By
looking up the
numbers in Table 2 at the end of this file you will find the object
where this
information can be called from, and the method to get (and usually set)
the value.
Some values may be stored by, and accessible by methods in more than one
There is perl code at the end that you can use as a starting point to
get at all of
the objects and their methods.
LOCUS       NM_079543               1821 bp    mRNA    linear   INV
            1-BSR                   2-BPS      3-BSR   4-BSR  5-BSR 6-BSR
DEFINITION  Drosophila melanogaster alpha-Esterase-3 CG1257-PA (alpha-Est3)
            mRNA, complete cds.
ACCESSION   NM_079543 XX_123456
            8-BPS     9-BSR
VERSION     NM_079543.2  GI:24644853
                      |     |
                    10-BPS  11-BSR
KEYWORDS    Esterases
SOURCE      Drosophila melanogaster (fruit fly)
  ORGANISM  Drosophila melanogaster
            14-BSC     15-BSC
            Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
            Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
            Ephydroidea; Drosophilidae; Drosophila.
REFERENCE   1  (bases 1 to 1821)
                    17-BAR 18-BAR
  AUTHORS   Adams,M.D. et al.(additional authors deleted for brevity)
  TITLE     The genome sequence of Drosophila melanogaster
  JOURNAL   Science 287 (5461), 2185-2195 (2000)
  MEDLINE   20196006
   PUBMED   10731132
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to
            NCBI review. This record is derived from an annotated genomic
            sequence (NT_033777). The reference sequence was derived from
            On Nov 6, 2002 this sequence version replaced gi:17737826.
FEATURES             Location/Qualifiers
     source          1..1821
                25-BLS  26-BLS
                     /organism="Drosophila melanogaster"
                     /note="genotype: y[1]; cn[1] bw[1] sp[1]; Rh6[1]"
     gene            1..1821
                     /note="alpha-Esterase-3; synonyms: B, aE3, CG1257,
                     alphaE3, alpha-Ests, fragment B;
                     go_function: carboxylesterase activity [goid 0004091]
                     [evidence NAS]"
     CDS             119..1750
                     /note="alpha-Est3 gene product"
                     /product="alpha-Esterase-3 CG1257-PA"









        1 gtactatggc aaatggtact atgggcagtc gcgataataa taataaatga ggctttaaat
       61 gtttttccac tacaagataa aaagttacga gtgcgcaccg agcatttagt agacgaacat
... (additional sequence deleted for brevity)
Table 1. Object Key
BPS - Bio::PrimarySeq
BSR - Bio::Seq::RichSeq
BSC - Bio::Species
BAC - Bio::Annotation::Comment
BAR - Bio::Annotation::Reference
BLS - Bio::Location::Simple
BSG - Bio::SeqFeature::Generic
Table 2. Usage key
1.   Bio::Seq::RichSeq           $scalar = $self->display_name() or
2.   Bio::PrimarySeq             $scalar = $self->length()
3.   Bio::Seq::RichSeq           $scalar = $self->molecule()
4.   Bio::PrimarySeq             $scalar = $self->is_circular
5.   Bio::Seq::RichSeq           $scalar = $self->division()
6.   Bio::seq::Richseq           @array  = $self->get_dates
7.   Bio::PrimarySeq             $scalar = $self->description()
8.   Bio::PrimarySeq             $scalar = $self->accession_number()
     Bio::Seq::RichSeq           $scalar = $self->accession()
9.   Bio::Seq::RichSeq           @array  = $self->get_secondary_accessions
10.  Bio::PrimarySeq             $scalar = $self->version()
     Bio::Seq::RichSeq           $scalar = $self->seq_version()
11.  Bio::PrimarySeq             $scalar = $self->primary_id()
12.  Bio::Seq::RichSeq           @array  = $self->get_keywords
13.  Bio::Species                $scalar = $self->common_name()
14.  Bio::Species                $scalar = $self->genus()
15.  Bio::Species                $scalar = $self->species()
16.  Bio::Species                @array  = $self->classification()
17.  Bio::Annotation::Reference  $scalar = $self->start()
18.  Bio::Annotation::Reference  $scalar = $self->end()
19.  Bio::Annotation::Reference  $scalar = $self->authors()
20.  Bio::Annotation::Reference  $scalar = $self->title()
21.  Bio::Annotation::Reference  $scalar = $self->location()
22.  Bio::Annotation::Reference  $scalar = $self->medline()
23.  Bio::Annotation::Reference  $scalar = $self->pubmed()
24.  Bio::Annotation::Comment    $scalar = $self->text() or ->as_text
25.  Bio::SeqFeature::Generic    $scalar = $self->start()
26.  Bio::SeqFeature::Generic    $scalar = $self->end()
27.  Bio::SeqFeature::Generic

Get all features from the features table using a variation on the code
You will need to vary the primary_tag (e.g. source, gene, CDS) and the
get_tag_values (e.g. organism, db_xref, chromosome, note, gene, locus_tag,
go_function, map, EC_number, codon_start, protein_id, etc.).  If there
is more
than one primary tag associated with a give value (e.g. more than one
gene) then
you will have to step through the @source_feats array to find what you
want rather
than simply shifting the first scalar off the top.

my @source_feats = grep { $_->primary_tag eq 'source' }
my $source_feat = shift @source_feats;
my @mol_type = $source_feat->get_tag_values('mol_type');

28.    Bio::PrimarySeq    $scalar =    $self->seq()

use strict;
use warnings;
use Bio::SeqIO;
use Bio::DB::GenBank; #use Bio::DB::GenPept or Bio::DB::RefSeq if needed

#Get some sequence IDs either like below, or read in from a file.  Note that
#this sample script works with the accession numbers below (at least at
the time
#it was written).  If you add different accession numbers, and you get
#you may be calling for something that the sequence doesn't have.
You'll have
#to add your own error trapping code to handle that.
my @ids = ('U59228', 'AB039327', 'BC035972');

#Create the GenBank database object to read from the database.
my $gb = new Bio::DB::GenBank();

#Create a sequence stream to pass the sequences from the database to the
my $seqio = $gb->get_Stream_by_id(\@ids);

#Loop over all of the sequences that you requested.
while (my $seq = $seqio->next_seq) {

  #Here is how you get methods directly from the RichSeq object.  Replace
  #'display_name' with any other method in Table 2. that can be called on
  #either the RichSeq object directly, or the PrimarySeq object which it has
  print $seq->display_name,"\n";

  #Here is how to access the classification data from the species object.
  my $species = $seq->species;
  print $species->common_name,"\n";
  my @class = $species->classification;
  print "@class\n";

  #Here is a general way to call things that are stored as a
  #Generic object.  Replace 'source' with any other of the "major"
headings in
  #the feature table (e.g gene, CDS, etc.) and replace 'organism' with
any of
  #the tag values found under that heading (mol_type, locus_tag, gene, etc.)
  my @source_feats = grep { $_->primary_tag eq 'source' }
  my $source_feat = shift @source_feats;
  my @mol_type = $source_feat->get_tag_values('mol_type');
  print "@mol_type\n";

  #Here is a general way to call things that are stored as some type of a
  #Bio::Annotation oject.  This includes reference information, and
  #Replace reference with 'comment' to get the comment, and replace
  #$ref->authors with $ref->title (or location, medline, etc.) to get other
  #reference categories
  my $ann = $seq->annotation();
  my @references = ($ann->get_Annotations('reference'));
  my $ref = shift @references;
  my ($title, $authors, $location, $pubmed, $reference);
  if (defined $ref) {
    $authors = $ref->authors;
    print "$authors\n";
  print "\n";
This file is without copyright.  It may be reproduced, edited, and
redistributed at
will without notice to the author, however I would appreciate it if any
or improvements were sent to barry.moore at genetics.utah.edu

Bioperl-l mailing list
Bioperl-l at portal.open-bio.org

More information about the Bioperl-l mailing list