[Bioperl-l] Getting values from a GenBank File

Barry Moore barry.moore at genetics.utah.edu
Fri Mar 5 09:30:06 EST 2004

O.K. Sounds good.


Brian Osborne wrote:

>This is good. I've also been talking to another Bioperl-er who's writing a
>script that creates an output very much like your Tables, where the input is
>any type of format (Genbank, Locuslink, and so on). So you could write your
>own HOWTO, yes, but in my opinion the better idea is to marry the existing
>HOWTO, your tables, and this incoming script somehow.
>I'm convinced that having too much on a single topic ("you could read the
>FAQ 2.6 or the HOWTO  or the module documentation for Bio::That or Bio::This
>or the bptutorial section 3.4...") is not good. Bioperl is under-documented,
>generally, but you'll also find places where there's documentation scattered
>all over the place on a given topic. The advice is inconsistent, one section
>gets outdated, the authors move on, these sorts of things happen.
>My suggestion would be to wait a moment until my colleague's script appears,
>then let's write something coherent that's easy to maintain.
>Brian O.
>-----Original Message-----
>From: bioperl-l-bounces at portal.open-bio.org
>[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Barry Moore
>Sent: Thursday, March 04, 2004 3:31 PM
>To: bioperl
>Subject: [Bioperl-l] Getting values from a GenBank File
>While learning BioPerl, I've spent a good bit of time trying to figure
>out where some bit of information in a GenBank file might be tucked away
>in that RichSeq object, and how the hell to get at it.  I finally
>decided it was time to sift through that object and find out what's in
>there and where it is.  Along the way I wrote down  what I found, and
>then added a bit to it to try and turn it into something that I would
>like to have found when I first started learning Bioperl.  The result is
>a "GeneBank-file-centric" crib sheet for Bioperl.  The document is
>centered around a "standard" GenBank file where each seperate item of
>information is marked-up with  numbers and codes that refer to
>information later in the document about how to get at that particular
>value in using BioPerl.  It also includes a script at the end that can
>be modified easily to get at all the major stuff in the GenBank file.
>This document is aimed at beginners to BioPerl, but assumes a working
>knowledge of Perl.  The file is, of course, free to be used by anyone in
>anyway that they wish.  I'd appreciate any corrections or improvements
>if you find errors or shortcomings.
>Barry Moore
>Dept. of Human Genetics
>University of Utah
>Salt Lake City, UT
>                 How to get values in a GenBank file with BioPerl
>This is a guide to accessing the value of each item in a GenBank file
>with BioPerl.
>Underneath each value in the sample GenBank file there is a number and a
>code.  The 3-letter code represents the object where the associated
>value is stored.
>Table 1 at the end of this file gives the object for each code.  By
>looking up the
>numbers in Table 2 at the end of this file you will find the object
>where this
>information can be called from, and the method to get (and usually set)
>the value.
>Some values may be stored by, and accessible by methods in more than one
>There is perl code at the end that you can use as a starting point to
>get at all of
>the objects and their methods.
>LOCUS       NM_079543               1821 bp    mRNA    linear   INV
>            1-BSR                   2-BPS      3-BSR   4-BSR  5-BSR 6-BSR
>DEFINITION  Drosophila melanogaster alpha-Esterase-3 CG1257-PA (alpha-Est3)
>            mRNA, complete cds.
>            7-BPS
>ACCESSION   NM_079543 XX_123456
>            8-BPS     9-BSR
>VERSION     NM_079543.2  GI:24644853
>                      |     |
>                    10-BPS  11-BSR
>KEYWORDS    Esterases
>            12-BSR
>SOURCE      Drosophila melanogaster (fruit fly)
>            13-BSC
>  ORGANISM  Drosophila melanogaster
>            14-BSC     15-BSC
>            Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
>            Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
>            Ephydroidea; Drosophilidae; Drosophila.
>            16-BSC
>REFERENCE   1  (bases 1 to 1821)
>                    17-BAR 18-BAR
>  AUTHORS   Adams,M.D. et al.(additional authors deleted for brevity)
>            19-BAR
>  TITLE     The genome sequence of Drosophila melanogaster
>            20-BAR
>  JOURNAL   Science 287 (5461), 2185-2195 (2000)
>            21-BAR
>  MEDLINE   20196006
>            22-BAR
>   PUBMED   10731132
>            23-BAR
>COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to
>            NCBI review. This record is derived from an annotated genomic
>            sequence (NT_033777). The reference sequence was derived from
>            CG1257-RA.
>            On Nov 6, 2002 this sequence version replaced gi:17737826.
>            24-BAC
>FEATURES             Location/Qualifiers
>     source          1..1821
>                25-BLS  26-BLS
>                     /organism="Drosophila melanogaster"
>                     27-BSG
>                     /mol_type="mRNA"
>                     /db_xref="taxon:7227"
>                     /chromosome="3R"
>                     /note="genotype: y[1]; cn[1] bw[1] sp[1]; Rh6[1]"
>     gene            1..1821
>                     /gene="alpha-Est3"
>                     /locus_tag="CG1257"
>                     /note="alpha-Esterase-3; synonyms: B, aE3, CG1257,
>                     alphaE3, alpha-Ests, fragment B;
>                     go_function: carboxylesterase activity [goid 0004091]
>                     [evidence NAS]"
>                     /map="84D9-84D9"
>                     /db_xref="FLYBASE:FBgn0015571"
>                     /db_xref="GeneID:40907"
>                     /db_xref="LocusID:40907"
>     CDS             119..1750
>                     /gene="alpha-Est3"
>                     /locus_tag="CG1257"
>                     /EC_number=""
>                     /note="alpha-Est3 gene product"
>                     /codon_start=1
>                     /product="alpha-Esterase-3 CG1257-PA"
>                     /protein_id="NP_524267.2"
>                     /db_xref="GI:24644854"
>                     /db_xref="FLYBASE:FBgn0015571"
>                     /db_xref="GeneID:40907"
>                     /db_xref="LocusID:40907"
>        1 gtactatggc aaatggtact atgggcagtc gcgataataa taataaatga ggctttaaat
>       61 gtttttccac tacaagataa aaagttacga gtgcgcaccg agcatttagt agacgaacat
>          28-BPS
>... (additional sequence deleted for brevity)
>Table 1. Object Key
>BPS - Bio::PrimarySeq
>BSR - Bio::Seq::RichSeq
>BSC - Bio::Species
>BAC - Bio::Annotation::Comment
>BAR - Bio::Annotation::Reference
>BLS - Bio::Location::Simple
>BSG - Bio::SeqFeature::Generic
>Table 2. Usage key
>1.   Bio::Seq::RichSeq           $scalar = $self->display_name() or
>2.   Bio::PrimarySeq             $scalar = $self->length()
>3.   Bio::Seq::RichSeq           $scalar = $self->molecule()
>4.   Bio::PrimarySeq             $scalar = $self->is_circular
>5.   Bio::Seq::RichSeq           $scalar = $self->division()
>6.   Bio::seq::Richseq           @array  = $self->get_dates
>7.   Bio::PrimarySeq             $scalar = $self->description()
>8.   Bio::PrimarySeq             $scalar = $self->accession_number()
>     Bio::Seq::RichSeq           $scalar = $self->accession()
>9.   Bio::Seq::RichSeq           @array  = $self->get_secondary_accessions
>10.  Bio::PrimarySeq             $scalar = $self->version()
>     Bio::Seq::RichSeq           $scalar = $self->seq_version()
>11.  Bio::PrimarySeq             $scalar = $self->primary_id()
>12.  Bio::Seq::RichSeq           @array  = $self->get_keywords
>13.  Bio::Species                $scalar = $self->common_name()
>14.  Bio::Species                $scalar = $self->genus()
>15.  Bio::Species                $scalar = $self->species()
>16.  Bio::Species                @array  = $self->classification()
>17.  Bio::Annotation::Reference  $scalar = $self->start()
>18.  Bio::Annotation::Reference  $scalar = $self->end()
>19.  Bio::Annotation::Reference  $scalar = $self->authors()
>20.  Bio::Annotation::Reference  $scalar = $self->title()
>21.  Bio::Annotation::Reference  $scalar = $self->location()
>22.  Bio::Annotation::Reference  $scalar = $self->medline()
>23.  Bio::Annotation::Reference  $scalar = $self->pubmed()
>24.  Bio::Annotation::Comment    $scalar = $self->text() or ->as_text
>25.  Bio::SeqFeature::Generic    $scalar = $self->start()
>26.  Bio::SeqFeature::Generic    $scalar = $self->end()
>27.  Bio::SeqFeature::Generic
>Get all features from the features table using a variation on the code
>You will need to vary the primary_tag (e.g. source, gene, CDS) and the
>get_tag_values (e.g. organism, db_xref, chromosome, note, gene, locus_tag,
>go_function, map, EC_number, codon_start, protein_id, etc.).  If there
>is more
>than one primary tag associated with a give value (e.g. more than one
>gene) then
>you will have to step through the @source_feats array to find what you
>want rather
>than simply shifting the first scalar off the top.
>my @source_feats = grep { $_->primary_tag eq 'source' }
>my $source_feat = shift @source_feats;
>my @mol_type = $source_feat->get_tag_values('mol_type');
>28.    Bio::PrimarySeq    $scalar =    $self->seq()
>use strict;
>use warnings;
>use Bio::SeqIO;
>use Bio::DB::GenBank; #use Bio::DB::GenPept or Bio::DB::RefSeq if needed
>#Get some sequence IDs either like below, or read in from a file.  Note that
>#this sample script works with the accession numbers below (at least at
>the time
>#it was written).  If you add different accession numbers, and you get
>#you may be calling for something that the sequence doesn't have.
>You'll have
>#to add your own error trapping code to handle that.
>my @ids = ('U59228', 'AB039327', 'BC035972');
>#Create the GenBank database object to read from the database.
>my $gb = new Bio::DB::GenBank();
>#Create a sequence stream to pass the sequences from the database to the
>my $seqio = $gb->get_Stream_by_id(\@ids);
>#Loop over all of the sequences that you requested.
>while (my $seq = $seqio->next_seq) {
>  #Here is how you get methods directly from the RichSeq object.  Replace
>  #'display_name' with any other method in Table 2. that can be called on
>  #either the RichSeq object directly, or the PrimarySeq object which it has
>  #inherited.
>  print $seq->display_name,"\n";
>  #Here is how to access the classification data from the species object.
>  my $species = $seq->species;
>  print $species->common_name,"\n";
>  my @class = $species->classification;
>  print "@class\n";
>  #Here is a general way to call things that are stored as a
>  #Generic object.  Replace 'source' with any other of the "major"
>headings in
>  #the feature table (e.g gene, CDS, etc.) and replace 'organism' with
>any of
>  #the tag values found under that heading (mol_type, locus_tag, gene, etc.)
>  my @source_feats = grep { $_->primary_tag eq 'source' }
>  my $source_feat = shift @source_feats;
>  my @mol_type = $source_feat->get_tag_values('mol_type');
>  print "@mol_type\n";
>  #Here is a general way to call things that are stored as some type of a
>  #Bio::Annotation oject.  This includes reference information, and
>  #Replace reference with 'comment' to get the comment, and replace
>  #$ref->authors with $ref->title (or location, medline, etc.) to get other
>  #reference categories
>  my $ann = $seq->annotation();
>  my @references = ($ann->get_Annotations('reference'));
>  my $ref = shift @references;
>  my ($title, $authors, $location, $pubmed, $reference);
>  if (defined $ref) {
>    $authors = $ref->authors;
>    print "$authors\n";
>  }
>  print "\n";
>This file is without copyright.  It may be reproduced, edited, and
>redistributed at
>will without notice to the author, however I would appreciate it if any
>or improvements were sent to barry.moore at genetics.utah.edu
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org

Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT

More information about the Bioperl-l mailing list