[Bioperl-l] How to extract xref information from seq object that is fetched from GenBank?

Sang Chul Choi goshng at gmail.com
Wed Aug 30 16:50:01 EDT 2006


I emailed twice to wrong address. I hope I could make it this time.
I had two questions but they are the same basically. I will appreciate you help.

This is my first question:

I am trying to fetch protein-coding DNA sequence from the public database.
I used Bio::DB::GenBank to fetch firstly protein sequence using SwissProt ID
or GenBank ID. Then, I am trying to look for DBSOURCE, which points to
where I can fetch the DNA sequence. But, I don't know how to get that
Often, there are many links in 'xrefs' of DBSOURCE. For example,
DBSOURCE    swissprot: locus CYSJ_ECOLI, accession P38038;
            class: standard.
            extra accessions:P14782,Q2MA65,created: Oct 1, 1994.
            sequence updated: Jun 27, 2006.
            annotation updated: Jun 27, 2006.
            xrefs: M23008.1 , AAA23650.1, U29579.1, AAA69274.1, U00096.2,
            AAC75806.1, AP009048.1, BAE76841.1, H65057, 1DDGA, 1DDGB, 1DDIA,

I thought I could use Annotation object like this to have information
but I am starting to think I may be wrong because I could not get that DBSOURCE
information using Annoation object.

use Bio::DB::GenBank;
$gb = new Bio::DB::GenBank;

$seq = $gp->get_Seq_by_acc('P38038');
$ann_coll = $seq->annotation;
for $ann ($ann_coll->get_Annotations) {
   print $ann->tagname, " ", $ann->as_text, "\n";

How can I get this DBSOURCE information?

Thank you,

Sang Chul

This is my second question:
It turned out that I asked the same question as Mira's Last December,
to which Hilmar had answered
GenBank has the protein-mRNA cross-reference in the feature table,
hence you would need to look into the tag/value pairs of a sequence's
features. DBSOURCE I believe is only present for those entries
originating from UniProt (i.e., not natively from GenBank).

On top of all that, the tags GenBank uses for their entry annotation
are not the ones BioPerl uses to tag its annotation objects - BioPerl
is not an API solely for GenBank. Consult the Bio::SeqIO::genbank POD
for documentation on what goes where in the BioPerl object model.

I totally agree with this point. I have another problem, though. For some
SwissProt protein case, I could use Bio::DB::SwissProt to fetch the protein
sequence, from which I could get 'dblink' annotation that pointed to where
I could get DNA sequence of the protein sequence.

So, there are two choices depending on which type of protein sequence.
Bio::DB::SwissProt for SwissProt protein, and Bio::DB::GenBank for GB protein.
But, the SwissProt protein, P38038, could not be fetched by using
Bio::DB::SwissProt. I got the message:
-------------------- WARNING ---------------------
MSG: id (P38038) does not exist
-------------------- WARNING ---------------------
MSG: acc (P38038) does not exist

Then, I tried to use Bio::DB::GenBank to fetch this protein sequence and
it worked. And, I've been trying to get DBSOURCE information of the protein
that is I think the only way to have information where I could get DNA sequence.

So, I am sort of stuck. And, I'm using Bioperl 1.4. I'm wondering if getting
DBSOURCE information from GenBank file is really hard, or there is a way
to do this.

I think that this might be a basic question and I'm sorry for my lack of
knowledge of BioPerl. I will appreciate your help.

Thank you very much,

Sang Chul

Live, Learn, and Love!
E-mail : goshng at empal dot com
            goshng at gmail dot com

More information about the Bioperl-l mailing list