[Bioperl-l] Entrez Gene ASN
skirov at utk.edu
Thu Mar 10 09:14:40 EST 2005
I have done some (mostly) serious thinking about ASN Entrez Gene parsing
and I propose we do my favorite thing- postpone everything we cannot
deal with right now. If you want it to sound better: take a gradual
approach where we store the data we can deal with in the existing
Bioperl objects and skipping the rest for now.
ASN gene record can be correctly represented as a tree. I have written a
simple parser for my own purposes which is storing the following:
What I do then is get specific levels and tags and build different
objects. So level 2 with parent EntrezGene (which is the root level and
has no information) is gene description and has tags such as gene, name,
etc; at level 3, 5 and 6 you can get the complete specie definition by
looking for orgname and org as tags and records with parent mod (which
is a value for orgname, descend down the branch).
I am using this approach to store most of the data in a relational
database without going through Bioperl. What I ultimately want to do is
use standard Bioperl modules. However, I don't think we have an object
that can efficiently represent the structure (correct me if I am wrong).
I think it may be a good idea to have a container object, possibly
Bio::Gene that may contain multiple Bio::Seq objects (with or without
real sequence). I believe we can borrow some structure and code from
EnsEMBL gene representation (way to contain multiple transcripts, etc.,
not the database interactions certainly).
Please let me know what you think.
More information about the Bioperl-l