[Bioperl-l] Entrez Gene and bioperl-db
Peter.Robinson at t-online.de
Sun Feb 6 12:17:07 EST 2005
On Fri, Feb 04, 2005 at 04:07:36PM -0700, Stephen L. Mathias wrote:
> Hi Peter,
(....) On Mon, 2005-01-17 at 04:06, Peter Robinson wrote:
> > Hi list,
> > 3) In the meantime I have also gotten a lex/yacc parser in C to parse
> > the species-specific Gene files (which is by far the most interesting
> > file in the Entrez gene system). In principle this approach could be
> > done in Perl -- straightforward but a lot of detail work. I will be
> > needing this kind of thing for my work, so I will continue to work on
> > this, and once it is bug-free in C I will think about ways of porting it
> > to Bioperl (this might take a while). As I mentioned before on this
> > list, if anybody else can do this more quickly please go ahead (but drop
> > me a line); on the other hand, collaborators who like the idea of
> > writing a grammer in the style of lex/yacc or ANTLR are also welcome.
> I've written a script that uses Parse::RecDescent and an associated
> grammar to parse the EntrezGene ASN.1 files. Actually, I've only tested
> it on the human file, but I assume it will work for the rest as well.
> According to my (admittedly shallow) understanding, Parse::RecDescent
> grammars work in a fundamentally different way than yacc grammars do.
> However, it is pure Perl.
> The script does not create bioperl objects; it simply converts the
> records into large data structures that more or less mirror the ASN.1.
> I take these and store the bits I want in a database. It would be easy
> enough to convert to bioperl objects. However, you may not want to take
> this approach as the parser itself is pretty slow (some examples
> below). My familiarity with the bioperl object model is a little rusty,
> but *a lot* of instantiation would need to be done to fully encapsulate
> the data represented in an EntrezGene record. I'm guessing that the
> additional time required would be considerable.
> The parser takes a second or two for most genes, however this goes up
> dramatically for larger records. Here are some examples from a little
> test file run on a box with fairly fast processors (2.8GHz/1MB Cache
> Parsing Record 1 (439656 bytes)
> Success for gene BRCA1 (LocusID 672). Time: 2 minutes and 6 seconds.
> Parsing Record 2 (224148 bytes)
> Success for gene CFTR (LocusID 1080). Time: 33 seconds.
> Parsing Record 3 (45261 bytes)
> Success for gene CNR1 (LocusID 1268). Time: 1 second.
> Parsing Record 4 (570419 bytes)
> Success for gene COX2 (LocusID 4513). Time: 5 minutes and 30 seconds.
> Parsing Record 5 (40860 bytes)
> Success for gene CYP1B1 (LocusID 1545). Time: 1 second.
> Parsing Record 6 (42362 bytes)
> Success for gene SRY (LocusID 6736). Time: 2 seconds.
> Parsing Record 7 (110754 bytes)
> Success for gene TRPV1 (LocusID 7442). Time: 7 seconds.
> It may very well be possible to speed this thing up. This was my first
> foray into Parse::RecDescent land, and it was somewhat, um, painful to
> get it working at all. At this point, I'm not inclined to spend any
> more time on it. It works for my purposes.
> At any rate, if you (or anyone else on the list) are interested I'd be
> happy to post the code.
> ( Stephen L. Mathias, Ph.D. ( s m a t h i a s (
> ) Office of Biocomputing ) @ p o b l a n o )
> ( UNM School of Medicine ( . h e a l t h . (
> ) ) u n m . e d u )
> ( http://poblano.health.unm.edu/ (
in the meantime I have started to set up a Java program unsing an ANTLR grammar to parse the gene ASN.1 files. It is about half of an alpha version but it works more or less. Given the size of the species specific ASN.1 files and the fact that many users would more or less like to parse the entire things for various purposes (myself at any rate), performance is an important issue. As I understand it, antlr also generates recursive descent parsers, but the performance seems to be much better, at least so far.
At this point it is looking like it will take me some time to finish up the Java parser (if I ever do), so if you would like to take the lead on this and see if your parser can be adapted to BioPerl, please do so. I plan to continue slowly improving my antlr parser, and if it seems to work well enough, I would try to see if a bioperl version with reasonable performance could be made of it. Realistically speaking that could be Autumn or Christmas though.
And yes, I would be interested in seeing your code.
More information about the Bioperl-l