[Bioperl-l] Entrez Gene ASN parsers
skirov at utk.edu
Sat Mar 12 17:59:16 EST 2005
I looked at the code (EntrezGene) and so far it seems to me it gives as
you claim pretty accurate and easy to understand data structure (few
dead entries and some 0 size array, but nothing major).
The only concern I have is that the data structure. If you want to
achieve a better structure (non-redundant, two level where possible or a
collection of Bioperl objects) this will slow things down. I guess I
will compare how the code I wrote compares to yours and choose the
faster one. I think this makes sense.
Liu, Mingyi wrote:
>I have just released a project on sourceforge that contains 4 different parsers for Entrez Gene ASN file based on regex, Parse::RecDescent, Parse::Yapp, and Perl-byacc. They differ in performance and the regex-based parser is the best performer, processing over 13000 records a minute on average (It finishes the 900+ MB human annotation file in 11 minutes on one Intel Xeon 2.4 GHz CPU). The other parsers are at least a few fold slower but I included them since it'd be of intererst to people learning to use those tools or choosing among the tools for a practical project. All parsers are short OO-modules (<100 lines if not counting POD/YACC-generated code), so they are easy to use and understand.
>Right now my parsers do not assemble data into Bioperl objects (because for my project I only needed to put them into a proprietary XML format, which is not released (not that it's anything special, just IP issues. Without IP issues, I could've released the parser code in Feb.)). They behave like XML-parsers, namely, they parse entrez gene records and assemble content into data structures only. But I hope it could serve as a base that Bioperl objects can be built (the data structure is easy to use). Please feel free to use the code for any Bioperl or other projects as I released them under GPL (thanks to my company and a collaborating company's consent).
>Please also feel free to contact me if you have any suggestion or bug report.
>The URL for the sourceforge project is http://sourceforge.net/projects/egparser/
>Dr. Mingyi Liu
>GPC Biotech Inc.
>610 Lincoln St.
>Waltham, MA 02451
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org
Stefan Kirov, Ph.D.
University of Tennessee/Oak Ridge National Laboratory
5700 bldg, PO BOX 2008 MS6164
Oak Ridge TN 37831-6164
tel +865 576 5120
e-mail: skirov at utk.edu
sao at ornl.gov
"And the wars go on with brainwashed pride
For the love of God and our human rights
And all these things are swept aside"
More information about the Bioperl-l