[Bioperl-l] Entrez Gene ASN parsers

Liu, Mingyi Mingyi.Liu at gpc-biotech.com
Sat Mar 12 10:43:44 EST 2005


I have just released a project on sourceforge that contains 4 different parsers for Entrez Gene ASN file based on regex, Parse::RecDescent, Parse::Yapp, and Perl-byacc.  They differ in performance and the regex-based parser is the best performer, processing over 13000 records a minute on average (It finishes the 900+ MB human annotation file in 11 minutes on one Intel Xeon 2.4 GHz CPU).  The other parsers are at least a few fold slower but I included them since it'd be of intererst to people learning to use those tools or choosing among the tools for a practical project.  All parsers are short OO-modules (<100 lines if not counting POD/YACC-generated code), so they are easy to use and understand.

Right now my parsers do not assemble data into Bioperl objects (because for my project I only needed to put them into a proprietary XML format, which is not released (not that it's anything special, just IP issues.  Without IP issues, I could've released the parser code in Feb.)).  They behave like XML-parsers, namely, they parse entrez gene records and assemble content into data structures only.  But I hope it could serve as a base that Bioperl objects can be built (the data structure is easy to use).  Please feel free to use the code for any Bioperl or other projects as I released them under GPL (thanks to my company and a collaborating company's consent).

Please also feel free to contact me if you have any suggestion or bug report.

The URL for the sourceforge project is http://sourceforge.net/projects/egparser/



Dr. Mingyi Liu
Computational Biologist
GPC Biotech Inc.
610 Lincoln St.
Waltham, MA 02451

More information about the Bioperl-l mailing list