[Bioperl-l] Error reporting/Validation implemented
mingyi.liu at gpc-biotech.com
Tue Mar 15 17:01:36 EST 2005
Stefan Kirov wrote:
> Few things:
> I used your parser to produce Bioperl objects based on some of the
> high level features and compared it ot what I have. Your parser is
> considerably faster (about twice), but it is still hard to tell as I
> am descending further in the hierarchy with mine. At the same time I
> don't think the difference will vanish, so I will start building over
> your parser to produce bioperl objects. I am not sure exactly how I am
> going to deal with the relationships that are necessary, but I'll deal
> with it when I finsih everything else.
Thanks for the comparison result! That was fast! Please let me know if
you need some help using the data structure of my parser. I'll try to
provide a skeleton code tonight for you (or maybe in the next couple of
days since you're away anyway) that comes from my code that extracts all
data (as far as I can tell) from Entrez Gene. This way although it
still does not construct objects for you, at least it's going to be
easier to find the stuff you want for object construction, which is
definitely the toughest step of creating a bioperl parser for Entrez Gene.
BTW, I just released version 1.04 with some simple improvements such as
attempts (only on *NIX) to open file over 2 GB even if the perl version
used does not support it (so that the file 'All_Data' to work for me
without recompiling my Perl), 'file' option in 'new' method, etc. It's
more convenient to use (check the "regex_parser_test.pl" in V1.04 for
usage example), somewhat like SeqIO's usage (send in 'file' in new() and
call next_seq to get next record).
> By the way it took 9 minutes on a 64 bit Xeon 3.4GHz even with
> Bioperl objects construction on the whole Homo_sapiens ASN file.
Thanks for sharing the benchmark! It's definitely faster than my Xeon
2.4 GHz. I just ran my parser V1.04 on the file All_Data that contains
all Entrez Gene genomes (about 7.4 GB) and it took the parser 98 minutes
to finish with no error found.
> The data that went inside the objects was: general desc of the genes
> (symbol, name, summary, etc.), organsism descr. but none of the truly
> big parts. Unfortunately, I am leaving tomorrow for a conference, so I
> will have some more next week earliest. Thanks for sharing the code!
Glad to be of help!
More information about the Bioperl-l