[Bioperl-l] Entrez Gene ASN parsers
hlapp at gmx.net
Sat Mar 12 19:33:24 EST 2005
I kind of like this approach, i.e., have a general purpose low-level
parser that you have reasonable confidence in will never be the
bottleneck, and then build a bioperl parser on top of it that now can
focus its code on assembling the desired data structure as opposed to
the file format itself.
And if course assembling that data structure will slow things down a
lot but hey, either you want an object hierarchy in (bio-)perl or you
Also, given the thread and previous ones, that ominous bioperl data
structure may be very fluid initially, or even result in different
top-level parsers depending on how compatible the different visions are
for what to get out of that parser.
On Saturday, March 12, 2005, at 03:50 PM, Liu, Mingyi wrote:
> Hi, Stefan,
> Yes, the advantage and disadvantage of my approach are that my parsers
> do not take the underlying data into account. By totally ignoring the
> data content and focusing just on format, this appropach ensured that
> no data will be left behind in parsing and that the development of the
> parsers would be very fast, and the parsers perform very well. In
> addition, even if NCBI changes the data content, the parser will most
> likely work just fine without any modifications.
> However, this does result in a data structure that is not consolidated
> into, for example, the two level type you'd want. The data structure
> generated merely reflects however NCBI chose to structure their Entrez
> Gene ASN files. Building Bioperl objects based on my parser would
> take some serious efforts (1-2 weeks). It is definitely doable
> though, and the performance should not slow down much. The benchmark
> I gave included not just the time for parsing and data structure
> construction, but also data structure trimming, which traverses almost
> the entire data structure and make changes. But the initiation of
> Bioperl objects may make the whole process slow down a few fold.
> Regardless, I totally agree that it's the best if you could do a
> comparison and choose the most suitable approach.
> BTW, can you send me example entries for which there are dead entries
> or 0-sized array in my parser? I wonder if it's a problem of Entrez
> Gene file or my parser, since I simply let the data structure mirror
> the file. But if it isn't, then I would want to check if it's a bug.
> I did process the full human genome into XML files and did not see any
> empty elements or attributes, and the parser runs on entire mouse and
> rat genomes without problem, which is expected.
>> -----Original Message-----
>> From: Stefan Kirov [mailto:skirov at utk.edu]
>> Sent: Saturday, March 12, 2005 5:59 PM
>> To: Liu, Mingyi
>> Cc: bioperl-l at portal.open-bio.org
>> Subject: Re: [Bioperl-l] Entrez Gene ASN parsers
>> I looked at the code (EntrezGene) and so far it seems to me
>> it gives as
>> you claim pretty accurate and easy to understand data structure (few
>> dead entries and some 0 size array, but nothing major).
>> The only concern I have is that the data structure. If you want to
>> achieve a better structure (non-redundant, two level where
>> possible or a
>> collection of Bioperl objects) this will slow things down. I guess I
>> will compare how the code I wrote compares to yours and choose the
>> faster one. I think this makes sense.
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
Hilmar Lapp email: lapp at gnf.org
GNF, San Diego, Ca. 92121 phone: +1-858-812-1757
More information about the Bioperl-l