[Bioperl-l] Porting Entrez Gene parser to Biojava, Biopython, Biophp, even C++

Mingyi Liu mingyi.liu at gpc-biotech.com
Sun Mar 13 21:44:57 EST 2005

Andrew Dalke wrote:

> Python used to use pcre but that was replaced with sre some years
> back, in part to support Unicode-based regexps.
I see.  Doesn't matter anyway.  I do want to note that this \G /cg is 
purely for parser efficiency, so s/// would work just fine except at 
least an order of magnitude slower with large Entrez Gene records.  So 
just as I said, porting is fine, but performance will take a hit.  Then 
again, any parser relying on regex would need \G /cg for performance, 
and would be hit when ported over.

> The code I looked at took a string and there was outer
> scaffolding to identify the record locations.
> The actual record extraction was not part of the EntrezGene
> library so I don't see what you could modify.  Perhaps add
> an "offset" field to the parse method?
Seems what you're looking for in a parser is a do-it-all text 
processor.  It parses, it indexes, and it adapts (read on for my comment 
on this one).  But I strictly said my parser is parser only.  Now with 
that out of the way, let me address your question: Yes, since my parser 
is parser only, if you want to use it for indexing purpose, then you'd 
have to keep position in outer scaffolding or custom programs, and make 
simple changes like calling pos function after token generation to 
record position of token in input string (a truncated Entrez Gene 
record).  It's all doable, but I just wouldn't put the indexing code 
into a parser.

> If you do get the byte positions of terms in the ASN.1
> (eg to report "syntax error at line 1234 column 56") then
> you would need to use the $` and $' fields, which perlvar
> warns is slow, so your timings would change.

Yeah, I know. If my parser tries to do more, sure it'd get slower. ;-)

> There are several layers to parsing.  ...
> But if you define that your parse tree returns the raw text
> representation then it is complete.  My question - which I
> haven't been able to resolve for Martel - is how should code
> like this, which tries to be cross-platform, handle what
> is semantically one item when it's represented as multiple
> components in the input format?
> Here are two examples to show how tricky that is
>      url "http://www.ncbi.nlm.nih.gov/sutils/evv.cgi?taxid=9606&conti
> g=NT_009714.16&gene=A2M&lid=2&from=1979284&to=2027463"
>       text "There is a significant genetic association of the 5 bp 
> deletion
>  and two novel polymorphisms in alpha-2-macroglobulin 
> alpha-2-macroglobulin
>  precursor with AD",
> In the first the "\n" should be removed while in the second
> it should be replaced with a space.
> It would be nice if this behavior was also the same cross-platform.
I think the phrase you were looking for instead of "what is semantically 
one item when it's represented as multiple components in the input 
format?" is simply "context-sensitive rules".  Context-sensitivity can 
be cross-platform, but my parser does not need to deal with it (note 
that how to replace the "\n" really is user's preference and none of 
parser's business. You might want to replace the 2nd one with space, but 
another person might want it to be replaced with "<br>"). Even if you 
find a better example, I could suggest you look to my Parse::RecDescent 
based parser, since Parse::RecDescent allows context-senstive grammar.  
And also one should know that coding context-sensitivity in regex is 
also not that hard, but you do need to have a well defined set of 
scenarios and rules.

> It's post tokenization and pre parse tree assembly.  For this
> case it's a simple regexp search/replace but 1) how is that handled
> in a cross platform manner 

My parser is regex based.  Any change in the perl parser could be 
reflected in other languages (I still prefer language instead of 
platforms though, since this is really the point.  My parsers are 
already cross-platform, they're supported by any platform that supports 
Perl).  There could be changes that are needed, like unsupported 
modifiers, but you wouldn't think that porting across languages should 
not ask developers to do anything, right?  What needs to be done should 
be determined on a case-by-case manner.  I can't think of a generic 
response that is panacea for all porting cases.

> and 2) for the general problem it's
> not as simple as a regexp.
Exactly.  If you read my comments on my parsers, I mentioned that when 
things get more complex, use those grammar-based tools instead.  Right 
now, for Entrez Gene, regex works and it works best, that's why I mostly 
talk about this one.  But you're very welcome to check other ones out 
for completeness.

> Indeed some of the problems don't apply.  But speaking solely for
> myself and not for the Biopython project I would rather use a
> validating parser that reported at least imbalanced parens,
> roughly equivalent to checking for well-formed XML.

Of course.  I could suggest that such checking can easily be added to my 
parser, with one variable tracking depth - that's all that's needed 
since Entrez Gene only has one type of block delimiter.  I'll probably 
do it when I have time next week since it's only 3 lines of code or so.  
But then again, I start to realize that you would rather use some other 
parser ranyway.

> For example, in reading the ASN.1 spec at
>  http://asn1.elibel.tm.fr/en/standards/index.htm#x680
> I see that ASN.1 could include a real number but the
> Homo_sapiens file doesn't have one and your parser doesn't
> handle it (it looks for [\w-]).  Mmm, and there are many
> more data types in full ASN.1.
Mmm, you really tried hard to let me know that my parser can not do it 
all.  ;-) Well, read on for my response.

> As far as I can tell, if NCBI does add a new data type that
> your code doesn't support then it's very hard to tell that
> the code is ignoring problems.

Good point. I'll add one line in the _parse function to do a catch-all 
error reporting.

> Consider a floating point date value (not legal according toe
> NCBI but legal ASN.1. .. I think - just testing the idea)
>   ...
>       year 2003.43,
>  ...
> Your code converts that into
> ...
>                                     '2003' => [
>                                         undef
>                                         ]
> ...
> That doesn't seem like the behavior it should do.
Well, you point that my parser is not a general ASN.1 parser is well 
taken, especially since I never claimed it to be one.  If you're looking 
for an ASN.1 perl parser, I heard that on the mailing list someone 
already made one, and it could be of help to you.

> BTW, looking at what you do, I don't understand why you handle
> the explicit types fields as you do.  Why does
>           tag id 9606
> turn into
>          'tag' => [
>             {
>               'id' => '11'
>             }
>           ],
> As far as I can tell there's only a single data type
> there so what about omitting the list reference?
>          'tag' => {
>               'id' => '11'
>             },
> But I don't know enough about ASN.1.
This has nothing to do with ASN. It is all about how uniform the data 
structure could be.  In fact, consider when NCBI decides to do
  tag id 12345,
  tag str "whatever"
which is far more possible than the cases you considered in earlier 
criticisms, then the data sturcture would need to become:

        'tag' => [
              'id' => '12345',
              'str' => 'whatever'
With your suggested approach, this would force the user to test what 
type of reference $hash{'tag'} is before dealing with it either as a 
hash or an array.  With my approach, user always knows to deal with it 
as an array.  This is also exactly the reason (I guess) why XML::Simple 
has option 'ForceArray', if you recall.

Now the promised response to the criticism that my parser doesn't do: 1. 
Indexing of EntrezGene file. 2. Adaptive behavior when new format comes 
out. 3. (semi-?)Automatic cross-language porting. 4. Full support for 
ASN.1 parsing.

It's really simple - if you haven't already known - my parser is just an 
Entrez Gene parser.  It is not designed to do those things.  You really 
went out of your way to show me that my parser simply doesn't do 
everything, but failed to show me that why my parser cannot be a 
reasonable Entrez Gene parser, which is your main point.  Also I don't 
understand why you just dispatch my parser right away as a candidate for 
porting to other language while I could address your valid concern next 
week with a few lines.  Why?  I can understand that you were possibly 
offended by my may-seem-naive enthusiasm of thinking about the prospect 
of porting this fast parser to other languages.  But I was pretty happy 
with the parser I made, simply because:

1. There are plenty of people talking about that they have a parser 
working for Entrez Gene, but probably due to various reasons like IP 
issues or specific projects, no one posted one yet (at least I couldn't 
find it after plenty of searching).  Mine's the first one I could find 
that's in public domain and in Perl.
2. My parser is so short, and not written in guru-style (since I'm far 
from a Perl guru), so it's easy to understand.
3. It's OO with pod and example scripts, so very easy to use.
4. Most importantly, it's freakishly fast without making mistakes with 
the NCBI Entrez Gene downloads.

My enthusiasm is based on the belief that there's not a Perl parser out 
there that's better than mine overall when points 2-4 are considered.  
And point 1 is just a trump card.  I thought it'd be helpful to many who 
want to get a GPL-ed Entrez Gene parser.

Nonetheless, if you just don't want to use my parser, you can simply say 
so (or tell me why it doesn't work as a portable Entrez Gene parser).  
Frankly, reading your emails, initially I was glad that we had a useful 
discussion on parsers, but the endless picking on the progressively 
absurd tasks for an Entrez Gene parser to do (like it's unable to index, 
adapt to arbitrary changes, auto-port, parse full ASN.1 specifications) 
just really changed my opinion, particularly because I doubt anyone 
using any language would be looking for those in an Entrez Gene parser.  
Again, FYI, it's only a parser, and I repeatedly said it's only a parser 
that only constructures a data structure.

But I certainly welcome good suggestions, and I'll add some basic error 
reporting next week.  I didn't think it was needed since again, I 
already parsed and checked results on human, mouse and rat.  But it's 
still a good idea & thanks for the suggestion!  If someday you work out 
a fast parser and/or one that does it all in either python or perl, I'd 
like to know too.  I'm always thrilled to learn useful things.



BTW, I realized that I was a bit overly broad in my last email in my 
criticism of early attitude that users have to do work to use their 
software.  I should say it's just some of the early softwares that gave 
such impression, even though it's only a few, the impression could be 
big.  If that's what's thrown you off, I apologize.

More information about the Bioperl-l mailing list