[Bioperl-l] Porting Entrez Gene parser to Biojava, Biopython,
Biophp, even C++
mingyi.liu at gpc-biotech.com
Sun Mar 13 16:44:57 EST 2005
Andrew Dalke wrote:
> When I wrote my grammars I did so in strict mode, and reported
> a bunch of errors to the database providers. The advantage
> is that wrong formats aren't accidently parsed. The disadvantage
> is that minor changes break the parser.
> I don't see any solution to this other than having someone
> track the file formats over time.
Sure. If there's arbitrary and drastic changes to file format, there
must be someone watching the change . But one of my points was that my
parser would likely stay valid even if NCBI changes their data
definitions because it's very unlikely that NCBI changes their file
structure/format, although they'd change data definitions (recall that I
said my parser doesn't care about data content?)
> I looked at the regexps. The ones that Python doesn't
> support are \G and the compilation flags /cg . They won't
> be in Python because the start/end positions are available
> as local variables and not as implicit globals. It
> uses a different stylism.
You're right. The /cg modifiers are exactly the ones I was talking
about. \G is actually supprted by PCRE, so very likely in Python too
since Python uses PCRE (please check again). Nonetheless, without /cg,
\G means little. That's why I said there's gonna be a performance hit.
> The first of these lists some tasks that can't be done
> with your approach, like being able to index all the
> records in a file by byte position.
Not really. If you really want those, my parser code can be easily
modified to record the file byte position of each token.
> Parsers can also get better performance by assuming the
> file format is correct. Eg, your EntrezGene.pm doesn't
> detect if the file was truncated (I fed it only the first
> 1000 lines of the human genome file) while the context-free
> parsers you have will at least generate an error that
> the parenthesis are unbalanced.
Yeah, my parser does not give much warnings at current stage. I
certainly wouldn't mind someone taking my code and add exception
handling. But frankly many parsers do not excel in this department.
Even some XML parsers only warn when something breaks the parser.
> One thing I note, investigating a question of Hilmar's,
> is that your tokenization of strings isn't quite complete.
> Double-quoted "strings" that contain a double quote are
> escaped ""with doubled"" double quotes. Your tokenizer
> doesn't convert the double quotes into single ones. My
> Martel code has the same problem. It needed another
> layer to describe how to unescape strings and handle
> word spilling.
You caught me. I was just being lazy - I noticed this a while ago, but
decided to delay a bit since I have 4 different parsers that need to be
modified. Then I forgot. (it's probably my fault that actually last
night I remembered this too, and I just uploaded the files anyway 'cause
it's so simple to fix by anybody anyway).
I'd say you're really exaggerating when you said my tokenization of
string isn't complete based on this. Not unescaping the "" escape has
nothing to do with tokenization (it's a post-processing step after
tokenization). It simply take one simple regex to fix it, no other
Thanks for your suggestions. I think problems specific to Martel might
not apply in this case since Entrez Gene file structure/format is really
simple, and they are likely to stay very stable. That's why I was
proposing sharing this code base across languages.
More information about the Bioperl-l