[Bioperl-l] genbank/embl format ebnf or other formal description

Dan Kortschak dan.kortschak at adelaide.edu.au
Tue Sep 11 18:26:16 EDT 2012


Hi Hilmar,

Yes, I plan to use ragel, which recently does again support Go at least in a non-official fork, which looks like it will be merged into Adrian's repo. (It might be a nice project for a student to implement a perl back end for ragel - though the absence of formal format descriptions makes its utility for bioperl somewhat moot ;).

I'd be interested to see the GSoC project if it's public yet - I'm in the process of writing a pure Go SAM/BAM package to replace my boom interface to libbam.

I can see why a generic parser is not appropriate for a perl-based parser (for the same reason I'm not using a generic Go parser, but rather a parser generator). The fact remains, a formal specification is beneficial for testing correctness and avoiding some of the problems bioperl has had in the past when NCBI has changed formats under bioperl's feet.

cheers
Dan

On 12/09/2012, at 7:32 AM, "Hilmar Lapp" <hlapp at drycafe.net> wrote:

> One of the problems in Perl with using a language-neutral definition of the format as a context-free grammar has been that RecDescent was just way too slow for this.
> 
> One of the Google Summer of Code students working on fast parsers (for SAM/BAM I think) used Ragel (http://www.complang.org/ragel/), which looks quite cool, but unfortunately doesn't support Perl (nor Go :-)
> 
>    -hilmar
> 
> On Sep 11, 2012, at 5:39 PM, Dan Kortschak wrote:
> 
>> Thanks Chris. It is related to both really, and more.
>> 
>> Second first, I continue to be amazed at the lack of specification or testing in a significant portion of software in the bioinformatics realm (bioperl is a nice counter example and one that I am grateful for having had as a training ground - and the work that has obviously gone into working through parsing and formatting un- or under-specified formats by the core and other developers is phenomenal).
>> 
>> But to the first point, I am unable to use bioperl to parse/format these formats for my project as it is a new project, not written in Perl - apologies for abusing the list - but rather in Go. I could go through the Perl to reimplement based on that, but I was hoping to use a parser generator from a spec, so that I can guarantee the parser/formatter is correct formally.
>> 
>> I asked here because I believe the developers of bioperl are some of the foremost experts in parsing the collection of "weakly defined, internally redundant, ambiguous, bulky fruit salad[s] of ... data format[s]" [1] that constute the majority of the file formats out there (this is not a pejorative against the bioperl devs, but rather a testament to their fortitude and strength - I have only implemented the bare minimum of formats in my library so far).
>> 
>> thanks
>> Dan
>> 
>> 
>> [1]http://www.biostars.org/post/show/7126/what-are-the-most-common-stupid-mistakes-in-bioinformatics/#7136
>> 
>> On 12/09/2012, at 12:09 AM, "Fields, Christopher J" <cjfields at illinois.edu> wrote:
>> 
>>> Christopher,
>>> 
>>> I think Dan's question is orthogonal to actually parsing a file; it relates more to proper formatting for a particular format based on a specification as well as potential downstream validation.  Bio::SeqIO::genbank is geared for flexibility and can handle a lot of mis-formatted data, it can massage some data into the proper format if needed.  One must recognize the primary driver for the parsers is to get data into objects, not as a format converter (that just happens to be a nice useful side effect).
>>> 
>>> The problem is, like many formats, a formal specification for Genbank format doesn't exist outside of the NCBI example file (old and incomplete) and the FT definition as far as I know, so calling something 'official' Genbank format isn't possible outside of NCBI.
>>> 
>>> chris (f)
>>> 
>>> On Sep 11, 2012, at 9:10 AM, Christopher Bottoms <molecules at cpan.org> wrote:
>>> 
>>>> Dan,
>>>> 
>>>> Why not use BioPerl's Bio::SeqIO, which can parse GenBank files?
>>>> 
>>>> --Christopher Bottoms
>>>> 
>>>> On Fri, Sep 7, 2012 at 10:43 PM, Dan Kortschak
>>>> <dan.kortschak at adelaide.edu.au> wrote:
>>>>> Thanks Chris. That's remarkable, so many words and not an actual formal
>>>>> specification. I guess I have some work ahead of me. I found the
>>>>> example, but examples rarely contain all edges and corners.
>>>>> 
>>>>> Dan
>>>>> 
>>>>> On Sat, 2012-09-08 at 03:39 +0000, Fields, Christopher J wrote:
>>>>>> Re: Genbank, the only know specification I know of is for the feature
>>>>>> table portion of the format as you have below.  They do have a
>>>>>> (possibly out of date) example file, note it isn't easily found unless
>>>>>> you search for it:
>>>>>> 
>>>>>> http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord
>>>>>> 
>>>>>> EMBL is better in this regard:
>>>>>> 
>>>>>> http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html
>>>>>> 
>>>>>> Note that UniProt Knowledgebase also has a user manual outlining the
>>>>>> similarities and differences with EMBL:
>>>>>> 
>>>>>> http://web.expasy.org/docs/userman.html
>>>>>> 
>>>>>> chris
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> 
>> 
>> 
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> -- 
> ===========================================================
> : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net :
> ===========================================================
> 
> 
> 
> 



More information about the Bioperl-l mailing list