[Bioperl-l] [BioPython] Cannot parse GenBank file
cjfields at uiuc.edu
Thu Jun 7 11:31:45 EDT 2007
On Jun 7, 2007, at 9:26 AM, Martin MOKREJŠ wrote:
> Chris Fields wrote:
>> One thing I missed which explains the biopython error: the LOCUS
>> line is missing the locus identifier (see the NCBI example record
>> link). This doesn't choke the bioperl parser but it appears to
>> stop the biopython parser in it's tracks (maybe a feature instead
>> of a bug!).
>> You should try adding a unique identifier (maybe the name of the
>> file or record) to the LOCUS line to see if it works:
>> LOCUS testfile 6499 bp ds-DNA linear 02-AUG-2006
>> The bioperl parser in CVS writes out the correct alphabet when
>> this is added:
>> LOCUS testfile 6499 bp ds-DNA linear 02-
>> I'll try adding a warning to the bioperl parser for this.
> I have updated http://bugzilla.open-bio.org/show_bug.cgi?id=2305
> but let me
> emphasize the LOCUS line now contains
> LOCUS pRL 5428 bp ds-DNA linear
> which still does not comply with the line you have proposed. But it
> can be
> parsed by bioperl-live from cvs. Is it still wrong? Testcase as
> in the bugzilla record #2305.
That should work. There isn't a strict uniqueness test (that would
require caching and isn't worth the trouble IMHO), though it's
required you add something unique for the accession/locus if you plan
on indexing them in the future.
Parsing GenBank data produced from third-party software is
problematic at best; there seems to be no steadfast rule with GenBank
output for some programs, even though the specification is plainly
stated in the NCBI release notes. My take on that is to have a
stricter (read:follows release notes) GenBank parser which passes off
the data in the record to default handler methods. A user could then
subjugate the defined handlers with their own by subclassing the
default handler class and overloading the methods or adding their own
code references directly.
More information about the Bioperl-l