Bioperl-guts: Embl parsing suggestions

Keith James kdj@sanger.ac.uk
19 Mar 2000 18:00:44 +0000


Hi all,

I've had a look at the parsing of features & qualifiers from EMBL
files and made some fixes (just on my local copy - nothing is cvs
commited yet).

Unquoted qualifers (e.g. /codon_start=2) are now split into
qualifer/value correctly.

Locations like 1234^1235 etc. are still ignored.

Apparently Ace can output feature tables where a "" is split, leaving
a lone quote at the end of a line and causing premature termination of
a multiline qualifier. This is handled now.

Locations with < and/or > cause a roff_l and/or roff_r tag to be added
to the feature, indicating that it runs off that end. I was going
indicate which end (5- or 3-prime) was present, but couldn't rule out
strand 0 features where this wouldn't make sense.

All the edits I've made are in Bio::SeqIO::embl.pm and Bio::SeqIO.pm. 
In order to accept feature qualifiers where a terminal quote doesn't
necessarily mean the end of the qualifier, I needed to buffer the
following line somewhere.

I've added a _pushback subroutine to Bio::SeqIO.pm where the line can
be stored in the object hash. Now _readline checks this buffer first,
before getting a new line from its filehandle.

make test is still passed.

I've never written a Perl module and a lot of the code makes no sense
to me. I'm sure someone will intervene if I'm about to shoot someone
else in the foot.

Keith

-- 

Keith James  --  kdj@sanger.ac.uk  --  http://www.sanger.ac.uk/Users/kdj
The Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SA
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://bio.perl.org
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl-guts.html
====================================================================