[Bioperl-l] Found bug in fasta.pm
Aaron J Mackey
Thu, 9 Nov 2000 09:02:39 -0500 (EST)
We have found the most reliable way to parse fasta format is via this
local $/ = "\n>";
chomp; # remove trailing "\n>"
my ($id, $desc, $seq) =
$_ =~ m/^>? # beginning >, only 1st seq.
(\S+)\s+ # identifier
([^\n]+)\n # description line
(.*)$ # sequence
/sox; # multiline, compile-once, ignore-whitespace
$seq =~ s/\W//g; # get rid of newlines, spaces, etc.
# ok, now do something with this ...
Note that no line-caching is necessary - the input-record-seperator $/
takes care of stopping before a new sequence is read. This works nicely
when your filehandle is not seekable (pipes, etc).
I'm not sure why SeqIO/fasta.pm was migrated to a read-line-by-line
architecture (thus necessitating the nitty-gritty mentioned below), but
I'd be happy to review any proposed solution (we've faced most of the
pathological cases before - '>'s in descriptions, in the sequence, on
lines by themselves (prefaced with whitespace, so not truly a new record),
etc. etc. - it's amazing what people will try to get away with).
On Wed, 8 Nov 2000, Elia Stupka wrote:
> I found a bug, which was introduced in main trunk, and then backported to
> branch-06. I have commented out these lines, and I leave it up to others
> if they want to do something about this.
> Elia Stupka (EnsEMBL)
> these are the lines:
> # Jason applying HL's patch from 25/05/2000
> # (on 02/10/2000)
> # a greater sign not preceded by a newline indicates that there is a
> # '>' within the description, so we need more to complete the record
> return unless defined($next_rec = $self->_readline());
> $entry .= $next_rec;
> What shall we do?
> tel: +44 1223 49 44 31
> mobile: +44 7971 59 03 69
> fax: +44 1223 49 44 68
> Bioperl-l mailing list
o ~ ~ ~ ~ ~ ~ o
/ Aaron J Mackey \
\ Dr. Pearson Laboratory /
\ University of Virginia \
/ (804) 924-2821 \
\ firstname.lastname@example.org /
o ~ ~ ~ ~ ~ ~ o