[Bioperl-l] "Be forgiving in what you accept" and Bio::Tools::GuessSeqFormat

George Hartzell hartzell at kestrel.alerce.com
Thu Jul 21 15:34:19 EDT 2005

There's a great "old" Internet maxim, "Be forgiving in what you accept
and strict in what you send".

The Bio::Seqio modules seem to be able to cope with "fasta" formatted
files that have a space separating the ">" from the rest of the line
(e.g.  "> ape") if a) you explicitly specify the format or b) if you
have the sequence in a file that ends in "fa" (or generally matches
the list of patterns that correspond to fasta file names).

But, if you happen to have the sequence in a file with a funny name
(e.g. /var/tmp/apreq23ZHis [aka a form upload]) then it fails.  It
can't guess based on the filename and the file content test is strict
and wants to see the header line without the whitespace (">ape").

Is there any reason not to extend the regexp a bit and relax that
constraint (since everything else seems to cope with it)?

Something like this:

*** /usr/local/lib/perl5/site_perl/5.8.6/Bio/Tools/GuessSeqFormat.pm.orig	Thu Jul 21 12:30:55 2005
--- /usr/local/lib/perl5/site_perl/5.8.6/Bio/Tools/GuessSeqFormat.pm	Thu Jul 21 12:31:45 2005
*** 591,595 ****
      my ($line, $lineno) = (shift, shift);
      return (($lineno != 1 && $line =~ /^[A-IK-NP-Z]+$/i) ||
!             $line =~ /^>\w/);
--- 591,595 ----
      my ($line, $lineno) = (shift, shift);
      return (($lineno != 1 && $line =~ /^[A-IK-NP-Z]+$/i) ||
!             $line =~ /^>\s*\w/);

More information about the Bioperl-l mailing list