[Bioperl-l] Bio::SeqIO::tab deletes gap characters when reading sequences, which is inconvenient

Tim White exceptlowang at gmail.com
Tue Apr 17 20:00:08 EDT 2012


Bio::SeqIO::tab (what you get when specifying -format => 'tab' to 
Bio::SeqIO->new()) is perfect for converting sequences into a 
one-per-line format, so that standard line-oriented UNIX tools (grep, 
comm etc.) work as expected.  Except...  I just discovered that it 
deletes gap ("-") characters when reading sequences, so it can't be used 
to round-trip any files that contain these.  This is a source of grief 
as I frequently work with FASTA files that contain aligned sequences, 
and thus gap characters.

This is all because the next_seq() function in Bio::SeqIO::tab.pm 
contains the line:

$seq =~ s/\W//g;

which removes all non-alphanumeric characters from the sequence data.  
IMHO it would be *much* better if this was changed to:

$seq =~ s/\s//g;

which simply removes all whitespace characters (particularly including 
the \r that often appears at the ends of lines on text files that have 
visited Windows), enabling gap characters (and, for example, periods and 
asterisks) to be preserved.  Alternatively, you could simply get rid of 
this line of code and allow whitespace characters through.

I'm not sure whether this counts as a "bug", as a cursory search didn't 
turn up any docs explaining precisely what characters are and aren't 
preserved by classes implementing Bio::SeqIO, but it's certainly 
inconsistent (at least Bio::SeqIO::fasta, and Bio::SeqIO::table, with 
columns and delimiters set up appropriately, allow round-tripping of 
files containing gap characters) as well as extremely inconvenient for 
me personally, and I suspect for others.  Assuming no harm would be done 
by making the above change, what's the best thing to do to get this 
changed?  I've simply edited my own local copy of tab.pm to make the 
above change, but obviously if others agree I'd like to get the change 
done upstream.


More information about the Bioperl-l mailing list