[Bioperl-l] alignIO::fasta bug

Chris Fields cjfields at illinois.edu
Thu Jan 22 22:54:09 EST 2009

On Jan 22, 2009, at 3:33 PM, Jason Stajich wrote:

> FYI - it appears that if the last sequence in a FASTA MSA is all  
> gaps (which happens in some whole genome synteny+multiple alignment  
> chunking like Mercator) no alignment is returned.
> yuck. It basically comes down to this bit of code where $end would  
> equal sequence length.
> 	#  If $end <= 0, we have either reached the end of
> 	#  file in <> or we have encountered some other error
> 	if ( $end <= 0 &&  ) {
> 		undef $aln;
> 		return $aln;
> 	}
> This start/end requirement of locatable seq is nice but kind of a  
> pain where I am managing the map of sequences outside of alignment  
> chunk.

I faced this issue with Mauve parsing (Bio::SeqIO::xmfa).  I set it up  
so LocatableSeqs can now have all gaps, but the alphabet needs to be  
set (get a warning otherwise) and start and end need to be initiated  
to 0, which is how I believe Mauve defined these.  However, should 0-0  
be a valid start/end for such a sequence?  Should we change that to  
automatically allow start = end = X (any position including 0) if a  
sequence is all gaps or empty?

If we come up with some rough ideas of how to handle this we can add  
some examples to the test suite and try getting LocatableSeq to do the  
right thing.  We can always mark them as TODO.

> Why not just check to see that the number of seq characters is 0 -  
> an all-gapped sequence as the last sequence of the file should still  
> be legal.
> Instead:
> 	if ( length($seqchar) == 0 ) {
> 		undef $aln;
> 		return $aln;
> 	}
> Although that would invalidate an empty alignment like this -- do we  
> want to still permit these?
> >A
> >B
> >C

These could be zero-length, empty seqs (start = end = undef).  I  
thought something was added to PrimarySeq recently for empty seqs.

> Also I've locally implemented possibility of parsing start/end from  
> the header line that is part of Mercator output and I think a  
> variant of UCSC headers that
> look like this for MFA.
> >Cp scaffold_1.3086:644980-660265+

Is there a way we can have a callback option for parsing out the  
data?  e.g. pass in everything after '>' and the LocatableSeq  
instance, the callback parses the string for whatever info and sets  
the LocatableSeq attributes accordingly.  We could default to some  
built-in coderefs for common regexes if a callback isn't defined.

> I can commit this to main-trunk so as to not interfere with branch  
> release.  Do any of the LocatableSeq/AlignIO::Fasta maintainers want  
> to comment?
> -jason

+1.  I think we should incorporate this into 1.6.1 along with any  
other LocatableSeq fixes and tests.


More information about the Bioperl-l mailing list