[Bioperl-l] Bio::AlignIO ignores questionmarks?

Chris Fields cjfields at uiuc.edu
Fri Apr 14 11:00:55 EDT 2006


This has been fixed in CVS a week or two ago.  It was reported as a bug
before:

http://bioperl.org/pipermail/bioperl-guts-l/2006-April/020836.html

If you either install from CVS or replace your local Bio::AlignIO::fasta
with the CVS version; it should work then.  

I don't adding '?' necessarily think it harms anything; the intent is to
maintain the nucleotide positions in the alignment.  If you used this
alignment (from the bug above):

>seq1
AAAAAAA?A?A
>seq2
AAAAAAA?AAA
>seq3
AAAAAAAAAAA
>seq4
AAAAAAAAAAA

then you got this when passing through AlignIO before the fix:

>seq1
AAAAAAAAA--
>seq2
AAAAAAAAAA-
>seq3
AAAAAAAAAAA
>seq4
AAAAAAAAAAA

which is not good since nucleotide positions in the alignment are modified
and gaps are introduced where there shouldn't be any.  After the fix you get
what you expect (the same as input).

A few proprietary sequence analysis programs use (or formerly used) '?' to
mean 'N or gap' or 'N' and actively exported this symbol to other formats.
I think older versions of Sequencher or Strider did this but I can't be
certain; I've slept since then.  Anyway, the intent of the user here is to
maintain the integrity of the alignment, so regardless of what '?' means the
nucleotide positions in the alignment are retained.  

OT here but another issue with Sequencher is that ':' is used for gaps as
well, something that Bio::AlignIO::fasta doesn't address.  I haven't tried
this out to see what it does, but I wouldn't be surprised if it breaks
again.

My question is, how many alignment formats don't retain positional
information?  In other words, don't all alignment formats (FASTA, clustal,
MSF) have sequences that have the same length with all gaps included
(internal and at both ends) so that they retain position regardless of what
symbols are used?

Chris

Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 


> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of Albert Vilella
> Sent: Friday, April 14, 2006 9:01 AM
> To: Kai Müller
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::AlignIO ignores questionmarks?
> 
> It seems like missing_char is more for SimpleAlign than for AlignIO. So
> in case of fasta files with '?' chars, they will be ignores in line 113
> of Bio::AlignIO::fasta.pm.
> 
> So you can add '\?' in that line of fasta.pm.
> 
> That will parse it correctly, although I am not sure whether fasta
> format should or shouldn't allow '?' chars in the file.
> 
> Anyone?
> 
> Cheers,
> 
>     Albert.
> 
> On Thu, 2006-04-13 at 20:38 -0400, Kai Müller wrote:
> > hi,
> >
> > I'm very new to BioPerl and have a maybe silly question.
> > when using Bio::AlignIO to load a set of sequences, the questionmarks
> are
> > simply lost (they refer to missing characters as opposed to gap
> characters
> > [-] or ambiguity [N]). I thought that 'missing_char()' might help, but
> it
> > didn't (I probably used it the wrong way).
> >
> > when $filename contains sequences with ????, the following snippet would
> > produce an alignment with ???? lost and downstream nucleotide just
> shifted
> > and the resulting length differnces filled by '---' @ 3' end:
> >
> >
> > my $aln_in = Bio::AlignIO->new(-file => "$filename", '-format' =>
> 'fasta');
> > 	my $aln = $aln_in->next_aln();
> > 	$aln->gap_char('-');
> > 	$aln->missing_char('?');
> >
> > 	my $testout = Bio::AlignIO->new(-fh => \*STDOUT , '-format' =>
> 'clustalw');
> > 	$testout->write_aln($aln);
> >
> >
> >
> > Can somebody give me a hint here?
> >
> > thanks and all the best,
> >
> > Kai Müller
> >
> >
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list