[Bioperl-l] Bio::AlignIO ignores questionmarks?
cjfields at uiuc.edu
Fri Apr 14 11:41:09 EDT 2006
> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-
> bounces at lists.open-bio.org] On Behalf Of David Messina
> Sent: Friday, April 14, 2006 12:14 AM
> To: Kai Müller
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Bio::AlignIO ignores questionmarks?
> Hi Kai,
> I'm by no means an expert with this module, but I'll take a shot.
> Running your code through a debugger, I'm seeing that
> Bio::AlignIO::fasta is gobbling the question marks:
> line 66: $MATCHPATTERN = '^A-Za-z\.\-';
> and then where $entry contains a line of sequence from the input file
> line 118: $entry =~ s/[$MATCHPATTERN]//g;
> As far as I can tell, a question mark is not a valid character for
> the FASTA format (see http://en.wikipedia.org/wiki/FASTA_format) --
> perhaps that's the reason Bio::AlignIO::fasta doesn't permit them?
I wouldn't trust wikipedia with that one. Check out the bioperl page:
The problem is, there is no really well-established universal rule for FASTA
format. These are three valid FASTA input sequences for some programs:
It's all dependent on how a program/web interface imports the sequence. You
don't need a description line, just '>' will do. Some don't even reuire a
sequence, though most filters will warn you. Even the rules for wrapping
the sequences on multiple line are different (is it 60, 80, 100, or none?).
I know, when I first started (early '90's), a quick and easy way to get
sequences ready for BLAST searches which required FASTA was copy-paste and
add '>' and CR in a line above, with no additional line breaks in the
sequence (all on one line). Still works AFAIK...
> And then by the time missing_char() is applied, the question marks
> are already gone.
> What happens if you read in your sequence with question marks in a
> format that explicitly permits question marks?
> On Apr 13, 2006, at 7:38 PM, Kai Müller wrote:
> > hi,
> > I'm very new to BioPerl and have a maybe silly question.
> > when using Bio::AlignIO to load a set of sequences, the
> > questionmarks are
> > simply lost (they refer to missing characters as opposed to gap
> > characters
> > [-] or ambiguity [N]). I thought that 'missing_char()' might help,
> > but it
> > didn't (I probably used it the wrong way).
> > when $filename contains sequences with ????, the following snippet
> > would
> > produce an alignment with ???? lost and downstream nucleotide just
> > shifted
> > and the resulting length differnces filled by '---' @ 3' end:
> > my $aln_in = Bio::AlignIO->new(-file => "$filename", '-format' =>
> > 'fasta');
> > my $aln = $aln_in->next_aln();
> > $aln->gap_char('-');
> > $aln->missing_char('?');
> > my $testout = Bio::AlignIO->new(-fh => \*STDOUT , '-format' =>
> > 'clustalw');
> > $testout->write_aln($aln);
> > Can somebody give me a hint here?
> > thanks and all the best,
> > Kai Müller
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
More information about the Bioperl-l