[Bioperl-l] AlignIO formats

Bernd Web bernd.web at gmail.com
Tue Mar 30 16:10:09 EDT 2010


Using GuessSeqFormat and AlignIO, I stumbled on some issues and am now
wondering if the defined formats are actually OK. Esp. related to
pfam, selex, stockholm formats it seems:

pfam here is like selex without any comment lines, but with the
/start-end after the seq id like myseq/1-111.
The EBI site (http://www.ebi.ac.uk/2can/tutorials/formats.html#pfam)
actually defines Pfam and Stockholm to be the same formats. This makes
me wonder: is the Pfam format actually defined as Selex or Stockholm?
Within BioPerl it is like Selex.

In addition, Selex (as used in HMMER 2.3.2) contains comment lines like
#=AC, #=RF or #=ID.
GuessSeq format uses this to detect Selex, however, they do not have
to be present.
GuessSeqFormat uses:

return (($lineno == 1 && $line =~ /^#=ID /) ||
            ($lineno == 2 && $line =~ /^#=AC /) ||
            ($line =~ /^#=SQ /));
to detect the Selex format.

At the same time, the Selex reader does not seem to get the aln id or accession

 if( $entry =~ /^\#=GS\s+(\S+)\s+AC\s+(\S+)/ ) {
	    $accession{ $1 } = $2;

Also a Selex file like:
seq2    ..GGGAAAGG.GA
seq3   UUU..AAAUUU.A

is guessed to be phylip (whereas the seq1/1-11 format will be guessed as pfam)

I am not sure if the above is desired behaviour, though all sequences
are read in the alignment object correctly. I' was wondering wether
all Selex variations could be guessed as Selex, not as phylip, pfam or
selex (though in the selex case we can have more alignments in one


More information about the Bioperl-l mailing list