[Bioperl-l] IUPAC support for DNA alignment

Alexie Papanicolaou apapanicolaou at ice.mpg.de
Fri Jun 27 06:02:08 EDT 2008


I'm the user who asked for it. I don't know of any conventions but 
perhaps people can help on this?

I'm not an expert at all but here is my opinion:
If you don't know the codon position (or even if it is coding) then you 
can't estimate the codon degeneracy. If you don't know the frequency of 
the bases representated in the degenerate site then you can't model it 
either on the DNA level. So any solution will be ad-hoc.

Regarding 2 base degenerate positions: My suggestion is that in a 
situation of alignment between, say a polymorphic and non polymorphic 
population for that site, and the user is interested in the distance 
between the populations, it would make sense to have the score to the 
full match.

Regarding 3 bases: I don't really know (see N below) but I 'd go for a 
full match again, assuming the user build the consensus.

Regarding N:
I think this is more likely to be missing data. I doubt you can have a 
SNP occuring four times in the same position (three times are expected 
under infinite sites, too for that matter). Or the consensus is derived 
from very diverged sequences. I wouldn't score N therefore.

Regarding X:
That one shouldn't find in a DNA alignment unless it is a mask. I'd 
expect no score as well.

my /practical/ suggestion would be to have the user to define it, as you 
allow for the other options, perhaps even allowing 2fold and 3fold 
degenerate IUPAC codes to be given different scores. That might save you 
(the owner) some future work when the user wants it...

many thanks to anyone who can help,

ps. Yee Man had cleverly suggested a workaround: one can use the Protein 
Matrix to create a scoring matrix. Might require some caution, 
remembering resetting the alphabet though?

Yee Man Chan wrote:
> Hi all
> 	I am the owner of Bio::Tools::dpAlign. A user emailed me to add
> support for IUPAC nucleotide codes. I am ok to add this feature but I
> would like to know what are the conventions to handle these IUPAC codes.
> 	Suppose match is +3 and mismatch is -1. Then what should be the
> score when T matches with U, A with W, A with D, A with N and A with X?
> Does anyone know the conventions?
> Thanks a lot.
> Yee Man

"You can't find a hermit to teach you herming, because of course that rather spoils the whole thing."

    -- (Terry Pratchett, Small Gods)

Alexie Papanicolaou
Department of Entomology,
Max Planck Institute for Chemical Ecology,
Hans-Knoell-Strasse 8,
D-07745 Jena, Germany.

More information about the Bioperl-l mailing list