[Bioperl-l] IUPAC support for DNA alignment

Hilmar Lapp hlapp at gmx.net
Fri Jun 27 13:33:46 EDT 2008

So instead of the user choosing a special matrix you would like to  
have a simple argument (that would probably under the hood do exactly  

BTW Scenarios #1 and #3 sound more or less the same to me (i.e., you  
believe the degenerate code to reflect site polymorphism, not sequence  


On Jun 27, 2008, at 10:13 AM, Alexie Papanicolaou wrote:

> Hello
> I guess I didn't give enough info... (also sorry Yee Man, forget to  
> CC you before)
> Scenario 1 - polymorphic allele vs non-polymorphic one. e.g.
> Let [A/G] be SNP in two alleles in population A and the one fixed  
> allele [G] is in population B.
> In this scenario we want to calculate the distance between one locus  
> between two populations ,thus a degenerate site is not the result of  
> uncertaintly but of reality. Obviously the best method is to provide  
> a matrix (if the user can be bothered) but Yee Man already allows  
> this option. Personally, I wouldn't really using an alignment score  
> to measure distance though... The application here is: we first want  
> to align those two sequences and there should be no penalty because  
> there is a SNP in one population (then estimate distance with  
> another algorithm).
> Scenario 2 - uncertainty
> If the scenario is that [A/G] is the result of uncertainty then I  
> gladly agree with you! I'm also perplexed how to score IUPAC codes  
> allowing for three nucleotides (i.e. there might not be a SNP after  
> all... but then again infinitite sites doesn't have to hold - in  
> some species less than others...)
> Scenario 3 - a type profile alignment to a consensus
> In my particular case, I'm doing something different: I have the  
> consensus of an alignment of multiple sequences (dozens to hundrends  
> depending on dataset) with some mismatches including a SNP say [A/ 
> G]. A third sequence that I wish to align has A in that position. So  
> obviously, it shouldn't be penalized.
> So it really depends on application and the user should be able to  
> decide in the end...  (Yee Man already provides the option for a  
> protein substitution matrix). It would be nice if we had the option  
> of specifying it though much more easily (a simple switch) so i can  
> use for scenario 3.
> a
> ps. sorry, my english is going the drain...
> Hilmar Lapp wrote:
>> Hi Alexie,
>> On Jun 27, 2008, at 6:02 AM, Alexie Papanicolaou wrote:
>>> Hello
>>> I'm the user who asked for it. I don't know of any conventions but  
>>> perhaps people can help on this?
>>> I'm not an expert at all but here is my opinion:
>>> If you don't know the codon position (or even if it is coding)  
>>> then you can't estimate the codon degeneracy. If you don't know  
>>> the frequency of the bases representated in the degenerate site  
>>> then you can't model it either on the DNA level. So any solution  
>>> will be ad-hoc.
>>> Regarding 2 base degenerate positions: My suggestion is that in a  
>>> situation of alignment between, say a polymorphic and non  
>>> polymorphic population for that site, and the user is interested  
>>> in the distance between the populations, it would make sense to  
>>> have the score to the full match.
>>> Regarding 3 bases: I don't really know (see N below) but I 'd go  
>>> for a full match again, assuming the user build the consensus.
>> are you suggesting that a determined and a degenerate site aligned  
>> pairwise should score as much as two determined sites?
>> My (possibly naive) default would be to average over all  
>> possibilities, each weighted by base frequency (if base frequencies  
>> are assumed unequal or independent), thus integrating out the  
>> uncertainty. (For standard matrices, I think this would also result  
>> in N receiving zero score.)
>> In the end though, maybe there should be an option for a user to  
>> just provide a substitution matrix?
>>    -hilmar
> -- 
> --
> "Eppur si evolve" ("And yet it evolves")
> -Galileo Jr (ca 21st century)
> --
> Alexie Papanicolaou
> Entomology
> Max Planck Institute for Chemical Ecology
> Hans Knoell Str 8
> Jena 07745
> Germany
> Email apapanicolaou at ice.mpg.de
> Tel +493641571561

: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :

More information about the Bioperl-l mailing list