[Bioperl-l] IUPAC support for DNA alignment

Alexie Papanicolaou apapanicolaou at ice.mpg.de
Fri Jun 27 13:40:34 EDT 2008

Yes, I don't want to use a special scoring for what i'm doing now. The 
option would allow to score a C or A aligned with M the same score as 
specified in -match. I guess it would be quickier if I just made my own 
matrix but there is a TODO line on IUPAC codes so I thought I push it a bit.

Yes from a computation point of view Sc 1 & 3 are the same.


Hilmar Lapp wrote:
> So instead of the user choosing a special matrix you would like to 
> have a simple argument (that would probably under the hood do exactly 
> that)?
> BTW Scenarios #1 and #3 sound more or less the same to me (i.e., you 
> believe the degenerate code to reflect site polymorphism, not sequence 
> uncertainty).
>     -hilmar
> On Jun 27, 2008, at 10:13 AM, Alexie Papanicolaou wrote:
>> Hello
>> I guess I didn't give enough info... (also sorry Yee Man, forget to 
>> CC you before)
>> Scenario 1 - polymorphic allele vs non-polymorphic one. e.g.
>> Let [A/G] be SNP in two alleles in population A and the one fixed 
>> allele [G] is in population B.
>> In this scenario we want to calculate the distance between one locus 
>> between two populations ,thus a degenerate site is not the result of 
>> uncertaintly but of reality. Obviously the best method is to provide 
>> a matrix (if the user can be bothered) but Yee Man already allows 
>> this option. Personally, I wouldn't really using an alignment score 
>> to measure distance though... The application here is: we first want 
>> to align those two sequences and there should be no penalty because 
>> there is a SNP in one population (then estimate distance with another 
>> algorithm).
>> Scenario 2 - uncertainty
>> If the scenario is that [A/G] is the result of uncertainty then I 
>> gladly agree with you! I'm also perplexed how to score IUPAC codes 
>> allowing for three nucleotides (i.e. there might not be a SNP after 
>> all... but then again infinitite sites doesn't have to hold - in some 
>> species less than others...)
>> Scenario 3 - a type profile alignment to a consensus
>> In my particular case, I'm doing something different: I have the 
>> consensus of an alignment of multiple sequences (dozens to hundrends 
>> depending on dataset) with some mismatches including a SNP say [A/G]. 
>> A third sequence that I wish to align has A in that position. So 
>> obviously, it shouldn't be penalized.
>> So it really depends on application and the user should be able to 
>> decide in the end...  (Yee Man already provides the option for a 
>> protein substitution matrix). It would be nice if we had the option 
>> of specifying it though much more easily (a simple switch) so i can 
>> use for scenario 3.
>> a
>> ps. sorry, my english is going the drain...
>> Hilmar Lapp wrote:
>>> Hi Alexie,
>>> On Jun 27, 2008, at 6:02 AM, Alexie Papanicolaou wrote:
>>>> Hello
>>>> I'm the user who asked for it. I don't know of any conventions but 
>>>> perhaps people can help on this?
>>>> I'm not an expert at all but here is my opinion:
>>>> If you don't know the codon position (or even if it is coding) then 
>>>> you can't estimate the codon degeneracy. If you don't know the 
>>>> frequency of the bases representated in the degenerate site then 
>>>> you can't model it either on the DNA level. So any solution will be 
>>>> ad-hoc.
>>>> Regarding 2 base degenerate positions: My suggestion is that in a 
>>>> situation of alignment between, say a polymorphic and non 
>>>> polymorphic population for that site, and the user is interested in 
>>>> the distance between the populations, it would make sense to have 
>>>> the score to the full match.
>>>> Regarding 3 bases: I don't really know (see N below) but I 'd go 
>>>> for a full match again, assuming the user build the consensus.
>>> are you suggesting that a determined and a degenerate site aligned 
>>> pairwise should score as much as two determined sites?
>>> My (possibly naive) default would be to average over all 
>>> possibilities, each weighted by base frequency (if base frequencies 
>>> are assumed unequal or independent), thus integrating out the 
>>> uncertainty. (For standard matrices, I think this would also result 
>>> in N receiving zero score.)
>>> In the end though, maybe there should be an option for a user to 
>>> just provide a substitution matrix?
>>>    -hilmar
>> -- 
>> -- 
>> "Eppur si evolve" ("And yet it evolves")
>> -Galileo Jr (ca 21st century)
>> -- 
>> Alexie Papanicolaou
>> Entomology
>> Max Planck Institute for Chemical Ecology
>> Hans Knoell Str 8
>> Jena 07745
>> Germany
>> Email apapanicolaou at ice.mpg.de
>> Tel +493641571561

"Eppur si evolve" ("And yet it evolves")
-Galileo Jr (ca 21st century)

Alexie Papanicolaou
Max Planck Institute for Chemical Ecology
Hans Knoell Str 8
Jena 07745
Email apapanicolaou at ice.mpg.de
Tel +493641571561

More information about the Bioperl-l mailing list