[Bioperl-l] Bug in Contig.pm? How to compare two sequence objects?

Dan Bolser dan.bolser at gmail.com
Tue Jul 27 05:02:05 EDT 2010

On 26 July 2010 18:54, Robson de Souza <robfsouza at gmail.com> wrote:
> On Mon, Jul 26, 2010 at 12:37 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
>> Thanks again Robson,
>> One thing I was going to email the list about is a lack of tests for
>> ace.pm and phrap.pm ... because of missing libs I'm not running the
>> sam or the maq tests, so any changes that I'm making to Contig.pm /
>> Scaffold.pm are not really being tested.
>> Do you think anyone on the list (or you) would be able to generate a
>> few hundred tests to help me start modifying code with confidence?
> I hope so. I was reading Chris's presentation at BOSC and he mentioned
> that people are working on the support for the samtools so I believe
> somebody must have a dataset they could share.

Bio::Assembly::IO::sam.pm is tested, but I don't run the tests because
I don't have samtools installed. Also, (quick impression) method
coverage is limited (Contig.pm has a lot of methods!).

I'll try to write some tests for ace.pm and phrap.pm.

>> One idea was to actually make Contig.pm implement SeqFeatureI, so that
>> you could call start and end and seq on the contig (consensus)
>> directly.
>> Here are some notes from IRC...
>> 04:00 < dbolser> I think "$contig_object->start" is totally reasonable
>> 04:01 < dbolser> otherwise its something like
>> "$contig_object->get_feature_collection->get_feature_by_id("_main_contig_feature:".$contig_object->id)->start"
>> 04:02 < dbolser> (untested ;-)
>> 09:01 < genehack> dbolser: i think you've got an argument that there should be;
>>                  it's probably going to be a lot easier to do that in
>>                  something like Biome (i.e., BioPerl in Moose)
> Never thought about it... but I can see that could be useful when
> contigs are part of higher level assemblies, like scaffolds or
> assembled chromosomes, but you could also achieve this by making
> Contig.pm a Bio::RangeI or, perhaps better, a Bio::LocatableSeq. Any
> reason to prefer the Bio::SeqFeatureI interface?

No. I don't have an overview of how the various objects relate to one
another, so any choice is more or less arbitrary. I'll try to write a
'Sequence HOWTO'.

Thanks again for help,

>> Any reason not to cc this to the list?
> Lack of attention. I was so concerned about the answer that I didn't
> notice I pressed "Reply to" :)
> Best,
> Robson
>> Thanks again.
>> All the best,
>> Dan.
>> On 26 July 2010 17:23, Robson de Souza <robfsouza at gmail.com> wrote:
>>> Hi,
>>> On Sun, Jul 25, 2010 at 12:35 PM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>>> Cheers for the clarification Robson.
>>> You are welcome :). Thanks for improving my long abandoned child... :)
>>>> How come the 'SYNOPSIS' code does produce warnings about replacing the
>>>> seq? (the workaround is easy enough, don't add_seq, but still...)
>>> I don't know. I remember testing it while I develop Contig.pm in
>>> parallel to ace.pm and phrap.pm and not seeing unexpected warnings...
>>> Did you try ace.pm recently? It uses set_seq_coord and this method
>>> calls add_seq, while phrap.pm uses add_seq directly, but I'm unaware
>>> of previous bug reports on ace.pm regarding unexpected warnings...
>>> note that set_seq_coord calls remove_seq before adding but prints
>>> another warning. Compare set_seq_coord to add_seq alone and maybe you
>>> will figure out what is going on...
>>> Regarding why you can't just add Bio::LocatableSeq objects and have
>>> add_seq just set the coordinates by itself, I investigated my long
>>> forgotten code and just realized that the whole problem was that the
>>> coordinates of a Bio::LocatableSeq object represent the location of
>>> the object in the original sequence (i.e. unaligned read coordinates)
>>> instead of its coordinates in the aligned consensus sequence.
>>> Bio::LocatableSeq issues a warning whenever you try to set the object
>>> start and end coordinates to the gapped consensus coordinates because
>>> such coordinates lead to a length shorter than the actual read length:
>>> use Bio::LocatableSeq;
>>> use Bio::SeqFeature::Collection;
>>>  my $ls = Bio::LocatableSeq->new(-start=> 100500, -end => 100503, -seq
>>> =>"agggcACT-Gccc", -id=>"uga");
>>> my $sfc = Bio::SeqFeature::Collection->new();
>>> $sfc->add_features([ $ls ]);
>>> print $sfc->feature_count,"\t",$ls->length."\n";
>>> Additionally, Bio::LocatableSeq objects do not support the
>>> primary_tag() method because they are just Bio::RangeI and not
>>> Bio::SeqFeatureI objects. These two issues prevented me from using
>>> Bio::LocatableSeq coordinates to represent gapped consensus
>>> coordinates but the example code above suggests suppressing
>>> Bio::LocatableSeq warnings and using the object class
>>> ($feat->isa(Bio::LocatableSeq"")) instead of a regular expression on
>>> the primary tag ("_unaligned_coord:$readID") would be enough to avoid
>>> the need for extra Bio::SeqFeature::Generic objects to store read
>>> coordinates.
>>>> Since you said 'feel free to act', I have been faffing around here:
>>>> http://github.com/dbolser/bioperl-live
>>>> I'm not really sure if that is so useful, please advise.
>>> The comments I made above points to one area where you could have some
>>> improvements in both memory usage and usability: drop the need for
>>> additional Bio::SeqFeatureI objects to represent gapped consensus
>>> coordinates. It would require you to check all Contig.pm methods for
>>> the use of the "_aligned_coord:$readID" tags (I don't expect to see
>>> any references to this outside Contig.pm) and change set_seq_coord to
>>> a method that changes/sets the start and end values of the
>>> Bio::LocatableSeq objects.
>>> This might be a bit of work so just go for it if you want and have
>>> time to experiment.
>>> Another detail: could you please replace references in Contig.pm and
>>> other Bio::Assembly classes POD to Bio::Assembly::* classes that do
>>> not exists to the correct Bio::* classes? Somebody changed this to the
>>> wrong names long ago...
>>>> Thanks again for help,
>>> Well these are my thoughts on the matter. Write again if you decide to
>>> take this on and are unable to make progress.
>>> Best,
>>> Robson
>>>> Dan.
>>>> P.S.
>>>> How come you are not in irc://irc.freenode.net/#bioperl ;-)
>>>> On 25 July 2010 15:42, Robson de Souza <robfsouza at gmail.com> wrote:
>>>>> Hi Dan,
>>>>> It is been a long time since I last loooked at this but, if I remember
>>>>> correctly, the point is that Bio::
>>>>> On Sun, Jul 25, 2010 at 9:23 AM, Dan Bolser <dan.bolser at gmail.com> wrote:
>>>>>> The following bug report boils down to this question:
>>>>>> How should two sequence objects be compared for identity? Does the
>>>>>> object override 'eq' or implement an 'identical' method?
>>>>> I think an 'identical' or 'equal' method would be the best alternative
>>>>> since having a full method call would allow passing arguments like
>>>>> '-mode => "complete"' to check all sequence features and annotations
>>>>> if they exist and '-mode => "basic"' to check id() and seq() values.
>>>>> Bio::Assembly::Contig depends mostly on the last one, although only
>>>>> id() is tracked most of the time (because of the internal hashes).
>>>>>> I found the following apparent bug in Contig.pm while executing the
>>>>>> documented 'SYNOPSIS' code:
>>>>> [snip]
>>>>>> It seems to be a bug in the documented behaviour of set_seq_coord:
>>>>>>        "If the sequence was previously added using add_seq, its
>>>>>> coordinates are changed/set.  Otherwise, add_seq is called and the
>>>>>> sequence is added to the contig."
>>>>> In fact, it should not print warnings all the time....
>>>>>> The offending line in that function seems to be:
>>>>>>  if( ... &&
>>>>>>      ($seq ne $self->{'_elem'}{$seqID}{'_seq'}) ) {
>>>>>>          ... <spew warnings>
>>>>>>  }
>>>>>>  $self->add_seq($seq);
>>>>>> which compares the *passed* sequence object to the sequence string for
>>>>>> the *stored* sequence object of the same name. This comparison is
>>>>>> always fails if I understood correctly, therefore set_seq_coord always
>>>>>> spews warnings if called after add_seq.
>>>>> Not the sequence string, but the objects themselves, i.e. the string
>>>>> perl uses to represent Bio::LocatableSeq objects... it is a memory
>>>>> based version of identical() :)
>>>>>> Out of curiosity, how come I can't just say:
>>>>>> my $ls = Bio::LocatableSeq->
>>>>>>  new( -seq      => 'ACCG-T',
>>>>>>       -id       => 'r1',
>>>>>>       -alphabet => 'dna'
>>>>>>       -start    => 3,
>>>>>>       -end      => 8,
>>>>>>       -strand   => 1
>>>>>>     );
>>>>>> $c->add_seq( $ls );
>>>>> Oh, I don't remember but it was either a bad design decision I made 8
>>>>> years ago to acommodate the Bio::Align::AlignI interface or a problem
>>>>> with Bio::SeqFeature::Collection at that time. Whatever the case, it
>>>>> would be nice to change it... you just need to create a
>>>>> Bio::SeqFeature::Generic when
>>>>> add_seq is called. I just won't have time to do it myself so feel free to act...
>>>>> Best,
>>>>> Robson
>>>>>> I hope the above report can be of some use.
>>>>>> Sincerely,
>>>>>> Dan.
>>>>>> _______________________________________________
>>>>>> Bioperl-l mailing list
>>>>>> Bioperl-l at lists.open-bio.org
>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l

More information about the Bioperl-l mailing list