[Bioperl-guts-l] [Bug 2648] New: The results of "Bio::Assembly::Scaffold->get_all_seq_ids" does NOT behave as documented

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Mon Nov 10 10:53:52 EST 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2648

           Summary: The results of "Bio::Assembly::Scaffold-
                    >get_all_seq_ids" does NOT behave as documented
           Product: BioPerl
           Version: main-trunk
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: major
          Priority: P2
         Component: Core Components
        AssignedTo: bioperl-guts-l at bioperl.org
        ReportedBy: dan.bolser at gmail.com


The results of "Bio::Assembly::Scaffold->get_all_seq_ids" does NOT behave as
documented, specifically, 

 get_all_seq_ids
     Function: Get the ID of all sequences making up the scaffold
               (sequences from contigs and singlets, not consensus).
     Returns : array of strings


However, the method returns the per-contig consensus sequence IDs. In Phrap
that's 'Contig1', 'Contig2', etc.

Actually, of the 7 contigs in my ace file its only doing this for Contig1 and
Contig2, which is a bit unexpected. Let me know and I'll provide the specific
ace file for debugging.

Either there is something wrong in my sequence processing pipeline or the new
phrap.ace format, or I found a decent bug. The relevant part of my code looks
like this:

use Bio::Assembly::IO;

my $parser =
  new Bio::Assembly::IO( -file   => 'my.phrap.ace',
                         -format =>          'ace');

my $ass =
  $parser->next_assembly();

my @reads = $ass->get_all_seq_ids;

for my $read (@reads){
  print $ass->get_seq_by_id($read)->length(), "\n";

  unless ($read =~ /^(\d{3}[A-J]\d{2}X\d{5}).([bg]).abi$/){
    warn "is this a valid read : $read \n";
    print $ass->get_seq_by_id($read)->length(), "\n";
    next;
  }
  ...
}

my @contigs = $ass->all_contigs();

foreach my $contig (@contigs){
  print
    join("\t",
         $ass->get_nof_contig_seqs(),
         $ass->get_nof_contigs(),
         $ass->get_nof_singlets(),
         ##
         $contig->id(),
         $contig->get_consensus_length(),
         $contig->no_sequences(),
         $contig->no_residues(),
        ), "\n";
}


And the (surprising) output ...

is this a valid read : Contig1
1129
is this a valid read : Contig2
1136

1010    7       2       Contig9 46816   415     525403   11
1010    7       2       Contig8 45107   348     438352    9
1010    7       2       Contig7 33950   233     290728    8
1010    7       2       Contig5 2430    3       3912      1
1010    7       2       Contig4 404     2       2444      6
1010    7       2       Contig3 271     2       2815     10
1010    7       2       Contig6 167     7       7965     47


Ahhh... So I just worked it out... the two singlet reads/contigs are being
confused, and are appearing with read IDs as their (presumably) corresponding
Contig IDs.

i.e. Read 342K45X45984 is a singlet, and is therefore also a contig. When
calling get_all_seq_ids, the ID for this read is incorrectly being substituted
with the contig ID for the contig that constitutes the read... (Contig1).
Hopefully that description makes sense without identifying the exact point in
the source code where this goes wrong...


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Bioperl-guts-l mailing list