[Bioperl-guts-l] [Bug 1378] New: Search IO fails to correctly parse GCG Blast files

bugzilla-daemon at cvs.open-bio.org bugzilla-daemon at cvs.open-bio.org
Thu Feb 6 09:04:40 EST 2003


http://bugzilla.bioperl.org/show_bug.cgi?id=1378

           Summary: Search IO fails to correctly parse GCG Blast files
           Product: Bioperl
           Version: 1.2 branch
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Bio::Search/Bio::SearchIO
        AssignedTo: bioperl-guts-l at bioperl.org
        ReportedBy: simon.andrews at bbsrc.ac.uk


A difference in the format of the significant alignment list between NCBI Blast 
and GCG blast means that SearchIO::blast.pm is incorrectly parsing the Score 
and E-value information from GCGBlast files.

The difference is that NCBI Blast makes an alignment list with one entry per 
line, whereas in GCGBlast each entry takes two lines.  Since the list is parsed 
per line, spurious information is collected.

a sample GCG alignment list looks like this..


                                                         Score    E
 Sequences producing significant alignments:             (bits)  Value ..

EM_RO:RNU94856  Begin: 713 End: 739 Strand:- 
!U94856 Rattus norvegicus paraoxonase mRNA, partia...        46  3e-04
EM_RO:AF162756  Begin: 3277 End: 3300 Strand:- 
!Af162756 Rattus norvegicus cenexin 2 mRNA, comple...        40  0.017
EM_RO:RRGFIL1  Begin: 2280 End: 2301 Strand:- 
!M15647 Rat insulin-like growth factor I gene, exon...       36  0.26
EM_RO:AC133000  Begin: 142704 End: 142720 
!Ac133000 Rattus Norvegicus Strain CH230-62B6 BAC,...        34  1.0
EM_RO:U95178  Begin: 1299 End: 1314 Strand:- 
!U95178 Rattus norvegicus DOC-2 p59 isoform mRNA, co...      32  4.1
EM_RO:U95177  Begin: 1953 End: 1968 Strand:- 
!U95177 Rattus norvegicus DOC-2 p82 isoform mRNA, co...      32  4.1
EM_RO:RNRS21  Begin: 1515 End: 1530 
!M13922 Rat repetitive sequence homologous to 3' end...      32  4.1
EM_RO:MMHCR  Begin: 1654 End: 1669 
!X13234 Woodchuck hcr gene. 5/1992                           32  4.1
EM_RO:D86383  Begin: 900 End: 915 Strand:- 
!D86383 Rattus norvegicus Hex mRNA, complete cds. 6/...      32  4.1
EM_RO:AF277901  Begin: 441 End: 456 Strand:- 
!Af277901 Rattus norvegicus zinc finger protein HI...        32  4.1
EM_RO:AF192505  Begin: 454 End: 469 Strand:- 
!Af192505 Mus spicilegus clone PTF5 LINE-1 repetit...        32  4.1
EM_RO:AF192504  Begin: 453 End: 468 Strand:- 
!Af192504 Mus spicilegus clone PTF2 LINE-1 repetit...        32  4.1
EM_RO:AF045659  Begin: 48 End: 63 Strand:- 
!Af045659 Rattus norvegicus mitogen-responsive pho...        32  4.1
\\End of List


A simple fix is to alter how blast.pm collects the Score and Escore values from 
this line.  This is around line 293 in blast.pm

##########################################################
elsif( /Sequences producing significant alignments:/ ) {
	   # skip the next whitespace line
	   $_ = $self->_readline();
	   while( defined ($_ = $self->_readline() ) && 
		  ! /^\s+$/ ) {	       
	       my @line = split;
	       push @hit_signifs, [ pop @line, pop @line];
	   }
##########################################################

Instead of using a split to get the information (which fails for GCG files), it 
can be replaced with a regex, which will ignore the lines without the score 
information on the end, but will still collect the required information from 
those lines which contain it.

########################################################
elsif( /Sequences producing significant alignments:/ ) {
	   # skip the next whitespace line
	   $_ = $self->_readline();
	   while( defined ($_ = $self->_readline() ) && 
		  ! /^\s+$/ ) {	       

  	     if (/(\d+)\s+([\d\.-e]+)$/){
	       push @hit_signifs , [$2,$1];
	     }
	   }
########################################################

I've only checked this against blastn files so far, so additional fixes may be 
required for other variants.  From the limited testing I've done this doesn't 
seem to break the parsing of NCBI format files.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.


More information about the Bioperl-guts-l mailing list