[Bioperl-guts-l]
[Bug 1378] New: Search IO fails to correctly parse GCG Blast files
bugzilla-daemon at cvs.open-bio.org
bugzilla-daemon at cvs.open-bio.org
Thu Feb 6 09:04:40 EST 2003
http://bugzilla.bioperl.org/show_bug.cgi?id=1378
Summary: Search IO fails to correctly parse GCG Blast files
Product: Bioperl
Version: 1.2 branch
Platform: All
OS/Version: All
Status: NEW
Severity: normal
Priority: P2
Component: Bio::Search/Bio::SearchIO
AssignedTo: bioperl-guts-l at bioperl.org
ReportedBy: simon.andrews at bbsrc.ac.uk
A difference in the format of the significant alignment list between NCBI Blast
and GCG blast means that SearchIO::blast.pm is incorrectly parsing the Score
and E-value information from GCGBlast files.
The difference is that NCBI Blast makes an alignment list with one entry per
line, whereas in GCGBlast each entry takes two lines. Since the list is parsed
per line, spurious information is collected.
a sample GCG alignment list looks like this..
Score E
Sequences producing significant alignments: (bits) Value ..
EM_RO:RNU94856 Begin: 713 End: 739 Strand:-
!U94856 Rattus norvegicus paraoxonase mRNA, partia... 46 3e-04
EM_RO:AF162756 Begin: 3277 End: 3300 Strand:-
!Af162756 Rattus norvegicus cenexin 2 mRNA, comple... 40 0.017
EM_RO:RRGFIL1 Begin: 2280 End: 2301 Strand:-
!M15647 Rat insulin-like growth factor I gene, exon... 36 0.26
EM_RO:AC133000 Begin: 142704 End: 142720
!Ac133000 Rattus Norvegicus Strain CH230-62B6 BAC,... 34 1.0
EM_RO:U95178 Begin: 1299 End: 1314 Strand:-
!U95178 Rattus norvegicus DOC-2 p59 isoform mRNA, co... 32 4.1
EM_RO:U95177 Begin: 1953 End: 1968 Strand:-
!U95177 Rattus norvegicus DOC-2 p82 isoform mRNA, co... 32 4.1
EM_RO:RNRS21 Begin: 1515 End: 1530
!M13922 Rat repetitive sequence homologous to 3' end... 32 4.1
EM_RO:MMHCR Begin: 1654 End: 1669
!X13234 Woodchuck hcr gene. 5/1992 32 4.1
EM_RO:D86383 Begin: 900 End: 915 Strand:-
!D86383 Rattus norvegicus Hex mRNA, complete cds. 6/... 32 4.1
EM_RO:AF277901 Begin: 441 End: 456 Strand:-
!Af277901 Rattus norvegicus zinc finger protein HI... 32 4.1
EM_RO:AF192505 Begin: 454 End: 469 Strand:-
!Af192505 Mus spicilegus clone PTF5 LINE-1 repetit... 32 4.1
EM_RO:AF192504 Begin: 453 End: 468 Strand:-
!Af192504 Mus spicilegus clone PTF2 LINE-1 repetit... 32 4.1
EM_RO:AF045659 Begin: 48 End: 63 Strand:-
!Af045659 Rattus norvegicus mitogen-responsive pho... 32 4.1
\\End of List
A simple fix is to alter how blast.pm collects the Score and Escore values from
this line. This is around line 293 in blast.pm
##########################################################
elsif( /Sequences producing significant alignments:/ ) {
# skip the next whitespace line
$_ = $self->_readline();
while( defined ($_ = $self->_readline() ) &&
! /^\s+$/ ) {
my @line = split;
push @hit_signifs, [ pop @line, pop @line];
}
##########################################################
Instead of using a split to get the information (which fails for GCG files), it
can be replaced with a regex, which will ignore the lines without the score
information on the end, but will still collect the required information from
those lines which contain it.
########################################################
elsif( /Sequences producing significant alignments:/ ) {
# skip the next whitespace line
$_ = $self->_readline();
while( defined ($_ = $self->_readline() ) &&
! /^\s+$/ ) {
if (/(\d+)\s+([\d\.-e]+)$/){
push @hit_signifs , [$2,$1];
}
}
########################################################
I've only checked this against blastn files so far, so additional fixes may be
required for other variants. From the limited testing I've done this doesn't
seem to break the parsing of NCBI format files.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Bioperl-guts-l
mailing list