[Bioperl-guts-l] [BioPerl - Bug #3235] Blast.pm adding \x01 character inconsistently when parsing of some Hit Identifiers

redmine at redmine.open-bio.org redmine at redmine.open-bio.org
Thu Jun 16 17:26:27 EDT 2011


Issue #3235 has been updated by Chris Fields.


Fixing the regex bounds us to using NCBI-like IDs, which may not always be the case (and could conversely fall apart in other cases).  However, it's pretty easy to check the previous line for spaces prior to the newline, which I've implemented and seems to fix the problem.
----------------------------------------
Bug #3235: Blast.pm adding \x01 character inconsistently when parsing of some Hit Identifiers
https://redmine.open-bio.org/issues/3235

Author: Francisco J. Ossandon
Status: Closed
Priority: Normal
Assignee: Bioperl Guts
Category: Bio::Search/Bio::SearchIO
Target version: 
URL: 


Hello,
I've seen that currently the "blast.pm" assumes that any Hit Description that starts with an empty space is a new Identifier, which is the general case for the NR database (and quite convinient to separate the multiple identifiers and descriptions), as seen in the line 872:

@s/^\s(?!\s)/\x01/; #new line to concatenate desc lines with <soh>@

But, I constantly use the Conserved Domain Database of NCBI (CDD, http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml), where there is only 1 identifier and 1 description (sometimes quite large) for each domain. In these domain descriptions, its common that they use a short primary description followed by +a dot and then 2 empty spaces+ and then the larger explanation. So when the primary description is too long, in the blast output the second empty space will go to the start of the next line, and cause Bioperl to add a \x01 character, even though its still the description, not a new identifier. For example, in the next 2 blast hit descriptions, the first one will not get a \x01 but the second will get it (NOTE: since this page is trimming the double empty spaces, I will have to represent them with perlish "\s"):

@>gnl|CDD|162955 TIGR02630, xylose_isom_A, xylose isomerase. \s \s Members of this \s
family are the enzyme xylose isomerase (5.3.1.5), which interconverts \s
D-xylose and D-xylulose.@

@>gnl|CDD|131679 TIGR02631, xylA_Arthro, xylose isomerase, Arthrobacter type. \s
\s This model describes a D-xylose isomerase that is also active \s
as a D-glucose isomerase. It is tetrameric and dependent \s
on a divalent cation Mg2+, Co2+ or Mn2+ as characterized in \s
Arthrobacter. Members of this family differ substantially from \s
the D-xylose isomerases of family TIGR02630.@

I think that a way to fix this inconsistent behaviour is to change the line 872 of "blast.pm", changing the regular expression so it must also match a pipe character (like in "gi|13488061|ref|NP_085643.1|") to assume that a new identifier exists (as it happens in NR), and that is not just the continuation of the description. The new regular expression could be like this:

@s/^\s(?=\S+?\|)/\x01/; #new line to concatenate desc lines with <soh> when an identifier is present@

I'm attaching a blast output with the above examples in their original state, you can compare the "$hit->description" output for both of them.


-- 
You have received this notification because you have either subscribed to it, or are involved in it.
To change your notification preferences, please click here and login: http://redmine.open-bio.org



More information about the Bioperl-guts-l mailing list