[Bioperl-l] A perl regex query
stephane.teletchea at jouy.inra.fr
Tue Sep 18 09:48:05 EDT 2007
neeti somaiya a écrit :
> My actual problem is a bit more complicated.
> It is not just one string, nut lakhs of them, they are actually names of
> chemical compounds.
> THe problem is there are 2 different data sources, I need to match the
> compond names between them, but the problem is though the compound may be
> the same in the two, they use different naming formats for them.
> eg 1 : Glucose
> DB1 : D-glucose
> DB2 : alpha-D-Glucose
> eg2 : 2,3-bisphosphoglycerate
> DB1 : Cyclic-2,3-bisphospho-D-Glycerate
> DB2 : 2,3 bisphoshpglycerate
> And there are some simple examples, there are even more complicated ones,
> with many digits, alhas, betas, hyphens, S, R, cis, trans etc etc.
> I just want to see if the basic compond is the same, i.e. the first one will
> be glucose and second one will be 2,3-biphosphoglycerate (can't take just
> bisphosphoglycerate because 1,3-bisphosphoglycerate would mean something
> Anyone has any suggestions how to tackle this?
I would use a two step approach :
1 - filter the entries, use a convention, for instance translata all '+'
into their 'plus' literal equivalent, change spaces by '_', change all
'-' for '_' also, etc
2 - try matching the result, if the match does not work, try to match
some characters (for instance, try to remove all non alphabetical
characters and see if the resulting produces a match).
That's theory, now, you have some time for errors and trials, but i
think there is not essay, one shot solution, neither a bioperl facility
for handling (bio)chemical compounds.
Stéphane Téletchéa, PhD. http://www.steletch.org
Unité Mathématique Informatique et Génome http://migale.jouy.inra.fr/mig
INRA, Domaine de Vilvert Tél : (33) 134 652 891
78352 Jouy-en-Josas cedex, France Fax : (33) 134 652 901
More information about the Bioperl-l