[Bioperl-l] A perl regex query
neetisomaiya at gmail.com
Tue Sep 18 08:47:18 EDT 2007
My actual problem is a bit more complicated.
It is not just one string, nut lakhs of them, they are actually names of
THe problem is there are 2 different data sources, I need to match the
compond names between them, but the problem is though the compound may be
the same in the two, they use different naming formats for them.
eg 1 : Glucose
DB1 : D-glucose
DB2 : alpha-D-Glucose
eg2 : 2,3-bisphosphoglycerate
DB1 : Cyclic-2,3-bisphospho-D-Glycerate
DB2 : 2,3 bisphoshpglycerate
And there are some simple examples, there are even more complicated ones,
with many digits, alhas, betas, hyphens, S, R, cis, trans etc etc.
I just want to see if the basic compond is the same, i.e. the first one will
be glucose and second one will be 2,3-biphosphoglycerate (can't take just
bisphosphoglycerate because 1,3-bisphosphoglycerate would mean something
Anyone has any suggestions how to tackle this?
On 9/18/07, Spiros Denaxas <spiros at lokku.com> wrote:
> Its not impossibe, you just have to use \b to denote the word boundaries
> echo 'this-is-a_test-D-string-D' | perl -ne ' s/\b\-D\-\b//g ; print ;'
> It only gets rid of -D- , all other occurrences of D and - remain intact.
> On 9/18/07, neeti somaiya <neetisomaiya at gmail.com> wrote:
> > Thanks.
> > It might work, but not always, because the string could be somthing like
> > Cyclic-2,3-Bisphospho-D-Glycerate.
> > Here I will first convert the full thing to a lower case and would then
> > to get what I want.
> > Nothing seems to work, when I try to substitute -D- with nothing, "D"
> > "-" when occuring separately also get substituted with nothing.
> > On 9/18/07, Roy Chaudhuri <rrc22 at cam.ac.uk> wrote:
> > >
> > > > This isnt really a bioperl query.
> > > > But does anyone know how I can substitute all special characters (+
> > > > other things) in a string with nothing in perl?
> > > > I mean if I have a string like Cyclic-2,3-bisphospho-D-glycerate and
> > > want
> > > > ouput as bisphosphoglycerate. I want to remove -D-, Cyclic-, 2,3-
> > > >
> > >
> > > A more general approach that might work is to keep lower case words (I
> > > don't know if that will be true for all your cases):
> > >
> > > $_='Cyclic-2,3-bisphospho-D-glycerate';
> > > print join '', /\b[a-z]+\b/g;
> > >
> > > Roy.
> > > --
> > > Dr. Roy Chaudhuri
> > > Department of Veterinary Medicine
> > > University of Cambridge, U.K.
> > >
> > --
> > -Neeti
> > Even my blood says, B positive
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
Even my blood says, B positive
More information about the Bioperl-l