[Bioperl-l] A perl regex query

Spiros Denaxas spiros at lokku.com
Tue Sep 18 10:41:20 EDT 2007


On 9/18/07, Benno Puetz <puetz at mpipsykl.mpg.de> wrote:
> James Smith wrote:
> >
> > Neeti,
> >
> > This isn't really a bioperl query - but I will try and explain a simple
> > solution...
> >
> > warn simplify( 'Cyclic-2,3-bisphospho-D-glycerate' );
> >
> > sub simplify {
> >   local $_ = "-$_[0]-";
> >         ## Quick hack add -'s at start and end! as always match
> > "-string-"
> >   s/-(
> >     Cyclic | # The prefix "cyclic"
> >     \d+    | # a single number between two "-"s
> >     \d+,\d+| # number,number between two "-"s
> >     \w       # a single letter between two "-"s
> >   )(?=-)//ixg;  ## case-insensitive, commented, multiple matches!
> >         ## 0-width +ve lookahead assertion - so can match
> >         ## multiple consecutive -x- constructions in same regexp!
> >   s/-//g;
> >         ## remove remaining "-"s from string...
> > }
> >
> > Not sure what other test strings you may want - but most should be
> > able to
> > fit in the () brackets in the first regexp of simplify
> >
> > James
> Along the same line
>
> # some test for most of the removals below
> my $string = "Alpha-Cyclic-2,3-bi-sphos-1,2,5-pho-D-beta-glycerate";
> my @ra_bad_terms = (  '-?(D|R|S)-',
>                       '-?([aA]lpha|[bB]eta|[gG]amma)-',
>                       '-?([cC]is|[tT]rans)-',
>                       '-?[cC]yclic-',
>                     # '-?\d+(,\d+)+-',   # uncomment to remove numbers, too
>                       '(?<!\d)-' );          # '-' following number
> print "$string\n";
> foreach ( @ra_bad_terms ){
>
>   eval { $string =~ s/$_//g; };
>   print "$_:$string\n";   # for feedback only
> }
> #$string =~ s/<@ra_bad_terms>//g;
>
> print lc($string),"\n";
>
>
> --
> Benno Pütz

My humble opinion would be to avoid using regular expressions to do
your task and try and locate a more valid and centralized information
repository to use, be it a database of synonyms or some other indexing
code. This will add the required domain knowledge in your solution.
Using regular expressions will almost certainly lead to problems and
bugs which will be very hard to resolve.
Should you decide to go forward and treat everything simply as strings
and compare them, I feel this is more of an NLP problem.

Spiros



More information about the Bioperl-l mailing list