Tricking the perl regex engine to get suboptimal matches

From BioPerl
Jump to: navigation, search

The Hack

This trick is discussed more extensively in Mastering Regular Expressions by Jeffrey Friedl. This is something I raised on the mail list at one point; we can probably wrap this up into a method somewhere (probably something curried).

Essentially you have to embed code into the regex and trick the parser into backtracking using a negative lookahead. The match itself fails (i.e. no match is returned), but the embedded code is executed for each match attempt, so in essence you will get all matches, including suboptimal ones.

The following demo script is a slight modification of one I used which checks the consensus string from the input alignment (in aligned FASTA format here), extracts the alignment slice using that match, then spits the alignment out to STDOUT in clustalw format. This should work for perl 5.8 and up, but it's only been tested on perl 5.10. You should be able to use this to fit what you want.

How it works

The beginning regex match actually succeeds (matches the first 18-21 characters) thus setting $1. This gets passed to the code block that follows to be evaluated, passing the string out to the subroutine to be tested and returned if a match is present. Successful matches are then passed to the array with relevant information to be iterated over later. The last (?!) is the part that fails, causing the engine to backtrack and attempt the match again.

Note that we never use the full match, we only attempt matching once per iteration in the while loop.

The script

use strict;
use warnings;
use Bio::AlignIO;
 
my $file = shift; # or $ARGV[0]
 
# there is a bug, likely in Bio::Root::IO, where passing the filename directly
# to AlignIO causes parsing issues and (with some AlignIO formats) segfaults
 
my $in = Bio::AlignIO->new(-file => $file,
                            -format => 'fasta');
my $out = Bio::AlignIO->new(-fh => \*STDOUT,
                            -format => 'clustalw');
 
while (my $aln = $in->next_aln) {
    my $c = $aln->consensus_string(100);
    my @matches;
    # note this match actually fails
    $c =~ m/
        (.{18,21})
        (?{
           my $match = check_match($1);
           push @matches, [$match, pos(), length($match)] if $match;
           })
        (?!)
        /xig;
    for my $match (@matches) {
        my ($hit, $st, $end) = ($match->[0],
                                $match->[1] - $match->[2] + 1,
                                $match->[1]);
        my $newaln = $aln->slice($st, $end);
        $out->write_aln($newaln);
    }
}
 
sub check_match {
     my $match = shift;
     return unless $match;
     my $ct = $match =~ tr/?/?/;
     return $match if $ct <= 4;
}
[back to top]


Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox