[Bioperl-l] Sequence matching problem!
Sendu Bala
bix at sendu.me.uk
Fri Feb 23 08:55:24 EST 2007
James Smith wrote:
> On Fri, 23 Feb 2007, Albert Vilella wrote:
>
>> now that we are at this pattern matching thread, I was wondering if
>> any perl guru could enlighten me on the issue of matching exact
>> sequence patterns on a gapped target sequence. E.g.:
>>
>> my $seq = "CGATCAACGAATCGTACGTACTC";
>> my $gapped_seq =
>> "GGGGGGCG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTCGGGGG";
>>
>> and one would like to get as a result:
>>
>> "CG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTC"
>>
>> which is the match of $seq but in $gapped_seq.
>
> Try...
>
> my $seq = "CGATCAACGAATCGTACGTACTC";
> my $gapped_seq =
> "GGGGGGCG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTCGGGGG";
>
> my $regexp = '('.join('-*?',split//,$seq).')';
>
> if( $gapped_seq =~ /$regexp/ ) {
> print "Match is $1\n";
> } else {
> print "No match\n";
> }
That's great stuff. If you were matching thousands of different $seq
against the same very large $gapped_seq, and only needed the first match
of $seq in $gapped_seq, the alternative to the above approach (remove
the gaps from $gapped_seq and do index() matching) will be faster.
Here's one (overly long-winded) way of implementing it, that I found to
take ~2s vs ~22s for the above regex approach when doing the job on
999999 copies of $seq:
#!/usr/bin/perl -w
use strict;
use warnings;
my $gapped_seq =
"GGGGGGCG-------A---TC---AACGA-----ATC---GTA---CGTACTCTACTCGGGGG";
# note the total gap-length at position in gapless 0-based coords
my @gap_lengths;
my $gap_length = 0;
while ($gapped_seq =~ /(-+)/g) {
my $match = $1;
my $prev_length = $gap_length;
$gap_length += length($match);
my $end = pos($gapped_seq) - $gap_length - 1;
push(@gap_lengths, $prev_length) for (1..$end-$#gap_lengths);
}
push(@gap_lengths, $gap_length) for (1..(length($gapped_seq) -
@gap_lengths - $gap_length));
# remove the gaps
my $gapless_seq = $gapped_seq;
$gapless_seq =~ s/-//g;
# now for each of thousands of seqs...
my $seq = 'CGATCAACGAATCGTACGTACTC';
my @seqs;
for (1..999999) {
push(@seqs, $seq);
}
foreach my $seq (@seqs) {
my $start = index($gapless_seq, $seq);
if ($start == -1) {
print "No match found for seq '$seq'\n";
next;
}
my $end = $start + length($seq) - 1;
# calculate the coords in $gapped_seq
$start = $start + $gap_lengths[$start];
$end = $end + $gap_lengths[$end];
my $result = substr($gapped_seq, $start, ($end - $start + 1));
#print $result, "\n";
}
exit;
More information about the Bioperl-l
mailing list