Many of us are good at reusing code beyond a certain complexity, but when it comes to writing a “quick” string randomizer, or finding a hash key that corresponds to a maximum value, or traversing a hash structure, etc., we constantly reinvent the wheel. It’s worth doing, to keep computationally fit, but some days we could use that extra half an hour.
Next time you rewrite that algorithm, log it here in the scrapbook, and give the gift of time. Post your favorite solutions to common tasks, elegant algorithms worthy of admiration, or truly frightening code to pique the morbid curiosity of the BioPerl community.
If you’re looking for an answer and don’t see it here, try the HOWTOs, search the BioPerl Archives, or post your question to the listserv (sign up here).
Note: These are donated scraps from various contributors. Please recognize that programming styles may vary widely. If a particular scrap has a bug, feel free to fix the issue by modifying the wiki, but refrain from imposing one’s own programming style upon others, with the singular exception that it doesn’t follow a Best Practice and may lead to problems down the road (we do have Talk pages for those purposes).
# get $tree somehow, e.g. from $treeio->next_tree (see Bio::TreeIO)
my @nodes = $tree->get_nodes;
my %clades;
foreach my $n (@nodes) {
my @desc = $n->get_all_Descendents;
if ($n->is_Leaf) {
# degenerate clades...
$clades{$n->id}++;
}
else {
my @lvs = grep { $_->is_Leaf } @desc;
$clades{ join(',',sort map {$_->id} @lvs) }++;
}
} return keys %clades;
id()
property), they will also appear in the output.
use Bio::TreeIO;
use Bio::Tree::Compatible;
my $tio = Bio::TreeIO->new(-format=>'newick',-file => $file);
while (my $t = $tio->next_tree) {
map {
print "(",join(',',@{$_}),")\n"
} values %{Bio::Tree::Compatible::cluster_representation( $t ) };
}
File content:
(((A:5,B:5):2,(C:4,D:4):1):3,E:10);
(((A:5,B:5)x:2,(C:4,D:4)y:1)z:3,E:10)r;
–MAJ
(see the bioperl-l thread here)
Shalabh Sharma asks
I was just wondering, is there any module is bioperl that does subsampling?
from Mark Jensen:
If you trust rand()
…
# open your file into $my_infile, then
@lines = <$my_infile>;
my $num_samps = 10;
my $sample_size_pc = 0.25;
my @samples;
for (1..$num_samps) {
push @samples = [map { int( @lines * rand ) } ( 0..int($sample_size_pc * @lines) ) ];
}
# now, do something, fr'instance
my @sample_pc;
foreach (@samples) {
my $pct=0;
foreach my $line (@lines[ @$_ ]) {
@a = split(/\s+/,$line);
$pct += $a[2];
}
$pct /= @$_;
push @sample_pc, $pct;
}
# etc...
(see bioperl-l discussion here).
Claudio Sampaio asks
How to get the entropy of each nucleotide of an alignment?
from Mark Jensen:
If you have a Bio::SimpleAlign object prepared
$alnio = new Bio::AlignIO(-format=>'fasta', -file=>'your.fas');
$aln = $alnio->next_aln;
try the following function, as
$entropies = entropy_by_column( $aln )
=head2 entropy_by_column
Title : entropy_by_column
Usage : entropy_by_column( $aln )
Function: returns the Shannon entropy for each column in an alignment
Example :
Returns : hashref of the form { $column_number => $entropy, ... }
Args : a Bio::SimpleAlign object
=cut
sub entropy_by_column {
my ($aln) = @_;
my (%ent);
foreach my $col (1..$aln->length) {
my %res;
foreach my $seq ($aln->each_seq) {
my $loc = $seq->location_from_column($col);
next if (!defined($loc) || $loc->location_type eq 'IN-BETWEEN');
$res{$seq->subseq($loc)}++;
}
$ent{$col} = entropy(values %res);
}
return [%ent];
}
=head2 entropy
Title : entropy
Usage : entropy( @numbers )
Function: returns the Shannon entropy of an array of numbers,
each number represents the count of a unique category
in a collection of items
Example : entropy ( 1, 1, 1 ) # returns 1.09861228866811 = log(1/3)
Returns : Shannon entropy or undef if entropy undefined;
Args : an array
=cut
use List::Util qw(sum);
sub entropy {
@a = grep {$_} @_;
return undef unless grep {$_>0} @a;
return undef if grep {$_<0} @a;
my $t = sum(@a);
$_ /= $t foreach @a;
return sum(map { -$_*log($_) } grep {$_} @a);
}
To get an array of hashes of hashes containing the frequencies of each residue at each site of an alignment:
$freqs = freqs_by_column($aln);
=head2 freqs_by_column
Title : freqs_by_column
Usage : freqs_by_column( $aln )
Function: returns the freqs for each column in an alignment
Example :
Returns : hashref of the form { $column_number => {'A'=>$afreq,...}, ... }
Args : a Bio::SimpleAlign object
=cut
sub freqs_by_column {
my ($aln) = @_;
my (%freqs);
foreach my $col (1..$aln->length) {
my %res;
foreach my $seq ($aln->each_seq) {
my $loc = $seq->location_from_column($col);
next if (!defined($loc) || $loc->location_type eq 'IN-BETWEEN');
$res{$seq->subseq($loc)}++;
}
$freqs{$col} = freqs(\%res);
}
return { %freqs};
}
=head2 freqs
Title : freqs
Usage : freqs( \%ntct )
Function: returns the site frequencies of nts in an aligment
Returns : hashref of freqs: { A => 0.5, G => 0.5 }
Args : hashref of nt counts { A => 100, G => 100 }
=cut
sub freqs {
my $ntct = shift;
my $t = eval join('+', values %$ntct);
my @a = qw( A T G C );
my @f = map {$_ /= $t} @{$ntct}{@a};
$t = eval join('+', @f);
map { $_ /= $t } @f;
my %ret;
@ret{@a} = @f;
return { %ret}
}
but don’t expect to set any speed records.
(see bioperl-l thread here.)
Jarod Pardon asks:
I have some sequence databases such as RefSeq in flat GenPept/GenBank format. There is a list of GI numbers and I want to extract the sequence from the database according to the GI number.
Mark Jensen replies:
Bio::DB::Flat is nicely generalized to allow different ‘namespaces’ for the different identifiers used on different sequences. You can choose the type of identifier you want (gi, in your case) by using get_Seq_by_acc()
as follows (this actually works on my machine):
$db = Bio::DB::Flat->new(-directory => "$ENV{HOME}/scratch",
-dbname => 'mydb',
-format => 'genbank',
-index => 'bdb',
-write_flag => 1);
$db->build_index("$ENV{HOME}/scratch/plastid1.rna.gbff");
$seq = $db->get_Seq_by_acc('GI' => 71025988);
If you want to get by accession number, use get_Seq_by_acc('ACC' => $accno)
, etc.
(see the thread on bioperl-l)
from Abhi Pratap:
I am looking for some code to parse the ACE file format. I have big ACE files which I would like to trim based on the user defined Contig name and specific region and write out the output to another fresh ACE file.
Russell Smithies responds:
Here’s how I’ve been doing it:
my $infile = "454Contigs.ace";
my $parser = new Bio::Assembly::IO(-file => $infile ,-format => "ace") or die $!;
my $assembly = $parser->next_assembly;
# to work with a named contig
my @wanted_id = ("Contig100");
my ($contig) = $assembly->select_contigs(@wanted_id) or die $!;
#get the consensus
my $consensus = $contig->get_consensus_sequence();
#get the consensus qualities
my @quality_values = @{$contig->get_consensus_quality()->qual()};
Eds. note: ACE file writing is another matter–but with solutions around the corner. See the rest of the discussion.
BioSQL is a part of the OBDA standard and was developed as a common sequence database schema for the different language projects within the Open Bioinformatics Foundation.
The BioPerl client implementation is bioperl-db. It provides an Object-Relational Mapping (ORM) for various BioPerl objects, such as Bio::SeqI. Here is a simple example:
#!/usr/bin/perl
use strict;
use Bio::Seq;
use Bio::DB::BioDB;
# create a database adaptor, actual parameters depend on your local database installation, here postgresql
my $dbadp = Bio::DB::BioDB->new(-database => 'biosql',
-user => 'biosql',
-dbname => 'biosql',
-host => 'localhost',
-driver => 'Pg');
my $adp = $dbadp->get_object_adaptor("Bio::SeqI");
my $acc = "XP_000001"; # just a dummy
my $seq = Bio::Seq->new(-accession_number => $acc,
-namespace => 'swissprot',
-version => $ver);
my $dbseq = $adp->find_by_unique_key($seq);
my $feat = Bio::SeqFeature::Generic->new(-primary_tag => $primary_tag,
-strand => 1,
-start => 100,
-end => 10000,
-source_tag => 'blat');
$dbseq->add_SeqFeature($feat);
$dbseq->store;
Bhakti Dwivedi wonders:
Does anyone know how to retrieve the “Source” or the “Species name” given the accession number?
The following scrap (with portions suspiciously reminiscent of the EUtilities HOWTO) demonstrates how you might do this:
use Bio::DB::EUtilities;
my (%taxa, @taxa);
my (%names, %idmap);
# these are protein ids; nuc ids will work by changing -dbfrom => 'nucleotide',
# (probably)
my @ids = qw(1621261 89318838 68536103 20807972 730439);
my $factory = Bio::DB::EUtilities->new(-eutil =>
'elink',
-db => 'taxonomy',
-dbfrom => 'protein',
-correspondence => 1,
-id => \@ids);
# iterate through the LinkSet objects
while (my $ds = $factory->next_LinkSet) {
$taxa{($ds->get_submitted_ids)[0]} = ($ds->get_ids)[0]
}
@taxa = @taxa{@ids};
$factory = Bio::DB::EUtilities->new(-eutil => 'esummary',
-db => 'taxonomy',
-id => \@taxa );
while (local $_ = $factory->next_DocSum) {
$names{($_->get_contents_by_name('TaxId'))[0]} =
($_->get_contents_by_name('ScientificName'))[0];
}
foreach (@ids) {
$idmap{$_} = $names{$taxa{$_}};
}
# %idmap is
# 1621261 => 'Mycobacterium tuberculosis H37Rv'
# 20807972 => 'Thermoanaerobacter tengcongensis MB4'
# 68536103 => 'Corynebacterium jeikeium K411'
# 730439 => 'Bacillus caldolyticus'
# 89318838 => undef (this record has been removed from the db)
1;
–maj
See HOWTO:Getting_Genomic_Sequences
get_genomic_sequence()
with the second version to get the exact solution. This requires [Bio::EnsEMBL](https://metacpan.org/pod/Bio::EnsEMBL)
tools; see link below. –Ed.Emanuele Osimo, after working hard with HOWTO Getting Genomic Sequences, says:
The script works fine, but gives me the wrong coordinates…there are two different sets of coordinates. The first is called “NC_000001.10 Genome Reference Consortium Human Build 37 (GRCh37), Primary_Assembly”, and is the one I need, and the second one is called just “NT_004610.19” and it’s the one that the script prints. Do you know how to make the script print the “right” coordinates (at least,the ones I need)?
After solving this, the follow-up question:
Could you please tell me how to use the RefSeq coordinates in another script that fetches the sequence, for instance?
Below is a script that fetches chromosomal coordinates based on a geneID, and uses those to obtain chromosomal nucleotide sequence. It’s hacky, but (sort of) works today (23:12, 23 July 2009 (UTC)).
use strict;
use Bio::DB::EntrezGene;
use Bio::WebAgent;
use Bio::DB::EUtilities;
use Bio::DB::GenBank;
use Bio::SeqIO;
use Bio::Root::IO;
my $ua = Bio::WebAgent->new();
my $db = new Bio::DB::EntrezGene;
my ($chr,$start, $end, $info);
my @geneids = (842, 843, 844);
for my $geneid ( @geneids ) {
($chr,$start, $end, $info) = genome_coords($geneid, $db, $ua);
my $seq_obj = get_genomic_sequence($chr, $start, $end);
print $seq_obj->seq; # or whatever you want
}
sub genome_coords {
my ($id, $db, $ua) = @_;
my $seq = $db->get_Seq_by_id($id);
my $ac = $seq->annotation;
for my $ann ($ac->get_Annotations('dblink')) {
if ($ann->database eq 'UCSC')
{
my $resp = $ua->get($ann->url);
my ($chr, $start, $end, $info) =
$resp->content =~ m{position=chr([0-9]+):([0-9]+)-([0-9]+)
\&.*\&refGene.*?\">(.*?)</A>}gx;
return unless $chr;
return ($chr, $start, $end, $info);
}
}
return; # parse error, no UCSC link on page
}
sub get_genomic_sequence {
my ($chr, $start, $end) = @_;
my $fetcher = Bio::DB::EUtilities->new(-eutil => 'efetch',
-db => 'nucleotide',
-rettype => 'gb');
my $strand = $start > $end ? 2 : 1;
my $acc = sprintf("NC_%06d", $chr); # standard locations for current hum genome build
# NC_000022 is X, NC_000023 is Y. NCC_1701 is the Enterprise.
$fetcher->set_parameters(-id => $acc,
-seq_start => $start,
-seq_stop => $end,
-strand => $strand);
my $seq = $fetcher->get_Response->content;
# there must be a better way than below, but oh well...
if ($seq) {
my ($fh, $fn) = Bio::Root::IO::tempfile();
print $fh $seq;
close $fh;
my $seqio = Bio::SeqIO->new(-file=>$fn, -format=>'genbank');
return $seqio->next_seq;
}
return;
}
–Ed.
Emanuele submits:
I discovered that the coupling of the two subs that Mark posted doesn’t get the right results. I think this is because one gets the coordinates with RefSeq build 36.3, the other with build 37. I found that coupling the first sub, genome_coords, with the Bio::EnsEMBL::Registry fetch by region API is a lot better, and it actually generates sequences that contain the genes.
Here is a modified get_genomic_sequence
, based on EO’s script. This requires Bio::EnsEMBL::*; get it here, and see the core API docs here:
use Bio::EnsEMBL::Slice;
use Bio::EnsEMBL::Registry;
# returns a Bio::EnsEMBL::Gene
sub get_genomic_sequence {
my ($chr, $start, $end) = @_;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
my $slice_adaptor = $registry->get_adaptor( 'Human', 'Core', 'Slice' );
my $slice = $slice_adaptor->fetch_by_region( 'chromosome', $chr, $start, $end );
return $slice;
}
The hack
I thought this was helpful, since the only DNA track example documentation I could find was in the POD of bioperl-live (not on search.cpan.org yet). Thoughts? –Jhannah 02:48, 13 May 2008 (UTC)
The script
use strict;
use Bio::Graphics;
use Bio::SeqFeature::Generic;
# -------------------------------------
# Configure these... ( name start end )
my @stuff = qw(
thing1 2 6
thing2 14 10
thing3 12 16
thing4 18 22
);
my $seq = 'ACGGTCGATCGATCGATCGATCGTACGATCG';
# --------------------------------------
my $panel = Bio::Graphics::Panel->new(
-length => length($seq), # bp
-width => length($seq) * 7, # pixels
);
# Add the arrow at the top.
my $full_length = Bio::SeqFeature::Generic->new(
-start => 1,
-end => length($seq),
);
$panel->add_track($full_length,
-glyph => 'arrow',
-tick => 2,
-fgcolor => 'black',
-northeast => 1,
);
# Data track...
my $track = $panel->add_track(
-glyph => 'generic',
-stranded => 1,
-label => 1,
);
# Add each feature to the data track.
while (@stuff) {
my ($name, $start, $end) = splice @stuff, 0, 3;
my $feature = Bio::SeqFeature::Generic->new(
-strand => $start < $end ? 1 : -1,
-display_name => $name,
-start => $start,
-end => $end,
);
$track->add_feature($feature);
}
# Add the DNA track to the bottom.
my $dna = Bio::PrimarySeq->new( -seq => $seq );
my $feature = Bio::SeqFeature::Generic->new( -start => 1, -end => length($seq) );
$feature->attach_seq($dna);
$panel->add_track( $feature, -glyph => 'dna' );
# Send to a .png file.
print $panel->png;
Output:
(see the bioperl-l thread starting here)
Matthew Betts asks:
Has any one tried to draw secondary structure with Bio::Graphics, i.e. two different types of glyph with different colours on the same track?
Scott Cain replies:
If you want to put more than one glyph and/or more than one color in a track, it is fairly easy. You just need to provide a callback for each option when you create the track, like this:
$panel->add_track($features_array_ref,
-glyph => sub { #code to set the glyph according the attributes of the feature },
-bgcolor => sub { #code to set the color },
-fgcolor => 'black',
...etc...
);
For more information, see the Graphics HOWTO
Matthew implements:
$panel->add_track(
'-bgcolor' => sub {
my($feature) = @_;
$feature->display_name eq 'strand' ? 'cyan' : 'magenta';
},
'-strand_arrow' => sub {
my($feature) = @_;
$feature->display_name eq 'strand' ? 1 : 0;
},
);
The hack
From Christopher Fields
A simple alignment overview plotter (requires GD::SVG, Bio::Graphics, and bioperl-live). This takes alignment data and prints out the sequences in blocks with dotted lines indicating gaps.
To change to PNG output, delete/comment the -image_class
parameter passed to the Bio::Graphics::Panel instance and change the last line to print $panel->png
.
The script
#!/usr/bin/perl
use strict;
use Bio::Graphics;
use Bio::AlignIO;
use Bio::SeqFeature::Generic;
my $file = shift or die "Usage: render_alignment.pl <clustalw file> ";
my $aio = Bio::AlignIO->new(-file => $file,
-format => 'clustalw') or die "parse failed";
my $aln = $aio->next_aln() or die "no alignment";
my $panel = Bio::Graphics::Panel->new(
-image_class => 'SVG',
-length => $aln->length,
-width => 800,
-pad_left => 150,
-pad_right => 10,
);
my $full_length = Bio::SeqFeature::Generic->new(
-start => 1,
-end => $aln->length,
-display_name => 'clustalw alignment',
);
$panel->add_track($full_length,
-glyph => 'arrow',
-tick => 2,
-fgcolor => 'black',
-double => 1,
-label => 1,
);
my $track = $panel->add_track(
-glyph => 'segments',
-label => 1,
-connector => 'dashed',
-bgcolor => 'blue',
-label_position => 'alignment_left',
-font2color => 'red'
);
for my $seqobj ($aln->each_seq) {
my $seq = $seqobj->seq;
my @seqs;
# get alignment positions for seqs
my $feature = Bio::SeqFeature::Generic->new(
-display_name => $seqobj->get_nse,
);
while ($seq =~ m{([^-]+)}g) {
# zero-based coords, must adjust accordingly
$feature->add_sub_SeqFeature(
Bio::SeqFeature::Generic->new(-start => pos($seq)-length($1)+1,
-end => pos($seq),
-sequence => $1), 'EXPAND');
}
$track->add_feature($feature);
}
print $panel->svg;
Fig. 1: Graphical alignment output
(This is the fruit of two scraps, Getting all k-mer combinations of residues and Sharing large arrays among multiple threads)
Marco Blanchette contributes the following scripts. Script 1 (countKmer.pl
) processes a (possibly very) large sequence file using multiple threads, to return the frequency distribution of k-mers of arbitrary length k. Script 2 (sampleAndCountKmer.pl
) does the same thing, if desired, or will randomly sample the sequences (with or without replacement) to return an empirical frequency distribution of k-mers. The user can also specify whether to count just presence/absence of a k-mer in a sequence, or to return a complete frequency distribution.
#!/usr/local/bin/perl
# countKmer.pl take a series of fasta files and count all the occurrence of all the
# possible kmers in each files. It's a multi-threaded scripts that treat multiple files
# asynchronously
use strict;
use warnings;
use Getopt::Std;
use threads;
use Thread::Queue;
our ($opt_k,$opt_m,$opt_p);
our $SEP="\t";
MAIN:{
init();
my $q = Thread::Queue->new;
$q->enqueue(@ARGV);
my $num_workers = @ARGV < $opt_p ? @ARGV : $opt_p; # no need to be wasteful :)
for (1 .. $num_workers) {
threads->new(\&worker, $q);
}
$_->join for threads->list;
}
sub worker {
my $queue = shift;
while (my $filename = $queue->dequeue_nb) {
processFile($filename);
}
return(0);
}
sub processFile {
my $file = shift;
#generate all possible word for a k-mer (define at run time by $opt_k)
print STDERR "Generating all the possible sequences $opt_k long\n";
my $kmer_p = kmer_generator($opt_k);
print STDERR "Counting the number of $opt_k nt long kmers in $file\n";
loadSeq($file,$kmer_p);
##print out the hits
printHits($kmer_p,$file);
}
sub loadSeq {
my $file = shift;
my $kmer_p = shift;
open FH, "$file" || die "Can't open file $file\n";
my $f=0;
my $seq;
while(<FH>){
if(/^>/){
countKmer(\$seq,$kmer_p) if $seq;
$seq='';
$f=1;
}elsif($f==1){
chomp;
next if "";
$seq = join("",$seq,$_);
}
}
countKmer(\$seq,$kmer_p) if $seq; # do not forget the last sequence
close FH;
return(0);
}
sub countKmer {
my $seq_p = shift;
my $kmer_p = shift;
my $k = $opt_k;
my %beenThere;
for (my $i=0;$i <= length(${$seq_p})-$k;$i++){
my $w = substr(${$seq_p},$i,$k);
unless ($opt_m){
#Count only one occurrence of a kmer per sequence
$kmer_p->{$w}++ if !exists $beenThere{$w} && exists $kmer_p->{$w};
$beenThere{$w}=1;
}else{
#Count all instances of a kmer per sequence
$kmer_p->{$w}++ if exists $kmer_p->{$w};
}
}
return(0);
}
sub printHits {
my $kmer_p=shift;
my $file = shift;
##print out the hits
my ($dir,$pre,$suf) = ($file =~ /(^.+\/|^)(.+)\.(.+$)/);
open OUT, ">$pre.hits" || die "Can't create file $pre.hits\n";
print OUT join($SEP, $_, $kmer_p->{$_}), "\n" for sort keys %{$kmer_p};
close OUT;
return(0);
}
sub kmer_generator {
my $k = shift;
my $kmer_p;
my @bases = ('A','C','G','T');
my @words = @bases;
for (my $i=1;$i<$k;$i++){
my @newwords;
foreach my $w (@words){
foreach my $b (@bases){
push (@newwords,$w.$b);
}
}
undef @words;
@words = @newwords;
}
map{ $kmer_p->{$_}=0} @words;
return($kmer_p);
}
sub init {
getopts("p:k:m");
unless (@ARGV){
print("\nUsage: countKmer.pl [-k 6 -p 4 -m] sequence_1.fa [sequence_2.fa ...]\n",
"\tFor each possible words in the kmer of lenght -k,\n",
"\tcount the number of time they are found in the fasta sequence file\n",
"\t-k\tsize of the kmer to analyze. Default 6\n",
"\t-m\twill count all possible kmer per sequences.\n",
"\t\tDefault: only one kmer is counted per sequence entries\n",
"\t-p\tThe number of jobs to process simultaneoulsy. Normally, the number of available processors\n",
"\t\tDefault: 4\n",
"\n",
);
exit(1)
}
$opt_k=6 unless $opt_k;
$opt_p=4 unless $opt_p;
return(0);
}
Script 2 (sampleAndCountKmer.pl
)
#!/usr/local/bin/perl
# sampleAndCountKmer.pl is a bootstrapping procedure that sample a list of sequences a
# certain number of time and return the count of each kmer in each sampling. Again this
# is a mutli-threaded scripts that will count multiple samples asynchronously
use strict;
use warnings;
use Getopt::Std;
use threads;
use threads::shared;
use Thread::Queue;
our ($opt_f,$opt_k,$opt_m,$opt_p,$opt_s,$opt_n,$opt_w,$opt_v);
our $SEP="\t";
MAIN:{
init();
my $file = $opt_f;
my $seq_p = loadSeq($opt_f);
#create the final result hash with shared arrays at each kmers.
my $res = &share({});
$res = kmer_generator($opt_k);
$res->{$_} = &share([]) for keys %{$res};
my $q = Thread::Queue->new;
$q->enqueue(1..$opt_n);
my $num_workers = $opt_p; # no need to be wasteful :)
for (1 .. $num_workers) {
threads->new(\&worker, $q, $seq_p, $res, $_);
}
for (threads->list){
$_->join;
}
##Now, need to deconvolute the arrays containing all the counts for all kmers
##and spit out a table of count for all the samples
print(join ("$SEP",$_,@{$res->{$_}}),"\n") for keys %$res;
print STDERR "all done!\n" if $opt_v;
}
sub worker {
my $queue = shift;
my $seq_p = shift;
my $res = shift;
my $t_id = shift;
while (my $q_id = $queue->dequeue_nb) {
processSeq($q_id,$seq_p,$res);
}
return(0);
}
sub processSeq {
my $q_id = shift;
my $seq_p = shift;
my $res = shift;
#generate all possible word for a k-mer (define at run time by $opt_k)
print STDERR "Generating all the posible sequences $opt_k nucleotid long kmer for sample $q_id\n" if $opt_v;
my $kmer_p = kmer_generator($opt_k);
#randomly sample the sequences
my $sample = sampleSeq($seq_p,$q_id);
#counting the number of kmers in the sample seq
print STDERR "Counting the number of $opt_k nt long kmers in the $q_id sample\n" if $opt_v;
map{ countKmer($_,$kmer_p) } @{$sample};
#put the hits in the final result kmer;
map { push @{$res->{$_}}, $kmer_p->{$_} } keys %{$res};
return(0);
}
sub sampleSeq {
my $seq_p = shift;
my $q_id = shift;
my $sample_size = $opt_s;
#create a list of $sample_size items of randomly selected position in @seqs
my @lines;
unless ($opt_w){
##Sampling with replacement
while (scalar(@lines) < $sample_size){
push @lines, int(rand(@{$seq_p}));
}
}else{
##Sampling without replacement
my %lines;
while (scalar(keys %lines) < $sample_size){
$lines{int(rand(@{$seq_p}))}='';
}
@lines = keys %lines;
}
print STDERR "Sampling $sample_size items for the sample $q_id\n" if $opt_v;
my @sample = map { $seq_p->[$_] } @lines;
return(\@sample);
}
sub loadSeq {
my $file = shift;
my $seq_p;
open FH, "$file" || die "Can't open file $file\n";
my $f=0;
my $seq;
while(<FH>){
if(/^>/){
push @{$seq_p}, $seq if $seq;
$seq='';
$f=1;
}elsif($f==1){
chomp;
next if "";
$seq = join("",$seq,$_);
}
}
close FH;
return($seq_p);
}
sub countKmer {
my $seq = shift;
my $kmer_p = shift;
my $k = $opt_k;
my %beenThere;
for (my $i=0;$i <= length($seq)-$k;$i++){
my $w = substr($seq,$i,$k);
unless ($opt_m){
#Count only one occurrence of the kmer per sequence
$kmer_p->{$w}++ if !exists $beenThere{$w} && exists $kmer_p->{$w};
$beenThere{$w}=1;
}else{
#Count all instances of each kmers in a sequence
$kmer_p->{$w}++ if exists $kmer_p->{$w};
}
}
return(0);
}
sub printHits {
my $kmer_p=shift;
my $file = shift;
##print out the hits
my ($dir,$pre,$suf) = ($file =~ /(^.+\/|^)(.+)\.(.+$)/);
open OUT, ">$pre.hits" || die "Can't create file $pre.hits\n";
print OUT join($SEP, $_, $kmer_p->{$_}), "\n" for keys %{$kmer_p};
close OUT;
return(0);
}
sub kmer_generator {
my $k = shift;
my $q_id=shift;
my $kmer_p;
my @bases = ('A','C','G','T');
my @words = @bases;
for (my $i=1;$i<$k;$i++){
my @newwords;
foreach my $w (@words){
foreach my $b (@bases){
push (@newwords,$w.$b);
}
}
undef @words;
@words = @newwords;
}
map{ $kmer_p->{$_}=0} @words;
return($kmer_p);
}
sub init {
getopts("f:p:k:s:n:mwv");
unless ($opt_f){
print("\nUsage:tsampleAndCountKmer.pl [-k 6 -p 4 -s 100000 -n 1000 -m -w] -f sequence.fa\n",
"\twill return a table of counts for each k-mers of length -k\n",
"\t-f\tFasta database. Required\n",
"\t-k\tsize of the k-mer to analyze. Default 6\n",
"\t-m\twill count all possible k-mer per sequence.\n",
"\t\tDefault: only one k-mer is counted per sequence entries\n",
"\t-s number of sequences to sample. Default: 100,000 sequences\n",
"\t-n number of times to repeat the sampling/counting procedures. Default: 1000 times\n",
"\t-w Sample without replacment: Default: with replacement\n",
"\t-p\tThe number of jobs to process simultaneoulsy. Normally, the number of available processors\n",
"\t\tDefault: 4\n",
"\t-v Run in verbrose mode. Spits out a lot of infomation. Default: quiet\n",
"\n",
);
exit(1)
}
$opt_k=6 unless $opt_k;
$opt_p=4 unless $opt_p;
$opt_s=100000 unless $opt_s;
$opt_n=1000 unless $opt_n;
return(0);
}
(See the bioperl-l discussion http://lists.open-bio.org/pipermail/bioperl-l/2008-December/028777.html. See also Counting k-mers in large sets of large sequences.)
Marco Blanchette poses:
'’I am using the Perl threads utility to successfully multi threads several of my computing jobs on my workstation. My current problem is that I need to perform multiple processes using the same humongous array (more than 2 X 106 items). My problem is that the computing time for each iteration is not very long but I have a lot of iterations to do and every time a thread is created I am passing the huge array to the function and a fresh copy of the array is created. Thus, there is a huge amount of wasted resources (time and memory) use to create these data structures that are used by each threads but not modified.
'’The logical alternative is to use shared memory where all thread would have access to the same copy of the huge array. In principle Perl provide such a mechanism through the module (threads::shared) but I am unable to understand how to use the shared variables. Anyone has experience to share on threads::shared?’’
Jonathan Crabtree responds:
Here is a short test program, which runs correctly on perl 5.8.8 and may help to illustrate how the Perl threads::shared module expects you to create and share nested data structures. You have to manually share any nested references and I think that the order in which the sharing calls are made may also be significant:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use threads::shared;
# threads::shared test/demo program
# creates a shared 2-dimensional array and checks that it can be seen in a thread
# tested in perl v5.8.8 built for i486-linux-gnu-thread-multi
## ----------------------------------------
## globals
## ----------------------------------------
# set the width and height of the 2d array to this value:
my $ARRAY_SIZE = 10;
## ----------------------------------------
## main program
## ----------------------------------------
# calls to &share take place in here, so a shared value is returned
my $array = &make_shared_array();
# print array contents before running thread
print "shared array before running thread:\n";
&check_and_print_array($array);
# run thread
my $thr = threads->create(\&do_the_job, $array);
my $retval = $thr->join();
print "join() returned: $retval\n";
# print array contents after running thread
print "shared array after running thread:\n";
&check_and_print_array($array);
exit(0);
## ----------------------------------------
## subroutines
## ----------------------------------------
sub make_shared_array {
# outermost array object must be made shared first
my $a = &share([]);
for (my $i = 0;$i < $ARRAY_SIZE;++$i) {
# each of the rows must be explicitly shared
my $row = &share([]);
# and then added to the containing array
$a->[$i] = $row;
# assign each cell a unique integer for verification purposes
my $base = $i * $ARRAY_SIZE;
for (my $j = 0;$j < $ARRAY_SIZE;++$j) {
$row->[$j] = $base + $j;
}
}
return $a;
}
# print out the array, checking that its dimensions match what we expect
sub check_and_print_array {
my $arr = shift;
die "not an array" if ((ref $arr) ne 'ARRAY');
my $nr = scalar(@$arr);
die "wrong number of rows in array" if ($nr != $ARRAY_SIZE);
for (my $i = 0;$i < $nr;++$i) {
my $row = $arr->[$i];
die "row $i not an array" if ((ref $row) ne 'ARRAY');
my $nc = scalar(@$row);
die "wrong number of columns in row $i" if ($nc != $ARRAY_SIZE);
for (my $j = 0;$j < $nc;++$j) {
my $val = $row->[$j];
printf("%10s", $val);
}
print "\n";
}
}
# work to execute in the thread
sub do_the_job {
my $var = shift;
# print the array once more in the thread
print "shared array in thread:\n";
&check_and_print_array($var);
return "do_the_job returned ok";
}
When I run it (on Ubuntu) the output looks like this:
shared array before running thread:
0 1 2 3 4 5 6
7 8 9
10 11 12 13 14 15 16
17 18 19
20 21 22 23 24 25 26
27 28 29
30 31 32 33 34 35 36
37 38 39
40 41 42 43 44 45 46
47 48 49
50 51 52 53 54 55 56
57 58 59
60 61 62 63 64 65 66
67 68 69
70 71 72 73 74 75 76
77 78 79
80 81 82 83 84 85 86
87 88 89
90 91 92 93 94 95 96
97 98 99
shared array in thread:
0 1 2 3 4 5 6
7 8 9
10 11 12 13 14 15 16
17 18 19
20 21 22 23 24 25 26
27 28 29
30 31 32 33 34 35 36
37 38 39
40 41 42 43 44 45 46
47 48 49
50 51 52 53 54 55 56
57 58 59
60 61 62 63 64 65 66
67 68 69
70 71 72 73 74 75 76
77 78 79
80 81 82 83 84 85 86
87 88 89
90 91 92 93 94 95 96
97 98 99
join() returned: do_the_job returned ok
shared array after running thread:
0 1 2 3 4 5 6
7 8 9
10 11 12 13 14 15 16
17 18 19
20 21 22 23 24 25 26
27 28 29
30 31 32 33 34 35 36
37 38 39
40 41 42 43 44 45 46
47 48 49
50 51 52 53 54 55 56
57 58 59
60 61 62 63 64 65 66
67 68 69
70 71 72 73 74 75 76
77 78 79
80 81 82 83 84 85 86
87 88 89
90 91 92 93 94 95 96
97 98 99
I haven’t verified that doing this actually yields the memory savings you’re looking for, but I don’t see why it shouldn’t.
(see bioperl-l here http://lists.open-bio.org/pipermail/bioperl-l/2009-January/028884.html and here http://lists.open-bio.org/pipermail/bioperl-l/2009-January/028887.html)
The sound of one man coding…
Russell Smithies asks the rhetorical question:
Does anyone have a script for building chromosomes from an .agp file and a directory full of contigs?
and his answer:
use Bio::DB::Fasta;
use Bio::Seq;
use Bio::SeqIO;
open(AGP,"Mt2.0_pgp.agp") or die $!;
my @chr = ();
my $db = Bio::DB::Fasta->new("contigs.fa");
while(<AGP>){
chomp;
split /\s/;
# extend temp string if it's too short
do{$chr[$_[0]] .= ' ' x 1_000_000;}while length $chr[$_[0]] < $_[2];
if($_[4] !~ m/N/){
($start,$stop) = $_[8] eq '+'?($_[6], $_[7]):($_[7], $_[6]);
$s = substr $chr[$_[0]], $_[1], $_[9], $db->seq($_[5],$start,$stop);
}else{
$s = substr $chr[$_[0]], $_[1], $_[5], "N" x $_[5] ;
}
}
#remove any trailing whitespace
@chr = map{s/\s+//g;$_}@chr;
#print the sequence. chromosomes are chr0 -> chr8
foreach(0..$#chr){
my $seqobj = Bio::Seq->new( -display_id => "chr$_", -seq => $chr[$_]);
my $seq_out = Bio::SeqIO->new('-file' => ">chr$_.fa",'-format' => 'fasta');
$seq_out->write_seq($seqobj);
}
Russell adds: * Please excuse my hacky use of substrings but this .agp file had overlapping runs of ‘N’ and this was the easiest way to deal with it; e.g.*
0 1 50000 1 N 50000 clone yes
0 50001 167645 2 F AC144644.3 1 117645 + 117645
0 167646 217645 3 N 50000 clone yes
0 217646 317645 4 N 100000 contig no
0 317646 367645 5 N 50000 clone yes
0 367646 411754 6 F AC146805.17 1 44109 + 44109
This script will convert an alignment file to another alignment file format. I use it to convert Mauve xmfa output to a directory of fasta alignments, but it is useful for converting fasta to clw or other formats.
Sorry if someone already wrote this kind of script… I couldn’t find one like it right away, so I made it and thought I should contribute. –Lskatz 18:24, 21 February 2012 (UTC)
I updated my script. There are two major things in this that I think I updated over time which is error reporting and also the ability to concatenate an entire xmfa into a fasta alignment. Some people in my lab wanted to view a MAUVE alignment in MEGA, which prompted this. –Lskatz 20:10, 3 August 2012 (UTC)
The newest version of the script will be at my github page but I will leave this version here for posterity. –Lskatz 14:39, 29 December 2014 (UTC) https://github.com/lskatz/lskScripts
#!/usr/bin/env perl
# Converts an alignment to another alignment format
# Run with no arguments or with -h for help.
# author: Lee Katz <lkatz@cdc.gov>
use Bio::AlignIO;
use strict;
use warnings;
use Getopt::Long;
use <Data::Dumper>;
use List::Util qw/max/;
sub logmsg{print "@_ ";};
exit(main());
sub main{
my $settings={};
die usage() if(@ARGV<1);
GetOptions($settings,qw(infile=s ginformat=s outfile=s format=s concatenateAlignment linker=s));
die usage() if($$settings{help});
my $infile=$$settings{infile} or die "Error: Need infile param:\n".usage();
my $outfile=$$settings{outfile} or die "Error: Need outfile param:\n".usage();
$$settings{outfileformat}=$$settings{format} || "fasta";
$$settings{linker}||="";
convertAln($infile,$outfile,$settings);
logmsg "Output is now in $outfile";
return 0;
}
sub convertAln{
my($infile,$outfile,$settings)=@_;
$$settings{infileFormat}=$$settings{ginformat}||guess_format($infile,$settings);
my $numAlns;
if($outfile=~/\/$/){
$numAlns=convertAlnToDir($infile,$outfile,$settings);
} else {
$numAlns=convertAlnToFile($infile,$outfile,$settings);
}
logmsg "Finished converting $infile to $outfile ($numAlns alignments)";
return $numAlns;
}
sub convertAlnToDir{
my($infile,$outdir,$settings)=@_;
$outdir=~s/\/+$//; # remove the trailing slash
logmsg "Converting $infile ($$settings{infileFormat}) to a directory $outdir ($$settings{outfileformat})";
mkdir($outdir) if(!-d $outdir);
my $in=Bio::AlignIO->new(-file=>$infile,-format=>$$settings{infileFormat});
my $alnCounter=0;
my $outExtension=$$settings{outfileformat};
while(my $aln=$in->next_aln){
my $out=Bio::AlignIO->new(-file=>">$outdir/".++$alnCounter.$outExtension,-format=>$$settings{outfileformat},-flush=>0);
$out->write_aln($aln);
}
return $alnCounter;
}
sub convertAlnToFile{
my($infile,$outfile,$settings)=@_;
logmsg "Converting $infile ($$settings{infileFormat}) to file $outfile";
my $in=Bio::AlignIO->new(-file=>$infile,-format=>$$settings{infileFormat});
my $out=Bio::AlignIO->new(-file=>">$outfile",-format=>$$settings{outfileformat},-flush=>0);
my $alnCounter=0;
if($$settings{concatenateAlignment}){
my @expectedId=findAllUniqueIdsInAln($infile,$settings);
my %alnSequence;
while(my $aln=$in->next_aln){
# concatenate sequences
foreach my $seq($aln->each_seq){
$alnSequence{$seq->id}.=$seq->seq.$$settings{linker};
}
# keep the alignment flush, even if I have to add gaps
my $currentAlnLength=max(map(length($_),values(%alnSequence)));
for(@expectedId){
my $lackOfSequence=$currentAlnLength-length($alnSequence{$_});
$alnSequence{$_}.='-'x$lackOfSequence;
}
}
# write the complete alignment to a file
my $concatAln=Bio::SimpleAlign->new();
for(@expectedId){
my $stringfh;
open($stringfh, "<", $alnSequence{$_}) or die "Could not open string for reading: $!";
my $seq=new Bio::LocatableSeq(-seq=>$alnSequence{$_},-id=>$_); # must be locatableSeq to be added
$concatAln->add_seq($seq);
}
$out->write_aln($concatAln);
} else { # it's pretty simple to just do a conversion to a different file
while(my $aln=$in->next_aln){
$out->write_aln($aln);
$alnCounter++;
}
}
return $alnCounter;
}
sub findAllUniqueIdsInAln{
my($infile,$settings)=@_;
logmsg "Finding all expected sequence IDs";
my %genome;
my $alnin=Bio::AlignIO->new(-file=>$infile,-format=>$$settings{infileFormat});
while(my $aln=$alnin->next_aln){
foreach my $seq($aln->each_seq){
$genome{$seq->id}++;
}
}
return keys(%genome);
}
sub guess_format{
my ($filename,$settings)=@_;
for my $format (qw(pfam xmfa selex fasta stockholm prodom clustalw msf mase bl2seq nexus phylip)){
eval{
my $in=Bio::AlignIO->new(-file=>$filename,-format=>$format);
my $aln=$in->next_aln;
$aln->length;
};
if($@){
my $error=$@;
next;
}
return $format;
}
die "Could not guess the format of $filename\n";
}
sub usage{
"Converts an alignment to another alignment format.
$0 -i alignmentfile -o outputPrefix [-f outputformat]
-i the alignment file to input
-o the output alignment file or directory
Specify a directory by a trailing slash.
If a directory, the output format will be determined by -f. Default: fasta.
-f output format
possible values are derived from bioperl.
e.g. fasta, clustalw, phylip, xmfa
-g input format to force the correct input format, but it's ok to let it guess
-c to concatenate the alignment into one solid entry
-l linkerSequence, to be used with -c, to separate entries.
Default is no linker but a useful linker could be NNNNNNNNNN
-h this helpful menu
";
}
I came up with a script to accompany Minimo, which is a nice assembler that can be fed by Illumina reads (but it doesn’t read FASTQ). Therefore there needs to be a fast way to convert FASTQ to FASTA/QUAL. This script is multithreaded and works best as you approach the number of CPUs you have. It is untested on non-threaded installations of Perl.
There is a method to perform the opposite operation: Merging separate sequence and quality files to FASTQ
Usage: $0 -i inputFastqFile [-n numCpus -q outputQualfile -f outputFastaFile]
#!/usr/bin/env perl
# Author: Lee Katz
# Convert a fastq to a fasta/qual combo using BioPerl, with some Linux commands
use Bio::Perl;
use Data::Dumper;
use strict;
use warnings;
use threads;
use Thread::Queue;
use Getopt::Long;
my $settings={};
$|=1;
my %numSequences; # static for a subroutine
exit(main());
sub main{
die("Usage: $0 -i inputFastqFile [-n numCpus -q outputQualfile -f outputFastaFile]") if(@ARGV<1);
GetOptions($settings,('numCpus=s','input=s','qualOut=s','fastaOut=s'));
my $file=$$settings{input}||die("input parameter missing");
my $outfasta=$$settings{fastaOut}||"$file.fasta";
my $outqual=$$settings{qualOut}||"$file.qual";
my $numCpus=$$settings{numCpus}||1;
my @subfile=splitFastq($file,$numCpus);
for my $f(@subfile){
threads->create(&convert,$f,"$f.fasta","$f.qual");
}
$_->join for (threads->list);
# join the sub files together
joinFastqFiles(\@subfile,$file);
return 0;
}
sub convert{
my($file,$outfasta,$outqual)=@_;
my $numSequences=numSequences($file);
my $reportEvery=int($numSequences/100) || 1;
print "$numSequences sequences to convert in $file\n";
my $in=Bio::SeqIO->new(-file=>$file,-format=>"fastq-illumina");
my $seqOut=Bio::SeqIO->new(-file=>">$outfasta",-format=>"fasta");
my $qualOut=Bio::SeqIO->new(-file=>">$outqual",-format=>"qual");
my $seqCount=0;
my $percentDone=0;
while(my $seq=$in->next_seq){
$seqOut->write_seq($seq);
$qualOut->write_seq($seq);
$seqCount++;
if($seqCount%$reportEvery == 0){
$percentDone++;
print "$percentDone%..";
}
}
print "Done with subfile $file.\n";
return 1;
}
sub joinFastqFiles{
my($subfile,$outfileBasename)=@_;
my($command,$subfasta,$subqual);
# fasta
$subfasta.="$_.fasta " for(@$subfile);
$command="cat $subfasta > $outfileBasename.fasta";
system($command);
# qual
$subqual.="$_.qual " for (@$subfile);
$command="cat $subqual > $outfileBasename.qual";
system($command);
return 1;
}
sub splitFastq{
my($file,$numCpus)=@_;
my $prefix="FQ"; # for fastq
my $numSequences=numSequences($file);
my $numSequencesPerFile=int($numSequences/$numCpus);
my $numSequencesPerFileRemainder=$numSequences % $numCpus;
my $numLinesPerFile=$numSequencesPerFile*4; # four lines per read; this could become incorrect if there is a really long read (not currently likely)
system("rm -r tmp;mkdir tmp;");
system("split -l $numLinesPerFile $file 'tmp/FQ'");
return glob("tmp/FQ*");
}
# use Linux to find the number of sequences quickly, but cache the value because it is still a slow process
# This should probably changed to `wc -l`/4 but I don't have time to test the change
# TODO for anyone reading this: please change this method to wc -l divided by 4.
sub numSequences{
my $file=shift;
return $numSequences{$file} if($numSequences{$file});
my $num=`grep -c '^\@' $file`;
chomp($num);
$numSequences{$file}=$num;
return $num;
}
Some have whimpered, “How do I get a fasta file of the features in this GFF?”… Well, assuming your GFF file fits into memory (and assuming you have the appropriate fasta file to hand), here’s how:
A script I like to call gff2fasta
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
my $verbose = 0;
# read in gff
warn "reading GFF ";
my %gff;
open GFF, '<', 'my.gff3'
or die "fail\n";
while(<GFF>){
my ($seqid, undef, undef, $start, $end,
undef, undef, undef, $attrs) = split;
push @{$gff{$seqid}}, [$start, $end, $attrs];
}
warn "OK\n";
# Do the fasta
my $seqio = Bio::SeqIO->
new( -file => 'my.fa',
-format => 'fasta' )
or die "double fail\n";
while(my $sobj = $seqio->next_seq){
my $seqid = $sobj->id;
unless(defined($gff{$seqid})){
warn "no features for $seqid\n";
next;
}
my $seq = $sobj->seq;
for(@{$gff{$seqid}}){
my ($start, $end, $attrs) = @$_;
warn join("\t", $start, $end, $attrs), "\n"
if $verbose > 0;
my %attrs = split(/=|;/, $attrs);
print ">$seqid-". $attrs{"ID"}.
"/$start-$end (". ($end-$start+1). ")\n";
print substr($seq, $start, $end-$start+1), "\n";
}
#exit;
}
warn "OK\n";
(see the bioperl-l thread starting here , and also look at this for the hard way)
Dan Bolser opines:
I have a ‘fasta quality file’ and a fasta file, and I would like to output a fastq file…I get the feeling that this should be a one-liner.
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
use Bio::Seq::Quality;
use Getopt::Long;
die "pass a fasta and a fasta-quality file "
unless @ARGV;
my ($seq_infile,$qual_infile)
= (scalar @ARGV == 1) ?($ARGV[0], "$ARGV[0].qual") : @ARGV;
# Create input objects for both a seq (fasta) and qual file
my $in_seq_obj =
Bio::SeqIO->new( -file => $seq_infile,
-format => 'fasta',
);
my $in_qual_obj =
Bio::SeqIO->new( -file => $qual_infile,
-format => 'qual',
);
my $out_fastq_obj =
Bio::SeqIO->new( -format => 'fastq'
);
while (1){
## create objects for both a seq and its associated qual
my $seq_obj = $in_seq_obj->next_seq || last;
my $qual_obj = $in_qual_obj->next_seq;
die "foo!\n"
unless
$seq_obj->id eq
$qual_obj->id;
## Here we use seq and qual object methods feed info for new BSQ
## object.
my $bsq_obj =
Bio::Seq::Quality->
new( -id => $seq_obj->id,
-seq => $seq_obj->seq,
-qual => $qual_obj->qual,
);
## and print it out.
$out_fastq_obj->write_fastq($bsq_obj);
}
For more information, see Bio::Seq::Quality.
(see bioperl-l thread here)
The key to parsing any external program report is knowing where (i.e., in what BioPerl object) the stuff you want lives. This info can be tricky to find, though walking through the code to find it is an excellent exercise for learning how BioPerl is structured.
Below is some code to get BLAST high-scoring pairs from a BLAST report. Here is the basic BP-object structure:
As a professor (David Wollkind) of mine once said (while teaching the chain rule), you have to peel away the players until you find the one with the ball.
Note there are many accessors on the Bio::Search::HSP::GenericHSP object though which the interesting stuff can be obtained.
The scrap uses blastn
to align two sequences. Get the standalone BLAST programs from NCBI (they don’t come with BioPerl!).
–MAJ
#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;
use Bio::Tools::Run::StandAloneBlast;
my $query1_in = Bio::SeqIO->newFh ( -file => "mus-betaglobin-bh0.fas",
-format => 'fasta' );
my $query1 = <$query1_in>;
my $query2_in = Bio::SeqIO->newFh ( -file => "mus-betaglobin-bh3.fas",
-format => 'fasta' );
my $query2 = <$query2_in>;
$factory = Bio::Tools::Run::StandAloneBlast->new('program' => 'blastn');
$report = $factory->bl2seq($query1, $query2);
while (my $result = $report->next_result) {
print "Query: ".$result->query_name."\n";
while (my $hit = $result->next_hit) {
while ($hsp = $hit->next_hsp) {
print $hsp->algorithm, ": identity ", 100*$hsp->frac_identical, "\%, rank ", $hsp->rank, " (E:", $hsp->evalue, ")\n";
printf("%7s: %s\n", "subj", $hsp->query_string);
printf("%7s: %s\n", "", $hsp->homology_string);
printf("%7s: %s\n", "hom", $hsp->hit_string);
print "\n";
}
print "\n";
}
}
Tim Kohler asks:
When I use Bio::Tools::Run::StandAloneBlast to BLAST one fasta file including different sequences, I get a BLAST output with many queries, each having several hits / sbjcts. My problem is how to parse all hits of one query into a single new file. And this for all the queries I have in my BLAST output file.
If you have your multiple-query blast report file, you can sort and output separate files like so:
use Bio::Search::Result::BlastResult;
use Bio::SearchIO;
my $report = Bio::SearchIO->new( -file=>'full-report.bls', -format =>
blast);
my $result = $report->next_result;
my %hits_by_query;
while (my $hit = $result->next_hit) {
push @{$hits_by_query{$hit->name}}, $hit;
}
foreach my $qid ( keys %hits_by_query ) {
my $result = Bio::Search::Result::BlastResult->new();
$result->add_hit($_) for ( @{$hits_by_query{$qid}} );
my $blio = Bio::SearchIO->new( -file => ">$qid\.bls", -format=>'blast' );
$blio->write_result($result);
}
–Ed.
Russell also had a solution, but we’ll let him pare it down for the scrapbook!
(see bioperl-l thread starting here)
Erich Schwarz discusses contig submissions sans Sequin:
I have newly sequenced contigs, with CDS predictions, loaded loaded into a MySQL database via Bio::DB::GFF. I’d like to export each contig, with its annotated CDSes, into a single GenBank-formatted record for each contig.
and notes the following:
I came up with code that would let me export protein translations of the contigs’ CDSes in GenBank format…
use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::GFF;
my $query_database = $ARGV[0];
my $dna = q{};
my $db = Bio::DB::GFF->new( -dsn => $query_database);
my $gb_file = 'example.gb';
my $seq_out = Bio::SeqIO->new( -file => ">$gb_file", -format => 'genbank');
my @contigs = sort
map { $_->display_id }
$db->features( -types => 'contig:assembly' );
foreach my $contig (@contigs) {
my $segment1 = $db->segment($contig);
my @p_txs = $segment1->features('processed_transcript');
foreach my $p_t (sort @p_txs) {
$dna = q{};
my @CDSes = $p_t->CDS;
my $cds_name = $CDSes[0]->display_id();
foreach my $cds (@CDSes) {
# $cds->seq == Bio::PrimarySeq, *not* plain nt seq.!
$dna = $dna . $cds->seq->seq;
}
my $full_cds = Bio::Seq->new( -display_id => $cds_name,
-seq => $dna, );
my $prot = $full_cds->translate;
$seq_out->write_seq($prot);
}
}
Returning to this, I tried using $db->get_Seq_by_id($contig)
to give me a Bio::Seq object for each contig (which I could then output into GenBank form), but that proved futile.
Jason Stajich provides some details:
If you want to get a specific segment you just do what you already have in your code:
my $segment = $db->segment($contig_name);
Or you can iterate through all the features - depends on how you named your segments/contigs/chromsomes, I named mine “contig:scaffold” for type:source:
my $iterator = $dbh->get_seq_stream(-type=>'scaffold');
while (my $s = $iterator->next_seq) {
...
}
Now you should be able to pass this segment object to
$seqio->write_seq($segment);
However, Bio::DB::GFF::Feature doesn’t implement the whole Bio::SeqI API so you probably have to create your own sequence and move the features over:
my $iterator = $dbh->get_seq_stream(-type=>'scaffold');
while (my $s = $iterator->next_seq) {
my $seq = Bio::Seq->new();
$seq->primary_seq($s->seq);
for my $feature ( $s->features('processed_transcript') ) {
my $f = Bio::SeqFeature::Generic->new(-location => $feature->location,
-primary_tag => $feature->primary_tag,
-source_tag => $feature->primary_tag,
-score => $feature->score,
-seq_id => $feature->seq_id);
$f->add_tag_value('locus_tag',$feature->name);
# might also add all other tag/value pairs from this feature like
# DBXREF, etc.
# can derive a CDS feature from this feature as well.
# or perhaps derive a gene feature that only has start/end for the feature
# or might add a translation tag/value pair for CDS features
$seq->add_SeqFeature($f);
}
$out->write_seq($seq);
}
I suspect you’ll have to edit the feature objects some to a) remove the ones you don’t want to output, b) add additional info like translation frame…, c) add in other annotation information that may or may not be encoded as tag/values that NCBI requires.
Don Gilbert suggests persisting in Chado rather than Bio::DB::GFF:
If the Bio::DB::GFF database to Genbank submission route doesn’t get you where you want, you can also look at storing your data in a GMOD Chado database, then using Bulkfiles to produce the Genbank Submission file set.
Find a GenBank Submit output from Chado dbs in this tool release
GMODTools-1.2b.zip 20-Jun-2008
adding (in progress) the Genbank Submission table writer, bulkfiles -format=genbanktbl
, with output suited to submit to NCBI per these specifications
See also the GMODTools wiki and this test case with genbank-submit output.
(see bioperl-l thread here)
Govind Chandra raises the following issue, based on the code below ($ac
is the Bio::Annotation::Collection property of a Bio::Seq object obtained from a Bio::SeqIO stream). The reply is here:
$ac
is-a Bio::Annotation::Collection but does not actually contain any annotation from the feature. Is this how it should be? I cannot figure out what is wrong with the script. Earlier I used to use has_tag()
, get_tag_values()
etc. but the documentation says these are deprecated.
Sample code:
#use strict;
use Bio::SeqIO;
$infile='NC_000913.gbk';
my $seqio=Bio::SeqIO->new(-file => $infile);
my $seqobj=$seqio->next_seq();
my @features=$seqobj->all_SeqFeatures();
my $count=0;
foreach my $feature (@features) {
unless($feature->primary_tag() eq 'CDS') {next;}
print($feature->start()," ", $feature->end(), " ",$feature->strand(),"\n");
$ac=$feature->annotation();
$temp1=$ac->get_Annotations("locus_tag");
@temp2=$ac->get_Annotations();
print("$temp1 $temp2[0] @temp2\n");
if($count++ > 5) {last;}
}
print(ref($ac),"\n");
exit;
Output:
190 255 1
0
337 2799 1
0
2801 3733 1
0
3734 5020 1
0
5234 5530 1
0
5683 6459 -1
0
6529 7959 -1
0
Bio::Annotation::Collection
Features vs. Annotations
In imprecise “user-centric” terms, a feature is metadata attached to a particular section or fragment of a sequence, while an annotation is metadata attached to the sequence object itself, and so describes something about the entire sequence.
In more precise, “implementor-centric” terms, an annotation object is something that you would like to attach to an annotatable object. The reason you want to attach it (and any semantics implied by that reason) is in the tag. The implementation principle that captures the “user-centric” idea is:
Who accesses what, from where?
The typical use case is as follows:
For the user, the question arises, “How do I get at these furshlugginer annotations??” Perhaps it’s better to ask “How do I get at these furshlugginer metadata?? “, since separating features and annotations in the mind according to the ideas above will help the user look in the right place for the metadata desired. Expletives are optional.
The Bio::SeqIO object (call it $seq
) contains an annotation property (a Bio::Annotation::Collection object) accessed via $seq->annotation()
. It also contains feature properties, accessed by methods such as $seq->get_SeqFeatures
, which return Bio::SeqFeatureI objects. These feature objects are generally instantiated in the Bio::SeqFeature::Generic class, where most of the accessors the average user desires reside.
The Short Answer
A user often wants to parse the features: start and end coordinates, strand, locus name, source, etc. So the short answer to the user’s question is often something like the following (lifted directly from the Bio::Seq POD):
# This block of code loops over sequence features in the sequence
# object, trying to find ones who have been tagged as 'exon'.
# Features have start and end attributes and can be outputted
# in Genbank Flat File format, GFF, a standarized format for sequence
# features.
foreach $feat ( $seqobj->get_SeqFeatures() ) {
if( $feat->primary_tag eq 'exon' ) {
print STDOUT "Location ",$feat->start,":",
$feat->end," GFF[",$feat->gff_string,"]\n";
}
}
Note that the features are obtained from the feature accessor of the sequence object, and the tags associated with each feature are obtained by using the tag accessors off the feature objects.
The Details
The key thing to notice here is that the feature objects and not the annotation objects are read. There was a time (see the History below) when the tag accession methods could be called off the annotation object. This behavior is now deprecated, in part to get people used to making the distinctions being discussed here.
However, remember that in “implementor-centric” terms, an annotation object is something that you would like to attach to an annotatable object. So… $feat->annotation()
is a legitimate method call, since Bio::SeqFeatureI implements Bio::AnnotatableI.
As Hilmar states:
SeqFeatureI
(which brings in the tag/value annotation) and from AnnotatableI
(which brings in annotation()
).I agree this can be confusing from a user’s perspective. As a rule of thumb, SeqIO
parsers will almost universally populate only the tag/ value system, because typically they will (or should) assume not more than that the feature object they are dealing with is a SeqFeatureI
.
Once you have the feature objects in your hands, you can add to either tag/values or annotation()
to your heart’s content. Just be aware that nearly all SeqIO
writers won’t use the annotation()
collection when you pass the sequence back to them since typically they won’t really know what to do with feature annotation that isn’t tag/value (unlike as for sequence annotation).
If in your code you want to treat tag/value annotation in the same way as (i.e., as if it were part of) the annotation that’s in the annotation collection then use Bio::SeqFeature::AnnotationAdaptor. That’s in fact what BioPerl db does to ensure that all annotation gets serialized to the database no matter where it is.
However, unless expressly told otherwise in a parser documentation, a sequence you get back from one of the SeqIO
parsers (which is where most people will get them from) will not have $feat->annotation()
populated.
The Best of Both Worlds : Bio::SeqFeature::AnnotationAdaptor
If you want to get all the metadata for a sequence object, and don’t care (implementation-wise) where it comes from, use Bio::SeqFeature::AnnotationAdaptor:
# (of hilmar)
my $anncoll = Bio::SeqFeature::AnnotationAdaptor->new();
foreach my $feat ($seq->get_all_SeqFeatures) {
$anncoll->feature($feat);
@vals = $anncoll->get_Annotations('locus_tag');
# do something with @vals
}
The description of SeqFeature::AnnotationAdaptor from its POD summarizes its raison d'etre
, and conveys another important distinction between Bio::SeqFeatureI and Bio::AnnotationCollectionI:
This adaptor provides a Bio::AnnotationCollectionI compliant, unified, and integrated view on the annotation of Bio::SeqFeatureI objects, including tag/value pairs, and annotation through the annotation()
method, if the object supports it. Code using this adaptor does not need to worry about the different ways of possibly annotating a SeqFeatureI object, but can instead assume that it strictly follows the AnnotationCollectionI scheme. The price to pay is that retrieving and adding annotation will always use objects instead of light-weight tag/value pairs.
In other words, this adaptor allows us to keep the best of both worlds. If you create tens of thousands of feature objects, and your only annotation is tag/value pairs, you are best off using the features’ native tag/value system. If you create a smaller number of features, but with rich and typed annotation mixed with tag/value pairs, this adaptor may be for you. Since its implementation is by double- composition, you only need to create one instance of the adaptor. In order to transparently annotate a feature object, set the feature using the feature()
method. Every annotation you add will be added to the feature object, and hence will not be lost when you set feature()
to the next object.
See the thread for a nice extended metaphor involving Reese’s Peanut Butter Cups. This shouldn’t, of course, be construed as an endorsement for them. (Extended metaphors, that is.)
A History of BioPerl Annotation (Chris Fields)
To go over why things were set up this way (and then reverted), is a bit of a history lesson. I believe prior to 1.5.0, Bio::SeqFeature::Generic stored most second-class data (dbxrefs, simple secondary tags, etc) as simple untyped text via tags but also allowed a Bio::Annotation::Collection. Therefore one effectively gets a mixed bag of first-class untyped data like display_id
and primary_tag
, untyped tagged text, and ‘typed’ Bio::AnnotationI objects.
Some of this was an attempt to somewhat ‘correct’ this for those who wanted a cohesive collection of typed data out-of-the-box. Essentially, everything becomes a Bio::AnnotationI. I believe Bio::SeqFeature::Annotated went a step further and made almost everything Bio::AnnotationI (including score
, primary_tag
, etc.) and type-checked tag data against SOFA.
As there were collisions between SeqFeature-like ‘tag’ methods and CollectionI-like methods, the design thought was to store all tag data as Bio::Annotation in a Bio::Annotation::Collection, then eventually deprecate the tag methods in favor of those available via the CollectionI. These deprecations were placed in Bio::AnnotatableI, so any future Bio::SeqFeatureI implementations would also get the deprecation. As noted, Bio::SeqFeature::Generic implements these methods so isn’t affected.
Now, layer in the fact that many of these (very dramatic) code changes were literally introduced just prior to the 1.5.0 release, AFAIK w/o much code review, and contained additional unwanted changes such as operator overloading and so on. Very little discussion about this occurred on list until after the changes were introduced (a good argument for small commits). Some very good arguments against this were made, including other lightweight implementations. Lots of angry devs!
Though the intentions were noble we ended up with a mess. I yanked these out a couple of years ago frankly out of frustration with the overloading issues.
Detecting Paired End reads. This doesn’t really use BioPerl but I had a hard time finding a subroutine for this and wanted to share with the community. Any additions are welcome. –Lskatz 19:45, 19 June 2014 (UTC)
Updated to streamline it and also because it was pointed out that in Casava1.8 there can be alternatives to the 1/2 pairing, e.g., 1/3, which would indicate that the multiplex tags are present somewhere (but not in the query fastq file) –Lskatz 18:05, 8 July 2014 (UTC)
# See whether a fastq file is paired end or not. It must be in a velvet-style shuffled format.
# In other words, the left and right sides of a pair follow each other in the file.
# params: fastq file and settings
# fastq file can be gzip'd
# settings: checkFirst is an integer to check the first X deflines
# TODO just extract IDs and send them to the other _sub()
sub is_fastqPE($;$){
my($fastq,$settings)=@_;
# if checkFirst is undef or 0, this will cause it to check at least the first 20 entries.
# 20 reads is probably enough to make sure that it's shuffled (1/2^10 chance I'm wrong)
$$settings{checkFirst}||=20;
$$settings{checkFirst}=20 if($$settings{checkFirst}<2);
# get the deflines
my @defline;
my $numEntries=0;
my $i=0;
my $fp;
if($fastq=~/\.gz$/){
open($fp,"gunzip -c '$fastq' |") or die "Could not open $fastq for reading: $!";
}else{
open($fp,"<",$fastq) or die "Could not open $fastq for reading: $!";
}
my $discard;
while(my $defline=<$fp>){
next if($i++ % 4 != 0);
chomp($defline);
$defline=~s/^@//;
push(@defline,$defline);
$numEntries++;
last if($numEntries > $$settings{checkFirst});
}
close $fp;
# it is paired end if it validates with any naming system
my $is_pairedEnd=_is_fastqPESra(\@defline,$settings) || _is_fastqPECasava18(\@defline,$settings) || _is_fastqPECasava17(\@defline,$settings);
return $is_pairedEnd;
}
sub _is_fastqPESra{
my($defline,$settings)=@_;
my @defline=@$defline; # don't overwrite $defline by mistake
for(my $i=0;$i<@defline-1;$i++){
my($genome,$info1,$info2)=split(/\s+/,$defline[$i]);
if(!$info2){
return 0;
}
my($instrument,$flowcellid,$lane,$x,$y,$X,$Y)=split(/:/,$info1);
my($genome2,$info3,$info4)=split(/\s+/,$defline[$i+1]);
my($instrument2,$flowcellid2,$lane2,$x2,$y2,$X2,$Y2)=split(/:/,$info3);
$_||="" for($X,$Y,$X2,$Y2); # these variables might not be present
if($instrument ne $instrument2 || $flowcellid ne $flowcellid2 || $lane ne $lane2 || $x ne $x2 || $y ne $y2 || $X ne $X2 || $Y ne $Y2){
return 0;
}
}
return 1;
}
sub _is_fastqPECasava18{
my($defline,$settings)=@_;
my @defline=@$defline;
for(my $i=0;$i<@defline-1;$i++){
my($instrument,$runid,$flowcellid,$lane,$tile,$x,$yandmember,$is_failedRead,$controlBits,$indexSequence)=split(/:/,$defline[$i]);
my($y,$member)=split(/\s+/,$yandmember);
my($inst2,$runid2,$fcid2,$lane2,$tile2,$x2,$yandmember2,$is_failedRead2,$controlBits2,$indexSequence2)=split(/:/,$defline[$i+1]);
my($y2,$member2)=split(/\s+/,$yandmember2);
# Instrument, etc must be the same.
# The member should be different, usually "1" and "2"
if($instrument ne $inst2 || $runid ne $runid2 || $flowcellid ne $fcid2 || $tile ne $tile2 || $member>=$member2){
return 0;
}
}
return 1;
}
# This format is basically whether the ends of the defline alternate 1 and 2.
sub _is_fastqPECasava17{
my($defline,$settings)=@_;
my @defline=@$defline;
for(my $i=0;$i<@defline-1;$i++){
# Get each member number but return false if it doesn't even exist.
my ($member1,$member2);
if($defline[$i] =~ m/(\d+)$/){
$member1=$1;
} else {
return 0;
}
if($defline[$i+1] =~ /(\d+)$/){
$member2=$1;
} else {
return 0;
}
# The test is whether member1 is less than member2.
# They can't be equal either.
if($member1 >= $member2){
return 0;
}
}
return 1;
}
(See the bioperl-l threads here and here for discussion on next-gen modules)
This is some example code for the removal of sequencing adapters from next generation sequence reads. This is useful for cleaning up sequence from Solexa/Illumina GA machines, and may also be relevant for the removal of adapter/primer sequence from other types of sequence machine.
These code excerpts are from a prototype sequence pre-processing pipeline, and the approach described may soon become obsolete. This example code is functional, but not optimal. The first stage of the pipeline is to parse the sequence (with quality) and extract the good quality sequence. This must be done prior to using this method for adapter removal.
Before we start with the Bioperl, we align all of the good quality sequence (some.fasta) with the sequencing adapter (adapter.fasta) and output the alignments.
needle -gapopen 100 -gapextend 10 -auto -asequence adapter.fasta -bsequence stdin -outfile stdout < some.fasta > lots_of.aln
We don’t want any gaps, and Needle doesn’t support ungapped alignments, so we set the gap penalties as high as possible. Note that we don’t use Water, because we want to extract the sequence before the adapter, and so need the full sequences in the alignment output.
Now we have our alignments we can pipe them into our adapter removal script.
use Bio::AlignIO;
my $aln_in = Bio::AlignIO->new(-fh => \*STDIN, -format => 'emboss'); # clustalw format is a handy alternative for readability
while ( my $aln = $aln_in->next_aln ) {
# Get the coordinates of the start and end of the sequence and adapter in the alignment
my $r1 = Bio::Range->new( -start=> $aln->column_from_residue_number( $aln->get_seq_by_pos(1)->display_id, "1" ),
-end=> $aln->column_from_residue_number( $aln->get_seq_by_pos(1)->display_id, $aln->select(1,1)->no_residues ),
-strand=> +1);
my $r2 = Bio::Range->new( -start=> $aln->column_from_residue_number( $aln->get_seq_by_pos(2)->display_id, "1" ),
-end=> $aln->column_from_residue_number( $aln->get_seq_by_pos(2)->display_id, $aln->select(2,2)->no_residues ),
-strand=> +1);
my ($start, $end, $strand) = $r1->intersection($r2); # Coordinates where the sequence and adapter align
my $aln2 = $aln->slice($start, $end); # Get the alignment slice
# Neither Emboss nor Bio::Align::PairwiseStatistics offer a suitable scoring method for adapter removal, so here is one I made earlier
my @seq1 = split ("", $aln2->get_seq_by_pos(1)->seq);
my @seq2 = split ("", $aln2->get_seq_by_pos(2)->seq);
my ($score, $mismatches, $gaps) = qw (0 0 0);
for (my $pos = 0; $pos < $aln2->length; $pos++)
{
if ($seq1[$pos] eq $seq2[$pos]) { $score ++; } # Matches
elsif ($seq1[$pos] ne "N" && $seq2[$pos] ne "N") # Mismatches
{
$score --;
$mismatches ++;
}
if ($seq1[$pos] eq "-" || $seq2[$pos] eq "-") { $gaps ++; } # Gaps
}
my $mismatch_rate = $mismatches / $aln2->length;
# Now we have some alignment stats we can do something with the sequence
if ($score >= 4 && $mismatch_rate <= 0.1 && $gaps <= 1 && $start >= 2) # If the adapter sequence match is good enough...
{
process_seq($aln->get_seq_by_pos(2)->trunc(1,($start -1))); # Extract the sequence before the adapter and do something with it
}
else # Or
{
process_seq($aln->get_seq_by_pos(2)); # Just do something with the sequence
}
}
Once we have trimmed the adapter from our sequences we can pass them on to the next stage of the processing pipeline.
The main issues with this approach are:
Issues aside, this approach does seem to work, and performance is adequate.
–Giles.weaver 13:23, 8 July 2009 (UTC)
This is a script for splitting a full lane from a multiplexed MiSeq run into individual tags.
#!/usr/bin/env perl
# Author: Lee Katz <lkatz@cdc.gov>
# Enteric Diseases Laboratory Branch, CDC
use LKUtils qw/logmsg/;
use strict;
use warnings;
use Data::Dumper;
use Getopt::Long;
my $settings={};
exit(main());
sub main{
die usage() if(@ARGV<1);
GetOptions($settings,qw(infile=s outdir=s));
my $infile=$$settings{infile} or die "Error: need input Illumina file\n".usage();
my $outdir=$$settings{outdir} or die "Error: need output directory \n".usage();
splitFastqFile($infile,$outdir,$settings);
return 0;
}
sub splitFastqFile{
my($infile,$outdir,$settings)=@_;
# open files
mkdir($outdir) or die "Could not mkdir $outdir because $!" if(!-d $outdir);
my %fh;
open($fh{$_}, '>', "$outdir/$_.fastq") or die "Cannot open a file in $outdir: $!" for(0..8);
my $currId;
my $i=0;
open(IN,'<',$infile) or die "Could not open fastq file $infile because $!";
while(my $fastqEntry=<IN>){
$fastqEntry.=<IN> for(1..3); # each entry is 4 lines total
my($currId,$seq,undef,$qlt)=split(/\n/,$fastqEntry);
my($part1,$part2)=split(/\s+/,$currId);
my($instrument,$runId,$flowCellId,$lane,$tile,$x,$y)=split(/:/,$part1);
my($ReadNum,$FilterFlag,undef,$SampleNumber)=split(/:/,$part2);
my $fh=$fh{$SampleNumber};
print $fh "$fastqEntry";
$i++;
logmsg "Finished with $i reads" if($i%100000==0);
}
close IN;
close($fh{$_}) for(keys(%fh));
}
sub usage{
"Given a fastq file of reads, split the fastq file into files.
Usage: perl $0 -i illumina.fastq -o outputdir
-i illumina.fastq
e.g. from a miseq
"
}
These code snippets demonstrate a way to calculate the melting temperature (Tm) of PCR primers (or other DNA). This code is designed to be called from an external package.
To calculate the Tm of degenerate primers, you can use this code in conjunction with the degenerate primers code. First get the constituent sequences of the degenerate primer, the pass them through these subroutines and take the mean of the returned temperatures.
The first subroutine takes a sequence string and passes it to the second or third subroutines as appropriate, depending on the length of the sequence.
The second subroutine calculates the Tm of short sequences by running the oligotm program, which is part of the Primer 3 package. This subroutine relies on IPC::Open3 so it isn’t Windows friendly.
The third subroutine calculates the Tm of longer sequences using a method derived from one of the test scripts in the Primer 3 package.
package Primer;
use IPC::Open3;
sub calc_oligo_tm {
my ($class, $sequence) = @_;
my ($tm, $errors);
if (length($sequence) >= 32) { ($tm, $errors) = $class->_calc_long_oligo_tm ($sequence); }
else { ($tm, $errors) = $class->_run_oligotm ($sequence); }
return ($tm, $errors);
}
sub _run_oligotm {
my ($class, $sequence) = @_;
my $command = "oligotm -tp 1 -sc 1 $sequence";
my $pid = open3(\*WTRFH, \*RDRFH, \*ERRFH, $command);
close (WTRFH);
my ($tm, $errors);
while (<RDRFH>) { $tm .= $_;}
while (<ERRFH>) { $errors .= $_;}
chomp $tm;
return ($tm, $errors);
}
sub _calc_long_oligo_tm {
my ($class, $sequence) = @_;
# This code based on Test script from primer3
my $GC_count = $sequence =~ tr/gGcC/gGcC/;
my $len = length $sequence;
my $salt=50;
my $divalent=0;
my $dntp=0;
$salt = $salt + 120*sqrt(($divalent-$dntp));
my $r = log($salt/1000.0)/log(10);
my $tm = 81.5 + 16.6*$r + 41*$GC_count/$len - 600./$len;
my $errors = "hopefully none";
return ($tm, $errors);
}
Primer 3 can be obtained here. If you are running a Debian based Linux distribution such as Ubuntu you can install it by typing: sudo apt-get install primer3
–Giles.weaver 19:14, 14 April 2010 (UTC)
These code snippets were designed to be called from an external package.
This subroutine demonstrates how to return each of the oligomers present in a degenerate primer. It takes the sequence string of the degenerate primer and returns the oligomers that match the sequence string in an array. This can be used in conjunction with the snippets here to calculate the Tm (melting temperature) mean and Tm range of degenerate primers.
package Primer;
use Bio::Tools::IUPAC;
sub find_degenerates {
my ($class, $sequence) = @_;
my $seq_obj = Bio::Seq->new(-seq => $sequence, -alphabet => 'dna');
my $stream = Bio::Tools::IUPAC->new(-seq => $seq_obj);
my @oligomers;
while (my $uniqueseq = $stream->next_seq()) { push @oligomers, $uniqueseq->seq; }
return @oligomers;
}
This subroutine produces pattern strings that can be used to search within files or a database (such as BioSQL) for sequences that match a degenerate primer. It takes the primer as a sequence object and returns two patterns, one for each strand.
package Primer;
use Bio::Tools::SeqPattern;
sub generate_primer_search_patterns {
my ($class, $seqobj) = @_;
my $pattern = Bio::Tools::SeqPattern->new(-SEQ =>$seqobj->seq, -TYPE =>'Dna');
return ($pattern->expand, $pattern->revcom(1)->expand);
}
–Giles.weaver 19:36, 14 April 2010 (UTC)
(see bioperl-l thread here)
manni122 asks:
I am trying to read in a file with multiple pairwise alignments. Some IDs appear frequently. So if I am using this code below I get the error message:
--- MSG: Replacing one sequence xxx ---
Is there a way to read the data even with those similar names?
Chris Fields suggests:
The NSE (Name.version/start-end) is used to distinguish the sequences from one another, so if each sequence has one or more unique accession/version/start/end there should be no replacement (and no warning).
use Bio::LocatableSeq;
use Bio::SimpleAlign;
use Bio::AlignIO;
my $aln = Bio::SimpleAlign->new();
my $out = Bio::AlignIO->new(-format => 'clustalw');
for my $v (1..10) {
my $ls = Bio::LocatableSeq->new(-id => 'ABCD1234',
-version => $v,
-alphabet => 'dna',
-seq => '--atg---gta--');
$aln->add_seq($ls);
}
$out->write_aln($aln);
with output…
CLUSTAL W(1.81) multiple sequence alignment
ABCD1234.1/1-6 --atg---gta--
ABCD1234.2/1-6 --atg---gta--
ABCD1234.3/1-6 --atg---gta--
ABCD1234.4/1-6 --atg---gta--
ABCD1234.5/1-6 --atg---gta--
ABCD1234.6/1-6 --atg---gta--
ABCD1234.7/1-6 --atg---gta--
ABCD1234.8/1-6 --atg---gta--
ABCD1234.9/1-6 --atg---gta--
ABCD1234.10/1-6 --atg---gta--
*** ***
and comments:
If you think about it that’s a feature. Any single sequence that appears in an alignment more than once is either (1) matching multiple regions (i.e. repeats, motifs, etc) so the location varies, or (2) the sequence was modified so the version changes (the last one is fairly new). Beyond that one has to question the logic of including multiple copies of exactly the same sequence record in a multiple alignment, so unless additional information distinguishing the potential duplicates is provided we assume unintentional (and erroneous) duplication and punt.
Weighing the options I would rather have the warning indicating a problem than nothing at all.
(see the bioperl-l thread starting here)
Rich Green asks:
I am looking for a command that would extract 2000 base pairs upstream of the sequence. Is there a way to extract just the gene promoter region from genbank?
Kevin Brown responds:
Get the start point for a Sequence Feature and request a subseq of the main object that starts 2000 above that.
my $gb = new Bio::DB::GenBank;
$entry = $gb->get_Seq_by_id($id);
my $promotor;
foreach my $f ($entry->all_SeqFeatures()) {
if ($f->primary_tag eq 'CDS') {
if( $f->strand < 0 ) {
$promotor = $entry->trunc($f->end+1, $f->end + 2000)->revcom;
}
else {
$promotor = $entry->trunc($f->start - 2000, $f->start -1);
}
$out->write_seq($promotor);
}
}
See discussion thread here.
’'’Albert Vilella’’’ sez: I basically want to start with something like this:
seq1.123 QWERTYUIOPASDFGHJKLZXCVBNM
seq2.234 QWERTYU-------------------
seq2.345 ----------ASDFGH----------
seq2.456 -------------------ZXCVBNM
and end with something like this:
seq1.123 QWERTYUIOPASDFGHJKLZXCVBNM
seq2.mrg QWERTYU---ASDFGH---ZXCVBNM
Here’s one of my favorite tricks for this: XOR mask on gap symbol. Fast! –ed.
use Bio::SeqIO;
use Bio::Seq;
use strict;
my $seqio = Bio::SeqIO->new( -fh => \*DATA );
my $acc = $seqio->next_seq->seq ^ '-';
while ($_ = $seqio->next_seq ) {
$acc ^= ($_->seq ^ '-');
}
my $mrg = Bio::Seq->new( -id => 'merged',
-seq => $acc ^ '-' );
1;
__END__
>seq2.234
QWERTYU-------------------
>seq2.345
----------ASDFGH----------
>seq2.456
-------------------ZXCVBNM
See discussion thread here.
Roger Hall asks:
Is there a random generator for creating nucleotides (of length l with composition frequencies a, c, g, and t) in there somewhere?
Bruno Vecchi actually supplies some code (with a little cheat from List::Util) :
#!perl
use List::Util qw(shuffle);
use Bio::SeqIO;
my ($seqfile, $number) = @ARGV;
my $in = Bio::SeqIO->new(-file => $seqfile);
my $fh = Bio::SeqIO->newFh(-format => 'fasta');
my $seq = $in->next_seq;
my @chars = split '', $seq->seq;
for my $i (1 .. $number) {
@chars = shuffle @chars;
my $new_seq = Bio::Seq->new(-id => $i, -seq => join '', @chars);
print $fh $new_seq;
}
You can use it like this from the command line (assuming you want 20 output sequences):
$ shuffle.pl input_sequence.fasta 20 > random_sequences.fasta
And a couple of -shudder- non-BioPerl (or at least, impure-BioPerl) solutions:
from Aidan Budd:
from “Big Dave” Messina:
See Counting_k-mers_in_large_sets_of_large_sequences
(see bioperl-l thread here, and the scrap Regular expressions and Repeats)
Abhi Pratap asks:
Is there a quick way to find the homopolymer stretches in the contigs and also report their base start and end positions?
from Hekki Levaslaiho:
If you can load the sequence strings into memory, I’d use a regular expression to detect the homopolymers and the use the pos function to find the location of hits:
$s = "AGGGGGGGAAAAACGATCGGGGGGGTGTGGGGGCCCCCGTG";
$min = 4;
while ( $s =~ /(A{$min,}|T{$min,}|G{$min,}|C{$min,})/g) {
$end = pos($s);
$start = $end - length($1) + 1;
print "$start, $end, $1 \n";
}
one-liner from Russell Smithies:
You can also use the built-in regex variables and back-references to get the positions of the matches:
print join(", ", $-[0], $+[0], $&),"\n" while ( $s =~ /([ACGT])\1{$min,}/g);
(See the bioperl-l thread beginning here. See also Counting k-mers in large sets of large sequences.)
Marco Blanchette asks:
'’Does anyone have a little function that I could use to generate all possible k-mer DNA sequences? For instance all possible 3-mer (AAA, AAT, AAC, AAG, etc…). I need something that I could input the value of k and get all possible sequences… ‘’
…and receives replies from Michael Eisen, Jamie Estill, Dave Messina, Mark Jensen, Chris Fields (#CF1, #CF2), Diego Riano-Pachon, Malcolm Cook, Russell Smithies, Heikki, and Jay Hannah:
#!/usr/bin/perl
#
$k = shift;
@bases = ('A','C','G','T');
@words = @bases;
for $i (1..$k-1)
{
undef @newwords;
foreach $w (@words)
{
foreach $b (@bases)
{
push (@newwords,$w.$b);
}
}
undef @words;
@words = @newwords;
}
foreach $w (@words)
{
print "$w\n";
}
SeqIO works great for this. I’ve used something like the following. This is part of a larger program, so some of this not relevant to what you need …
$fasta_in = "your_file.fasta";
$k = 3;
my $in_seq_num = 0;
my $inseq = Bio::SeqIO->new( -file => "<$fasta_in",
-format => 'fasta');
while (my $seq = $inseq->next_seq) {
$in_seq_num++;
if ($in_seq_num == 2) {
print "\a";
die "Input file should be a single sequence record\n";
}
# Calculate base cooridate data
my $seq_len = $seq->length();
my $max_start = $seq->length() - $k;
# Print some summary data
print STDERR "\n==============================\n" if $verbose;
print STDERR "SEQ LEN: $seq_len\n" if $verbose;
print STDERR "MAX START: $max_start\n" if $verbose;
print STDERR "==============================\n" if $verbose;
# CREATE FASTA FILE OF ALL K LENGTH OLIGOS
# IN THE INPUT SEQUENCE
print STDERR "Creating oligo fasta file\n" if $verbose;
open (FASTAOUT, ">$temp_fasta") ||
die "Can not open temp fasta file:\n$temp_fasta\n";
for $i (0..$max_start) {
$start_pos = $i + 1;
$end_pos = $start_pos + $k - 1;
my $oligo = $seq->subseq($start_pos, $end_pos);
# Set counts array to zero
$counts[$i] = 0;
print FASTAOUT ">$start_pos\n";
print FASTAOUT "$oligo\n";
}
close (FASTAOUT);
}
Dave Messina writes:
Here’s some code to generate and print all possible nmers. I’m really just using the module Math::Combinatorics to do all the dirty work here, so probably won’t be as fast as if you wrote a custom recursive function as you suggest.
#!/usr/local/bin/perl
use strict;
use warnings;
use Math::Combinatorics;
# do all codons (3-mers) as an example
generate_possible_kmers(3);
=head2 generate_possible_kmers
Title : generate_possible_kmers
Usage : generate_possible_kmers(n)
Function: create and print the list of possible DNA kmers
Returns : none
Args : n - the length of the desired 'mer'
=cut
sub generate_possible_kmers {
my ($n) = @_;
my $alphabet = [ qw( A C G T ) ];
my $words_per_row = 10;
my $i=0;
my $o = Math::Combinatorics->new( count=>$n, data=>$alphabet,
frequency=>[$n,$n,$n,$n] );
while ( my @x = $o->next_multiset ) {
my $p = Math::Combinatorics->new( data=>\@x , frequency=>[map{1} @x] );
while ( my @y = $p->next_string ) {
print join('', @y), ' ';
$i++;
if (($i % $words_per_row) == 0) { print "\n";
}
}
}
}
Mark Jensen fires off: A little sloppy, but it recurses and is general—
# ex...
@combs = doit(3, [ qw( A T G C ) ]);
1;
# code
sub doit {
my ($n, $sym) = @_;
my $a = [];
doit_guts($n, $sym, $a, '');
return map {$_ || ()} @$a;
}
sub doit_guts {
my ($n, $sym, $store, $str) = @_;
if (!$n) {
return $str;
}
else {
foreach my $s (@$sym) {
push @$store, doit_guts($n-1, $sym, $store, $str.$s);
}
}
}
Chris Fields comments:
To add to the pile: Mark Jason Dominus (mjd) tackles this problem in Higher-Order Perl using iterators, which also allows other nifty bits like ‘give variants of A(CTG)T(TGA)’, where anything in parentheses are wild-cards. The nice advantage of the iterator approach is you don’t tank memory for long strings. Furthermore, as a bonus, you can now download the book for free: 1 The relevant chapter is here (p. 135): 2
Here is the example code:
#!/usr/bin/perl -w
use strict;
use warnings;
use 5.010;
my $string = 'A(CT)G(AC)';
my $it = make_genes($string);
while (my $new = NEXTVAL($it) ) {
say $new;
}
sub make_genes {
my $pat = shift;
my @tokens = split /[()]/, $pat;
for (my $i = 1; $i < @tokens; $i += 2) {
$tokens[$i] = [0, split(//, $tokens[$i])];
}
my $FINISHED = 0;
return sub {
return if $FINISHED;
my $finished_incrementing = 0;
my $result = "";
for my $token (@tokens) {
if (ref $token eq "") { # plain string
$result .= $token;
} else { # wildcard
my ($n, @c) = @$token;
$result .= $c[$n];
unless ($finished_incrementing) {
if ($n == $#c) { $token->[0] = 0 }
else { $token->[0]++; $finished_incrementing = 1 }
}
}
}
$FINISHED = 1 unless $finished_incrementing;
return $result;
}
}
sub NEXTVAL { $_[0]->() }
Diego Riano-Pachon offers:
Just another option: why not just counting? Any DNA sequence is a number in base four (4),isn’t it? So you have to count from zero to the number (in base 4) that you want.
############################################################################################
##DNA word can be represented as base 4 numbers,
##this is an idea from Ingo Dreyer, I just implemented it in Perl
##The following hashes help in doing the transcoding from and to base 4
############################################################################################
use Math::BigInt;
use Math::BaseArith;
my %transcoded_nt;
$transcoded_nt{'A'}=0;
$transcoded_nt{'C'}=1;
$transcoded_nt{'G'}=2;
$transcoded_nt{'T'}=3;
my %transcode_base4;
$transcode_base4{0}='A';
$transcode_base4{1}='C';
$transcode_base4{2}='G';
$transcode_base4{3}='T';
my @alphabet=sort keys %transcoded_nt;
my $lengthalph=scalar(@alphabet);
my $x= Math::BigInt->new($lengthalph);
my $number_of_words=$x->bpow(scalar(keys %matrix));
foreach my $i (0..$number_of_words-1){
my @base_four_word= encode($i,[4,4,4,4,4,4,4,4,4,4,4]);
my $transcoded_word=join('',@base_four_word);
while (my ($position,$val) = each %transcode_base4){
$transcoded_word=~s/$position/$val/g;
}
}
Malcolm Cook gives it a bash:
just to round things out…. here’s a quick bash script that does the same…
#!/bin/bash
k=$1
s=$( printf "%${k}s" ); # a string with $k blanks
s=${s// /{A,T,G,C}}; # substitute '{A,T,G,C}' for each of the k blanks
echo 'kmers using bash to expand:' $s > /dev/stderr
bash -c "echo $s"; # let brace expanion of inferior bash compute the cross product
oh… and… if you need the results in a perl array, and you’re running under some unix, try the even terser:
#!/usr/bin/env perl
my $k = shift;
my @kmer = split / /, `echo @{['{A,T,G,C}' x $k]}`;
finally (? | !), here’s a perl one-liner that outputs one kmer per line, with switches for the length, $k, and the alphabet, $a: |
perl -se '$,=" "; print split / /, qx/echo @{["{$a}" x $k]}/;' -- -k=4 -a='A,T,G,C'
Russell Smithies parries with an elegant:
recursive use of map:
print "[", join(", ", @$_), "]\n" for
permute([ qw( A T G C ) ],[ qw( A T G C ) ],[ qw( A T G C ) ],[ qw( A T G C ) ]);
sub permute {
my $last = pop @_;
unless (@_) {
return map [$_], @$last;
}
return map { my $left = $_; map [@$left, $_], @$last } permute(@_);
}
Chris Fields delivers this coup de grace :
Perl 6 (20 random 20-mers):
use v6;
say [~] <A T G C>.pick(20, :repl) for 1..20;
By the way, this works with the latest Rakudo. – Chris Fields
Heikki Lehvaslaiho: Thank you for everyone for these entertaining entries. In my books, Michel Eisen’s version wins with sheer clarity. Recursion is always recommendable, too. Below are cleaned versions of these two. Feel free to improve them further.
#!/usr/bin/env perl
use warnings;
use strict;
sub kmers ($;$) {
my $k = shift;
my $alphabet = shift || [ qw( A T G C ) ];
my @bases = @$alphabet;
my @words = @bases;
for ( 1 .. --$k ) {
my @newwords;
foreach my $w (@words) {
foreach my $b (@bases) {
push (@newwords, $w.$b);
}
}
@words = @newwords;
}
return @words;
}
my $k = shift;
die "positive integer needed as the argument!"
unless $k > 0 and $k =~ /^\d$/;
map {print "$_\n"} kmers($k);
#!/usr/bin/env perl
use warnings;
use strict;
sub kmers ($;$) {
my $n = shift;
my $sym = shift || [ qw( A T G C ) ];
sub kmers_guts {
my ($n, $sym, $store, $str) = @_;
if ($n) {
foreach my $s (@$sym) {
push @$store, kmers_guts($n-1, $sym, $store, $str.$s);
}
} else {
return $str;
}
}
my $a = [];
kmers_guts($n, $sym, $a, '');
return map {$_ || ()} @$a;
}
my $k = shift;
die "positive integer needed as the argument!"
unless $k =~ /^\d$/;
map {print "$_\n"} kmers($k);
Here’s an alternate to version 5. I think versions 5 and 10 are the only ones that use iterators, which is mandatory for extremely large sets so you don’t waste / run out of memory. Using arrays “every possible DNA strand of lengths 1 to 14 would require 5GB of memory”! In contrast, iterators consume only a trivial amount of memory. –Jhannah 13:56, 20 February 2009 (UTC)
=head2 gen_permutate
An iterator for all possible DNA sequences between lengths $min and $max.
my @DNA = qw/A C T G/;
my $seq = gen_permutate(3, 4, @DNA);
while ( my $strand = $seq->() ) {
print "$strand\n";
}
# http://www.perl.com/pub/a/2005/06/16/iterators.html
# Modified by jhannah
=cut
sub gen_permutate {
my ($min, $max, @list) = @_;
my @curr = ($#list) x ($min - 1);
return sub {
if ( (join '', map { $list[ $_ ] } @curr) eq $list[ -1 ] x @curr ) {
@curr = (0) x (@curr + 1);
}
else {
for ( my $pos = $#curr; $pos >= 0; $pos-- ) {
++$curr[ $pos ], last if $curr[ $pos ] < $#list;
$curr[ $pos ] = 0;
}
}
return undef if @curr > $max;
return join '', map { $list[ $_ ] } @curr;
};
}
(from an email to Russell Smithies) This isn’t the prettiest way, but it might be the most efficient. The strategy is based on the idea that you can’t get much faster than a hard-coded loop.
foreach $c1 (@alphabet) {
foreach $c2 (@alphabet) {
foreach $c3 (@alphabet) {
$h{"$c1$c2$c3"} = 1;
}
}
}
The kmer()
function below simply builds up the same code for any k and any alphabet, executes it, and returns a reference to a hash. I don’t use this method myself because when I work with k-mers, I tend to work in C, and then I use an iterative numeric method.
@alphabet = qw(A C G T);
$kmers = kmer(3, @alphabet);
sub kmer {
($k, $a) = @_;
for (1..$k) {$w .= "\$c$_"}
for (1..$k) {$code .= "for \$c$_ (\@\$a) {"}
eval $code . "\$h{\"$w\"} = 1" . "}" x $k;
\%h;
}
See Merge_gapped_sequences_across_a_common_region
By Brian –Ed.
How do I find an iteration of any sequence of a specific length?
So /(QA)+/
will match one or more iterations of QA but what if you want to match any repeat of length 2?
Try /(..)1+/
Then $1
will tell you what the repeat was, length($&)/2
will tell you the number of repeats.
How do I find some sequence flanked by homopolymers of a given length?
For example, to find FAFCRCFCFAFAFCRF flanked by n number of Q, e.g.:
AGTWRWDFDQQQQQQQQFAFCRCFCFAFAFCRFQQQQQQQQQQQQQ
The regular expression would be something like
/(Q{$n,})([^Q]{$x,})(Q{$n,})/
Example:
perl -e '$n=5; $x=9; $_= "AGTWRWDFDQQQQQQQQFAFCRCFCFAFAFCRFQQQQQQQQQQQQQ"; print "$1|$2|$3\n" if /(Q{$n,})([^Q]{$x,})(Q{$n,})/;'
QQQQQQQQ|FAFCRCFCFAFAFCRF|QQQQQQQQQQQQQ|
How do I find any homopolymer flanked on both sides by the same amino acid?
For example, HTTTTTTTTTTH or TGGGGGGGGGGGT.
perl /(.)[^\1]+\1/
In action:
perl -e '$_ = "HTTH"; print "|$1|\n" if /((.)[^\2]+\2)/;'
Note that the “homopolymer” could have a length of 1!
Here’s a beautiful use of the eval ''string''
flavor of eval
: creating a suffix tree from a dictionary of words. The tree lookup seems like falling off a log as well;
but there’s a hidden danger, read on.
I took the original routine and BioPerled it up a little:
END
value was too clever by half, as pointed out by Aaron. That leads to hammering of complete words by later suffixes, or of suffixes by later short words. So, I revert to Ian’s original… –Ed.package Suftree;
use strict;
sub new {
my $class = shift;
my @words = @_;
my $self = {};
bless($self, $class);
$self->readDictionary(@words);
return $self;
}
sub readDictionary {
my $self = shift;
my @words = @_;
my %D;
my $i = 0;
for (@words) {
chomp;
my $a = $_;
my $string = join('', map { "\{'${_}'\}" } split('',$a)).'{END}';
eval "\$D$string = 1"; # <<=== cool way to make a suffix tree!
}
$self->dict(\%D);
}
sub dict {
my $self = shift;
return $self->{_dict} = shift if (@_);
return $self->{_dict};
}
# more to come...
$i=0
looks suspicious. Hmm. So could replace readDictionary
above with the followingsub readDictionary {
my $self = shift;
my @words = @_;
my %D;
for (@words) {
chomp;
my @a = '*'.split('');
while ( shift @a ) {
my $string = join('', map { "\{'${_}'\}" } @a).'{END}';
eval "\$D$string = 1"; # <<=== cool way to make a suffix tree!
}
}
$self->dict(\%D);
}
Observe the method readDictionary
for the trick. Here, Ian builds a hash of hash of … assignments as a string**, then eval
s it. He is really leveraging Perl’s automatic reference creation to cause Perl to build the entire data structure internally. Brilliant! But in that very thing lies dormant the seed of cruft…
Here are two search methods off the Suftree
object:
package Suftree;
use strict;
# return -1 if $str DNE in dict; 0 if substring exists;
# 1 if whole word is present
sub search {
my $self = shift;
my $str = shift;
my $l = join('', map { "\{'${_}'\}" } split('',$str));
if (eval "exists \$self->dict->$l") {
return 1 if eval "exists \$self->dict->$l->\{END\}";
return 0;
}
return -1;
}
sub search2 {
my $self = shift;
my $str = shift;
my $d = $self->dict;
for (split('',$str)) {
last if !defined($d = $$d{$_});
}
for ($d) {
!defined && do {return -1};
grep(/END/,keys %$d) && do {return 1};
return 1;
}
}
Both return the right thing. search()
is cool, it uses an eval ''string''
trick too, while search2()
looks like somebody’s C
homework. But suppose we do
use Suftree;
use strict;
my $D = Suftree->new(<DATA>);
print $D->search('AGGA'); #returns 0
print $D->search('AAGGAGGGTAAAAATGA'); #returns 1
print $D->search('CGGA'); #returns -1, but look out!
print $D->search2('AGGA'); #returns 0
print $D->search2('AAGGAGGGTAAAAATGA'); #returns 1 print $D->search2('CGGA'); #returns -1
__DATA__
AAGGAGGTAAAAATGA
AAGGAGGTAAAATGA
AGGAGGTAAAAATGA
ATGGAGGTAAAAATGA
AAGGAGGGTAAAAATGA
In the failed search, ` $D->search(‘CGGA’) , the
eval`ed string in the lookup is
$D->dict->{C}{G}{G}{A}
What does Perl do? It creates
$D->dict->{C}
$D->dict->{C}->{G}
and
$D->dict->{C}->{G}->{G}
before it finds out that
$D->dict->{C}->{G}->{G}->{A}
does not exist. These extra refs are unfortunately now stored in the dictionary, and, e.g., $D->search('CG')
will be successful, but not reflect the data. search2()
avoids this by explicitly checking the existence of each branch, and jumping from the loop when the branch DNE.
Can use the second form of readDictionary
above, and use search2()
, or…
Courtesy of Russell –Ed.
Using a recursive approach, we can check for the presence of any substring in the dictionary.
$D->recurse_search('C'); # returns -1
$D->recurse_search('GGTA'); # returns 0
$D->recurse_search('ATGA'); # returns 1
package Suftree;
use strict
# returns -1: not found; 0: internal substring; 1: right-end anchored substring
sub recurse_search{
my $self = shift;
my $str = shift;
_recurse_hash($self,$self->dict,0,$str);
}
sub _recurse_hash {
my ($self, $hash, $depth, $str) = @_;
my $ret;
$ret = $self->search2_hash($str, $hash);
return $ret unless $ret < 0;
for my $k (keys %$hash) {
$ret = _recurse_hash($self,$$hash{$k},$depth+1,$str);
return $ret unless $ret < 0;
}
return -1;
}
sub search2_hash {
my $self = shift;
my $str = shift;
my $d = shift;
for (split('',$str)) {
last if !defined($d = $$d{$_});
}
for ($d) {
!defined && do {return -1};
ref($d) && do { return 0 };
return 1;
}
}
Footnote
join
and map
.This trick is discussed more extensively in Mastering Regular Expressions by Jeffrey Friedl. This is something I raised on the mail list at one point; we can probably wrap this up into a method somewhere (probably something curried).
Essentially you have to embed code into the regex and trick the parser into backtracking using a negative lookahead. The match itself fails (i.e. no match is returned), but the embedded code is executed for each match attempt, so in essence you will get all matches, including suboptimal ones.
The following demo script is a slight modification of one I used which checks the consensus string from the input alignment (in aligned FASTA format here), extracts the alignment slice using that match, then spits the alignment out to STDOUT in clustalw format. This should work for perl 5.8 and up, but it’s only been tested on perl 5.10. You should be able to use this to fit what you want.
The beginning regex match actually succeeds (matches the first 18-21 characters) thus setting $1
. This gets passed to the code block that follows to be evaluated, passing the string out to the subroutine to be tested and returned if a match is present. Successful matches are then passed to the array with relevant information to be iterated over later. The last (?!)
is the part that fails, causing the engine to backtrack and attempt the match again.
Note that we never use the full match, we only attempt matching once per iteration in the while
loop.
use strict;
use warnings;
use Bio::AlignIO;
my $file = shift; # or $ARGV[0]
# there is a bug, likely in Bio::Root::IO, where passing the filename directly
# to AlignIO causes parsing issues and (with some AlignIO formats) segfaults
my $in = Bio::AlignIO->new(-file => $file,
-format => 'fasta');
my $out = Bio::AlignIO->new(-fh => \*STDOUT,
-format => 'clustalw');
while (my $aln = $in->next_aln) {
my $c = $aln->consensus_string(100);
my @matches;
# note this match actually fails
$c =~ m/
(.{18,21})
(?{
my $match = check_match($1);
push @matches, [$match, pos(), length($match)] if $match;
})
(?!)
/xig;
for my $match (@matches) {
my ($hit, $st, $end) = ($match->[0],
$match->[1] - $match->[2] + 1,
$match->[1]);
my $newaln = $aln->slice($st, $end);
$out->write_aln($newaln);
}
}
sub check_match {
my $match = shift;
return unless $match;
my $ct = $match =~ tr/?/?/;
return $match if $ct <= 4;
}
NigelD destroys the competition with the clever use of splice
to kick out a random member of the ever-shortening original index array, along with a judicious copy and reverse of the index array -Ed.
sub perm2 {
my @a = split('',shift);
my @i = (0..$#a);
my @j = @i;
my @r;
map {
push @r, splice(@i, rand($_+1),1);
} reverse @j;
return join('',@a[@r]);
}
Here’s a useful little device for monitoring progess as a large number of things (sequences, e.g.) are being stepped through. It’s written as a configurator that returns an iterator. Call the iterator on every pass through the loop, it will know when to do the right thing. The callback allows you to get fancy if desired. –MAJ
# dodot() args --
# $every : hashref of { $modulus => $symbol, ... }
# (i.e., "every $modulus things, print $symbol...")
# $title : print (or whatever) this on the first call
# $callback : do this every $modulus things:
# $callback->($current_symbol, $current_modulus, $current_count_of_things
#
sub dodot {
my ($every, $title, $callback) = @_;
my $i=0;
my @every = sort {$b<=>$a} keys %$every;
$callback||= sub { print STDERR shift() };
return sub {
$i++;
$callback->("$title\n") if ($title && ($i==1));
for (@every) {
unless ($i % $_) {
$callback->($every->{$_},$_,$i);
last;
}
};
};
}
Use it like so:
my $dot = dodot( { 10 => '.',
100 => '|',
1000 => '@' },
'Enormo' );
while (my $seq = $seqio->next_seq) {
# doing doing doing...
$dot->();
}
to produce something like:
Enormo
.........|.........|.........|.........|.........|.........|.........|.........|.........|....
.....@.........|.........|.........|.........|.........|.........|.........|.........|........
.|.........@.........|.........|.........|.........|.........|
For nice linebreaks, try
my $dot = dodot( { 10 => '.',
100 => '|',
701 => "\n",
1000 => '@' },
'Enormo');
to get
Enormo
.........|.........|.........|.........|.........|.........|.........|
.........|.........|.........@.........|.........|.........|.........|
.........|.........|.........|.........|.........|.........@.........|
.........|.........|.........|.........|
To start over for a new routine, just create a new iterator:
sub validate {
my @seqs = @_;
my $vdot = dodot( { 500 => '.' }, 'validating' );
while ($_ = $seqio->next_seq) {
$vdot->();
# validate validate validate
}
return @valid;
}
sub preprocess {
my @seqs = @_;
my $pdot = dodot( { 500 => '.' }, 'preprocessing');
while ($_ = $seqio->next_seq) {
$pdot->();
# preprocess preprocess preprocess
}
return @processed;
}
## main
...
my $aln = $alnio->next_aln;
my @valid_processed = preprocess( validate($aln->each_seq) );
...
Here’s the monitor I use. Gives a “wget”-like progress display with percentage of the job remaining.
sub progress_bar {
my ( $got, $total, $width, $char ) = @_;
$width ||= 25; $char ||= '=';
my $num_width = length $total;
sprintf "|%-${width}s| Processed %${num_width}s rows of %s (%.2f%%)\r",
$char x (($width-1)*$got/$total). '>',
$got + 1, $total, 100*$got/+$total;
}
# as an example of how it's used,
# here's what I did when I was processing
# a very large amount of mass-spec data
$mass_count = keys (%mass_data);
$mass_num = 0;
foreach $mass (sort {$a <=> $b} (keys %mass_data)){
# do stuff with your data
$t2mdata .= pack("f",$mass);
# show progress
print progress_bar( $mass_num++, $mass_count, 25, '=' );
}
Or use a cpan module that you can use as simple or as complex as you want.
use Term::ProgressBar;
my $max = 10;
my $progressBar = Term::ProgressBar->new(
{
name => 'Running Test',
count => $max,
ETA => 'linear',
max_update_rate => 1,
remove => 1,
}
);
for my $i (1..$max){
$progressBar->update ($i);
sleep 1;
}
(sort {$h{$b} <=> $h{$a} keys %h)[0]
or
use List::Util;
$h{ max values %h };
All keys having max value:
my $max = [sort {$b<=>$a} values %h]->[0];
map { ($h{$_} == $max) ? $_ : () } keys %h;
–MAJ
Recursively:
dohash( %h, sub { print "$_[0] " } );
sub dohash {
my ($h, $callback) = @_;
map { ref $$h{$_} ? dohash($$h{$_},$callback) : $callback->($$h{$_}) } keys %$h ;
return;
}
A depth-first iterator:
$it_d = make_depth_iterator( %h, sub { print "$_[0]\n" } );
while ($it_d->()) {1;}
sub make_depth_iterator {
my ($h, $callback) = @_;
my @stack = shift;
return sub {
while (@stack) {
my $v = pop @stack;
while (ref($v)) {
push @stack, values %$v;
$v = pop @stack;
}
return undef unless $v;
return $callback->($v);
}
return;
}
}
A breadth-first iterator:
$it_b = make_breadth_iterator( %h, sub { print "$_[0]\n " } );
while ($it_b->()) {1};
sub make_breadth_iterator {
my ($h, $callback) = @_;
my @queue = shift;
return sub {
while (@queue) {
my $v = shift @queue;
while (ref($v)) {
push @queue, values %$v;
$v = shift @queue;
}
return undef unless $v;
return $callback->($v);
}
return;
}
}