FAQ

From BioPerl
Jump to: navigation, search

Contents

About this FAQ

What is this FAQ?

It is the list of Frequently Asked Questions about BioPerl.

What if my question isn't answered here?

In general users should avoid directly contacting the bioperl developers individually. Answers can be more efficiently answered via the following means:

  • One can initially search the mailing list archives. The archive is is an excellent repository of information, and searching it may very well give you a quick answer.
  • If you can't find the answer there, a second course of action is to post the question directly to the main bioperl mailing list. Posting a question to the list may get an answer much faster and has even led to coding new functionality. Also, as noted above, any question/answer will be archived for future searching.

How is it maintained?

It is now maintained as a Wiki on this site. You can help maintain it by adding questions and answers.

How can I tell what version of BioPerl is installed?

We have a universal version number for a release. This is set in the module Bio::Root::Version which is universally applied to every module which inherits from Bio::Root::RootI. To check the version, just use the following one liner:

perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'

Now when we use version tuples that are not just decimal numbers, Perl converts these silently to Unicode representations. (Data::VString contains a helpful description of version tuples.) What that means is, to actually print the version number you have to use formatted printing like this

 perl -MBio::Root::Version -e 'printf "%vd\n", $Bio::Root::Version::VERSION'

Printing the version number can be done on any module in BioPerl (and should be consistent) so for example, printing out the version number of Bio::SeqIO, which is different from the overall Bioperl version number.

perl -MBio::SeqIO -e 'printf "%vd\n", $Bio::SeqIO::VERSION'

BioPerl in General

What is BioPerl?

BioPerl is a toolkit of perl modules useful in building bioinformatics solutions in Perl. It is built in an object-oriented manner so that many modules depend on each other to achieve a task. The collection of modules in the bioperl-live repository consist of the core of the functionality of bioperl. Additionally auxiliary modules for creating graphical interfaces (bioperl-gui), persistent storage in RDMBS (bioperl-db), running and parsing the results from hundreds of bioinformatics applications (Run package), software to automate bioinformatic analyses (bioperl-pipeline) are all available as Git modules in our repository.

Where do I go to get the latest release?

Official releases will be noted on the website http://bioperl.org . You can always get our releases from http://bioperl.org/DIST or via the wiki at Getting BioPerl.

What is the difference between 1.5.2 and 1.4.0? What do you mean developer release?

The 1.4.x series was released in 2004 and represented a stable release series. The 1.5.0 release was made in early 2005 but several annoying bugs were included in it. The 1.5.1 release in October has fixed those bugs and also added a number of new modules as well. 1.5.2 was released in Dec 2006. See the Change log for more information.

Developer releases are odd numbered releases (1.3, 1.5, etc) not intended to be completely stable (although all tests should pass). Stable releases are even numbered (1.0, 1.2, 1.6) and intended to provide a stable API so that modules will continue to respect the API throughout a stable release series. We cannot guarantee that APIs are stable between releases (i.e. 1.6 may not be completely compatible with scripts written for 1.4), but we endeavor to keep the API stable so that upgrading is easy.

(historical text from old FAQ)

0.7.X series (0.7.0, 0.7.2) were all released in 2001 and were stable releases on 0.7 branch. This means they had a set of functionality that is maintained throughout (no experimental modules) and were guaranteed to have all tests and subsequent bug fix releases with the 0.7 designation would not have any API changes.
The 0.9.X series was our first attempt at releasing so called developer releases. These are snapshots of the actively developed code that at a minimum pass all our tests.

How can I learn how to use a module?

% perldoc MODULE

Careful - spelling and case count! If you are not sure about case you can use the -i switch with perldoc.

% perldoc -i SomeModule

The BioPerl Tutorials will provide a good introduction. You may also find useful documentation in the form of a HOWTOs. There are links to tutorials off the bioperl website that may provide some additional help. There are also many scripts in the examples/ and scripts/ directories that may be useful - see the BioPerl scripts page for brief descriptions.

Additionally we have written many tests for our modules, you can see test data and example usage of the modules in these tests - look in the test dir, called t/.

I'm interested in the bleeding edge version of the code, where can I get it?

The Using Git page details the instructions on how to get the latest code; use the option for anonymous access.

Who uses this toolkit?

Lots of people. Sanger Centre, EBI, many large and small academic laboratories, large and small pharmaceutical companies. We do not have a comprehensive list at this point in time but maybe it should be started: BioPerl Users

How should I cite BioPerl?

See the BioPerl publications page.

What are the license terms for BioPerl?

BioPerl is licensed under the same terms as Perl itself which is dually-licensed under the terms of the Perl Artistic License (see http://www.perl.com/pub/a/language/misc/Artistic.html or http://www.opensource.org/licenses/artistic-license.html) or the GNU GPL (http://www.gnu.org/licenses/gpl.html).

I want to help, where do I start?

BioPerl is a pretty diverse collection of modules which has grown from the direct needs of the developers participating in the project. So if you don't have a need for a specific module in the toolkit it becomes hard to just describe ways it needs to be expanded or adapted. One area, however is the development of stand-alone scripts which use BioPerl components for common tasks. Some starting points for scripts: find out what people in your institution do routinely that a shortcut can be developed for. Identify modules in BioPerl that need easy interfaces and write that wrapper - you'll learn how to use the module inside and out.

A great place to start looking for aspects of the project which need help is the Project priority list and Orphan modules list.

We always need people to help fix bugs - check the Bugzilla bug tracking system. Submitting bugs in the documentation and code is very helpful as has been said about open source software "Given enough eyeballs, all bugs are shallow".

I've got an idea for a module how do I contribute it?

Post your idea on the mailing list. If you have written it already, or if you have been thinking about the API already, post the API, ideally with usage documentation, e.g., the POD that would normally go with each method, and some usage examples, e.g., what would otherwise go into the synopsis section of the module's POD.

Once you completed gathering feedback and incorporating into your module as appropriate, you either post it on bugzilla, or, if you have a developer account already, you should just commit it once you have convinced yourself that the (yours and the pre-existing) tests pass.

If you do not have a developer account yet, posting ideas, APIs, examples, and code may quickly earn you one.

How do I submit a patch or enhancement to BioPerl?

We suggest the following. Post your idea to the appropriate mailing list. If it is a really new idea consider taking us through your thought process. We'll help you tease out the necessary information such as what methods you'll want and how it can interact with other BioPerl modules. If it is a port of something you've already worked on, give us a summary of the current methods. Make sure there is an interface to the module, not just an implementation and make sure there will be a set of tests that will be in the t/ directory to insure that your module is tested. If you have a suggested patch and/or code enhancement, the SubmitPatch HOWTO gives guidelines on how to properly submit them via Bugzilla. See also Advanced BioPerl for more information.

Why can't I easily get a list of all the methods a object can call?

This a problem with perl, not only with bioperl. To list all the methods, you have to walk the inheritance tree and standard perl is not able to do it. As usual, help can be found in the CPAN. Install the CPAN module Class::Inspector and put the following script perlmethods into your path and run it, e.g, >perlmethods Bio::Seq.

 
#!/usr/bin/perl -w
use Class::Inspector;
$class = shift || die "Usage: methods perl_class_name\n";
eval "require $class";
print join ("\n", sort @{Class::Inspector->methods($class,'full','public')}), "\n";

There is also a project called Deobfuscator developed during the 2005 Bioinformatics course at Cold Spring Harbor Labs. The Deobfuscator displays available methods for an object type and provide links to the return types of the methods. An older version can also be found here.

Can you explain the Object Model design and rationale?

There is no simple answer to this question. Simply put, this is a toolkit which has grown organically. The goals and user audience has evolved. Some decisions have been made and we have been forced to live by them rather than destroy backward compatibility. In addition there are different philosophies of software development. The major developers on the project have tried to impose a set of standards on the code so that the project can be coordinated without every commit being cleared by a few key individuals (see Eric S. Raymond's essay "The Cathedral and the Bazaar" for different styles of running an open source project - we are clearly on the Bazaar end). Advanced BioPerl talks more about specific design goals.

The clear consensus of the project developers is that BioPerl should be consistent. This may cause us to pay the price of some copy-and-paste of code, with the Get/Set accessor methods being a sore spot for some, and the lack of using AUTOLOAD. By being consistent we hope that someone can grok the gist of a module from the basic documentation, see example code, and get a set of methods from the API documentation. We aim to make the core object design easy to understand. This has not been realized by any stretch of the imagination as the toolkit has well over 1000 modules in bioperl-live and bioperl-run alone.

That said we do want to improve things. We want to experiment with newer modules which make Perl more object-oriented. We have high hopes for some of the promises of Perl6. To try and realize this goal we are encouraging developers to play with new object models in a bioperl-experimental project.

Some useful discussion on the mailing list can be found at this node http://bioperl.org/pipermail/bioperl-l/2003-December/014406.html. We encourage you to participate in the discussion and to join in the development process either on existing BioPerl code or the bioperl-experimental code if you have a particular interest in making the toolkit more object-oriented.

Sequences

How do I parse a sequence file?

Use the Bio::SeqIO system. This will create Bio::Seq objects for you. For more information see the BioPerl Tutorials, the SeqIO HOWTO, the Bio::SeqIO Wiki page, or the Bio::SeqIO POD documentation (or type perldoc Bio::SeqIO).

I can't get sequences with Bio::DB::GenBank any more, why not?

If you are running an old BioPerl version, NCBI changed the web CGI script that provided this access. You must use a modern version like 1.4.x or 1.5.x.

How can I get NT_ or NM_ or NP_ accessions from NCBI (Reference sequences)?

To retrieve GenBank reference sequences, or RefSeqs, use Bio::DB::RefSeq, not Bio::DB::GenBank or Bio::DB::GenPept when you are retrieving these accession numbers. This is still an area of active development because the data providers have not provided the best interface for us to query. EBI has provided a mirror with their dbfetch system which is accessible through the Bio::DB::RefSeq object however, there are cases where NT_ accession numbers will not be retrievable.

How can I use Bio::SeqIO to parse sequence data to or from a string?

Use this code to parse sequence records from a string:

use IO::String;
use Bio::SeqIO;
my $stringfh = new IO::String($string);
my $seqio = new Bio::SeqIO(-fh => $stringfh, 
                           -format => 'fasta');
while( my $seq = $seqio->next_seq ) { 
 # process each seq
}

And here is how to write to a string:

use IO::String;
use Bio::SeqIO;
my $s;
my $io = IO::String->new(\$s);
my $seqOut = new Bio::SeqIO(-format =>'swiss', -fh =>$io);
$seqOut->write_seq($seq1);
print $s; # $s contains the record in swissprot format and is stored in the string

How do I use Bio::Index::Fasta and index on different ids?

I'm using Bio::Index::Fasta in order to retrieve sequences from my indexed fasta file but I keep seeing MSG: Did not provide a valid Bio::PrimarySeqI object when I call fetch followed by write_seq() on a Bio::SeqIO handle. Why?

It's likely that fetch didn't retrieve a Bio::Seq object. There are few possible explanations but the most common cause is that the id you're passing to fetch is not the key to that sequence in the index. For example, if the FASTA header is >gi|12366 and your id is 12366 then fetch won't find the sequence, it expects to see gi|12366. You need to use the get_id method to specify the key used in indexing, like this:

$inx = Bio::Index::Fasta->new(-filename =>$indexname);
$inx->id_parser(\&get_id);
$inx->make_index($fastaname);
 
sub get_id {
  my $header = shift;
  $header =~ /^>gi\|(\d+)/;
  $1;
}

The same issue arises when you use Bio::DB::Fasta, but in that case the code might look like this:

$inx = Bio::DB::Fasta->new($fastaname, -makeid => \&get_id);

Cannot get an accession from GenBank when I know it is there

I'm using Bio::DB::GenBank to query GenBank and I'm certain that the id is there but I'm seeing the error MSG: acc does not exist. This bug in versions 1.2 and 1.2.1, but it is fixed in 1.2.2. Either upgrade to 1.2.2 or higher, or edit the module Bio::DB::GenBank and change protein to nucleotide in the BEGIN block.

Also see http://bioperl.org/pipermail/bioperl-l/2004-February/014958.html

Accession numbers are not present for FASTA sequence files

If you parse a FASTA sequence format file with Bio::SeqIO the sequences won't have the accession number. What to do?

All the data is in the $seq->display_id it just needs to be parsed out. Here is some code to set the accession number.

my ($gi,$acc,$locus);
(undef,$gi,undef,$acc,$locus) = split(/\|/,$seq->display_id);
$seq->accession_number($acc);

Why don't we just go ahead and do this? For one, we don't make any assumptions about the format of the ID part of the sequence. Perhaps the parser code could try and detect if it is a GenBank formatted ID and go ahead and set the accession number field. It would be trivial to do, just no one has volunteered the time - put it on the Project priority list if you think it is important and better yet, volunteer the code patch!

Also see http://bioperl.org/pipermail/bioperl-l/2005-August/019579.html

How do I get genomic sequences when all I have is an gene identifier or name?

This question has a few different answers, it deserves its own page.

I would like to make my own custom fasta header - how do I do this?

You want to use the method preferred_id_type(). Here's some example code:

use Bio::SeqIO;
 
my $seqin = Bio::SeqIO->new(-file => $file,
                            -format => 'genbank');
 
my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,
                            -format => 'fasta');
 
# From Bio::SeqIO::fasta
$seqout->preferred_id_type('display');
 
my $count = 1;
 
while (my $seq = $seqin->next_seq) {
    # override the regular display_id with your own
    $seq->display_id('foo'.$count); 
    $seqout->write_seq($seq);
    $count++;
}

You can pass one of the following values to preferred_id_type: "accession", "accession.version", "display", "primary". The description line is automatically appended to the preferred id type but this can also be set, like so:

$seq->desc($some_string);

Report Parsing

I want to parse BLAST, how do I do this?

As of version 1.1, BioPerl only supports one approach - the Bio::SearchIO interface. There are other BLAST parsing modules in the package, but they remain just to support older legacy code. Bio::SearchIO supports:

It is strongly recommended you read the HOWTO:SearchIO for more information.

What was wrong with Bio::Tools::Blast?

Bio::Tools::Blast* is no longer supported, as of BioPerl version 1.1. Nothing is really wrong with it, it has just been outgrown by a more generic approach to reports. This generic approach allows us to just write pluggable modules for FASTA and BLAST parsing while using the same framework. This is completely analogous to the Bio::SeqIO system of parsing sequence files. However, the objects produced are of the Bio::SearchIO rather than Bio::Seq variety.

I want to parse FASTA or NCBI -m7 (XML) format, how do I do this?

It is as simple as parsing text BLAST results - you simply need to specify the format as fasta or blastxml and the parser will load the appropriate module for you. You can use the exact logic and code for all of these formats as we have generalized the modules for sequence database searching. The page describing Bio::SearchIO provides a table showing how the formats match up to particular modules. Note that, for parsing BLAST XML output, you will need XML::SAX and that XML::SAX::ExpatXS is recommended to speed up parsing.

How can I generate a pairwise alignment of two sequences?

Look at Bio::Factory::EMBOSS to see how to use the water and needle alignment programs that are part of the EMBOSS suite. Bio::Factory::EMBOSS is part of the bioperl-run package.

Or you can use the pSW module for DNA alignments or the dpAlign module for protein alignments. These are part of the bioperl-ext package; download it via Getting BioPerl.

You can also use prss34 (part of FASTA package) to assess the significance of a pairwise alignment by shuffling the sequences.

How do I get the frame for a translated search?

I'm using Bio::Search* and its frame() to parse BLAST but I'm seeing 0, 1, or 2 instead of the expected -3, -2, -1, +1, +2, +3. Why am I seeing these different numbers and how do I get the frame according to BLAST?

These are GFF frames - so +1 is 0 in GFF, -3 will be encoded with a frame of 2 with the strand being set to -1.

Frames are relative to the hit or query sequence so you need to query it based on sequence you are interested in:

$hsp->hit->strand;
$hsp->hit->frame;

or

$hsp->query->strand;
$hsp->query->frame;

So the value according to a blast report of -3 can be constructed as:

my $blastframe = ($hsp->query->frame + 1) * $hsp->query->strand;

Can I get domain number from hmmpfam or hmmsearch output?

For example:

 SH2_5: domain 2 of 2, from 349 to 432: score 104.4, E = 1.9e-26

Not directly but you can compute it since the domains are numbered by their order on the protein:

my @domains = $hit->domains;
my $domainnum = 1;
my $total = scalar @domains;
foreach my $domain ( sort { $a->start <=> $b->start } $hit->domains ) {
  print "domain $domainnum of $total,\n";
  $domainnum++;
}

Does Bio::SearchIO parse the HTML output that BLAST creates using the -T option?

Yes, with a twist. You can modify Bio::SearchIO's _readline() method such that it reads in the HTML and strips it of tags using the HTML::Strip module.

Please note: We do not suggest parsing BLAST HTML output if it can be avoided. We actively support XML, tabular, and text output parsing of NCBI BLAST reports only; we have never supported parsing of NCBI BLAST HTML output directly through BioPerl and will not attempt to rectify problems where HTML output parsing post-stripping of the tags breaks but parsing text output works. Consider this fair warning.

use Bio::SearchIO;
use HTML::Strip;
my $hs = HTML::Strip->new();
# replace the blast parser's _readline method with one that
# auto-strips HTML:
package Bio::SearchIO::blast; 
 
sub Bio::SearchIO::blast::_readline {
  my ($self, @args) = @_;
  my $line = $self->SUPER::_readline(@args);
  return unless defined $line;
  return $hs->parse($line);
}
# now parse using the BLAST format module
my $in = new Bio::SearchIO(-format => 'blast', -file   => $file);

Annotations and Features

How do I retrieve all the features from a sequence?

How about all the features which are exons or have a /note field that contains a certain gene name?

To get all the features:

my @features = $seq->all_SeqFeatures();

To get all the features filtering on only those which have the primary tag (ie. feature type) exon.

 
my @genes = grep { $_->primary_tag eq 'exon'} 
$seq->all_SeqFeatures();

To get all the features filtering on this which have the /note tag and within the note field contain the requested string $noteval.

my @f_with_note = grep {  my @a = $_->has_tag('note') ? $_->each_tag_value('note') : (); 
                                          grep { /$noteval/ } @a;  }  $seq->all_SeqFeatures();

How do I parse the CDS join or complement statements in GenBank or EMBL files to get the sub-locations?

For example, how can I get the the coordinates 45 and 122 in join(45..122,233..267) ?

You could use primary_tag to find the CDS features and the Bio::Location::SplitLocationI object to get the coordinates:

 foreach my $feature ($seqobj->top_SeqFeatures){
   if ( $feature->location->isa('Bio::Location::SplitLocationI') and $feature->primary_tag eq 'CDS' ) {
      foreach my $location ( $feature->location->sub_Location ) {
        print $location->start , ".." , $location->end, "\n";
      } 
   }
 }

How do I retrieve a nucleotide coding sequence when I have a protein gi number?

You could go through the protein's feature table and find the coded_by value. The trick is to associate the coded_by nucleotide coordinates to the nucleotide entry, which you'll retrieve using the accession number from the same feature.

use Bio::Factory::FTLocationFactory;
use Bio::DB::GenPept;
use Bio::DB::GenBank;
 
my $gp = Bio::DB::GenPept->new;
my $gb = Bio::DB::GenBank->new;
# factory to turn strings into Bio::Location objects
my $loc_factory = Bio::Factory::FTLocationFactory->new;
 
my $prot_obj = $gp->get_Seq_by_id($protein_gi);
foreach my $feat ( $prot_obj->top_SeqFeatures ) {
   if ( $feat->primary_tag eq 'CDS' ) {
      # example: 'coded_by="U05729.1:1..122"'
      my @coded_by = $feat->each_tag_value('coded_by');
      my ($nuc_acc,$loc_str) = split /\:/, $coded_by[0];
      my $nuc_obj = $gb->get_Seq_by_acc($nuc_acc);
      # create Bio::Location object from a string
      my $loc_object = $loc_factory->from_string($loc_str);
      # create a Feature object by using a Location
      my $feat_obj = Bio::SeqFeature::Generic->new(-location =>$loc_object);
      # associate the Feature object with the nucleotide Seq object
      $nuc_obj->add_SeqFeature($feat_obj);
      my $cds_obj = $feat_obj->spliced_seq;     
      print "CDS sequence is ",$cds_obj->seq,"\n";
   }
}

How do I get the complete spliced nucleotide sequence from the CDS section?

You can use the spliced_seq method. For example:

my $seq_obj = $db->get_Seq_by_id($gi);
foreach my $feat ( $seq_obj->top_SeqFeatures ) {
   if ( $feat->primary_tag eq 'CDS' ) {
      my $cds_obj = $feat->spliced_seq;
      print "CDS sequence is ",$cds_obj->seq,"\n";
   }
}

How do I get the complete spliced sequence when the coordinates refer to Genbank identifiers?

The problematic features have coordinates like this:

 join(complement(AY421753.1:1..6),complement(3813..5699))

To retrieve this, you need to pass a Genbank database handle to the spliced_seq method. For example:

my $db = Bio::DB::GenBank->new();
my $io = Bio::SeqIO->new(-file=>'funnyfile.gb', -format=>'genbank');
while ( my $seq = $seq_in->next_seq ) {
  for my $feat ( $seq->get_SeqFeatures ) {
    if ( $feat->primary_tag eq 'CDS' ) {
       my $cds = $feat->spliced_seq(-db => $db, -nosort => 0);
       print $cds->translate->seq, "\n";
    }
  }
}

How do I get the reverse-complement of a sequence using the subseq method?

One way is to pass the location to subseq in the form of a Bio::LocationI object. This object holds strand information as well as coordinates.

use Bio::Location::Simple;
my $location = Bio::Location::Simple->new(-start  => $start,
                                          -end   => $end,
                                          -strand => "-1");
# assume we already have a sequence object
my $rev_comp_substr = $seq_obj->subseq($location);

I get the warning (old style Annotation) on new style Annotation::Collection. What is wrong?

Wow, you're using an old version! You'll see this error because the modules and interface has changed starting with BioPerl 1.0. Before v1.0 there was a Bio::Annotation module with add_Comment, add_Reference, each_Comment, and each_Reference methods.

After v1.0 there is a Bio::Annotation::Collection module with add_Annotation('comment', $ann) and get_Annotations('comment').

Please update your code in order to avoid seeing these warning messages. In the future the Reference objects will likely be implemented by the Bio::Biblio system but we hope to maintain a compatible API for these.

Utilities

How do I find all the ORFs in a nucleotide sequence? Antigenic sites in a protein? Calculate nucleotide melting temperature? Find repeats?

In fact, none of these functions are built into BioPerl but they are all available in the EMBOSS package, as well as many others. The BioPerl developers created a simple interface to EMBOSS such that any and all EMBOSS programs can be run from within BioPerl. See Bio::Factory::EMBOSS for more information, it's in the bioperl-run package.

If you can't find the functionality you want in BioPerl then make sure to look for it in EMBOSS, these packages integrate quite gracefully with BioPerl. Of course, you will have to install EMBOSS to get this functionality.

In addition, BioPerl after version 1.0.1 contains the Pise/Bioperl modules. The Pise package was designed to provide a uniform interface to bioinformatics applications, and currently provides wrappers to greater than 250 such applications! Included amongst these wrapped apps are HMMER, PHYLIP, BLAST, GENSCAN, and the EMBOSS suite. Use of the Pise/BioPerl modules does not require installation of Pise locally as it runs through the HTTP protocol of the web. Also, see the BioMOBY project for information on running applications remotely.

How do I do motif searches with BioPerl? Can I do "find all sequences that are 75% identical" to a given motif?

There are a number of approaches. Within BioPerl take a look at Bio::Tools::SeqPattern. Or, take a look at the TFBS package. This BioPerl-compliant package specializes in pattern searching of nucleotide sequence using matrices.

It's also conceivable that the combination of BioPerl and Perl's regular expressions could do the trick. You might also consider the CPAN module String::Approx (this module addresses the percent match query), but experienced users question whether its distance estimates are correct, the Unix agrep command is thought to be faster and more accurate. Finally, you could use EMBOSS, as discussed in the previous question (or you could use Pise to run EMBOSS applications). The relevant programs would be fuzzpro or fuzznuc. Complex RNA sequence secondary structural 'motifs' can be searched with Tom Macke's RNAMotif available from the Case group at Scripps. A Bio::SearchIO-based parser for RNAMotif output (Bio::SearchIO::rnamotif) exists in BioPerl v 1.5.2 and in bioperl-live.

Can I query MEDLINE or other bibliographic repositories using BioPerl?

Yes! The solution lies in Bio::Biblio*, a set of modules that provide access to MEDLINE and OpenBQS-compliant servers using SOAP. See Bio::Biblio, Bio::DB::BiblioI, scripts/biblio.PLS, or examples/biblio/* for details and example code.

How do I merge a set of sequences along with their features and annotations?

Try the cat() method in Bio::SeqUtils:

  $merged_seq = Bio::SeqUtils->cat(@seqs)

This method uses the first sequence in the array as a foundation and adds the subsequent sequences to it, along with their features and annotations.

Running external programs

How do I run BLAST from within BioPerl?

Use the module Bio::Tools::Run::StandAloneBlast. It will give you access to many of the search tools in the NCBI BLAST suite including blastall, bl2seq, blastpgp. The basic structure is like this.

use Bio::Tools::Run::StandAloneBlast;
my $factory = Bio::Tools::Run::StandAloneBlast->new(p => 'blastn',
                                                    d => 'nt',
                                                    e => '1e-5');
my $seq = Bio::PrimarySeq->new(-id => 'test1',
                               -seq => 'AGATCAGTAGATGATAGGGGTAGA');
my $report = $factory->blastall($seq); # get back a {{PM|Bio::SearchIO}} report

How do I tell BLAST to search multiple database using Bio::Tools::Run::StandAloneBlast?

Put the names of the databases in a variable. like so:

my $dbs = '"/dba/BMC.fsa /dba/ALC.fsa /dba/HCC.fsa"';
my @params = ( 
   d	       => "$dbs",
   program     => "BLASTN",
   _READMETHOD => "Blast",
   outfile     => "$dir/est.bls"
);
 
my $factory = Bio::Tools::Run::StandAloneBlast->new(@params);
my $seqio = Bio::SeqIO->new(-file=>'t/amino.fa',-format => 'Fasta' );
my $seqobj = $seqio->next_seq();
$factory->blastall($seqobj);

Hey, I want to run Clustalw within BioPerl, I used Bio::Tools::Run::Alignment::Clustalw before - where did it go?

Most of the Bio/Tools/Run directory was moved to a new package, bioperl-run, to help make the size of the core code smaller and separate out the more specialized nature of application running from the rest of BioPerl. You can get these modules by installing the bioperl-run package. Download it via Getting BioPerl. This changeover began in the bioperl 1.1 developer release.

What does the future hold for running applications within BioPerl?

We are trying to build a standard starting point for analysis application which will probably look like Bio::Tools::Run::AnalysisFactory which will allow the user to request which type of remote or local server they want to use to run their analyses. This will connect to the Pasteur's PISE server, the EBI's Novella server, as well as be aware of wrappers to run applications locally.

Additionally we suggest investigating the BioPipe project, also known as bioperl-pipeline, at http://www.biopipe.org. This is a sophisticated system to chain together sets of analyses and build rules for performing these computes.

I'm trying to run Bio::Tools::Run::StandAloneBlast and I'm seeing error messages like Can't locate Bio/Tools/Run/WrapperBase.pm - how do I fix this?

This file is missing in version 1.2. Two possible solutions: install version 1.2.1 or greater or retrieve and copy WrapperBase.pm to the proper location.

Other BioPerl packages

What is bioperl-ext?

bioperl-ext is a package of code for C-extensions (hence the 'ext') to BioPerl. These include interfacing to the staden IO library (the io_lib library) for reading in chromatogram files and Bio::Ext::Align which is a Smith-Waterman implementation.

It is likely that functionality within bioperl-ext will eventually be replaced by the BioLib initiative.

bioperl-ext won't compile the staden IO lib part - what do I do?

Make sure you read the README about copying files over. Some more items to check off before asking.

  1. Are you sure io_lib is installed where you think it is, and that the install path is seen by Perl (did you answer the questions during perl Makefile.PL ?)
  2. Did you copy the various missing .h files (os.h config.h if I remember right) from your io_lib source directory into the install include directory when installing io_lib?
  3. When you ran make for io_lib did you see any errors or messages about how you should probably run "ranlib" on the library object?
  4. Did you run "ranlib" on the installed libread file(s)?

Note that newer versions of io_lib no longer support ABI sequences unless the Staden Package is also installed.

What is bioperl-db?

The BioPerl db package contains interfaces and adaptors that work with a BioSQL database to serialize and de-serialize Bioperl objects. Hilmar Lapp strongly recommends you use the Git version with the latest biosql-schema.

What is bioperl-microarray?

The Microarray package provides some basic tools for microarray functionality. It was started by Allen Day and may need some more work before it is a mature product.

What is bioperl-gui?

The GUI package provides a Graphical User Interface for interacting with sequence and feature objects. It is used as part of the Genquire project.

What is bioperl-pedigree?

The Pedigree package was started by Jason Stajich and provides basic tools for interacting with pedigree data and rendering pedigree plots.

BioPerl-related questions

I am using Ensembl. How do I do XYZ?

Though BioPerl is used in Ensembl, the version used is rather old and most of the sequence parsing infrastructure has evolved beyond using Bioperl directly (see below for an explanation). The best place to look for answers to any Ensembl-related matter is the Ensembl mail list and web site:

Why is the version of BioPerl (v.1.2.3) used in Ensembl so old? Haven't there been bug fixes?

Ewan Birney has answered this a few times on the Ensembl mail list. In short (in this thread):

Ensembl doesn't make heavy use of Bioperl anymore - most of the critical things
we re-wrote, mainly due to speed/memory issues. I think the short answer is that
it _probably_ works with 1.5, but we don't have a strong desire to move up
as certainly there are no problems with the 1.2.3 release we are using.

The only true impediment in upgrading appears to be reported problems with blastview code, though this hasn't been confirmed using the latest BioPerl code.

Ewan mentions in another thread:

Ensembl has slowly migrated away from Bioperl, mainly due to speed
issues in the bioperl framework. One project kicked around (by me mainly) is
to make a thin "facade" across Ensembl objects to make them Bioperl compliant,
handling for example what the "get_Annotations()" call will actually do (if you
look into the Annotation objects in bioperl you will get a sense of why
we can't have these objects in the main Ensembl API- way too heavyweight).

If one were interested in this it may be something to bring up on the ensembl-dev or bioperl-l mail list to gauge whether there is enough interest...

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox