BioPerl Tutorial

From BioPerl
(Redirected from
Jump to: navigation, search



Bioperl Tutorial - a tutorial for Bioperl

Most of the content of the BioPerl tutorial was removed and put into HOWTOs in 2011. All that remains are these unrelated sections.


Written by Peter Schattner <schattner at>

Added to BioPerl Wiki by Jay Hannah (Pod::Simple::Wiki, format mediawiki)

Copyright Peter Schattner

Contributions, additions and corrections have been made to this document by the following individuals:

Identifying amino acid cleavage sites (Sigcleave)

For amino acid sequences we may be interested to know whether the amino acid sequence contains a cleavable signal sequence for directing the transport of the protein within the cell. SigCleave is a program (originally part of the EGCG molecular biology package) to predict signal sequences, and to identify the cleavage site based on the von Heijne algorithm.

The threshold setting controls the score reporting. If no value for threshold is passed in by the user, the code defaults to a reporting value of 3.5. SigCleave will only return score/position pairs which meet the threshold limit.

There are 2 accessor methods for this object. signals() will return a perl hash containing the sigcleave scores keyed by amino acid position. pretty_print() returns a formatted string similar to the output of the original sigcleave utility.

The syntax for using Sigcleave is as follows:

# create a Seq object, for example:
use Bio::Tools::Sigcleave;
$sigcleave_object = new Bio::Tools::Sigcleave
    ( -seq          => $seqobj,
      -threshold    => 3.5,
      -description  => 'test sigcleave protein seq',
%raw_results      = $sigcleave_object->signals;
$formatted_output = $sigcleave_object->pretty_print;

Please see Bio::Tools::Sigcleave for details.

Miscellaneous sequence utilities: OddCodes, SeqPattern


For some purposes it's useful to have a listing of an amino acid sequence showing where the hydrophobic amino acids are located or where the positively charged ones are. Bioperl provides this capability via the module Bio::Tools::OddCodes.

For example, to quickly see where the charged amino acids are located along the sequence we perform:

use Bio::Tools::OddCodes;
$oddcode_obj = Bio::Tools::OddCodes->new($amino_obj);
$output = $oddcode_obj->charge();

The sequence will be transformed into a three-letter sequence (A,C,N) for negative (acidic), positive (basic), and neutral amino acids. For example the ACDEFGH would become NNAANNC.

For a more complete chemical description of the sequence one can call the chemical() method which turns sequence into one with an 8-letter chemical alphabet { A (acidic), L (aliphatic), M (amide), R (aromatic), C (basic), H (hydroxyl), I (imino), S (sulfur) }:

$output = $oddcode_obj->chemical();

In this case the sample sequence ACDEFGH would become LSAARAC.

OddCodes also offers translation into alphabets showing alternate characteristics of the amino acid sequence such as hydrophobicity, "functionality" or grouping using Dayhoff's definitions. See the documentation in Bio::Tools::OddCodes for further details.


The SeqPattern object is used to manipulate sequences using Perl regular expressions. A key motivation for SeqPattern is to have a way of generating a reverse complement of a nucleic acid sequence pattern that includes ambiguous bases and/or regular expressions. This capability leads to significant performance gains when pattern matching on both the sense and anti-sense strands of a query sequence are required. Typical syntax for using SeqPattern is shown below. For more information, there are several interesting examples in the script in the examples/tools directory.

use Bio::Tools::SeqPattern;
$pattern     = '(CCCCT)N{1,200}(agggg)N{1,200}(agggg)';
$pattern_obj = new Bio::Tools::SeqPattern(-SEQ  => $pattern,
                                          -TYPE => 'dna');
$pattern_obj2 = $pattern_obj->revcom();
$pattern_obj->revcom(1); # returns expanded rev complement pattern.

More detail can be found in Bio::Tools::SeqPattern.

Converting coordinate systems (Coordinate::Pair, RelSegment)

Coordinate system conversion is a common requirement, for example, when one wants to look at the relative positions of sequence features to one another and convert those relative positions to absolute coordinates along a chromosome or contig. Although coordinate conversion sounds pretty trivial it can get fairly tricky when one includes the possibilities of switching to coordinates on negative (i.e. Crick) strands and/or having a coordinate system terminate because you have reached the end of a clone or contig. Bioperl has two different approaches to coordinate-system conversion (based on the modules Bio::Coordinate::Pair and Bio::DB::GFF::RelSegment, respectively).

The Coordinate::Pair approach is somewhat more "low level". With it, you define an input coordinate system and an output coordinate system, where in each case a coordinate system is a triple of a start position, end position and strand. The end position is especially important when dealing with unfinished assemblies where the coordinate system ends when one reaches the end of the sequence of a clone or contig. Once one has defined the two coordinate systems, one defines a Coordinate::Pair to map between them. Then one can map positions between the coordinates systems with code such as this:

$input_coordinates = Bio::Location::Simple->new  
(-seq_id => 'propeptide', -start => 1000, -end => 2000, -strand=>1 );
$output_coordinates = Bio::Location::Simple->new  
(-seq_id => 'peptide', -start => 1100, -end => 2100, -strand=>1 );
$pair = Bio::Coordinate::Pair->new
(-in => $input_coordinates , -out => $output_coordinates );
$pos = Bio::Location::Simple->new (-start => 500, -end => 500 );
$res = $pair->map($pos);
$converted_start = $res->start;

In this example $res is also a Bio::Location object, as you'd expect. See the documentation for Bio::Coordinate::Pair and Bio::Coordinate::GeneMapper for more details.

The Bio::DB::GFF::RelSegment approach is designed more for handling coordinate transformations of sequence features rather than for transforming arbitrary coordinate systems. With Bio::DB::GFF::RelSegment you define a coordinate system relative to a specific feature (called the "refseq"). You also have access to the absolute coordinate system (typically of the entire chromosome.) You can determine the position of a feature relative to some other feature simply by redefining the relevant reference feature (i.e. the "refseq") with code like this:

$db = Bio::DB::GFF->new(-dsn     => 'dbi:mysql:elegans',
                        -adaptor => 'dbi:mysqlopt');
$segment = $db->segment('ZK909');
$relative_start = $segment->start;  # $relative_start = 1;
# Now retrieve the start position of ZK909 relative to feature ZK337
$relative_start = $segment->start;
# Now retrieve the start position of ZK909 relative to the entire chromosome
$absolute_start =  $segment->abs_start;

This approach is convenient because you don't have to keep track of coordinates directly, you just keep track of the name of a feature which in turn marks the coordinate-system origin. However, this approach does require that you have stored all the sequence features in GFF format. Moreover, Bio::DB::GFF::RelSegment has been principally developed and tested for applications where all the sequence features are stored in a bioperl-db relational database. However, if one wants to use the Bio:DB::GFF machinery (including its coordinate transformation capabilities) without building a local relational database, this is possible by defining the 'database' as having an adaptor called 'memory', e.g.

$db = Bio::DB::GFF->new(-adaptor => 'memory' );

For more details on coordinate transformations and other GFF-related capabilities in Bioperl see Bio::DB::GFF::RelSegment, and Bio::DB::GFF.

Personal tools
Main Links