HOWTO:Multiple Sequence Alignment

From BioPerl
Jump to: navigation, search

Contents

Author

Jun Yin, [1]. Email:jun.yin-at-ucd.ie

Abstract

This HOWTO intends to show how to use the BioPerl Bio::SimpleAlign and Bio::DB::Align to manipulate Multiple Sequence Alignment (MSA) sequences and retrieve online alignment sequences.

Please notice that, the modules mentioned in this tutorial, Bio::AlignIO, Bio::SimpleAlign and Bio::DB::Align, read/write/edit alignment sequences, but not to do the alignment. If you want to align your sequences, you may consider clustalw2. The packages introduced here deal with the output file from clustalw2 and other alignment softwares.

In BioPerl, alignment sequence reading in and writing out are performed using Bio::AlignIO. Online alignement sequence can be retrieved using Bio::DB::Align. These two packages (actually bundle of packages) will save the alignment into Bio::SimpleAlign objects. Bio::SimpleAlign provides dozens of methods to manipulate the alignment, calculate consensus sequences and annotate the alignment.

Refactoring of Bio::SimpleAlign and implementation of Bio::DB::Align were funded as part of the Google Summer of Code project, under the guidance of Chris Fields in Open Bioinformatics Foundation.

Reading and Writing MSA sequences

Reading in and writing out MSA sequences in local file are performed using Bio::AlignIO. Here is an example code:

use Bio::AlignIO;
use Bio::SimpleAlign;
 
# Use Bio::AlignIO to read in the alignment
my $str = Bio::AlignIO->new(-file => 't/data/testaln.pfam');
my $aln = $str->next_aln();
 
 
# do something here
print $aln->length;
print $aln->num_residues;
print $aln->is_flush;
 
foreach my $seq ($aln->next_Seq) {
    #do something with $seq
    print $seq->get_nse;
}

Retrieving online alignment sequences

Bio::DB::Align uses RESTful service provided by the database to retrieve online alignment sequences. The alignment sequences are saved into Bio::SimpleAlign object. Now BioPerl supports Pfam, Rfam, Prosite and NCBI Protein Clusters (ProtClustDB).

Pfam

This package uses the RESTful service provided by Pfam. It retrieves online alignment sequences and save it into Bio::SimpleAlign object.

The retrieval can be based on protein domain id, e.g. "Piwi", or accession number, e.g. "PF02171".

use Bio::DB::Align;
use Bio::DB::Align::Pfam;
use Bio::SimpleAlign; 
 
my $dbobj  = Bio::DB::Align::Pfam->new(); #create a db object
my $dbobj2 = Bio::DB::Align->new(-db=>"Pfam"); #create a db object
                                           #the same with above
 
#retrieve a Bio::SimpleAlign object
my $aln=$dbobj->get_Aln_by_id("Piwi"); 
 
#do something here with the align object
print $aln->length,"\n";
print $aln->num_sequences,"\n";	
 
foreach my $seq ($aln->next_Seq) {
	#do something with $seq
}

Or you can do more complicated calling such as:

#parameter based calling
my $aln2 = $dbobj->get_Aln_by_acc(-accession=>"PF02171",
                                -alignment=>"full",
                                -format=>"stockholm",
                                -order=>"a",
                                -case=>"u",
                                -gap=>"dots");
print $aln2->id,"\n";
print $aln2->accession,"\n";

You can also convert the id into accession and vice versa.

#id accession conversion
my $acc = $dbobj->id2acc("Piwi");
my $id  = $dbobj->acc2id("PF02171");

Rfam

This package uses the RESTful service provided by Rfam. It retrieves online alignment sequences and save it into Bio::SimpleAlign object. Bio::DB::Align::Rfam only supports alignment sequence retrieval using RNA accession number, e.g. "RF00360".


use Bio::DB::Align;
use Bio::DB::Align::Rfam;
use Bio::SimpleAlign; 
 
my $dbobj=Bio::DB::Align::Rfam->new(); #create a db object
my $dbobj2=Bio::DB::Align->new(-db=>"Rfam"); #create a db object
                                           #the same with above
 
#retrieve a Bio::SimpleAlign object
my $aln=$dbobj->get_Aln_by_acc("RF00360"); 
 
#or parameter based calling
my $aln2=$dbobj->get_Aln_by_acc(-accession=>"RF00360",
                                -alignment=>"full",
                                -nselabel=>1, 
                                -format=>"stockholm",
                                -gap=>"dashes");

Prosite

This package uses the RESTful service provided by Prosite. It retrieves online alignment sequences and save it into Bio::SimpleAlign object. Bio::DB::Align::Prosite only supports alignment sequence retrieval using protein domain accession number, e.g. "RF00360".

use Bio::DB::Align;
use Bio::DB::Align::Prosite;
use Bio::SimpleAlign; 
 
my $dbobj=Bio::DB::Align::Prosite->new(); #create a db object
my $dbobj2=Bio::DB::Align->new(-db=>"Prosite"); #create a db object
                                           #the same with above
 
#retrieve a Bio::SimpleAlign object
my $aln=$dbobj->get_Aln_by_acc("PS51092"); 
 
#or parameter based calling
my $aln2=$dbobj->get_Aln_by_acc(-accession=>"PS00023",
                                -format=>"clustalw");

ProtClustDB

This package supports alignment sequence retrieval of protein clusters from NCBI Entrez Protein Clusters (ProtClustDB). It retrieves online alignment sequences and save it into Bio::SimpleAlign object.This package depends on Bio::DB::EUtilities, which is used to convert id and accession.

The retrieval can be based on protein cluster id, e.g. "2725839", or accession number, e.g. "CLSN2725839".

You may need to provide your email address to use this service, as per NCBI E-utilities policy.

use Bio::DB::Align;
use Bio::DB::Align::ProtClustDB;
use Bio::SimpleAlign; 
 
#create a db object, using either of the two methods below
my $dbobj=Bio::DB::Align::ProtClustDB->new(-email=>'bioperl-test@foo.bar'); 
my $dbobj2=Bio::DB::Align->new(-db=>"ProtClustDB",
                               -email=>'bioperl-test@foo.bar'); 
 
 
#retrieve a Bio::SimpleAlign object
my $aln=$dbobj->get_Aln_by_id("2725839"); 
 
#or parameter based calling
my $aln2=$dbobj->get_Aln_by_acc(-accession=>"CLSN2725839");
 
#id accession conversion
my $acc=$dbobj->id2acc("2725839");
my $id=$dbobj->acc2id("CLSN2725839");

Alignment description

After you read in the alignment, a Bio::SimpleAlign object will be built. You can use the methods below to check the attributes of the alignment. These may include general MSA attributes, e.g. id, accession, or sequence related descriptors, e.g. No. of residues, No. of sequences, and sequence percentage identity.

Here are some example codes returning alignment attributes.

my $id        = $aln->id;             #id of the alignment
my $accession = $aln->accession;      #accession of the alignment
my $desc      = $aln->description;    #description
my $source    = $aln->source;         #sequence source, usually the format of the sequence
my @symchars  = $aln->symbol_chars    #Returns all the seen symbols except gap/missing/match characters

Here are some example codes returning sequence based descriptors.

my $score     = $aln->score;           #score of the alignment
my $length    = $aln->length;          #length of the alignment block
my $num_res   = $aln->num_residues;    #number of residues in total in the alignment
my $num_seq   = $aln->num_sequences;   #number of sequences in the alignment

Bio::SimpleAlign provides several methods to calculate sequence percentage identity.

#This function calculates the percentage identity of the conserved columns
my $iden1          = $aln->average_percentage_identity;
 
#This function uses a fast method to calculate the average percentage identity of the alignment
my $iden2          = $aln->overall_percentage_identity;
 
#Returns pairwise percentage identity of each sequence to the reference sequence(first sequence as default)
my @pairwise_iden1 = $aln->pairwise_percentage_identity;
 
#Returns pairwise percentage identity of each sequence to the 3rd sequence 
my @pairwise_iden2 = $aln->pairwise_percentage_identity(3);

Selecting sequences or horizontal/vertical subsets of the current MSA

You may select horizontal (sequences) or vertical (columns) subtsets of the current MSA. Here is some example code of doing that. Beware that, in BioPerl sequence and columns are usually "1" based, e.g. the first sequence is 1, and the first column is also 1.

Selecting sequences in the alignment

Here are some codes to select horizontal subsets (sequences) of the alignment:

#selecting sequence object one by one
foreach my $seq ($aln->next_Seq) {
    #do something with $seq
    print $seq->seq();
}
 
#the sequence objects are returned in alphabetically sorted order
 
foreach my $seq ($aln->next_alphabetically) {
    #do something with $seq
    print $seq->seq();
}
 
 
#selecting a specific sequence by position in the alignment
 
my $seq3 = $aln->get_Seq_by_pos(3); #the third sequence
 
#selecting a specific sequence by id
 
my $seq1 = $aln->get_Seq_by_id("JAK_HUMAN");
 
 
#selecting multiple sequences by position, and save it into another alignment object
 
my $aln2 = $aln->select_Seqs([1..5,9,11]);
 
#parameter based way of doing that, and reverse the selection
#If you have 11 sequences in the alignment, this code below is the same with the code above
my $aln3 = $aln->select_Seqs(-selection=>[6..8,10],-toggle=>1);

Selecting columns in the alignment

Here are some codes to select/de-select columns in the alignment. The result will be saved into a new alignment object. The gap only sequences are usually removed after the selection, unless you defined -keepgaponly=>1 in the selection.

#For example, if an alignment has 35 columns, all the methods below are doing the same thing.
#these methods all select the 3rd to 5th, and 28th to 32nd columns and save them into a new alignment object
my $newaln  = $aln->select_columns([3...5,28..32]);
my $newaln2 = $aln->select_columns(-selection=>[1..2,6..27,33..35],-toggle=>1);
my $newaln3 = $aln->remove_columns([1..2,6..27,33..35]);
my $newaln4 = $aln->remove_columns([3...5,28..32],1);
my $newaln5 = $aln->remove_columns(-selection=>[3...5,28..32],-toggle=>1);
 
 
#You can also select the columns based on the type of these columns
my $aln2 = $aln->select_columns(['mismatch']);
my $aln2 = $aln->remove_columns(['weak']);
 
 
#keep gap only sequences
my $aln2 = $aln->select_columns(-selection=>['mismatch'],-keepgaponly=>1);
my $aln2 = $aln->select_columns(['mismatch'],0,1); # doing the same thing as above
 
#remove gaps
my $aln3 = $aln->remove_gaps;

Modifying the current alignment

These methods will modify the original alignment object, so use them with care.

Add or Remove sequences

#For example, we have three Bio::LocatableSeq objects, we can add them into an alignment object.
#Beware that, the sequences should be in the same length, otherwise they will be padded to the same length.
 
 
my $s1 = Bio::LocatableSeq->new(
    -id => 'AAA',
    -seq => 'aawtat-tn-',
    -start => 12,
    -end => 19,
    -alphabet => 'dna'
);
my $s2 = Bio::LocatableSeq->new(
    -id => 'BBB',
    -seq => '-aaaat-tt-',
    -start => 1,
    -end => 7,
    -alphabet => 'dna'
);
my $s3 = Bio::LocatableSeq->new(
    -id => 'BBB',
    -seq => '-aaaat-t--',
    -start => 31,
    -end => 36,
    -alphabet => 'dna'
);
 
 
$a = Bio::SimpleAlign->new();
$a->add_Seq($s1);
$a->add_Seq($s2);
$a->add_Seq($s3);
 
#Now $a has three sequence objects.
 
#We can also remove sequence objects from the alignment object.
 
$a->remove_LocatableSeq($s1);
 
#Now $a has two sequence objects

Sort sequences

$aln->sort_alphabetically;         #Sort the alignment by the alphabetical order of sequence ID
$aln->sort_by_list($list_file);    #Sort the alignmetn by the given list file
$aln->sort_by_pairwise_identity;   #Sort by the pairwise percentage identity of the first (default) or the given sequence
$aln->sort_by_length;              #Sort by the length the sequences
$aln->sort_by_start;               #Sort by the start of the sequences

Remove sequences

$aln->remove_Seqs([1..3,5..8]);    #Remove the selected sequences from the alignment
$aln->remove_redundant_Seqs(0.7);  #Remove redundant sequences by defined percentage
$aln->uniq_Seq;                    #Only retain the non-redundant sequences

Calculating consensus sequences

Bio::SimpleAlign provides several methods calculating consensus sequences. For details of these methods, please refer to the description of these methods in the POD file.

my $str     = $aln->consensus_string(0.5);#The consensus residue has to appear at least threshold 50%
my $str     = $aln->consensus_iupac; #Makes a consensus using IUPAC ambiguity codes
my $str     = $aln->bracket_string; #a string in BIC format. This is used for allelic comparisons.
my $str     = $aln->match_line; #Generates a match line
my $str     = $aln->gap_line;#Generates a gap line
my $str     = $aln->all_gap_line; #Generates a gap line, where '-' represents all-gap column
my %cigars  = $aln->cigar_line; #Generates a "cigar" (Compact Idiosyncratic Gapped Alignment Report) line for each sequence in the alignment.

Modifying alignment sequences

Change the letters in the sequences

These methods can be used to modify the sequences in the alignment. For example, you can substitute a specific character(map_chars), capitalize the letters (uppercase), or change the alignment into match characters (match). Here are some code examples:

$aln->map_chars( '\.', '-' );      #Substitute the '.' into '-'. Maybe useful to assign your own gap char
 
$aln->uppercase;                   #Change all the letters in the alignment to uppercase
$aln->lowercase;                   #Change all the letters in the alignment to lowercase
$aln->togglecase;                  #Change all the letters in the alignment to opposite case
 
$aln->match;                       #Goes through all columns and changes residues that are identical to residue in first sequence (reference sequence) to  match character
$aln->unmatch;                     #Undo the previous change

Set reference sequence

You can set the new reference sequence in the alignment, which is to move the assigned sequence to the first position. The reference sequence (the first sequence) is used as the default sequence in methods such as $aln->match, $aln->pairwise_percentage_identity, and $aln->remove_gaps.

$aln->set_new_reference(3);            #Select the 3rd sequence as the reference (1st) sequence
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox