BioPerl | Home

AlignIO and SimpleAlign HOWTO

Authors

Brian Osborne briano at bioteam.net

Peter Schattner

Abstract

This is a HOWTO that talks about using Bio::AlignIO and Bio::SimpleAlign to create and analyze alignments. It also discusses how to run various applications that create alignment files. See the list of alignment formats for details on each format.

AlignIO

Data files storing multiple sequence alignments appear in varied formats and is the Bioperl object for conversion of alignment files. AlignIO is patterned on the object and its commands have many of the same names as the commands in . Just as in the object can be created with -file and -format options:

use Bio::AlignIO; my $io = Bio::AlignIO->new(
                         -file => "receptors.aln",
                         -format => "clustalw" );

If the -format argument isn’t used then Bioperl will try and determine the format based on the file’s suffix, in a case-insensitive manner. Here is the current set of input formats:

Format	Suffixes	Comment
bl2seq
ClustalW	`aln`
emboss	`water needle`
fasta	`fasta fast seq fa fsa nt aa`
maf	`maf`
mase		Seaview
mega	`meg ega`
meme	`meme`
metafasta
msf	`msf pileup gcg`	GCG
nexus	`nexus nex`
pfam	`pfam pfm`	Pfam
phylip	`phylip phlp phyl phy ph`	interleaved
po		POA
prodom
psi	`psi`	PSI-BLAST
selex	`selex slx selx slex sx`	HMMER
stockholm	`stk`	Rfam, Pfam
XMFA	`xmfa`
arp	`arp`	Arlequin

Table 1. AlignIO input formats

The emboss format refers to the output of the water, needle, matcher, stretcher, merger, and supermatcher applications. See http://emboss.sourceforge.net.

Bio::AlignIO reads many formats but does not write in every format (the same is true for Bio::SeqIO). AlignIO currently supports output in these formats:

fasta
mase
selex
clustalw
msf
phylip
po
stockholm
XMFA
metafasta

The basic syntax for AlignIO is similar to that of SeqIO:

use Bio::AlignIO;

$in = Bio::AlignIO->new(-file => "inputfilename" ,
                        -format => 'fasta');

$out = Bio::AlignIO->new(-file => ">outputfilename",
                         -format => 'pfam');

while ( my $aln = $in->next_aln ) {
    $out->write_aln($aln);
}

The returned object, $aln, is a Bio::SimpleAlign object rather than a Bio::Seq object.

Bio::AlignIO also supports the tied filehandle syntax described for Bio::SeqIO.

SimpleAlign

Once one has identified a set of similar sequences, one often needs to create an alignment of those sequences.

Bio::AlignIO objects can be produced by bioperl-run alignment creation objects (e.g. Clustalw.pm, BLAST’s bl2seq, TCoffee.pm, and Lagan.pm or they can be read in from files of multiple-sequence alignments in various formats using AlignIO.

Some of the manipulations possible with include:

slice() : Obtaining an alignment slice, that is, a subalignment inclusive of specified start and end columns. Sequences with no residues in the slice are excluded from the new alignment and a warning is printed.
column_from_residue_number() : Finding column in an alignment where a specified residue of a specified sequence is located.
consensus_string() : Making a consensus string. This method includes an optional threshold parameter, so that positions in the alignment with lower percent-identity than the threshold are marked by ? in the consensus
percentage_identity() : A fast method for calculating the average percentage identity of the alignment
consensus_iupac() : Making a consensus using IUPAC ambiguity codes from DNA and RNA.

Skeleton code for using some of these features is shown below. More detailed, working code is in examples/align_on_codons.pl, also in the examples/align directory. Additional documentation on methods can be found in Bio::SimpleAlign and Bio::LocatableSeq.

use Bio::SimpleAlign;

$aln = Bio::SimpleAlign->new('t/data/testaln.dna');

$threshold_percent = 60;

$consensus_with_threshold = $aln->consensus_string($threshold_percent);

# dna/rna alignments only
$iupac_consensus = $aln->consensus_iupac();

$percent_ident = $aln->percentage_identity;

$seqname = '1433_LYCES';

$pos = $aln->column_from_residue_number($seqname, 14);

Aligning 2 sequences with Blast using bl2seq and AlignIO

As an alternative to Smith-Waterman, two sequences can also be aligned in Bioperl using the bl2seq option of Blast within the object. To get an alignment - in the form of a object - using bl2seq, you need to parse the bl2seq report with the file format reader as follows:

$factory = Bio::Tools::Run::StandAloneBlast->new(-outfile => 'bl2seq.out');
$bl2seq_report = $factory->bl2seq($seq1, $seq2);

# Use AlignIO.pm to create a SimpleAlign object from the bl2seq report
$str = Bio::AlignIO->new(-file => 'bl2seq.out',
                        -format => 'bl2seq');
$aln = $str->next_aln();

Aligning multiple sequences with Clustalw.pm and TCoffee.pm

For aligning multiple sequences (i.e. two or more), Bioperl offers a perl interface to the clustalw, muscle, and tcoffee programs. TCoffee is a more recent program - derived from clustalw - which has been shown to produce better results for local MSA.

To use these capabilities, the clustalw, muscle, or tcoffee programs need to be installed on the host system. In addition, the environmental variables CLUSTALDIR and TCOFFEEDIR need to be set to the directories containg the executables.

From the user’s perspective, the Bioperl syntax for calling Bio::Tools::Run::Alignment::Clustalw, Bio::Tools::Run::Alignment::TCoffee, or Bio::Tools::Run::Alignment::Muscle or is almost identical. The only differences are the names of the modules themselves appearing in the initial use, and constructor statements and the names of the some of the individual program options and parameters.

In either case, initially, a factory object must be created. The factory may be passed most of the parameters or switches of the relevant program. In addition, alignment parameters can be changed or examined after the factory has been created. Any parameters not explicitly set will remain as the underlying program’s defaults. Application output is returned in the form of a object. It should be noted that some clustalw and TCoffee parameters and features (such as those corresponding to tree production) have not been implemented yet in the Perl interface.

Once the factory has been created and the appropriate parameters set, one can call the method align() to align a set of unaligned sequences, or profile_align() to add one or more sequences or a second alignment to an initial alignment. Input to align() consists of a set of unaligned sequences in the form of the name of file containing the sequences or a reference to an array of objects. Typical syntax is shown below. We illustrate with clustalw, but a similar syntax - except for the module name - would work for TCoffee, or muscle.

use Bio::Tools::Run::Alignment::Clustalw;

$factory = Bio::Tools::Run::Alignment::Clustalw->new(-matrix => 'BLOSUM');
$ktuple = 3;
$factory->ktuple($ktuple);

# @seq_array is an array of Bio::Seq objects
$seq_array_ref = @seq_array;

$aln = $factory->align($seq_array_ref);

Clustalw.pm and TCoffee.pm can also align two (sub)alignments to each other or add a sequence to a previously created alignment by using the profile_align() method. For further details on the required syntax and options for the profile_align() method, the user is referred to Bio::Tools::Run::Alignment::Clustalw and Bio::Tools::Run::Alignment::TCoffee. The user is also encouraged to examine the script clustalw.pl in the examples/align directory.

Manipulating clusters of sequences Cluster and ClusterIO

Sequence alignments are not the only examples in which one might want to manipulate a group of sequences together. Such groups of related sequences are generally referred to as clusters. Examples include Unigene clusters and gene clusters resulting from clustering algorithms being applied to microarray data.

The Bioperl Cluster and modules are available for handling sequence clusters. Code to read in a Unigene cluster (in the NCBI XML format) and then extract individual sequences for the cluster for manipulation might look like this:

my $stream = Bio::ClusterIO->new(-file => "Hs.data", -format => "unigene");

while ( my $in = $stream->next_cluster ) {
    print $in->unigene_id() . "\n";
    while ( my $sequence = $in->next_seq ) {
        print $sequence->accession_number . "\n";
    }
}

See Bio::Cluster::UniGene for more details.