A detailed description of Bio::Tools::Run::Maq, a wrapper for creating short-read assemblies with the Maq assembler.
Mark A. Jensen
maj -at- fortinbras -dot- us
# create an assembly $maq_fac = Bio::Tools::Run::Maq->new(); $maq_assy = $maq_fac->run( 'reads.fastq', 'refseq.fas' ); # paired-end $maq_assy = $maq_fac->run( 'reads.fastq', 'refseq.fas', 'paired-reads.fastq'); # be more strict $maq_fac->set_parameters( -c2q_min_map_quality => 60 ); $maq_assy = $maq_fac->run( 'reads.fastq', 'refseq.fas', 'paired-reads.fastq'); # run maq commands separately $maq_fac = Bio::Tools::Run::Maq->new( -command => 'pileup', -single_end_quality => 1 ); $maq_fac->run_maq( -bfa => 'refseq.bfa', -map => 'maq_assy.map', -txt => 'maq_assy.pup.txt' );
Bio::Tools::Run::Maq and other Bioperl dependencies are found in the bioperl/bioperl-run repository. Other dependencies are found in bioperl/bioperl-live.
This module provides a wrapper interface for Heng Li’s reference-directed short read assembly suite
maq. The module won’t work unless you download and install the
maq software; see the Sourceforge site.
There are two modes of action.
The first is a simple pipeline through the
maq commands, taking your read data in and squirting out an assembly object of type Bio::Assembly::IO::maq. The pipeline is based on the one performed by
|data conversion to maq binary formats||
|map sequence reads to reference seq||
|assemble, creating consensus||
|convert map & cns files to plaintext||
Command-line options can be directed to the
cns2fq steps. See Specifying Options.
The second mode is direct access to
maq commands. To run a command, construct a run factory, specifying the desired command using the
-command argument in the factory constructor, along with options specific to that command (see Specifying Options):
$maqfac->Bio::Tools::Run::Maq->new( -command => 'fasta2bfa' );
To execute, use the
run_maq methods. Input and output files are specified in the arguments of
run_maq (see Specifying Files):
$maqfac->run_maq( -fas => "myref.fas", -bfa => "myref.bfa" );
maq is complex, with many subprograms (commands) and command-line options and file specs for each. This module attempts to provide commands and options comprehensively. You can browse the choices like so:
$maqfac = Bio::Tools::Run::Maq->new( -command => 'assemble' ); # all maq commands @all_commands = $maqfac->available_parameters('commands'); @all_commands = $maqfac->available_commands; # alias # just for assemble @assemble_params = $maqfac->available_parameters('params'); @assemble_switches = $maqfac->available_parameters('switches'); @assemble_all_options = $maqfac->available_parameters();
Reasonably mnemonic names have been assigned to the single-letter command line options. These are the names returned by
available_parameters, and can be used in the factory constructor like typical BioPerl named parameters.
Options can be directed to the
cns2fq components of the assembly pipeline implemented by the
run() method. Identify the desired options, for example
@map_params = Bio::Tools::Run::Maq->new(-command => 'map' )-> available_parameters(); # returns: # adaptor_file # first_read_length # max_hits # max_mismatches # max_outer_distance # max_outer_distance_rf # mismatch_dump # mismatch_posn_dump # mismatch_thr # mutation_rate # second_read_length # unmapped_dump
then in the factory construction, specify the desired parameters prefixed by
c2q_, as appropriate (note, no
$maqfac = Bio::Tools::Run::Maq->new( -map_max_mismatches => 1 ); $assy = $maqfac->run( "read1.fastq", "refseq.fas" );
maq manpage for many gory details.
When a command requires filenames, these are provided to the
run_maq method, not the constructor (
new()). To see the set of files required by a command, use
available_parameters('filespec') or the alias
$maqfac = Bio::Tools::Run::Maq->new( -command => 'map' ); @filespec = $maqfac->filespec;
This example returns the following array:
map bfa bfq1 #bfq2 2>#log
This indicates that map (
maq binary mapfile), bfa (
maq binary fasta), and bfq (
maq binary fastq) files must be specified, another bfq file may be specified, and a log file receiving STDERR also may be specified. Use these in the
run_maq call like so:
$maqfac->run_maq( -map => 'my.map', -bfa => 'myrefseq.bfa', -bfq1 => 'reads1.bfq', -bfq2 => 'reads2.bfq' );
-log parameter was unspecified. Therefore, the object will store the programs STDERR output for you in the
handle_map_warning($maqfac) if ($maqfac->stderr =~ /warning/);
STDOUT for a run is also saved, in
stdout(), unless a file is specified to slurp it according to the filespec.
maq STDOUT usually contains useful information on the run.
Bio::Tools::Run::Maq, like the other assembler run modules, stores the assembly in a Bio::Assembly::Scaffold
object by default. This object is built using Bio::Assembly::IO::maq
, a read-only IO module that parses the
maq consensus file (
.cns.fastq) and map file (
.maq) produced by the execution of
Bio::Tools::Run::Maq::run(). In the code:
$fac = Bio::Tools::Run::Maq->new(); $assy = $fac->run('reads.faq', 'refseq.fas');
$assy is the
Scaffold object. This code, if successful, will have produced two temporary files in the execution directory, with
To specify a name for the assembly output files, do
$fac = Bio::Tools::Run::Maq->new(); $fac->out_type('myassy.maq'); $assy = $fac->run('reads.faq', 'refseq.fas');
maq map file (converted by
maq mapview) will be called
myassy.maq in this example; the consensus file will be
myassy.cns.fastq. Both files are required to read in an assembly using Bio::Assembly::IO, but only the
.maq is specified in the parameters:
$assy = Bio::Assembly::IO->new( -file => 'myassy.maq' );
Since the reference sequence is integrated into the
maq always creates a single “contig” that includes all reads that match the reference sequence. Bio::Assembly::IO::maq
divides the assembly into separate contigs by examining the consensus quality, and returning contigs that represent contiguous regions of non-zero PHRED quality values. Singlets are contigs containing only one read; as required by
Bio::Assembly::Scaffold, these are treated specially and can be accessed with the
Standard contig features and attributes are accessible with standard accessors:
$contig = $assy->next_contig; $consensus = $contig->get_consensus_sequence; $cons_quality = $contig->get_consensus_quality; @cons_ids = $contig->get_seq_ids;
maq-specific features, present in the
.maq file, are accessible as follows.
The sequence IDs within the contigs are obtained as :
@contig_seqids = $contig->get_seq_ids;
To get the individual
maq features, you need to drill down:
$seq = $contig->get_seq_by_name( ($contig->get_seq_ids) ); $feat = $contig->get_seq_feat_by_tag($seq, "_aligned_coord:".$seq->id); $maq_feat = ($feat->sub_SeqFeature); print join(" ", $maq_feat->get_all_tags);
The tags are:
alt_map_qual chr insert_size map_qual num_mm_best_hit one_mm_hits paired_flag posn qualstr read_len read_name se_map_qual sum_qual_mm_best_hit zero_mm_hits
which are mnemonics for the contents of each
maq mapview line; see the
maq manpage under ‘mapview’ for details.