A detailed description of Bio::Tools::Run::Maq, a wrapper for creating short-read assemblies with the Maq assembler.
Mark A. Jensen
maj -at- fortinbras -dot- us
# create an assembly
$maq_fac = Bio::Tools::Run::Maq->new();
$maq_assy = $maq_fac->run( 'reads.fastq', 'refseq.fas' );
# paired-end
$maq_assy = $maq_fac->run( 'reads.fastq', 'refseq.fas',
'paired-reads.fastq');
# be more strict
$maq_fac->set_parameters( -c2q_min_map_quality => 60 );
$maq_assy = $maq_fac->run( 'reads.fastq', 'refseq.fas',
'paired-reads.fastq');
# run maq commands separately
$maq_fac = Bio::Tools::Run::Maq->new(
-command => 'pileup',
-single_end_quality => 1 );
$maq_fac->run_maq( -bfa => 'refseq.bfa',
-map => 'maq_assy.map',
-txt => 'maq_assy.pup.txt' );
Bio::Tools::Run::Maq and other Bioperl dependencies are found in the bioperl/bioperl-run repository. Other dependencies are found in bioperl/bioperl-live.
This module provides a wrapper interface for Heng Li’s reference-directed short read assembly suite maq
. The module won’t work unless you download and install the maq
software; see the Sourceforge site.
There are two modes of action.
The first is a simple pipeline through the maq
commands, taking your read data in and squirting out an assembly object of type Bio::Assembly::IO::maq. The pipeline is based on the one performed by maq.pl easyrun
:
Action | maq Commands |
---|---|
data conversion to maq binary formats | fasta2bfa, fastq2bfq |
map sequence reads to reference seq | map |
assemble, creating consensus | assemble |
convert map & cns files to plaintext | mapview, cns2fq |
Table 1. maq
commands
Command-line options can be directed to the map
, assemble
, and cns2fq
steps. See Specifying Options.
maq
componentsThe second mode is direct access to maq
commands. To run a command, construct a run factory, specifying the desired command using the -command
argument in the factory constructor, along with options specific to that command (see Specifying Options):
$maqfac->Bio::Tools::Run::Maq->new( -command => 'fasta2bfa' );
To execute, use the run_maq
methods. Input and output files are specified in the arguments of run_maq
(see Specifying Files):
$maqfac->run_maq( -fas => "myref.fas", -bfa => "myref.bfa" );
maq
is complex, with many subprograms (commands) and command-line options and file specs for each. This module attempts to provide commands and options comprehensively. You can browse the choices like so:
$maqfac = Bio::Tools::Run::Maq->new( -command => 'assemble' );
# all maq commands
@all_commands = $maqfac->available_parameters('commands');
@all_commands = $maqfac->available_commands; # alias
# just for assemble
@assemble_params = $maqfac->available_parameters('params');
@assemble_switches = $maqfac->available_parameters('switches');
@assemble_all_options = $maqfac->available_parameters();
Reasonably mnemonic names have been assigned to the single-letter command line options. These are the names returned by available_parameters
, and can be used in the factory constructor like typical BioPerl named parameters.
Options can be directed to the map
, assemble
and cns2fq
components of the assembly pipeline implemented by the run()
method. Identify the desired options, for example
@map_params = Bio::Tools::Run::Maq->new(-command => 'map' )->
available_parameters();
# returns:
# adaptor_file
# first_read_length
# max_hits
# max_mismatches
# max_outer_distance
# max_outer_distance_rf
# mismatch_dump
# mismatch_posn_dump
# mismatch_thr
# mutation_rate
# second_read_length
# unmapped_dump
then in the factory construction, specify the desired parameters prefixed by map_
, asm_
, or c2q_
, as appropriate (note, no -command
parameter):
$maqfac = Bio::Tools::Run::Maq->new( -map_max_mismatches => 1 );
$assy = $maqfac->run( "read1.fastq", "refseq.fas" );
See the maq
manpage for many gory details.
When a command requires filenames, these are provided to the run_maq
method, not the constructor (new()
). To see the set of files required by a command, use available_parameters('filespec')
or the alias filespec()
.
$maqfac = Bio::Tools::Run::Maq->new( -command => 'map' );
@filespec = $maqfac->filespec;
This example returns the following array:
map
bfa
bfq1
#bfq2
2>#log
This indicates that map (maq
binary mapfile), bfa (maq
binary fasta), and bfq (maq
binary fastq) files must be specified, another bfq file may be specified, and a log file receiving STDERR also may be specified. Use these in the run_maq
call like so:
$maqfac->run_maq( -map => 'my.map', -bfa => 'myrefseq.bfa',
-bfq1 => 'reads1.bfq', -bfq2 => 'reads2.bfq' );
Here, the -log
parameter was unspecified. Therefore, the object will store the programs STDERR output for you in the stderr()
attribute:
handle_map_warning($maqfac) if ($maqfac->stderr =~ /warning/);
STDOUT
for a run is also saved, in stdout()
, unless a file is specified to slurp it according to the filespec. maq
STDOUT usually contains useful information on the run.
Bio::Tools::Run::Maq
, like the other assembler run modules, stores the assembly in a Bio::Assembly::Scaffold
object by default. This object is built using Bio::Assembly::IO::maq
, a read-only IO module that parses the maq
consensus file (.cns.fastq
) and map file (.maq
) produced by the execution of Bio::Tools::Run::Maq::run()
. In the code:
$fac = Bio::Tools::Run::Maq->new();
$assy = $fac->run('reads.faq', 'refseq.fas');
$assy
is the Scaffold
object. This code, if successful, will have produced two temporary files in the execution directory, with .maq
and .cns.fastq
extensions.
To specify a name for the assembly output files, do
$fac = Bio::Tools::Run::Maq->new();
$fac->out_type('myassy.maq');
$assy = $fac->run('reads.faq', 'refseq.fas');
The maq
map file (converted by maq mapview
) will be called myassy.maq
in this example; the consensus file will be myassy.cns.fastq
. Both files are required to read in an assembly using Bio::Assembly::IO, but only the .maq
is specified in the parameters:
$assy = Bio::Assembly::IO->new( -file => 'myassy.maq' );
Since the reference sequence is integrated into the maq
assembly, maq
always creates a single “contig” that includes all reads that match the reference sequence. Bio::Assembly::IO::maq
divides the assembly into separate contigs by examining the consensus quality, and returning contigs that represent contiguous regions of non-zero PHRED quality values. Singlets are contigs containing only one read; as required by Bio::Assembly::Scaffold
, these are treated specially and can be accessed with the singlet
-associated iterators.
Standard contig features and attributes are accessible with standard accessors:
$contig = $assy->next_contig;
$consensus = $contig->get_consensus_sequence;
$cons_quality = $contig->get_consensus_quality;
@cons_ids = $contig->get_seq_ids;
maq
-specific features, present in the .maq
file, are accessible as follows.
The sequence IDs within the contigs are obtained as :
@contig_seqids = $contig->get_seq_ids;
To get the individual maq
features, you need to drill down:
$seq = $contig->get_seq_by_name( ($contig->get_seq_ids)[0] );
$feat = $contig->get_seq_feat_by_tag($seq, "_aligned_coord:".$seq->id);
$maq_feat = ($feat->sub_SeqFeature)[0];
print join(" ", $maq_feat->get_all_tags);
The tags are:
alt_map_qual
chr
insert_size
map_qual
num_mm_best_hit
one_mm_hits
paired_flag
posn
qualstr
read_len
read_name
se_map_qual
sum_qual_mm_best_hit
zero_mm_hits
which are mnemonics for the contents of each maq mapview
line; see the maq
manpage under ‘mapview’ for details.