Using BioPerl to create and manipulate short-read assemblies with the bwa
and samtools
suites.
Quicklink to synopsis.
Mark A. Jensen
maj -at- fortinbras -dot- us
bwa
is a suite of C programs that perform efficient alignments (based in part on the Burrows-Wheeler transform) of short (20-100bp) sequence reads, guided by a set of reference sequences provided in FASTA format. bwa
input and output complies with the Sequence/Alignment Map (SAM) binary (.bam
) and test (.sam
) formats. Alignment files in SAM formats can be converted, indexed, sliced and diced by the samtools
suite.
These tools are comprehensive and allow the user to tweak many different parameters, and outputs can be directed to inputs to create highly application-specific workflows. The BioPerl run wrappers Bio::Tools::Run::BWA and Bio::Tools::Run::Samtools are designed to help automate and manage such workflows, and help reduce the number of cryptic command-line options and file order inconsistencies the user must remember. Bio::Tools::Run::BWA can also act as a canned assembler, accepting input data and returning a Bio::Assembly::IO object in a single command. A new assembly IO object has been created for this purpose, Bio::Assembly::IO::sam.
Bio::Tools::Run::BWA and additional dependencies in the Bio::Tools::Run namespace can be found in the bioperl-run repository.
Like all run wrappers, these modules need the underlying programs to work; visit the links above to acquire them.
Also, the Bio::DB::Sam modules must be installed on your system. These are not BioPerl modules. (They were written by a core developer, though, so they’re all right.) You can get them on CPAN.
The run()
method of Bio::Tools::Run::BWA
creates an assembly in sorted .bam
format, using reads in FASTQ
format and a reference sequence database in FASTA
. Create a BWA
factory, then call run
from it with your files as arguments:
$bwa = Bio::Tools::Run::BWA->new()
$bwa->out_type('asm.sam'); # specify output file
# for single-end reads:
$bwa->run( 'read1.fastq', 'refdb.fas' )
# for paired-end reads:
$bwa->run( 'read1.fastq', 'refdb.fas', 'read2.fastq');
If you have Bio::Tools::Run::Samtools installed, the output file (asm.sam
) will already be sorted and converted to BAM.
The BWA::run()
method will also create a Bio::Assembly::Scaffold object, containing Bio::Assembly::Contig alignments with associated consensus sequence objects, quality data, and sequence features.
To do this, leave out_type
unset above, and capture the return value of run()
:
$bwa = Bio::Tools::Run::BWA->new()
# for single-end reads:
$aio = $bwa->run( 'read1.fastq', 'refdb.fas' )
# for paired-end reads:
$aio = $bwa->run( 'read1.fastq', 'refdb.fas', 'read2.fastq');
The Bio::Tools::Run::BWA::run()
method performs the following steps:
Action | Program | Commands |
---|---|---|
create a bwa index for ref seq | bwa | index |
map sequence reads to reference seq | bwa | aln |
assemble | bwa | samse/sampe |
sort on coordinates | samtools | sort |
create bam index | samtools | index |
Command-line options can be directed to the aln
and samse/sampe
steps using factory arguments. See Specifying Options).
A second mode for Bio::Tools::Run::BWA allows direct access to bwa
commands. To run a command, construct a run factory, specifying the desired command using the -command
argument in the factory constructor, along with options specific to that command (see Specifying Options):
$bwa = Bio::Tools::Run::BWA->new( -command => 'view' );
To execute, use the run_bwa()
method off the factory. Input and output files are specified in the arguments of run_maq
(see Specifying Files):
$bwa->run_bwa( -bam=>'mysam.bam' -out=>'mysam.sam' );
bwa
is complex, with many subprograms (commands) and command-line options and file specs for each. The wrapper module attempts to provide commands and options comprehensively. You can browse the choices like so:
$bwa = Bio::Tools::Run::BWA->new( -command => 'aln' );
# all bwa commands
@all_commands = $bwa->available_parameters('commands');
@all_commands = $bwa->available_commands; # alias
# just for aln
@assemble_params = $bwa->available_parameters('params');
@assemble_switches = $bwa->available_parameters('switches');
@assemble_all_options = $bwa->available_parameters();
Reasonably mnemonic names have been assigned to the single-letter command line options. These are the names returned by available_parameters
, and can be used in the factory constructor like typical BioPerl named parameters.
Options can be directed to the aln
and samse/sampe
components of the assembly pipeline implemented by the run()
method. Identify the desired options, for example
@map_params = Bio::Tools::Run::Maq->new(-command => 'aln')->available_parameters();
# returns:
# 'max_edit_dist'
# 'max_gap_opens'
# 'max_gap_extns'
# 'deln_protect_3p'
# 'deln_protect_ends'
# 'subseq_seed'
# 'max_edit_dist_seed'
# 'n_threads'
# 'mm_penalty'
# 'gap_open_penalty'
# 'gap_extn_penalty'
# 'subopt_hit_threshold'
# 'trim_parameter'
# 'reverse_no_comp'
# 'no_iter_search'
then in the factory construction, specify the desired parameters prefixed by aln_
, sms_
, or smp_
, as appropriate (note, no -command
parameter is specified in the run()
method):
$bwa = Bio::Tools::Run::BWA->new( -aln_n_threads => 3 );
$assy = $bwa->run( "read1.fastq", "refseq.fas" );
See the bwa
manpage for many gory details.
When a command requires filenames, these are provided to the run_bwa
method, not the constructor (new()
). To see the set of files required by a command, use available_parameters('filespec')
or the alias filespec()
.
$bwa = Bio::Tools::Run::BWA->new( -command => 'aln' );
@filespec = $bwa->filespec;
This example returns the following array:
fas
faq
>sai
This indicates that the FASTA database (fas) and the FASTQ reads (faq) MUST be specified, and the STDOUT of this program (SA coordinates) will be slurped into a file specified in the run_bwa
argument list:
$bwa->run_bwa( -fas => 'my.db.fas', -faq => 'reads.faq',
-sai => 'out.sai' );
If capture files are not specified per the filespec, text sent to STDOUT and STDERR is saved and is accessible with $bwa->stdout()
and $bwa->stderr()
.
By an odd coincidence, the Bio::Tools::Run::Samtools wrapper accesses the samtools
commands in a similar way. Create a Samtools
factory, specifying the command line options in the constructor argument:
$converter = Bio::Tools::Run::Samtools->new(
-command => 'view',
-sam_input => 1,
-bam_output => 1
);
and run using the run()
, specifying necessary files in its arguments:
$converter->run( -bam => 'mysam.sam', -out => 'mysam.bam' );
Parameters and filespecs can be browsed as described in Specifying options and Specifying files.
# create a single-read assembly object
$bwa = Bio::Tools::Run::BWA->new();
$assy = $bwa->run( 'read1.fastq', 'refseqs.fas' );
# create a paired-read assembly object
$assy_prd = $bwa->run( 'read1.fastq', 'refseqs.fas', 'read2.fastq' );
# create just the sorted, binary SAM file
$bwa->out_type( 'assy_prd.bam' );
$bwa->run( 'read1.fastq', 'refseqs.fas', 'read2.fastq' );
# extract regions from the assy
$start1 = 150000;
$end1 = 200000;
$start2 = 250000;
$end2 = 275000;
$samt = Bio::Tools::Run::Samtools->new(
-command => 'view',
-bam_output => 1
);
$samt->run( -bam => 'assy_prd.bam',
-rgn => [ "my_seqid:$start1-$end1",
"my_seqid:$start2-$end2" ],
-out => 'assy_rgns.bam'
);
# convert a text SAM to binary format
$samt = Bio::Tools::Run::Samtools->new(
-command => 'view',
-sam_input => 1,
-bam_output => 1
);
$samt->run( -bam => 'mysam.sam', -out => 'mysam.bam' );
# sort it
$samt = Bio::Tools::Run::Samtools->new(
-command => 'sort'
);
# creates 'mysam.srt.bam':
$samt->run( -bam => 'mysam.bam', -pfx => 'mysam.srt' );