HOWTO:Wrappers

From BioPerl
Jump to: navigation, search

Abstract

Creating DIY BioPerl run wrappers for your favorite external programs.

Contents


Author

Mark A. Jensen

Fortinbras Research

maj -at- fortinbras -dot- us

NOTE

Code and documentation in progress...

Introduction

BioPerl has a long tradition of providing wrapper objects for running external programs, mainly through the distribution called bioperl-run. Wrappers make it relatively easy to process data in highly customizable pipelines with the benefits of BioPerl objects and I/O. They also help to standardize the interfaces to typically idiosyncratic open-source utilities, reducing the burden on the developer.

With new bioinformatics tools being released almost daily, it can be difficult for the BioPerl regulars to maintain a stable of run wrappers for the latest and greatest tools. Bio::Tools::Run::WrapperBase and Bio::Tools::Run::WrapperBase::CommandExts attempt to reduce the activation energy of creating new wrappers, handling bookkeeping, file finding, and calling tasks. Bio::Tools::WrapperMaker allows the user to specify the interface of a command-line tool using a definitions file written in XML, using WrapperBase and WrapperBase::CommandExts under the hood to provide a fully-function BioPerl wrapper class.

Using these modules, users can write their own wrappers with a minimum of code, and increase the wrapper library by sumbitting these as Bioperl contributions.

Overview

Bio::Tools::Run::WrapperBase is a class providing basic methods and properties for running external programs: methods for finding executables, parsing option strings, and writing command lines. Bio::Tools::Run::WrapperBase::CommandExts extends WrapperBase with methods for handling programs that act as frontends for multiple commands (like bwa) and logical groups of separate programs (like the blast+ suite). CommandExts will also run programs safely via IPC::Run and provides simple access to standard output and standard error streams. Bio::Tools::WrapperMaker, with the above modules, will produce a fully-functional BioPerl run wrapper for any command-line program, based on a wrapper definition file written in XML. Relatively complex wrappers can be created just by compiling XML, or WrapperMaker can be used as an initializer for more sophisticated wrappers.

WrapperBase and CommandExts

content to appear...

WrapperMaker

Bio::Tools::WrapperMaker has three goals:

  • to make the use and creation of program wrappers fast and easy;
  • to standardize access to the parameters and output of diverse external programs; and
  • to provide a standard but flexible jumping-off point for sophisticated wrappers and pipelines.

Here's fast and easy:

$fac = Bio::Tools::WrapperMaker->compile(-defs => 'samtools.xml');
$fac->set_parameters( -command => 'view', -sam_input => 1, -bam_output => 1);
$fac->run( -bam => 'my.sam', -out => 'my.bam');

Here's flexible:

# expose only one samtool...
$fac = SamView->new( -sam_input => 1, -bam_output => 1);
$fac->run( -bam => 'my.sam', -out => 'my.bam');
 
package SamView;
use Bio::Tools::WrapperMaker;
Bio::Tools::WrapperMaker->compile( -defs => 'samtools.xml',
				   -namespace => "SamView::Factory" );
our @ISA = qw( SamView::Factory );
sub new {
    my $self = shift;
    my @args = @_;
    return SamView::SUPER->new( -command => 'view', @args );
}
1;

Standardization is accomplished via the wrapper definitions file validated against an XML Schema, Bio::ParameterBaseI compliance, and run methods and output accessors via Bio::Tools::Run::WrapperBase::CommandExts.

Making a wrapper

To produce a run wrapper factory, use the compile() method:

$lsfac = Bio::Tools::WrapperMaker->compile( -defs => 'ls.xml' );

or

$lsfac = Bio::Tools::WrapperMaker->compile( -defs => $ls_xml_string );

The wrapper definition XML will be validated each time a factory is compiled (if XML::LibXML is installed). To inhibit the validation step, set

$Bio::Tools::WrapperMaker::VALIDATE_DEFS = 0;

and to turn off all validation warnings, set

$Bio::Tools::WrapperMaker::VALIDATE_DEFS = -1;

Wrapper namespace

The run wrapper factory is placed in the Perl namespace MyWrapper by default. This namespace can be used to run any class method in Bio::Tools::Run::WrapperBase and Bio::Tools::Run::WrapperBase::CommandExts, and to set up any package globals you may desire. For example, the following code works:

Bio::Tools::WrapperMaker->compile( -defs => 'ls.xml' );
 
$lsfac = MyWrapper->new( -all => 1 );

Magic!

To create the wrapper in a different namespace, specify it with the -namespace parameter:

Bio::Tools::WrapperMaker->compile( -defs => 'ls.xml',
                                   -namespace => 'LinuxWrap::LS' );
$lsfac = LinuxWrap::LS->new();

Wrapper definition file

The wrapper definition file is an XML document that validates against the schema maker.xsd, found in the local installation directory $YOUR_INSTALL_ROOT/Bio/Tools/WrapperMaker or at (currently) [1].

The definition file defines:

  • the program name,
  • the commands (if any) that the program supports,
  • the parameters and switches associated with the program and/or individual commands,
  • the other items (typically filenames, but not always) that appear at the end of the command line.

Other useful elements can appear in the definitions file; see the documentation in maker.xsd itself for more detail.

If you want to create or edit a definitions file, but don't want to write XML, try the wrapper defs editor at Fortinbras.

Here is a brief overview of these components based on a simple example.

Example wrapper def for a familiar program:

 1  <defs xmlns="http://www.bioperl.org/wrappermaker/1.0">
 2    <program name="ls" dash-policy="mixed"/>
 3    <self name="_self">
 4      <options>
 5        <option name="all" type="switch"/>
 6        <option name="sort_by_size" type="switch" translation="S"/>
 7        <option name="sort_by_time" type="switch" translation="t"/>
 8        <option name="one_line_each" type="switch" translation="1"/>
 9      </options>
 10    <filespecs>
 11      <filespec token="pth" use="optional-multiple"/>
 12      <filespec token="out" use="optional-single" redirect="stdout"/>
 13    </filespecs>
 14  </self>
 15 </defs>

The root element of the schema is the defs element. The namespace definition as given is required.

The program element (line 2) defines the name of the program as typed on the command line. The dash-policy attribute indicates whether single or double dashes are used to set off the program parameters or switches. mixed indicates that single character options are set off with single dashes, and long options with a double dash.

The self element (line 3) encompasses the options and filespecs associated with the program itself, and not with program commands. For example, in

svn --version

version is a "self option", while in

svn update -r 16784

r is a command option, for the command update. Program commands, their options and filespecs are specified in a commands element.

The options element (line 4) specifies the options to make available to the wrapper, and can be used to create human-readable aliases to these options. If the name specified is an alias, the translation attribute indicates the command-line equivalent (sans dashes); compare lines 5 and 6. The type attribute specifies either parameter, meaning the option takes a value (as in the r option in the Subversion client, above), or switch, meaning the option indicates a boolean state indicated by the option's presence or absence on the command line.

The filespecs element (line 10) defines how files or paths are aliased, and also specifies stdin/stdout/stderr redirection. Each filespec element (lines 11 and 12) must be included in the definition file in the order they would appear on the command line. The token attribute becomes the wrapper parameter for this path. The use attribute indicates whether this filespec is optional or required, and whether multiple files or just a single file is allowed on the command line (required-single, required-multiple, optional-single, optional-multiple).

This is a basic overview. The WrapperMaker/CommandExts system is designed to support complex programs and groups of programs, and provides many other features. See Wrapper Definition Files (to appear, one day) for more complex examples involving programs with multiple commands, and the representation of a group of related programs in a single wrapper.

Finding the executable

If the actual program executable does not appear in your $PATH, you can specify its location in an environment variable: the program name in upper case followed by 'DIR'. If the example above didn't work out of the box, you might do

$ export LSDIR=/usr/bin

or

#!/usr/bin/perl
$ENV{LSDIR} = "/usr/bin";
...

(Or you might have little heart-to-heart with your sysadmin.)

Using the wrapper object directly

The wrapper object will manage the program according to the facilities in Bio::Tools::Run::WrapperBase::CommandExts. It will automatically be Bio::ParameterBaseI compliant, possessing set_parameters(), get_parameters(), reset_parameters(), available_parameters(), and parameters_changed(). The run() method will execute the program, unless a command named "run" was defined in the definitions file, in which case _run() will do the trick.

Some examples based on the definition above:

$lsfac = Bio::Tools::WrapperMaker->compile( -defs => 'ls.xml');
 
# list pwd and output to file
$lsfac->run( -out => "listing.txt" );
 
# list home directory, and collect output with stdout() 
# ( provided by CommandExts...)
$lsfac->set_parameters( -one_line_each => 1 );
$lsfac->run( -pth => "~" );
@myfiles = split("\n", $lsfac->stdout);

Using WrapperMaker in a new wrapper module

If you are designing a new wrapper module that requires more complex sanity checking or other computation than CommandExts provides, you can use WrapperMaker to "initialize" that module. Set the -namespace parameter to the qualified module name:

package My::Complex::Wrapper;
use strict;
use warnings;
use Bio::Tools::WrapperMaker;
my $WRAPPER_DEFS = "./complex_wrapper.xml";
 
Bio::Tools::WrapperMaker->compile( 
  -defs => $WRAPPER_DEFS,
  -namespace => __PACKAGE__ );
 
sub additional_computation {
  my $self;
  ...
}
...
1;

In support of this usage, maker.xsd defines the lookups element. It can be used to define arbitrary lookup hashes, which WrapperMaker will import as package globals:

<defs xmlns="http://www.bioperl.org/wrappermaker/1.0">
  ...
  <lookups>
    <lookup name="accepted_types">
      <elt key="seq" value="fasta fastq raw crossbow"/>
      <elt key="seq2" value="fasta fastq raw"/>
      <elt key="ref" value="fasta"/>
    </lookup>
    ...
  </lookups>
</defs>

The value of the elt members is an arbitrary string.

Security notes

Because this module is designed to run commands outside Perl as directed by an external file, attention has been paid to taint-checking and input verification. Of course, we can't (and don't try to) keep you from overriding the checks or making wrappers for nasty programs.

Basic security is provided by validation of wrapper definition files against an XML Schema definition. Taint checks are built in to the XSD to protect your command line against injections by naughty defs. WrapperMaker will validate for you if you have the XML::LibXML module installed, against either a local copy of the XSD, or a hosted version at $SCHEMA_URL. If you are missing XML::LibXML, a warning will be emitted. The warning can be turned of by setting

$Bio::Tools::WrapperMaker::VALIDATE_DEFS = -1

Wrapper def files can be validated "by hand" at the wrapper defs editor.

The IPC::Run module is used to execute all processes, and three-argument open to open all files. Backticks and qx are not used.

Synopsis

$samt = Bio::Tools::WrapperMaker->compile( -defs => "samtools.xml" );
$samt->set_parameters( -sam_input => 1, -bam_output => 1);
$samt->run( -bam => 'my.sam', -out => 'my_bam' );
 
$samt = Bio::Tools::WrapperMaker->compile( -defs => $xml_string,
                                           -xsd => 'maker_tweaked.xsd' );
 
Bio::Tools::WrapperMaker->compile( -defs => "samtools.xml",
                                   -namespace => "My::Samtools" );
$viewfac = My::Samtools->new_view( -sam_input => 1, -bam_output => 1);
 
$viewfac->run( -bam => 'my.sam', -out => 'my_bam');
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox