HOWTO:SQLite for BioPerl indexing

From BioPerl
Jump to: navigation, search

Contents

Abstract

Idea: use the AnyDBM_File system to expand the choices of DBM for as many BioPerl instances of indexing as reasonable.

Author

Mark A. Jensen

Fortinbras Research

maj -at- fortinbras -dot- us

The Issue

Many of the BioPerl standard indexing modules use the DB_File module exclusively as a DBM to create tied hashes and arrays. While DB_File is part of Perl core, it depends on the external installation of Berkeley DB. On at least one platform (Windows/32 using ActiveState Perl), Berkeley DB is difficult to integrate with DB_File.

The Perl core AnyDBM_File module for ties allows the dev to provide a set of choices of DBMs over which to fail. This system by default includes the SDBM, which is included with Perl. One idea is to swap all DB_File for AnyDBM_File instances. However, DB_File also provides a direct API to Berkeley DB (put(), get(), seq(), etc. methods) that are used fairly extensively in BioPerl indexing code. SDBM does not provide these methods, and also has a record length limit (in the standard Perl build) that is too restrictive for many applications.

Being able to use SQLite for ties would provide an attractive alternative to both BDB and SDBM. The SQLite DBI (DBD::SQLite) contains the Perl DBI interface and the SQLite DBMS as XS all in one, obviating an external install. There are no record length restrictions.

Modules

Currently available in CPAN:

Other modules have also been modified on branch anydbm-branch.

Usage

BEGIN {
    @AnyDBM_File::ISA = qw( DB_File SQLite_File ) unless 
       @AnyDBM_File::ISA == 1; # single member indicates AnyDBM_File already loaded
}
use AnyDBM_File;
use vars qw( $DB_BTREE &R_DUP); # must declare the globals you expect to use
use AnyDBM_File::Importer qw(:bdb); # an import tag is REQUIRED
 
my %db;
$DB_BTREE->{'flags'} = R_DUP;
tie( %db, 'AnyDBM_File', O_CREAT | O_RDWR, 0644, $DB_BTREE);

Description

DB_File Emulation

The intention was to create a DBM that could almost completely substitute for DB_File, so that DB_File could be replaced everywhere in the code by AnyDBM_File, and things would just work. Currently, it is slightly more complicated than that, but not too much more.

Versions of $DB_HASH, $DB_BTREE, and $DB_RECNO, as well as the necessary flags (R_DUP, R_FIRST, R_NEXT, etc.) are imported by using the AnyDBM_File::Importer module. The desired constants need to be declared global in the calling program, as well as imported, to avoid compilation errors (at this point). See Converting from DB_File below.

Arguments to the tie function mirror those of DB_File, and all should work the same way. See Converting from DB_File.

All of DB_File's random and sequential access functions work:

get()
put()
del()
seq()

as well as the duplicate key handlers

get_dup()
del_dup()
find_dup()

seq() works by finding partial matches, like DB_File::seq(). The extra array functions ( shift(), pop(), etc. ) are not yet implemented as method calls, though all these functions (including splice are available on the tied arrays.

Some HASHINFO fields are functional:

$DB_BTREE->{'compare'} = sub { - shift cmp shift };

will provide sequential access in reverse lexographic order, for example.

$DB_HASH->{'cachesize'} = 20000;

will enforce

PRAGMA cache_size = 20000

in the underlying database.

Converting from DB_File

To failover to SQLite_File from DB_File, go from this:

use DB_File;
# ...
$DB_BTREE->{cachesize} = 100000;
$DB_BTREE->{flags} = R_DUP;
my %db;
my $obj = tie( %db, 'DB_File', 'my.db', $flags, 0666, $DB_BTREE);

to this:

 
use vars qw( $DB_HASH &R_DUP );
BEGIN {
  @AnyDBM_File::ISA = qw( DB_File SQLite_File )
    unless @AnyDBM_File::ISA == 1; # 
}
use AnyDBM_File;
use AnyDBM_File::Importer qw(:bdb);
# ...
$DB_BTREE->{cachesize} = 100000;
$DB_BTREE->{flags} = R_DUP;
my %db;
my $obj = tie( %db, 'AnyDBM_File', 'my.db', $flags, 0666, $DB_BTREE);

Implementation and Testing

Design and Motivation

Objective

Provide an SQLite-based DBM as a drop-in alternative for DB_File

Two things are required of such a module:

  1. The machinery for tying hashes and arrays (see perltie), and
  2. An emulation of the DB_File API to Berkeley DB.

Tests and Modifications

Unit Tests

Test Suites currently passing

These tests currently (r16252) pass under both DB_File and SQLite_File (for me under ActiveState 5.8 sans DB_File and Cygwin 5.10 with DB_File), with mods in the core modules at that revision and :

  • t/LocalDB/BioDBGFF.t
  • t/LocalDB/BlastIndex.t
  • t/LocalDB/DBFasta.t
  • t/LocalDB/DBQual.t
  • t/LocalDB/Index.t
  • t/LocalDB/transfac_pro.t
  • t/LocalDB/SeqFeature_mysql.t (created by ./Build)

Modified Modules in the Bio::DB and Bio::Index Namespaces

The following modules have a DB_File or related dependency:

Bio::DB::Flat has a BDB dependency that is very particular to its implementation and would not benefit from an AnyDBM_File-based conversion. The Bio::DB::SeqFeature::Store::DBI modules for MySQL and SQLite still rely on BDB-tied hashes for indexing. Other modules are there to provide a BDB option; these could benefit from a SQLite-based BDB emulation (as a workaround in the absence of BDB).

The following modules have been modified to make use of SQLite_File, either because they already use the AnyDBM_File system, or because DB_File was integrated in to tied hash and array indexes. By converting these to AnyDBM_File, DB_File remains available and the default choice, but the modules remain functional on systems that don't support DB_File for whatever reason.

Modifications were successful if all tests passed under both DB_File and SQLite_File.

Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox