Local Databases HOWTO



Authors

Brian Osborne briano at bioteam.net

Peter Schattner

Abstract

This is a HOWTO that talks about using Bioperl to create local sequence databases for fast retrieval.

Introduction

Bioperl offers many ways to retrieve sequences from online databases like NCBI and Swissprot but there may be times when you want build local databases for fast, secure retrieval. This HOWTO discusses the different Bioperl modules you might use. Also see OBDA Flat databases HOWTO.

Bio::Index

The following sequence data formats are supported by Bio::Index*:

Once the set of sequences have been indexed using Bio::Index*, individual sequences can be accessed using syntax very similar to that used for accessing remote databases.

For example, if one wants to set up an indexed flat-file database of fasta files one could write a script like using Bio::Index::Fasta:

# script 1: create the index

# some users have reported that this is necessary
use strict;
use Bio::Index::Fasta; 

my $index_file_name = shift;

my $inx = Bio::Index::Fasta->new( -filename => $index_file_name,
                                  -write_flag => 1);

$inx->make_index(@sequence_files);

This script then retrieves sequences:


# script 2: retrieve some files

# some users have reported that this is necessary
use strict;
use Bio::Index::Fasta;

my $index_file_name = shift;

my $inx = Bio::Index::Fasta->new($index_file_name);

for my $id (@ARGV) {
    # Returns Bio::Seq object
    my $seq = $inx->fetch($id);
    # do something with the sequence
}

To facilitate the creation and use of more complex or flexible indexing systems, the Bioperl distribution includes two sample scripts in the scripts/index directory, bp_index.PLS and bp_fetch.PLS. These scripts can be used as templates to develop customized local data-file indexing systems.

Bio::DB::Fasta

Bioperl also supplies as a means to index and query Fasta format files. It’s similar in spirit to Bio::Index* but has additional methods and has the ability to retrieve subsequences, great for long sequences:


use strict;
use Bio::DB::Fasta;

my $id = 'CHROMOSOME_I';
my $file = "arabidopsis.fa"

my $db = Bio::DB::Fasta->new($file);
my @ids = $db->get_all_primary_ids;

# get a sequence as string 
my $seqstring = $db->seq($id);

# get a PrimarySeq obj 
my $seqobj = $db->get_Seq_by_id($id); 

# get the header, or description line
my $desc = $db->header($id);
my $alphabet = $db->alphabet($id);

# Get subsequences and length
my $seqstr   = $db->seq($id, 4_000_000 => 4_100_000);
my $revseq   = $db->seq($id, 4_100_000 => 4_000_000);
my $length   = $db->length($id);

See Bio::DB::Fasta for more information.

Indexing using a specific substring

Both modules also offer the user the ability to designate a specific string within the fasta header as the desired id, such as the gi number within the string *gi 4556644 gb X45555. Consider the following fasta-formatted sequence, in *test.fa:
>gi|523232|emb|AAC12345|sp|D12567 titin fragment
MHRHHRTGYSAAYGPLKJHGYVHFIMCVVVSWWASDVVTYIPLLLNNSSAGWKRWWWIIFGGE
GHGHHRTYSALWWPPLKJHGSKHFILCVKVSWLAKKERTYIPKKILLMMGGWWAAWWWI
By default and will use the first “word” they encounter in the fasta header as the retrieval key, in this case “gi 523232 emb AAC12345 sp D12567”. What would be more useful as a key would be a single id. The code below will index the “test.fa” file and create an index file called “test.fa.idx” where the keys are the Swissprot, or “sp”, identifiers.
$ENV{BIOPERL_INDEX_TYPE} = "SDBM_File";

#look for the index in the current directory

$ENV{BIOPERL_INDEX} = ".";

my $file_name = "test.fa";
my $inx = Bio::Index::Fasta->new( -filename => $file_name . ".idx",
                                  -write_flag => 1 );

# pass a reference to the critical function to the Bio::Index object
$inx->id_parser(&get_id);

# make the index
$inx->make_index($file_name);

# here is where the retrieval key is specified
sub get_id {
    my $header = shift;
    $header =~ /^>.*sp|([A-Z]d{5}b)/;
    $1;
}

Here is how you would retrieve the sequence, as a object:

my $seq = $inx->fetch("D12567");
print $seq->seq;

What if you wanted to retrieve a sequence using either a Swissprot id or a gi number and the fasta header was actually a concatenation of headers with multiple gi’s and Swissprots?

>gi|523232|emb|AAC12345|sp|D12567|gi|7744242|sp|V11223 titin fragment

Modify the function that’s passed to the id_parser() method:

sub get_id {
    my $header = shift;
    my (@sps) = $header =~ /^>.*sp|([A-Z]d{5})/g;
    my (@gis) = $header =~ /gi|(d+)/g;
    return (@sps,@gis);
}

The module uses the same principle, but the syntax is slightly different, for example:


my $db = Bio::DB::Fasta->new('test.fa', -makeid => &make_my_id);
my $seqobj = $db->get_Seq_by_id($id);

sub make_my_id {
    $description_line = shift;
    $description_line =~ /gi|(d+)|emb|(w+)/;
    ($1,$2);
}

Storing sequences in a relational database

The core Bioperl package does not support accessing sequences and data stored in relational databases but this capability is available in the Bioperl-db package.