[Bioperl-l] Output a subset of FASTA data from a single large file

Chris Fields cjfields at uiuc.edu
Mon Jun 12 16:35:10 EDT 2006


...
> Chris Fields wrote:
> > There's information in the HOWTOs:
> >
> > http://www.bioperl.org/wiki/HOWTO:Flat_databases
> >
> > http://www.bioperl.org/wiki/HOWTO:OBDA
> >
...
> As you later discovered, that was an Outlook problem. Just to make this
> thread relevant to bioperl, the bioperl solution is:

Agreed (stupid Outlook).  It might be much faster to use non-Bioperl-ish
ways, but it is easier to further manipulate sequences (convert format,
analyze sequences, etc) using Bioperl directly.  I haven't used flat
databases much but it should move very quickly, even in an OO environment.

The one problem with the proposed non-bioperl method is, if you wanted
100,000 sequences (based on ID's) in a FASTA database file containing
200,000 sequences, all ID's would need to be stored (1) in an array (which
gulped the data from the ID file) and then map the ID's to (2) a hash;
that's may be a pretty big memory footprint depending on your system.  

Sendu's BioPerl version indexes the FASTA file based on the ID, then (1)
reads the ID's in one at a time from the file, (2) retrieves the data, then
(3) prints it out.   The advantage of this approach is that the built index
can be used in other bioperl scripts as well w/o having to rebuild it again,
so if you wanted a different set of ID's later on you can access the
database using the prebuilt index.  More can be found in the
Bio::Index::Fasta POD.  

You can also use the ideas and code in the HOWTO (Flat Databases) I
mentioned, which focuses on the Bio::DB::Flat system and ODBA.  The
advantage of these is that you can use Sleepycat's Berkeley Database through
the Perl BerkeleyDB module (more functionality than DB_File) which is faster
than a standard flat database.  In the HOWTO, specifically look under
'Secondary or custom namespaces' for ideas on how to use your ID as a
primary or secondary key.

Chris

> use Bio::SeqIO;
> use Bio::Index::Fasta;
> my $inx = Bio::Index::Fasta->new(-write_flag => 1);
> $inx->id_parser(\&get_id);
> $inx->make_index(shift);
> 
> my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
> my $wanted_ids_file = shift;
> open(IDS, $wanted_ids_file);
> while (<IDS>) {
>    chomp;
>    my $seq = $inx->fetch($_);
>    $out->write_seq($seq);
> }
> 
> sub get_id {
>    my $line = shift;
>    $line =~ /^>probe:\S+?:(\S+?):/;
>    $1;
> }
> 
> It works for me on the sample sequence given by the OP.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list