[Bioperl-l] Output a subset of FASTA data from a single largefile

Cook, Malcolm MEC at stowers-institute.org
Mon Jun 12 17:15:35 EDT 2006

Yeah, good points...

... my recommendation of the one-liner was motivated based on a small
number of IDs and no other applications needing to index the entire
fasta database.

--Malcolm [At which point he bowed out of this fray]

>-----Original Message-----
>From: bioperl-l-bounces at lists.open-bio.org 
>[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields
>Sent: Monday, June 12, 2006 3:35 PM
>To: 'Sendu Bala'; bioperl-l at lists.open-bio.org
>Subject: Re: [Bioperl-l] Output a subset of FASTA data from a 
>single largefile
>> Chris Fields wrote:
>> > There's information in the HOWTOs:
>> >
>> > http://www.bioperl.org/wiki/HOWTO:Flat_databases
>> >
>> > http://www.bioperl.org/wiki/HOWTO:OBDA
>> >
>> As you later discovered, that was an Outlook problem. Just 
>to make this
>> thread relevant to bioperl, the bioperl solution is:
>Agreed (stupid Outlook).  It might be much faster to use 
>ways, but it is easier to further manipulate sequences (convert format,
>analyze sequences, etc) using Bioperl directly.  I haven't used flat
>databases much but it should move very quickly, even in an OO 
>The one problem with the proposed non-bioperl method is, if you wanted
>100,000 sequences (based on ID's) in a FASTA database file containing
>200,000 sequences, all ID's would need to be stored (1) in an 
>array (which
>gulped the data from the ID file) and then map the ID's to (2) a hash;
>that's may be a pretty big memory footprint depending on your system.  
>Sendu's BioPerl version indexes the FASTA file based on the 
>ID, then (1)
>reads the ID's in one at a time from the file, (2) retrieves 
>the data, then
>(3) prints it out.   The advantage of this approach is that 
>the built index
>can be used in other bioperl scripts as well w/o having to 
>rebuild it again,
>so if you wanted a different set of ID's later on you can access the
>database using the prebuilt index.  More can be found in the
>Bio::Index::Fasta POD.  
>You can also use the ideas and code in the HOWTO (Flat Databases) I
>mentioned, which focuses on the Bio::DB::Flat system and ODBA.  The
>advantage of these is that you can use Sleepycat's Berkeley 
>Database through
>the Perl BerkeleyDB module (more functionality than DB_File) 
>which is faster
>than a standard flat database.  In the HOWTO, specifically look under
>'Secondary or custom namespaces' for ideas on how to use your ID as a
>primary or secondary key.
>> use Bio::SeqIO;
>> use Bio::Index::Fasta;
>> my $inx = Bio::Index::Fasta->new(-write_flag => 1);
>> $inx->id_parser(\&get_id);
>> $inx->make_index(shift);
>> my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
>> my $wanted_ids_file = shift;
>> open(IDS, $wanted_ids_file);
>> while (<IDS>) {
>>    chomp;
>>    my $seq = $inx->fetch($_);
>>    $out->write_seq($seq);
>> }
>> sub get_id {
>>    my $line = shift;
>>    $line =~ /^>probe:\S+?:(\S+?):/;
>>    $1;
>> }
>> It works for me on the sample sequence given by the OP.
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>Bioperl-l mailing list
>Bioperl-l at lists.open-bio.org

More information about the Bioperl-l mailing list