[Bioperl-l] Output a subset of FASTA data from a single largefile

Chris Fields cjfields at uiuc.edu
Mon Jun 12 17:20:55 EDT 2006


Sorry Malcolm.  I didn't want to imply that your way or the bioperl way was
best, just point out advantages/disadvantages.  

Oops, didn't point out the possible Bioperl disadvantage (too many objects
generated = slow slow slow).  

Chris

> -----Original Message-----
> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
> Sent: Monday, June 12, 2006 4:16 PM
> To: Chris Fields; Sendu Bala; bioperl-l at lists.open-bio.org
> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a single
> largefile
> 
> Yeah, good points...
> 
> ... my recommendation of the one-liner was motivated based on a small
> number of IDs and no other applications needing to index the entire
> fasta database.
> 
> 
> --Malcolm [At which point he bowed out of this fray]
> 
> >-----Original Message-----
> >From: bioperl-l-bounces at lists.open-bio.org
> >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields
> >Sent: Monday, June 12, 2006 3:35 PM
> >To: 'Sendu Bala'; bioperl-l at lists.open-bio.org
> >Subject: Re: [Bioperl-l] Output a subset of FASTA data from a
> >single largefile
> >
> >...
> >> Chris Fields wrote:
> >> > There's information in the HOWTOs:
> >> >
> >> > http://www.bioperl.org/wiki/HOWTO:Flat_databases
> >> >
> >> > http://www.bioperl.org/wiki/HOWTO:OBDA
> >> >
> >...
> >> As you later discovered, that was an Outlook problem. Just
> >to make this
> >> thread relevant to bioperl, the bioperl solution is:
> >
> >Agreed (stupid Outlook).  It might be much faster to use
> >non-Bioperl-ish
> >ways, but it is easier to further manipulate sequences (convert format,
> >analyze sequences, etc) using Bioperl directly.  I haven't used flat
> >databases much but it should move very quickly, even in an OO
> >environment.
> >
> >The one problem with the proposed non-bioperl method is, if you wanted
> >100,000 sequences (based on ID's) in a FASTA database file containing
> >200,000 sequences, all ID's would need to be stored (1) in an
> >array (which
> >gulped the data from the ID file) and then map the ID's to (2) a hash;
> >that's may be a pretty big memory footprint depending on your system.
> >
> >Sendu's BioPerl version indexes the FASTA file based on the
> >ID, then (1)
> >reads the ID's in one at a time from the file, (2) retrieves
> >the data, then
> >(3) prints it out.   The advantage of this approach is that
> >the built index
> >can be used in other bioperl scripts as well w/o having to
> >rebuild it again,
> >so if you wanted a different set of ID's later on you can access the
> >database using the prebuilt index.  More can be found in the
> >Bio::Index::Fasta POD.
> >
> >You can also use the ideas and code in the HOWTO (Flat Databases) I
> >mentioned, which focuses on the Bio::DB::Flat system and ODBA.  The
> >advantage of these is that you can use Sleepycat's Berkeley
> >Database through
> >the Perl BerkeleyDB module (more functionality than DB_File)
> >which is faster
> >than a standard flat database.  In the HOWTO, specifically look under
> >'Secondary or custom namespaces' for ideas on how to use your ID as a
> >primary or secondary key.
> >
> >Chris
> >
> >> use Bio::SeqIO;
> >> use Bio::Index::Fasta;
> >> my $inx = Bio::Index::Fasta->new(-write_flag => 1);
> >> $inx->id_parser(\&get_id);
> >> $inx->make_index(shift);
> >>
> >> my $out = Bio::SeqIO->new(-format => 'Fasta', -fh => \*STDOUT);
> >> my $wanted_ids_file = shift;
> >> open(IDS, $wanted_ids_file);
> >> while (<IDS>) {
> >>    chomp;
> >>    my $seq = $inx->fetch($_);
> >>    $out->write_seq($seq);
> >> }
> >>
> >> sub get_id {
> >>    my $line = shift;
> >>    $line =~ /^>probe:\S+?:(\S+?):/;
> >>    $1;
> >> }
> >>
> >> It works for me on the sample sequence given by the OP.
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l at lists.open-bio.org
> >http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >



More information about the Bioperl-l mailing list