[Bioperl-l] retrieving by acc from a local multifasta file

Barry Moore barry.moore at genetics.utah.edu
Fri Aug 6 08:45:33 EDT 2004


If this is a one off script, and you are doing something simple with 
your sequences once you extract them, then you may not need to use 
BioPerl at all.  You could read the complete uniprot_sprot.fasta file 
into a hash keyed off of the accession to create a simple database in 
memory.  Then you can retrieve the sequences you need by accession.  It 
will take a while to build that hash even on a fairly good computer, so 
it's not an approach that you would want to use for a script that you 
will run alot.  Try the following code.



use strict;
use warnings;

#Your list of SwissProt accessions.
my @accs = ('Q43495', 'P13813', 'P15455');

#Open and read your uniprot file.
open (IN, "uniprot_sprot.fasta");
my $uniprot_data = join "", (<IN>);

#Extract fasta sequences into a hash keyed on the accession.
my %seq_db;
while($uniprot_data =~ /^>.*?\(([\d\w]{6})\).*?\n(^(?!>).*\n)+/gm) {
  $seq_db{$1} = $&;

#Loop over your accessions, and do something with the sequence.
for my $acc (@accs) {
  print "$seq_db{$acc}\n\n";
Maria Persico wrote:

>Hi All,
>This may be a stupid problem but for me it's something difficult:
>I have a list of swissprot accessions(my_acc) and I want to extract from
>uniprot_sprot.fasta only sequences of my list.
>How can do this with bioperl?
>Maria Persico
>MINT database, Cesareni Group
>Universita' di Tor Vergata, via della Ricerca Scientifica
>00133 Roma, Italy
>Tel: +39 0672594315
>FAX: +39 0672594766
>e-mail: maria at cbm.bio.uniroma2.it
>Bioperl-l mailing list
>Bioperl-l at portal.open-bio.org

Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT

More information about the Bioperl-l mailing list