[Bioperl-l] Finding seqs of given domain architecture
David.Messina at sbc.su.se
Fri Apr 18 05:31:59 EDT 2008
I talked about your question with a colleague of mine who has been working
in this area. Below is his reply.
[I'm reposting this *without* the attachment mentioned since the mailing
list wouldn't accept it otherwise. If anyone wants a copy of the code, just
> 3. Pfam has this capability, i.e. to show all domains with a given
> architecture, but it is difficult to get at the actual sequences or
> even a list of accession numbers.
First, this should be available right away in PfamAlyser:
although you might need to upgrade your browser to Java 1.6 to get it to
If this does not work as suggested (an upgraded version is coming
eventually), have a look at the file:
which contains the Pfam architectures for all UniProt sequences. You can
parse that to get a file of <accession number>-<list of domain>
correspondences and just filter that to get the accession numbers.
(Please find attached a Perl script to do just that.)
Under UNIX, you can then just grep this for the domain IDs,
(like grep domainArchitectureFile.txt PF00008 | grep PF00456 >
but I am sure there are solutions under other operating systems as well.
You could then write a script to parse out the corresponding sequences
from the UniProt fasta flatfile, if you wanted, or (again under UNIX) a
script to wget them of the webpage.
In case your sequences are not in UniProt, consider using HMMER and the
Pfam HMM files to assign domains to all sequences in your dataset. I
would then parse the HMMER output into the same format as the above, and
use the same approach following that.
Hope this helps,
krifo at sbc.su.se
More information about the Bioperl-l