[Bioperl-l] Uniprot/Swiss accessions?

Smithies, Russell Russell.Smithies at agresearch.co.nz
Mon May 18 17:52:31 EDT 2009

Hi guys,
Thanx for your suggestions.

With the magic of awk and comm, I split the amalgamated accessions and created lists of swissprot IDs for both the file from NCBI and the file from Uniprot.

sp_ncbi_accessions.txt          458,377 ids
sp_uniprot_accessions.txt       466,739 ids

*       The NCBI file has 95 ids that don't appear in the Uniprot list
*       The Uniprot file has 8,457 ids that don't appear in the NCBI list
*       There are 458,282 ids that appear on both lists.

I did a quick random sample of the 8,457 ids unique to Uniprot and none could be found in the "protein" database at NCBI but all were in the "gene" database as "reference sequences that belong to a specific genome build" and all belonged to recently sequenced bacterial genomes. As none are in the "protein" database, they don't have GI numbers.

The 95 ids that were at NCBI but not in Uniprot were usually (random sample again) described as "putative protein" (or "very putative protein" in one case) and are the result of gene predictions. Eg http://www.ncbi.nlm.nih.gov/protein/48429254

So what I'll do is use the NCBI database and add in the extra 8,457 ids unique to Uniprot and assign them fake GI numbers so I can formatdb them with the " -o T" option.

Thanx again for your help,

