[Bioperl-l] Assistance with a BioPerl/Perl project

Colin Erdman cerdman2 at du.edu
Thu Mar 24 16:46:36 EST 2005

So in effect, this is just as good as taking the actual nucleotide sequences
(derived using a GenBank lookup) from my static accession number list and
running them through the 'member sequences' of my genes (clusters) of
interest in order to see if any new gene products or information have been
added for that sequence? And where would you suspect that BLASTN will then
fit into the scheme. I apologize for the redundancy, there is just so much
to take in!


-----Original Message-----
From: Sean Davis [mailto:sdavis2 at mail.nih.gov] 
Sent: Thursday, March 24, 2005 11:50 AM
To: Colin Erdman
Cc: bioperl-l at portal.open-bio.org
Subject: Re: [Bioperl-l] Assistance with a BioPerl/Perl project

If you are starting with Genbank Accession numbers and want to get to 
Entrez Gene, the "standard" way to do that is to use Unigene.  If you 
go to the Entrez website and choose the Unigene database, you can type 
in your accession and you will be taken to a unigene record.  If you 
click on the "links" section, you can then link to Entrez Gene.

To do this in batch mode, I download Hs.data.gz from NCBI at:


Then, you can use Bio::ClusterIO to parse Unigene.  Grab the 
accession_number part of each sequence (there is an example of doing 
this in the POD documentation).  You can then make a hash like:


which maps accessions to unigene ids.

Make a second hash that maps unigene to gene using the file:


which will map the unigene ids to gene.

Then, you have the information you need to map from accession to gene 
via unigene.

Just a note on Entrez Gene:  the Gene does not represent a sequence, 
but instead a set of sequences.  The sequences are Refseq sequences.  
So, you wouldn't be blasting against "Gene" per say, but against the 
one or several Refseq sequences (if there are any) that represent the 

Hope this helps.  Standard disclaimer:  as with perl AND 
bioinformatics, there is more than one way to do this.  And keep in 
mind that Entrez Gene is only one source of annotation; for chromosome 
21, there may be other sites that have more information, specifically 


On Mar 24, 2005, at 12:54 PM, Colin Erdman wrote:

> Hello list,
> I am a 22 year old bioinformatics and molecular biology major at the
> University of Denver. I just accepted a position with a researcher 
> here, and
> already have a first assignment. We are working on a comprehensive
> chromosome 21 gene database and map and my first task is to update a 
> list of
> known (and curated) Human chromosome 21 genes. I have become rapidly
> familiar with BioPerl however my adviser needs me to use Entrez Gene to
> compare the currently known Chr 21 genes (from query: '21[CHR] AND Homo
> sapiens[ORGN] AND NOT Pseudogene' ) with a list of genes that she has
> provided in xls and xml format.
> The idea is to take the accession numbers in the provided files, pull 
> the
> nucleotide sequence from them, and run those against the sequences for
> records found with the Entrez Gene query in order to find any newly
> annotated/(discovered/elucidated?) genes for that sequence. I am 
> familiar
> with the current problem of BioPerl not directly being able to parse 
> the
> EntrezGene object, but have played with the Bio::SeqIO::Gene2accession 
> (&
> geneinfo) and the egparser. My programming skills are not completely 
> up to
> par, so egparser is tough for me to grasp. Bio::SeqIO::Gene2accession 
> is
> more intuitive, however I am having a terrible time figuring out how to
> convert my desired entrezgene results into the legacy gene_info and
> gene2accession formats? Any suggestions are greatly appreciated, I am 
> very
> new at this, so very simple coding examples and explanations help and 
> are
> the best way for me to learn.
> Thanks all!
> colin
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

More information about the Bioperl-l mailing list