[Bioperl-l] Assistance with a BioPerl/Perl project
sdavis2 at mail.nih.gov
Thu Mar 24 13:49:40 EST 2005
If you are starting with Genbank Accession numbers and want to get to
Entrez Gene, the "standard" way to do that is to use Unigene. If you
go to the Entrez website and choose the Unigene database, you can type
in your accession and you will be taken to a unigene record. If you
click on the "links" section, you can then link to Entrez Gene.
To do this in batch mode, I download Hs.data.gz from NCBI at:
Then, you can use Bio::ClusterIO to parse Unigene. Grab the
accession_number part of each sequence (there is an example of doing
this in the POD documentation). You can then make a hash like:
which maps accessions to unigene ids.
Make a second hash that maps unigene to gene using the file:
which will map the unigene ids to gene.
Then, you have the information you need to map from accession to gene
Just a note on Entrez Gene: the Gene does not represent a sequence,
but instead a set of sequences. The sequences are Refseq sequences.
So, you wouldn't be blasting against "Gene" per say, but against the
one or several Refseq sequences (if there are any) that represent the
Hope this helps. Standard disclaimer: as with perl AND
bioinformatics, there is more than one way to do this. And keep in
mind that Entrez Gene is only one source of annotation; for chromosome
21, there may be other sites that have more information, specifically
On Mar 24, 2005, at 12:54 PM, Colin Erdman wrote:
> Hello list,
> I am a 22 year old bioinformatics and molecular biology major at the
> University of Denver. I just accepted a position with a researcher
> here, and
> already have a first assignment. We are working on a comprehensive
> chromosome 21 gene database and map and my first task is to update a
> list of
> known (and curated) Human chromosome 21 genes. I have become rapidly
> familiar with BioPerl however my adviser needs me to use Entrez Gene to
> compare the currently known Chr 21 genes (from query: '21[CHR] AND Homo
> sapiens[ORGN] AND NOT Pseudogene' ) with a list of genes that she has
> provided in xls and xml format.
> The idea is to take the accession numbers in the provided files, pull
> nucleotide sequence from them, and run those against the sequences for
> records found with the Entrez Gene query in order to find any newly
> annotated/(discovered/elucidated?) genes for that sequence. I am
> with the current problem of BioPerl not directly being able to parse
> EntrezGene object, but have played with the Bio::SeqIO::Gene2accession
> geneinfo) and the egparser. My programming skills are not completely
> up to
> par, so egparser is tough for me to grasp. Bio::SeqIO::Gene2accession
> more intuitive, however I am having a terrible time figuring out how to
> convert my desired entrezgene results into the legacy gene_info and
> gene2accession formats? Any suggestions are greatly appreciated, I am
> new at this, so very simple coding examples and explanations help and
> the best way for me to learn.
> Thanks all!
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
More information about the Bioperl-l