[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Sean Davis sdavis2 at mail.nih.gov
Tue Feb 14 15:02:59 EST 2006


You can look get the upstream regions for genes via the table browser at
UCSC.  If you want to do it yourself, just download their refGene table (as
a tab-delimited text file) that includes the HUGO gene name.  Then, use the
method given by Brian to look up the locations.  The genome just isn't THAT
big to download and to store locally.  Note that most of the big sites (like
NCBI, for example) impose restrictions on the number and timing of hits, so
utilizing them for high-thoughput analysis (like for gene expression
studies) is not always feasible.  I have found that having the data locally
is almost always better.

Sean
 


On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:

> Hi Brian,
> 
> Thanks very much for the pointers and the speed of your reply and apologies
> for the speed of mine.
> 
> This looks good, but what I was looking for was a bioP approach for hooking to
> an API at NCBI or EBI so I could get this info and seqs from them.  In this
> case, speed of retrieval is not critical and I'd rather not download the
> entirety of the sequences to a local disk to hack at them.
> 
> I've determined a screen-scraping approach to get them and could script that,
> but I thought that bioP had a method for using NCBI's external API's, tho it
> may be that my memory is faulty or the approach is no longer supported due to
> overload.  
> 
> Does NCBI make such APIs available anymore?  I searched a bit for docs on them
> but couldn't find anything (unless it's buried in the NCBI tookit, which I
> haven't started to excavate).
> 
> Failing that, would SEALS provide such a service? Any PerlPinipeds listening?
> 
> Harry
> 
> 
> 
> 
> 
> 
> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
>> Harry,
>> 
>> Hope you're doing well. The approach could be based on Bio::DB::Fasta. So,
>> from its documentation:
>> 
>>   use Bio::DB::Fasta;
>> 
>>   # create database from directory of fasta files
>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>> 
>>   # simple access (for those without Bioperl)
>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>>   my @ids     = $db->ids;
>>   my $length   = $db->length('CHROMOSOME_I');
>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
>>   my $header   = $db->header('CHROMOSOME_I');
>> 
>>   # Bioperl-style access
>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
>> 
>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
>>   my $seq     = $obj->seq;
>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
>> 
>> Do you already have the offsets?
>> 
>> Brian O.
>> 
>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
>>> Hi All,
>>> 
>>> After perusing the tutorial and other docs for a an evening, I still
>>> can't find the answer to this.  Forgive me if I've missed something
>>> obvious.
>>> 
>>> This should not be a novel request, but I've not found it answered.  If
>>> bioperl isn't the best way to do this, I'd be grateful to a pointer to a
>>> better way, especially if it includes an illuminating bit of code.
>>> 
>>> The problem is to retrieve genomic sequences plus & minus some offset
>>> from a locus determined by HUGO keyword or GeneID.  This would be a
>>> common followup chore for some extra analysis from a gene expression
>>> expt.  Or maybe this is in the DBFetch routines, but I've missed the
>>> sequence type to specify...?
>>> 
>>> 
>>> TIA!




More information about the Bioperl-l mailing list