[Bioperl-l] Fetching genomic sequences based on HUGO names or GeneIDs

Harry Mangalam hjm at tacgi.com
Thu Feb 16 11:23:02 EST 2006


Yes, I'm going to  try this 1st.  Also the pointer to the NCBI eutils page was 
helpful.  They describe the same thing and I think that API will give me what 
I need.  I'll post back to report.  

Sorry for the delay in answering - this is a side project and as such is going 
slow.

Many thanks to you guys, especially Brian for the example code - much more 
than I had a right to expect.  Virtual Beers all round and real ones should 
we ever meet up.

Harry


On Thursday 16 February 2006 04:52, Chris Fields wrote:
> I think a method was recently implemented in Bio::DB::GenBank to
> retrieve a segment of DNA given start and end coordinates in GenBank
> format; that should contain the features you need.  I requested it
> ~Nov-Dec in the mailing list but didn't get a chance to test it.
> Would that help?
>
> On Feb 15, 2006, at 11:16 PM, Brian Osborne wrote:
> > Harry,
> >
> > It's not clear to me that NCBI's eutils offers this capability
> > directly. You
> > can probably download Entrez Gene entries and parse them for
> > coordinates but
> > I know of no way to remotely retrieve genomic sequences like this
> > from NCBI
> > (ENSEMBL API perhaps?). What I had in mind uses the local approach
> > that some
> > of us favor and to prove to myself that this is simple to do I wrote a
> > script that I just added to examples/tools, it's called
> > extract_genes.pl and
> > it's based on Bio::DB::Fasta. Download the sequence files for a given
> > species to some dir, download Entrez Gene's gene2accession file,
> > and run. It
> > creates and stores a hash for lookups, it won't read gene2accession
> > each
> > time it runs.
> >
> > Brian O.
> >
> > On 2/14/06 12:15 PM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> >> Hi Brian,
> >>
> >> Thanks very much for the pointers and the speed of your reply and
> >> apologies
> >> for the speed of mine.
> >>
> >> This looks good, but what I was looking for was a bioP approach
> >> for hooking to
> >> an API at NCBI or EBI so I could get this info and seqs from
> >> them.  In this
> >> case, speed of retrieval is not critical and I'd rather not
> >> download the
> >> entirety of the sequences to a local disk to hack at them.
> >>
> >> I've determined a screen-scraping approach to get them and could
> >> script that,
> >> but I thought that bioP had a method for using NCBI's external
> >> API's, tho it
> >> may be that my memory is faulty or the approach is no longer
> >> supported due to
> >> overload.
> >>
> >> Does NCBI make such APIs available anymore?  I searched a bit for
> >> docs on them
> >> but couldn't find anything (unless it's buried in the NCBI tookit,
> >> which I
> >> haven't started to excavate).
> >>
> >> Failing that, would SEALS provide such a service? Any PerlPinipeds
> >> listening?
> >>
> >> Harry
> >>
> >> On Sunday 12 February 2006 08:37, Brian Osborne wrote:
> >>> Harry,
> >>>
> >>> Hope you're doing well. The approach could be based on
> >>> Bio::DB::Fasta. So,
> >>> from its documentation:
> >>>
> >>>   use Bio::DB::Fasta;
> >>>
> >>>   # create database from directory of fasta files
> >>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> >>>
> >>>   # simple access (for those without Bioperl)
> >>>   my $seq      = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
> >>>   my $revseq   = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
> >>>   my @ids     = $db->ids;
> >>>   my $length   = $db->length('CHROMOSOME_I');
> >>>   my $alphabet = $db->alphabet('CHROMOSOME_I');
> >>>   my $header   = $db->header('CHROMOSOME_I');
> >>>
> >>>   # Bioperl-style access
> >>>   my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> >>>
> >>>   my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
> >>>   my $seq     = $obj->seq;
> >>>   my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
> >>>
> >>> Do you already have the offsets?
> >>>
> >>> Brian O.
> >>>
> >>> On 2/12/06 1:46 AM, "Harry Mangalam" <hjm at tacgi.com> wrote:
> >>>> Hi All,
> >>>>
> >>>> After perusing the tutorial and other docs for a an evening, I
> >>>> still
> >>>> can't find the answer to this.  Forgive me if I've missed something
> >>>> obvious.
> >>>>
> >>>> This should not be a novel request, but I've not found it
> >>>> answered.  If
> >>>> bioperl isn't the best way to do this, I'd be grateful to a
> >>>> pointer to a
> >>>> better way, especially if it includes an illuminating bit of code.
> >>>>
> >>>> The problem is to retrieve genomic sequences plus & minus some
> >>>> offset
> >>>> from a locus determined by HUGO keyword or GeneID.  This would be a
> >>>> common followup chore for some extra analysis from a gene
> >>>> expression
> >>>> expt.  Or maybe this is in the DBFetch routines, but I've missed
> >>>> the
> >>>> sequence type to specify...?
> >>>>
> >>>>
> >>>> TIA!
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign

-- 
Cheers, Harry
Harry J Mangalam - 949 856 2847 (vox; email for fax) - hjm at tacgi.com 
            <<plain text preferred>>


More information about the Bioperl-l mailing list