[Bioperl-l] UCSC database backend

Caleb Davis cdavis at bcm.tmc.edu
Thu Aug 31 19:53:51 EDT 2006


Hi folks, first time caller here.  Love the show!

I just started going through the archive and saw this thread.  I vote in 
favor of this interface, for what it's worth.  What about doing it this 
way?:

$objSeqIO  = Bio::SeqIO->new(-file => '~/seq/myseqCustomTrack.bed',
                         -format => 'bed',
                         -assembly => 'hg18',
                         -track => 'hg18_myfavgenes');    #see example below

If this worked you could have any genomic sequence from any assembly at 
your fingertips.  The custom track format is a common way to specify 
arbitrary genomic coordinates and region identifiers.  There are a few 
different formats, but BED is my current favorite 
(http://genome.ucsc.edu/goldenPath/help/customTrack.html#format).  The 
nice thing about custom tracks is that you can use them as hooks to 
retrieve assembly sequence, as well as to visualize the regions of 
interest directly in the browser.

A brief tutorial for using the table browser to retrieve 
sequence...Here's an example of a plausible myCustomTrack.bed.  In this 
example the coordinates happen to correspond to genes, but they could 
point to arbitrary genomic segments:
---------
browser position chr1:1940702-1952050
track name=hg18_myfavgenes description="my favorite genes"
chr1    1940702    1952050    GABRD|NM_000815    1000    +    1940722   
 1951581    0,0,0    9    88,113,68,221,83,138,156,212,769   
 0,5538,5930,6114,8173,8751,9707,10147,10579
chr1    6008966    6083110    KCNAB2|NM_003636    1000    +    6023215   
 6081221    0,0,0    16   
 126,129,42,44,38,80,45,45,44,87,86,121,95,121,89,1979   
 0,14197,15511,46435,47413,55875,58884,61147,62688,64069,68080,69003,69210,70316,70949,72165
track name=hg18_othergenes description="some other genes"
chr1    11788880    11825772    CLCN6|NM_001286    1000    +   
 11788906    11822867    0,0,0    23   
 113,60,66,66,67,107,127,68,59,133,114,167,127,124,154,160,107,187,158,157,108,126,2986   
 0,894,9609,10378,13251,16458,17470,18249,19919,20852,21869,22221,22959,27278,27640,27999,28247,29730,30762,31106,32098,32298,33906
chr1    16221696    16233131    CLCNKA|NM_004070    1000    +   
 16221701    16232740    0,0,0    19   
 105,129,129,140,78,79,126,85,102,85,174,70,111,214,134,89,84,87,439   
 0,1185,2148,3493,3921,4082,4695,5206,5403,6146,6511,7116,7350,7846,9095,9588,9828,10555,10996
---------
Submit the example into the form here: 
http://genome.ucsc.edu/cgi-bin/hgTracks?clade=vertebrate&org=Human&db=hg18&position=chrX%3A151%2C073%2C054-151%2C383%2C976&pix=620&hgsid=77004600&customTrackPage=add+your+own+custom+tracks

To retrieve the sequence you go to the table browser, select the custom 
track, and select sequence as the output format:
http://genome.ucsc.edu/cgi-bin/hgTables?clade=vertebrate&org=Human&db=hg18&hgta_group=user&hgta_track=ct_hg18myfavgenes&hgta_table=0&hgta_regionType=genome&position=chr1%3A1940702-1952050&hgta_outputType=primaryTable&hgta_outFileName=

Maybe assemblies are interesting enough that they would get their own 
objects, with a custom track query to an objAssembly generating an 
objSeqIO.  You folks rock!
--Caleb

 >Sean Davis wrote:
 >>
 >> Before we get too far down this line of thought, keep in mind that 
this will
 >> be dozens of Gb of sequence and database tables.  See here for details:
 >>
 >> http://genome.ucsc.edu/admin/mirror.html
 >>
 >> The sequences include all of genbank, essentially.  The mysql tables 
ALONE
 >> (no sequence) for only ONE human assembly is on the order of 
10Gb--not the
 >> kind of thing you can download in a few minutes (or even hours).  
Just to
 >> keep in mind....
 >
 >I think if someone needs heavy-duty access to genomic data, they'll find
 >the discspace. That wouldn't be the problem. The problem would be
 >finding an easy way of getting the data, which is where I hoped
 >something like a UCSC frontend would come in.
 >
 >
 >> On another point, the strength of UCSC is not in obtaining sequence, 
but in
 >> mapping to the genome.  I think getting actual sequence should be 
secondary
 >> here, if for no other reason than there are trivially easy ways of 
getting
 >> sequence information from elsewhere given an accession or ID.  There is
 >> simply too much information to be stored locally for most people and 
getting
 >> the data remotely from UCSC doesn't seem possible currently.
 >
 >The work would certainly be highly valuable even if it didn't allow for
 >sequence retrieval, but from my own point of view my main interest was
 >exactly the retrieval of arbitrary bits of genomic sequence - for which
 >there is no accession or ID that can be used to query some other database.
 >
 >How does the website table browser frontend allow retrieval of sequence
 >data?



More information about the Bioperl-l mailing list