[Bioperl-l] BioSQL, bioperl-db and UniGene

Sean Davis sdavis2 at mail.nih.gov
Thu Jan 5 09:32:04 EST 2006

Hi Marc.  

I currently do something similar for all our arrays (about 7 different
platforms, three species, depending on need) at NHGRI/NIMH/NINDS.  I use
blat to map oligo sequences to refseq, ensembl, unigene_unique (the single
"best" sequence for a unigene cluster), UCSC known genes, and Human
Invitational as well as to several genome builds for each species.  I run
blat locally and then load all the blat results into one large database
table (about 5 million rows in the current build).  I also have an
annotation database that includes Entrez Gene, refseq, ensembl, unigene,
Human Invitational, UCSC knownGene, gene ontology, homologene, and a few
other things.  After doing the blats, I then choose the best hit for each
transcript database and map that to an associated gene model using the
annotation database.  I end up with oligos mapped to zero to many
transcripts for all large transcript databases, oligos mapped to zero to
many genes (and local storage of all the gene objects and associated
information for easy access), as well as mappings to multiple sources of
metadata.  Doing the blats for all these is quite fast (but DO NOT plan on
using bioperl to parse the 5M blat results.  Doing so will take DAYS).

Note that the process does not include storing all the sequences in the
database--there isn't a need for doing so if you are just blatting.  Also, I
do not use biosql in this situation because I found it rather slow for
mapping between different entities.  It did require building a database of
my own, but doing so makes it fairly easy to add tables as needed to support
another public database or to support a website, for example.  If you don't
want to build your own annotation database (the largest part of doing what I
have been doing), you can use one of several available including GeneKeyDB
(by our own Stefan Kirov) or Dragon DB.

Let me know if I can be of more help.


On 1/5/06 8:26 AM, "Marc Saric" <marc.saric at gmx.de> wrote:

> Hi all,
> I've got some questions regarding BioSQL I would like to ask here:
> I am currently writing an app which should map microarray probe
> sequences to target sequences. It should do so in a generalized manner
> (i.e. any microarray against an arbitrary sequence-database). Currently
> I need UniGene for Zebrafish (Dr.*) and several Oligonucleotide libs,
> among them an Affymetrix array.
> Due to the fact, that UniGene is a moving target (especially for
> unfinished genomes) it would be good to do the mapping in a fully
> automated way.
> I am thinking about doing sequence-based mapping of probe-sequences with
> BLAT or GMAP  (like ProbeLynx does for Ensembl/TIGR-based data, but
> unfortunately that tool is quite hard to port/extend for other databases).
> In addition I would like to have annotation based mapping (i.e. take the
> accession from the vendor-provided mapping and have a look to which
> UniGene-cluster it maps) as a fallback/second option for microarrays,
> where probe sequences are not published.
> I have installed/setup Bioperl 1.5.1 and the CVS-versions of biosql and
> bioperl-db with MySQL 4.1.12/Mac OS X and was able to load Taxon- and
> UniGene-data from flatfiles, at least the Cluster-IDs and Accessions as
> available from the *.data file.
> I was also able to rewrite microarray probes from various tab-delimited
> formats or FASTA to Genbank, which worked ok for loading (albeit slow,
> but...).
> (I hope you are still with me after this lengthy intro... :-) )
> 1st question:
> Due to the fact that the loader does not like raw FASTA-files, what
> would be the most elegant/efficient way of loading all sequence-files
> for the UniGene build as well (normaly provided in a FASTA-file called
> *.seq.all, Dr.seq.all in my case). And how to associate them with the
> cluster data (i.e. there are allready entries in bioentry for all
> sequences, but they are missing the sequence data and most of their
> detail annotation, so this might be some kind of update).
> 2nd question:
> What would be the best way of integrating BLAT/GMAP (same format as
> BLAT) results. I'm thinking about parsing the file and writing the
> mapping-results as a annotation into the database, linked to each
> probe-sequence. Data would include the hit(s) found for each probe,
> wether it hits more than one cluster and possibly some additional notes.
>> From there I would write out a report or custom sequence file for use in
> other tools.
> If possible I would also like to accumulate annotations (like mapping
> against different UniGene builds over time).
> 3rd question:
> Due to the fact, that UniGene changes frequently, I would like to have
> some kind of versioning, so that I can keep old versions of UniGene as a
> backup and add new ones (i.e. not only keeping the mapping results but
> also keeping all the source sequences).
> If I understand it right, the load_seqdatabase script does not support
> this and has no (command-line) option for overriding the "database" name
> (i.e. for UniGene it will always be set to "UniGene" in biodatabase and
> thus overwrite old versions)?
> Do you see any fundamental problems here for versioning the data (except
> storage space)?
> Thanks in advance.
> Links:
> ProbeLynx http://koch.pathogenomics.ca/probelynx/
> D.rerio UniGene: http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?TAXID=7955

More information about the Bioperl-l mailing list