[Bioperl-l] getting proteins matching GO

Chris Mungall cjm at fruitfly.org
Wed Nov 10 20:03:47 EST 2004


There are a number of approaches you can take. You can download the GO
MySQL database and either query that directly via SQL or query it via the
GO DB perl API.

You can find the database build, API code and example queries here:


Click on "example SQL" in the navigation bar to see both SQL queries and
API calls for doing what you want. Note that you probably want to do
queries over the transitive closure, so that you get both "cell signaling"
and all terms beneath this one in the GO graph.

You can do something similar with the EnsMart, as Stefan mentions below.
However, I am unsure if this takes into account the transitive closure,
and how many organisms EnsMart covers, and how regularly they update their
copy of GO.

If you don't want to download a database and want to do this in pure perl,
then you can either use bioperl, or the go-perl classes (the latter also
available from the site above). You will also have to download the
gene_association files for the organism you are interested in from here:


(This is the fileset used to build the GO DB, so you don't gain anything
by this approach[*])

You can then use either bioperl or go-perl to parse the GO file to obtain
a graph object. You'll then want to parse through all the association
files and check every entry against the graph to see if your term of
interest subsumes the term in the association file. This approach will be
slow, there are several million associations in total! I don't believe
bioperl has an association file parser - this is easy to write yourself,
but there are a few gotchas, so you should read the format documentation
carefully, or use the go-perl parser.

If you're not comfortable with OO programming, you can also use the
go2path script (part of go-perl) to generate the path-to-list route for
all GO terms and use this when filtering the association files.

Yet another way is to use the AmiGO browser, which queries GO DB -
however, I assume you are after a programmatic solution.

I would recommend the database solution. Let me know if you have any
problems with the GO DB or the go-perl code.


[*] the GO DB actually has some of the older redundant associations
filtered out, so you do gain something by going straight to the
associations file, but not much

Stefan Kirov skirov wrote:

> Pedro,
> You may want to check Bio::Ontology and especially Bio::OntologyIO.
> These are pretty cool modules, but you will have to install bioperl-live
> or wait for bioperl 1.5 (which as I understand should be released soon).
> You will have to download the GO DB locally and parse it with
> Bio::OntologyIO, I am not sure if somebody is working on remote access
> (not familiar if it is possible at the moment). By the way if you are
> not familiar with mysql and you are OK with perl, Bio::OntologyIO might
> be easiest for you. It will also include anything you are able to get
> from GO website. But you will have to keep local database (or flat
> file). Hope this helps.
> Stefan
> Pedro Antonio Reche wrote:
> > Dear Stefan, thanks a lot  for your e-mail. Actually, I am interested
> > in getting all proteins from all organisms that are tagged with let
> > say the go_process cell signaling. I will try the sites that you
> > indicate to see if they can do the job. Do you know if Bioperl can
> > also do this?
> > Regards,
> >
> > pdro
> > On Nov 5, 2004, at 12:27 PM, Stefan Kirov wrote:
> >
> >> What organism? You can use either EnsMart (for example for human
> >> there is a table called hsapiens_gene_ensembl__xref_go__dm) or you
> >> can use GeneKeyDB if you install it locally (genereg.ornl.gov/gkdb),
> >> there is a table called ll_go, which you can search for the gene
> >> identifier(locuslink), associated with a particular GO term and then
> >> get the protein accession from another table  (something like :
> >> "select r.np_accn from ll_go g, ll_refseq_nm r where r.ll_id=g.ll_id
> >> and g.go_term=?") and fetch the seq from RefSeq, etc. Both Ensembl
> >> and GeneKeyDB are restricted to certain eukaryotes. So it all depends
> >> on what kind of organisms you are expected to work with.
> >> Stefan
> >>
> >> Pedro Antonio Reche wrote:
> >>
> >>> Hi,
> >>> I am interested in getting all the protein sequences  matching a
> >>> specific GO term and I wonder if someone would know how to do this.
> >>> Thanks in advance for any help.
> >>> Cheers
> >>>
> >>> pdro
> >>>
> >>> _______________________________________________
> >>> Bioperl-l mailing list
> >>> Bioperl-l at portal.open-bio.org
> >>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >>
> >> --
> >> Stefan Kirov, Ph.D.
> >> University of Tennessee/Oak Ridge National Laboratory
> >> 5700 bldg, PO BOX 2008 MS6164
> >> Oak Ridge TN 37831-6164
> >> USA
> >> tel +865 576 5120
> >> fax +865-576-5332
> >> e-mail: skirov at utk.edu
> >> sao at ornl.gov
> >>
> >> "And the wars go on with brainwashed pride
> >> For the love of God and our human rights
> >> And all these things are swept aside"
> >>
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l

More information about the Bioperl-l mailing list