[Bioperl-l] Reciprocal best hits using Bioperl?

Chris Fields cjfields at illinois.edu
Mon Jan 18 11:12:08 EST 2010

(my small rant on this)

On Jan 18, 2010, at 4:20 AM, Robert Bradbury wrote:

> My comment might be that the problem with OrthoMCL is that it is
> primarily lower organisms.  The problem with Ensembl (and some other
> databases) is that it is primarliy higher organisms (though they do
> include Drosophila, C. elegans and Yeast).

OrthoMCL v2 handles both lower and higher organism; I've used it for both, with decent success.  Most other ortholog tools do as well (if I'm not mistaken, ensembl also uses MCL under the hood, unless that's changed).  I don't believe one should be completely bound to one toolset, particularly in this case (there are lots of nice ortholog clustering tools using various moeans of comparison out there), but I do think OrthoMCL is very good as an initial pass.  If anything, I would like a set of (possibly bioperl-based, definitely DB-based) modules that can deal with this information.

The more imperative issue in my opinion is that one is prisoner to the gene models for those specific organisms of interest, and this may vary widely depending on the source of those gene models (Ensembl, UCSC, NCBI, EBI, centralized MODs like FlyBase, etc).  For instance, if gene models are poorly curated or rarely updated, the comparisons may be significantly flawed.  Some of these issues may also be (somewhat) alleviated once more transcriptome data is available that helps clear up gene model ambiguities, but that won't be true for all organisms, at least initially.

Note this isn't meant as a slam on any specific DBs or MODs in general, the problem is one born of the fact that there isn't a single, centralized, trusted, consistently updated source for this data, specifically something that will handle moderated third-party annotation.  That's a very difficult problem to solve effectively.  Some of these very issues crept up at the GMOD conference, and there appears to be consensus that a real attempt is needed to address this.  

I don't know, maybe it's just unicorns and rainbows.  Personally I do think the situation will improve, as there seems to be great demand for it, but it requires time, resources, manpower, money, cat herding, etc.

> The problem arises when one wants to cross those boundaries.  For
> example the 5-10 antioxidant proteins, the ~150 DNA repair proteins,
> many of the mitochondrial (ETC) proteins, the ribosomal rRNA's &
> tRNAs, and the fundamental biochemistry (EC) proteins are homologous
> all the way from the most ancient bacteria through H. sapiens.  The
> only way to play in the mixed arena of prokaryotes and eukaryotes
> involving fundamental vectors in evolution is to either construct ones
> own databases (which presumably means getting involved with MySQL, and
> probably spending some $$$ on hardware) or to develop some BioPerl
> modules that can do the  SpeciesX vs. SpeciesY comparisons on demand
> using some part of the cloud.  This problem isn't going to get smaller
> its only going to get larger, now that the cost of sequencing
> (pseudo-resequencing) a vertebrate genome is starting to come in under
> $10,000 and people are starting to seriously talk about 10,000
> vertebrate genomes.  10,000 x 10,000 x 20,000 (genes) isn't something
> people are going to undertake very soon.
> Robert

They're already undertaking it now using a broad range of organisms, in and out of the cloud.  In most cases one can amend a prior recip. comparative analysis with new data fairly easily, if one takes care to do so early on (i.e. set up the BLAST databases with a specified defined size for comparative stats between separate analyses).  OrthoMCL v2 describes a procedure to do this, and I believe others have similar methodology.  

I could also see possible ways one can further optimize this, for instance in cases where two very closely-related organisms are compared, where translated seqs are 100% identical, etc.  IIRC, the OrthoMCL DB site already has a way to upload custom sets of protein data for mapping to (already pre-run) clusters.  Just the fact that the tools are available as OS, they're semi-automated, and can be generically applied to data of personal interest is a great boon.  Not sure I see the downside of that, and I'm pretty confident the scalability issues will be addressed in some way.


More information about the Bioperl-l mailing list