[Bioperl-l] Packages retrieving online alignment sequences

Gregory Jordan greg at ebi.ac.uk
Sun Aug 8 02:12:41 EDT 2010

On 7 August 2010 23:07, Chris Fields <cjfields at illinois.edu> wrote:

> A simpler method could be introduced, but I can see that being potentially
> brittle in the long run.  A naked alphanumeric string doesn't reveal much
> about what it is at face value w/o knowing database/service-specific
> behavior.  And then we're reliant on that behavior not changing, which we
> can't guarantee (this has bitten us in the past).  What would one do if NCBI
> (for instance) allowed accessions derived completely of digits, or
> conversely a unique ID with mixed alphanumerics?
> Using methods specific for ID/acc at least guarantees a behavior on the
> backend w/o guessing, and if there is no danger of overlap (a service
> accepts either/or) one could simply be an alias of the other.

Thanks for the clarification on IDs vs accessions. As long as the behavior
and distinction are well-documented, I'm sure it won't make too much of a

My main concern was just that having two similar methods -- with no clearly
laid out distinction between the two and one of them only supported by half
of the implementing subclasses -- might confuse potential users.

As a point of reference: both Rfam and Pfam allow either an ID or an
accession in their front-page search interface (http://www.pfam.org /
http://www.rfam.org/). In fact, they seem to entirely hide the distinction
between ID and Accession from the end user; nowhere on the Rfam page for an
individual result is it clear which string is the accession and which is the
ID (http://rfam.sanger.ac.uk/family/snoZ107_R87).

Thus, a potential user of the Rfam module wouldn't know whether to call the
get_by_ID or get_by_Accession method, even after looking at the Rfam page
for his / her desired alignment!

As you can probably tell, I'm all in favor of a unified search whenever
feasible / possible. :-)

> As for writing up an adaptor to ensembl outside of it's API, overall I
> don't think it's a bad idea, but if it's possible maybe start without
> reinventing things, then move to direct SQL.  Unless it's easier to use SQL.
For fetching Ensembl's gene family alignments, using the SQL will be
easiest. They don't tend to get unreasonably large in terms of memory  -- I
think the biggest tend to be ~700 sequences with a few thousand alignment
columns or so -- and it's a simple table join or two to get both the tree
and alignment from the database.

For genomic alignments, I agree that a more memory-efficient and/or lazy
backend would be necessary. And it's pretty much impossible to get those
things out of the Ensembl tables without using their API.


More information about the Bioperl-l mailing list