[Bioperl-l] genpept/swiss

Ewan Birney birney@ebi.ac.uk
Mon, 4 Sep 2000 17:45:01 +0100 (GMT)

On Mon, 4 Sep 2000 hilmar.lapp@pharma.Novartis.com wrote:

> I didn't follow what you said, so I don't know.  Part of the problem
> may be that I don't know much about how complete the bioperl parsers are.
>      Unfortunately, there are other very sad points, for instance some
>      types of location (compound locations with cross-references, fuzzy
>      locations) cannot be handled because the data model is not yet
>      prepared for them. (This means that you e.g. lose the translation tag
>      for those sequences, and since the CDS coordinates are not handled
>      either, you basically cannot tell the correct translation.)

There are two questions here:

	(a) is our data/object model rich enough to cope with what we want
to do? Possibly not completely yet, but heading that way

	(b) is that data model compatible with the
EMBL/GenBank/Swissprot/Whatever data model.

(a) is the work we have to do. (b) is a decision we have to make.

both are open to people arguing one way or the other and more importantly,
providing *code*

>      Maybe it's a good time to bring up this painful discussion again: What
>      do people think about a rewrite of the SeqIO parsers? What should the
>      re-design provide for? Given the current maturity of XML
>      representations of the major databanks (can anyone comment on this,
>      that is, what is the maturity?), does it make sense to go directly for
>      an XML mapping?  Do the advantages of such an approach justify the
>      price in overhead (performance-wise)? Would it be realistic to limit
>      future support (meaning maintenance) in BioPerl to XML dumps provided
>      by the major database providers?

I *don't* think a re-write of the SeqIO system is warranted yet.

I would be supportive of someone who wanted to *reorganise* the
EMBL/GenBank/Swissprot parsing to be more flexible. If they so wished.

I think XML read/write fits in perfectly fine with the current SeqIO
system at the moment, but we have to bootstrap ourselves into this problem
- learning to read the XML format of others, dumping XML formats to be

>      And last not least: who would be volunteering to do what?


>      Have the Ensembl people done some work in this direction that could be
>      back-ported?

Ensembl makes heavy use of EMBL/GenBank dumping "with all the bells and
whistles". This goes via the Bio::SeqI interface (Ensembl "sequences" are
Bio::SeqI compliant)

We also have a GAME dumper in development, mainly waiting for a bunch of
people wanting to use it. The GAME dumper works directly off Ensembl, not
via the Bio::SeqI interface as both Ensembl and GAME are richer than the
basic Sequence (Bio::Seq etc), for example with supporting evidence.

>      I guess Ewan wants to comment on these questions...

Have done...

>           Hilmar

Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420