[Bioperl-l] genpept/swiss

Ewan Birney birney@ebi.ac.uk
Mon, 4 Sep 2000 08:08:23 +0100 (BST)

On Mon, 4 Sep 2000, Hilmar Lapp wrote:

> Jason recently reported problems with a certain sequence record fetched
> from Entrez using Bio::DB::GenPept (see bugs 838 and 839 in fixed-bugs).

great. I hope fixed on both the branch and the main trunk? Or do we need
to merge somewhere?

> I've fixed these: qualifiers not satisfying the /tag=value syntax are now
> reported through warn() instead of throw() in the genbank parser. Some of
> you may object to this, but I'm myself tired of being thrown out of loops
> over chunks of entries just because of one single misformatting. This
> raises again the issue of a switch that can be user-enabled and causes
> such things to throw() again. As I'm not sure about the (error)
> notification levels available in the Bio/Root classes, could someone who
> does know comment on this? That is, is it possible to set the reporting
> level such that warn() actually becomes equivalent to throw()?

It is possible on Bio::Root::Object, but not that many people do it. 
Either we need to revist this on the Root object or put something specific
into SeqIO.pm. I would be happier with the latter, as I want to see the 
Root usage shrink not grow.

your milage may vary of course...

> For the second fix I replaced primary_id() by display_id() in swiss.pm
> ID-line generation. This should be the safest alternative since 1)
> swiss.pm seq-parser sets display_id() itself, and 2) _every_ sequence
> object is supposed to have a display_id(), as far as I understand.
> The whole subject was raised by entry O18919; this illustrates a general
> problem one should be aware of when interconverting between rich sequence
> formats. There are several remarkable differences between this entry
> fetched from GenPept and written out in SwissProt format as opposed to
> the entry one can obtain from SwissProt (www.expasy.ch). E.g., the
> feature table is rendered almost useless in terms of information content:
>      Site            270
>                      /site_type="active"
>                      /note="BY SIMILARITY."
>      Site            272
>                      /site_type="metal-binding"
>                      /note="MAGNESIUM (POTENTIAL)."
> in GenPept becomes in SwissProt format
> FT   Site        270    270
> FT   Site        272    272
> instead of
> FT   ACT_SITE    269    269       BY SIMILARITY.
> FT   METAL       271    271       MAGNESIUM (POTENTIAL).

Hmmm. Annoying. The Swissprot idea of a feature table is considerably
different to the EMBL idea right?

> in the SwissProt original. (In addition you may notice the offset of
> coordinates by 1, which is due to the Methionin being omitted in
> SwissProt.)
> There are other things, some of which can be healed (like the CRC64
> instead of CRC32 now being used by SwissProt), while others probably
> cannot (like comments getting screwed).
> The point I'd like to make may be best illustrated by comparing with
> automated language translators that are around (like babelfish;
> babelfish.altavista.com). Try to translate an only slightly complicated
> sentence from one language into another, which already screws it up
> half-way, and then translate the result into a third. I think it is
> pointless for BioPerl to aim at clean and complete conversion from any
> rich format into another rich format for sequences.

I agree entirely. I have always wanted bioperl to stay focused on decent
objects, and not tied to formats, and this is a clear case when this
happens. However, sometimes we need the flexibility to at least *write*
specific things in aspects of the format: EMBL/GenBank dumping has alot of
of hooks for this.

One suggestion for the reading of formats is that we have specific
sub-classes for the specific parts to each format. This could be quite
clean and would at least mean we are not discarding stuff.

> The only way this could be achieved with a reasonable effort is by
> mapping languages to a common meta-representation, like XML or ASN.1 (and
> anything the meta-format doesn't cover will still be lost).

That is so true. Let's hope this gets sorted out ... ;)

> So, you should be aware of this whenever you convert between sequence
> formats using BioPerl.
> If people disagree please post.

I'm 100% in agreement with you...

> 	Hilmar
> -- 
> -----------------------------------------------------------------
> Hilmar Lapp                                email: hlapp@gmx.net
> NFI Vienna, IFD/Bioinformatics             phone: +43 1 86634 631
> A-1235 Vienna                                fax: +43 1 86634 727
> -----------------------------------------------------------------
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l

Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420