[Bioperl-l] Proposal for Meta data

Chris Fields cjfields at uiuc.edu
Fri Dec 15 12:23:27 EST 2006

On Dec 15, 2006, at 8:28 AM, Jason Stajich wrote:

> On Dec 14, 2006, at 9:21 PM, Chris Fields wrote:
>> On Dec 14, 2006, at 7:45 PM, David Messina wrote:
>>> Hey Chris,
>>> My thoughts below.
>>>> [Chris]
>>>> This could be used to annotate any
>>>> PrimarySeq, LocatableSeq, SimpleAlign, SeqFeature, or what-have- 
>>>> you,
>>>> maybe in a collection (similar to AnnotationCollection).  I thought
>>>> something like this may be of general use for any PrimarySeq
>>>> (quality, structure), alignments like NEXUS and Stockholm,
>>>> SeqFeatures where structure could be stored (tRNA or riboswitches),
>>>> etc.
>>>> However, this also seems to fall into the category of sequence
>>>> annotation.  So, would it be better to have a set of  
>>>> Bio::Annotation
>>>> classes used for this purpose?
>>> To me, all meta data is equal. That is, your classic Genbank feature
>>> annotation and a user's arbitrary meta-tag like "Bob thinks this  
>>> is a
>>> kinase domain" aren't different in kind even if they are  
>>> different in
>>> content.
>>> As resequencing projects multiply, the ability to create arbitrary
>>> meta tags, attach them to different types of objects, and use those
>>> tags to link them together will become desirable, if not essential.
>>> Keeping a common interface to all of these meta data types would be
>>> advantageous, plus new users won't have to determine whether they
>>> need to use Bio::Meta objects or Bio::Annotation objects.
>>> So I would argue for all of the meta data types to live "under one
>>> roof". Which roof isn't as important. Bio::Annotation, since it
>>> already exists for today's meta data, seems like a reasonable  
>>> choice.
>>> (assuming Annotation objects are flexible enough to be extended as
>>> you propose)
>>> There, and no flames or jibes even. :)
>> I guess what I want to know is whether there should to be a
>> distinction between 'normal' sequence annotation (comments,
>> references, and so on) and annotation that could be best described as
>> position-specific (like RNA or protein structural annotation).  The
>> current meta implementation is for sequence data only; I felt it
>> would be nice to have a generic implementation that would be
>> applicable to any object data.
> my stream-of-consciousness for right now:
> I was thinking Bio::Annotation is where this should go - that  
> system doesn't have anything about it that makes it explicitly  
> sequence related. What we're trying to hammer out here on the  
> Alignment side - which fits with your RNA example - is have  
> features, basically SeqFeatures - associated with alignments so  
> columns can be annotated to cover things like character sets and  
> partitions for phylogenetic analyses.  As for data which annotates  
> non-contiguous things like RNAstems we may have  to be more  
> creative about that or model it with a splitLocation.
> So currently we've added code so that an Alignment is-a  
> Bio::AnnotableI and is-a Bio::FeatureHolderI to move towards this  
> end, with the goal of being able to capture more of the data that  
> can be represented in a NEXUS file.
> It feels more like a hack than an elegant Meta-data solution, but I  
> am totally sure whether the data you are thinking about doing at  
> this point, perhaps I need to spend more time thinking about it.
> Or are you worried about the idea of whether the semantic mapping  
> of the data into features or annotations is confusing users?

Sorry in advance for the longish response here...

My original thought was to have a generic abstract class capable of  
positionally describing data in any another class, similar to  
Heikki's Bio::Seq::MetaI but not constrained to sequence data only.   
Implementing classes would be capable of having different data  
structures based on their use (simple string, array, AoA, AoH, AoO).   
One MetaCollection class to contain them all in a tag-like system, so  
you could have mixed data types describe the same object.  The latter  
Collection class is so similar to AnnotationCollection that I agree  
Bio::Annotation would be the best place for this.

The way I reconfigured Stockholm alignment parsing/writing is to use  
Bio::Seq::Meta objects (which are LocatableSeq).  Each Seq::Meta is  
capable of holding a sequence and several meta strings, stored as  
tags or 'names'.  However, there is no Meta object for alignments  
(for RNA/protein structure consensus and other Rfam/Pfam markup); I  
hacked around this by using a Bio::Seq::Meta w/o a seq, but I would  
rather have a generic Meta object independent of the sequence cruft.

So for this partial Pfam alignment,

#=GR Q92SV1_RHIME/122-299 pAS .........................
#=GR Q8ZXP5_PYRAE/91-262 SA  00000000000...120030X..474
#=GC SS_cons                 HHHHHHHHTTH...HHHHHHH..HTT
#=GC SA_cons                 03002200312...1312414..676
#=GC seq_cons                luhhLuhsRpl...hthppth..+pG

'#=GC' lines would be in generic meta string objects in the  
alignment, while '#=GR' tags would be in similar meta objects in the  
relevant sequences.  As long as both aren't AnnotatableI this isn't  
an issue.

Similarly, NEXUS files which contained any position-based values  
could hold a meta string/array object in a similar tag.

The basic scheme is:


Then I started thinking about where this could be applied, and  
whether a true Meta object needs to be constrained only to describing  
position-based data.  This somewhat relates to this bug:


which seems to need a simple but unconstrained hash-of-arrays-based  
meta object.

Then my head appropriately exploded...

Hope everything is going well at the hackathon!  Looks like some  
interesting stuff coming out of it.


More information about the Bioperl-l mailing list