Features vs. Annotations

From BioPerl
Jump to: navigation, search

(see bioperl-l thread here)

  • This scrap tries to codify a discussion on the differences between SeqFeatures and Annotations in the BioPerl world. The brains behind it are those of Chris Fields and Hilmar Lapp. --Ed.

Contents

The Issue

Govind Chandra raises the following issue, based on the code below ($ac is the Bio::Annotation::Collection property of a Bio::Seq object obtained from a Bio::SeqIO stream). The reply is here:

$ac is-a Bio::Annotation::Collection but does not actually contain any annotation from the feature. Is this how it should be? I cannot figure out what is wrong with the script. Earlier I used to use has_tag(), get_tag_values() etc. but the documentation says these are deprecated.

Sample code:

#use strict;
use Bio::SeqIO;
 
$infile='NC_000913.gbk';
my $seqio=Bio::SeqIO->new(-file => $infile);
my $seqobj=$seqio->next_seq();
my @features=$seqobj->all_SeqFeatures();
my $count=0;
foreach my $feature (@features) {
  unless($feature->primary_tag() eq 'CDS') {next;}
  print($feature->start(),"   ", $feature->end(), "   ",$feature->strand(),"\n");
  $ac=$feature->annotation();
  $temp1=$ac->get_Annotations("locus_tag");
  @temp2=$ac->get_Annotations();
  print("$temp1   $temp2[0] @temp2\n");
  if($count++ > 5) {last;}
}
 
print(ref($ac),"\n");
exit;

Output:

190   255   1
0    
337   2799   1
0    
2801   3733   1
0    
3734   5020   1
0    
5234   5530   1
0    
5683   6459   -1
0    
6529   7959   -1
0    
Bio::Annotation::Collection

The Response

  • This is a mash-up of the responses from Chris and Hilmar, with some expansion and elision by your humble scribe. Please follow the thread for the verbatim responses in context. Also, see the much more detailed Feature-Annotation HOWTO.

Features vs. Annotations

In imprecise "user-centric" terms, a feature is metadata attached to a particular section or fragment of a sequence, while an annotation is metadata attached to the sequence object itself, and so describes something about the entire sequence.

In more precise, "implementor-centric" terms, an annotation object is something that you would like to attach to an annotatable object. The reason you want to attach it (and any semantics implied by that reason) is in the tag. The implementation principle that captures the "user-centric" idea is:

A feature is (and in the BioPerl-way of looking at the world, should be) locatable, whereas annotation is not.

Who accesses what, from where?

The typical use case is as follows:

  • Download a GenBank file
  • Access the file as a stream using Bio::SeqIO
  • Pull a Bio::Seq object from the stream using Bio::SeqIO::next_seq
  • Read GenBank features/annotations from the Bio::Seq object using object methods

For the user, the question arises, "How do I get at these furshlugginer annotations??" Perhaps it's better to ask "How do I get at these furshlugginer metadata??", since separating features and annotations in the mind according to the ideas above will help the user look in the right place for the metadata desired. Expletives are optional.

The Bio::SeqIO object (call it $seq) contains an annotation property (a Bio::Annotation::Collection object) accessed via $seq->annotation(). It also contains feature properties, accessed by methods such as $seq->get_SeqFeatures, which return Bio::SeqFeatureI objects. These feature objects are generally instantiated in the Bio::SeqFeature::Generic class, where most of the accessors the average user desires reside.

The Short Answer

A user often wants to parse the features: start and end coordinates, strand, locus name, source, etc. So the short answer to the user's question is often something like the following (lifted directly from the Bio::Seq POD):

# This block of code loops over sequence features in the sequence
# object, trying to find ones who have been tagged as 'exon'.
# Features have start and end attributes and can be outputted
# in Genbank Flat File format, GFF, a standarized format for sequence
# features.
 
  foreach $feat ( $seqobj->get_SeqFeatures() ) {
      if( $feat->primary_tag eq 'exon' ) {
          print STDOUT "Location ",$feat->start,":",
             $feat->end," GFF[",$feat->gff_string,"]\n";
      }
  }

Note that the features are obtained from the feature accessor of the sequence object, and the tags associated with each feature are obtained by using the tag accessors off the feature objects.

The Details

The key thing to notice here is that the feature objects and not the annotation objects are read. There was a time (see the History below) when the tag accession methods could be called off the annotation object. This behavior is now deprecated, in part to get people used to making the distinctions being discussed here.

However, remember that in "implementor-centric" terms, an annotation object is something that you would like to attach to an annotatable object. So... $feat->annotation() is a legitimate method call, since Bio::SeqFeatureI implements Bio::AnnotatableI.

As Hilmar states:

SeqFeature::Generic has indeed two mechanisms to store annotation, the tag system and the annotation collection. This is because it inherits from SeqFeatureI (which brings in the tag/value annotation) and from AnnotatableI (which brings in annotation()).

I agree this can be confusing from a user's perspective. As a rule of thumb, SeqIO parsers will almost universally populate only the tag/ value system, because typically they will (or should) assume not more than that the feature object they are dealing with is a SeqFeatureI.

Once you have the feature objects in your hands, you can add to either tag/values or annotation() to your heart's content. Just be aware that nearly all SeqIO writers won't use the annotation() collection when you pass the sequence back to them since typically they won't really know what to do with feature annotation that isn't tag/value (unlike as for sequence annotation).

If in your code you want to treat tag/value annotation in the same way as (i.e., as if it were part of) the annotation that's in the annotation collection then use Bio::SeqFeature::AnnotationAdaptor. That's in fact what BioPerl db does to ensure that all annotation gets serialized to the database no matter where it is.

[However,] unless expressly told otherwise in a parser documentation, a sequence you get back from one of the SeqIO parsers (which is where most people will get them from) will not have $feat->annotation() populated.

The Best of Both Worlds : Bio::SeqFeature::AnnotationAdaptor

If you want to get all the metadata for a sequence object, and don't care (implementation-wise) where it comes from, use Bio::SeqFeature::AnnotationAdaptor:

# (of hilmar)
my $anncoll = Bio::SeqFeature::AnnotationAdaptor->new();
foreach my $feat ($seq->get_all_SeqFeatures) {
$anncoll->feature($feat);
@vals = $anncoll->get_Annotations('locus_tag');
# do something with @vals
}

The description of SeqFeature::AnnotationAdaptor from its POD summarizes its raison d'ĂȘtre, and conveys another important distinction between Bio::SeqFeatureI and Bio::AnnotationCollectionI:

Bio::SeqFeatureI defines light-weight annotation of features through tag/value pairs. Conversely, Bio::AnnotationCollectionI together with Bio::AnnotationI defines an annotation bag, which is better typed, but more heavy-weight because it contains every single piece of annotation as objects. The frequently used base implementation of Bio::SeqFeatureI, Bio::SeqFeature::Generic, defines an additional slot for AnnotationCollectionI-compliant annotation.

This adaptor provides a Bio::AnnotationCollectionI compliant, unified, and integrated view on the annotation of Bio::SeqFeatureI objects, including tag/value pairs, and annotation through the annotation() method, if the object supports it. Code using this adaptor does not need to worry about the different ways of possibly annotating a SeqFeatureI object, but can instead assume that it strictly follows the AnnotationCollectionI scheme. The price to pay is that retrieving and adding annotation will always use objects instead of light-weight tag/value pairs.

In other words, this adaptor allows us to keep the best of both worlds. If you create tens of thousands of feature objects, and your only annotation is tag/value pairs, you are best off using the features' native tag/value system. If you create a smaller number of features, but with rich and typed annotation mixed with tag/value pairs, this adaptor may be for you. Since its implementation is by double- composition, you only need to create one instance of the adaptor. In order to transparently annotate a feature object, set the feature using the feature() method. Every annotation you add will be added to the feature object, and hence will not be lost when you set feature() to the next object.

See the thread for a nice extended metaphor involving Reese's Peanut Butter Cups. This shouldn't, of course, be construed as an endorsement for them. (Extended metaphors, that is.)

A History of BioPerl Annotation (Chris Fields)

To go over why things were set up this way (and then reverted), is a bit of a history lesson. I believe prior to 1.5.0, Bio::SeqFeature::Generic stored most second-class data (dbxrefs, simple secondary tags, etc) as simple untyped text via tags but also allowed a Bio::Annotation::Collection. Therefore one effectively gets a mixed bag of first-class untyped data like display_id and primary_tag, untyped tagged text, and 'typed' Bio::AnnotationI objects.

Some of this was an attempt to somewhat 'correct' this for those who wanted a cohesive collection of typed data out-of-the-box. Essentially, everything becomes a Bio::AnnotationI. I believe Bio::SeqFeature::Annotated went a step further and made almost everything Bio::AnnotationI (including score, primary_tag, etc.) and type-checked tag data against SOFA.

As there were collisions between SeqFeature-like 'tag' methods and CollectionI-like methods, the design thought was to store all tag data as Bio::Annotation in a Bio::Annotation::Collection, then eventually deprecate the tag methods in favor of those available via the CollectionI. These deprecations were placed in Bio::AnnotatableI, so any future Bio::SeqFeatureI implementations would also get the deprecation. As noted, Bio::SeqFeature::Generic implements these methods so isn't affected.

Now, layer in the fact that many of these (very dramatic) code changes were literally introduced just prior to the 1.5.0 release, AFAIK w/o much code review, and contained additional unwanted changes such as operator overloading and so on. Very little discussion about this occurred on list until after the changes were introduced (a good argument for small commits). Some very good arguments against this were made, including other lightweight implementations. Lots of angry devs!

Though the intentions were noble we ended up with a mess. I yanked these out a couple of years ago frankly out of frustration with the overloading issues: see Feature Annotation rollback.

[back to top]


Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox