[Bioperl-l] Bio::DB::SeqFeature treamtent of tags and annotations

Cook, Malcolm MEC at stowers-institute.org
Fri Feb 2 16:47:08 EST 2007

I don't think that adding this directive is a good idea after all
But, I see that you remap the ID= to a load_id attribute which is
preserved in the Bio::DB::SeqFeatureStore database.
And then it gets squelched during GFF production by
However, if ID is prone to clashes, then certainly simply renaming the
attribute to be load_id does not preclude clashes from happening, and
only courts disaster.  Don't you think?
I'm a little blurry on the GFF3Loader, but it looks like you're using
load_id to facilitate loading parent/child features out of order.  Is
that right?  If so, I suggest you delete all load_id features
immediately after performing a load.  It has not further use.
Or, you might consider instead of `round-trip-ids` directive, rather,
give the GFF3Loader  an IDAttribute option which would allow the use of
the loader to preserve the ID values, but to use a named
In my case, munging flybase gff,  I would then use it like this:
bp_seqfeature_load.PLS --fast --IDAttribute flybaseID
which would preserve the ID values in the database but under the
FlybaseID attribute for features so loaded.
On a related topic:
I just committed this patch to Bio::DB::SeqFeature::NormalizedFeature

_create_subfeatures : ensure that subfeatures get the `source` of their

While doing this I wonder: what is the -class that subfeatures are
getting from their parent...??? I left it in place. Please advise! Fix
my thinking....


Further, I observe that Bio::Graphics::FeatureBase::new handles the
-segments option is to call add_segment.  So, when I create a new
Bio::DB::SeqFeature with -segments [[ 100,200 ] [300,400]], the
-segments option gets handled by Bio::Graphics::FeatureBase::new, which,
as mentioned, calls add_segment. The surprising thing to me when thrying
to trace through the class modules and understand what is going on is
that what gets run at this point is not
Bio::Graphics::FeatureBase::add_segment, but rather
Bio::DB::SeqFeature::add_segment, whose semantics is different in at
least one regard, namely, that it does not set the start and stop of the
parent feature from the min and max of the segments.

I have committed a patch to Bio::Graphics::FeatureBase with a comment to
this effect, and have also patched it's add_segment method to copy the
parent's source into the segment.

I hope my commits and suggestions further the cause.  Let me know if
-- Malcolm


	From: lincoln.stein at gmail.com [mailto:lincoln.stein at gmail.com]
On Behalf Of Lincoln Stein
	Sent: Tuesday, January 30, 2007 4:46 PM
	To: Cook, Malcolm
	Cc: bioperl list; lstein at cshl.org
	Subject: Re: Bio::DB::SeqFeature treamtent of tags and
	I've fixed the first issue in CVS. Sorry for the inconsistency.
add_tag_value(), delete_tag_value() and get_Annotations() now all work
as expected.
	The problem with the ID column is that it is supposed to be
LOCAL to the GFF3 file and is not intended to be stored in the database.
In contrast, Name can survive roundtripping. Perhaps the thing to do is
to add a flag to the GFF3 file that turns on ID round-tripping, e.g.
	##round-trip-ids: 1
	If you like this idea, I can implement it.
	On 1/29/07, Cook, Malcolm < MEC at stowers-institute.org
<mailto:MEC at stowers-institute.org> > wrote: 

		Thanks for your suggestions on approach to my problems
augmenting Flybase annotation.  I am trying to follow them and finding
the following oddities
		The first issue relates to the intermix of 'annotations'
and 'tag values'.  I find that Bio::DB::SeqFeature implements some of
the 'tag' methods and some of the 'Annotation' methods.  Here is a perl
one-liner that shows values stored using add_tag_value are not retreived
with get_tag_values, but rather with get_Annotations.
		> perl -MBio::DB::SeqFeature -e 'my
$f=Bio::DB::SeqFeature->new; $f->add_tag_value("x",666); print
"get_tag_values:\t" . $f->get_tag_values("x") . "\nget_Annotations:\t" .
		whose output is:
		get_Annotations:    666
		Tracing this shows me that this results from the fact
		Bio::DB::SeqFeature uses of Bio::Graphics::FeatureBase
(via Bio::DB::SeqFeature::NormalizedFeature) which does not support
-tags in ->new but rather -attributes, viz:
		  -attributes   a hashref of tag value attributes, in
which the key is the tag
		                  and the value is an array reference of
		And though Bio::Graphics::FeatureBase purports to
implement Bio::SeqFeatureI, it only partially implements the  'tag'
methods (now deprecated and relegated to Bio::AnnotatableI).  In
particular, the '*' methods Bio::SeqFeatureI are not implemented in

		*  add_tag_value
		*  remove_tag

		As a result, add_tag_value and remove_tag are inherited
from different modules whose understanding of tags is not the same!

		This one-liner :

		>perl -MClass::ISA -MClass::Inspector
-MBio::DB::SeqFeature -e 'my @c =
Class::ISA::self_and_super_path("Bio::DB::SeqFeature"); foreach my $fn
qw(add_tag_value get_tag_values) {print "\n$fn:\t", join "\t", (grep
{Class::Inspector->function_exists($_, $fn)} @c)}'

		confirms that they are defined in different packages,

		add_tag_value: Bio::AnnotatableI 
		get_tag_values: Bio::Graphics::FeatureBase

		Proposed solution...  hmmmm ..... I dunno.... maybe the
following patch to Bio::Graphics::FeatureBase->add_tag_value :
		sub add_tag_value {
		  my ($self,$tag, at vals) = @_;
		  push @{$self->{attributes}{$tag}}, @vals;
		It fixes my use case for now but I'm still concerned and
confused about this variety of methods.  


		Also, I think that any "ID" in column 9 of GFF3 float
file should be preserved through a round-trip through a
Bio::DB::SeqFeature store, but this is not yet possible since any ID
attribute in GFF3 column 9 is being lost by GFF3Loader, causing me to
locally patch GFF3Loader::handle_feature method to add the following:

		  # mec at stowers-institute.org
<mailto:mec at stowers-institute.org>  , wondering why not all attributes
		  # carried forward, adds ID tag in particular service
		  # round-tripping ID, which, though present in database
as load_id
		  # attribute, was getting lost as itself
		  $unreserved->{ID}= $reserved->{ID}     if exists

		Poised to patch.... what d'you think?

		Malcolm Cook
		Stowers Institute for Medical Research - Kansas City,



			From: lincoln.stein at gmail.com [mailto:
lincoln.stein at gmail.com <mailto:lincoln.stein at gmail.com> ] On Behalf Of
Lincoln Stein
			Sent: Tuesday, December 19, 2006 3:58 PM
			To: Cook, Malcolm
			Cc: bioperl list; lstein at cshl.org
			Subject: Re: bp_seqfeature_load /
Bio::DB::SeqFeature::Store::GFF3Loader problems augmenting Flybase
			Hi Malcom,
			Your second guess was right. The use case of
augmenting an existing gene with additional splice forms isn't provided
for. You can get the functionality by making direct calls to
Bio::DB::SeqFeature::Store methods:
			my @genes =
			@genes == 1 or die "Didn't get exactly one
			my $parent = $genes[0];
			my $parent = $genes[0];
			my $chr    = $parent->seq_id;
			my $start  = $parent->start;
			my $end    = $parent->end;
			my $strand = $parent->strand;
			my $new_splice_form =
$db->new_feature(-primary_tag => 'mRNA',
			                       -source      => 'added',
			                       -seq_id   => '4r',
			                       -strand   => $strand,
			                       -start    => $start+10,
			                       -end      => $end,
			for my $pos
([$start+10,$start+100],[$start+200,$end]) {
			  my ($e_start,$e_end) = @$pos;
			  my $exon =
Bio::DB::SeqFeature->new(-primary_tag => 'exon',
=> $db,
			                      -seq_id      => '4r',
			                      -strand     => $strand,
			                      -start       => $e_start,
			                      -end         => $e_end);
			I found a bug in updating the seqfeature
database when I wrote this script, so you'll have to get the latest
biperl live. I think you can use this to write a splice form updating
			In order to support the idea of adding new
splice forms to an existing gene using the GFF3 format, I will have to
either modify the loader, or write a separate script (probably better to
do the latter). It shouldn't be hard if you'd like to give it a try.
			On 12/19/06, Cook, Malcolm
<MEC at stowers-institute.org <mailto:MEC at stowers-institute.org>  > wrote: 

				Lincoln and fellow Bio::DB::SeqFeature
				I find that using bp_seqfeature_load.PLS
to load subfeatures of genes
				already loaded using
bp_seqfeature_load.PLS fails with
				------------- EXCEPTION  ------------- 
				MSG: FBgn0017545 doesn't have a primary
				STACK toplevel

				Where FBgn0017545 is the ID of a gene
previously loaded.
				I am unsure how to remedy my situation
and welcome any advise on correct
				or improved approach to my problem.
				Here's some detail if it helps.  I am
developing a pipeline to design a 
				microarray probes capable of
distinguishing among splice variants in
				drosophila (using latest Flybase
dmel_r5.1 annotation).  So I
				1) load a filtered selection of Flybase
annotation using
				bp_seqfeature_load.  (for testing
purposes, I am using a single gene's 
				worth of annotation, FBgn0017545.gff,
attached).  This is done as
				        > bp_seqfeature_load.PLS
--create FBgn0017545.gff
				2) analyze all the genes in the
database, and create GFF3 output each 
				feature of which has a 'Parent' that is
a previously loaded gene (i.e.
				FBgn0017545).  (These features represent
the unique introns, splice
				sites, and exonic design targets. Output
of this analysis,
				FBgn0017545_matd.gff, is also attached) 
				3) load these analysis results into the
same database, as follows:
				        > bp_seqfeature_load.PLS
				It is at this point that I get the above
				However, I don't get any error and the
data loads fine if I load the two
				files together, as follows: 
				        > bp_seqfeature_load.PLS
--create <(cat FBgn0017545.gff
				So, I suspect that either I am
misunderstanding when/how to use
				bp_seqfeature_load.PLS or else this use
case has not yet arisen and must 
				be provided for somehow.
				I am running against bioperl-live
				Thanks for your thoughts and assistance,
				Malcolm Cook
				Database Applications Manager -
				Stowers Institute for Medical Research -
Kansas City, Missouri 

			Lincoln D. Stein
			Cold Spring Harbor Laboratory
			1 Bungtown Road
			Cold Spring Harbor, NY 11724
			(516) 367-8380 (voice)
			(516) 367-8389 (fax)
			SANDRA MICHELSEN, AT michelse at cshl.edu
<mailto:michelse at cshl.edu>  

	Lincoln D. Stein
	Cold Spring Harbor Laboratory
	1 Bungtown Road
	Cold Spring Harbor, NY 11724
	(516) 367-8380 (voice)
	(516) 367-8389 (fax)
	SANDRA MICHELSEN, AT michelse at cshl.edu 

More information about the Bioperl-l mailing list