[Bioperl-l] bioperl + GFF3 audit

Chris Fields cjfields at uiuc.edu
Tue Sep 18 23:37:30 EDT 2007


On Sep 18, 2007, at 7:04 PM, Jason Stajich wrote:

> Something to throw out there for discussion with GFF3 gurus.  Maybe
> we can have a little STATE-OF-GFF3 and compliance at the GMOD
> workshop after Genome Informatics in Nov?
>
> I propose after we get the next stable release out we consider doing
> a systematic code audit to insure that we can really generate proper
> GFF3 compliant data from all of our parsers.  This would include both
> good ID/Parent as well as .  I'd be happy to also think about making
> sure we can generate proper GTF/GFF2.5 - whether this means we have a
> translator that works on these objects or we have to code this into
> the parser software that creating the sequence features, not sure.
> The whole Bio::Tools mishmash is a little unsettling when trying to
> generate standardized output.  I'm not really clear if Bio::FeatureIO
> actually tries to do this properly, but 'gene_id'/'transcript_id' for
> GTF and ID/Parent 3-level Features for gene->transcript->exon/CDS
> doesn't really come out properly and I end up writing workarounds on
> the downstream data.

This suggests we should try to get a stable out fairly quickly and  
work on the next dev straight away.  I'm okay with that, though it  
would be nice to finish up a few loose ends first, the svn move  
foremost.

The Feature/Annotation stuff has been pretty much rolled back so  
maybe a stable release can be done fairly quickly.  My main concern  
was that any rollback would break FeatureIO or SF::Annotated, but so  
far FeatureIO and SF::Annotated both pass tests.  However, I think  
both also need better documentation and possibly more/better test  
coverage.

> One aspect that is biting is the flat versus multi-level features
> (genes -> transcripts -> exons) and how we handle them.  I think this
> ought to get fleshed out better so we can really support .  A lot of
> the Bio::Tools parsers are generally pretty laissez fair here about
> things and we have a variety of non-standard and non-compliant  
> aspects.

Agreed.

> For example, I am playing with tRNA parsing and I assume that proper
> GFF3 here is three levels of :
> gene -> tRNA -> exon
> with those being the primary_tag names that correspond to the
> Sequence Ontology.
>
> I have modified the code locally to report generic features but which
> have sub-features that must be extracted.  In addition the ID/Parent
> fields are explicitly filled in and I wonder if we want to do a
> better job insuring these are meaningfully entered?

Would a factory approach work here?  For instance, have a Factory  
which generates the SeqFeature type you want on the fly if passed  
appropriate parameters and location, say flattened vs unflattened,  
strictly typed vs lightweight, etc.  For that matter, maybe we could  
reimplement FTHelper in SeqIO to do the same...

> So if there are interested people out there we can try and hammer out
> a todo list on the wiki and see if we're generating proper GFF3 in
> the first place and trying to make sure all the features that get fed
> out to Bio::FeatureIO or Bio::Tools::GFF can get properly transformed
> into GFF3 and GTF output.
>
> Comments/Volunteers?
>
> -jason
>
> --
> Jason Stajich
> jason at bioperl.org

I'll be busy 'til mid-Oct but I'll chip in.  I'll keep tabs on the wiki.

chris


More information about the Bioperl-l mailing list