[Bioperl-l] GFF3 Validator

Andrew Dalke dalke at dalkescientific.com
Mon Apr 10 01:33:31 EDT 2006


Lincoln:
> I just wanted to draw your attention to a GFF3 validator put up by the  
> NIAID
> BRC program. It checks that the GFF3 file is well formed and that only  
> SO
> terms are used in the feature type column. It doesn't check that  
> parent/child
> relationships follow the spec, but it is a real good start.
>
> The documentation: http://iowg.brcdevel.org/gff3.html
> Standalone script: http://iowg.brcdevel.org/gff3validator/
> Web-based script:
> 	http://www.tigr.org/tigr-scripts/prok_manatee/brc-central/ 
> gff_validation.cgi

As part of the DAS2 project a couple weeks ago I wrote a basic GFF3
parser in Python.  If anyone is interested, I've put a snapshot at
   http://www.dalkescientific.com/PyGFF3-0.5.tar.gz

There's a few performance tricks which might be translatable to the
bioperl or gff3validator codes.  Mostly just tricks to optimize for
the common case.

I bring it up to point out I've implemented but not rigorously tested
the validation code to ensure that features have no cycles.  Once I
find all of the features in a set I do a topological sort on them.
If no topological sort is possible, there's a cycle and hence an error.

The code for toposort.py is very simple, with about 15 lines for
the initialization and another 15 for the recursive part.  It should
be easy to convert just that part without having to go through the
rest of the Python code.

Please note that so far I have only parsed two gff files with this
parser.  It is incomplete and not tested.  I wanted to get an idea
of how hard is to to handle GFF-like complex features.  Hard enough
that I'm going to propose a change to DAS2 to make it easier.

					Andrew
					dalke at dalkescientific.com



More information about the Bioperl-l mailing list