[Bioperl-l] [Gmod-gbrowse] is this a bp_genbank2gff3.pl bug?

Scott Cain cain.cshl at gmail.com
Tue Jun 19 14:41:52 EDT 2007


Hi Alessandra,

I cc'ed your message to the bioperl and sequence ontology mailing lists,
since your question is relevant to both.

Converting genbank files to GFF3 is excruciatingly difficult; I
generally find that I can use the genbank2gff3 script to get me most of
the way there, but then I need to do some manual fixing to make it
'right'.

I am using bioperl-live, since there have been several fixes to the
script since bioperl 1.5.2 was released, including the most recent fixes
from me today (when I started working on this); I would suggest you use
bioperl-live as well.  I ran the script on chrY.

Most (perhaps all) of the errors fit into a few categories:

  - CDS doesn't have a phase, where the GFF3 spec requires CDSes to have
a phase.  Since it can be a little bit of a hassle to calculate, I
understand why it was left out, but I'll submit a bug report to have
those calculated.  If you are planning on loading the GFF file into
Chado, you can use the --noCDS option to get exons instead of CDSes,
which makes the problem go away (the validator has a bug here though--it
reports the polypeptide derives_from mRNA as invalid, but it is correct;
I'm reporting that directly to the author).  Here's the bioperl bug
report:

  http://bugzilla.open-bio.org/show_bug.cgi?id=2322

  - "invalid type pair" is caused by the genbank file using feature
types in a way that conflicts with the Sequence Ontology.  For example,
it has STS features that are part_of a gene, pseudogenic_region as
part_of pseudogene.  I don't know if there would be an easy way to catch
this in the conversion script.  You may need to fix these by hand.  If
the problems occur for features that you don't care about, you can use
the --filter option to leave them out of the resulting GFF file (for
example, adding '--filter STS' would leave all STS features out of the
file).  Also, if you don't plan on loading these into Chado (which does
require SO-compliance) but instead plan on using a Bio::DB::SeqFeature
database, these errors may not be a problem.

  - "invalid type" is caused by feature types that are not in SOFA
(Sequence Ontology for Feature Annotation), though the terms probably
are in SO.  I thought at one point we discussed allowing any SO type to
appear in the GFF3 type column, but that is not what the spec says now.
I don't see this type of error as causing a problem for either
Bio::DB::SeqFeature or Chado.  Chado allows features to be typed with
anything that is in SO and does not restrict to SOFA.

Scott




On Tue, 2007-06-19 at 16:56 +0200, Alessandra Bilardi wrote:
> Hi all,
> 
> I used bp_genbank2gff3.pl with CVS bioperl and it created gff3 about
> human genbank file. I used validate_gff3 on line with human.gff and 
> it has id non-unique so the database gbrowse inserting has errors.
> 
> I attach the error file about hs_ref_chrY.gbk and hs_ref_chr1.gbk that 
> I download at at ftp://ftp.ncbi.nih.gov/genomes/H_sapiens
> Elements having id non-unique are:
> - CDS or pseudo*exon without mRNA and parent 
> - STS with egual start and end
> - tRNA with egual name
> 
> If this is a bp_genbank2gff3.pl bug, can you rectify bp_genbank2gff3.pl?
> If I'm mistaken, can you help me?
> 
> Thanks very much for the help in advance,
> 
> Alessandra.
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________ Gmod-gbrowse mailing list Gmod-gbrowse at lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse
-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                         cain at cshl.edu
GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
Cold Spring Harbor Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20070619/3d818b27/attachment.bin 


More information about the Bioperl-l mailing list