[Bioperl-l] bp_genbank2gff3.pl

Scott Cain scott at scottcain.net
Sat Sep 18 07:07:13 EDT 2010


Hi Dave,

A fresh "pull" of the bioperl git repository shows that
bp_genbank2gff3.pl already does this.  It creates a locus_tag for all
features that have a locus_tag, and uses the locus_tag for the ID when
it can (it can't blindly use the locus tag for the ID since both the
gene and the CDS have the same tag).

Scott


On Sat, Sep 18, 2010 at 11:20 AM, David Breimann
<david.breimann at gmail.com> wrote:
> Hi Scott,
>
> Here is a very short genbank:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_E24377A/NC_009789.gbk
>
> Note all genes in the genbank have locus tags. In the resulting GFF3,
> however, only the last gene (EcE24377A_B0005) gets a locus_tag. I have no
> idea why it deserves a special treatment... :)
>
> p.s. making this change (i.e., copying locus_tag to the GFF3 last column
> whenever available) will really make my life easier.
>
> Thank you,
> Dave
>
> On Sat, Sep 18, 2010 at 12:08 PM, Scott Cain <scott at scottcain.net> wrote:
>>
>> Hi Dave,
>>
>> That seems perfectly reasonable.  If you could point out a GenBank
>> entry for which that does not happen, I could try to figure out why
>> not.
>>
>> Scott
>>
>>
>> On Sat, Sep 18, 2010 at 10:20 AM, David Breimann
>> <david.breimann at gmail.com> wrote:
>> > Since locus_tag is an essential tag in genbank, I suggest locus_tag will
>> > be
>> > always added to the GFF last column if it exists in the genbank, whether
>> > it
>> > is used as ID in the GFF or not.
>> >
>> > On Sat, Sep 18, 2010 at 11:17 AM, Scott Cain <scott at scottcain.net>
>> > wrote:
>> >>
>> >> Hi Dave,
>> >>
>> >> bp_genbank2gff3.pl suffers from the fact that it has to deal with
>> >> GenBank files :-)  It was designed initially to work on whole genome
>> >> refseqs, and contains several ad hoc rules for trying to make it "do
>> >> the right thing."  In practice, it is not unusual for a post
>> >> processing step (either by hand or a quicky perl script) to be
>> >> required to really get it right.  I don't recall the specifics (if I
>> >> ever knew :-) for when and how the locus tag is used, but I do know
>> >> that there is a list of things that it will try to use for the ID, and
>> >> while the locus is on the list, I don't know where it comes in the
>> >> list, so it's possible that other items might supersede it.
>> >>
>> >> Scott
>> >>
>> >>
>> >> On Sat, Sep 18, 2010 at 10:05 AM, David Breimann
>> >> <david.breimann at gmail.com> wrote:
>> >> > Hello,
>> >> >
>> >> > I'm not sure how bp_genbank2gff3.pl works. Sometimes it adds a
>> >> > `locus_tag`
>> >> > in the fields and sometime it doesn't, even though the genabank has a
>> >> > locus
>> >> > tag.
>> >> > Also, is the ID always equivalent to the locus tag?
>> >> >
>> >> > Thanks,
>> >> > Dave
>> >> > _______________________________________________
>> >> > Bioperl-l mailing list
>> >> > Bioperl-l at lists.open-bio.org
>> >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >>
>> >> ------------------------------------------------------------------------
>> >> Scott Cain, Ph. D.                                   scott at scottcain
>> >> dot net
>> >> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> >> Ontario Institute for Cancer Research
>> >
>> >
>>
>>
>>
>> --
>> ------------------------------------------------------------------------
>> Scott Cain, Ph. D.                                   scott at scottcain
>> dot net
>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>> Ontario Institute for Cancer Research
>
>



-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research



More information about the Bioperl-l mailing list