[Bioperl-l] [Gmod-schema] Loading NCBI/GenBank bacteria into CHADO: Chromosome/Plasmid gene name conflicts

Scott Cain scott at scottcain.net
Mon Mar 15 10:55:17 EDT 2010


Hi Leighton,

Thanks for the feedback both on getting chado installed from svn and
on the genbank2gff3 converter.  About installing Chado from svn, I
thought I'd modified the Makefile.PL script to gracefully survive not
having the GMODtools directory present; I guess I'll have to revisit
that.  Since I probably won't get to it today, I created a bug report
for it:

  https://sourceforge.net/tracker/?func=detail&aid=2970687&group_id=27707&atid=391291

About the genbank2gff3 script, I'm cc'ing Nathan to make sure he sees
your comments.

Thanks,
Scott



On Mon, Mar 15, 2010 at 7:55 AM, Leighton Pritchard <lpritc at scri.ac.uk> wrote:
> Hi Scott,
>
> Thanks for the reply.  I tried your suggestions on a clean VM of CentOS 5.4
> and the equally wordy <grin> outcome is below...
>
> On 02/03/2010 Tuesday, March 2, 16:11, "Scott Cain" <scott at scottcain.net>
> wrote:
>
>> First, I am working on the 1.1 release of gmod/chado, and it
>> may fix some of the problems you are describing.  Certainly, ID
>> collisions between GFF files should not be a problem (I didn't think
>> they were in the 1.0 release, but that was a long time ago).  Please
>> try a checkout of the schema trunk in the gmod svn:
>>
>>   http://gmod.org/wiki/SVN
>
> As a note for anyone following this, when I downloaded the trunk/chado files
> only, my build failed with
>
> """
> $make
> [...]
> Manifying ../blib/man3/Bio::Chaos::ChaosGraph.3pm
> Manifying ../blib/man3/Bio::Chaos::FeatureUtil.3pm
> Manifying ../blib/man3/Bio::Chaos::XSLTHelper.3pm
> Manifying ../blib/man3/Bio::Chaos::Root.3pm
> make[1]: Leaving directory `/home/lpritc/Desktop/chado/chaos-xml'
> make: *** No rule to make target `bin/gmod_gff2biomart5.pl', needed by
> `blib/script/gmod_gff2biomart5.pl'.  Stop.
> """
>
> I had to download the whole trunk for the installation to work.  I came
> across this thread:
> http://old.nabble.com/Minor-Makefile.PL-changes-td26272744.html
>
> while I was looking for a solution; someone else has had a similar problem.
>
>> Another thing you may want to look at is that just last week, a
>> developer at Texas A&M, Nathan Liles, contributed code to the
>> bioperl-live trunk for the genbank2gff3.pl script that will do a much
>> better job of converting bacterial genbank files to GFF3; perhaps that
>> will help too.  Working with a svn checkout of bioperl-live shouldn't
>> be too scary either; the pieces you are interested in (that work with
>> Chado and GBrowse) are quite stable.
>
> I also checked out BioPerl-live.  The svn server at code.open-bio.org was
> unresponsive for a couple of days, but Peter pointed me to GitHub at
> http://github.com/bioperl/bioperl-live so I went from there.  The process
> isn't quite as clean as using the latest stable version of BioPerl, however.
>
> When I attempt to use the bp_genbank2gff3.pl script, I get the following
> error message:
>
> """
> [lpritc at localhost ~]$ bp_genbank2gff3.pl -s NC_004547.gbk
> Can't locate object method "FT_SO_map" via package
> "Bio::SeqFeature::Tools::TypeMapper" at /usr/bin/bp_genbank2gff3.pl line
> 374.
> """
>
> This appears to be associated with the following code (l207 onwards...) in
> TypeMapper:
>
> """
> =head2 map_types_to_SO
>
> [...]
>
> hardcodes the genbank to SO mapping
>
> [...]
> dgg: separated out FT_SO_map for caller changes. Update with:
>
>  open(FTSO,"curl -s
> http://sequenceontology.org/resources/mapping/FT_SO.txt|");
>  while(<FTSO>){
>    chomp; ($ft,$so,$sid,$ftdef,$sodef)= split"\t";
>    print "     '$ft' => '$so',\n" if($ft && $so && $ftdef);
>  }
>
> =cut
>
> sub ft_so_map  {
>  # $self= shift;
> """
>
> The upper/lower case function declaration seems to be important, as changing
> it back to "sub FT_SO_map" lets the script work:
>
> """
> [lpritc at localhost ~]$ bp_genbank2gff3.pl -s NC_004547.gbk
> # Input: NC_004547.gbk
> # working on region:NC_004547, Erwinia carotovora subsp. atroseptica
> SCRI1043, 03-DEC-2007, Erwinia carotovora subsp. atroseptica SCRI1043,
> complete genome.
> # GFF3 saved to ./NC_004547.gbk.gff
> # Summary:
> # Feature       Count
> # -------       -----
> # repeat_region  19
> # sequence_variant  2
> # repeat_unit  2
> # gene  4614
> # region  17387
> # exon  4597
> # RESIDUES  5064019
> #
> """
>
> Obviously, this is another unsatsifactory sucky ad hoc post-install hack; I
> hope I'm doing the right sort of thing, there.  I'm not familiar with
> BioPerl so I'm not clear on why this change was made to the interface (it's
> part of the recent changes by Nathan Liles you referred to in your post:
> http://github.com/bioperl/bioperl-live/commit/18dae5436130c7c77e31120af1a37d
> dcd8a77a03), but it also seems to break bp_genbank2gff3.pl.  Also, the
> --noCDS flag appears to have no effect at all when using the new version of
> bp_genbank2gff3.pl.
>
> The old version of bp_genbank2gff3.pl appears to recognise more feature
> types in the summary:
>
> """
> [lpritc at localhost ~]$ bp_genbank2gff3.pl -s NC_004547.gbk
> # Input: NC_004547.gbk
> # working on region:NC_004547, Erwinia carotovora subsp. atroseptica
> SCRI1043, 03-DEC-2007, Erwinia carotovora subsp. atroseptica SCRI1043,
> complete genome.
> # GFF3 saved to ./NC_004547.gbk.gff
> # Summary:
> # Feature       Count
> # -------       -----
> # mRNA  4472
> # sequence_variant  2
> # gene  4594
> # region  8275
> # pseudogene  20
> # CDS  4472
> # RESIDUES(tr)  1433791
> # RESIDUES  5064019
> # rRNA  22
> # processed_transcript  24
> # repeat_region  19
> # pseudogenic_region  46
> # repeat_unit  2
> # exon  4597
> # tRNA  76
> #
> """
>
> and this is reflected in the substantial difference in GFF3 output, for
> issuing exactly the same command when moving from BioPerl 1.6.1 to
> bioperl-live: we get different GFF3 output that represents a different gene
> model.  I wasn't expecting so radical a change, but at least the IDs are
> based on the locus_tag with the new script, and this appears to solve my
> problem with clashing feature IDs on the files I was using.
>
> Many thanks for your help,
>
> L.
>
> --
> Dr Leighton Pritchard MRSC
> D131, Plant Pathology Programme, SCRI
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
> gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405
>
>
> ______________________________________________________
> SCRI, Invergowrie, Dundee, DD2 5DA.
> The Scottish Crop Research Institute is a charitable company limited by guarantee.
> Registered in Scotland No: SC 29367.
> Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.
>
>
> DISCLAIMER:
>
> This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
> addressee.
> If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.
>
> Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
> ______________________________________________________
>



-- 
------------------------------------------------------------------------
Scott Cain, Ph. D.                                   scott at scottcain dot net
GMOD Coordinator (http://gmod.org/)                     216-392-3087
Ontario Institute for Cancer Research



More information about the Bioperl-l mailing list