[Bioperl-l] [Gmod-schema] Loading NCBI/GenBank bacteria into CHADO: Chromosome/Plasmid gene name conflicts

Leighton Pritchard lpritc at scri.ac.uk
Mon Mar 15 07:55:52 EDT 2010


Hi Scott,

Thanks for the reply.  I tried your suggestions on a clean VM of CentOS 5.4
and the equally wordy <grin> outcome is below...

On 02/03/2010 Tuesday, March 2, 16:11, "Scott Cain" <scott at scottcain.net>
wrote:

> First, I am working on the 1.1 release of gmod/chado, and it
> may fix some of the problems you are describing.  Certainly, ID
> collisions between GFF files should not be a problem (I didn't think
> they were in the 1.0 release, but that was a long time ago).  Please
> try a checkout of the schema trunk in the gmod svn:
> 
>   http://gmod.org/wiki/SVN

As a note for anyone following this, when I downloaded the trunk/chado files
only, my build failed with

"""
$make
[...]
Manifying ../blib/man3/Bio::Chaos::ChaosGraph.3pm
Manifying ../blib/man3/Bio::Chaos::FeatureUtil.3pm
Manifying ../blib/man3/Bio::Chaos::XSLTHelper.3pm
Manifying ../blib/man3/Bio::Chaos::Root.3pm
make[1]: Leaving directory `/home/lpritc/Desktop/chado/chaos-xml'
make: *** No rule to make target `bin/gmod_gff2biomart5.pl', needed by
`blib/script/gmod_gff2biomart5.pl'.  Stop.
"""

I had to download the whole trunk for the installation to work.  I came
across this thread:
http://old.nabble.com/Minor-Makefile.PL-changes-td26272744.html
 
while I was looking for a solution; someone else has had a similar problem.

> Another thing you may want to look at is that just last week, a
> developer at Texas A&M, Nathan Liles, contributed code to the
> bioperl-live trunk for the genbank2gff3.pl script that will do a much
> better job of converting bacterial genbank files to GFF3; perhaps that
> will help too.  Working with a svn checkout of bioperl-live shouldn't
> be too scary either; the pieces you are interested in (that work with
> Chado and GBrowse) are quite stable.

I also checked out BioPerl-live.  The svn server at code.open-bio.org was
unresponsive for a couple of days, but Peter pointed me to GitHub at
http://github.com/bioperl/bioperl-live so I went from there.  The process
isn't quite as clean as using the latest stable version of BioPerl, however.

When I attempt to use the bp_genbank2gff3.pl script, I get the following
error message:

"""
[lpritc at localhost ~]$ bp_genbank2gff3.pl -s NC_004547.gbk
Can't locate object method "FT_SO_map" via package
"Bio::SeqFeature::Tools::TypeMapper" at /usr/bin/bp_genbank2gff3.pl line
374.
"""

This appears to be associated with the following code (l207 onwards...) in
TypeMapper:

"""
=head2 map_types_to_SO

[...]

hardcodes the genbank to SO mapping

[...]
dgg: separated out FT_SO_map for caller changes. Update with:

  open(FTSO,"curl -s
http://sequenceontology.org/resources/mapping/FT_SO.txt|");
  while(<FTSO>){
    chomp; ($ft,$so,$sid,$ftdef,$sodef)= split"\t";
    print "     '$ft' => '$so',\n" if($ft && $so && $ftdef);
  }

=cut

sub ft_so_map  {
  # $self= shift;
"""

The upper/lower case function declaration seems to be important, as changing
it back to "sub FT_SO_map" lets the script work:

"""
[lpritc at localhost ~]$ bp_genbank2gff3.pl -s NC_004547.gbk
# Input: NC_004547.gbk
# working on region:NC_004547, Erwinia carotovora subsp. atroseptica
SCRI1043, 03-DEC-2007, Erwinia carotovora subsp. atroseptica SCRI1043,
complete genome.
# GFF3 saved to ./NC_004547.gbk.gff
# Summary:
# Feature       Count
# -------       -----
# repeat_region  19
# sequence_variant  2
# repeat_unit  2
# gene  4614
# region  17387
# exon  4597
# RESIDUES  5064019
# 
"""

Obviously, this is another unsatsifactory sucky ad hoc post-install hack; I
hope I'm doing the right sort of thing, there.  I'm not familiar with
BioPerl so I'm not clear on why this change was made to the interface (it's
part of the recent changes by Nathan Liles you referred to in your post:
http://github.com/bioperl/bioperl-live/commit/18dae5436130c7c77e31120af1a37d
dcd8a77a03), but it also seems to break bp_genbank2gff3.pl.  Also, the
--noCDS flag appears to have no effect at all when using the new version of
bp_genbank2gff3.pl.

The old version of bp_genbank2gff3.pl appears to recognise more feature
types in the summary:

"""
[lpritc at localhost ~]$ bp_genbank2gff3.pl -s NC_004547.gbk
# Input: NC_004547.gbk
# working on region:NC_004547, Erwinia carotovora subsp. atroseptica
SCRI1043, 03-DEC-2007, Erwinia carotovora subsp. atroseptica SCRI1043,
complete genome.
# GFF3 saved to ./NC_004547.gbk.gff
# Summary:
# Feature       Count
# -------       -----
# mRNA  4472
# sequence_variant  2
# gene  4594
# region  8275
# pseudogene  20
# CDS  4472
# RESIDUES(tr)  1433791
# RESIDUES  5064019
# rRNA  22
# processed_transcript  24
# repeat_region  19
# pseudogenic_region  46
# repeat_unit  2
# exon  4597
# tRNA  76
# 
"""

and this is reflected in the substantial difference in GFF3 output, for
issuing exactly the same command when moving from BioPerl 1.6.1 to
bioperl-live: we get different GFF3 output that represents a different gene
model.  I wasn't expecting so radical a change, but at least the IDs are
based on the locus_tag with the new script, and this appears to solve my
problem with clashing feature IDs on the files I was using.

Many thanks for your help,

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405


______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.


DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________


More information about the Bioperl-l mailing list