[Bioperl-l] Fixing bioperl [was Re: [GMOD-devel] Re: [Gmod-gbrowse] Analysis features (Re: Final alpha release of gmod (chado))]

Ewan Birney birney at ebi.ac.uk
Thu Jul 28 19:20:39 EDT 2005

Just my $0.02 on this....

Chris - this seems bang on the money and what we should
do (roll back out the changes, extend the interface and then
in the extended interface have the "scruffy" types delegate
to the short_name or whatever in the main types).

So - for what it is worth, this is the way to go for me.

Chris Mungall wrote:
> [sorry for the cross-posting, but I think it's really important to have a
> gmod to bioperl chit chat on this. I've removed gmod-gbrowse from the cc
> list]
> On Thu, 28 Jul 2005, Scott Cain wrote:
>>Hi Cyril,
>>I think Bio::Tools::GFF is somewhat hacky and not a tool I would use to
>>produce 'safe' GFF3.  On the other hand Bio::FeatureIO is still a little
>>immature, but it is what I used for the chado GFF3 bulk loader, so it
>>does handle (parse) Target features.  So my suggestion would be to use
>>BFIO::gff, but be prepared for some problems; when you find them
>>complain loudly on the bioperl mailing list or fix the problems and
>>commit them (or both!).
> I think the answer may be even more complicated than this.
> Lurkers and contributors to the bioperl mailing list may have noticed that
> there has been some major obstacles in progressing lately, particularly in
> getting a stable release of the code out. bp1.4 is fairly old, 1.5 is a
> developers release, though this is the one required by GMOD.
> My understanding is that this bottleneck can be traced back to changes in
> the SeqFeature and Annotation model. These changes appear to be required
> by Bio::SeqFeature::Annotated which is produced by Bio::FeatureIO::gff
> (which in turn is used by the GMOD bulk loader, which is the main reason
> GMOD requires 1.5, I believe?). Unfortunately, these changes also break
> existing code and have a severe negative impact on memory usage.
> Before advising Cyril and others to switch to BFIO::gff I think it's
> important to make sure there is a clear path forward with bioperl. My
> impression is that there is something of a stalemate here. The bioperl
> developers would like to retract the aforementioned changes, but they
> believe they cannot do this without breaking GMOD code.  They are also
> extremely uncomfortable about leaving these changes in. Everyone gives up
> and starts coding around bioperl.
> Here is why the changes were introduced:
> BioPerl has a 'scruffy' typing model, whereby feature types (primary_tag
> in bioperl) and featureprop types (tags in bioperl) are labels or strings.
> In contrast, Chado forces all types to be some class or relation in an
> ontology.
> Now obviously I'm rather partial to the Chado model, but that doesn't mean
> I think it should be forced upon bioperl. I often use bioperl in scruffy
> mode (on scruffy data); or in some combination whereby I map the scruffy
> types to ontologies in some non-bioperl code. When using bioperl as a
> middleware component over a nicely organised database, ontology-typed mode
> is definitely best. However, the majority of bioperl users (including
> myself) spend a large proportion of their time working with scruffy data,
> in which case lightweight scruffy types are more appropriate.
> It seems that there is a perfectly simple way of reconciling both
> approaches. We revert bioperl back to the simpler scruffy model. The
> majority of users and developers breathe a sigh of relief. We then extend
> SeqFeatureI with something like SeqFeatureAnnotatedI. This forces types to
> be stored as OntologyTerms (and I haven't even touched on some of the
> problems here, but at least we are insulating the standard bioperl layer
> that 99% of users use from these issues). All classes implementing SFAI
> will necessarily implement SFI, and the primary_tag and tag_values methods
> will be supported (not deprecated) as simple delegations to the
> OntologyTerm objects.
> We can then modify BFIO::gff (which is an incredibly useful piece of code)
> and get rid of all the dependencies on SO and Bio::Ontology* and instead
> allow the user of this module to plug in their own resolver/validator - so
> they can choose whether they just want fast scruffy lightweight SFI
> features, or whether they want ontology-typed SFAI features. If the
> latter, then they can choose their own resolver strategy - by a user
> supplied hash, by a copy of SO auto-downloaded from sourceforge, by a
> local chado db, by the genbank->SO mapping table, during parsing vs
> post-parsing, whatever. In fact there is already
> Bio::SeqFeature::Tools::TypeMapper, but currently this is mostly concerned
> with helping Bio::SeqFeature::Tools::Unflattener convert scruffy genbank
> to something sensible.
> GMOD (and perhaps biosql) would use SFAI, everyone else would use the
> simpler SFI. Someone can even get a stable 1.6 release out before all the
> SFAI details such as how the resolver would work are finalised. I'd really
> like to see 1.6 include a simpler BFIO::gff that can optionally produces
> features that aren't SeqFeature::Annotateds, but that's negotiable.
> There's vast swathes of both GMOD and BioPerl code I'm not familiar with,
> so it's possible my analysis above is flawed in some way. If it is, then
> it's up to someone from either camp to speak up! If not, then there's no
> excuses for the relevant people to start sorting out this mess by
> commencing with the solution outlined above.
> Cheers
> Chris
>>On Thu, 2005-07-28 at 18:37 +0200, Cyril Pommier wrote:
>>>We are going to store analysis results in chado, and we are of course
>>>very interressed by these futur evolutions of GFF3/chado.
>>>So we would like to make sure that the parsers and conversions programs
>>>we are writing now will be compatible with the futur GFF3.
>>>We are using Bio::SeqFeature::Generic objects that we write with
>>>Do you think that Bio::Tools::GFF will be able to handle the new 'type'
>>>column or is it better to switch to Bio::FeatureIO::gff ?
>>>Thanks in advance for any advice.
>>>Don Gilbert wrote:
>>>>Your notes in gmod_bulk_load_gff3.pl suggest it is headed in
>>>>same direction I suggest below. More about these todo points
>>>>>- address flybase"s use of of analysisfeature combined with feature to
>>>>>give source-type information (in GFF terms). This will need to
>>>>>be addressed in the GBrowse adaptor.
>>>>>- modify the bulk loader to allow "mixed" GFF3 files (that is,
>>>>>both analysis results and annotations). See perldoc
>>>>>for more info
>>>>Use of chado's analysisfeature table is something others who know
>>>>it better can comment on. But after working with it for a while
>>>>it makes sense to me to use in this way:
>>>>For a future GFF -> Chado loader, treat analysis features such as
>>>>gene finding results, BLAST, sim4 as 'analysisfeature type' rather
>>>>than feature CV term type (the ones that now end up with a generic
>>>>'match' cvterm). In these cases the Analysis table is populated with
>>>>as the basis of this 'analysisfeature type', such as
>>>>match:genie:dummy (or maybe exon:genie)
>>>>The program:database fits neatly in GFF source field, as
>>>>#ref source type start stop ...
>>>>chr1 blastx:na_pe.dros match 1 100 ...
>>>>chr1 sim4:DGC match 1 100 ...
>>>>These can be treated in database adaptor analogously to the CVterm
>>>>table feature types. See at end a list of current GFF feature
>>>>type:source from worm, rice, yeast, fly MODs. Fly and rice use a
>>>>syntax like above and worm gff uses BLAT_EMBL_BEST, instead of
>>>>From POD of your bulk_load_gff3.pl
>>>>>If you are loading analysis results (ie, BLAT results, gene
>>>>>predictions), you should specify the -a flag. If no arguments are
>>>>>supplied with the -a, then the loader will assume that the results
>>>>>belong to an analysis set with a name that is the concatenation of
>>>>>the source (column 2) and the method (column 3) with an underscore
>>>>>in between.
>>>>"... then the loader will assume that the results belong to an
>>>>analysis table row with a program name and database source name
>>>>taken from Source (column 2, colon separated program:sourcename),
>>>>with a SOFA feature type taken from Method (column 3). If
>>>>sourcename doesn't apply, e.g. genefinder, don't add or use 'dummy'.
>>>>Use the generic 'match' SOFA type if others don't apply."
>>>>[see also http://song.sourceforge.net/gff3-jan04.shtml#ALIGNMENTS]
>>>>Note that sourcename of database is a common attribute (all those
>>>>blasts, blats, sim4, ... are run on several different databases).
>>>>For that underscore between method and source, where does that go into
>>>>database? It is used as parts of program or database sourcename names,
>>>>so it may be problematic to add one if not needed.
>>>>Oh, I see now from bulk_load_gff3.PLS, you are creating a 'Name' entry
>>>>for analysis table. This probably is less useful than using Program
>>>>and Sourcename fields as flybase does, which comes from the common
>>>>usage where people run various programs, with various database sources
>>>>and want to plop the results into a database easily. These go into those
>>>>two fields directly, no need to create or parse a Name entry
>>>>(which can be and is null in flybase data).
>>>>>my $search_analysis
>>>>>= $db->prepare("SELECT analysis_id FROM analysis WHERE name=?");
>>>>I think it would be better as
>>>>my $search_analysis
>>>>= $db->prepare("SELECT analysis_id FROM analysis WHERE program=? and
>>>>>Otherwise, the argument provided with -a will be taken
>>>>>as the name of the analysis set. Either way, the analysis set must
>>>>>already be in the analysis table. The easist way to do this is to
>>>>>insert it directly in the psql shell:
>>>>>INSERT INTO analysis (name, program, programversion)
>>>>>VALUES ('genscan 2005-2-28','genscan','5.4');
>>>>My choice would be to populate the analysis table from GFF data, rather
>>>>than expect prepraration by user (or as another option).
>>>>INSERT INTO analysis (program, sourcename)
>>>>VALUES ('tblastx','na_baylorf1_scfchunk.dpse');
>>>>INSERT INTO analysis (program, sourcename)
>>>>VALUES ('sim4','na_gb.dmel');
>>>>INSERT INTO analysis (program, sourcename, programversion)
>>>>VALUES ('genie_masked','dummy', '1.0');
>>>>>There are other columns in the analysis table that are optional; see
>>>>>the schema documentation and '\d analysis' in psql for more
>>>>>A planned addtion to the functionality of handling analysis results
>>>>>is to allow "mixed" GFF files, where some lines are analysis results
>>>>>and some are not.
>>>>This is the case for drosophila GFF now (see others also below). If
>>>>you make the default assumption that if ($method =~ /.*match/) and
>>>>($source =~ m/([^:]+):(.+)/), you should get all/most of
>>>>analysisfeature types, and probably not anything else.
>>>>>Additionally, one will be able to supply lists of
>>>>>types (optionally with sources) and their associated entry in the
>>>>>analysis table. The format will probably be tag value pairs:
>>>>>--analysis match:Rice_est=rice_est_blast, \
>>>>>match:Maize_cDNA=maize_cdna_blast, \
>>>>My suggestion for this (as per GFF source,type columns) would be
>>>>--analysis match:program:sourcename ...
>>>>--analysis match:blast:Rice_est,match:blast:Maize_cDNA,\
>>>>mRNA:genscan:dummy, exon:genscan:dummy
>>>>I guess the 'dummy' data sourcename need not be added; flybase uses it
>>>>to keep that field not-null, but it isn't required by the schema.
>>>>Here are some snippets from the ChadoFC adaptor I modified
>>>>from yours (will get into cvs.sf.net 'real soon'), showing that
>>>>it isn't much work to add this as an analog to how cvterm types
>>>>are used.
>>>>-- Don
>>>>## Bio::DB::Das::ChadoFC.pm, part of new() - load analysis types
>>>>## treat similar to CV table types
>>>>sub getAnalysisFeatureHash
>>>>my $self= shift;
>>>>my $dbh= $self->dbh();
>>>>my $sth = $dbh->prepare("select analysis_id,program,sourcename from
>>>>or warn "unable to prepare select cvterms";
>>>>$sth->execute or $self->throw("unable to select cvterms");
>>>>my(%term2name,%name2term) = ({},{});
>>>>while (my $hashref = $sth->fetchrow_hashref) {
>>>>## this is dgg syntax of analysis feature names for GFF
>>>>## all have generic 'match' method and program:source as 'source'
>>>>## a problem, want other main types: EST_match:xxx, mRNA:genie .. etc.
>>>>my $anfeat= "match:".$hashref->{program}.":".$hashref->{sourcename};
>>>>$term2name{ $hashref->{analysis_id} } = $anfeat;
>>>>$name2term{ $anfeat } = $hashref->{analysis_id};
>>>>## Das::ChadoFC::Segment snippets
>>>>sub features {
>>>>my $sql_range = '';
>>>>my ($interbase_start,$rend,$srcfeature_id,$sql_types);
>>>>unless ($feature_id) {
>>>>$sql_range = $self->sql_range($rangetype);
>>>>$sql_types = $self->sql_types($types, -1); # dgg
>>>>$srcfeature_id = $self->{srcfeature_id};
>>>>elsif($self->{has_anatype}) {
>>>>$from_part .= "left join analysisfeature af using (feature_id) ";
>>>>sub sql_types
>>>>$valid_type = $factory->name2term($temp_type);
>>>>$is_anatype= 0;
>>>>unless ($valid_type) {
>>>>$valid_type = $factory->an_name2term($temp_type);
>>>>$self->{has_anatype}= $is_anatype= 1 if ($valid_type);
>>>>## leave out extra invalid types
>>>>if (!$valid_type) {
>>>>### skip
>>>>} elsif ($temp_dbxref) {
>>>>$sql_types .= $orsql."(f.type_id = $valid_type and fd.dbxref_id =
>>>>} elsif($is_anatype) {
>>>>$sql_types .= $orsql."(af.analysis_id = $valid_type)"; #<<<
>>>>} else {
>>>>$sql_types .= $orsql."(f.type_id = $valid_type)";
>>>>Lists of GFF feature type:source from some current MOD data
>>>>where * are probably analysisfeature types (program:database)
>>>>rice gff type:source
>>>>EST_match:Barley (? might be EST_match:someprogram:Barley)
>>>>* exon:FgenesH:Monocot
>>>>* mRNA:FgenesH:Monocot
>>>>worm gff type:source
>>>>* CDS:Genefinder
>>>>* CDS:twinscan
>>>>* EST_match:BLAT_EST_BEST (~ EST_match:BLAT:EST_BEST)
>>>>* EST_match:BLAT_EST_OTHER
>>>>* cDNA_match:BLAT_mRNA_BEST (~ cDNA_match:BLAT:mRNA_BEST )
>>>>* cDNA_match:BLAT_mRNA_OTHER
>>>>complex_substitution :Allele
>>>>* exon:Genefinder
>>>>* exon:tRNAscan-SE-1.23
>>>>* exon:twinscan
>>>>* expressed_sequence_match:BLAT_OST_BEST (~
>>>>expressed_sequence_match:BLAT:OST_BEST )
>>>>* expressed_sequence_match:BLAT_OST_OTHER
>>>>* mRNA:Genefinder
>>>>* mRNA:twinscan
>>>>* nucleotide_match:BLAT_EMBL_BEST (~ nucleotide_match:BLAT:EMBL_BEST )
>>>>* nucleotide_match:BLAT_EMBL_OTHER
>>>>* nucleotide_match:BLAT_TC1_BEST
>>>>* nucleotide_match:BLAT_TC1_OTHER
>>>>* nucleotide_match:BLAT_ncRNA_BEST
>>>>* nucleotide_match:BLAT_ncRNA_OTHER
>>>>* nucleotide_match:TEC_RED
>>>>* nucleotide_match:waba_coding
>>>>* nucleotide_match:waba_strong
>>>>* nucleotide_match:waba_weak
>>>>* protein_match:wublastx
>>>>* repeat_region:RepeatMasker
>>>>* tRNA:tRNAscan-SE-1.23
>>>>* translated_nucleotide_match:BLAT_NEMATODE (~
>>>>translated_nucleotide_match:BLAT:NEMATODE )
>>>>fly gff type:source
>>>>* match:RNAiHDP
>>>>* match:assembly:path
>>>>* match:blastx:aa_SPTR.dmel
>>>>* match:blastx:aa_SPTR.insect
>>>>* match:blastx:aa_SPTR.othinv
>>>>* match:blastx:aa_SPTR.othvert
>>>>* match:blastx:aa_SPTR.plant
>>>>* match:blastx:aa_SPTR.primate
>>>>* match:blastx:aa_SPTR.rodent
>>>>* match:blastx:aa_SPTR.worm
>>>>* match:blastx:aa_SPTR.yeast
>>>>* match:genscan
>>>>* match:repeatmasker
>>>>* match:sim4:na_ARGs.dros
>>>>* match:sim4:na_ARGsCDS.dros
>>>>* match:sim4:na_DGC_dros
>>>>* match:sim4:na_dbEST.diff.dmel
>>>>* match:sim4:na_dbEST.same.dmel
>>>>* match:sim4:na_gadfly_dmel_r2
>>>>* match:sim4:na_gb.dmel
>>>>* match:sim4:na_gb.tpa.dmel
>>>>* match:sim4:na_smallRNA.dros
>>>>* match:sim4:na_transcript_dmel_r31
>>>>* match:sim4:na_transcript_dmel_r32
>>>>* match:tRNAscan-SE:.
>>>>* match:tblastx:na_agambiae
>>>>* match:tblastx:na_dbEST.insect
>>>>* match:tblastx:na_dpse
>>>>* match_part:RNAiHDP
>>>>* match_part:assembly:path
>>>>* match_part:blastx:aa_SPTR.dmel
>>>>* match_part:blastx:aa_SPTR.insect
>>>>* match_part:blastx:aa_SPTR.othinv
>>>>* match_part:blastx:aa_SPTR.othvert
>>>>* match_part:blastx:aa_SPTR.plant
>>>>* match_part:blastx:aa_SPTR.primate
>>>>* match_part:blastx:aa_SPTR.rodent
>>>>* match_part:blastx:aa_SPTR.worm
>>>>* match_part:blastx:aa_SPTR.yeast
>>>>* match_part:genscan
>>>>* match_part:repeatmasker
>>>>* match_part:sim4:na_ARGs.dros
>>>>* match_part:sim4:na_ARGsCDS.dros
>>>>* match_part:sim4:na_DGC_dros
>>>>* match_part:sim4:na_dbEST.diff.dmel
>>>>* match_part:sim4:na_dbEST.same.dmel
>>>>* match_part:sim4:na_gadfly_dmel_r2
>>>>* match_part:sim4:na_gb.dmel
>>>>* match_part:sim4:na_gb.tpa.dmel
>>>>* match_part:sim4:na_smallRNA.dros
>>>>* match_part:sim4:na_transcript_dmel_r31
>>>>* match_part:sim4:na_transcript_dmel_r32
>>>>* match_part:tRNAscan-SE:.
>>>>* match_part:tblastx:na_agambiae
>>>>* match_part:tblastx:na_dbEST.insect
>>>>* match_part:tblastx:na_dpse
>>>>transposable_element_insertion_site:. 3116
>>>>yeast gff type:source count
>>>>-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
>>>>-- gilbertd at indiana.edu -- http://marmot.bio.indiana.edu/
>>>>This SF.Net email is sponsored by the 'Do More With Dual!' webinar
>>>>July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
>>>>core and dual graphics technology at this free one hour event hosted
>>>>by HP, AMD, and NVIDIA. To register visit
>>>>Gmod-gbrowse mailing list
>>>>Gmod-gbrowse at lists.sourceforge.net
>>Scott Cain, Ph. D.                                         cain at cshl.edu
>>GMOD Coordinator (http://www.gmod.org/)                     216-392-3087
>>Cold Spring Harbor Laboratory
>>SF.Net email is Sponsored by the Better Software Conference & EXPO September
>>19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
>>Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
>>Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
>>Gmod-devel mailing list
>>Gmod-devel at lists.sourceforge.net
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l

More information about the Bioperl-l mailing list