Talk:GFF code audit

From BioPerl
Jump to: navigation, search

BLAT and PSL to GFF3

See Don Gilbert's message on Gmod-Gbrowse (which isn't available on the archive right now, SF mailing list archival system is really not very good).

From Don:

From: Don Gilbert <gilbertd@>
Date: October 21, 2007 12:50:41 AM EDT
To: gmod-gbrowse@lists.sourceforge.net
Subject: Re: [Gmod-gbrowse] GFF and PSL

I also was looking for a blat psl to gff conversion
program, but failed to find one that handled the exon
structure and distinct matches to same query that blat
produces (i.e. tandem genes).

Find now this fairly simple program (1 page):
http://iubio.bio.indiana.edu/gmod/tandy/blat2gff.pl

This brings up a topic of genome analyses using Bioperl's
SearchIO conversions (whether blat, blast or other).  That
is the details of distinct gene matches are lost in the
conversions.  Duplicate genes are very common, and tandem
duplicates tend to confuse the heck out of many genome
analysis programs.

Blat computes and writes these explicitly, each row is a
distinct match (say of EST x genome), with the exon
structure in the array of Q,T-starts on a match row. 
Bioperl unfortunately smooshes these together into one match
when they are the same query EST.

One can parse the same distinct match detail from BLAST
(e.g. tabular output 8,9), by looking at query and source
HSP locations, but it is more programming effort.

-- Don

# example tandem match pair
  grep EB634440 dgri-gnoest.blat  

  670   ..    +  EB634440  753  0  683     scaffold_14830  6267026 2489511 2490194 2   398,272,     0,411,  2489511,2489922,
  683   ..    -  EB634440  753  0  683     scaffold_14830  6267026 2484805 2485488 1   683,    70,  2484805,
  646   ..    -  EB634440  753  0  683     scaffold_14830  6267026 2480511 2481194 3   272,71,305,  70,355,448, 2480511,2480796,2480889,

  grep EB634440 dgri-gnoest.blat | blat2gff.pl -match EST_match

  ##gff-version 3
  scaffold_14830  BLAT    EST_match    2489512 2490194 670  +  .  ID=EB634440_mid1;Target=EB634440 1 683
  scaffold_14830  BLAT    match_part   2489512 2489909 670  +  .  Parent=EB634440_mid1;Target=EB634440 1 398
  scaffold_14830  BLAT    match_part   2489923 2490194 670  +  .  Parent=EB634440_mid1;Target=EB634440 412 683
  scaffold_14830  BLAT    EST_match    2484806 2485488 683  -  .  ID=EB634440_mid2;Target=EB634440 1 683
  scaffold_14830  BLAT    match_part   2484806 2485488 683  -  .  Parent=EB634440_mid2;Target=EB634440 71 753
  scaffold_14830  BLAT    EST_match    2480512 2481194 646  -  .  ID=EB634440_mid3;Target=EB634440 1 683
  scaffold_14830  BLAT    match_part   2480512 2480783 646  -  .  Parent=EB634440_mid3;Target=EB634440 71 342
  scaffold_14830  BLAT    match_part   2480797 2480867 646  -  .  Parent=EB634440_mid3;Target=EB634440 356 426
  scaffold_14830  BLAT    match_part   2480890 2481194 646  -  .  Parent=EB634440_mid3;Target=EB634440 449 753

View thse EST matches at http://insects.eugenes.org/species/cgi-bin/gbrowse/dgri/?
   name=scaffold_14830:2479354-2492663;label=hsgDM-EST-NCBI_GNO

# equivalent Bioperl  bp_search2gff3.pl
# turns distinct 3 gene matches into one, including reversed one, and ignores exon detail ...

  grep EB634440 dgri-gnoest.blat | lib/Bio/script/bp_search2gff3.pl  -f psl -m -ver 3 -t hit -i -

  ##gff-version 3
  scaffold_14830  BLAT    match_part   2489512 2490194 .  +   0   Parent=EB634440;Target=Sequence:EB634440 1 683
  scaffold_14830  BLAT    match_part   2484806 2485488 .  -   0   Parent=EB634440;Target=Sequence:EB634440 1 683
  scaffold_14830  BLAT    match_part   2480512 2481194 .  -   0   Parent=EB634440;Target=Sequence:EB634440 1 683
  scaffold_14830  BLAT    match        2480512 2490194 .  -   .   ID=EB634440

-- d.gilbert--bioinformatics--indiana-u--bloomington-in-47405
-- gilbertd_AT_indiana.edu--http://marmot.bio.indiana.edu/

GFF-related Bugs

  • bp_search2gff.pl - two bugs (enhancement requests, really) have been reported on Bugzilla which we should take note of:
  • bp_genbank2gff3.pl - doesn't calculate phase:
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox