[Bioperl-l] GFF file and load_gff.pl

Richard Harrison richard.harrison at edinburgh.ac.uk
Wed Jan 28 15:59:02 EST 2009


Thanks Scott,
You're being a great help. Unfortunately, I am still struggling. There  
was no line at the top of the gff file. I added one, but it makes  
little difference. I had a look at Bio::DB::SeqFeature::Store, but as  
far as I can make out it handles the gff file worse than the  
Bio:DB:GFF file. I tried another gff3 file from a different source and  
it made no difference at all.


These are the commands that I'm using to populate two different  
databases, so I can work out which method is best:

./bp_seqfeature_load.pl -d cere_seqfeat -s Bio::DB::SeqFeature -a  
DBI::mysql -user root -pass pwd -v --verbose -c genome.gff

./bp_load_gff.pl -d cere_gffdb -user root -pass pwd -- 
adaptor=dbi::mysql --create --gff3_munge genome.gff

Both databases seem to load the data ok and don't give error messages..


Then in bioperl:

#use Bio::DB::SeqFeature;
	
  # Open the feature database
  my $db     = Bio::DB::SeqFeature::Store->new( -adaptor =>  
'DBI::mysql',
                                   				 -dsn     =>  
'dbi:mysql:cere_seqfeat',
				   								 -user => 'root',
				   								 -pass => 'pwd',
				   								 -create => 1
				 								);
my @types = $db->types;
	foreach (@types){	
	print "$_\n";
	}


I GET NO OUTPUT

Alternatively:
use Bio::DB::GFF;

   # Open the feature database
   	my $db      = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
                                    -dsn     => 'dbi:mysql:cere_gffdb',
				   -user => 'root',
				   -pass => 'pwd'
				 );

my @types = $db->types;
	foreach (@types){	
	print "$_\n";
	}

I GET:

telomere:SGD
intron:SGD
insertion:SGD
chromosome:SGD
region:landmark
ncRNA:SGD
transposable_element_gene:SGD
region:SGD
ARS:SGD
snRNA:SGD
snoRNA:SGD
nc_primary_transcript:SGD
rRNA:SGD
transposable_element:SGD
gene:SGD
CDS:SGD
repeat_family:SGD
transcript_region:SGD
pseudogene:SGD
nucleotide_match:SGD
tRNA:SGD
binding_site:SGD
repeat_region:SGD
centromere:SGD


Any ideas what is going on here? I'm struggling to comprehend where  
I'm going wrong.

Best wishes,
Richard



On 28 Jan 2009, at 18:51, Scott Cain wrote:

> Hi Richard,
>
> A few items:
>
> * It looks as though the loader didn't know that it was loading GFF3
> (you can tell it's GFF3 by the = between the tags and values in the
> ninth column; in GFF2, there would be a space).  As a result, the
> classes weren't created properly.  Check that there is a line at the
> top of your GFF file that looks like "##gff-version 3"
>
> * You may not want to use a Bio::DB::GFF database anyway.  Since you
> are just getting started and have GFF3, you might be better off using
> a Bio::DB::SeqFeature::Store database, which was designed to work with
> GFF3 data (Bio::DB::GFF works better with GFF2).  The loader for a
> SeqFeature::Store database is called bp_seqfeature_load.pl.
>
> Scott
>
>
> On Wed, Jan 28, 2009 at 12:36 PM, Richard Harrison
> <richard.harrison at ed.ac.uk> wrote:
>> Thank you Chris, Scott and Adam,
>> You are right, I was confused. I have now managed to create a  
>> Bio::DB::GFF
>> database with my genome annotation loaded into it. One further  
>> question.
>> I am having trouble retrieving the desired info from the database.   
>> Shown
>> below is a typical entry into the GFF file for a gene
>>
>>
>> #chr01  SGD     gene    33449   34702   .       +       .
>> ID=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO: 
>> 0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO: 
>> 0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase 
>> %20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by 
>> %20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an 
>> %20essential 
>> %20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>>
>> #chr01  SGD     CDS     33449   34702   .       +       0
>> Parent=YAL061W;Name=YAL061W;gene=BDH2;Alias=BDH2;Ontology_term=GO: 
>> 0008150,GO:0005634,GO:0005737,GO:0016616,GO:0008270,GO:0016491,GO: 
>> 0046872;Note=Putative%20medium-chain%20alcohol%20dehydrogenase 
>> %20with%20similarity%20to%20BDH1%3B%20transcription%20induced%20by 
>> %20constitutively%20active%20PDR1%20and%20PDR3%3B%20BDH2%20is%20an 
>> %20essential 
>> %20gene;dbxref=SGD:S000000057;orf_classification=Uncharacterized
>>
>>
>> I would like to search the database for YAL061W and retrieve the CDS
>> coordinates, details about introns etc. I don't need the sequence,  
>> as I have
>> separate multiple genome-alignments..
>>
>>
>> At present all I can work out how to do is  get all feature types and
>> classes  in the database.. (see code below)
>>
>>
>> my $db      = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',
>>                                  -dsn     => 'dbi:mysql:biosql',
>>                                  user => 'root',
>>                                  pass => '*******'
>>                                );
>>       #get types
>>       my @types = $db->types;
>>
>> EG:
>> #telomere:SGDintron:SGDinsertion:SGDchromosome:SGDregion:landmarkncRNA:SGDtransposable_element_gene:SGDregion:SGDARS:SGDsnRNA:SGDsnoRNA:SGDnc_primary_transcript:SGDrRNA 
>>  etc...
>>
>>
>>
>>       #get classes
>>       my @classes = $db->classes;
>>
>> ID=YKR067W
>> ID=YKR068C
>> ID=YKR069W
>> ID=YKR070W
>> ID=YKR071C
>> ID=YKR072C
>> ID=YKR073C
>> ID=YKR074W
>>
>> etc...
>>
>> Could someone point me towards a useful set of pointers for this.  
>> I've tried
>> reading the documentation but it doesn't seem to illustrate what I  
>> want to
>> do.
>>
>> Best wishes and thanks for the help so far,
>>
>> Richard
>>
>>
>>
>>
>>
>>
>>
>> On 28 Jan 2009, at 16:15, Scott Cain wrote:
>>
>>> Hi Richard,
>>>
>>> Your mixing up two database schemas.  Do you want to use a BioSQL
>>> database (bioperl-db) or a Bio::DB::GFF database?  I'm guessing that
>>> you want the latter, so I'll answer that question (as it's the  
>>> easier
>>> one anyway).  You need to add the "-c" flag (for --create) to the
>>> load_gff.pl command to create the Bio::DB::GFF schema.
>>>
>>> If you really wanted a BioSQL database, you'll have to wait for help
>>> from someone else more knowledgeable about it.
>>>
>>> Scott
>>>
>>>
>>>
>>>
>>> On Wed, Jan 28, 2009 at 10:22 AM, Richard Harrison
>>> <richard.harrison at ed.ac.uk> wrote:
>>>>
>>>> Dear all,
>>>>
>>>> I am running Bioperl 1.6 on osx- leopard on a macbook pro.
>>>>
>>>> I have installed mysql-5.1.30-osx10.5-x86, DBD-mysql-4.010, the
>>>> biosql-schema for mysql and bioperl-db.  As per the instructions  
>>>> I have a
>>>> database called biosql which I associated the SQL dialect
>>>> biosqldb-mysql.sql
>>>>
>>>> After much fannying, the install seems fine....although i can't  
>>>> be sure
>>>> (never used mysql before)
>>>>
>>>> I am having problems with the script load_gff.pl
>>>>
>>>> I want to load  a database with the data from a genome.gff file  
>>>> (for
>>>> saccharomyces cerevisiae). I don't want to add sequence to it, as  
>>>> all i
>>>> need
>>>> is the annotation.
>>>>
>>>> I have tried the following command(s):
>>>>
>>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword genome.gff
>>>> ./bp_load_gff.pl -d biosql -user root -pass mypassword
>>>> --adaptor=dbi::mysql
>>>> genome.gff
>>>>
>>>> With both I get the following error:
>>>>
>>>> No ftype id for CDS:SGD Table 'biosql.ftype' doesn't exist Record
>>>> skipped.
>>>> (then another few '000 of these)
>>>> then..
>>>>
>>>> genome.gff: 16379 records loaded
>>>>
>>>>
>>>> Any ideas where I'm going wrong?
>>>>
>>>> Thanks,
>>>>
>>>> Richard
>>>>
>>>> ____________________________
>>>> Dr Richard Harrison
>>>> 127 Ashworth Labs
>>>> Institutes of Evolutionary Biology
>>>> King's Buildings
>>>> West Mains Road
>>>> Edinburgh EH9 3JT
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>
>>>
>>>
>>> --
>>> ------------------------------------------------------------------------
>>> Scott Cain, Ph. D.                                   scott at  
>>> scottcain
>>> dot net
>>> GMOD Coordinator (http://gmod.org/)                     216-392-3087
>>> Ontario Institute for Cancer Research
>>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>
>
>
> -- 
> ------------------------------------------------------------------------
> Scott Cain, Ph. D.                                   scott at  
> scottcain dot net
> GMOD Coordinator (http://gmod.org/)                     216-392-3087
> Ontario Institute for Cancer Research
>


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the Bioperl-l mailing list