From bix at sendu.me.uk Tue Aug 1 02:49:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 01 Aug 2006 07:49:54 +0100 Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> References: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> Message-ID: <44CEF992.8090006@sendu.me.uk> Andreo Beck wrote: > Hi, > > Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query give outputs > 1 ? > I get some > 1 values. That might depend on what $hit_object is (Bio::Search::Hit::GenericHit ?), but I'd say it's probably a bug if you get over 1. Can you give an example where you got over 1? Provide the code and the input data. > Does using the parentheses (e.g. $hit_object->frac_aligned_hit()) make any difference? No, empty parentheses are only needed to make it clear to the perl interpreter you are calling a subroutine in ambiguous cases; $obj->method isn't ambiguous. From y.itan at ucl.ac.uk Tue Aug 1 09:36:20 2006 From: y.itan at ucl.ac.uk (Yuval Itan) Date: Tue, 1 Aug 2006 14:36:20 +0100 Subject: [Bioperl-l] Getting sequences by base pair locations In-Reply-To: <44CA118C.7010401@mail.nih.gov> References: <05a8f85bdc36dc7e370d0935b6394f9d@ucl.ac.uk> <44CA118C.7010401@mail.nih.gov> Message-ID: <322788807dd336dc1b51b9f89b6e93cf@ucl.ac.uk> Thank you all for all the helpful answers! Malcolm- I've used the UCSC server to do the BLAT search (because I couldn't run it locally due to memory problems)- so I could not get the chimp sequences in a convenient way. I have the results also in a normal Blat output including all usual fields: chromosome number etc. Wade- thanks a lot for your offer, that would be great. The chimp genome is just one large fasta format file. Cheers, Yuval On 28 Jul 2006, at 14:30, Sean Davis wrote: > Yuval Itan wrote: >> Hello all, >> I was BLATing a few hundred human genes against the chimp genome, and >> kept the best chimp hits for every human gene. >> I have the base pair start and end location for every chimp hit, and >> I need to get the sequence for each of these chimp hits. Here is an >> example for a few chimp hits bp locations: >> Start End* >> *142854 144504 >> 154479 155198 >> 153066 167370 >> 163146 163559 >> I have one chimp genome file (about 3GB) including all chromosomes, >> but I could also get one file per chromosome if that would make >> things easier. Does anyone have a script or a link for an interface >> that can do the job? From MEC at stowers-institute.org Tue Aug 1 11:12:08 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Tue, 1 Aug 2006 10:12:08 -0500 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: Yuval, Glad to help. Given that you are not running blat suite locally, but at ucsc, you should try this approach: upload/paste your blat results (in blat's native output format, psl) as a custom track in the genome browser, named, say, myhumanhits (i.e. just give the blat results a new first line like: `track name="myhumanhits" description="myhumanhits from my favorite human genes" visibility=2`) then goto the table browser and configure it group = 'custom tracks' track = 'myhumanhits' retion = genome output format = sequence output file = myhumanhits.fasta submit it When prompted, Save the myhumanhits.fasta to your computer and take it from there. I'm not sure how many hits this will work for, but i just did this on a small track and it works just fine. Only problem, the first word in the fasta defline is always the same for all sequences. You'll have to 'uniqify' these names somehow probably (depedning of course on your application). Let us know & Good luck & ask for good email support on ucsc genome browser subscribe to http://www.soe.ucsc.edu/mailman/listinfo/genome-announce Malcolm Cook Database Applications Manager, Bioinformatics Stowers Institute for Medical Research >-----Original Message----- >From: bioperl-l-bounces at lists.open-bio.org >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Yuval Itan >Sent: Tuesday, August 01, 2006 8:36 AM >To: bioperl-l at lists.open-bio.org >Subject: Re: [Bioperl-l] Getting sequences by base pair locations > >Thank you all for all the helpful answers! >Malcolm- I've used the UCSC server to do the BLAT search (because I >couldn't run it locally due to memory problems)- so I could >not get the >chimp sequences in a convenient way. I have the results also in a >normal Blat output including all usual fields: chromosome number etc. >Wade- thanks a lot for your offer, that would be great. The chimp >genome is just one large fasta format file. >Cheers, >Yuval >On 28 Jul 2006, at 14:30, Sean Davis wrote: > >> Yuval Itan wrote: >>> Hello all, >>> I was BLATing a few hundred human genes against the chimp >genome, and >>> kept the best chimp hits for every human gene. >>> I have the base pair start and end location for every chimp >hit, and >>> I need to get the sequence for each of these chimp hits. Here is an >>> example for a few chimp hits bp locations: >>> Start End* >>> *142854 144504 >>> 154479 155198 >>> 153066 167370 >>> 163146 163559 >>> I have one chimp genome file (about 3GB) including all chromosomes, >>> but I could also get one file per chromosome if that would make >>> things easier. Does anyone have a script or a link for an interface >>> that can do the job? > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > From andreo_beck at yahoo.com Tue Aug 1 13:51:32 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Tue, 1 Aug 2006 10:51:32 -0700 (PDT) Subject: [Bioperl-l] (no subject) In-Reply-To: <44CF6923.5040304@sendu.me.uk> Message-ID: <20060801175132.20433.qmail@web55709.mail.re3.yahoo.com> I attach 1 "erroneous" part (record 9) and 1 correct part (record 8)...the entire file is ~500mb... Sendu Bala wrote: Andreo Beck wrote: > Thanks Sendu.I get this when I try parsing a WUBLASTP report (thats too > huge to upload). Unfortunately that's the most important thing for figuring out what's going on. Could you compress it (zip or tar.gz) along with the fasta file you used to do the blast? If its only a few MB you can email it to me directly, or create a bug report at http://bugzilla.bioperl.org/index.cgi and upload the files that way. (You could also try cutting both files down to just the sequence and hits that give results > 1, with a few < 1 for comparison - just make sure it still works after you edit the files) FYI, at a glance your code looked fine. Cheers, Sendu. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: erroneous_report.txt Url: http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060801/151ebca3/attachment-0001.txt From cjfields at uiuc.edu Tue Aug 1 18:52:05 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 1 Aug 2006 17:52:05 -0500 Subject: [Bioperl-l] (no subject) In-Reply-To: <20060801175132.20433.qmail@web55709.mail.re3.yahoo.com> References: <20060801175132.20433.qmail@web55709.mail.re3.yahoo.com> Message-ID: Andreo, Okay, not cool to email an entire BLAST file to the entire Bioperl-l list! Sendu wanted it emailed to him, not everybody! Chris On Aug 1, 2006, at 12:51 PM, Andreo Beck wrote: > I attach 1 "erroneous" part (record 9) and 1 correct part (record > 8)...the entire file is ~500mb... > > > Sendu Bala wrote: > Andreo Beck wrote: >> Thanks Sendu.I get this when I try parsing a WUBLASTP report >> (thats too >> huge to upload). > > Unfortunately that's the most important thing for figuring out what's > going on. Could you compress it (zip or tar.gz) along with the fasta > file you used to do the blast? If its only a few MB you can email > it to > me directly, or create a bug report at > http://bugzilla.bioperl.org/index.cgi and upload the files that way. > > .... Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From Kevin.M.Brown at asu.edu Tue Aug 1 18:43:00 2006 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Tue, 1 Aug 2006 15:43:00 -0700 Subject: [Bioperl-l] Getting sequences by base pair locations Message-ID: <1A4207F8295607498283FE9E93B775B401C5C74D@EX02.asurite.ad.asu.edu> Perl Mechanize is a great way to submit web forms repeatedly. I do it for things like MHC epitope prediction sites as well as a way to grab things like journal articles matching certain keywords. http://www.perl.com/pub/a/2003/01/22/mechanize.html http://search.cpan.org/dist/WWW-Mechanize/lib/WWW/Mechanize.pm > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Cook, Malcolm > Sent: Tuesday, August 01, 2006 8:12 AM > To: Yuval Itan; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Getting sequences by base pair locations > > Yuval, > > Glad to help. Given that you are not running blat suite > locally, but at > ucsc, you should try this approach: > > upload/paste your blat results (in blat's native output > format, psl) as > a custom track in the genome browser, named, say, myhumanhits > (i.e. just give the blat results a new first line like: `track > name="myhumanhits" description="myhumanhits from my favorite human > genes" visibility=2`) > then goto the table browser and configure it > group = 'custom tracks' > track = 'myhumanhits' > retion = genome > output format = sequence > output file = myhumanhits.fasta > > submit it > > When prompted, Save the myhumanhits.fasta to your computer and take it > from there. > > I'm not sure how many hits this will work for, but i just did > this on a > small track and it works just fine. Only problem, the first > word in the > fasta defline is always the same for all sequences. You'll have to > 'uniqify' these names somehow probably (depedning of course on your > application). > > Let us know & Good luck & ask for good email support on ucsc genome > browser subscribe to > http://www.soe.ucsc.edu/mailman/listinfo/genome-announce > > Malcolm Cook > Database Applications Manager, Bioinformatics > Stowers Institute for Medical Research > > > >-----Original Message----- > >From: bioperl-l-bounces at lists.open-bio.org > >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Yuval Itan > >Sent: Tuesday, August 01, 2006 8:36 AM > >To: bioperl-l at lists.open-bio.org > >Subject: Re: [Bioperl-l] Getting sequences by base pair locations > > > >Thank you all for all the helpful answers! > >Malcolm- I've used the UCSC server to do the BLAT search (because I > >couldn't run it locally due to memory problems)- so I could > >not get the > >chimp sequences in a convenient way. I have the results also in a > >normal Blat output including all usual fields: chromosome number etc. > >Wade- thanks a lot for your offer, that would be great. The chimp > >genome is just one large fasta format file. > >Cheers, > >Yuval > >On 28 Jul 2006, at 14:30, Sean Davis wrote: > > > >> Yuval Itan wrote: > >>> Hello all, > >>> I was BLATing a few hundred human genes against the chimp > >genome, and > >>> kept the best chimp hits for every human gene. > >>> I have the base pair start and end location for every chimp > >hit, and > >>> I need to get the sequence for each of these chimp hits. > Here is an > >>> example for a few chimp hits bp locations: > >>> Start End* > >>> *142854 144504 > >>> 154479 155198 > >>> 153066 167370 > >>> 163146 163559 > >>> I have one chimp genome file (about 3GB) including all > chromosomes, > >>> but I could also get one file per chromosome if that would make > >>> things easier. Does anyone have a script or a link for an > interface > >>> that can do the job? > > > >_______________________________________________ > >Bioperl-l mailing list > >Bioperl-l at lists.open-bio.org > >http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From pengchy1981 at yahoo.com.cn Wed Aug 2 01:51:47 2006 From: pengchy1981 at yahoo.com.cn (=?gb2312?q?=D1=EE=20=C5=F4=B3=CC?=) Date: Wed, 2 Aug 2006 13:51:47 +0800 (CST) Subject: [Bioperl-l] Develop with perl on oracle? Message-ID: <20060802055147.82637.qmail@web15201.mail.cnb.yahoo.com> hi, i am a novice on the database. i want to construct a database about esophegeal carcinoma or else use oracle and develop with perl. on end i want to publish my database use cgi which programmed by perl script. anyone has the same thought or has the same experience can give me some advice!! thanks a lot! Yang PCh from China --------------------------------- ????????????-3.5G??????20M???? From bix at sendu.me.uk Wed Aug 2 07:23:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 02 Aug 2006 12:23:01 +0100 Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> References: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> Message-ID: <44D08B15.9000408@sendu.me.uk> Andreo Beck wrote: > Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query > give outputs > 1 ? I get some > 1 values. This should now be fixed in CVS. You should be able to grab just Bio/Search/SearchUtils.pm if you don't want to install all of bioperl-live. ( http://code.open-bio.org/cgi/viewcvs.cgi/bioperl-live/ ) It should be noted that all frac_* statistics and probably others from hit objects have had a high chance of being wrong in the past, and frac_identical and frac_conserved can still be very wrong*. It would be a good idea if someone were to make these methods return slightly more sane numbers. [*] which is to say, more wrong than you might reasonably expect, given the limitations with gapped or alignment-free blasts From hlapp at gmx.net Wed Aug 2 08:32:35 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 2 Aug 2006 08:32:35 -0400 Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: <44D08B15.9000408@sendu.me.uk> References: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> <44D08B15.9000408@sendu.me.uk> Message-ID: <11E90B37-0A62-4A54-90BD-E98B2CA6E683@gmx.net> On Aug 2, 2006, at 7:23 AM, Sendu Bala wrote: > It should be noted that all frac_* statistics and probably others from > hit objects have had a high chance of being wrong in the past, and > frac_identical and frac_conserved can still be very wrong*. It > would be > a good idea if someone were to make these methods return slightly more > sane numbers. > > [*] which is to say, more wrong than you might reasonably expect, > given > the limitations with gapped or alignment-free blasts Can you elaborate? Specifically, can you put in tests that show the wrong results? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Wed Aug 2 09:03:36 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 02 Aug 2006 14:03:36 +0100 Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: <11E90B37-0A62-4A54-90BD-E98B2CA6E683@gmx.net> References: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> <44D08B15.9000408@sendu.me.uk> <11E90B37-0A62-4A54-90BD-E98B2CA6E683@gmx.net> Message-ID: <44D0A2A8.6040000@sendu.me.uk> Hilmar Lapp wrote: > > On Aug 2, 2006, at 7:23 AM, Sendu Bala wrote: > >> It should be noted that all frac_* statistics and probably others from >> hit objects have had a high chance of being wrong in the past, and >> frac_identical and frac_conserved can still be very wrong*. It would be >> a good idea if someone were to make these methods return slightly more >> sane numbers. >> >> [*] which is to say, more wrong than you might reasonably expect, given >> the limitations with gapped or alignment-free blasts > > Can you elaborate? Specifically, can you put in tests that show the > wrong results? I've added some new tests based on Andreo's blast result, but atm I've left the tests commented out. See line 882 of t/SearchIO.t revision 1.94 - a sane result would be less than 1. I think it ought to be possible to get better answers, but I got the feeling the fix wouldn't be completely trivial so I let it go, not having the time to spare right now. The reason I don't just call this a bug and make a bug report is that the documentation acknowledges that you won't always get a good answer, so it needs to be investigated if the current answer really is the best that can reasonably be given, or if there is some bug making the answer worse than it needs to be (as was the case with frac_aligned_hit and frac_aligned_query). From bernd.brandt at gmail.com Wed Aug 2 14:49:20 2006 From: bernd.brandt at gmail.com (Bernd Brandt) Date: Wed, 2 Aug 2006 20:49:20 +0200 Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: <44D0A2A8.6040000@sendu.me.uk> References: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> <44D08B15.9000408@sendu.me.uk> <11E90B37-0A62-4A54-90BD-E98B2CA6E683@gmx.net> <44D0A2A8.6040000@sendu.me.uk> Message-ID: Hi, "frac_identical and frac_conserved can still be very wrong" They were wrong with hmm report parsing (hmmsearch). The bioperl-live (CVS 14 July 2006) returned 0 for both fracions. I will check it with the newest CVS and send a small test script. Regards, Bernd On 8/2/06, Sendu Bala wrote: > Hilmar Lapp wrote: > > > > On Aug 2, 2006, at 7:23 AM, Sendu Bala wrote: > > > >> It should be noted that all frac_* statistics and probably others from > >> hit objects have had a high chance of being wrong in the past, and > >> frac_identical and frac_conserved can still be very wrong*. It would be > >> a good idea if someone were to make these methods return slightly more > >> sane numbers. > >> > >> [*] which is to say, more wrong than you might reasonably expect, given > >> the limitations with gapped or alignment-free blasts > > > > Can you elaborate? Specifically, can you put in tests that show the > > wrong results? > > I've added some new tests based on Andreo's blast result, but atm I've > left the tests commented out. See line 882 of t/SearchIO.t revision 1.94 > - a sane result would be less than 1. > > I think it ought to be possible to get better answers, but I got the > feeling the fix wouldn't be completely trivial so I let it go, not > having the time to spare right now. > > The reason I don't just call this a bug and make a bug report is that > the documentation acknowledges that you won't always get a good answer, > so it needs to be investigated if the current answer really is the best > that can reasonably be given, or if there is some bug making the answer > worse than it needs to be (as was the case with frac_aligned_hit and > frac_aligned_query). > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From deep.shingan at gmail.com Thu Aug 3 02:25:07 2006 From: deep.shingan at gmail.com (deep shingan) Date: Thu, 3 Aug 2006 11:55:07 +0530 Subject: [Bioperl-l] Problem in Bio-Perl parser , while running on XML-RPC environment In-Reply-To: <908f93490608022319q1d63805co8646b1dcd19e6a56@mail.gmail.com> References: <908f93490608022319q1d63805co8646b1dcd19e6a56@mail.gmail.com> Message-ID: <908f93490608022325t5c9a7e23u7e47c00d183f3afc@mail.gmail.com> Hi all, I have written a bio-perl parser which parse the blast result file and return the output array containg all the separate blast values. This parser runs very fine on the local machine and the result output array also contains all the details I have. But when I try to run the method in xmlrpc environment, where a xml-rpc cpp client sends a request to the perl xml-rpc server and calls this method. For the very first request , the parser method executes and set/assign values in output array but does not execute the return statement..and at the same time I can see the containt of the output array on the client side. but I cant see that, the return statement is executed(I checked this several times through log file). The interesting thing is that, when I again send another request to the server, the server returns the blank output array to the client and then call the blast parser method. I am very badly stucked here, I checked each n every statement I have writen in log file, for perl server and cpp client. I am not getting any clue on this. So ...please , please I anyone has encounterd same problem and have some Idea where the problem lies..please help ,me. I am sending the source code here. Thanks Deepak #rintable form. # This script was used to create the table in the SearchIO HOWTO, # found at http://bioperl.open-bio.org/wiki/HOWTO:SearchIO use strict; use Bio::SearchIO; use Bio::SimpleAlign; use Bio::AlignIO; use Error qw(:try); use Frontier::Daemon; #logic #we are taking number of hits that user want to look as input parameter #to the blastParser method. and parsing the file that is copied by #ftp, to the directory in which this script is running. # we are returning the output array which contains all parsed data. use lib 'lib/perl5/site_perl/5.8.5/'; use Config::Simple; use Log::Log4perl; Log::Log4perl::init('log4perl.conf'); my $logger = Log::Log4perl->get_logger('rootLogger'); $logger->debug("Logger Initialised"); #&blastParser(); sub blastParser { try{ print "\nInside..."; $logger->debug("\nInside Method blastParser"); my @outputArray; my $arrayCntr = 0; #This is the file that is transfered by ftp to the current working #directory my $file = "tempBlastFile"; print "\n$file"; my $recordCounts = 10; my $in = new Bio::SearchIO(-format => 'blast', # comment out the next line to read STDIN -file => $file ); while ( my $result = $in->next_result ) { print "\nInside result.."; $logger->debug("\nAnalysing result..."); my @stats = $result->available_statistics; my @params = $result->available_parameters; while ( my $hit = $result->next_hit and $recordCounts) { print "\n\nRecordcount ::\t$recordCounts\n\n"; $logger->debug("\nAnalysing hit"); $logger->debug("\nRecordCount $recordCounts"); $recordCounts--; my $id = $hit->matches('id'); my $cons = $hit->matches('cons'); my @accs = $hit->each_accession_number; my @qidentical = $hit->seq_inds('query','identical'); my @qconserved = $hit->seq_inds('query','conserved'); my @hidentical = $hit->seq_inds('hit','identical'); my @hconserved = $hit->seq_inds('hit','conserved'); $outputArray[$arrayCntr] = $hit->name; $arrayCntr++; $outputArray[$arrayCntr] = $hit->accession; $arrayCntr++; $outputArray[$arrayCntr] = $hit->raw_score; $arrayCntr++; $outputArray[$arrayCntr] = $hit->bits; $arrayCntr++; $outputArray[$arrayCntr] = $hit->gaps; $arrayCntr++; $logger->debug("\nHit Name : ".$hit->name); $logger->debug("\nHit Accession".$hit->accession); $logger->debug("\nHit Row score :".$hit->raw_score); $logger->debug("\nHit Bits :".$hit->bits); $logger->debug("\nHit Gaps :".$hit->gaps); #while ( my $hsp = $hit->next_hsp ) my $hsp = $hit->next_hsp; # { $logger->debug("Analysing Hsp"); my ($qid,$qcons) = $hsp->matches('hit'); my ($id,$cons) = $hsp->matches('query'); @qidentical = $hsp->seq_inds('query','identical'); @qconserved = $hsp->seq_inds('query','conserved'); @hidentical = $hsp->seq_inds('hit','identical'); @hconserved = $hsp->seq_inds('hit','conserved'); my @hrange = $hsp->range('hit'); my @qrange = $hsp->range('query'); my $aln = $hsp->get_aln; my $alnIO = Bio::AlignIO->new(-format=>"clustalw",-file=>'>tempHitFile'); $outputArray[$arrayCntr] = $hsp->evalue; $arrayCntr++; $outputArray[$arrayCntr] = $hsp->percent_identity; $arrayCntr++; $logger->debug("Evalue".$hsp->evalue); $logger->debug("Percent Identity".$hsp->percent_identity); $alnIO->write_aln($aln); open hitFile, "tempHitFile" or die "Can't read file"; undef $/; my $allignMent = ; $outputArray[$arrayCntr] = $allignMent; $arrayCntr++; close hitFile; $logger->debug("Allignment :",$allignMent); # }#hsp while ends }#hit while ends }#result while end print "\nReturning Output Array..."; return \@outputArray; } catch Error with { my $ex = shift; print "Exception...!"; } } my $methods = {'blastParser' => \&blastParser}; Frontier::Daemon->new(LocalPort => 9012, methods => $methods)or die "Couldn't start HTTP server: $!"; From staffa at niehs.nih.gov Thu Aug 3 12:53:04 2006 From: staffa at niehs.nih.gov (staffa) Date: Thu, 3 Aug 2006 12:53:04 -0400 Subject: [Bioperl-l] Pattern finding and mismatches Message-ID: <1fe2ea8669ab142911aae00a2825689c@niehs.nih.gov> I have seen bioperl modules for finding restriction sites and fragment lengths, and one could create his own pattern to search for, but is there anything like GCG's findpatterns that would allow me to search for a 21bp pattern but allow 2 mismatches anywhere? Nick Staffa Telephone: 919-316-4569 (NIEHS: 6-4569) Scientific Computing Support Group NIEHS Information Technology Support Services Contract (Science Task Monitor: Jack L. Field( field1 at niehs.nih.gov ) National Institute of Environmental Health Sciences National Institutes of Health Research Triangle Park, North Carolina -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/enriched Size: 907 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060803/257a3d2c/attachment.bin From andreo_beck at yahoo.com Thu Aug 3 14:14:34 2006 From: andreo_beck at yahoo.com (Andreo Beck) Date: Thu, 3 Aug 2006 11:14:34 -0700 (PDT) Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: <44D08B15.9000408@sendu.me.uk> Message-ID: <20060803181434.29904.qmail@web57202.mail.re3.yahoo.com> It still gives more than 1. Can you tell me how good approximation it does? I mean if its 1.3 can I assume it 1 or something like that? Sendu Bala wrote: Andreo Beck wrote: > Can $hit_object->frac_aligned_hit or $hit_object->frac_aligned_query > give outputs > 1 ? I get some > 1 values. This should now be fixed in CVS. You should be able to grab just Bio/Search/SearchUtils.pm if you don't want to install all of bioperl-live. ( http://code.open-bio.org/cgi/viewcvs.cgi/bioperl-live/ ) It should be noted that all frac_* statistics and probably others from hit objects have had a high chance of being wrong in the past, and frac_identical and frac_conserved can still be very wrong*. It would be a good idea if someone were to make these methods return slightly more sane numbers. [*] which is to say, more wrong than you might reasonably expect, given the limitations with gapped or alignment-free blasts __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From bix at sendu.me.uk Thu Aug 3 14:34:01 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 03 Aug 2006 19:34:01 +0100 Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: <20060803181434.29904.qmail@web57202.mail.re3.yahoo.com> References: <20060803181434.29904.qmail@web57202.mail.re3.yahoo.com> Message-ID: <44D24199.7060806@sendu.me.uk> Andreo Beck wrote: [re: frac_aligned_query, frac_aligned_hit] > It still gives more than 1. Can you tell me how good approximation it > does? I mean if its 1.3 can I assume it 1 or something like that? The only 'approximation' is to assume that none of the hsp alignments have gaps. It is still impossible for it to get more than 1. When you have gaps these methods are still useful because they give you a good idea of how much of the hit/query was involved in the alignment - not base for base but in terms of sequence coverage. So when they are working without any bug, the number returned is valid and correct from a certain viewpoint and shouldn't be 'fudged' like you suggest (treating one number as another (1.3 != 1)). If it still does get more than 1 it is a bug that needs to be fixed. For the test data you sent me I get less than 1 now. Are you sure you managed to get and install v.1.16 or higher of Bio::Search::SearchUtils.pm ? Are you sure you are actually using the new module when you run your script? From lzhtom at hotmail.com Thu Aug 3 19:31:57 2006 From: lzhtom at hotmail.com (zhihua li) Date: Thu, 03 Aug 2006 23:31:57 +0000 Subject: [Bioperl-l] install DBD::mysql, can't execute mysql_config Message-ID: Hi, netters, recently i've been trying to install the perl package of DBD::mysql, 'cause I need this to use ensembl bioperl API. according to the installation file of DBD::mysql, I have to install mysql first. so i download the binary mysql file and unpack it into /home/mysql. when i tried to do make file for DBD::mysql, it alwayse said: can't find file mysql_config. actually i found the file mysql_config under the directory /home/mysql/bin, so i added the full path to the PATH enviroment. yet the setup file of DBD::mysql still kept reporting that it can't find mysql_config. does anyone have a clue about this? From mthon at tamu.edu Thu Aug 3 21:11:08 2006 From: mthon at tamu.edu (Michael Thon) Date: Thu, 3 Aug 2006 20:11:08 -0500 Subject: [Bioperl-l] Pattern finding and mismatches In-Reply-To: <1fe2ea8669ab142911aae00a2825689c@niehs.nih.gov> References: <1fe2ea8669ab142911aae00a2825689c@niehs.nih.gov> Message-ID: <587F1568-F5BF-465C-80CB-0F495DA783F8@tamu.edu> Hi Nick - I don't know if you can do this directly with bioperl, but I think the EMBOSS program fuzznuc will do what you want. emboss.sourceforge.net M On Aug 3, 2006, at 11:53 AM, staffa wrote: > I have seen bioperl modules for finding restriction sites and > fragment lengths, and one could create his own pattern to search for, > but is there anything like GCG's findpatterns that would allow me > to search for a 21bp pattern but allow 2 mismatches anywhere? > > > Nick Staffa > Telephone: 919-316-4569 (NIEHS: 6-4569) > Scientific Computing Support Group > NIEHS Information Technology Support Services Contract > (Science Task Monitor: Jack L. Field( field1 at niehs.nih.gov ) > National Institute of Environmental Health Sciences > National Institutes of Health > Research Triangle Park, North Carolina > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From torsten.seemann at infotech.monash.edu.au Fri Aug 4 00:04:48 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Fri, 04 Aug 2006 14:04:48 +1000 Subject: [Bioperl-l] install DBD::mysql, can't execute mysql_config In-Reply-To: References: Message-ID: <44D2C760.80803@infotech.monash.edu.au> zhihua li, > recently i've been trying to install the perl package of DBD::mysql, > 'cause I need this to use ensembl bioperl API. according to the > installation file of DBD::mysql, I have to install mysql first. so i > download the binary mysql file and unpack it into /home/mysql. > > when i tried to do make file for DBD::mysql, it alwayse said: can't find > file mysql_config. actually i found the file mysql_config under the > directory /home/mysql/bin, so i added the full path to the PATH > enviroment. yet the setup file of DBD::mysql still kept reporting that > it can't find mysql_config. > > does anyone have a clue about this? this list is for solving problems with BioPerl, not MySQL or the MySQL DBD driver. i assume you have read this document: http://search.cpan.org/src/CAPTTOFU/DBD-mysql-3.0006_1/INSTALL.html it gives good advice; most linux distributions have already packaged MySQL and DBD-MySQL for you, eg. in Fedora/Redhat, a "yum install perl-DBD-mysql" would probably do everything for you. --Torsten From osborne1 at optonline.net Fri Aug 4 19:56:15 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 04 Aug 2006 19:56:15 -0400 Subject: [Bioperl-l] Pattern finding and mismatches In-Reply-To: <587F1568-F5BF-465C-80CB-0F495DA783F8@tamu.edu> Message-ID: Michael and Nick, This question is also addressed in the FAQ: http://www.bioperl.org/wiki/FAQ#How_do_I_do_motif_searches_with_BioPerl.3F_C an_I_do_.22find_all_sequences_that_are_75.25_identical.22_to_a_given_motif.3 F Brian O. On 8/3/06 9:11 PM, "Michael Thon" wrote: > Hi Nick - I don't know if you can do this directly with bioperl, but > I think the EMBOSS program fuzznuc will do what you want. > emboss.sourceforge.net > > M > > On Aug 3, 2006, at 11:53 AM, staffa wrote: > >> I have seen bioperl modules for finding restriction sites and >> fragment lengths, and one could create his own pattern to search for, >> but is there anything like GCG's findpatterns that would allow me >> to search for a 21bp pattern but allow 2 mismatches anywhere? >> >> >> Nick Staffa >> Telephone: 919-316-4569 (NIEHS: 6-4569) >> Scientific Computing Support Group >> NIEHS Information Technology Support Services Contract >> (Science Task Monitor: Jack L. Field( field1 at niehs.nih.gov ) >> National Institute of Environmental Health Sciences >> National Institutes of Health >> Research Triangle Park, North Carolina >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Sat Aug 5 11:42:10 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 05 Aug 2006 16:42:10 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul Message-ID: <44D4BC52.30203@sendu.me.uk> After the initial round of changes to Taxonomy described at http://bugzilla.open-bio.org/show_bug.cgi?id=2047 (now committed), further changes will allow for the transition of Bio::Species to Bio::Taxonomy::Node (renamed to Bio::Taxon), and for Taxon to be fully usable without external database access. In brief: rename Bio::Taxonomy::Node to Bio::Taxon, make Bio::Taxon implement Bio::Tree::NodeI, make Bio::Species a Bio::Taxon, remove all Bio::Species-related-backward-compatible methods from Bio::Taxon, create Bio::DB::Taxonomy::list, update Bio::SeqIO::genbank et al. The following is the set of changes that have been made (with all relevant tests passing), but not committed. Feedback is encouraged. These notes are also available at http://bugzilla.open-bio.org/show_bug.cgi?id=2061 for easier reference later. (in the following notes, use of the name-case word 'Taxon' refers to the module Bio::Taxon or instance of that class, while 'taxon' refers to the concept of a taxonomic unit) Bio::DB::Taxonomy, ::* ---------------------- # API-CHANGES get_Taxonomy_Node() renamed get_taxon(). get_Taxonomy_Node() is a synonym of get_taxon(), eventually to be deprecated. New methods ancestor() and each_Descendent() correspond to similar methods in Bio::Taxon and Bio::Tree::NodeI, freeing up the need to store parent_id on each Taxon. New internal method _handle_internal_id(). See Implementation notes below. # Implementation changes Normally when you create a Bio::Taxon it automatically receives a new unique internal id. However when you request the same Taxon from a database more than once you always get an object with the same internal id (allows get_lca to work, allows you to modify one copy of a returned object but still compare it to another copy and see they are supposed to be the same taxon). This even applies across different databases. The Taxon objects returned will still have different memory locations. Bio::DB::Taxonomy::flatfile --------------------------- # API-CHANGES get_Children_Taxids is deprecated - method no longer part of the DB::Taxonomy interface, and superseded by each_Descendent (which is actually implemented by all databases). # Implementation changes No longer includes the fake root node 'root'; there are multiple roots now (10239, 12884, 12908, 29384 and 131567). This means when getting the lineage you no longer have to remove the root node. This is now consistent with the results possible with entrez. NB: You have to delete your current indexes before you will notice the change. Bio::DB::Taxonomy::entrez ------------------------- # API-CHANGES get_node has new option -full that tells it to retrieve full details on a taxon from the website. (Otherwise, it may return a Taxon with minimal information if only minimal information had previously been cached.) # Implementation changes Caches the data it gets from the website and tries to minimise the number of website accesses it does. Bio::DB::Taxonomy::list ----------------------- # NEW An implementation of Bio::DB::Taxonomy that accepts lists of words to build a database. Used especially by Bio::Species for backward compatibility purposes, but also useful generally to quickly and easily create a lineage of Bio::Taxon objects/ a Tree. Bio::Tree::TreeI ---------------- # BUG-FIXES number_nodes() returned the number of descendants belonging to the root node, but forgot to count the root node itself. Now number_nodes() == scalar(get_nodes()). Bio::Tree::Tree --------------- # API-CHANGES Added -node option to new() which will call get_lineage_nodes() on the supplied NodeI and set the tree root that way. This is so you can easily make a tree from a Bio::Taxon. In order that the Tree resulting from a Bio::Taxon with a db_handle doesn't end up pulling in the entire database, in the process of finding the root from the -node, ancestor() / add_Descendent() is set for each member of the lineage, which means the database will no longer be asked what the ancestor or descendents of the taxa are. Bio::Tree::TreeFunctionsI ------------------------- # API-CHANGES New method get_lineage_nodes(). Returns all the ancestors of a particular node, up to the tree's defined root node. get_lca() can now also accept just a list of nodes, and also more than 2 nodes. Removed _check_two_nodes() since no longer necessary. New method splice(). Removes requested nodes from a tree, making the ancestors of the removed node's descendants the removed node's ancestor (ie. remove nodes without making the tree fall apart). New method contract_linear_paths(). Splices out all nodes in the tree that have an ancestor and only one descendant. New method merge_lineage(). Merges a lineage of nodes with an existing Tree. # Implementation changes get_lca() uses get_lineage_nodes(), and is the correct implementation; previously not guaranteed to give correct answer. Can get the lca of more than 2 nodes. reroot() uses get_lineage_nodes(). Methods distance(), is_monophyletic() and is_paraphyletic() reimplemented with the new get_lca(). find_node() no longer warns about an unknown search type (allowing you to search on -rank and any other thing in the future). Bio::Tools::Phylo::PAML ----------------------- # Implementation changes Methods that make use of get_lca() reimplemented with the new get_lca(). (otherwise, PAML tests no longer passed) Bio::Tree::Node --------------- # Implementation changes ancestor() now correctly removes and adds descendant from previous/new ancestor when changing ancestor. t/Node.t -------- Added tests for setting ancestor() Bio::Taxonomy::Node ------------------- # DEPRECATED (name change) isa Bio::Taxon # Implementation changes No code; delegates to Bio::Taxon Bio::Taxon ---------- # NEW (name change from Bio::Taxonomy::Node) Changes below relate to changes to Bio::Taxonomy::Node # API-CHANGES Removed the following options from new(): -classification, -sub_species, -variant and -organelle. The corresponding methods are no longer present. New option to new(): -id. For Tree::Node compatibility. -object_id and -ncbi_taxid are no longer mentioned in docs but still work. The -dbh option to new() no longer defaults to any database. A Bio::Taxon is now fully usable without ever setting a database handle. Removed the methods binomial(), species(), genus(), sub_species(), variant(), classification() and show_all(). Not appropriate to have rank-specific methods in a class that models any single rank. Definitely not appropriate to store information about other taxons in a Taxon. These questions can be answered using Tree* methods, or with Bio::Species. Removed method organelle(). Organelle isn't part of a taxonomy. Other modules like SeqIO should have their own storage of organelle information as necessary (But Bio::Species retains organelle() in the mean time). Removed methods get_Lineage_Nodes() and get_LCA_Node(). For these kinds of methods you should now use Bio::Tree::TreeFunctionsI methods. You can no longer set parent_id(). The id of your parent is determined by the Taxon that is your ancestor. This method is no longer needed (previously it was central to the workings of the object), so is now deprecated. It issues a warning if you try and set its value. get_Parent_Node() eventually to be deprecated, is now a synonym of new method ancestor(). (For Tree::Node compatibility.) get_Children_Nodes() eventually to be deprecated, is now a synonym of new method each_Descendent(). (For Tree::Node compatibility.) object_id() eventually to be deprecated, is now a synonym of new method id(). (For Tree::Node compatibility.) # Implementation changes is(also)a Bio::Tree::Node. division() was implemented via $self->name('division', at _). Now name('division') will only allow one value to be set, and division() only ever returns a single scalar or undef, never an array. common_names() returns the last common_name in scalar context (instead of first), so set/get/set/get works as expected with common_name(). db_handle() similar to before when getting, but now setting the handle will locate $self in the new database (by id or name) and merge data (eg. if rank was 'no rank' and new database node has rank 'species', $self->rank() will become 'species'). get_Parent_Node() (ne ancestor()) and get_Children_Nodes() (ne each_Descendent()) now use the Bio::Tree::Node implementation. ancestor() falls back to asking the database for the ancestor if one had not been manually set by the user. each_Descendent does NOT fall back to the database, preventing the whole database being pulled into a Tree object made with a Bio::Taxon. parent_id() now gets the ancestor Taxon with ancestor() and returns $ancestor->id(). Had to remove the clean up methods from Bio::Tree::Node since they were in a CODE ref, preventing Bio::Species objects from being frozen with Storable. Will come up with a better solution in the future. Bio::Taxonomy ------------- # DEPRECATED Redundant Bio::Taxonomy::Taxon -------------------- # DEPRECATED Redundant Bio::Taxonomy::Tree ------------------- # DEPRECATED Redundant Bio::Taxonomy::FactoryI ----------------------- # DEPRECATED Redundant Bio::Species ------------ # Implementation changes Bio::Species isa Bio::Taxon. No method uses validate_species_name() any more. (but the method remains unaltered, as does validate_name() which just returns 1 - no change). classification() set implemented as: Set db_handle() to a new Bio::DB::Taxonomy::list with the supplied classification array and make a Bio::Tree::Tree of self, stored in self. Getting the classification implemented as: Return the scientific_name() of each Taxon returned by our tree->get_lineage_nodes. Methods ncbi_taxid(), division() and common_name() implemented by Taxon. Methods species(), genus(), subspecies() and variant() no longer get/set elements in the classification array or store direct values. They are implemented like: Ask our tree for the taxon with rank() eq method name and set/get the scientific_name of that. Otherwise, for methods species() and genus() assume we are rank() 'species', our parent taxon is rank() 'genus' and try again. For subspecies() and variant(), fall back to old implementation (store data directly on self). binomial() prefers to simply return scientific_name() if we are a Taxon with rank() 'species' and the scientific_name is at least a 2 word scalar. It interprets the 'FULL' option as wanting the trinomial name and prefers to simply return scientific_name() if we have rank() 'subspecies' or 'variant' and at least 3 word scalar. Failing these two cases, it falls back on the old implementation (build 'genus species' from the classification), but with a little more intelligence to try and not duplicate names. # Behaviour changes An indirect new behaviour is that the SeqIO modules will probably return ->species() as the real species name (eg. 'Homo sapiens'), not the previously (and sometimes incorrectly) munged name (eg. 'sapiens'). # Notes Stores a Bio::Tree::Tree on itself, had to remove its clean up methods since they were in a CODE ref, preventing us from being frozen with Storable. Will come up with a better solution in the future. Bio::SeqIO::* ------------- A number of these modules make use of Bio::Species when parsing taxonomic information. They probably all have/had problems. I've only investigated genbank to any significant depth; the others need to be properly tested to see if when they read taxonomic data in they can output it again identically to the input file. It is probably the case that some fail at this currently. (I simply don't have time myself to make all these modules perfect.) Bio::SeqIO::bsml_sax -------------------- # BUG-FIXES It used to include the genus twice in the classification array of Bio::Species object. Now it doesn't. Bio::SeqIO::embl ---------------- # BUG-FIXES When the OC lines include the species name, the Bio::Species classification array included the true species name as a rank above genus and the real genus duplicated as a rank above that. Now it doesn't. Bio::SeqIO::genbank ------------------- # BUG-FIXES Now that Bio::Species isa Bio::Taxon, it is possible to ensure that output of input matches the input (in the SOURCE and ORGANISM lines at least). Usage of Bio::Species re-implemented to get all tests in t/genbank.t to pass. t/genbank.t ----------- Modified some tests to expect the correct answer, ie. $bio_species_obj->species now expects 'Mus musculus', not 'musculus'. t/Index.t --------- Modified some tests to expect the correct answer, ie. $bio_species_obj->species now expects 'Homo sapiens', not 'sapiens'. scripts/taxa/taxonomy2tree.PLS ------------------------------ Added some extra options to define the location of the database indexes and files, or use the entrez on-line database instead. (Note how entrez and flatfile are now truly interchangeable.) Reimplemented using the new Bio::Taxon system. Now much simpler. You also get the correct answer, eg. instead of (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo sapiens")"Homo/Pan/Gorilla group")Hominidae)root; you now get (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo sapiens")"Homo/Pan/Gorilla group")Hominidae)"cellular organisms"; From osborne1 at optonline.net Sun Aug 6 11:38:11 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Sun, 06 Aug 2006 11:38:11 -0400 Subject: [Bioperl-l] FW: A bio-perl networking problem In-Reply-To: <20060806080943.62302.qmail@web51708.mail.yahoo.com> Message-ID: ------ Forwarded Message From: deepak shingan Date: Sun, 06 Aug 2006 01:09:42 -0700 (PDT) To: Subject: A bio-perl networking problem Hello Sir, I am in a greate trouble and I really want your help. I am using SearchIO algorithm for parsing the blast result file in the xml-rpc environment. I have a client and server written in perl. The very first request that I am sending is getting executed and I am able to see the output result at the client side. But for the second request I am sending is not getting executed succesfully. Even it dose not throw any exception. I think the problem is with "->next_result" statement which can not collect result on the second request even the file contains many hits. Can you give some light on this. I am really sorry for disturbing you but I am not able to make out anything. So please help me. Here , I am sending the source code for testClient and testParser and also a tempoaray Blast result file for testing. Thanks and egoursly waiting for your responce. Deepak #fileName :: TestServer.pl use warnings; use strict; use Frontier::Daemon; use Bio::SimpleAlign; use Bio::SearchIO; use Bio::AlignIO; use Error qw(:try); #**************************************************************************? ***** # start the daemon Frontier::Daemon->new( LocalPort => 8901, methods => { 'blastParser'=>\&blastParser, } ) or die "Failed to start daemon: $!\n"; #**************************************************************************? ***** # Methods #**************************************************************************? ***** sub blastParser { try{ print "\nInside..."; my @outputArray; my $arrayCntr = 0; #This is the file that is transfered by ftp to the current working #directory #but for testing just hardcoded my $file = "tempBlastFile"; print "\n$file"; #hardcoded just for testing my $recordCounts = 10; my $in = new Bio::SearchIO(-format => 'blast', # comment out the next line to read STDIN -file => $file ); #for the second request its not entering into the while loop #even though the file has many hits while ( my $result = $in->next_result ) { print "\nInside result.."; my @stats = $result->available_statistics; my @params = $result->available_parameters; while ( my $hit = $result->next_hit and $recordCounts) { print "\n\nRecordcount ::\t$recordCounts\n\n"; $recordCounts--; my $id = $hit->matches('id'); my $cons = $hit->matches('cons'); my @accs = $hit->each_accession_number; my @qidentical = $hit->seq_inds('query','identical'); my @qconserved = $hit->seq_inds('query','conserved'); my @hidentical = $hit->seq_inds('hit','identical'); my @hconserved = $hit->seq_inds('hit','conserved'); $outputArray[$arrayCntr] = $hit->name; $arrayCntr++; $outputArray[$arrayCntr] = $hit->accession; $arrayCntr++; $outputArray[$arrayCntr] = $hit->raw_score; $arrayCntr++; $outputArray[$arrayCntr] = $hit->bits; $arrayCntr++; $outputArray[$arrayCntr] = $hit->gaps; $arrayCntr++; while ( my $hsp = $hit->next_hsp ) { my ($qid,$qcons) = $hsp->matches('hit'); my ($id,$cons) = $hsp->matches('query'); @qidentical = $hsp->seq_inds('query','identical'); @qconserved = $hsp->seq_inds('query','conserved'); @hidentical = $hsp->seq_inds('hit','identical'); @hconserved = $hsp->seq_inds('hit','conserved'); my @hrange = $hsp->range('hit'); my @qrange = $hsp->range('query'); my $aln = $hsp->get_aln; my $alnIO = Bio::AlignIO->new(-format=>"clustalw",-file=>'>tempHitFile'); $outputArray[$arrayCntr] = $hsp->evalue; $arrayCntr++; $outputArray[$arrayCntr] = $hsp->percent_identity; $arrayCntr++; $alnIO->write_aln($aln); open hitFile, "tempHitFile" or die "Can't read file"; undef $/; my $allignMent = ; $outputArray[$arrayCntr] = $allignMent; $arrayCntr++; close hitFile; $logger->debug("Allignment :",$allignMent); undef $hsp; }#hsp while ends undef $hit; }#hit while ends }#result while end print "\nReturning Output Array...\n"; undef $in; return \@outputArray; } catch Error with { my $ex = shift; print "Exception...!: $ex"; } } #fileName : testClient use Frontier::Client; # Make an object to represent the XML-RPC server. $server_url = 'http://localhost:8901/RPC2'; $server = Frontier::Client->new(url => $server_url); # Call the remote server and get our result. $result = $server->call('blastParser'); print "Got the result back\n". @$result ; Behind every successful man, there is a woman And behind every unsuccessful man, there are two...!!! Yahoo! Messenger with Voice. Make PC-to-Phone Calls to the US (and 30+ countries) for 2?/min or less. BLASTN 2.2.11 [Jun-05-2005] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= 1_636_39 (844 letters) Database: Chimp_Genome_With_GeneID_coordinates_BUILD1.fasta 40,517 sequences; 3,089,631,664 total letters Searching...................................................done Score E Sequences producing significant alignments: (bits) Value 112104-112105: 22707394-23965543 224 7e-56 113895-113896: 10715500-10725880 176 1e-41 110427-110428: 29765264-29902644 145 5e-32 108622-108623: 129300340-130310706 141 8e-31 111131: 144643645-144703036 111 7e-22 122818-122819: 145994494-147531331 88 1e-14 >112104-112105: 22707394-23965543 Length = 1258150 Score = 224 bits (113), Expect = 7e-56 Identities = 237/278 (85%), Gaps = 4/278 (1%) Strand = Plus / Minus Query: 20 ggccttgttcaattaagcactctattcttaatttactgctaaatcctccttggactctca 79 |||| ||||||| ||||||||||| ||||| ||||||||||||||| |||| ||| || | Sbjct: 44216 ggccctgttcaactaagcactctactcttagtttactgctaaatccaccttcgaccctta 44157 Query: 80 ggtttcatgagagttatcgtgggttttctaggatagaaaatgtagcccatctttttccgt 139 |||||||| || |||||||| ||| | || ||||||||||||||||| | || || Sbjct: 44156 ggtttcataagggttatcgtaattttcatggggtagaaaatgtagcccatttcttgccac 44097 Query: 140 ctcataggctacaccttgacctaacgtttttgcgaaaaggggtacttgcgctcactttgt 199 ||||| ||||||||||||||||||||| ||| || ||||||||||||| ||||||| Sbjct: 44096 ctcatgggctacaccttgacctaacgtctttacgt----gggtacttgcgcttactttgt 44041 Query: 200 gacctttatcagggtttgctgaagatggcggtatataggctgagcaagagagggtagggt 259 ||||| ||||||||||||||||||||||||||||||||||||||||||| ||| ||| Sbjct: 44040 aaccttcatcagggtttgctgaagatggcggtatataggctgagcaagaggtggtgaggt 43981 Query: 260 ggatcggggtttatcgattatgggacaggctcctctag 297 ||||||||||||||||||| | |||||||||||||| Sbjct: 43980 tgatcggggtttatcgattacagaacaggctcctctag 43943 >113895-113896: 10715500-10725880 Length = 10381 Score = 176 bits (89), Expect = 1e-41 Identities = 230/274 (83%), Gaps = 6/274 (2%) Strand = Plus / Plus Query: 25 tgttcaattaagcactctattcttaatttactgctaaatcctccttggactctcaggttt 84 ||||||| ||||||||||| ||||| ||||||||||||||| |||| ||| || | ||| Sbjct: 4798 tgttcaactaagcactctactcttagtttactgctaaatccaccttcgacccttaaattt 4857 Query: 85 catgagagttatcgtgggttttctaggatagaaaatgtagcccatctttttccgtctcat 144 ||| || |||||||| | |||||| ||||||||||||||||| | || || ||||| Sbjct: 4858 cataagggttatcgtag-ttttctgaagtagaaaatgtagcccatttcttgccacctcat 4916 Query: 145 aggctacaccttgacctaacgtttttgcgaaaaggggtacttgcgctcactttgtgacct 204 ||||||||||||||||||||| ||| || ||||||||||||| |||||| | ||| Sbjct: 4917 gggctacaccttgacctaacgtctttacgt----gggtacttgcgcttactttgcggcct 4972 Query: 205 ttatcagggtttgctgaagatggcggtatataggctgagcaagagag-ggtagggtggat 263 | |||||||||||||||||||||||||||||||||||||||||| | ||| ||| ||| Sbjct: 4973 tcgtcagggtttgctgaagatggcggtatataggctgagcaagagggtggtgaggttgat 5032 Query: 264 cggggtttatcgattatgggacaggctcctctag 297 |||||||||||||||| | |||||| ||||||| Sbjct: 5033 cggggtttatcgattacagaacaggcccctctag 5066 >110427-110428: 29765264-29902644 Length = 137381 Score = 145 bits (73), Expect = 5e-32 Identities = 137/158 (86%), Gaps = 4/158 (2%) Strand = Plus / Minus Query: 140 ctcataggctacaccttgacctaacgtttttgcgaaaaggggtacttgcgctcactttgt 199 ||||| ||||||||||||||||||||| ||| || ||||||||||||| |||||| Sbjct: 126023 ctcatgggctacaccttgacctaacgtctttacgt----gggtacttgcgcttactttgc 125968 Query: 200 gacctttatcagggtttgctgaagatggcggtatataggctgagcaagagagggtagggt 259 |||| ||||||||||||||||||||||||||||||||||||||||||| ||| ||| Sbjct: 125967 agccttcatcagggtttgctgaagatggcggtatataggctgagcaagaggtggtgaggt 125908 Query: 260 ggatcggggtttatcgattatgggacaggctcctctag 297 ||||||||||||||||||| | |||||||||||||| Sbjct: 125907 tgatcggggtttatcgattacagaacaggctcctctag 125870 >108622-108623: 129300340-130310706 Length = 1010367 Score = 141 bits (71), Expect = 8e-31 Identities = 195/232 (84%), Gaps = 7/232 (3%) Strand = Plus / Minus Query: 20 ggccttgttcaattaagcactctattcttaatttactgctaaatcctccttggactctca 79 |||||| |||||||||||||||| || ||||||||||||||||||||||||| | || | Sbjct: 110910 ggccttattcaattaagcactctgctcctaatttactgctaaatcctccttgggccctta 110851 Query: 80 ggtttcatgagagttatcgtg-ggttttctagga-tagaaaatgtagcccatctttttcc 137 |||||||| || ||| | ||| | |||||||||| |||||||| |||||||| | || || Sbjct: 110850 ggtttcataagggttgttgtgagattttctaggagtagaaaatatagcccatttcttacc 110791 Query: 138 gtctcataggctacaccttgacctaacgtttttgcgaaaaggggtacttgcgctcacttt 197 ||||| ||||||| ||||||||||||||||| || | | |||||||||| ||||| Sbjct: 110790 acctcatgggctacaacttgacctaacgtttttacgta----gatacttgcgcttacttt 110735 Query: 198 gtgacctttatcagggtttgctgaagatggcggtatataggctgagcaagag 249 | | | ||| ||||||||||||||||||| |||||||||||||||||||| Sbjct: 110734 g-cagccttactagggtttgctgaagatggcagtatataggctgagcaagag 110684 >111131: 144643645-144703036 Length = 59392 Score = 111 bits (56), Expect = 7e-22 Identities = 153/185 (82%), Gaps = 4/185 (2%) Strand = Plus / Minus Query: 113 tagaaaatgtagcccatctttttccgtctcataggctacaccttgacctaacgtttttgc 172 ||||||||||||||||| | || || ||||| |||||||||||||||||| |||||| Sbjct: 9254 tagaaaatgtagcccatttcttaccacctcatgggctacaccttgacctaatgtttttat 9195 Query: 173 gaaaaggggtacttgcgctcactttgtgacctttatcagggtttgctgaagatggcggta 232 | | | |||||| ||| |||||||||| |||| ||||||||||||||||| |||| Sbjct: 9194 gtaga----tacttgtgcttactttgtgacttttactagggtttgctgaagatgatggta 9139 Query: 233 tataggctgagcaagagagggtagggtggatcggggtttatcgattatgggacaggctcc 292 ||||||||||||||||| ||| ||| || |||||||||| ||||| | ||||||||| Sbjct: 9138 tataggctgagcaagaggtggtgaggtaaattggggtttatccattatagaacaggctcc 9079 Query: 293 tctag 297 ||||| Sbjct: 9078 tctag 9074 >122818-122819: 145994494-147531331 Length = 1536838 Score = 87.7 bits (44), Expect = 1e-14 Identities = 111/132 (84%), Gaps = 1/132 (0%) Strand = Plus / Minus Query: 39 ctctattcttaatttactgctaaatcctccttggactctcaggtttcatgagagttatcg 98 |||||||||| ||||||| ||||||||||||| ||| | | |||||||||| ||| || Sbjct: 1324540 ctctattcttgatttactactaaatcctcctttgacctttaagtttcatgagggttgtca 1324481 Query: 99 tgggttttctaggat-agaaaatgtagcccatctttttccgtctcataggctacaccttg 157 ||||| |||||| | ||||||||| |||||| | ||||| |||||| |||||||||||| Sbjct: 1324480 tgggtgttctagatttagaaaatgtggcccatttctttccatctcatgggctacaccttg 1324421 Query: 158 acctaacgtttt 169 | ||||| |||| Sbjct: 1324420 atctaacatttt 1324409 Database: Chimp_Genome_With_GeneID_coordinates_BUILD1.fasta Posted date: Nov 16, 2005 5:29 PM Number of letters in database: 3,089,631,664 Number of sequences in database: 40,517 Lambda K H 1.37 0.711 1.31 Gapped Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Hits to DB: 1,543,047 Number of Sequences: 40517 Number of extensions: 1543047 Number of successful extensions: 4846 Number of sequences better than 1.0e-10: 6 Number of HSP's better than 0.0 without gapping: 5 Number of HSP's successfully gapped in prelim test: 1 Number of HSP's that attempted gapping in prelim test: 4821 Number of HSP's gapped (non-prelim): 20 length of query: 844 length of database: 3,089,631,664 effective HSP length: 21 effective length of query: 823 effective length of database: 3,088,780,807 effective search space: 2542066604161 effective search space used: 2542066604161 T: 0 A: 0 X1: 11 (21.8 bits) X2: 15 (29.7 bits) S1: 12 (24.3 bits) S2: 38 (75.8 bits) ------ End of Forwarded Message From cjfields at uiuc.edu Sun Aug 6 19:44:14 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 6 Aug 2006 18:44:14 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D4BC52.30203@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> Message-ID: <4D709D39-0340-4184-8581-3468F29A83D0@uiuc.edu> Sendu, I feel this needs to be posted to the main list for further responses from anyone interested in making a point, one way or another. I'm dropping out of this; you can have the last word. This is in response to Sendu's proposal to have $species->species return the binomial name for that rank, as documented on Bugzilla. Any other responses would be appreciated. (In reply to comment #5) > (In reply to comment #4) > See also http://en.wikipedia.org/wiki/Species and > http://en.wikipedia.org/wiki/Binomial_nomenclature. "The name of the species is > the whole binomial, not just the second term (which may be called specific > epithet, for plants, or specific name, for animals)". > > We can't have a method for the 'specific name' because we have no way of always > correctly working out what that is. The NCBI taxonomy database doesn't tell us, > and neither do the various sequence file formats. Let's say, for instance, that the single definition of 'species,' as you have shown, was the only correct definition. But in your response quoting the Wikipedia articles you leave out a plethora of other definitions, including one used by taxonomists: the second name in a binomial nomenclature, aka the species descriptor or what you have as the 'specific epithet'. This is also explicitly stated in the second link you provide, for 'binomial nomenclature': "As the word "binomial" suggests, the scientific name of a species is formed by the combination of two terms: the genus name and the species descriptor." The previous use of species() in Bio::Species fits that definition, in that the species() method originally gave only the species descriptor (one name), NOT the binomial name, which is given by binomial(). Similarly, genus() gave only the genus name. Why have a genus() or binomial() at all if you get the entire name via species()? So, is there a correct definition of 'species'? The same wikipedia pages you use to bolster your case for using a binomial species name actually indicates otherwise: "Since the advent of the theory of evolution, the conception of species has undergone vast changes in biology; however no consensus on the definition of the word has yet been reached." Seems ambiguous to me. Is there another way? Our proposal (actually Hilmar's) was to let Bio::Species hold the data as parsed in the SeqIO modules as is, but also have the same data contained in a Bio::Taxon object for I/O. Then, slowly deprecate Bio::Species in favor of Bio::Taxon. No confusion as to the data returned, no redundant methods, and the change is gradual, not sudden. So, you could get the name ('Homo sapiens') as a Bio::Taxon object scientific name: # returns NCBI TaxID scientific name from Bio::Taxon object $seq->taxon->scientific_name(); which doesn't carry the ambiguity of what would be returned like # returns species name from Bio::Species object $seq->species->species(); # what is it? Is it a single name? The binomial? Both definitions could be correct (but only the first one is used). At least with the first version (again proposed by Hilmar), you can state that this explicitly returns the scientific name as defined by NCBI (and have something from the NCBI server to point to). No tainting of Bio::Taxon with odd useless methods which can be misconstrued five ways. I'm not going to get drawn into another long-winded argument about this. My point is made. It's your baby. I feel that we sometimes get too impassioned trying to defend our views when coding is the best course of action. And I feel that not making concise arguments can be wasteful and, ultimately, pointless. It's my firm belief, though, using species() in this way will generate more confusion than it's worth. I'll leave it to you to answer the confused emails from bioperl users who don't expect this. Chris From hlapp at gmx.net Sun Aug 6 15:38:17 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 6 Aug 2006 15:38:17 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D4BC52.30203@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> Message-ID: <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> Wow! This is quite a number of changes to digest. Thanks for the detailed documentation. I have three comments. 1) It sounds a bit that you changed the behavior of get_lca() such that users may have to adjust their code? If this is true, then this needs to be made clear in the 1.6 release as that part will not be backward compatible. If this is not true, then why did you have to change the implementation of Bio::Tools::Phylo::PAML to make tests pass? I.e., to what extent can what broke Bio::Tools::Phylo::PAML also break someone's script? 2) I can't find object_id() on Tree::Node or Taxonomy::Taxon. Where is/was it? The reason I am asking is that this method is part of the Bio::IdentifiableI API and therefore if you want to deprecate it you are suggesting to deprecate implementing Bio::IdentifiableI, and the rest of those methods need to be deprecated along. 3) Your whole email should probably go on the wiki, linked somewhere under documentation or release notes. Or somebody has a better idea? -hilmar On Aug 5, 2006, at 12:42 PM, Sendu Bala wrote: > After the initial round of changes to Taxonomy described at > http://bugzilla.open-bio.org/show_bug.cgi?id=2047 (now committed), > further changes will allow for the transition of Bio::Species to > Bio::Taxonomy::Node (renamed to Bio::Taxon), and for Taxon to be fully > usable without external database access. > > In brief: rename Bio::Taxonomy::Node to Bio::Taxon, make Bio::Taxon > implement Bio::Tree::NodeI, make Bio::Species a Bio::Taxon, remove all > Bio::Species-related-backward-compatible methods from Bio::Taxon, > create > Bio::DB::Taxonomy::list, update Bio::SeqIO::genbank et al. > > The following is the set of changes that have been made (with all > relevant tests passing), but not committed. Feedback is encouraged. > These notes are also available at > http://bugzilla.open-bio.org/show_bug.cgi?id=2061 for easier reference > later. > > > (in the following notes, use of the name-case word 'Taxon' refers > to the > module Bio::Taxon or instance of that class, while 'taxon' refers > to the > concept of a taxonomic unit) > > > Bio::DB::Taxonomy, ::* > ---------------------- > > # API-CHANGES > get_Taxonomy_Node() renamed get_taxon(). get_Taxonomy_Node() is a > synonym of get_taxon(), eventually to be deprecated. > > New methods ancestor() and each_Descendent() correspond to similar > methods in Bio::Taxon and Bio::Tree::NodeI, freeing up the need to > store > parent_id on each Taxon. > > New internal method _handle_internal_id(). See Implementation notes > below. > > # Implementation changes > Normally when you create a Bio::Taxon it automatically receives a new > unique internal id. However when you request the same Taxon from a > database more than once you always get an object with the same > internal > id (allows get_lca to work, allows you to modify one copy of a > returned > object but still compare it to another copy and see they are > supposed to > be the same taxon). This even applies across different databases. The > Taxon objects returned will still have different memory locations. > > > Bio::DB::Taxonomy::flatfile > --------------------------- > > # API-CHANGES > get_Children_Taxids is deprecated - method no longer part of the > DB::Taxonomy interface, and superseded by each_Descendent (which is > actually implemented by all databases). > > # Implementation changes > No longer includes the fake root node 'root'; there are multiple roots > now (10239, 12884, 12908, 29384 and 131567). This means when > getting the > lineage you no longer have to remove the root node. This is now > consistent with the results possible with entrez. > NB: You have to delete your current indexes before you will notice the > change. > > > Bio::DB::Taxonomy::entrez > ------------------------- > > # API-CHANGES > get_node has new option -full that tells it to retrieve full > details on > a taxon from the website. (Otherwise, it may return a Taxon with > minimal > information if only minimal information had previously been cached.) > > # Implementation changes > Caches the data it gets from the website and tries to minimise the > number of website accesses it does. > > > Bio::DB::Taxonomy::list > ----------------------- > > # NEW > An implementation of Bio::DB::Taxonomy that accepts lists of words to > build a database. Used especially by Bio::Species for backward > compatibility purposes, but also useful generally to quickly and > easily > create a lineage of Bio::Taxon objects/ a Tree. > > > Bio::Tree::TreeI > ---------------- > > # BUG-FIXES > number_nodes() returned the number of descendants belonging to the > root > node, but forgot to count the root node itself. Now number_nodes() == > scalar(get_nodes()). > > > Bio::Tree::Tree > --------------- > > # API-CHANGES > Added -node option to new() which will call get_lineage_nodes() on the > supplied NodeI and set the tree root that way. This is so you can > easily > make a tree from a Bio::Taxon. In order that the Tree resulting from a > Bio::Taxon with a db_handle doesn't end up pulling in the entire > database, in the process of finding the root from the -node, > ancestor() > / add_Descendent() is set for each member of the lineage, which means > the database will no longer be asked what the ancestor or > descendents of > the taxa are. > > > Bio::Tree::TreeFunctionsI > ------------------------- > > # API-CHANGES > New method get_lineage_nodes(). Returns all the ancestors of a > particular node, up to the tree's defined root node. > > get_lca() can now also accept just a list of nodes, and also more > than 2 > nodes. > > Removed _check_two_nodes() since no longer necessary. > > New method splice(). Removes requested nodes from a tree, making the > ancestors of the removed node's descendants the removed node's > ancestor > (ie. remove nodes without making the tree fall apart). > > New method contract_linear_paths(). Splices out all nodes in the tree > that have an ancestor and only one descendant. > > New method merge_lineage(). Merges a lineage of nodes with an > existing Tree. > > # Implementation changes > get_lca() uses get_lineage_nodes(), and is the correct implementation; > previously not guaranteed to give correct answer. Can get the lca of > more than 2 nodes. > > reroot() uses get_lineage_nodes(). > > Methods distance(), is_monophyletic() and is_paraphyletic() > reimplemented with the new get_lca(). > > find_node() no longer warns about an unknown search type (allowing you > to search on -rank and any other thing in the future). > > > Bio::Tools::Phylo::PAML > ----------------------- > > # Implementation changes > Methods that make use of get_lca() reimplemented with the new > get_lca(). > (otherwise, PAML tests no longer passed) > > > Bio::Tree::Node > --------------- > > # Implementation changes > ancestor() now correctly removes and adds descendant from previous/new > ancestor when changing ancestor. > > > t/Node.t > -------- > Added tests for setting ancestor() > > > Bio::Taxonomy::Node > ------------------- > > # DEPRECATED (name change) > isa Bio::Taxon > > # Implementation changes > No code; delegates to Bio::Taxon > > > Bio::Taxon > ---------- > > # NEW (name change from Bio::Taxonomy::Node) > Changes below relate to changes to Bio::Taxonomy::Node > > # API-CHANGES > Removed the following options from new(): -classification, > -sub_species, -variant and -organelle. The corresponding methods > are no > longer present. > > New option to new(): -id. For Tree::Node compatibility. -object_id and > -ncbi_taxid are no longer mentioned in docs but still work. > > The -dbh option to new() no longer defaults to any database. A > Bio::Taxon is now fully usable without ever setting a database handle. > > Removed the methods binomial(), species(), genus(), sub_species(), > variant(), classification() and show_all(). Not appropriate to have > rank-specific methods in a class that models any single rank. > Definitely > not appropriate to store information about other taxons in a Taxon. > These questions can be answered using Tree* methods, or with > Bio::Species. > > Removed method organelle(). Organelle isn't part of a taxonomy. Other > modules like SeqIO should have their own storage of organelle > information as necessary (But Bio::Species retains organelle() in the > mean time). > > Removed methods get_Lineage_Nodes() and get_LCA_Node(). For these > kinds > of methods you should now use Bio::Tree::TreeFunctionsI methods. > > You can no longer set parent_id(). The id of your parent is determined > by the Taxon that is your ancestor. This method is no longer needed > (previously it was central to the workings of the object), so is now > deprecated. It issues a warning if you try and set its value. > > get_Parent_Node() eventually to be deprecated, is now a synonym of new > method ancestor(). (For Tree::Node compatibility.) > > get_Children_Nodes() eventually to be deprecated, is now a synonym of > new method each_Descendent(). (For Tree::Node compatibility.) > > object_id() eventually to be deprecated, is now a synonym of new > method > id(). (For Tree::Node compatibility.) > > # Implementation changes > is(also)a Bio::Tree::Node. > > division() was implemented via $self->name('division', at _). Now > name('division') will only allow one value to be set, and division() > only ever returns a single scalar or undef, never an array. > > common_names() returns the last common_name in scalar context (instead > of first), so set/get/set/get works as expected with common_name(). > > db_handle() similar to before when getting, but now setting the handle > will locate $self in the new database (by id or name) and merge data > (eg. if rank was 'no rank' and new database node has rank 'species', > $self->rank() will become 'species'). > > get_Parent_Node() (ne ancestor()) and get_Children_Nodes() (ne > each_Descendent()) now use the Bio::Tree::Node implementation. > ancestor() falls back to asking the database for the ancestor if > one had > not been manually set by the user. each_Descendent does NOT fall > back to > the database, preventing the whole database being pulled into a Tree > object made with a Bio::Taxon. > > parent_id() now gets the ancestor Taxon with ancestor() and returns > $ancestor->id(). > > Had to remove the clean up methods from Bio::Tree::Node since they > were > in a CODE ref, preventing Bio::Species objects from being frozen with > Storable. Will come up with a better solution in the future. > > > Bio::Taxonomy > ------------- > > # DEPRECATED > Redundant > > > Bio::Taxonomy::Taxon > -------------------- > > # DEPRECATED > Redundant > > > Bio::Taxonomy::Tree > ------------------- > > # DEPRECATED > Redundant > > > Bio::Taxonomy::FactoryI > ----------------------- > > # DEPRECATED > Redundant > > > Bio::Species > ------------ > > # Implementation changes > Bio::Species isa Bio::Taxon. > > No method uses validate_species_name() any more. (but the method > remains > unaltered, as does validate_name() which just returns 1 - no change). > > classification() set implemented as: > Set db_handle() to a new Bio::DB::Taxonomy::list with the supplied > classification array and make a Bio::Tree::Tree of self, stored in > self. > Getting the classification implemented as: > Return the scientific_name() of each Taxon returned by our > tree->get_lineage_nodes. > > Methods ncbi_taxid(), division() and common_name() implemented by > Taxon. > > Methods species(), genus(), subspecies() and variant() no longer > get/set > elements in the classification array or store direct values. They are > implemented like: > Ask our tree for the taxon with rank() eq method name and set/get > the scientific_name of that. > Otherwise, for methods species() and genus() assume we are rank() > 'species', our parent taxon is rank() 'genus' and try again. For > subspecies() and variant(), fall back to old implementation (store > data > directly on self). > > binomial() prefers to simply return scientific_name() if we are a > Taxon > with rank() 'species' and the scientific_name is at least a 2 word > scalar. It interprets the 'FULL' option as wanting the trinomial name > and prefers to simply return scientific_name() if we have rank() > 'subspecies' or 'variant' and at least 3 word scalar. Failing these > two > cases, it falls back on the old implementation (build 'genus species' > from the classification), but with a little more intelligence to > try and > not duplicate names. > > # Behaviour changes > An indirect new behaviour is that the SeqIO modules will probably > return > ->species() as the real species name (eg. 'Homo sapiens'), not the > previously (and sometimes incorrectly) munged name (eg. 'sapiens'). > > # Notes > Stores a Bio::Tree::Tree on itself, had to remove its clean up methods > since they were in a CODE ref, preventing us from being frozen with > Storable. Will come up with a better solution in the future. > > > Bio::SeqIO::* > ------------- > A number of these modules make use of Bio::Species when parsing > taxonomic information. They probably all have/had problems. I've only > investigated genbank to any significant depth; the others need > to be properly tested to see if when they read taxonomic data in they > can output it again identically to the input file. It is probably the > case that some fail at this currently. (I simply don't have time > myself > to make all these modules perfect.) > > > Bio::SeqIO::bsml_sax > -------------------- > > # BUG-FIXES > It used to include the genus twice in the classification array of > Bio::Species object. Now it doesn't. > > > Bio::SeqIO::embl > ---------------- > > # BUG-FIXES > When the OC lines include the species name, the Bio::Species > classification array included the true species name as a rank above > genus and the real genus duplicated as a rank above that. Now it > doesn't. > > > Bio::SeqIO::genbank > ------------------- > > # BUG-FIXES > Now that Bio::Species isa Bio::Taxon, it is possible to ensure that > output of input matches the input (in the SOURCE and ORGANISM lines at > least). Usage of Bio::Species re-implemented to get all tests in > t/genbank.t to pass. > > > t/genbank.t > ----------- > Modified some tests to expect the correct answer, ie. > $bio_species_obj->species now expects 'Mus musculus', not 'musculus'. > > > t/Index.t > --------- > Modified some tests to expect the correct answer, ie. > $bio_species_obj->species now expects 'Homo sapiens', not 'sapiens'. > > > scripts/taxa/taxonomy2tree.PLS > ------------------------------ > Added some extra options to define the location of the database > indexes > and files, or use the entrez on-line database instead. (Note how > entrez > and flatfile are now truly interchangeable.) > > Reimplemented using the new Bio::Taxon system. Now much simpler. You > also get the correct answer, eg. instead of > (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo > sapiens")"Homo/Pan/Gorilla group")Hominidae)root; > you now get > (("Pongo pygmaeus",(Gorilla,"Pan troglodytes","Homo > sapiens")"Homo/Pan/Gorilla group")Hominidae)"cellular organisms"; > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Mon Aug 7 03:01:35 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 08:01:35 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <4D709D39-0340-4184-8581-3468F29A83D0@uiuc.edu> References: <44D4BC52.30203@sendu.me.uk> <4D709D39-0340-4184-8581-3468F29A83D0@uiuc.edu> Message-ID: <44D6E54F.8090201@sendu.me.uk> Chris Fields wrote: >> http://en.wikipedia.org/wiki/Species and >> http://en.wikipedia.org/wiki/Binomial_nomenclature. "The name of >> the species is the whole binomial, not just the second term (which >> may be called specific epithet, for plants, or specific name, for >> animals)". >> >> We can't have a method for the 'specific name' because we have no >> way of always correctly working out what that is. The NCBI taxonomy >> database doesn't tell us, and neither do the various sequence file >> formats. > > The previous use of species() in Bio::Species fits that definition, > in that the species() method originally gave only the species > descriptor (one name), NOT the binomial name, which is given by > binomial(). Similarly, genus() gave only the genus name. Why have a > genus() or binomial() at all if you get the entire name via > species()? We don't need species() either. This is about backward compatibility. Though... from that point of view I suppose it makes sense to have species() behave the same way as it used to. > So, is there a correct definition of 'species'? Yes, I should think it's the one Wikipedia gives. [...] > "Since the advent of the theory of evolution, the conception of > species has undergone vast changes in biology; however no consensus > on the definition of the word has yet been reached." > > Seems ambiguous to me. No, the concept of species had undergone vast changes in respect to how you group together or not different organisms into the same species; not in how you define the name of a species. > It's my firm belief, though, using species() in this way will > generate more confusion than it's worth. I'll leave it to you to > answer the confused emails from bioperl users who don't expect this. Well, that would be fine. It's a simple choice, then, between a) having species() always return the correct answer, but that being something quite different to before (the binomial), or guess what the descriptor is but sometimes get it wrong, being the old way. If we pick the latter, we have the further choice between b) guessing in exactly the same way as before for pure backward compatibility, or c) guessing in a new way so that we're wrong less of the time. All 3 ways are fine by me - I'll go with the consensus choice. Chris, your vote would be for b) ? From bix at sendu.me.uk Mon Aug 7 03:26:09 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 08:26:09 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> Message-ID: <44D6EB11.30903@sendu.me.uk> Hilmar Lapp wrote: > 1) It sounds a bit that you changed the behavior of get_lca() such > that users may have to adjust their code? If this is true, then this > needs to be made clear in the 1.6 release as that part will not be > backward compatible. If this is not true, then why did you have to > change the implementation of Bio::Tools::Phylo::PAML to make tests > pass? I.e., to what extent can what broke Bio::Tools::Phylo::PAML > also break someone's script? I can say that it /should/ have given the same results, but clearly it didn't. What I had to change in PAML was the way in which PAML found the lca of multiple nodes; it had its own algorithm for that, that used get_lca 2 nodes at a time. Now it just calls get_lca once, supplying all the nodes in one go. I don't think I spent any time trying to figure out the problem, I just made the change: < while( @nodes_L > 1 ) { < my $lca = $tree->get_lca < (-nodes => [shift @nodes_L, < shift @nodes_L]); < push @nodes_L, $lca; < } < my $n = shift @nodes_L; --- > my $n = @nodes_L < 2 ? shift(@nodes_L) : $tree->get_lca(@nodes_L); I'll look into it and see if I can avoid any behaviour change. > 2) I can't find object_id() on Tree::Node or Taxonomy::Taxon. Where > is/was it? The reason I am asking is that this method is part of the > Bio::IdentifiableI API and therefore if you want to deprecate it you > are suggesting to deprecate implementing Bio::IdentifiableI, and the > rest of those methods need to be deprecated along. Ah, I didn't notice that. Well there's no need to deprecate it then; it can remain a permanent synonym of id(). (Though, when the DB modules create Bio::Taxon objects, they don't actually use any of the other IdentifiableI methods.) > 3) Your whole email should probably go on the wiki, linked somewhere > under documentation or release notes. Or somebody has a better idea? I'm not really sure how to go about that in any case. Is there a wiki page that gives advice on making new wiki pages? (Both technically and in terms of what should be on the page, its style, what links to it, where it should live etc.) From bix at sendu.me.uk Mon Aug 7 04:38:58 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 09:38:58 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D6EB11.30903@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> Message-ID: <44D6FC22.6030102@sendu.me.uk> Sendu Bala wrote: > Hilmar Lapp wrote: >> 1) It sounds a bit that you changed the behavior of get_lca() such >> that users may have to adjust their code? If this is true, then this >> needs to be made clear in the 1.6 release as that part will not be >> backward compatible. If this is not true, then why did you have to >> change the implementation of Bio::Tools::Phylo::PAML to make tests >> pass? I.e., to what extent can what broke Bio::Tools::Phylo::PAML >> also break someone's script? > > I can say that it /should/ have given the same results, but clearly it > didn't. What I had to change in PAML was the way in which PAML found the > lca of multiple nodes; it had its own algorithm for that, that used > get_lca 2 nodes at a time. Now it just calls get_lca once, supplying all > the nodes in one go. > > I don't think I spent any time trying to figure out the problem, I just > made the change: > > < while( @nodes_L > 1 ) { > < my $lca = $tree->get_lca > < (-nodes => [shift @nodes_L, > < shift @nodes_L]); > < push @nodes_L, $lca; > < } > < my $n = shift @nodes_L; > --- > > my $n = @nodes_L < 2 ? shift(@nodes_L) : $tree->get_lca(@nodes_L); > > I'll look into it and see if I can avoid any behaviour change. Oh yes, I remember now. get_lca() used to consider an input node as a possible ancestor of itself, which is how the algorithm in PAML worked. So there will be a behaviour change - now get_lca really does only get the lowest common ancestor of input nodes, which necessarily can't be any of the input nodes themselves. I'd call the old behaviour a bug that has now been fixed. (Though the code had a comment to the effect that it was a quite deliberate choice on the part of the author.) Ah, I just realised that the PAML algorithm is on the wiki, so many people may have use it: http://www.bioperl.org/wiki/HOWTO:Trees#Bio::Tree::TreeFunctionsI The old get_lca behaviour was probably there purely to allow this convergence to work. I'll have to edit that page along the lines of 'to get the lca of multiple nodes you used to have to do ..., but now you do ...'. When would I make that edit? After I commit, or when 1.6 comes out? From bix at sendu.me.uk Mon Aug 7 05:31:50 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 10:31:50 +0100 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO Message-ID: <44D70886.2080603@sendu.me.uk> Hi, I was previously trying to tidy up the test suite to remove all failures, skips and warnings. I managed to reduce it quite a bit, but am left with 3 main problem scripts. Failed Test Total Fail Failed List of Failed ------------------------------------------------ t/protgraph.t 66 23 34.85% 11 13 20-21 26 33 36-37 45 48-56 59-61 65-66 I don't know enough about it to know why its failing or what the answers are really supposed to be. Does it fail for other people? Has it ever worked in the past? t/entrezgene.................ok 3/1003Pseudo-hashes are deprecated at /.../Bio/SeqIO/entrezgene.pm line 469. t/entrezgene.................ok 509/1003Pseudo-hashes are deprecated at /.../Bio/SeqIO/entrezgene.pm line 469. Pseudo-hashes are deprecated at /.../Bio/SeqIO/entrezgene.pm line 469. t/entrezgene.................ok 824/1003Pseudo-hashes are deprecated at /.../Bio/SeqIO/entrezgene.pm line 469. Does anyone have the time to re-implement entrezgene.pm to not use pseudo-hashes? t/FeatureIO..................ok 2/22 -------------------- WARNING --------------------- MSG: '##feature-ontology' directive handling not yet implemented --------------------------------------------------- -------------------- WARNING --------------------- MSG: '##attribute-ontology' directive handling not yet implemented --------------------------------------------------- -------------------- WARNING --------------------- MSG: '##source-ontology' directive handling not yet implemented --------------------------------------------------- Is anyone planning to implement those things? Is it at least possible to do so? Cheers, Sendu. From cjfields at uiuc.edu Mon Aug 7 08:41:22 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 07:41:22 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> Message-ID: <0F07040D-74EE-4E07-91FD-A392E8B995B5@uiuc.edu> Sendu's been documenting this under Bugzilla (Bug 2061), but the wiki may be a better place. The changes are extensive. The only part I don't agree with is the use of species() to return a binomial name, which I already responded to (and don't plan dragging out). I don't use Bio::Tools::Phylo::PAML or the Tree modules so I can't give much input there, but I would suggest that changes there need to be listed in a separate post. Lots of people use Bio::Tools::Phylo::PAML so there may be some who might not agree. Chris On Aug 6, 2006, at 2:38 PM, Hilmar Lapp wrote: > I have three comments. > > 1) It sounds a bit that you changed the behavior of get_lca() such > that users may have to adjust their code? If this is true, then this > needs to be made clear in the 1.6 release as that part will not be > backward compatible. If this is not true, then why did you have to > change the implementation of Bio::Tools::Phylo::PAML to make tests > pass? I.e., to what extent can what broke Bio::Tools::Phylo::PAML > also break someone's script? > > 2) I can't find object_id() on Tree::Node or Taxonomy::Taxon. Where > is/was it? The reason I am asking is that this method is part of the > Bio::IdentifiableI API and therefore if you want to deprecate it you > are suggesting to deprecate implementing Bio::IdentifiableI, and the > rest of those methods need to be deprecated along. > > 3) Your whole email should probably go on the wiki, linked somewhere > under documentation or release notes. Or somebody has a better idea? > > -hilmar Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Aug 7 08:47:09 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 08:47:09 -0400 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: <44D70886.2080603@sendu.me.uk> References: <44D70886.2080603@sendu.me.uk> Message-ID: Don't worry too much about the warnings. They are annoying and contribute to screen clutter, and especially the entrezgene one raises concerns. The FeatureIO warnings just point out what they say I believe, and yes those things are implementable I suppose. -hilmar On Aug 7, 2006, at 5:31 AM, Sendu Bala wrote: > Hi, > I was previously trying to tidy up the test suite to remove all > failures, skips and warnings. I managed to reduce it quite a bit, > but am > left with 3 main problem scripts. > > > Failed Test Total Fail Failed List of Failed > ------------------------------------------------ > t/protgraph.t 66 23 34.85% 11 13 20-21 26 33 36-37 45 48-56 > 59-61 65-66 > > I don't know enough about it to know why its failing or what the > answers > are really supposed to be. Does it fail for other people? Has it ever > worked in the past? > > > t/entrezgene.................ok 3/1003Pseudo-hashes are deprecated at > /.../Bio/SeqIO/entrezgene.pm line 469. > t/entrezgene.................ok 509/1003Pseudo-hashes are > deprecated at > /.../Bio/SeqIO/entrezgene.pm line 469. > Pseudo-hashes are deprecated at /.../Bio/SeqIO/entrezgene.pm line 469. > t/entrezgene.................ok 824/1003Pseudo-hashes are > deprecated at > /.../Bio/SeqIO/entrezgene.pm line 469. > > Does anyone have the time to re-implement entrezgene.pm to not use > pseudo-hashes? > > > t/FeatureIO..................ok 2/22 > > -------------------- WARNING --------------------- > MSG: '##feature-ontology' directive handling not yet implemented > --------------------------------------------------- > > -------------------- WARNING --------------------- > MSG: '##attribute-ontology' directive handling not yet implemented > --------------------------------------------------- > > -------------------- WARNING --------------------- > MSG: '##source-ontology' directive handling not yet implemented > --------------------------------------------------- > > Is anyone planning to implement those things? Is it at least > possible to > do so? > > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Aug 7 08:43:01 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 08:43:01 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D6EB11.30903@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> Message-ID: <7A51C10E-73B4-4131-B89D-1E9357A9E99F@gmx.net> On Aug 7, 2006, at 3:26 AM, Sendu Bala wrote: > Though, when the DB modules create Bio::Taxon objects, they don't > actually use any of the other IdentifiableI methods. The Bio::IdentifiableI methods are typically used by client programs for reading, seldom by bioperl core modules for setting a value; they're almost always synonyms to the module's 'native' (but inconsistently named) identifiability-related methods. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Aug 7 08:52:38 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 08:52:38 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D6FC22.6030102@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> <44D6FC22.6030102@sendu.me.uk> Message-ID: On Aug 7, 2006, at 4:38 AM, Sendu Bala wrote: > So there will be a behaviour change - now get_lca really does only get > the lowest common ancestor of input nodes, which necessarily can't be > any of the input nodes themselves. > > I'd call the old behaviour a bug that has now been fixed. (Though the > code had a comment to the effect that it was a quite deliberate choice > on the part of the author.) I inclined to call the new behavior a bug. Why would the lca between node A and node B not be defined, or be an ancestor node of A instead of A itself? Likewise, why would the lca between a node A and its child node B not be A but instead an ancestor node of A? Or am I missing something? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Aug 7 08:51:54 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 07:51:54 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D6FC22.6030102@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> <44D6FC22.6030102@sendu.me.uk> Message-ID: <020AF7B9-C7A8-4EC3-B4D1-6993D72BE06B@uiuc.edu> Any changes to the modules can be posted as updates or addenda on the Module page, and the sooner the better. An example is the continual comments left for the RemoteBlast tool: http://www.bioperl.org/wiki/Module:Bio::Tools::Run::RemoteBlast There are a few users who seem to use the wiki extensively and don't use the list. Of course there are those who would rather just post a question on the list w/o bothering to look up the info on the wiki. Jason has a style list for adding links and other 'pretty stuff' (markup): http://www.bioperl.org/wiki/BioPerl:Style_guide Also note the project priority list, make changes as needed there as well: http://www.bioperl.org/wiki/Project_priority_list Chris On Aug 7, 2006, at 3:38 AM, Sendu Bala wrote: > Sendu Bala wrote: >> Hilmar Lapp wrote: >>> 1) It sounds a bit that you changed the behavior of get_lca() such >>> that users may have to adjust their code? If this is true, then this >>> needs to be made clear in the 1.6 release as that part will not be >>> backward compatible. If this is not true, then why did you have to >>> change the implementation of Bio::Tools::Phylo::PAML to make tests >>> pass? I.e., to what extent can what broke Bio::Tools::Phylo::PAML >>> also break someone's script? >> >> I can say that it /should/ have given the same results, but >> clearly it >> didn't. What I had to change in PAML was the way in which PAML >> found the >> lca of multiple nodes; it had its own algorithm for that, that used >> get_lca 2 nodes at a time. Now it just calls get_lca once, >> supplying all >> the nodes in one go. >> >> I don't think I spent any time trying to figure out the problem, I >> just >> made the change: >> >> < while( @nodes_L > 1 ) { >> < my $lca = $tree->get_lca >> < (-nodes => [shift @nodes_L, >> < shift @nodes_L]); >> < push @nodes_L, $lca; >> < } >> < my $n = shift @nodes_L; >> --- >>> my $n = @nodes_L < 2 ? shift(@nodes_L) : $tree->get_lca(@nodes_L); >> >> I'll look into it and see if I can avoid any behaviour change. > > Oh yes, I remember now. get_lca() used to consider an input node as a > possible ancestor of itself, which is how the algorithm in PAML > worked. > > So there will be a behaviour change - now get_lca really does only get > the lowest common ancestor of input nodes, which necessarily can't be > any of the input nodes themselves. > > I'd call the old behaviour a bug that has now been fixed. (Though the > code had a comment to the effect that it was a quite deliberate choice > on the part of the author.) > > > Ah, I just realised that the PAML algorithm is on the wiki, so many > people may have use it: > http://www.bioperl.org/wiki/HOWTO:Trees#Bio::Tree::TreeFunctionsI > The old get_lca behaviour was probably there purely to allow this > convergence to work. I'll have to edit that page along the lines of > 'to > get the lca of multiple nodes you used to have to do ..., but now > you do > ...'. When would I make that edit? After I commit, or when 1.6 > comes out? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Aug 7 08:54:33 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 08:54:33 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <0F07040D-74EE-4E07-91FD-A392E8B995B5@uiuc.edu> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <0F07040D-74EE-4E07-91FD-A392E8B995B5@uiuc.edu> Message-ID: <667DC275-D22C-4AAB-88FE-5173813660E4@gmx.net> On Aug 7, 2006, at 8:41 AM, Chris Fields wrote: > The only part I don't agree with is the use of species() to return a > binomial name, which I already responded to (and don't plan dragging > out). Maybe I missed that? This is not about $seq->species() but a species () method on another object? If it's on a sequence object it's clearly a bad idea due to the API change, but I thought that's not the plan. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Aug 7 09:04:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 08:04:25 -0500 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: <44D70886.2080603@sendu.me.uk> References: <44D70886.2080603@sendu.me.uk> Message-ID: <6BFD6C0A-31ED-44A5-AE98-EB1C3A64061A@uiuc.edu> Brian already knows about the 'pseudohash' issue with protgraph. I think it's a weird ref call or construct that tripping the perl warnings. I don't think perl 5.6 does this; the pseudohash warning was added in perl 5.8. I really wouldn't worry about the other tests throwing warnings. You have your hands full with Taxonomy now; focus on getting that up and running before you move on to other ventures. Chris On Aug 7, 2006, at 4:31 AM, Sendu Bala wrote: > Hi, > I was previously trying to tidy up the test suite to remove all > failures, skips and warnings. I managed to reduce it quite a bit, > but am > left with 3 main problem scripts. > > > Failed Test Total Fail Failed List of Failed > ------------------------------------------------ > t/protgraph.t 66 23 34.85% 11 13 20-21 26 33 36-37 45 48-56 > 59-61 65-66 > > I don't know enough about it to know why its failing or what the > answers > are really supposed to be. Does it fail for other people? Has it ever > worked in the past? > > > t/entrezgene.................ok 3/1003Pseudo-hashes are deprecated at > /.../Bio/SeqIO/entrezgene.pm line 469. > t/entrezgene.................ok 509/1003Pseudo-hashes are > deprecated at > /.../Bio/SeqIO/entrezgene.pm line 469. > Pseudo-hashes are deprecated at /.../Bio/SeqIO/entrezgene.pm line 469. > t/entrezgene.................ok 824/1003Pseudo-hashes are > deprecated at > /.../Bio/SeqIO/entrezgene.pm line 469. > > Does anyone have the time to re-implement entrezgene.pm to not use > pseudo-hashes? > > > t/FeatureIO..................ok 2/22 > > -------------------- WARNING --------------------- > MSG: '##feature-ontology' directive handling not yet implemented > --------------------------------------------------- > > -------------------- WARNING --------------------- > MSG: '##attribute-ontology' directive handling not yet implemented > --------------------------------------------------- > > -------------------- WARNING --------------------- > MSG: '##source-ontology' directive handling not yet implemented > --------------------------------------------------- > > Is anyone planning to implement those things? Is it at least > possible to > do so? > > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Aug 7 09:22:47 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 08:22:47 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <667DC275-D22C-4AAB-88FE-5173813660E4@gmx.net> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <0F07040D-74EE-4E07-91FD-A392E8B995B5@uiuc.edu> <667DC275-D22C-4AAB-88FE-5173813660E4@gmx.net> Message-ID: That was about $seq->species->species() (i.e. Bio::Species). The old method returns the species name (i.e. 'subtilis'). The genus() method returns 'Bacillus', and binomial() returns 'Bacillus subtilis.' Even if we ended up removing genus() and binomial(), the use of species() here will be tainted by it's past use for returning the species descriptor, so changing the way it behaves (returning the binomial) is really an API change. My thought was to tie genus(), species(), and binomial() to DB lookups and have them return the previous information. But that's also really an API change, and is it really necessary? I thought the point of all this was to get rid of Bio::Species and have a much cleaner container object take it's place (Bio::Taxon). The lineage would be stored in Sendu's Bio::DB::Taxonomy::list object, the organelle() moved to RichSeq (it's not involved in taxonomic data), and the common names and scientific name added to Bio::Taxon. So, Bio::Taxon could be used for writing output. Bio::Species would be retained for the API and gradually phased out. So, using your suggestion, a Bio::Taxon object would be returned here: $seq->taxon->scientific_name(); and a (soon to be deprecated) Bio::Species object would be returned here: $seq->species->common_name(); Chris On Aug 7, 2006, at 7:54 AM, Hilmar Lapp wrote: > > On Aug 7, 2006, at 8:41 AM, Chris Fields wrote: > >> The only part I don't agree with is the use of species() to return a >> binomial name, which I already responded to (and don't plan dragging >> out). > > Maybe I missed that? This is not about $seq->species() but a species > () method on another object? If it's on a sequence object it's > clearly a bad idea due to the API change, but I thought that's not > the plan. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Aug 7 09:57:19 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 14:57:19 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> <44D6FC22.6030102@sendu.me.uk> Message-ID: <44D746BF.4070908@sendu.me.uk> Hilmar Lapp wrote: > > On Aug 7, 2006, at 4:38 AM, Sendu Bala wrote: > >> So there will be a behaviour change - now get_lca really does only get >> the lowest common ancestor of input nodes, which necessarily can't be >> any of the input nodes themselves. >> >> I'd call the old behaviour a bug that has now been fixed. (Though the >> code had a comment to the effect that it was a quite deliberate choice >> on the part of the author.) > > I inclined to call the new behavior a bug. Why would the lca between > node A and node B not be defined, or be an ancestor node of A instead of > A itself? [...] > > Or am I missing something? Well, it's the 'lowest common ancestor', isn't it? How can the ancestor of something be itself? I'm interested that you think that the LCA has to be defined; the original implementation makes the same assumption in its comments. Consider two lineages: A---B---C---D---E X---Y---Z---D---F The old implementation would not only expect that E and F have an lca, but return the answer D, which is wrong. E and F do not have a common ancestor; their direct ancestors just happen to have the same descriptor. (In typical usage it was probably never wrong, since the descriptors used were script-unique, and unchangeable without using an internal method.) Or more obviously: A--B X--C B and C do not have an lca. I think there is the assumption that both nodes being compared belong to the same properly constructed tree, but you don't even need a Tree object to use get_lca(). my $lca = Bio::Tree::TreeFunctionsI->get_lca(@nodes); (Not that I'd suggest anyone do that.) > Likewise, why would the lca between a node A and its child > node B not be A but instead an ancestor node of A? I could certainly be wrong, I just can't find anything authoritative that explicitly states the correct answer to that either way. For example http://66.102.9.104/search?q=cache:5cDyg4Um8GEJ:dept-info.labri.fr/~gavoille/article/AGKR02 Defines the nearest common ancestor (== lowest common ancestor) like: Let T be a rooted tree. A node x ? T is an ancestor of a node y ? T if the path from the root of T to y goes through x. A node v ? T is a common ancestor of x and y if it is an ancestor of both x and y. The nearest common ancestor, nca, of two nodes x, y is the common ancestor of x and y whose distance to x (and to y) is smaller than the distance to x of any other common ancestor of x and y. According to that, v cannot be x or y. From bix at sendu.me.uk Mon Aug 7 10:04:10 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 15:04:10 +0100 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: <6BFD6C0A-31ED-44A5-AE98-EB1C3A64061A@uiuc.edu> References: <44D70886.2080603@sendu.me.uk> <6BFD6C0A-31ED-44A5-AE98-EB1C3A64061A@uiuc.edu> Message-ID: <44D7485A.6070209@sendu.me.uk> Chris Fields wrote: > Brian already knows about the 'pseudohash' issue with protgraph. With entrezgene you mean? Do you get the same test fails for protgraph on Windows that I get using linux? > I think it's a weird ref call or construct that tripping the perl > warnings. I don't think perl 5.6 does this; the pseudohash warning > was added in perl 5.8. > > I really wouldn't worry about the other tests throwing warnings. You > have your hands full with Taxonomy now; focus on getting that up and > running before you move on to other ventures. As far as I'm concerned I'm more or less finished with Taxonomy. I'm just waiting for your confirmed 'vote' on what species() should return, and at least one other vote. Might also be waiting to find out how get_lca should really work as well. But in any case, I'm just waiting, which isn't that hard ;) FYI I'm actually moving on to the previously mooted hmmpfam plugin for SearchIO. From osborne1 at optonline.net Mon Aug 7 10:04:17 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 07 Aug 2006 10:04:17 -0400 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: <44D70886.2080603@sendu.me.uk> Message-ID: Sendu, This one was discussed in bioperl-l a couple of months ago. All tests pass for me on Mac OS X 10.4.6, perl 5.8.6, bioperl-live. I'll take a closer look at the individual tests, see if I can find a pattern... Brian O. On 8/7/06 5:31 AM, "Sendu Bala" wrote: > Failed Test Total Fail Failed List of Failed > ------------------------------------------------ > t/protgraph.t 66 23 34.85% 11 13 20-21 26 33 36-37 45 48-56 > 59-61 65-66 > > I don't know enough about it to know why its failing or what the answers > are really supposed to be. Does it fail for other people? Has it ever > worked in the past? From osborne1 at optonline.net Mon Aug 7 10:08:05 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 07 Aug 2006 10:08:05 -0400 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: <6BFD6C0A-31ED-44A5-AE98-EB1C3A64061A@uiuc.edu> Message-ID: Chris and Sendu, Right. The author of entrezgene.pm is _not_ using pseudohashes but perl thinks they're being used. I suppose one could hack something here but it's not entrezgene.pm that's at fault. Brian O. On 8/7/06 9:04 AM, "Chris Fields" wrote: > Brian already knows about the 'pseudohash' issue From RScavett at uni-koeln.de Mon Aug 7 09:58:39 2006 From: RScavett at uni-koeln.de (Rick Scavetta) Date: Mon, 07 Aug 2006 15:58:39 +0200 Subject: [Bioperl-l] Database Retrieval Message-ID: Hello Programmers, I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. With these accession numbers I want to know what are the three closest upstream and downstream genes (and orientation, if possible) to my gene of interest. Is these some way of finding this out? Any suggestions? Also, I would also like to know something about the expression of a particular gene of interest. e.g. ba querying the Novartis Gene Atlas (http://symatlas.gnf.org/SymAtlas/). Is there a module for handling submissions and retrieving results of this sort? Thanks! Rick Scavetta -- Rick Scavetta Department of Genetics Evolutionary Genetics Rm 2.07 Zuelpicher Str. 47 50674 Cologne Germany http://www.genetik.uni-koeln.de/groups/Tautz/meg/ Tel.: +49-221-470-3402 Fax: +49-221-470-5975 From cjfields at uiuc.edu Mon Aug 7 10:35:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 09:35:25 -0500 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: <44D7485A.6070209@sendu.me.uk> References: <44D70886.2080603@sendu.me.uk> <6BFD6C0A-31ED-44A5-AE98-EB1C3A64061A@uiuc.edu> <44D7485A.6070209@sendu.me.uk> Message-ID: <3DC61611-4825-4CA3-B5AB-A9D81CC7A065@uiuc.edu> On Aug 7, 2006, at 9:04 AM, Sendu Bala wrote: > Chris Fields wrote: >> Brian already knows about the 'pseudohash' issue with protgraph. > > With entrezgene you mean? Yes, sorry. Lack of coffee. > Do you get the same test fails for protgraph on Windows that I get > using > linux? The tests all pass on Mac OS X; Ill try Windows today, but if memory serves there were problems with this a few months back Okay, found it on the gmane lists: http://thread.gmane.org/gmane.comp.lang.perl.bio.general/11137/ focus=11311 This part of the thread details the issues. Linux and WinXP fails similar tests, Mac OS X (Tiger) passes (Heikki and I ran the tests). Never resolved. My feeling: could be OS-specific, could be Clone.pm, could be perl version. > >> I think it's a weird ref call or construct that tripping the perl >> warnings. I don't think perl 5.6 does this; the pseudohash warning >> was added in perl 5.8. >> >> I really wouldn't worry about the other tests throwing warnings. You >> have your hands full with Taxonomy now; focus on getting that up and >> running before you move on to other ventures. > > As far as I'm concerned I'm more or less finished with Taxonomy. I'm > just waiting for your confirmed 'vote' on what species() should > return, > and at least one other vote. > > Might also be waiting to find out how get_lca should really work as > well. But in any case, I'm just waiting, which isn't that hard ;) > > FYI I'm actually moving on to the previously mooted hmmpfam plugin for > SearchIO. Cool! The only problem I saw with the way you wanted to go about it was you would have to preparse the report since everything is reported as 1 hsp/hit. They usually aren't long, so that shouldn't be a problem. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From sdavis2 at mail.nih.gov Mon Aug 7 10:41:02 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 07 Aug 2006 10:41:02 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: Message-ID: On 8/7/06 9:58 AM, "Rick Scavetta" wrote: > Hello Programmers, > > I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. > With these accession numbers I want to know what are the three closest > upstream and downstream genes (and orientation, if possible) to my gene of > interest. Is these some way of finding this out? Any suggestions? I would look at using the UCSC genome browser data. You can download the refGene table, order it by chromosome and start, then do whatever manipulation you like. Your problem can probably be solved with perl, but if you have experience using R (the statistics software), using it may be more straightforward. You could also potentially do this using SQL queries of the UCSC MySQL database. Is there a reason to use "number of genes"? If you can simply restrict to a number of base pairs, the problem can (probably) be solved using only the UCSC table browser. > Also, I would also like to know something about the expression of a > particular gene of interest. e.g. ba querying the Novartis Gene Atlas > (http://symatlas.gnf.org/SymAtlas/). Is there a module for handling > submissions and retrieving results of this sort? No, but, again, UCSC hosts these data and they have nice query tools, a publicly-accessible MySQL database containing the information, and downloadable tables for the data. From cjfields at uiuc.edu Mon Aug 7 10:44:25 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 09:44:25 -0500 Subject: [Bioperl-l] t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: References: Message-ID: It's something to do with perl 5.8. I think this is the version where the pseudohash deprecation warning was added. The only problem I see here is a potential compatability issue with entrezgene.pm and the next major perl release, which will not run code which it thinks contains pseudohashes. I ran into a similar issue not too long ago when dereferencing something and perl complaining about pseudohashes. Can't remember what I did to make it stop complaining, but it had something to do with the way I dereferenced the data... Chris On Aug 7, 2006, at 9:08 AM, Brian Osborne wrote: > Chris and Sendu, > > Right. The author of entrezgene.pm is _not_ using pseudohashes but > perl > thinks they're being used. I suppose one could hack something here > but it's > not entrezgene.pm that's at fault. > > Brian O. > > > On 8/7/06 9:04 AM, "Chris Fields" wrote: > >> Brian already knows about the 'pseudohash' issue > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From sdavis2 at mail.nih.gov Mon Aug 7 11:16:45 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 07 Aug 2006 11:16:45 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: Message-ID: On 8/7/06 10:41 AM, "Sean Davis" wrote: > > > > On 8/7/06 9:58 AM, "Rick Scavetta" wrote: > >> Hello Programmers, >> >> I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. >> With these accession numbers I want to know what are the three closest >> upstream and downstream genes (and orientation, if possible) to my gene of >> interest. Is these some way of finding this out? Any suggestions? > > I would look at using the UCSC genome browser data. You can download the > refGene table, order it by chromosome and start, then do whatever > manipulation you like. Your problem can probably be solved with perl, but > if you have experience using R (the statistics software), using it may be > more straightforward. You could also potentially do this using SQL queries > of the UCSC MySQL database. > > Is there a reason to use "number of genes"? If you can simply restrict to a > number of base pairs, the problem can (probably) be solved using only the > UCSC table browser. > >> Also, I would also like to know something about the expression of a >> particular gene of interest. e.g. ba querying the Novartis Gene Atlas >> (http://symatlas.gnf.org/SymAtlas/). Is there a module for handling >> submissions and retrieving results of this sort? > > No, but, again, UCSC hosts these data and they have nice query tools, a > publicly-accessible MySQL database containing the information, and > downloadable tables for the data. Oh, and I should have mentioned that NCBI GEO hosts these data, which can be downloaded as a "spreadsheet" from here: http://www.ncbi.nlm.nih.gov/geo/gds/gds_browse.cgi?gds=596 Sean From cuiw at ncbi.nlm.nih.gov Mon Aug 7 11:25:48 2006 From: cuiw at ncbi.nlm.nih.gov (Cui, Wenwu (NIH/NLM/NCBI) [C]) Date: Mon, 7 Aug 2006 11:25:48 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: Message-ID: <18C407FD4FFB424292D769FBD68C1987C7C394@NIHCESMLBX8.nih.gov> Hello, Rick: Novartis does provide a flat text file for these data. You can simply load it to either R or MySQL. If you want, I can share my scripts with you. Wenwu -----Original Message----- From: Davis, Sean (NIH/NCI) [E] Sent: Monday, August 07, 2006 10:41 AM To: Rick Scavetta; bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Database Retrieval On 8/7/06 9:58 AM, "Rick Scavetta" wrote: > Hello Programmers, > > I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. > With these accession numbers I want to know what are the three closest > upstream and downstream genes (and orientation, if possible) to my gene of > interest. Is these some way of finding this out? Any suggestions? I would look at using the UCSC genome browser data. You can download the refGene table, order it by chromosome and start, then do whatever manipulation you like. Your problem can probably be solved with perl, but if you have experience using R (the statistics software), using it may be more straightforward. You could also potentially do this using SQL queries of the UCSC MySQL database. Is there a reason to use "number of genes"? If you can simply restrict to a number of base pairs, the problem can (probably) be solved using only the UCSC table browser. > Also, I would also like to know something about the expression of a > particular gene of interest. e.g. ba querying the Novartis Gene Atlas > (http://symatlas.gnf.org/SymAtlas/). Is there a module for handling > submissions and retrieving results of this sort? No, but, again, UCSC hosts these data and they have nice query tools, a publicly-accessible MySQL database containing the information, and downloadable tables for the data. _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Aug 7 11:51:19 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 16:51:19 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <44D76177.1090901@sendu.me.uk> Rick Scavetta wrote: > Hello Programmers, > > I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. > With these accession numbers I want to know what are the three closest > upstream and downstream genes (and orientation, if possible) to my gene of > interest. Is these some way of finding this out? Any suggestions? One possible way would be with the Ensembl Perl API: http://www.ensembl.org/info/software/core/core_tutorial.html You'd get a gene or transcript adapator and use fetch_all_by_external_name() iirc. Then probably your slice adaptor would be used to get the up and down stream regions and iirc there's some easy way to ask for all the genes in a slice. Sean's solution (use UCSC) seems a lot easier, and since you just want a list of names as a result, not bioperl objects, I'd go for that. Actually, would there be any interest in a bioperl interface to the UCSC tables? It could probably be done using their DAS server, and end up as a very easy-to-use alternative to the Ensembl API for limited or one-off queries. From cjfields at uiuc.edu Mon Aug 7 12:24:13 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 11:24:13 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D76177.1090901@sendu.me.uk> References: <44D76177.1090901@sendu.me.uk> Message-ID: <19426B4B-178D-4685-8851-8048502DEDA0@uiuc.edu> This was brought up previously: http://article.gmane.org/gmane.comp.lang.perl.bio.general/11808/ match=ucsc Would be nice to get something up and running here, even if it just returned raw data (not objects). The more access the better. In a similar vein, I developed Bio::DB::EUtiities to fetch raw data, like sequences, XML, etc, from any NCBI Entrez database as an alternative to using Bioperl objects; it could be used via an I/O module (like SeqIO) to convert said data into Bioperl objects or just return it unmodified by using it directly. Still needs some work, just haven't had time lately. BTW, you can use a 'slice adaptor' via GenBank files as well, both via Bio::DB::GenBank and Bio::DB::EUtilities. They both accept the parameters seq_start, seq_stop, and strand, and both return the sequence slice containing relevant seqfeatures, etc for the region. Chris On Aug 7, 2006, at 10:51 AM, Sendu Bala wrote: > Rick Scavetta wrote: >> Hello Programmers, >> >> I have a list of mouse GeneIDs for which I have extracted the >> RefSeqs for. >> With these accession numbers I want to know what are the three >> closest >> upstream and downstream genes (and orientation, if possible) to my >> gene of >> interest. Is these some way of finding this out? Any suggestions? > > One possible way would be with the Ensembl Perl API: > http://www.ensembl.org/info/software/core/core_tutorial.html > > You'd get a gene or transcript adapator and use > fetch_all_by_external_name() iirc. > > Then probably your slice adaptor would be used to get the up and down > stream regions and iirc there's some easy way to ask for all the genes > in a slice. > > Sean's solution (use UCSC) seems a lot easier, and since you just > want a > list of names as a result, not bioperl objects, I'd go for that. > > Actually, would there be any interest in a bioperl interface to the > UCSC > tables? It could probably be done using their DAS server, and end > up as > a very easy-to-use alternative to the Ensembl API for limited or > one-off > queries. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From lalithaviswanath at yahoo.com Mon Aug 7 12:20:30 2006 From: lalithaviswanath at yahoo.com (lalitha viswanath) Date: Mon, 7 Aug 2006 09:20:30 -0700 (PDT) Subject: [Bioperl-l] (no subject) Message-ID: <20060807162030.602.qmail@web34114.mail.mud.yahoo.com> __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com From akozik at atgc.org Mon Aug 7 12:13:39 2006 From: akozik at atgc.org (Alexander Kozik) Date: Mon, 07 Aug 2006 09:13:39 -0700 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D76177.1090901@sendu.me.uk> References: <44D76177.1090901@sendu.me.uk> Message-ID: <44D766B3.1030009@atgc.org> if a sorted list (table) of all genes is available for each orientation then simple 'grep' could help, e.g.: $grep -A 2 -B 2 -f list_of_ids table_with_genes where: -A 2 tells to extract two lines below match -B 2 two lines above match list_of_ids - file with a list of genes of interest table_with_genes - sorted by genome position table of all genes. There is an assumption that genes on table are sorted according to their genome positions. It would be better to have two tables with sorted genes, one table per orientation (strand). Alexander Kozik Bioinformatics Specialist Genome and Biomedical Sciences Facility 451 East Health Sciences Drive University of California Davis, CA 95616-8816 Phone: (530) 754-9127 email#1: akozik at atgc.org email#2: akozik at gmail.com web: http://www.atgc.org/ Sendu Bala wrote: > Rick Scavetta wrote: >> Hello Programmers, >> >> I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. >> With these accession numbers I want to know what are the three closest >> upstream and downstream genes (and orientation, if possible) to my gene of >> interest. Is these some way of finding this out? Any suggestions? > > One possible way would be with the Ensembl Perl API: > http://www.ensembl.org/info/software/core/core_tutorial.html > > You'd get a gene or transcript adapator and use > fetch_all_by_external_name() iirc. > > Then probably your slice adaptor would be used to get the up and down > stream regions and iirc there's some easy way to ask for all the genes > in a slice. > > Sean's solution (use UCSC) seems a lot easier, and since you just want a > list of names as a result, not bioperl objects, I'd go for that. > > Actually, would there be any interest in a bioperl interface to the UCSC > tables? It could probably be done using their DAS server, and end up as > a very easy-to-use alternative to the Ensembl API for limited or one-off > queries. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Mon Aug 7 12:36:30 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 12:36:30 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D746BF.4070908@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> <44D6FC22.6030102@sendu.me.uk> <44D746BF.4070908@sendu.me.uk> Message-ID: <274892B4-1906-4746-BE8A-487BA4FFFF05@gmx.net> On Aug 7, 2006, at 9:57 AM, Sendu Bala wrote: > > Let T be a rooted tree. A node x ? T is an ancestor of a > node y ? T if the path from the root of T to y goes through > x. A node v ? T is a common ancestor of x and y if it is > an ancestor of both x and y. The nearest common ancestor, > nca, of two nodes x, y is the common ancestor of x and y > whose distance to x (and to y) is smaller than the distance > to x of any other common ancestor of x and y. > > According to that, v cannot be x or y. Why not? The path from the root node to node x certainly goes through x, so x is an ancestor of x. Also, there is no other ancestor of x whose distance to x is smaller than the distance of x to x (namely zero). Hence, x is the nearest common ancestor of nodes x and x. More generally, node x is the nearest common ancestor of node x and any node z for which node x is an ancestor. Otherwise, as an example, for a rooted tree what is the nearest common ancestor between the root node and any node in the tree? Do you claim that in this case there is no common ancestor? The above definition (intentionally) does not say that x != y. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Aug 7 12:46:27 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 12:46:27 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <9F5EACE5-DD31-4DA1-858B-FF727978AFB8@gmx.net> On Aug 7, 2006, at 10:41 AM, Sean Davis wrote: >> Also, I would also like to know something about the expression of a >> particular gene of interest. e.g. ba querying the Novartis Gene Atlas >> (http://symatlas.gnf.org/SymAtlas/). Is there a module for handling >> submissions and retrieving results of this sort? > > No, but, again, UCSC hosts these data and they have nice query > tools, a > publicly-accessible MySQL database containing the information, and > downloadable tables for the data. They actually don't host all of them, and not necessarily the latest versions as samples do get added, also to the public datasets. Having said that, you may be just fine with the version that UCSC hosts. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sdavis2 at mail.nih.gov Mon Aug 7 12:47:22 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 07 Aug 2006 12:47:22 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D76177.1090901@sendu.me.uk> Message-ID: On 8/7/06 11:51 AM, "Sendu Bala" wrote: > Rick Scavetta wrote: >> Hello Programmers, >> >> I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. >> With these accession numbers I want to know what are the three closest >> upstream and downstream genes (and orientation, if possible) to my gene of >> interest. Is these some way of finding this out? Any suggestions? > > One possible way would be with the Ensembl Perl API: > http://www.ensembl.org/info/software/core/core_tutorial.html > > You'd get a gene or transcript adapator and use > fetch_all_by_external_name() iirc. > > Then probably your slice adaptor would be used to get the up and down > stream regions and iirc there's some easy way to ask for all the genes > in a slice. > > Sean's solution (use UCSC) seems a lot easier, and since you just want a > list of names as a result, not bioperl objects, I'd go for that. > > Actually, would there be any interest in a bioperl interface to the UCSC > tables? It could probably be done using their DAS server, and end up as > a very easy-to-use alternative to the Ensembl API for limited or one-off > queries. Sendu, I haven't used their DAS server in a couple of years--it is probably worth my trying it again at some point. I have gone the route of maintaining a local ucsc mirror of the database. There are obvious advantages to doing this, including speed of access, no access limits, and the power and flexibility of SQL. With the simplicity of MySQL, one can even rsync directory from UCSC's mysql data directory directly, eliminating the need to do a table import step. I haven't gone so far as to implement a perl-based wrapper, but one could envision doing so for the most commonly-used tables, perhaps using Rose::DB::Object or DBIx::Class as a base with other methods added as needed based on the data contained in the table. If I get a chance, I could probably extract the parts from the UCSC website necessary to generate such a database-only mirror and put them on the wiki, but the information is all on the UCSC website under the "mirroring instructions" heading. Sean From sdavis2 at mail.nih.gov Mon Aug 7 12:54:12 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 07 Aug 2006 12:54:12 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <19426B4B-178D-4685-8851-8048502DEDA0@uiuc.edu> Message-ID: On 8/7/06 12:24 PM, "Chris Fields" wrote: > This was brought up previously: > > http://article.gmane.org/gmane.comp.lang.perl.bio.general/11808/ > match=ucsc > > Would be nice to get something up and running here, even if it just > returned raw data (not objects). The more access the better. Sorry, Chris. I probably sound like a broken record--UCSC, UCSC, UCSC.... As I hinted at in the last email, simple Rose::DB::Object (my preference, faster) or DBIx::Class based classes could be set up for the various tables of interest, with the possibility of adding functionality by simply adding methods to the generic ORM classes. This would be relatively quick to develop I would think and pretty easy to maintain. If I get a chance in the next few days, I can put together a quick set of classes for accessing some of the common tables. Sean From cjfields at uiuc.edu Mon Aug 7 12:47:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 11:47:55 -0500 Subject: [Bioperl-l] (no subject) In-Reply-To: <20060807162030.602.qmail@web34114.mail.mud.yahoo.com> References: <20060807162030.602.qmail@web34114.mail.mud.yahoo.com> Message-ID: Pardon? Didn't catch that... On Aug 7, 2006, at 11:20 AM, lalitha viswanath wrote: > > > > __________________________________________________ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Aug 7 13:13:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 12:13:18 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <8FF6DC53-4A29-4C9A-BB62-F96419876842@uiuc.edu> On Aug 7, 2006, at 11:54 AM, Sean Davis wrote: > > On 8/7/06 12:24 PM, "Chris Fields" wrote: > >> This was brought up previously: >> >> http://article.gmane.org/gmane.comp.lang.perl.bio.general/11808/ >> match=ucsc >> >> Would be nice to get something up and running here, even if it just >> returned raw data (not objects). The more access the better. > > Sorry, Chris. I probably sound like a broken record--UCSC, UCSC, > UCSC.... Never hurts to reiterate here. Well, to a certain degree... > As I hinted at in the last email, simple Rose::DB::Object (my > preference, > faster) or DBIx::Class based classes could be set up for the > various tables > of interest, with the possibility of adding functionality by simply > adding > methods to the generic ORM classes. This would be relatively quick to > develop I would think and pretty easy to maintain. If I get a > chance in the > next few days, I can put together a quick set of classes for > accessing some > of the common tables. > > Sean I guess these could go under the Bio::DB* namespace unless someone disagrees. I don't think it's absolutely necessary get any raw data into bioperlish objects immediately. Just setting up access to UCSC would be a great start, and the classes could evolve from there to slowly get 'assimilated' into Bioperl. Hmm... makes Bioperl sound like the Borg... Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From charles.tilford at bms.com Mon Aug 7 12:05:56 2006 From: charles.tilford at bms.com (Charles Tilford) Date: Mon, 07 Aug 2006 12:05:56 -0400 Subject: [Bioperl-l] Bio::DB::Fasta fails for files over 4GB Message-ID: <44D764E4.4000706@bms.com> I just found out that Bio::DB::Fasta has an inherit 4GB file size limit in it. This is due to how indexing information is stored. The module pack()s information using this format: use constant STRUCT =>'NNnnCa*'; ... where the first token is the file offset. N = 32-bit unsigned integer, and rolls-over when the file position passes the 4GB mark, resulting in garbage out for those entries. Changing the packing format to: use constant STRUCT =>'QNnnCa*'; ...solves the problem (Q = 64-bit unsigned int). We have several genomic files (ensembl dumps) where this is an issue: -rw-rw-r-- 1 kirovs bioinfo 7.2G Jul 13 12:28 pan_troglodytes.genome.CHIMP1A.fa -rw-rw-r-- 1 kirovs bioinfo 6.8G Jul 13 12:25 monodelphis_domestica.genome.BROADO3.fa -rw-rw-r-- 1 kirovs bioinfo 5.0G Jul 13 12:26 mus_musculus.genome.NCBIM36.fa -rw-rw-r-- 1 kirovs bioinfo 4.6G Aug 2 15:31 bos_taurus.genome.Btau2.fa -rw-rw-r-- 1 kirovs bioinfo 4.1G Jul 13 12:22 danio_rerio.genome.ZFISH6.fa These are not really large genomes, but have a fair number of unassembled (duplicitous) fragments in them, which bump up the file size. Some fully assembled genomes will probably eventually top the 4GB mark, anyway. Unfortunately, this raises a backward compatibility issue, since an index packed with 'N' will fail when unpacked with 'Q'. Perhaps the module could dynamically bifurcate the packing structure based on a file size test? The second token is for the sequence length, I can't imagine a single sequence exceeding 4Gb, so it's probably safe - yes? Should it also be Q in the event that biology someday exceeds our current imagination? Thanks, CAT -- Charles Tilford, Bioinformatics-Applied Genomics Bristol-Myers Squibb PRI, Hopewell 3A039 P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213 charles.tilford at bms.com From bix at sendu.me.uk Mon Aug 7 13:53:36 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 18:53:36 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <44D77E20.1040604@sendu.me.uk> Sean Davis wrote: > > On 8/7/06 11:51 AM, "Sendu Bala" wrote: > >> Actually, would there be any interest in a bioperl interface to the UCSC >> tables? It could probably be done using their DAS server, and end up as >> a very easy-to-use alternative to the Ensembl API for limited or one-off >> queries. > > I haven't used their DAS server in a couple of years--it is probably worth > my trying it again at some point. I have gone the route of maintaining a > local ucsc mirror of the database. There are obvious advantages to doing > this, including speed of access, no access limits, and the power and > flexibility of SQL. With the simplicity of MySQL, one can even rsync > directory from UCSC's mysql data directory directly, eliminating the need to > do a table import step. I haven't gone so far as to implement a perl-based > wrapper, but one could envision doing so for the most commonly-used tables, > perhaps using Rose::DB::Object or DBIx::Class as a base with other methods > added as needed based on the data contained in the table. Yes, using MySQL would probably be better. Any interface would ideally use the UCSC MySQL server by default (with enforced X second waits), with an option to set the address to your own server. Do you want to go ahead and look into making those classes for accessing the common tables? It's in my plan to make various aspects of genomic data retrieval a strength of bioperl as opposed to a surprising missing link (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll get to that in a few weeks but if you lay the ground work or better yet complete everything before then that would be great! :) > If I get a chance, I could probably extract the parts from the UCSC website > necessary to generate such a database-only mirror and put them on the wiki, > but the information is all on the UCSC website under the "mirroring > instructions" heading. That would no doubt be useful. Cheers, Sendu. From cjfields at uiuc.edu Mon Aug 7 13:43:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 12:43:01 -0500 Subject: [Bioperl-l] Bio::DB::Fasta fails for files over 4GB In-Reply-To: <44D764E4.4000706@bms.com> References: <44D764E4.4000706@bms.com> Message-ID: Dynamically determining the packing based on file size is probably the way to go; it would be nice to see how this affects speed. Chris On Aug 7, 2006, at 11:05 AM, Charles Tilford wrote: > I just found out that Bio::DB::Fasta has an inherit 4GB file size > limit > in it. This is due to how indexing information is stored. The module > pack()s information using this format: > > use constant STRUCT =>'NNnnCa*'; > > ... where the first token is the file offset. N = 32-bit unsigned > integer, and rolls-over when the file position passes the 4GB mark, > resulting in garbage out for those entries. Changing the packing > format to: > > use constant STRUCT =>'QNnnCa*'; > > ...solves the problem (Q = 64-bit unsigned int). We have several > genomic > files (ensembl dumps) where this is an issue: > > -rw-rw-r-- 1 kirovs bioinfo 7.2G Jul 13 12:28 > pan_troglodytes.genome.CHIMP1A.fa > -rw-rw-r-- 1 kirovs bioinfo 6.8G Jul 13 12:25 > monodelphis_domestica.genome.BROADO3.fa > -rw-rw-r-- 1 kirovs bioinfo 5.0G Jul 13 12:26 > mus_musculus.genome.NCBIM36.fa > -rw-rw-r-- 1 kirovs bioinfo 4.6G Aug 2 15:31 > bos_taurus.genome.Btau2.fa > -rw-rw-r-- 1 kirovs bioinfo 4.1G Jul 13 12:22 > danio_rerio.genome.ZFISH6.fa > > These are not really large genomes, but have a fair number of > unassembled (duplicitous) fragments in them, which bump up the file > size. Some fully assembled genomes will probably eventually top the > 4GB > mark, anyway. > > Unfortunately, this raises a backward compatibility issue, since an > index packed with 'N' will fail when unpacked with 'Q'. Perhaps the > module could dynamically bifurcate the packing structure based on a > file > size test? > > The second token is for the sequence length, I can't imagine a single > sequence exceeding 4Gb, so it's probably safe - yes? Should it also > be Q > in the event that biology someday exceeds our current imagination? > > Thanks, > CAT > > -- > Charles Tilford, Bioinformatics-Applied Genomics > Bristol-Myers Squibb PRI, Hopewell 3A039 > P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213 > charles.tilford at bms.com > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Aug 7 14:09:12 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 19:09:12 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <274892B4-1906-4746-BE8A-487BA4FFFF05@gmx.net> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> <44D6FC22.6030102@sendu.me.uk> <44D746BF.4070908@sendu.me.uk> <274892B4-1906-4746-BE8A-487BA4FFFF05@gmx.net> Message-ID: <44D781C8.1050606@sendu.me.uk> Hilmar Lapp wrote: > On Aug 7, 2006, at 9:57 AM, Sendu Bala wrote: > >> Let T be a rooted tree. A node x ? T is an ancestor of a >> node y ? T if the path from the root of T to y goes through >> x. A node v ? T is a common ancestor of x and y if it is >> an ancestor of both x and y. The nearest common ancestor, >> nca, of two nodes x, y is the common ancestor of x and y >> whose distance to x (and to y) is smaller than the distance >> to x of any other common ancestor of x and y. >> >> According to that, v cannot be x or y. > > Why not? The path from the root node to node x certainly goes through > x, so x is an ancestor of x. Well, 'through' implied to me 'does not end at'. Maybe it's not supposed to imply that, in which case what you say seems fair. In any case, since its debatable I'll go with the old expected behaviour. I'll still leave the changes in PAML et al. that use the single call to get_lca, as that will be faster. The other point of contention was with what Bio::Species::species should return, and again since its debatable I'll go with the old expected behaviour (guess badly what the specific name is). I'll wait till this weekend for any more comments and then commit and add to the wiki. From charles.tilford at bms.com Mon Aug 7 14:30:03 2006 From: charles.tilford at bms.com (Charles Tilford) Date: Mon, 07 Aug 2006 14:30:03 -0400 Subject: [Bioperl-l] Bio::DB::Fasta fails for files over 4GB In-Reply-To: References: <44D764E4.4000706@bms.com> Message-ID: <44D786AB.3040401@bms.com> Chris Fields wrote: > Dynamically determining the packing based on file size is probably the > way to go; it would be nice to see how this affects speed. > > Chris Ok, I seem to have a functioning patch. I was also concerned about performance; I assume Lincoln's using a constant for the pack format because it optimizes compilation of the _pack() and _unpack() methods. So rather than make the format a variable, I made the methods themselves variant. The code looks at the file(s) being indexed, and if any file exceeds 4Gb it will use 64-bit packing (for both file offset and sequence length - being paranoid on the later, in case we get Martian genomes with chromosomes over 4Gb in length). I have not tested it for a directory of multiple files, but it should still work. Two packing formats are defined as constants, and the pack / unpack methods are bifurcated to use one or the other. I do not have a good feeling for the performance difference in calling a method directly or by de-referencing a method reference, but I assume it is minuscule. -s is used for the file size test - I assume that is nuclear-hard portable across platforms? I didn't figure out a good reason for the _pack/_unpack calls to remain object methods, so they're not in the patch. The patch below is against: # $Id: Fasta.pm,v 1.44 2006/07/17 10:39:37 sendu Exp $ --- /net/thegeneral/home/tilfordc/Fasta.pm 2006-08-07 14:01:27.000000000 -0400 +++ /stf/biocgi/tilfordc/patch_lib/Bio/DB/Fasta.pm 2006-08-07 14:07:52.033844532 -0400 @@ -418,7 +418,8 @@ *ids = \&get_all_ids; *get_seq_by_primary_id = *get_Seq_by_acc = \&get_Seq_by_id; -use constant STRUCT =>'NNnnCa*'; +use constant STRUCT =>'NNnnCa*'; +use constant STRUCTBIG =>'QQnnCa*'; # 64-bit file offset and seq length use constant DNA => 1; use constant RNA => 2; use constant PROTEIN => 3; @@ -568,6 +569,7 @@ # get the most recent modification time of any of the contents my $modtime = 0; my %modtime; + $self->set_pack_method( @files ); foreach (@files) { my $m = (stat($_))[9]; $modtime{$_} = $m; @@ -612,6 +614,32 @@ return Bio::PrimarySeq::Fasta->new($self,$id); } +=head2 set_pack_method + + Title : set_pack_method + Usage : $db->set_pack_method( @files ) + Function: Determines whether data packing uses 32 or 64 bit integers + Returns : + Args : one or more file paths + +=cut + +sub set_pack_method { + my $self = shift; + # Find the maximum file size: + my ($maxsize) = sort { $b <=> $a } map { -s $_ } @_; + my $fourGB = (2 ** 32) - 1; + + if ($maxsize > $fourGB) { + # At least one file exceeds 4Gb - we will need to use 64 bit ints + $self->{packmeth} = \&_packBig; + $self->{unpackmeth} = \&_unpackBig; + } else { + $self->{packmeth} = \&_pack; + $self->{unpackmeth} = \&_unpack; + } +} + =head2 index_file Title : index_file @@ -629,6 +657,7 @@ my $file = shift; my $force_reindex = shift; + $self->set_pack_method( $file ); my $index = $self->index_name($file); # if caller has requested reindexing, then unlink the index unlink $index if $force_reindex; @@ -716,9 +745,9 @@ if ($id) { my $seqlength = $pos - $offset - length($_); $seqlength -= $termination_length * $seq_lines; - $offsets->{$id} = $self->_pack($offset,$seqlength, - $linelength,$firstline, - $type,$base); + $offsets->{$id} = &{$self->{packmeth}}($offset,$seqlength, + $linelength,$firstline, + $type,$base); } $id = ref($self->{makeid}) eq 'CODE' ? $self->{makeid}->($_) : $1; ($offset,$firstline,$linelength) = ($pos,length($_),0); @@ -746,10 +775,10 @@ } $seqlength -= $termination_length * $seq_lines; }; - $offsets->{$id} = $self->_pack($offset,$seqlength, - $linelength,$firstline, - $type,$base); - } + $offsets->{$id} = &{$self->{packmeth}}($offset,$seqlength, + $linelength,$firstline, + $type,$base); +} $offsets->{__termination_length} = $termination_length; return \%offsets; } @@ -770,35 +799,35 @@ my $self = shift; my $id = shift; my $offset = $self->{offsets}{$id} or return; - ($self->_unpack($offset))[0]; + (&{$self->{unpackmeth}}($offset))[0]; } sub length { my $self = shift; my $id = shift; my $offset = $self->{offsets}{$id} or return; - ($self->_unpack($offset))[1]; + (&{$self->{unpackmeth}}($offset))[1]; } sub linelen { my $self = shift; my $id = shift; my $offset = $self->{offsets}{$id} or return; - ($self->_unpack($offset))[2]; + (&{$self->{unpackmeth}}($offset))[2]; } sub headerlen { my $self = shift; my $id = shift; my $offset = $self->{offsets}{$id} or return; - ($self->_unpack($offset))[3]; + (&{$self->{unpackmeth}}($offset))[3]; } sub alphabet { my $self = shift; my $id = shift; my $offset = $self->{offsets}{$id} or return; - my $type = ($self->_unpack($offset))[4]; + my $type = (&{$self->{unpackmeth}}($offset))[4]; return $type == DNA ? 'dna' : $type == RNA ? 'rna' : 'protein'; @@ -818,7 +847,7 @@ my $self = shift; my $id = shift; my $offset = $self->{offsets}{$id} or return; - $self->fileno2path(($self->_unpack($offset))[5]); + $self->fileno2path((&{$self->{unpackmeth}}($offset))[5]); } sub fileno2path { @@ -899,7 +928,7 @@ my $self = shift; my $id = shift; my ($offset,$seqlength,$linelength,$firstline,$type,$file) - = $self->_unpack($self->{offsets}{$id}) or return; + = &{$self->{unpackmeth}}($self->{offsets}{$id}) or return; $offset -= $firstline; my $data; my $fh = $self->fh($id) or return; @@ -914,7 +943,7 @@ my $self = shift; my $id = shift; my $a = shift()-1; - my ($offset,$seqlength,$linelength,$firstline,$type,$file) = $self->_unpack($self->{offsets}{$id}); + my ($offset,$seqlength,$linelength,$firstline,$type,$file) = &{$self->{unpackmeth}}($self->{offsets}{$id}); $a = 0 if $a < 0; $a = $seqlength-1 if $a >= $seqlength; my $tl = $self->{offsets}{__termination_length}; @@ -940,15 +969,21 @@ } sub _pack { - shift; pack STRUCT, at _; } +sub _packBig { + pack STRUCTBIG, at _; +} + sub _unpack { - shift; unpack STRUCT,shift; } +sub _unpackBig { + unpack STRUCTBIG,shift; +} + sub _type { shift; local $_ = shift; > > On Aug 7, 2006, at 11:05 AM, Charles Tilford wrote: > >> I just found out that Bio::DB::Fasta has an inherit 4GB file size limit >> in it. This is due to how indexing information is stored. The module >> pack()s information using this format: >> >> use constant STRUCT =>'NNnnCa*'; >> >> ... where the first token is the file offset. N = 32-bit unsigned >> integer, and rolls-over when the file position passes the 4GB mark, >> resulting in garbage out for those entries. Changing the packing >> format to: >> >> use constant STRUCT =>'QNnnCa*'; >> >> ...solves the problem (Q = 64-bit unsigned int). We have several genomic >> files (ensembl dumps) where this is an issue: >> >> -rw-rw-r-- 1 kirovs bioinfo 7.2G Jul 13 12:28 >> pan_troglodytes.genome.CHIMP1A.fa >> -rw-rw-r-- 1 kirovs bioinfo 6.8G Jul 13 12:25 >> monodelphis_domestica.genome.BROADO3.fa >> -rw-rw-r-- 1 kirovs bioinfo 5.0G Jul 13 12:26 >> mus_musculus.genome.NCBIM36.fa >> -rw-rw-r-- 1 kirovs bioinfo 4.6G Aug 2 15:31 >> bos_taurus.genome.Btau2.fa >> -rw-rw-r-- 1 kirovs bioinfo 4.1G Jul 13 12:22 >> danio_rerio.genome.ZFISH6.fa >> >> These are not really large genomes, but have a fair number of >> unassembled (duplicitous) fragments in them, which bump up the file >> size. Some fully assembled genomes will probably eventually top the 4GB >> mark, anyway. >> >> Unfortunately, this raises a backward compatibility issue, since an >> index packed with 'N' will fail when unpacked with 'Q'. Perhaps the >> module could dynamically bifurcate the packing structure based on a file >> size test? >> >> The second token is for the sequence length, I can't imagine a single >> sequence exceeding 4Gb, so it's probably safe - yes? Should it also be Q >> in the event that biology someday exceeds our current imagination? >> >> Thanks, >> CAT >> >> -- >> Charles Tilford, Bioinformatics-Applied Genomics >> Bristol-Myers Squibb PRI, Hopewell 3A039 >> P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213 >> charles.tilford at bms.com >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > -- Charles Tilford, Bioinformatics-Applied Genomics Bristol-Myers Squibb PRI, Hopewell 3A039 P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213 charles.tilford at bms.com From amanda.na at gmail.com Mon Aug 7 14:31:36 2006 From: amanda.na at gmail.com (Amanda n/a) Date: Mon, 7 Aug 2006 14:31:36 -0400 Subject: [Bioperl-l] Parsing Clustalw alignment Message-ID: <19344f3c0608071131g150ff883yeb28630ae509599e@mail.gmail.com> Hi, Has anyone had luck in using Clustalw.pm to parse through clustalw .aln and/or output files? Specifically, I'm looking to search for high scoring alignments between specific pairs of sequences. If found, I then need to extract the two sequence ids, sequences, alignment score and positions. If anyone has had experience with this or knows of a resource/doc that runs through this, I would be grateful if you would not mind sharing. Thanks From cjfields at uiuc.edu Mon Aug 7 14:46:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 13:46:37 -0500 Subject: [Bioperl-l] Bio::DB::Fasta fails for files over 4GB In-Reply-To: <44D786AB.3040401@bms.com> References: <44D764E4.4000706@bms.com> <44D786AB.3040401@bms.com> Message-ID: <6C1831E0-3D32-4139-8A8C-C095502BB32A@uiuc.edu> Why don't you submit your patch to Bugzilla? http://bugzilla.open-bio.org/ http://www.bioperl.org/wiki/HOWTO:SubmitPatch Lincoln could take a look at it when he gets back from vacation and comment on it. He may have some other possibilities we haven't thought of. Chris On Aug 7, 2006, at 1:30 PM, Charles Tilford wrote: > Chris Fields wrote: >> Dynamically determining the packing based on file size is probably >> the way to go; it would be nice to see how this affects speed. >> Chris > Ok, I seem to have a functioning patch. I was also concerned about > performance; I assume Lincoln's using a constant for the pack > format because it optimizes compilation of the _pack() and _unpack > () methods. So rather than make the format a variable, I made the > methods themselves variant. The code looks at the file(s) being > indexed, and if any file exceeds 4Gb it will use 64-bit packing > (for both file offset and sequence length - being paranoid on the > later, in case we get Martian genomes with chromosomes over 4Gb in > length). I have not tested it for a directory of multiple files, > but it should still work. Two packing formats are defined as > constants, and the pack / unpack methods are bifurcated to use one > or the other. I do not have a good feeling for the performance > difference in calling a method directly or by de-referencing a > method reference, but I assume it is minuscule. > > -s is used for the file size test - I assume that is nuclear-hard > portable across platforms? > > I didn't figure out a good reason for the _pack/_unpack calls to > remain object methods, so they're not in the patch. > > > The patch below is against: > # $Id: Fasta.pm,v 1.44 2006/07/17 10:39:37 sendu Exp $ > > > --- /net/thegeneral/home/tilfordc/Fasta.pm 2006-08-07 > 14:01:27.000000000 -0400 > +++ /stf/biocgi/tilfordc/patch_lib/Bio/DB/Fasta.pm 2006-08-07 > 14:07:52.033844532 -0400 > @@ -418,7 +418,8 @@ > *ids = \&get_all_ids; > *get_seq_by_primary_id = *get_Seq_by_acc = \&get_Seq_by_id; > -use constant STRUCT =>'NNnnCa*'; > +use constant STRUCT =>'NNnnCa*'; > +use constant STRUCTBIG =>'QQnnCa*'; # 64-bit file offset and seq > length > use constant DNA => 1; > use constant RNA => 2; > use constant PROTEIN => 3; > @@ -568,6 +569,7 @@ > # get the most recent modification time of any of the contents > my $modtime = 0; > my %modtime; > + $self->set_pack_method( @files ); > foreach (@files) { > my $m = (stat($_))[9]; > $modtime{$_} = $m; > @@ -612,6 +614,32 @@ > return Bio::PrimarySeq::Fasta->new($self,$id); > } > +=head2 set_pack_method > + > + Title : set_pack_method > + Usage : $db->set_pack_method( @files ) > + Function: Determines whether data packing uses 32 or 64 bit integers > + Returns : > + Args : one or more file paths > + > +=cut > + > +sub set_pack_method { > + my $self = shift; > + # Find the maximum file size: > + my ($maxsize) = sort { $b <=> $a } map { -s $_ } @_; > + my $fourGB = (2 ** 32) - 1; > + > + if ($maxsize > $fourGB) { > + # At least one file exceeds 4Gb - we will need to use 64 bit > ints > + $self->{packmeth} = \&_packBig; > + $self->{unpackmeth} = \&_unpackBig; > + } else { > + $self->{packmeth} = \&_pack; > + $self->{unpackmeth} = \&_unpack; > + } > +} > + > =head2 index_file > Title : index_file > @@ -629,6 +657,7 @@ > my $file = shift; > my $force_reindex = shift; > + $self->set_pack_method( $file ); > my $index = $self->index_name($file); > # if caller has requested reindexing, then unlink the index > unlink $index if $force_reindex; > @@ -716,9 +745,9 @@ > if ($id) { > my $seqlength = $pos - $offset - length($_); > $seqlength -= $termination_length * $seq_lines; > - $offsets->{$id} = $self->_pack($offset,$seqlength, > - $linelength,$firstline, > - $type,$base); > + $offsets->{$id} = &{$self->{packmeth}}($offset,$seqlength, > + $linelength, > $firstline, > + $type,$base); > } > $id = ref($self->{makeid}) eq 'CODE' ? $self->{makeid}-> > ($_) : $1; > ($offset,$firstline,$linelength) = ($pos,length($_),0); > @@ -746,10 +775,10 @@ > } > $seqlength -= $termination_length * $seq_lines; > }; > - $offsets->{$id} = $self->_pack($offset,$seqlength, > - $linelength,$firstline, > - $type,$base); > - } > + $offsets->{$id} = &{$self->{packmeth}}($offset,$seqlength, > + $linelength,$firstline, > + $type,$base); > +} > $offsets->{__termination_length} = $termination_length; > return \%offsets; > } > @@ -770,35 +799,35 @@ > my $self = shift; > my $id = shift; > my $offset = $self->{offsets}{$id} or return; > - ($self->_unpack($offset))[0]; > + (&{$self->{unpackmeth}}($offset))[0]; > } > sub length { > my $self = shift; > my $id = shift; > my $offset = $self->{offsets}{$id} or return; > - ($self->_unpack($offset))[1]; > + (&{$self->{unpackmeth}}($offset))[1]; > } > sub linelen { > my $self = shift; > my $id = shift; > my $offset = $self->{offsets}{$id} or return; > - ($self->_unpack($offset))[2]; > + (&{$self->{unpackmeth}}($offset))[2]; > } > sub headerlen { > my $self = shift; > my $id = shift; > my $offset = $self->{offsets}{$id} or return; > - ($self->_unpack($offset))[3]; > + (&{$self->{unpackmeth}}($offset))[3]; > } > sub alphabet { > my $self = shift; > my $id = shift; > my $offset = $self->{offsets}{$id} or return; > - my $type = ($self->_unpack($offset))[4]; > + my $type = (&{$self->{unpackmeth}}($offset))[4]; > return $type == DNA ? 'dna' > : $type == RNA ? 'rna' > : 'protein'; > @@ -818,7 +847,7 @@ > my $self = shift; > my $id = shift; > my $offset = $self->{offsets}{$id} or return; > - $self->fileno2path(($self->_unpack($offset))[5]); > + $self->fileno2path((&{$self->{unpackmeth}}($offset))[5]); > } > sub fileno2path { > @@ -899,7 +928,7 @@ > my $self = shift; > my $id = shift; > my ($offset,$seqlength,$linelength,$firstline,$type,$file) > - = $self->_unpack($self->{offsets}{$id}) or return; > + = &{$self->{unpackmeth}}($self->{offsets}{$id}) or return; > $offset -= $firstline; > my $data; > my $fh = $self->fh($id) or return; > @@ -914,7 +943,7 @@ > my $self = shift; > my $id = shift; > my $a = shift()-1; > - my ($offset,$seqlength,$linelength,$firstline,$type,$file) = > $self->_unpack($self->{offsets}{$id}); > + my ($offset,$seqlength,$linelength,$firstline,$type,$file) = & > {$self->{unpackmeth}}($self->{offsets}{$id}); > $a = 0 if $a < 0; > $a = $seqlength-1 if $a >= $seqlength; > my $tl = $self->{offsets}{__termination_length}; > @@ -940,15 +969,21 @@ > } > sub _pack { > - shift; > pack STRUCT, at _; > } > +sub _packBig { > + pack STRUCTBIG, at _; > +} > + > sub _unpack { > - shift; > unpack STRUCT,shift; > } > +sub _unpackBig { > + unpack STRUCTBIG,shift; > +} > + > sub _type { > shift; > local $_ = shift; > > > > > >> >> On Aug 7, 2006, at 11:05 AM, Charles Tilford wrote: >> >>> I just found out that Bio::DB::Fasta has an inherit 4GB file size >>> limit in it. This is due to how indexing information is stored. >>> The module pack()s information using this format: >>> >>> use constant STRUCT =>'NNnnCa*'; >>> >>> ... where the first token is the file offset. N = 32-bit unsigned >>> integer, and rolls-over when the file position passes the 4GB >>> mark, resulting in garbage out for those entries. Changing the >>> packing format to: >>> >>> use constant STRUCT =>'QNnnCa*'; >>> >>> ...solves the problem (Q = 64-bit unsigned int). We have several >>> genomic files (ensembl dumps) where this is an issue: >>> >>> -rw-rw-r-- 1 kirovs bioinfo 7.2G Jul 13 12:28 >>> pan_troglodytes.genome.CHIMP1A.fa >>> -rw-rw-r-- 1 kirovs bioinfo 6.8G Jul 13 12:25 >>> monodelphis_domestica.genome.BROADO3.fa >>> -rw-rw-r-- 1 kirovs bioinfo 5.0G Jul 13 12:26 >>> mus_musculus.genome.NCBIM36.fa >>> -rw-rw-r-- 1 kirovs bioinfo 4.6G Aug 2 15:31 >>> bos_taurus.genome.Btau2.fa >>> -rw-rw-r-- 1 kirovs bioinfo 4.1G Jul 13 12:22 >>> danio_rerio.genome.ZFISH6.fa >>> >>> These are not really large genomes, but have a fair number of >>> unassembled (duplicitous) fragments in them, which bump up the >>> file size. Some fully assembled genomes will probably eventually >>> top the 4GB mark, anyway. >>> >>> Unfortunately, this raises a backward compatibility issue, since >>> an index packed with 'N' will fail when unpacked with 'Q'. >>> Perhaps the module could dynamically bifurcate the packing >>> structure based on a file size test? >>> >>> The second token is for the sequence length, I can't imagine a >>> single sequence exceeding 4Gb, so it's probably safe - yes? >>> Should it also be Q in the event that biology someday exceeds our >>> current imagination? >>> >>> Thanks, >>> CAT >>> >>> -- >>> Charles Tilford, Bioinformatics-Applied Genomics >>> Bristol-Myers Squibb PRI, Hopewell 3A039 >>> P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213 >>> charles.tilford at bms.com >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> > > -- > Charles Tilford, Bioinformatics-Applied Genomics > Bristol-Myers Squibb PRI, Hopewell 3A039 > P.O. Box 5400, Princeton, NJ 08543-5400, (609) 818-3213 > charles.tilford at bms.com Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Aug 7 15:10:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 14:10:48 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D781C8.1050606@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <16F03035-6AC0-48A6-BB76-EC811FAC0C5B@gmx.net> <44D6EB11.30903@sendu.me.uk> <44D6FC22.6030102@sendu.me.uk> <44D746BF.4070908@sendu.me.uk> <274892B4-1906-4746-BE8A-487BA4FFFF05@gmx.net> <44D781C8.1050606@sendu.me.uk> Message-ID: <98C04787-DBCA-4314-A775-6D07B7A63831@uiuc.edu> On Aug 7, 2006, at 1:09 PM, Sendu Bala wrote: > The other point of contention was with what Bio::Species::species > should > return, and again since its debatable I'll go with the old expected > behaviour (guess badly what the specific name is). This 'guessing badly' already occurs in SeqIO, so you wouldn't have to do anything about it. My initial idea was to snip out all dependencies that SeqIO has on Bio::Species, replacing them with streamlined dependencies (i.e. no 'guessing') using Bio::Taxonomy::Node. You can do the same with Bio::Taxon, Bio::DB::Taxonomy::list, and the RichSeq organelle() method for I/O. Although redundant, this would allow for the gradual transition Hilmar suggested (and the data is small; not like you're storing a copy of the sequence or features). Therefore you wouldn't have to worry about weird API issues with species() etc. 1) Make the critical replacement allowing read/write using Bio::Taxon instead of Bio::Species 2) Notify the group about changes, but indicate that Bio::Species will be supported until v 1.6. 3) At bioperl v1.6, add warnings but continue allowing use of Bio::Species 4) Remove Bio::Species, have the RichSeq species() throw or make it act as an alias of taxon(). # lineage stored here or in a separate Bio::DB::Taxonomy::list $taxon = $seq->taxon; # get Bio::Taxon object # use Bio::Taxon specific methods here $species = $seq->species; # get Bio::Species object # use Bio::Species methods here # eventually add warnings then deprecate #both the below return the same info; latter will be deprecated $seq->organelle(); # new $seq->species->organelle(); #old #eventually deprecate Bio::Species and species() Chris > I'll wait till this weekend for any more comments and then commit and > add to the wiki. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From amanda.na at gmail.com Mon Aug 7 14:35:26 2006 From: amanda.na at gmail.com (Amanda n/a) Date: Mon, 7 Aug 2006 14:35:26 -0400 Subject: [Bioperl-l] Parsing output with Clustalw.pm Message-ID: <19344f3c0608071135q695ce8a6s5fca81471931a863@mail.gmail.com> Hi, Has anyone had luck in using Clustalw.pm to parse through clustalw .aln and/or output files? Specifically, I'm looking to search for high scoring alignments between specific pairs of sequences. If found, I then need to extract the two sequence ids, sequences, alignment score and positions. If anyone has had experience with this or knows of a resource/doc that runs through this, I would be grateful if you would not mind sharing. Thanks From bix at sendu.me.uk Mon Aug 7 15:40:25 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 07 Aug 2006 20:40:25 +0100 Subject: [Bioperl-l] Parsing Clustalw alignment In-Reply-To: <19344f3c0608071131g150ff883yeb28630ae509599e@mail.gmail.com> References: <19344f3c0608071131g150ff883yeb28630ae509599e@mail.gmail.com> Message-ID: <44D79729.705@sendu.me.uk> Amanda n/a wrote: > Hi, > Has anyone had luck in using Clustalw.pm to parse through clustalw .aln > and/or output files? Specifically, I'm looking to search for high scoring > alignments between specific pairs of sequences. If found, I then need to > extract the two sequence ids, sequences, alignment score and positions. You'll want to use the Bio::AlignIO module for clustalw. You read your file with AlignIO: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/AlignIO.html and then get your answers from an AlignI: http://doc.bioperl.org/releases/bioperl-current/bioperl-live/Bio/Align/AlignI.html That's perhaps not the easiest thing to get to grips with, but the AlignIO modules are kind of similar in concept to the SeqIO ones, for which there is a more friendly explanation: http://www.bioperl.org/wiki/HOWTO:SeqIO Also, http://www.bioperl.org/wiki/HOWTO:Beginners. From cjfields at uiuc.edu Mon Aug 7 16:35:16 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 7 Aug 2006 15:35:16 -0500 Subject: [Bioperl-l] Ideas for next Bioperl release Message-ID: <000a01c6ba60$fdf2c390$15327e82@pyrimidine> All, We are interested in ideas, for what should be included in future releases of Bioperl, including the next developer release (1.5.5), the eventual next stable release (1,.6), and beyond (?!?). I think we should at least try getting a developer point release out soon as there are major changes looming for Taxonomy, Feature/Annotation, etc. Along the way we can determine the release pumpkin, etc. The direct link: http://www.bioperl.org/wiki/Bioperl_Release Chris Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From charles.tilford at bms.com Mon Aug 7 15:40:13 2006 From: charles.tilford at bms.com (Charles Tilford) Date: Mon, 07 Aug 2006 15:40:13 -0400 Subject: [Bioperl-l] Bio::DB::Fasta fails for files over 4GB In-Reply-To: <6C1831E0-3D32-4139-8A8C-C095502BB32A@uiuc.edu> References: <44D764E4.4000706@bms.com> <44D786AB.3040401@bms.com> <6C1831E0-3D32-4139-8A8C-C095502BB32A@uiuc.edu> Message-ID: <44D7971D.2010708@bms.com> Chris Fields wrote: > Why don't you submit your patch to Bugzilla? > Thanks, good idea; done: http://bugzilla.bioperl.org/show_bug.cgi?id=2063 From jdw at ou.edu Mon Aug 7 17:11:47 2006 From: jdw at ou.edu (James D. White) Date: Mon, 07 Aug 2006 16:11:47 -0500 Subject: [Bioperl-l] t/protgraph errors - was t/entrezgene, t/protgraph and t/FeatureIO In-Reply-To: References: Message-ID: <44D7AC93.60701@ou.edu> I recently updated to bioperl-live on Perl 5.8.5 under Solaris 9 and got the following: Failed Test Total Fail Failed List of Failed ------------------------------------------------ t/protgraph.t 66 24 36.36% 10-13 20-21 26 33 36-37 45 48-56 59-60 65-66 I did notice during my first pass at installing it, that Makefile.PL mentioned that I did not have the Clone module installed, so Makefile.PL was skipping the testing of ProteinGraph. (I think that was the message.) I took care of another problem that I had and installed the Clone module, then I got the error above. This is an error, not a warning and inhibits "make install" IIRC. I briefly looked at t/protgraph.t, but I didn't know what it did, and I did not need ProteinGraph, so to get around the error I renamed t/protgraph.t so that "make test" did not try to run it. Then "make test" and "make install" worked correctly. There was a note in t/protgraph.t around the area of tests 10-13 where someone else who did not understand the module made corrections to get around the errors he was having. I took that as a sign that the errors were probably in either the module or the test and not just my installation. If protgraph.t does not give an error during "make test", then either (1) it worked, or (2) it was skipped. If you don't have Clone.pm installed, then it was skipped. The error summary at the end of the output from "make test" does not list the test programs that were skipped. You need to check the full "make test" output to be sure. There should be a "t/protgraph.t...........OK" message. On 8/7/06 10:04 AM, "Brian Osborne" wrote: >Sendu, > >This one was discussed in bioperl-l a couple of months ago. All tests pass >for me on Mac OS X 10.4.6, perl 5.8.6, bioperl-live. I'll take a closer look >at the individual tests, see if I can find a pattern... > >Brian O. > > >On 8/7/06 5:31 AM, "Sendu Bala" wrote: > > > >>> Failed Test Total Fail Failed List of Failed >>> ------------------------------------------------ >>> t/protgraph.t 66 23 34.85% 11 13 20-21 26 33 36-37 45 48-56 >>> 59-61 65-66 >>> >>> I don't know enough about it to know why its failing or what the answers >>> are really supposed to be. Does it fail for other people? Has it ever >>> worked in the past? >> >> > > > > >------------------------------ > > > From sdavis2 at mail.nih.gov Mon Aug 7 18:44:54 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 07 Aug 2006 18:44:54 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D77E20.1040604@sendu.me.uk> Message-ID: On 8/7/06 1:53 PM, "Sendu Bala" wrote: > Sean Davis wrote: >> >> On 8/7/06 11:51 AM, "Sendu Bala" wrote: >> >>> Actually, would there be any interest in a bioperl interface to the UCSC >>> tables? It could probably be done using their DAS server, and end up as >>> a very easy-to-use alternative to the Ensembl API for limited or one-off >>> queries. >> >> I haven't used their DAS server in a couple of years--it is probably worth >> my trying it again at some point. I have gone the route of maintaining a >> local ucsc mirror of the database. There are obvious advantages to doing >> this, including speed of access, no access limits, and the power and >> flexibility of SQL. With the simplicity of MySQL, one can even rsync >> directory from UCSC's mysql data directory directly, eliminating the need to >> do a table import step. I haven't gone so far as to implement a perl-based >> wrapper, but one could envision doing so for the most commonly-used tables, >> perhaps using Rose::DB::Object or DBIx::Class as a base with other methods >> added as needed based on the data contained in the table. > > Yes, using MySQL would probably be better. Any interface would ideally > use the UCSC MySQL server by default (with enforced X second waits), > with an option to set the address to your own server. This is the route I have gone. > Do you want to go ahead and look into making those classes for accessing > the common tables? It's in my plan to make various aspects of genomic > data retrieval a strength of bioperl as opposed to a surprising missing > link (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll get > to that in a few weeks but if you lay the ground work or better yet > complete everything before then that would be great! :) So, there is a sketch of what things would look like here: http://watson.nci.nih.gov/~sdavis/Bio-DB-UCSC.tar.gz You can install it using the usual: perl Makefile.PL make test make install You will need to have installed: Rose::DB::Object As this is what ORM I am using. It is very crude and only includes the refLink and refFlat tables so far, but adding other tables is pretty straightforward, as you can see from the code. I would love to hear comments. Basically, to use, you can do something like that shown in the synopsis and output is given below: NAME Bio::DB::UCSC - Access UCSC MySQL tables nicely SYNOPSIS use Bio::DB::UCSC::RefLink::Manager; my $reflinks = Bio::DB::UCSC::RefLink::Manager->get_reflinks( query => [ mrnaAcc => {like => 'NM_00002%'}, ], ); foreach my $reflink (@$reflinks) { print "Accession: ",$reflink->mrnaAcc,"\n"; print " Gene ID: ",$reflink->locusLinkId,"\n"; print " Locations: \n"; # and get all locations for each reflink # reflink table is related to refflat table my $refflats = $reflink->refflats; foreach my $refflat (@$refflats) { print " Chrom: ",$refflat->chrom, " Transcription Start: ",$refflat->txStart,"\n"; } } OUTPUT: Accession: NM_000020 Gene ID: 94 Locations: Chrom: chr12 Transcription Start: 50587468 Accession: NM_000026 Gene ID: 158 Locations: Chrom: chr22 Transcription Start: 39072508 Accession: NM_000022 Gene ID: 100 Locations: Chrom: chr20 Transcription Start: 42681577 Accession: NM_000027 Gene ID: 175 Locations: Chrom: chr4 Transcription Start: 178588917 Accession: NM_000028 Gene ID: 178 Locations: Chrom: chr1 Transcription Start: 100088632 Accession: NM_000023 Gene ID: 6442 Locations: Chrom: chr17 Transcription Start: 45598389 Accession: NM_000029 Gene ID: 183 Locations: Chrom: chr1 Transcription Start: 228904891 Accession: NM_000025 Gene ID: 155 Locations: Chrom: chr8 Transcription Start: 37939672 Accession: NM_000021 Gene ID: 5663 Locations: Chrom: chr14 Transcription Start: 72672931 Accession: NM_000024 Gene ID: 154 Locations: Chrom: chr5 Transcription Start: 148186368 From hlapp at gmx.net Mon Aug 7 23:54:16 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 7 Aug 2006 23:54:16 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <9CE694B6-3227-4400-96B2-AD5759467F60@gmx.net> Sean, would it be possible to return standard bioperl objects, like Bio:SeqI objects, or Bio::Annotation::Reference, Bio::LocationI, etc? -hilmar On Aug 7, 2006, at 6:44 PM, Sean Davis wrote: > > > > On 8/7/06 1:53 PM, "Sendu Bala" wrote: > >> Sean Davis wrote: >>> >>> On 8/7/06 11:51 AM, "Sendu Bala" wrote: >>> >>>> Actually, would there be any interest in a bioperl interface to >>>> the UCSC >>>> tables? It could probably be done using their DAS server, and >>>> end up as >>>> a very easy-to-use alternative to the Ensembl API for limited or >>>> one-off >>>> queries. >>> >>> I haven't used their DAS server in a couple of years--it is >>> probably worth >>> my trying it again at some point. I have gone the route of >>> maintaining a >>> local ucsc mirror of the database. There are obvious advantages >>> to doing >>> this, including speed of access, no access limits, and the power and >>> flexibility of SQL. With the simplicity of MySQL, one can even >>> rsync >>> directory from UCSC's mysql data directory directly, eliminating >>> the need to >>> do a table import step. I haven't gone so far as to implement a >>> perl-based >>> wrapper, but one could envision doing so for the most commonly- >>> used tables, >>> perhaps using Rose::DB::Object or DBIx::Class as a base with >>> other methods >>> added as needed based on the data contained in the table. >> >> Yes, using MySQL would probably be better. Any interface would >> ideally >> use the UCSC MySQL server by default (with enforced X second waits), >> with an option to set the address to your own server. > > This is the route I have gone. > >> Do you want to go ahead and look into making those classes for >> accessing >> the common tables? It's in my plan to make various aspects of genomic >> data retrieval a strength of bioperl as opposed to a surprising >> missing >> link (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll >> get >> to that in a few weeks but if you lay the ground work or better yet >> complete everything before then that would be great! :) > > So, there is a sketch of what things would look like here: > > http://watson.nci.nih.gov/~sdavis/Bio-DB-UCSC.tar.gz > > You can install it using the usual: > > perl Makefile.PL > make test > make install > > You will need to have installed: > > Rose::DB::Object > > As this is what ORM I am using. It is very crude and only includes > the > refLink and refFlat tables so far, but adding other tables is pretty > straightforward, as you can see from the code. I would love to hear > comments. Basically, to use, you can do something like that shown > in the > synopsis and output is given below: > > NAME > Bio::DB::UCSC - Access UCSC MySQL tables nicely > > SYNOPSIS > use Bio::DB::UCSC::RefLink::Manager; > > my $reflinks = Bio::DB::UCSC::RefLink::Manager->get_reflinks( > query => [ > mrnaAcc => {like => 'NM_00002%'}, > ], > ); > > foreach my $reflink (@$reflinks) { > print "Accession: ",$reflink->mrnaAcc,"\n"; > print " Gene ID: ",$reflink->locusLinkId,"\n"; > print " Locations: \n"; > # and get all locations for each reflink > # reflink table is related to refflat table > my $refflats = $reflink->refflats; > foreach my $refflat (@$refflats) { > print " Chrom: ",$refflat->chrom, > " Transcription Start: ",$refflat- > >txStart,"\n"; > } > } > > OUTPUT: > Accession: NM_000020 > Gene ID: 94 > Locations: > Chrom: chr12 Transcription Start: 50587468 > Accession: NM_000026 > Gene ID: 158 > Locations: > Chrom: chr22 Transcription Start: 39072508 > Accession: NM_000022 > Gene ID: 100 > Locations: > Chrom: chr20 Transcription Start: 42681577 > Accession: NM_000027 > Gene ID: 175 > Locations: > Chrom: chr4 Transcription Start: 178588917 > Accession: NM_000028 > Gene ID: 178 > Locations: > Chrom: chr1 Transcription Start: 100088632 > Accession: NM_000023 > Gene ID: 6442 > Locations: > Chrom: chr17 Transcription Start: 45598389 > Accession: NM_000029 > Gene ID: 183 > Locations: > Chrom: chr1 Transcription Start: 228904891 > Accession: NM_000025 > Gene ID: 155 > Locations: > Chrom: chr8 Transcription Start: 37939672 > Accession: NM_000021 > Gene ID: 5663 > Locations: > Chrom: chr14 Transcription Start: 72672931 > Accession: NM_000024 > Gene ID: 154 > Locations: > Chrom: chr5 Transcription Start: 148186368 > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From deep_ans at yahoo.com Tue Aug 8 00:10:23 2006 From: deep_ans at yahoo.com (deepak shingan) Date: Mon, 7 Aug 2006 21:10:23 -0700 (PDT) Subject: [Bioperl-l] A blast result file parsing exception Message-ID: <20060808041023.76470.qmail@web51715.mail.yahoo.com> Hi All, I have a bio-perl parser which parse a blast result file. It works fine for some files but for some files it throws following exception ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Can't get identical or conserved data: no data. STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Search::Hit::GenericHit::matches /usr/lib/perl5/site_perl/5.8.5/Bio/Search/Hit/GenericHit.pm:852 STACK: parserMethod.pl:56 ----------------------------------------------------------- I am sending the parser code and a temporary blast file on which this exception is generated . Please throw some light and please help me. Thanks Deepak --------------------------------- Yahoo! Music Unlimited - Access over 1 million songs.Try it free. -------------- next part -------------- A non-text attachment was scrubbed... Name: blastParser.zip Type: application/x-zip-compressed Size: 56041 bytes Desc: 4066357048-blastParser.zip Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060807/0518701c/attachment-0001.bin From bix at sendu.me.uk Tue Aug 8 05:21:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 08 Aug 2006 10:21:38 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <44D857A2.80907@sendu.me.uk> Sean Davis wrote: > > On 8/7/06 1:53 PM, "Sendu Bala" wrote: > >> Do you want to go ahead and look into making those classes for >> accessing the common tables? It's in my plan to make various >> aspects of genomic data retrieval a strength of bioperl as opposed >> to a surprising missing link >> (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll get >> to that in a few weeks but if you lay the ground work or better yet >> complete everything before then that would be great! :) > > So, there is a sketch of what things would look like here: > > http://watson.nci.nih.gov/~sdavis/Bio-DB-UCSC.tar.gz Thanks for that. > only includes the refLink and refFlat tables so far, but adding other > tables is pretty straightforward, as you can see from the code. I > would love to hear comments. Basically, to use, you can do something > like that shown in the synopsis and output is given below: > > NAME Bio::DB::UCSC - Access UCSC MySQL tables nicely > > SYNOPSIS use Bio::DB::UCSC::RefLink::Manager; > > my $reflinks = Bio::DB::UCSC::RefLink::Manager->get_reflinks( query > => [ mrnaAcc => {like => 'NM_00002%'}, ], ); I appreciate that this is due to the way Rose::DB works, but is it possible to hide the SQL nature of what we're doing? Is it possible to hide even the table names? Ideally the interface API would survive a complete change in UCSC's table structures. The implementation would have to change, but user code would not. Are you willing to take this on from your outline and develop a set of more bioperlish modules? Even if you don't have time your contribution so far is certainly valuable, so thank you. I envisage that Bio::DB::UCSC.pm would be the easy-to-use starting point, presenting a code interface similar to the UCSC table browsing web interface. And while it would implement using various submodules, even UCSC.pm would be protected from SQL and table changes. From sdavis2 at mail.nih.gov Tue Aug 8 07:41:57 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 08 Aug 2006 07:41:57 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D857A2.80907@sendu.me.uk> Message-ID: On 8/8/06 5:21 AM, "Sendu Bala" wrote: > Sean Davis wrote: >> >> On 8/7/06 1:53 PM, "Sendu Bala" wrote: >> >>> Do you want to go ahead and look into making those classes for >>> accessing the common tables? It's in my plan to make various >>> aspects of genomic data retrieval a strength of bioperl as opposed >>> to a surprising missing link >>> (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll get >>> to that in a few weeks but if you lay the ground work or better yet >>> complete everything before then that would be great! :) >> >> So, there is a sketch of what things would look like here: >> >> http://watson.nci.nih.gov/~sdavis/Bio-DB-UCSC.tar.gz > > Thanks for that. > > >> only includes the refLink and refFlat tables so far, but adding other >> tables is pretty straightforward, as you can see from the code. I >> would love to hear comments. Basically, to use, you can do something >> like that shown in the synopsis and output is given below: >> >> NAME Bio::DB::UCSC - Access UCSC MySQL tables nicely >> >> SYNOPSIS use Bio::DB::UCSC::RefLink::Manager; >> >> my $reflinks = Bio::DB::UCSC::RefLink::Manager->get_reflinks( query >> => [ mrnaAcc => {like => 'NM_00002%'}, ], ); > > I appreciate that this is due to the way Rose::DB works, but is it > possible to hide the SQL nature of what we're doing? Is it possible to > hide even the table names? > > Ideally the interface API would survive a complete change in UCSC's > table structures. The implementation would have to change, but user code > would not. > > Are you willing to take this on from your outline and develop a set of > more bioperlish modules? Even if you don't have time your contribution > so far is certainly valuable, so thank you. > > I envisage that Bio::DB::UCSC.pm would be the easy-to-use starting > point, presenting a code interface similar to the UCSC table browsing > web interface. And while it would implement using various submodules, > even UCSC.pm would be protected from SQL and table changes. That is certainly possible--this is perl, right? I'll think about it, but I doubt that I have the time to put together a satisfactory "grand" solution that allows arbitrary queries without specifying SQL, returns bioperl objects, and doesn't reflect some of the underlying schema. If one settles on a set of objects that one wants to return, the process will be easier, but that limits the information that one can get from the database. Practically, to have a "table-browser-like" code interface will require exposing some of the SQL schema, as column names and table names will need to come into it. Taking such an approach, either based on RDBO or with hand-coded SQL management, precludes returning bioperl-type objects. On the other hand, if one wants only bioperl-type objects returned, the information that can be returned is quite limited and the query structure (from a perl point of view) will need to be limited to a set of fields that can ultimately be used to look up only the information associated with bioperl objects. I think the table-browser-like approach is the better way to go to start; let the user deal with making bioperl objects as he/she sees fit once the data is back. As a second round of development, one could certainly build a compatibility layer that uses the primary query engine to pull out information for constructing key bioperl objects, but I don't think that should be the primary goal, but a secondary one. All that said, I think some more discussion with some judicious code examples (even if WAY off track, as mine probably is) is probably needed before settling on a path forward. Sean From bix at sendu.me.uk Tue Aug 8 08:44:13 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 08 Aug 2006 13:44:13 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <44D8871D.5010007@sendu.me.uk> Sean Davis wrote: > That is certainly possible--this is perl, right? I'll think about it, but I > doubt that I have the time to put together a satisfactory "grand" solution > that allows arbitrary queries without specifying SQL, returns bioperl > objects, and doesn't reflect some of the underlying schema. If one settles > on a set of objects that one wants to return, the process will be easier, > but that limits the information that one can get from the database. > > Practically, to have a "table-browser-like" code interface will require > exposing some of the SQL schema, as column names and table names will need > to come into it. Not necessarily. You only have to have a mapping from the conceptual purpose of the table to its current name (and likewise for columns). So instead of a module called 'refLink' because there is a table called 'refLink', you might have something called refseq_mrna_links which maps to 'refLink'. Oh, and given the sheer number of tables, I don't think it would be appropriate to have a module per table. How about some single module that does the selection of the relevant database and table given $db and $table_concept? Perhaps: # map 'human' to the possible human databases, default 'hgXX' my $db = Bio::DB::UCSC::Databases('human'); # map 'refseq_mrna_links' to 'refLink' and return a # Bio::DB::UCSC::Queryable my $queryable = new Bio::DB::UCSC::Table($db, 'refseq_mrna_links'); # map mrna_accession method and its args to # query => [mrnaAcc => {like => 'NM_00002%'}] my $row_data = $queryable->mrna_accession(-like => 'NM_00002%'); Even that's not so hot; you still have to know some massive list of inflexible table-concept names like 'refseq_mrna_links'. Perhaps it would be even better if it was truly concept based. You say what you want and it figures out the correct table: my $queryable = new Bio::DB::UCSC::Table($db, 'mrna_accession', 'genomic_coordinates'); Sane? Reasonable? Desirable? Possible? I'm just throwing ideas out; you may see a better way of achieving similar ends. > Taking such an approach, either based on RDBO or with > hand-coded SQL management, precludes returning bioperl-type objects. On the > other hand, if one wants only bioperl-type objects returned, the information > that can be returned is quite limited and the query structure (from a perl > point of view) will need to be limited to a set of fields that can > ultimately be used to look up only the information associated with bioperl > objects. I think the table-browser-like approach is the better way to go to > start; let the user deal with making bioperl objects as he/she sees fit once > the data is back. As a second round of development, one could certainly > build a compatibility layer that uses the primary query engine to pull out > information for constructing key bioperl objects, but I don't think that > should be the primary goal, but a secondary one. Yes, that's the way it should be done, but the interface for the primary query engine ought still be independent of the table structure. From cjfields at uiuc.edu Tue Aug 8 08:49:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 07:49:23 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <76F6EE38-AD2D-4ABA-B5E9-C2741A3D6269@uiuc.edu> Most of the Bio::DB::* classes implement Bio::DB::RandomAccessI, which is the origin of the get_Seq_by* methods that Bio::DB::GenBank and others use. You could create a set of modules which implements an interface like RandomAccessI, grab the raw data on the backend using a UCSC-specific DB handle (using MySQL or whatever) or web agent, and get them into Bio* objects. This is what Bio::DB::GenBank does. It inherits from Bio::DB::NCBIHelper and Bio::DB::WebDBSeqI. WebDBSeqI implements methods from RandomAccessI and adds a web agent; NCBIHelper inherits from WebDBSeqI and adds NCBI-specific parameters for remote access of the Entrez protein and nucleotide databases. If you have the critical backend class made (remote or local access to the database), an interface could be designed similar to Bio::DB::GenBank. Chris On Aug 8, 2006, at 6:41 AM, Sean Davis wrote: > > > > On 8/8/06 5:21 AM, "Sendu Bala" wrote: > >> Sean Davis wrote: >>> >>> On 8/7/06 1:53 PM, "Sendu Bala" wrote: >>> >>>> Do you want to go ahead and look into making those classes for >>>> accessing the common tables? It's in my plan to make various >>>> aspects of genomic data retrieval a strength of bioperl as opposed >>>> to a surprising missing link >>>> (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll get >>>> to that in a few weeks but if you lay the ground work or better yet >>>> complete everything before then that would be great! :) >>> >>> So, there is a sketch of what things would look like here: >>> >>> http://watson.nci.nih.gov/~sdavis/Bio-DB-UCSC.tar.gz >> >> Thanks for that. >> >> >>> only includes the refLink and refFlat tables so far, but adding >>> other >>> tables is pretty straightforward, as you can see from the code. I >>> would love to hear comments. Basically, to use, you can do >>> something >>> like that shown in the synopsis and output is given below: >>> >>> NAME Bio::DB::UCSC - Access UCSC MySQL tables nicely >>> >>> SYNOPSIS use Bio::DB::UCSC::RefLink::Manager; >>> >>> my $reflinks = Bio::DB::UCSC::RefLink::Manager->get_reflinks( query >>> => [ mrnaAcc => {like => 'NM_00002%'}, ], ); >> >> I appreciate that this is due to the way Rose::DB works, but is it >> possible to hide the SQL nature of what we're doing? Is it >> possible to >> hide even the table names? >> >> Ideally the interface API would survive a complete change in UCSC's >> table structures. The implementation would have to change, but >> user code >> would not. >> >> Are you willing to take this on from your outline and develop a >> set of >> more bioperlish modules? Even if you don't have time your >> contribution >> so far is certainly valuable, so thank you. >> >> I envisage that Bio::DB::UCSC.pm would be the easy-to-use starting >> point, presenting a code interface similar to the UCSC table browsing >> web interface. And while it would implement using various submodules, >> even UCSC.pm would be protected from SQL and table changes. > > That is certainly possible--this is perl, right? I'll think about > it, but I > doubt that I have the time to put together a satisfactory "grand" > solution > that allows arbitrary queries without specifying SQL, returns bioperl > objects, and doesn't reflect some of the underlying schema. If one > settles > on a set of objects that one wants to return, the process will be > easier, > but that limits the information that one can get from the database. > > Practically, to have a "table-browser-like" code interface will > require > exposing some of the SQL schema, as column names and table names > will need > to come into it. Taking such an approach, either based on RDBO or > with > hand-coded SQL management, precludes returning bioperl-type > objects. On the > other hand, if one wants only bioperl-type objects returned, the > information > that can be returned is quite limited and the query structure (from > a perl > point of view) will need to be limited to a set of fields that can > ultimately be used to look up only the information associated with > bioperl > objects. I think the table-browser-like approach is the better way > to go to > start; let the user deal with making bioperl objects as he/she sees > fit once > the data is back. As a second round of development, one could > certainly > build a compatibility layer that uses the primary query engine to > pull out > information for constructing key bioperl objects, but I don't think > that > should be the primary goal, but a secondary one. > > All that said, I think some more discussion with some judicious code > examples (even if WAY off track, as mine probably is) is probably > needed > before settling on a path forward. > > Sean > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From sdavis2 at mail.nih.gov Tue Aug 8 09:09:35 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 08 Aug 2006 09:09:35 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <76F6EE38-AD2D-4ABA-B5E9-C2741A3D6269@uiuc.edu> Message-ID: On 8/8/06 8:49 AM, "Chris Fields" wrote: > Most of the Bio::DB::* classes implement Bio::DB::RandomAccessI, > which is the origin of the get_Seq_by* methods that Bio::DB::GenBank > and others use. You could create a set of modules which implements > an interface like RandomAccessI, grab the raw data on the backend > using a UCSC-specific DB handle (using MySQL or whatever) or web > agent, and get them into Bio* objects. I can look into this as a limited solution. > This is what Bio::DB::GenBank does. It inherits from > Bio::DB::NCBIHelper and Bio::DB::WebDBSeqI. WebDBSeqI implements > methods from RandomAccessI and adds a web agent; NCBIHelper inherits > from WebDBSeqI and adds NCBI-specific parameters for remote access of > the Entrez protein and nucleotide databases. These have relatively clean, well-defined APIs; UCSC does not. If you have access to the UCSC source code, just take a look at joiner.doc to see the mess. Accessing NCBI is quite a different matter than accessing UCSC, I think. > If you have the critical backend class made (remote or local access > to the database), an interface could be designed similar to > Bio::DB::GenBank. That critical backend is not straightforward, as noted above, but I'll think about it more. Unlike Genbank where each "object" is the same, there is no such single entity at UCSC, so returning data from UCSC is potentially much more complicated, with special cases for refSeq, knownGene, ESTs, mRNAs, BACS, SNPs, cpg islands, etc. All I'm saying is that the design of UCSC places some constraints on at least the implementation of the interface, if not also on the design of the API. Sean From cjfields at uiuc.edu Tue Aug 8 09:26:46 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 08:26:46 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D8871D.5010007@sendu.me.uk> References: <44D8871D.5010007@sendu.me.uk> Message-ID: It's important to initially build something capable of returning everything UCSC has to offer, initially as just raw data in XML or text. Open the floodgates, so to speak. That was why I designed EUtilities. It returns literally anything from Entrez accessible by parameters in the format (text or XML) it will likely be used in; it's not limited to only sequences, pubmed, etc. And I can access all the EUtilities (elink, efetch, epost, and so on). Why? (Evil laugh.....) Because access to the data, IMHO, is more important to get set up first, even it it only returns raw data. Then the critical infrastructure is there in the DB class to get anything you want from the database. You can then use your DB class as a DB handle or web agent inside another class which has a consistent API, like that for RandomAccessI (sequence-specific DB access), to get the data into the appropriate objects. Bio::Taxonomy::Node attempted a similar thing, correct? Hilmar wanted to know the following, which indicates going beyond just sequences: > would it be possible to return standard bioperl objects, like > Bio:SeqI objects, or Bio::Annotation::Reference, Bio::LocationI, etc? 'Front-end' classes that return appropriate objects (SeqI, LocationI, etc) could be built around the DB class; the key is the consistent interface. So we would need a RandomAccessI-like interface for LocationI, Annotation::References, etc. If someone really wants references, they could build a class to get them into the appropriate objects using your DB class as the 'backend' to get the raw data. Chris On Aug 8, 2006, at 7:44 AM, Sendu Bala wrote: > Sean Davis wrote: >> That is certainly possible--this is perl, right? I'll think about >> it, but I >> doubt that I have the time to put together a satisfactory "grand" >> solution >> that allows arbitrary queries without specifying SQL, returns bioperl >> objects, and doesn't reflect some of the underlying schema. If >> one settles >> on a set of objects that one wants to return, the process will be >> easier, >> but that limits the information that one can get from the database. >> >> Practically, to have a "table-browser-like" code interface will >> require >> exposing some of the SQL schema, as column names and table names >> will need >> to come into it. > > Not necessarily. You only have to have a mapping from the conceptual > purpose of the table to its current name (and likewise for > columns). So > instead of a module called 'refLink' because there is a table called > 'refLink', you might have something called refseq_mrna_links which > maps > to 'refLink'. Oh, and given the sheer number of tables, I don't > think it > would be appropriate to have a module per table. > > How about some single module that does the selection of the relevant > database and table given $db and $table_concept? Perhaps: > > # map 'human' to the possible human databases, default 'hgXX' > my $db = Bio::DB::UCSC::Databases('human'); > > # map 'refseq_mrna_links' to 'refLink' and return a > # Bio::DB::UCSC::Queryable > my $queryable = new Bio::DB::UCSC::Table($db, 'refseq_mrna_links'); > > # map mrna_accession method and its args to > # query => [mrnaAcc => {like => 'NM_00002%'}] > my $row_data = $queryable->mrna_accession(-like => 'NM_00002%'); > > > Even that's not so hot; you still have to know some massive list of > inflexible table-concept names like 'refseq_mrna_links'. Perhaps it > would be even better if it was truly concept based. You say what you > want and it figures out the correct table: > > my $queryable = new Bio::DB::UCSC::Table($db, 'mrna_accession', > 'genomic_coordinates'); > > > Sane? Reasonable? Desirable? Possible? I'm just throwing ideas out; > you > may see a better way of achieving similar ends. > > >> Taking such an approach, either based on RDBO or with >> hand-coded SQL management, precludes returning bioperl-type >> objects. On the >> other hand, if one wants only bioperl-type objects returned, the >> information >> that can be returned is quite limited and the query structure >> (from a perl >> point of view) will need to be limited to a set of fields that can >> ultimately be used to look up only the information associated with >> bioperl >> objects. I think the table-browser-like approach is the better >> way to go to >> start; let the user deal with making bioperl objects as he/she >> sees fit once >> the data is back. As a second round of development, one could >> certainly >> build a compatibility layer that uses the primary query engine to >> pull out >> information for constructing key bioperl objects, but I don't >> think that >> should be the primary goal, but a secondary one. > > Yes, that's the way it should be done, but the interface for the > primary > query engine ought still be independent of the table structure. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Tue Aug 8 09:28:51 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 8 Aug 2006 09:28:51 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: Hi Sean, the module you whipped up is great. My rationale behind asking whether bioperl objects could be returned too was that my bias is that once somebody goes through the trouble of installing bioperl one probably does so with the expectation that most if not all of the modules can work with each other in some way. The second rationale was that converting the data to bioperl-compliant objects would probably be a repetitive task from one situation to the next and so perfectly eligible for provision by the library as a reusable block of code. -hilmar On Aug 8, 2006, at 7:41 AM, Sean Davis wrote: > > > > On 8/8/06 5:21 AM, "Sendu Bala" wrote: > >> Sean Davis wrote: >>> >>> On 8/7/06 1:53 PM, "Sendu Bala" wrote: >>> >>>> Do you want to go ahead and look into making those classes for >>>> accessing the common tables? It's in my plan to make various >>>> aspects of genomic data retrieval a strength of bioperl as opposed >>>> to a surprising missing link >>>> (http://www.bioperl.org/wiki/Getting_Genomic_Sequences); I'll get >>>> to that in a few weeks but if you lay the ground work or better yet >>>> complete everything before then that would be great! :) >>> >>> So, there is a sketch of what things would look like here: >>> >>> http://watson.nci.nih.gov/~sdavis/Bio-DB-UCSC.tar.gz >> >> Thanks for that. >> >> >>> only includes the refLink and refFlat tables so far, but adding >>> other >>> tables is pretty straightforward, as you can see from the code. I >>> would love to hear comments. Basically, to use, you can do >>> something >>> like that shown in the synopsis and output is given below: >>> >>> NAME Bio::DB::UCSC - Access UCSC MySQL tables nicely >>> >>> SYNOPSIS use Bio::DB::UCSC::RefLink::Manager; >>> >>> my $reflinks = Bio::DB::UCSC::RefLink::Manager->get_reflinks( query >>> => [ mrnaAcc => {like => 'NM_00002%'}, ], ); >> >> I appreciate that this is due to the way Rose::DB works, but is it >> possible to hide the SQL nature of what we're doing? Is it >> possible to >> hide even the table names? >> >> Ideally the interface API would survive a complete change in UCSC's >> table structures. The implementation would have to change, but >> user code >> would not. >> >> Are you willing to take this on from your outline and develop a >> set of >> more bioperlish modules? Even if you don't have time your >> contribution >> so far is certainly valuable, so thank you. >> >> I envisage that Bio::DB::UCSC.pm would be the easy-to-use starting >> point, presenting a code interface similar to the UCSC table browsing >> web interface. And while it would implement using various submodules, >> even UCSC.pm would be protected from SQL and table changes. > > That is certainly possible--this is perl, right? I'll think about > it, but I > doubt that I have the time to put together a satisfactory "grand" > solution > that allows arbitrary queries without specifying SQL, returns bioperl > objects, and doesn't reflect some of the underlying schema. If one > settles > on a set of objects that one wants to return, the process will be > easier, > but that limits the information that one can get from the database. > > Practically, to have a "table-browser-like" code interface will > require > exposing some of the SQL schema, as column names and table names > will need > to come into it. Taking such an approach, either based on RDBO or > with > hand-coded SQL management, precludes returning bioperl-type > objects. On the > other hand, if one wants only bioperl-type objects returned, the > information > that can be returned is quite limited and the query structure (from > a perl > point of view) will need to be limited to a set of fields that can > ultimately be used to look up only the information associated with > bioperl > objects. I think the table-browser-like approach is the better way > to go to > start; let the user deal with making bioperl objects as he/she sees > fit once > the data is back. As a second round of development, one could > certainly > build a compatibility layer that uses the primary query engine to > pull out > information for constructing key bioperl objects, but I don't think > that > should be the primary goal, but a secondary one. > > All that said, I think some more discussion with some judicious code > examples (even if WAY off track, as mine probably is) is probably > needed > before settling on a path forward. > > Sean > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sdavis2 at mail.nih.gov Tue Aug 8 09:35:50 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 08 Aug 2006 09:35:50 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: Message-ID: On 8/8/06 9:28 AM, "Hilmar Lapp" wrote: > Hi Sean, the module you whipped up is great. My rationale behind > asking whether bioperl objects could be returned too was that my bias > is that once somebody goes through the trouble of installing bioperl > one probably does so with the expectation that most if not all of the > modules can work with each other in some way. The second rationale > was that converting the data to bioperl-compliant objects would > probably be a repetitive task from one situation to the next and so > perfectly eligible for provision by the library as a reusable block > of code. Hilmar, This makes perfect sense and should be a goal, I agree. Sean From bix at sendu.me.uk Tue Aug 8 09:37:07 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 08 Aug 2006 14:37:07 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: <44D8871D.5010007@sendu.me.uk> Message-ID: <44D89383.2060102@sendu.me.uk> Chris Fields wrote: > It's important to initially build something capable of returning > everything UCSC has to offer Yes, I agree with that. I think it's also important, however, that the backend interface be stable so future front-end modules and user code don't have to change when the tables do. From hlapp at gmx.net Tue Aug 8 09:42:06 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 8 Aug 2006 09:42:06 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: <44D8871D.5010007@sendu.me.uk> Message-ID: <9FB18F3F-786F-40B5-ABB8-FE2642DB30D4@gmx.net> On Aug 8, 2006, at 9:26 AM, Chris Fields wrote: > >> would it be possible to return standard bioperl objects, like >> Bio:SeqI objects, or Bio::Annotation::Reference, Bio::LocationI, etc? > > 'Front-end' classes that return appropriate objects (SeqI, LocationI, > etc) could be built around the DB class; the key is the consistent > interface. So we would need a RandomAccessI-like interface for > LocationI, Annotation::References, etc. Sounds like a good idea. Do you feel like coding up prototypes? That'd be great. > If someone really wants references, they could build a class to get > them into the appropriate objects using your DB class as the > 'backend' to get the raw data. > Yes, sure - but: one should keep in mind that bioperl as a library and as a project offers certain promises in return of the hassle to install and learn it, one of which is to offer implementations of common tasks as reusable code in well-defined and consistent APIs. Just keep that in mind as the long term picture ... -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Aug 8 09:42:53 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 08:42:53 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <3FD7FA41-A0CF-45A9-BDBB-9DAC08FC20BC@uiuc.edu> ... > These have relatively clean, well-defined APIs; UCSC does not. If > you have > access to the UCSC source code, just take a look at joiner.doc to > see the > mess. Accessing NCBI is quite a different matter than accessing > UCSC, I > think. Yes, I think every database is different. The critical thing is to get the data flowing first, then worry about getting it into the appropriate objects. So, how would you design a generic interface to access anything in UCSC, either remotely or locally (MySQL)? That would be the start. It can be modified from there. >> If you have the critical backend class made (remote or local access >> to the database), an interface could be designed similar to >> Bio::DB::GenBank. > > That critical backend is not straightforward, as noted above, but > I'll think > about it more. > Unlike Genbank where each "object" is the same, there is no such > single > entity at UCSC, so returning data from UCSC is potentially much more > complicated, with special cases for refSeq, knownGene, ESTs, mRNAs, > BACS, > SNPs, cpg islands, etc. All I'm saying is that the design of UCSC > places > some constraints on at least the implementation of the interface, > if not > also on the design of the API. NCBI has the same issue; dbSNP returns several different formats, only XML clusters are recognized in Bioperl. Taxonomy access also returns several formats (XML is used in Bioperl). The key would be to map those special cases to return the data in a format you expect Bioperl to eventually use, normally XML or text. There are a few exceptions (EntrezGene uses ASN1). You could also have an override allowed; EUtilities allows the use of the parameter 'retmode' so you can override the return mode specified by the mapped databases. As an example, here's a small bit from EUtilities in the BEGIN block: %DATABASE = ('pubmed' => 'xml', 'protein' => 'text', 'nucleotide' => 'text', 'nuccore' => 'text', 'nucgss' => 'text', 'nucest' => 'text', 'structure' => 'text', 'genome' => 'text', 'books' => 'xml', 'cancerchromosomes'=> 'xml', 'cdd' => 'xml', 'domains' => 'xml', 'gene' => 'asn1', ... Chris > Sean > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From osborne1 at optonline.net Tue Aug 8 09:38:15 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Tue, 08 Aug 2006 09:38:15 -0400 Subject: [Bioperl-l] A blast result file parsing exception In-Reply-To: <20060808041023.76470.qmail@web51715.mail.yahoo.com> Message-ID: Deepak, What Bioperl version are you using? Brian O. On 8/8/06 12:10 AM, "deepak shingan" wrote: > Hi All, > I have a bio-perl parser which parse a blast result file. It works fine for > some files but for some files it throws following exception > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Can't get identical or conserved data: no data. > STACK: Error::throw > STACK: Bio::Root::Root::throw > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > STACK: Bio::Search::Hit::GenericHit::matches > /usr/lib/perl5/site_perl/5.8.5/Bio/Search/Hit/GenericHit.pm:852 > STACK: parserMethod.pl:56 > ----------------------------------------------------------- > I am sending the parser code and a temporary blast file on which this > exception is generated . > Please throw some light and please help me. > > Thanks > Deepak > > > --------------------------------- > Yahoo! Music Unlimited - Access over 1 million songs.Try it free. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Tue Aug 8 09:48:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 08:48:01 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D89383.2060102@sendu.me.uk> References: <44D8871D.5010007@sendu.me.uk> <44D89383.2060102@sendu.me.uk> Message-ID: <20F6A514-394D-4654-8541-B91F869561CE@uiuc.edu> Agreed. The backend would have to be maintained based on regular database updates/changes. Unless there are specific parameter changes, that shouldn't be a problem. Chris On Aug 8, 2006, at 8:37 AM, Sendu Bala wrote: > Chris Fields wrote: >> It's important to initially build something capable of returning >> everything UCSC has to offer > > Yes, I agree with that. I think it's also important, however, that the > backend interface be stable so future front-end modules and user code > don't have to change when the tables do. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Aug 8 10:16:20 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 09:16:20 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <9FB18F3F-786F-40B5-ABB8-FE2642DB30D4@gmx.net> References: <44D8871D.5010007@sendu.me.uk> <9FB18F3F-786F-40B5-ABB8-FE2642DB30D4@gmx.net> Message-ID: <4D374A57-F8DD-430E-B883-3CB2B538ECAA@uiuc.edu> On Aug 8, 2006, at 8:42 AM, Hilmar Lapp wrote: > >> ... >> 'Front-end' classes that return appropriate objects (SeqI, LocationI, >> etc) could be built around the DB class; the key is the consistent >> interface. So we would need a RandomAccessI-like interface for >> LocationI, Annotation::References, etc. > > Sounds like a good idea. Do you feel like coding up prototypes? > That'd be great. I could work on that. RandomAccessI is a pretty simple interface class with a few abstract methods: get_Seq_by_id get_Seq_by_acc get_Seq_by_version Other RandomAccessI-implementing modules add 'get_Stream*' methods, including 'get_Stream_by_query'. Using that, we could have get_Location(s)*, get_Ref(s)*, get_Taxon/ Taxa*, etc. Most of these would only use a unique ID or query, though. Would we want to lump all these together in one non-specific interface class or split them up into several specific interfaces? Many of the latter will likely have only a few methods, but at least they would be consistent, so I think the latter. > ... > Yes, sure - but: one should keep in mind that bioperl as a library > and as a project offers certain promises in return of the hassle to > install and learn it, one of which is to offer implementations of > common tasks as reusable code in well-defined and consistent APIs. > Just keep that in mind as the long term picture ... Agreed. The only caveat: I think the 'backend' DB-specific class would, by necessity, have to use a DB-specific interface that allowed maximum access to the database. The 'front-end' class which gets the raw data into Bioperlish objects must have a consistent Bioperl interface; this is something we really need to enforce. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Tue Aug 8 11:21:11 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 8 Aug 2006 11:21:11 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <4D374A57-F8DD-430E-B883-3CB2B538ECAA@uiuc.edu> References: <44D8871D.5010007@sendu.me.uk> <9FB18F3F-786F-40B5-ABB8-FE2642DB30D4@gmx.net> <4D374A57-F8DD-430E-B883-3CB2B538ECAA@uiuc.edu> Message-ID: <30EE71A4-79E8-44B4-811D-ACCFF3BB3324@gmx.net> On Aug 8, 2006, at 10:16 AM, Chris Fields wrote: > Using that, we could have get_Location(s)*, get_Ref(s)*, get_Taxon/ > Taxa*, etc. Most of these would only use a unique ID or query, > though. > > Would we want to lump all these together in one non-specific > interface class or split them up into several specific interfaces? > Many of the latter will likely have only a few methods, but at > least they would be consistent, so I think the latter. Right. It is easy for an implementation class to implement multiple interfaces, but it is messy to split up interfaces later that were lumped together earlier without a clear need to do so. "do one task only but do it well" - that's the motto for the interface, where doing it well also means to keep it as simple as possible. thanks for taking a stab. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Tue Aug 8 11:46:39 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 08 Aug 2006 16:46:39 +0100 Subject: [Bioperl-l] Wiki edits Message-ID: <44D8B1DF.7080407@sendu.me.uk> Hi all, Can I make a request that everyone take care about the status of their edits with regard to 'minor'? http://en.wikipedia.org/wiki/Wikipedia:How_to_edit_a_page#Minor_edits http://en.wikipedia.org/wiki/Wikipedia:Minor_edit Salient quotes being: "minor edit... implies trivial changes only, such as typo corrections, formatting and presentational changes and rearranging of text without changing any content" "any change that affects the meaning of an article is not minor, even if it involves one word" Cheers, Sendu. From benoit at ebi.ac.uk Tue Aug 8 11:37:46 2006 From: benoit at ebi.ac.uk (Benoit Ballester) Date: Tue, 08 Aug 2006 16:37:46 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <44D8AFCA.5020000@ebi.ac.uk> Dear Rick > I have a list of mouse GeneIDs for which I have extracted the RefSeqs for. > With these accession numbers I want to know what are the three closest > upstream and downstream genes (and orientation, if possible) to my gene of > interest. Is these some way of finding this out? Any suggestions? You can search the Ensembl database with your Refseq IDs. You can either use the Ensembl Genome Browser (www.ensembl.org) or the Ensembl Perl API. I have attached a simple script that extract upstream and downstream genes for a given human refseq ID. You can start from here and have a look to the Ensembl Perl API. (http://www.ensembl.org/info/software/index.html). You will find some tutorials and documentations for each API (Core, Compara, Variation) at the address above. > Also, I would also like to know something about the expression of a > particular gene of interest. e.g. ba querying the Novartis Gene Atlas > (http://symatlas.gnf.org/SymAtlas/). Is there a module for handling > submissions and retrieving results of this sort? You can get expression data from ArrayExpress and GEO (http://www.ebi.ac.uk/arrayexpress/ http://www.ncbi.nlm.nih.gov/geo/). ArrayExpress and GEO are public repository for microarray data (MIAME compliant). Hope this help. -- Benoit Ballester -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: registry_ensembl_init.txt Url: http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060808/f8f5e873/attachment.txt -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: Example.pl Url: http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060808/f8f5e873/attachment.pl From cjfields at uiuc.edu Tue Aug 8 12:36:12 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 11:36:12 -0500 Subject: [Bioperl-l] Wiki edits In-Reply-To: <44D8B1DF.7080407@sendu.me.uk> References: <44D8B1DF.7080407@sendu.me.uk> Message-ID: We will have to either add something to the Style Guide or create a new wiki page outlining proper editing and other rules we should promote (no spam, etc). I think everyone who has contributed to the wiki is guilty to varying degrees on using 'minor' edits for other purposes (small additions, etc). Chris On Aug 8, 2006, at 10:46 AM, Sendu Bala wrote: > Hi all, > Can I make a request that everyone take care about the status of their > edits with regard to 'minor'? > > http://en.wikipedia.org/wiki/Wikipedia:How_to_edit_a_page#Minor_edits > http://en.wikipedia.org/wiki/Wikipedia:Minor_edit > > Salient quotes being: > > "minor edit... implies trivial changes only, such as typo corrections, > formatting and presentational changes and rearranging of text without > changing any content" > > "any change that affects the meaning of an article is not minor, > even if > it involves one word" > > > Cheers, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Tue Aug 8 12:53:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 11:53:28 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <30EE71A4-79E8-44B4-811D-ACCFF3BB3324@gmx.net> References: <44D8871D.5010007@sendu.me.uk> <9FB18F3F-786F-40B5-ABB8-FE2642DB30D4@gmx.net> <4D374A57-F8DD-430E-B883-3CB2B538ECAA@uiuc.edu> <30EE71A4-79E8-44B4-811D-ACCFF3BB3324@gmx.net> Message-ID: Sounds good. I guess first things first. We should get the basic UCSC DB backend up and running first, then implement RandomAccessI for simple sequence retrieval. I'll try getting interfaces for Location, Refs, others along the way, which can be implemented as needed. This will be pretty simple So, we'll probably have something along the lines of Bio::DB::RandomAccessI - Bio::SeqI retrieval interface Bio::DB::LocationI - Bio::LocationI retrieval interface Bio::DB::ReferenceI - Bio::Annotation::Reference (or similar) retrieval interface Bio::DB::ClusterI - Bio::ClusterI retrieval interface (SNP) ....and so on Should Sean get an CVS account or should we just pass everything via a proxy, Sendu maybe? Might be easier to give him CVS access so Sendu can focus on Taxonomy, I can focus on the interfaces and getting through my benchwork (?!?), and Hilmar can focus on $job. Chris On Aug 8, 2006, at 10:21 AM, Hilmar Lapp wrote: > > On Aug 8, 2006, at 10:16 AM, Chris Fields wrote: > >> Using that, we could have get_Location(s)*, get_Ref(s)*, get_Taxon/ >> Taxa*, etc. Most of these would only use a unique ID or query, >> though. >> >> Would we want to lump all these together in one non-specific >> interface class or split them up into several specific interfaces? >> Many of the latter will likely have only a few methods, but at >> least they would be consistent, so I think the latter. > > Right. It is easy for an implementation class to implement multiple > interfaces, but it is messy to split up interfaces later that were > lumped together earlier without a clear need to do so. > > "do one task only but do it well" - that's the motto for the > interface, where doing it well also means to keep it as simple as > possible. > > thanks for taking a stab. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From sdavis2 at mail.nih.gov Tue Aug 8 13:14:04 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 08 Aug 2006 13:14:04 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: Message-ID: On 8/8/06 12:53 PM, "Chris Fields" wrote: > Sounds good. I guess first things first. We should get the basic > UCSC DB backend up and running first, then implement RandomAccessI > for simple sequence retrieval. I'll try getting interfaces for > Location, Refs, others along the way, which can be implemented as > needed. This will be pretty simple > > So, we'll probably have something along the lines of > > Bio::DB::RandomAccessI - Bio::SeqI retrieval interface > Bio::DB::LocationI - Bio::LocationI retrieval interface > Bio::DB::ReferenceI - Bio::Annotation::Reference (or similar) > retrieval interface > Bio::DB::ClusterI - Bio::ClusterI retrieval interface (SNP) > ....and so on > > Should Sean get an CVS account or should we just pass everything via > a proxy, Sendu maybe? Might be easier to give him CVS access so > Sendu can focus on Taxonomy, I can focus on the interfaces and > getting through my benchwork (?!?), and Hilmar can focus on $job. Whatever works for me, although I should probably run stuff by folks first, as I'm not used to the bioperl conventions--errors, etc.... Sean From bix at sendu.me.uk Tue Aug 8 14:24:41 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 08 Aug 2006 19:24:41 +0100 Subject: [Bioperl-l] Wiki edits In-Reply-To: References: <44D8B1DF.7080407@sendu.me.uk> Message-ID: <44D8D6E9.20603@sendu.me.uk> Chris Fields wrote: > We will have to either add something to the Style Guide or create a > new wiki page outlining proper editing and other rules we should > promote (no spam, etc). I don't think it's really necessary, given that it's already there. There's a link on every edit page that tells you how edit a page ('editing help'). If people haven't read that there's no reason to assume they've read the Style Guide either, or would read some new page ;) Of course it's fair that people haven't read the friendly manual, so I think the occasional reminder on this list is harmless. From bix at sendu.me.uk Tue Aug 8 14:41:21 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 08 Aug 2006 19:41:21 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: References: Message-ID: <44D8DAD1.5010402@sendu.me.uk> Sean Davis wrote: > > On 8/8/06 12:53 PM, "Chris Fields" wrote: > >> Should Sean get an CVS account or should we just pass everything via >> a proxy, Sendu maybe? Might be easier to give him CVS access so >> Sendu can focus on Taxonomy, I can focus on the interfaces and >> getting through my benchwork (?!?), and Hilmar can focus on $job. > > Whatever works for me, although I should probably run stuff by folks first, > as I'm not used to the bioperl conventions--errors, etc.... I've placed this 'project' as a point for the 1.5.5 bioperl release currently planned for 'fall 2006'. But so you both have a better idea of time-lines, here's a very aggressive one for you: I'm going to want to start using, or start writing, a front-end module(s) on September 4th. So ideally I'd like to see the backend and interface work done before then. As we get closer to the end of the month, if it doesn't seem like that's going to happen, let me know and I can dive in and start helping out directly on those areas. Thank you, Sendu. From cjfields at uiuc.edu Tue Aug 8 15:47:49 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 14:47:49 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D8DAD1.5010402@sendu.me.uk> Message-ID: <000101c6bb23$8750b260$15327e82@pyrimidine> Sendu, I hate to say this, but rethink that deadline a little. Sept 4 falls at the beginning of the academic year so I will be flooded with tons of work, let alone work on $job (postdoc, bench scientist) and try to get EUtilities in decent shape. So I'm doing some heavy prioritizing now; this falls further down on the list. It's probably the same on Sean's end as well. Hence my thought that getting Sean a CVS account would speed things along. It would be nice to get 1.5.5 out by late Sept-mid Oct. But we also need to be realistic. I think a basic UCSC sequence retrieval implementation using RandomAccessI and Sean's code is doable by early to mid-September. Again, the backend data retrieval component is the critical part here. I'm working on the interfaces for other data, but that will take shape once I have a better idea of the data that's available. Hilmar suggested LocationI and Annotation::Reference as a start; any other suggestions? Chris > Sean Davis wrote: > > > > On 8/8/06 12:53 PM, "Chris Fields" wrote: > > > >> Should Sean get an CVS account or should we just pass everything via > >> a proxy, Sendu maybe? Might be easier to give him CVS access so > >> Sendu can focus on Taxonomy, I can focus on the interfaces and > >> getting through my benchwork (?!?), and Hilmar can focus on $job. > > > > Whatever works for me, although I should probably run stuff by folks > first, > > as I'm not used to the bioperl conventions--errors, etc.... > > I've placed this 'project' as a point for the 1.5.5 bioperl release > currently planned for 'fall 2006'. But so you both have a better idea of > time-lines, here's a very aggressive one for you: > > I'm going to want to start using, or start writing, a front-end > module(s) on September 4th. > > So ideally I'd like to see the backend and interface work done before > then. As we get closer to the end of the month, if it doesn't seem like > that's going to happen, let me know and I can dive in and start helping > out directly on those areas. > > > Thank you, > Sendu. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sdavis2 at mail.nih.gov Tue Aug 8 16:22:22 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 08 Aug 2006 16:22:22 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D8DAD1.5010402@sendu.me.uk> Message-ID: On 8/8/06 2:41 PM, "Sendu Bala" wrote: > Sean Davis wrote: >> >> On 8/8/06 12:53 PM, "Chris Fields" wrote: >> >>> Should Sean get an CVS account or should we just pass everything via >>> a proxy, Sendu maybe? Might be easier to give him CVS access so >>> Sendu can focus on Taxonomy, I can focus on the interfaces and >>> getting through my benchwork (?!?), and Hilmar can focus on $job. >> >> Whatever works for me, although I should probably run stuff by folks first, >> as I'm not used to the bioperl conventions--errors, etc.... > > I've placed this 'project' as a point for the 1.5.5 bioperl release > currently planned for 'fall 2006'. But so you both have a better idea of > time-lines, here's a very aggressive one for you: > > I'm going to want to start using, or start writing, a front-end > module(s) on September 4th. > > So ideally I'd like to see the backend and interface work done before > then. As we get closer to the end of the month, if it doesn't seem like > that's going to happen, let me know and I can dive in and start helping > out directly on those areas. Sendu, I, like Chris said (in another email), have to keep things pretty vague as far as time-frame. This is a totally side-project for me, so will take a back-seat most of the time. Also, I think there are some very large issues that we have been ignoring with regard to the underlying table structure, but these will come out as time moves along. We can shoot for a deadline, but I think that "planning" on completion by a certain time is probably a bit premature before we are a bit further along in the process. Sean From bix at sendu.me.uk Tue Aug 8 16:06:07 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 08 Aug 2006 21:06:07 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <000101c6bb23$8750b260$15327e82@pyrimidine> References: <000101c6bb23$8750b260$15327e82@pyrimidine> Message-ID: <44D8EEAF.6030606@sendu.me.uk> Chris Fields wrote: > Sendu, > > I hate to say this, but rethink that deadline a little. Sept 4 falls at the > beginning of the academic year so I will be flooded with tons of work, let > alone work on $job (postdoc, bench scientist) and try to get EUtilities in > decent shape. So I'm doing some heavy prioritizing now; this falls further > down on the list. I'm quite serious about the deadline; like I say, if you can't have it finished by then that's fine. You can commit what you have and I can take over toward the end of this month. I was planning to do this all by myself, so any ground work that gets done is a bonus for me. I don't, however, want it to be a hindrance (having to wait for it). For the DB interfaces, just putting out ideas and code snippets in this thread to promote discussion over the course of this month is probably most of the 'work' needed. Actually creating the modules after that is somewhat trivial. From cjfields at uiuc.edu Tue Aug 8 17:13:40 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 16:13:40 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D8EEAF.6030606@sendu.me.uk> Message-ID: <000101c6bb2f$867b4380$15327e82@pyrimidine> Sendu, As Sean and I both point out, we have time constraints during that period. We do have other priorities outside of Bioperl, as do most other bioperl contributors. I'm glad that you have a lot of time to dedicate here; Bioperl needs it. But don't let that ambition drive away others who want to help. This is a community project; don't forget that. My opinion: let Sean handle this for now. You have plenty on your plate as it is. Besides the huge Taxonomy issues to worry about, there is also the issue of a new developer point release that either you or I will likely be heading up. Let that be your main focus at this point. Sean can handle getting the basics going, then he'll likely ask for help when it comes time to get data into the right objects. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Tuesday, August 08, 2006 3:06 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Database Retrieval > > Chris Fields wrote: > > Sendu, > > > > I hate to say this, but rethink that deadline a little. Sept 4 falls at > the > > beginning of the academic year so I will be flooded with tons of work, > let > > alone work on $job (postdoc, bench scientist) and try to get EUtilities > in > > decent shape. So I'm doing some heavy prioritizing now; this falls > further > > down on the list. > > I'm quite serious about the deadline; like I say, if you can't have it > finished by then that's fine. You can commit what you have and I can > take over toward the end of this month. I was planning to do this all by > myself, so any ground work that gets done is a bonus for me. I don't, > however, want it to be a hindrance (having to wait for it). > > For the DB interfaces, just putting out ideas and code snippets in this > thread to promote discussion over the course of this month is probably > most of the 'work' needed. Actually creating the modules after that is > somewhat trivial. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Tue Aug 8 17:32:05 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 8 Aug 2006 17:32:05 -0400 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <000101c6bb2f$867b4380$15327e82@pyrimidine> References: <000101c6bb2f$867b4380$15327e82@pyrimidine> Message-ID: <6EB92616-9FEE-44B4-9CCC-18003BA5ADDE@gmx.net> Guys - we all remember the rule, don't we? "Don't get into the way of someone who threatens to code." Sendu may have his own time constraints under which he may have to deliver something. He wants to make that happen using and extending bioperl, building on the input and work of others. I can't see anything wrong with that, in fact I find it laudable, and I can't see why I would want to stop him or slow him down. I'm also sure that Sean will gladly welcome if somebody else does his work for him. At least I would. So please everybody take it easy. -hilmar On Aug 8, 2006, at 5:13 PM, Chris Fields wrote: > Sendu, > > As Sean and I both point out, we have time constraints during that > period. > We do have other priorities outside of Bioperl, as do most other > bioperl > contributors. I'm glad that you have a lot of time to dedicate here; > Bioperl needs it. But don't let that ambition drive away others > who want to > help. This is a community project; don't forget that. > > My opinion: let Sean handle this for now. You have plenty on your > plate as > it is. Besides the huge Taxonomy issues to worry about, there is > also the > issue of a new developer point release that either you or I will > likely be > heading up. Let that be your main focus at this point. Sean can > handle > getting the basics going, then he'll likely ask for help when it > comes time > to get data into the right objects. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Sendu Bala >> Sent: Tuesday, August 08, 2006 3:06 PM >> To: bioperl-l at bioperl.org >> Subject: Re: [Bioperl-l] Database Retrieval >> >> Chris Fields wrote: >>> Sendu, >>> >>> I hate to say this, but rethink that deadline a little. Sept 4 >>> falls at >> the >>> beginning of the academic year so I will be flooded with tons of >>> work, >> let >>> alone work on $job (postdoc, bench scientist) and try to get >>> EUtilities >> in >>> decent shape. So I'm doing some heavy prioritizing now; this falls >> further >>> down on the list. >> >> I'm quite serious about the deadline; like I say, if you can't >> have it >> finished by then that's fine. You can commit what you have and I can >> take over toward the end of this month. I was planning to do this >> all by >> myself, so any ground work that gets done is a bonus for me. I don't, >> however, want it to be a hindrance (having to wait for it). >> >> For the DB interfaces, just putting out ideas and code snippets in >> this >> thread to promote discussion over the course of this month is >> probably >> most of the 'work' needed. Actually creating the modules after >> that is >> somewhat trivial. >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Tue Aug 8 18:23:49 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 17:23:49 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <6EB92616-9FEE-44B4-9CCC-18003BA5ADDE@gmx.net> Message-ID: <000201c6bb39$529def90$15327e82@pyrimidine> Hilmar, I agree with "Don't get into the way of someone who threatens to code." Not my point by the response. We are all willing to code here. Sean has donated preliminary code; I am willing to work on interfaces. If Sendu wants to work on this as well, fine, but indicate why he needs it by a certain deadline. I feel that forcing a deadline may be premature if the sole reason (as Sendu indicated) is to include it in a possible 1.5.5 release, which, BTW, no one has stepped forward to make happen yet. My main concern is 'turning people off' who want to contribute to Bioperl by having a Bioperl developer (1) impose a deadline, and (2) indicate that they will take things over if the deadline isn't reached. That is, in essence, what Sendu has done. I had proposed making the deadline a few weeks later as a compromise, to get Sean and I past the initial rush of the fall semester. I think this is perfectly reasonable. Sean's response also indicates he is under time constraints as well. And he indicates potential problems, so any input (and code) he gives would be priceless. Sendu never indicates why this can't wait a few extra weeks, just that he'll dive in and do it himself anyway, regardless of what we think. That's not very considerate of Sean's efforts here, nor mine. This is all completely unintentional, I'm sure, (the medium of email does remove a certain emotional quotient) but that's how it comes across, regardless. At this point, when I see things like this it makes me want to throw in the towel and take a back seat to Bioperl development. I appreciate Sendu's intentions here, and I agree with him 99% of the time. I listen to his opinions, and I hope he listens to mine. And I understand some of his reasoning in wanting to get this in ASAP; a new Bioperl release is long LONG overdue. I just don't want anyone's toes getting stepped on in the process, especially someone willing to donate their own code, time, and effort to help out. The more hands the merrier (my point about this being a community project), and Bioperl needs any help it can get right now. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Tuesday, August 08, 2006 4:32 PM > To: Chris Fields; Sendu Bala > Cc: bioperl-l at lists.open-bio.org list > Subject: Re: [Bioperl-l] Database Retrieval > > Guys - > > we all remember the rule, don't we? "Don't get into the way of > someone who threatens to code." > > Sendu may have his own time constraints under which he may have to > deliver something. He wants to make that happen using and extending > bioperl, building on the input and work of others. I can't see > anything wrong with that, in fact I find it laudable, and I can't see > why I would want to stop him or slow him down. > > I'm also sure that Sean will gladly welcome if somebody else does his > work for him. At least I would. > > So please everybody take it easy. > > -hilmar > > On Aug 8, 2006, at 5:13 PM, Chris Fields wrote: > > > Sendu, > > > > As Sean and I both point out, we have time constraints during that > > period. > > We do have other priorities outside of Bioperl, as do most other > > bioperl > > contributors. I'm glad that you have a lot of time to dedicate here; > > Bioperl needs it. But don't let that ambition drive away others > > who want to > > help. This is a community project; don't forget that. > > > > My opinion: let Sean handle this for now. You have plenty on your > > plate as > > it is. Besides the huge Taxonomy issues to worry about, there is > > also the > > issue of a new developer point release that either you or I will > > likely be > > heading up. Let that be your main focus at this point. Sean can > > handle > > getting the basics going, then he'll likely ask for help when it > > comes time > > to get data into the right objects. > > > > Chris > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Sendu Bala > >> Sent: Tuesday, August 08, 2006 3:06 PM > >> To: bioperl-l at bioperl.org > >> Subject: Re: [Bioperl-l] Database Retrieval > >> > >> Chris Fields wrote: > >>> Sendu, > >>> > >>> I hate to say this, but rethink that deadline a little. Sept 4 > >>> falls at > >> the > >>> beginning of the academic year so I will be flooded with tons of > >>> work, > >> let > >>> alone work on $job (postdoc, bench scientist) and try to get > >>> EUtilities > >> in > >>> decent shape. So I'm doing some heavy prioritizing now; this falls > >> further > >>> down on the list. > >> > >> I'm quite serious about the deadline; like I say, if you can't > >> have it > >> finished by then that's fine. You can commit what you have and I can > >> take over toward the end of this month. I was planning to do this > >> all by > >> myself, so any ground work that gets done is a bonus for me. I don't, > >> however, want it to be a hindrance (having to wait for it). > >> > >> For the DB interfaces, just putting out ideas and code snippets in > >> this > >> thread to promote discussion over the course of this month is > >> probably > >> most of the 'work' needed. Actually creating the modules after > >> that is > >> somewhat trivial. > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From bix at sendu.me.uk Tue Aug 8 20:03:45 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 09 Aug 2006 01:03:45 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <000201c6bb39$529def90$15327e82@pyrimidine> References: <000201c6bb39$529def90$15327e82@pyrimidine> Message-ID: <44D92661.2090705@sendu.me.uk> Chris Fields wrote: > My main concern is 'turning people off' who want to contribute to Bioperl by > having a Bioperl developer (1) impose a deadline, and (2) indicate that they > will take things over if the deadline isn't reached. That is, in essence, > what Sendu has done. My intent was to make clear you didn't have to do everything yourself, that help was available. I offered a solution to the deadline. If you're more comfortable working solo, that's entirely understandable. In which case another solution to the deadline is the one you already offered: to push it back or not have one. You offered your solution, I offered mine, we pick one and move on. Simple :) > Sendu never indicates why this can't wait a few extra weeks, just that he'll > dive in and do it himself anyway, regardless of what we think. As Hilmar supposed, I have my own deadlines; tying in with getting 1.5.5 promptly out the door and inclusive of this proposal was icing on the cake. It's not the end of the world; if you both are interested in the work and want to see it through to the end without help, that's super. I look forward to being able to use the code in the future. It just means I'll have to make other plans, so would like to know one way or the other toward the end of this month. Cheers, Sendu. From cjfields at uiuc.edu Tue Aug 8 21:09:56 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 20:09:56 -0500 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <44D92661.2090705@sendu.me.uk> Message-ID: <000001c6bb50$87d99800$15327e82@pyrimidine> ... > My intent was to make clear you didn't have to do everything yourself, > that help was available. I offered a solution to the deadline. If you're > more comfortable working solo, that's entirely understandable. In which > case another solution to the deadline is the one you already offered: to > push it back or not have one. > > You offered your solution, I offered mine, we pick one and move on. > Simple :) That's not the way it came across, frankly. In your own words: "I'm quite serious about the deadline; like I say, if you can't have it finished by then that's fine. You can commit what you have and I can take over toward the end of this month." This doesn't sound like you listened to my compromise, Sendu. I proposed moving the deadline back a few weeks (where I mentioned early to mid Sept) to allow a little more breathing room for Sean and I; that's the compromise, and I want to stick with that. If it gets done early, fine. If not, give Sean a bit more time, until mid-Sept. Then have at it. I was hoping an initial 1.5.5 release candidate would go out mid-Sept, a second a week later, and the final one end-Sept. This could be pushed back a week to accommodate this if it's important to everybody (and, judging by the level of interest, it is). Anyway, by any means, a remedial UCSC DB class should be up-and-running by mid-Sept at the latest, implementing Bio::DB::RandomAccessI to get Bio::SeqI objects. We could add in support for LocationI, ClusterI, and others along the way if needed. An extra few weeks allows Sean time to add everything he wants and get the general feel of Bioperl, so maybe he will have a few more ideas along the way. He has already indicated possible problems in his previous posts. My contribution here (interfaces) will be minor in comparison; I already have a proto-LocationI interface, so others will take very little time and effort. And, wouldn't it be great to have someone else contributing here? Even if it is small? That's how a number of us started... > As Hilmar supposed, I have my own deadlines; tying in with getting 1.5.5 > promptly out the door and inclusive of this proposal was icing on the > cake. It's not the end of the world; if you both are interested in the You need to actually state this when you demand a deadline. We can't just assume this! It's a little off-putting to have someone you don't know very well giving you deadlines to meet, esp. when they aren't your boss. I already have a P.I., thank you. > work and want to see it through to the end without help, that's super. I > look forward to being able to use the code in the future. It just means > I'll have to make other plans, so would like to know one way or the > other toward the end of this month. Give Sean some time. Have a bit of patience! An extended deadline allows more input for 1.5.5 overall. And anyone that wants to add ideas, have an opinion, contribute code, make a suggestion along the way, with Sean's contribution or with anything else, that's great! That's what this is all about. Not everybody here knows Bioperl as well as you, I, or Hilmar. I believe in the work you have contributed to Bioperl, Sendu. I personally think you should be the next Release Pumpkin, quite frankly; I just don't have time right now. But have some patience, man! Chris From prachi at stanford.edu Tue Aug 8 20:18:08 2006 From: prachi at stanford.edu (Prachi Shah) Date: Tue, 8 Aug 2006 17:18:08 -0700 Subject: [Bioperl-l] Bio::Tools::pSW stop codon bug? Message-ID: <8684cf960608081718h40d74571wbbe2448ed29cff7d@mail.gmail.com> Hi, I am trying to align very similar protein sequences with the Bio::Tools::pSW modules but running into an issue which seems like a bug. One of the two sequences is extended considerably with gaps so that an Amino acid residue matches the stop codon (*). I know there should not be any internal stop codons but we are working with a new assembly of the candida genome and we want to pick out such inconsistent cases. In any case, the alignment should match the two sequences (because they are the same) up until the stop codon is encountered in the new sequence. Instead it artificially extends the old sequence and matches the Alanine with the stop codon. Any help on this is appreciated. Thanks Prachi Here is an example set of two sequences I am trying to align: >orf19.6264.3 MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQRVGYNLGYFSANYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFNTSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV*V >orf19.6264.3_old MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQRVGYNLGYFSANYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFNTSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV and below is the part of code that generates the alignments -- ################ my $new_translatedSeqObj = Bio::Seq->new(-display_id => $gene, -seq => $new_translatedSeq); my $old_translatedSeqObj = Bio::Seq->new(-display_id => $gene. "_old", -seq => $old_translatedSeq); # do alignments my $align_factory = new Bio::Tools::pSW( '-matrix' => '/tools/perl/5.8.8/lib/site_perl/5.8.8/Bio/Ext/Align/blosum62.bla', '-gap' => 12, '-ext' => 2 ); my $aln = $align_factory->pairwise_alignment( $old_translatedSeqObj, $new_translatedSeqObj ); my $alnout = new Bio::AlignIO(-format => 'clustalw', -fh => \*STDOUT); ################## The alignment -- CLUSTAL W(1.81) multiple sequence alignment orf19.6264.3_old/1-162 MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQR orf19.6264.3/1-177 MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQR ************************************************************ orf19.6264.3_old/1-162 VGYNLGYFSANYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFN orf19.6264.3/1-177 VGYNLGYFSANYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFN ************************************************************ orf19.6264.3_old/1-162 TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHA---------------AL orf19.6264.3/1-177 TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV*V **************************************** : From cjfields at uiuc.edu Wed Aug 9 00:31:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Tue, 8 Aug 2006 23:31:55 -0500 Subject: [Bioperl-l] Bio::Tools::pSW stop codon bug? In-Reply-To: <8684cf960608081718h40d74571wbbe2448ed29cff7d@mail.gmail.com> References: <8684cf960608081718h40d74571wbbe2448ed29cff7d@mail.gmail.com> Message-ID: <59AB30B4-0F18-465C-BEFB-A406031772F2@uiuc.edu> You should submit this as a bug via Bugzilla. The first link gives some info on submitting bugs, the second is the actual Bugzilla link. http://www.bioperl.org/wiki/Bugs http://bugzilla.open-bio.org/ Make sure to create attachments for the script and data (no cut-and- paste). I personally don't use pSW much, but we at least can test what's going on. It may just be the way the local alignment behaves. Maybe the algorithm doesn't like end gaps! Chris On Aug 8, 2006, at 7:18 PM, Prachi Shah wrote: > Hi, > > I am trying to align very similar protein sequences with the > Bio::Tools::pSW modules but running into an issue which seems like a > bug. One of the two sequences is extended considerably with gaps so > that an Amino acid residue matches the stop codon (*). I know there > should not be any internal stop codons but we are working with a new > assembly of the candida genome and we want to pick out such > inconsistent cases. In any case, the alignment should match the two > sequences (because they are the same) up until the stop codon is > encountered in the new sequence. Instead it artificially extends the > old sequence and matches the Alanine with the stop codon. Any help on > this is appreciated. > > Thanks > Prachi > > Here is an example set of two sequences I am trying to align: > >> orf19.6264.3 > MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQRVGYNLGYFSA > NYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFNTSQLYTGLLIVAVPLGFLAS > PISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV*V >> orf19.6264.3_old > MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQRVGYNLGYFSA > NYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFNTSQLYTGLLIVAVPLGFLAS > PISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV > > and below is the part of code that generates the alignments -- > > ################ > my $new_translatedSeqObj = Bio::Seq->new(-display_id => $gene, > -seq => $new_translatedSeq); > > my $old_translatedSeqObj = Bio::Seq->new(-display_id => $gene. "_old", > -seq => $old_translatedSeq); > > # do alignments > my $align_factory = new Bio::Tools::pSW( '-matrix' => > '/tools/perl/5.8.8/lib/site_perl/5.8.8/Bio/Ext/Align/blosum62.bla', > '-gap' => 12, > '-ext' => 2 > ); > > my $aln = $align_factory->pairwise_alignment( $old_translatedSeqObj, > $new_translatedSeqObj ); > > my $alnout = new Bio::AlignIO(-format => 'clustalw', > -fh => \*STDOUT); > ################## > > The alignment -- > > CLUSTAL W(1.81) multiple sequence alignment > > > orf19.6264.3_old/1-162 > MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQR > orf19.6264.3/1-177 > MSNYLNLAQFSGVTDRFNLERIKSDFSSVQSTISKLRPPQEFFDFRRLSKPANFGEIQQR > > ************************************************************ > > > orf19.6264.3_old/1-162 > VGYNLGYFSANYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFN > orf19.6264.3/1-177 > VGYNLGYFSANYITIVLGLSIYALITNFLLLFVTIFVLGGIYGINKLNGEDLVLPVGRFN > > ************************************************************ > > > orf19.6264.3_old/1-162 > TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHA---------------AL > orf19.6264.3/1-177 > TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV*V > > **************************************** : > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Wed Aug 9 03:48:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 09 Aug 2006 08:48:54 +0100 Subject: [Bioperl-l] Database Retrieval In-Reply-To: <000001c6bb50$87d99800$15327e82@pyrimidine> References: <000001c6bb50$87d99800$15327e82@pyrimidine> Message-ID: <44D99366.9050609@sendu.me.uk> Chris Fields wrote: > ... >> My intent was to make clear you didn't have to do everything >> yourself, that help was available. I offered a solution to the >> deadline. If you're more comfortable working solo, that's entirely >> understandable. In which case another solution to the deadline is >> the one you already offered: to push it back or not have one. >> >> You offered your solution, I offered mine, we pick one and move on. >> Simple :) > > That's not the way it came across, frankly. I appreciate that. I keep putting my foot in it, don't I? I apologise once more. > I was hoping an initial 1.5.5 release candidate would go out > mid-Sept, a second a week later, and the final one end-Sept. This > could be pushed back a week to accommodate this if it's important to > everybody (and, judging by the level of interest, it is). [...] > I personally think you should be the next Release Pumpkin, quite > frankly; I just don't have time right now. How soon before desired rc1 release have pumpkins previously 'emerged'? From bix at sendu.me.uk Wed Aug 9 04:43:58 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 09 Aug 2006 09:43:58 +0100 Subject: [Bioperl-l] $hit_object->frac_aligned_hit/$hit_object->frac_aligned_query In-Reply-To: References: <20060801025930.96806.qmail@web55705.mail.re3.yahoo.com> <44D08B15.9000408@sendu.me.uk> <11E90B37-0A62-4A54-90BD-E98B2CA6E683@gmx.net> <44D0A2A8.6040000@sendu.me.uk> Message-ID: <44D9A04E.6010200@sendu.me.uk> Bernd Brandt wrote: > Hi, > > "frac_identical and frac_conserved can still be very wrong" > They were wrong with hmm report parsing (hmmsearch). The bioperl-live > (CVS 14 July 2006) returned 0 for both fracions. I will check it with > the newest CVS and send a small test script. For both hmmsearch and hmmpfam parsing, those are deliberately set to 0. They needn't be; there is a midline to the hsp-like-thing that shows identical and conserved positions. I suppose the reason they're set to 0 currently is that its not a real alignment against a set sequence, but against a model. Still, I think it's reasonable to give an answer in both cases if you can get an answer from the original output files. I'm in the process of writing new hmmer parsers, and will allow non-0 answers for those methods. (Its not really possible to alter the existing parser because it only allows one hsp per hit, preventing a correct answer even if it didn't set to 0) From sdavis2 at mail.nih.gov Wed Aug 9 09:10:15 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 09 Aug 2006 09:10:15 -0400 Subject: [Bioperl-l] UCSC database backend Message-ID: I have put together a variation of the database backend. It is based on DBIx::Abstract and exposes a bit more of the SQL and DBI. Since UCSC uses cross-database queries, it might be a better fit for the problem than an ORM. It is pretty simple, but I don't know that we need much more here. The harder problem, as I mentioned earlier, is to determine what to return, not how to return it. I am showing only a couple of the DBIx::Abstract methods here; there are a number of others for fetching data. In particular, pretty much any of the fetch_* are available. Here is the basic POD: NAME Bio::DB::UCSC::DB - database abstraction for UCSC SYNOPSIS use Bio::DB::UCSC::DB; # By default, connect to MySQL server at UCSC, hg18 database my $db = Bio::DB::UCSC::DB->new(); if ($db->select('*','refGene')->rows) { while (my $data = $db->fetchrow_hashref) { .... } } #get table database descriptions from hgcentral database my $db_descriptions = $db->db_descriptions(); #arrayref of hashrefs #get full listing of tables (and attributes) my $dbi_table_info = $db->dbi_table_info(); #arrayref of hashrefs #get full column information for the "tissue" table my $dbi_column_info = $db->dbi_column_info('tissue'); #arrayref of hashrefs #get table descriptions from UCSC tableDescriptions table #Still needs a bit of cleanup, but.... my $table_descriptions = $db->table_descriptions(); #arrayref of hashrefs DESCRIPTION This module provides some database abstraction via DBIx::Abstract. The connection parameters are currently passed directly to DBIx::Abstract->connect(). All the methods of DBIx::Abstract are available, with the addition of a ->dbh() method to get at the DBI database handle and the database introspection methods noted above. TODO A fair bit of work on the connection end. In particular, I will proba- bly make a "Bio::DB::SQL" class that encapsulates some methods for working with SQL databases and some kind of abstraction for connection information, making it easier to switch from local to remote versions of a database. See Also L,L Author Sean Davis From cjfields at uiuc.edu Wed Aug 9 10:41:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 09:41:02 -0500 Subject: [Bioperl-l] UCSC database backend In-Reply-To: Message-ID: <000601c6bbc1$d9e0abe0$15327e82@pyrimidine> Sean, If you have your CVS account set up you could go ahead and add it in. I think the plan is to try and include this in the next dev release (1.5.5), which we are trying to get out by end-Sept at the latest. I think a few RCs may be made beforehand, but that's really up to the pumpkin. As RandomAccessI is already available, we could use that as a start to implement sequence retrieval. Other interfaces would be added over time to round out getting data into the proper Bio* objects. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sean Davis > Sent: Wednesday, August 09, 2006 8:10 AM > To: bioperl-l at lists.open-bio.org list > Subject: [Bioperl-l] UCSC database backend > > I have put together a variation of the database backend. It is based on > DBIx::Abstract and exposes a bit more of the SQL and DBI. Since UCSC uses > cross-database queries, it might be a better fit for the problem than an > ORM. It is pretty simple, but I don't know that we need much more here. > The harder problem, as I mentioned earlier, is to determine what to > return, > not how to return it. I am showing only a couple of the DBIx::Abstract > methods here; there are a number of others for fetching data. In > particular, pretty much any of the fetch_* are available. > > Here is the basic POD: > > > > NAME > Bio::DB::UCSC::DB - database abstraction for UCSC > > SYNOPSIS > use Bio::DB::UCSC::DB; > > # By default, connect to MySQL server at UCSC, hg18 database > my $db = Bio::DB::UCSC::DB->new(); > > if ($db->select('*','refGene')->rows) { > while (my $data = $db->fetchrow_hashref) { > .... > } > } > > #get table database descriptions from hgcentral database > my $db_descriptions = $db->db_descriptions(); #arrayref of > hashrefs > > #get full listing of tables (and attributes) > my $dbi_table_info = $db->dbi_table_info(); #arrayref of > hashrefs > > #get full column information for the "tissue" table > my $dbi_column_info = $db->dbi_column_info('tissue'); #arrayref > of > hashrefs > > #get table descriptions from UCSC tableDescriptions table > #Still needs a bit of cleanup, but.... > my $table_descriptions = $db->table_descriptions(); #arrayref of > hashrefs > > DESCRIPTION > This module provides some database abstraction via DBIx::Abstract. > The > connection parameters are currently passed directly to > DBIx::Abstract->connect(). All the methods of DBIx::Abstract are > available, with the addition of a ->dbh() method to get at the DBI > database handle and the database introspection methods noted above. > > TODO > A fair bit of work on the connection end. In particular, I will > proba- > bly make a "Bio::DB::SQL" class that encapsulates some methods for > working with SQL databases and some kind of abstraction for > connection > information, making it easier to switch from local to remote > versions > of a database. > > See Also > L,L > > Author > Sean Davis > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From sdavis2 at mail.nih.gov Wed Aug 9 10:50:53 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 09 Aug 2006 10:50:53 -0400 Subject: [Bioperl-l] UCSC database backend In-Reply-To: <000601c6bbc1$d9e0abe0$15327e82@pyrimidine> Message-ID: On 8/9/06 10:41 AM, "Chris Fields" wrote: > Sean, > > If you have your CVS account set up you could go ahead and add it in. I > think the plan is to try and include this in the next dev release (1.5.5), > which we are trying to get out by end-Sept at the latest. I think a few RCs > may be made beforehand, but that's really up to the pumpkin. > > As RandomAccessI is already available, we could use that as a start to > implement sequence retrieval. Other interfaces would be added over time to > round out getting data into the proper Bio* objects. Chris, Once I get CVS access, I will commit what I have done (as long as it "works"). Now for the details. Keep in mind that for many of the "sequences" available from UCSC, there is no actual "sequence" stored in the database; rather they are stored in flat files not accessible directly via SQL. Therefore, a sequence would be "abstract" in the sense of being a "join location" on the chromosome, and even that isn't quite right, as the mRNA sequence != genomic alignment sequence. Also, there are many different tables that maintain "sequence" information. So, implementing RandomAccessI is not going to be straightforward and will require some assumptions about what will be searched. In fact, since the same "sequence" can be in many different tables, there may need to be a way of specifying where the search is done (what table(s)). Sean From cjfields at uiuc.edu Wed Aug 9 13:58:07 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 12:58:07 -0500 Subject: [Bioperl-l] Bio::Tools::pSW stop codon bug? In-Reply-To: Message-ID: <000901c6bbdd$6168c320$15327e82@pyrimidine> Aaron, How well is Bio::Tools::pSW maintained? I haven't seen many using it (most tend to use the other alternatives you mention). Chris > -----Original Message----- > From: aaron.j.mackey at gsk.com [mailto:aaron.j.mackey at gsk.com] > Sent: Wednesday, August 09, 2006 12:43 PM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org; Prachi Shah > Subject: Re: [Bioperl-l] Bio::Tools::pSW stop codon bug? > > > I personally don't use pSW much, but we at least can test what's > > going on. It may just be the way the local alignment behaves. Maybe > > the algorithm doesn't like end gaps! > > by definition, local (as opposed to global) alignments don't have end > gaps. > > > > orf19.6264.3_old/1-162 > > > TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHA---------------AL > > > orf19.6264.3/1-177 > > > TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV*V > > yes, this looks to be an alignment bug in pSW; if you remove the *, does > the alignment end naturally at the AL? > > is there some reason you are wedded to pSW as opposed to, say, bl2seq > (BLAST-based pairwise alignment) or ssearch (SmithWaterman pairwise > alignment) (both also accessible via BioPerl's SearchIO->HSP->AlignI > pipeline)? > > -Aaron From cjfields at uiuc.edu Wed Aug 9 14:11:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 13:11:37 -0500 Subject: [Bioperl-l] UCSC database backend In-Reply-To: Message-ID: <000a01c6bbdf$419961b0$15327e82@pyrimidine> > Chris, > > Once I get CVS access, I will commit what I have done (as long as it > "works"). > > Now for the details. Keep in mind that for many of the "sequences" > available from UCSC, there is no actual "sequence" stored in the database; > rather they are stored in flat files not accessible directly via SQL. > Therefore, a sequence would be "abstract" in the sense of being a "join > location" on the chromosome, and even that isn't quite right, as the mRNA > sequence != genomic alignment sequence. Also, there are many different > tables that maintain "sequence" information. So, implementing > RandomAccessI > is not going to be straightforward and will require some assumptions about > what will be searched. In fact, since the same "sequence" can be in many > different tables, there may need to be a way of specifying where the > search > is done (what table(s)). > > Sean Sean, Okay, makes sense. So, the MySQL database holds the sequence information (location, etc) and the actual sequences (mRNA, EST, genomic) are in various flat files. Seems like this calls for a helper set-up script to index the appropriate sequence flat files and possibly load the MySQL database table information. Bio::DB::Fasta could be used for indexing the sequence files as it's pretty fast. So, if I were to retrieve a particular sequence (region of scaffold of genomic DNA for instance), I would need: 1) unique ID or name for the sequence 2) start-end coordinates (in UCSC terms, I suppose; UCSC starts with 0, if I remember correctly?) 3) table to retrieve data from 4) either the location of indexed sequence files or a flat-file db handler These could be all set upon instantiation for sequence retrieval : $factory = Bio::DB::UCSC::Sequence(-table => $table, -seq_start => $start, -seq_end => $end, -db => $handler,); # returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB Handler $seq = $factory->get_Seq_by_id($id); If you just want the sequence associated with an ID, the location info (whether it is Simple, Split, Fuzzy, etc) could be used to retrieve the subsequence from the appropriate flatfile dependent on the table used. $factory = Bio::DB::UCSC::Sequence(-table => $table, -db => $handler,); # returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB handler $seq = $factory->get_Seq_by_id($id); Would something like that be appropriate? Not sure if I'm missing something. Sendu may have other suggestions/additions; I'm letting the coffee talk now. Chris From aaron.j.mackey at gsk.com Wed Aug 9 13:42:32 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Wed, 9 Aug 2006 13:42:32 -0400 Subject: [Bioperl-l] Bio::Tools::pSW stop codon bug? In-Reply-To: <59AB30B4-0F18-465C-BEFB-A406031772F2@uiuc.edu> Message-ID: > I personally don't use pSW much, but we at least can test what's > going on. It may just be the way the local alignment behaves. Maybe > the algorithm doesn't like end gaps! by definition, local (as opposed to global) alignments don't have end gaps. > > orf19.6264.3_old/1-162 > > TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHA---------------AL > > orf19.6264.3/1-177 > > TSQLYTGLLIVAVPLGFLASPISTMMWLIGSSGVTVGAHAALMEKPIETVFEEEV*V yes, this looks to be an alignment bug in pSW; if you remove the *, does the alignment end naturally at the AL? is there some reason you are wedded to pSW as opposed to, say, bl2seq (BLAST-based pairwise alignment) or ssearch (SmithWaterman pairwise alignment) (both also accessible via BioPerl's SearchIO->HSP->AlignI pipeline)? -Aaron From MEC at stowers-institute.org Wed Aug 9 14:38:15 2006 From: MEC at stowers-institute.org (Cook, Malcolm) Date: Wed, 9 Aug 2006 13:38:15 -0500 Subject: [Bioperl-l] UCSC database backend Message-ID: Just a thought (chiming in).... Both blast and blat indices databases have ways of retrieving sequence using identifiers and coordinates. If you're building these indices for local copies of these files anyway, they can do double duty. It is pretty easy to write tied hash interfaces to blast/blat formated databases which could be wrapped Bio::DB::fasta like. Might save some time.... --Malcolm >-----Original Message----- >From: bioperl-l-bounces at lists.open-bio.org >[mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Chris Fields >Sent: Wednesday, August 09, 2006 1:12 PM >To: 'Sean Davis'; bioperl-l at lists.open-bio.org >Subject: Re: [Bioperl-l] UCSC database backend > >> Chris, >> >> Once I get CVS access, I will commit what I have done (as long as it >> "works"). >> >> Now for the details. Keep in mind that for many of the "sequences" >> available from UCSC, there is no actual "sequence" stored in >the database; >> rather they are stored in flat files not accessible directly via SQL. >> Therefore, a sequence would be "abstract" in the sense of >being a "join >> location" on the chromosome, and even that isn't quite >right, as the mRNA >> sequence != genomic alignment sequence. Also, there are >many different >> tables that maintain "sequence" information. So, implementing >> RandomAccessI >> is not going to be straightforward and will require some >assumptions about >> what will be searched. In fact, since the same "sequence" >can be in many >> different tables, there may need to be a way of specifying where the >> search >> is done (what table(s)). >> >> Sean > >Sean, > >Okay, makes sense. So, the MySQL database holds the sequence >information >(location, etc) and the actual sequences (mRNA, EST, genomic) >are in various >flat files. Seems like this calls for a helper set-up script >to index the >appropriate sequence flat files and possibly load the MySQL >database table >information. Bio::DB::Fasta could be used for indexing the >sequence files >as it's pretty fast. > >So, if I were to retrieve a particular sequence (region of scaffold of >genomic DNA for instance), I would need: > >1) unique ID or name for the sequence >2) start-end coordinates (in UCSC terms, I suppose; UCSC >starts with 0, if >I remember correctly?) >3) table to retrieve data from >4) either the location of indexed sequence files or a >flat-file db handler > >These could be all set upon instantiation for sequence retrieval : > >$factory = Bio::DB::UCSC::Sequence(-table => $table, > -seq_start => $start, > -seq_end => $end, > -db => $handler,); > ># returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB Handler > >$seq = $factory->get_Seq_by_id($id); > >If you just want the sequence associated with an ID, the location info >(whether it is Simple, Split, Fuzzy, etc) could be used to retrieve the >subsequence from the appropriate flatfile dependent on the table used. > >$factory = Bio::DB::UCSC::Sequence(-table => $table, > -db => $handler,); > ># returns Bio::PrimarySeq::Fasta via Bio::DB::Fasta DB handler > >$seq = $factory->get_Seq_by_id($id); > >Would something like that be appropriate? Not sure if I'm missing >something. Sendu may have other suggestions/additions; I'm letting the >coffee talk now. > >Chris > >_______________________________________________ >Bioperl-l mailing list >Bioperl-l at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/bioperl-l > From sdavis2 at mail.nih.gov Wed Aug 9 15:02:42 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Wed, 09 Aug 2006 15:02:42 -0400 Subject: [Bioperl-l] UCSC database backend In-Reply-To: <000a01c6bbdf$419961b0$15327e82@pyrimidine> Message-ID: On 8/9/06 2:11 PM, "Chris Fields" wrote: >> Chris, >> >> Once I get CVS access, I will commit what I have done (as long as it >> "works"). >> >> Now for the details. Keep in mind that for many of the "sequences" >> available from UCSC, there is no actual "sequence" stored in the database; >> rather they are stored in flat files not accessible directly via SQL. >> Therefore, a sequence would be "abstract" in the sense of being a "join >> location" on the chromosome, and even that isn't quite right, as the mRNA >> sequence != genomic alignment sequence. Also, there are many different >> tables that maintain "sequence" information. So, implementing >> RandomAccessI >> is not going to be straightforward and will require some assumptions about >> what will be searched. In fact, since the same "sequence" can be in many >> different tables, there may need to be a way of specifying where the >> search >> is done (what table(s)). >> >> Sean > > Sean, > > Okay, makes sense. So, the MySQL database holds the sequence information > (location, etc) and the actual sequences (mRNA, EST, genomic) are in various > flat files. Seems like this calls for a helper set-up script to index the > appropriate sequence flat files and possibly load the MySQL database table > information. Bio::DB::Fasta could be used for indexing the sequence files > as it's pretty fast. Before we get too far down this line of thought, keep in mind that this will be dozens of Gb of sequence and database tables. See here for details: http://genome.ucsc.edu/admin/mirror.html The sequences include all of genbank, essentially. The mysql tables ALONE (no sequence) for only ONE human assembly is on the order of 10Gb--not the kind of thing you can download in a few minutes (or even hours). Just to keep in mind.... On another point, the strength of UCSC is not in obtaining sequence, but in mapping to the genome. I think getting actual sequence should be secondary here, if for no other reason than there are trivially easy ways of getting sequence information from elsewhere given an accession or ID. There is simply too much information to be stored locally for most people and getting the data remotely from UCSC doesn't seem possible currently. Sean From cjfields at uiuc.edu Wed Aug 9 15:21:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 14:21:57 -0500 Subject: [Bioperl-l] UCSC database backend In-Reply-To: Message-ID: <000001c6bbe9$15154f50$15327e82@pyrimidine> ... > Before we get too far down this line of thought, keep in mind that this > will > be dozens of Gb of sequence and database tables. See here for details: > > http://genome.ucsc.edu/admin/mirror.html > > The sequences include all of genbank, essentially. The mysql tables ALONE > (no sequence) for only ONE human assembly is on the order of 10Gb--not the > kind of thing you can download in a few minutes (or even hours). Just to > keep in mind.... Yes, there was a recent bug related to the packing order for very large files (>4 GB, I believe). I'm hoping Lincoln takes a look at it soon for further suggestions as the proposed changes would require reindexing everything. However, the proposed fix did work well for the submitter. > On another point, the strength of UCSC is not in obtaining sequence, but > in > mapping to the genome. I think getting actual sequence should be > secondary > here, if for no other reason than there are trivially easy ways of getting > sequence information from elsewhere given an accession or ID. There is > simply too much information to be stored locally for most people and > getting > the data remotely from UCSC doesn't seem possible currently. > > Sean Then we could use this to primarily return location and other information instead. Anyone interested in sequence can use the location info to retrieve sequences remotely (via Bio::DB::GenBank or similar) or locally (Bio::DB::Fasta). The key is to get this set up in some basic way that people could start using it, make suggestions, etc. Sendu, any suggestions? Chris From cjfields at uiuc.edu Wed Aug 9 17:47:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 16:47:09 -0500 Subject: [Bioperl-l] XML parser preference? Message-ID: <000801c6bbfd$5d5d6680$15327e82@pyrimidine> All, I am finishing up the EUtilities modules in bioperl-live. I'm using XML::Simple to grab the IDs and other information from XML returned from NCBI via esearch/elink/epost queries, but I noticed that no other Bioperl modules use this particular module. It comes with ActiveState Perl by default (the reason I use it) but I found, after the fact, other perl distributions do not include this (Mac OS X was one). I don't necessarily want to lump another XML parser requirement for bioperl users on top of the four or so already present, so I'm considering changing. I have a preference for SAX (hehe) but XML::Twig might also be an option. Any thoughts? Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From arareko at campus.iztacala.unam.mx Wed Aug 9 18:15:06 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 09 Aug 2006 17:15:06 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <000801c6bbfd$5d5d6680$15327e82@pyrimidine> References: <000801c6bbfd$5d5d6680$15327e82@pyrimidine> Message-ID: <44DA5E6A.50405@campus.iztacala.unam.mx> Hi Chris, As I've mentioned in a previous thread, it depends a lot on what your module/interface needs to parse from a certain document: http://lists.open-bio.org/pipermail/bioperl-l/2006-July/022151.html I'm more into tree-based parsing rather than stream-based, but again, it depends a lot on personal needs and preferences. If it's possible to adapt your modules to use any of the already present parsers, that would be great and will avoid adding more pre-requisites to BioPerl. Regards, Mauricio. Chris Fields wrote: > All, > > I am finishing up the EUtilities modules in bioperl-live. I'm using > XML::Simple to grab the IDs and other information from XML returned from > NCBI via esearch/elink/epost queries, but I noticed that no other Bioperl > modules use this particular module. > > It comes with ActiveState Perl by default (the reason I use it) but I found, > after the fact, other perl distributions do not include this (Mac OS X was > one). I don't necessarily want to lump another XML parser requirement for > bioperl users on top of the four or so already present, so I'm considering > changing. > > I have a preference for SAX (hehe) but XML::Twig might also be an option. > Any thoughts? > > Christopher Fields > Postdoctoral Researcher - Switzer Lab > Dept. of Biochemistry > University of Illinois Urbana-Champaign > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Wed Aug 9 18:25:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 17:25:30 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <44DA5E6A.50405@campus.iztacala.unam.mx> Message-ID: <000001c6bc02$b99a6e20$15327e82@pyrimidine> Mauricio, Thanks for reminding me. I completely forgot about that! I think I'll give XML::Twig a look (somewhat tree-based). I may play around with both stream and tree-based to see what would work the best; my guess is tree-based from the XML returned, but SAX may be faster in the end. Thanks again! Chris > -----Original Message----- > From: Mauricio Herrera Cuadra [mailto:arareko at campus.iztacala.unam.mx] > Sent: Wednesday, August 09, 2006 5:15 PM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] XML parser preference? > > Hi Chris, > > As I've mentioned in a previous thread, it depends a lot on what your > module/interface needs to parse from a certain document: > > http://lists.open-bio.org/pipermail/bioperl-l/2006-July/022151.html > > I'm more into tree-based parsing rather than stream-based, but again, it > depends a lot on personal needs and preferences. If it's possible to > adapt your modules to use any of the already present parsers, that would > be great and will avoid adding more pre-requisites to BioPerl. > > Regards, > Mauricio. From osborne1 at optonline.net Wed Aug 9 19:04:42 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 09 Aug 2006 19:04:42 -0400 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <000801c6bbfd$5d5d6680$15327e82@pyrimidine> Message-ID: Chris, I don't think there's any reason not to use XML::Simple if it's guaranteed that the chunks of XML are small. Brian O. On 8/9/06 5:47 PM, "Chris Fields" wrote: > I am finishing up the EUtilities modules in bioperl-live. I'm using > XML::Simple to grab the IDs and other information from XML returned from > NCBI via esearch/elink/epost queries, but I noticed that no other Bioperl > modules use this particular module. From cjfields at uiuc.edu Wed Aug 9 19:21:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 18:21:39 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: Message-ID: <000001c6bc0a$910cfc40$15327e82@pyrimidine> Brian, The main reason I chose XML::Simple is that I wanted something up-and-running ASAP; the chunks of XML normally are small and not too complex, but for esearch/elink they can be quite long). The problem I foresee is we currently have prereq's for the following XML parsers: XML::SAX XML::Parser XML::DOM XML::Twig which various modules use (this doesn't include the XML writers, either). I didn't think it was a good idea to lump yet another prereq on top of those. Anyway, I can always replace the backend XML parser along the way as long as the front-end API is the same. Which reminds me, I need to work on that... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Brian Osborne > Sent: Wednesday, August 09, 2006 6:05 PM > To: Chris Fields; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] XML parser preference? > > Chris, > > I don't think there's any reason not to use XML::Simple if it's guaranteed > that the chunks of XML are small. > > Brian O. > > > On 8/9/06 5:47 PM, "Chris Fields" wrote: > > > I am finishing up the EUtilities modules in bioperl-live. I'm using > > XML::Simple to grab the IDs and other information from XML returned from > > NCBI via esearch/elink/epost queries, but I noticed that no other > Bioperl > > modules use this particular module. > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Wed Aug 9 19:29:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 18:29:36 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <44DA642E.9030102@cornell.edu> Message-ID: <000101c6bc0b$ad72c580$15327e82@pyrimidine> Rob, There seems to be a general shift away from using the older XML::Parser and XML::DOM parsers towards XML::SAX and XML::Twig as the former two are not under active development. For SAX parsing, we seem to be moving in the direction of XML::SAX (the recent transition of SearchIO::blastxml was the start). However, nothing has been done for tree-like (DOM) parsing. In fact, both the XML::DOM and XML::Twig docs recommend XML::LibXML over XML::DOM. However, XML::LibXML isn't used AFAIK in Bioperl, and I think it's more of a burden to use that. Grr...I wish I had checked bioperl dependencies before I started! Chris > -----Original Message----- > From: Robert Buels [mailto:rmb32 at cornell.edu] > Sent: Wednesday, August 09, 2006 5:40 PM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] XML parser preference? > > I don't think it really matters. Every parser has its own strengths. > > If you've written something that already works well, but are concerned > about adding yet another XML parser to bioperl's external dependencies, > pick a parser that is a.) already being used somewhere else in bioperl > and b.) requires the fewest changes to your already-working code. > > Since you're already using XML::Simple, which is basically a DOM parser, > I would say go with another DOM parser that's already being used in > bioperl. How about XML::DOM? > > Rob > > Chris Fields wrote: > > All, > > > > I am finishing up the EUtilities modules in bioperl-live. I'm using > > XML::Simple to grab the IDs and other information from XML returned from > > NCBI via esearch/elink/epost queries, but I noticed that no other > Bioperl > > modules use this particular module. > > > > It comes with ActiveState Perl by default (the reason I use it) but I > found, > > after the fact, other perl distributions do not include this (Mac OS X > was > > one). I don't necessarily want to lump another XML parser requirement > for > > bioperl users on top of the four or so already present, so I'm > considering > > changing. > > > > I have a preference for SAX (hehe) but XML::Twig might also be an > option. > > Any thoughts? > > > > Christopher Fields > > Postdoctoral Researcher - Switzer Lab > > Dept. of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > Robert Buels > SGN Bioinformatics Analyst > 252A Emerson Hall, Cornell University > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu From arareko at campus.iztacala.unam.mx Wed Aug 9 20:11:51 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Wed, 09 Aug 2006 19:11:51 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <000101c6bc0b$ad72c580$15327e82@pyrimidine> References: <000101c6bc0b$ad72c580$15327e82@pyrimidine> Message-ID: <44DA79C7.6000303@campus.iztacala.unam.mx> Robert & Chris, I have no doubt that XML::LibXML is a great parser (I've used it a few times), the problem with it is that it runs on top of libxml2's C library. On *nix systems it's fairly simple to have this dependency compiled and running, but what about having it under other OS's (e.g. Windows)? Introducing XML::LibXML as a dependency into the toolkit will probably place EUtilities as a module not usable by everyone, especially those who use BioPerl in a OS where installing/compiling C dependencies can be a headache. Mauricio. Chris Fields wrote: > Rob, > > There seems to be a general shift away from using the older XML::Parser and > XML::DOM parsers towards XML::SAX and XML::Twig as the former two are not > under active development. For SAX parsing, we seem to be moving in the > direction of XML::SAX (the recent transition of SearchIO::blastxml was the > start). However, nothing has been done for tree-like (DOM) parsing. > > In fact, both the XML::DOM and XML::Twig docs recommend XML::LibXML over > XML::DOM. However, XML::LibXML isn't used AFAIK in Bioperl, and I think > it's more of a burden to use that. > > Grr...I wish I had checked bioperl dependencies before I started! > > Chris > >> -----Original Message----- >> From: Robert Buels [mailto:rmb32 at cornell.edu] >> Sent: Wednesday, August 09, 2006 5:40 PM >> To: Chris Fields >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] XML parser preference? >> >> I don't think it really matters. Every parser has its own strengths. >> >> If you've written something that already works well, but are concerned >> about adding yet another XML parser to bioperl's external dependencies, >> pick a parser that is a.) already being used somewhere else in bioperl >> and b.) requires the fewest changes to your already-working code. >> >> Since you're already using XML::Simple, which is basically a DOM parser, >> I would say go with another DOM parser that's already being used in >> bioperl. How about XML::DOM? >> >> Rob >> >> Chris Fields wrote: >>> All, >>> >>> I am finishing up the EUtilities modules in bioperl-live. I'm using >>> XML::Simple to grab the IDs and other information from XML returned from >>> NCBI via esearch/elink/epost queries, but I noticed that no other >> Bioperl >>> modules use this particular module. >>> >>> It comes with ActiveState Perl by default (the reason I use it) but I >> found, >>> after the fact, other perl distributions do not include this (Mac OS X >> was >>> one). I don't necessarily want to lump another XML parser requirement >> for >>> bioperl users on top of the four or so already present, so I'm >> considering >>> changing. >>> >>> I have a preference for SAX (hehe) but XML::Twig might also be an >> option. >>> Any thoughts? >>> >>> Christopher Fields >>> Postdoctoral Researcher - Switzer Lab >>> Dept. of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> -- >> Robert Buels >> SGN Bioinformatics Analyst >> 252A Emerson Hall, Cornell University >> Ithaca, NY 14853 >> Tel: 503-889-8539 >> rmb32 at cornell.edu >> http://www.sgn.cornell.edu > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Wed Aug 9 22:14:59 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 9 Aug 2006 21:14:59 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <44DA79C7.6000303@campus.iztacala.unam.mx> References: <000101c6bc0b$ad72c580$15327e82@pyrimidine> <44DA79C7.6000303@campus.iztacala.unam.mx> Message-ID: Mauricio, Sorry, didn't mean to imply I want to use XML::LibXML. Just indicating that what most Perl-XML users who use DOM-like parsing seem to migrate towards XML::LibXML(not in Bioperl) or XML::Twig (which Bioperl uses) for large files and XML::Simple for small stuff. Seems fewer people use XML::DOM these days. XML::Twig is nice because I can process 'chunks' of XML at a time, but it may be overkill with some of the smaller XML data returned from NCBI via eutils. I'll need to tax EUtilities to try and maximize the returned XML to get an idea of just how much XML data is returned for esearch/elink (epost XML is always very small, so no worries there). XML::Simple and XML::Twig are both available for pretty much all OS's (Win, *nix) so I'll stick with one of those. I was actually quite surprised that XML::Simple isn't used anywhere in Bioperl. It's very easy to use and utilizes XML::SAX or XML::Parser on the back end, so having expat around speeds things up quite a bit. Chris On Aug 9, 2006, at 7:11 PM, Mauricio Herrera Cuadra wrote: > Robert & Chris, > > I have no doubt that XML::LibXML is a great parser (I've used it a > few times), the problem with it is that it runs on top of libxml2's > C library. On *nix systems it's fairly simple to have this > dependency compiled and running, but what about having it under > other OS's (e.g. Windows)? > > Introducing XML::LibXML as a dependency into the toolkit will > probably place EUtilities as a module not usable by everyone, > especially those who use BioPerl in a OS where installing/compiling > C dependencies can be a headache. > > Mauricio. > > Chris Fields wrote: >> Rob, There seems to be a general shift away from using the older >> XML::Parser and >> XML::DOM parsers towards XML::SAX and XML::Twig as the former two >> are not >> under active development. For SAX parsing, we seem to be moving >> in the >> direction of XML::SAX (the recent transition of SearchIO::blastxml >> was the >> start). However, nothing has been done for tree-like (DOM) parsing. >> In fact, both the XML::DOM and XML::Twig docs recommend >> XML::LibXML over >> XML::DOM. However, XML::LibXML isn't used AFAIK in Bioperl, and I >> think >> it's more of a burden to use that. >> Grr...I wish I had checked bioperl dependencies before I started! >> Chris >>> -----Original Message----- >>> From: Robert Buels [mailto:rmb32 at cornell.edu] >>> Sent: Wednesday, August 09, 2006 5:40 PM >>> To: Chris Fields >>> Cc: bioperl-l at lists.open-bio.org >>> Subject: Re: [Bioperl-l] XML parser preference? >>> >>> I don't think it really matters. Every parser has its own strengths. >>> >>> If you've written something that already works well, but are >>> concerned >>> about adding yet another XML parser to bioperl's external >>> dependencies, >>> pick a parser that is a.) already being used somewhere else in >>> bioperl >>> and b.) requires the fewest changes to your already-working code. >>> >>> Since you're already using XML::Simple, which is basically a DOM >>> parser, >>> I would say go with another DOM parser that's already being used in >>> bioperl. How about XML::DOM? >>> >>> Rob >>> >>> Chris Fields wrote: >>>> All, >>>> >>>> I am finishing up the EUtilities modules in bioperl-live. I'm >>>> using >>>> XML::Simple to grab the IDs and other information from XML >>>> returned from >>>> NCBI via esearch/elink/epost queries, but I noticed that no other >>> Bioperl >>>> modules use this particular module. >>>> >>>> It comes with ActiveState Perl by default (the reason I use it) >>>> but I >>> found, >>>> after the fact, other perl distributions do not include this >>>> (Mac OS X >>> was >>>> one). I don't necessarily want to lump another XML parser >>>> requirement >>> for >>>> bioperl users on top of the four or so already present, so I'm >>> considering >>>> changing. >>>> >>>> I have a preference for SAX (hehe) but XML::Twig might also be an >>> option. >>>> Any thoughts? >>>> >>>> Christopher Fields >>>> Postdoctoral Researcher - Switzer Lab >>>> Dept. of Biochemistry >>>> University of Illinois Urbana-Champaign >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> -- >>> Robert Buels >>> SGN Bioinformatics Analyst >>> 252A Emerson Hall, Cornell University >>> Ithaca, NY 14853 >>> Tel: 503-889-8539 >>> rmb32 at cornell.edu >>> http://www.sgn.cornell.edu >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Thu Aug 10 02:56:59 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 07:56:59 +0100 Subject: [Bioperl-l] UCSC database backend In-Reply-To: References: Message-ID: <44DAD8BB.9080001@sendu.me.uk> Sean Davis wrote: > I have put together a variation of the database backend. It is based on > DBIx::Abstract and exposes a bit more of the SQL and DBI. Since UCSC uses > cross-database queries, it might be a better fit for the problem than an > ORM. It is pretty simple, but I don't know that we need much more here. > The harder problem, as I mentioned earlier, is to determine what to return, > not how to return it. I am showing only a couple of the DBIx::Abstract > methods here; there are a number of others for fetching data. In > particular, pretty much any of the fetch_* are available. Seems to be reasonable so far. You could probably use the 'introspection' calls to build other access methods on-the-fly. > TODO > A fair bit of work on the connection end. In particular, I will > proba- > bly make a "Bio::DB::SQL" class that encapsulates some methods for > working with SQL databases and some kind of abstraction for > connection > information, making it easier to switch from local to remote versions > of a database. That sounds really interesting. From bix at sendu.me.uk Thu Aug 10 03:14:03 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 08:14:03 +0100 Subject: [Bioperl-l] UCSC database backend In-Reply-To: References: Message-ID: <44DADCBB.8080908@sendu.me.uk> Sean Davis wrote: > > Before we get too far down this line of thought, keep in mind that this will > be dozens of Gb of sequence and database tables. See here for details: > > http://genome.ucsc.edu/admin/mirror.html > > The sequences include all of genbank, essentially. The mysql tables ALONE > (no sequence) for only ONE human assembly is on the order of 10Gb--not the > kind of thing you can download in a few minutes (or even hours). Just to > keep in mind.... I think if someone needs heavy-duty access to genomic data, they'll find the discspace. That wouldn't be the problem. The problem would be finding an easy way of getting the data, which is where I hoped something like a UCSC frontend would come in. > On another point, the strength of UCSC is not in obtaining sequence, but in > mapping to the genome. I think getting actual sequence should be secondary > here, if for no other reason than there are trivially easy ways of getting > sequence information from elsewhere given an accession or ID. There is > simply too much information to be stored locally for most people and getting > the data remotely from UCSC doesn't seem possible currently. The work would certainly be highly valuable even if it didn't allow for sequence retrieval, but from my own point of view my main interest was exactly the retrieval of arbitrary bits of genomic sequence - for which there is no accession or ID that can be used to query some other database. How does the website table browser frontend allow retrieval of sequence data? From j_martin at lbl.gov Thu Aug 10 06:38:39 2006 From: j_martin at lbl.gov (Joel Martin) Date: Thu, 10 Aug 2006 03:38:39 -0700 Subject: [Bioperl-l] UCSC database backend In-Reply-To: <44DADCBB.8080908@sendu.me.uk> References: <44DADCBB.8080908@sendu.me.uk> Message-ID: <20060810103839.GA4900@eniac.jgi-psf.org> Sendu Bala wrote: > Sean Davis wrote: > > On another point, the strength of UCSC is not in obtaining sequence, but in > > mapping to the genome. I think getting actual sequence should be secondary > > here, if for no other reason than there are trivially easy ways of getting > > sequence information from elsewhere given an accession or ID. There is > > simply too much information to be stored locally for most people and getting > > the data remotely from UCSC doesn't seem possible currently. > > The work would certainly be highly valuable even if it didn't allow for > sequence retrieval, but from my own point of view my main interest was > exactly the retrieval of arbitrary bits of genomic sequence - for which > there is no accession or ID that can be used to query some other database. piping in as a user, retrieving sequence based on chromosomal coordinates, or offsets from coordinates of an acceession/model is a large amount of what I use ucsc for. I'm not using it enough to store all that data but it's a nicely straightforward place to ask 'give me genomic seq for transcript ABC and 2kb off either end. I wouldn't know how to ask that at ncbi. Joel From sdavis2 at mail.nih.gov Thu Aug 10 07:00:17 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 10 Aug 2006 07:00:17 -0400 Subject: [Bioperl-l] UCSC database backend In-Reply-To: <44DADCBB.8080908@sendu.me.uk> Message-ID: On 8/10/06 3:14 AM, "Sendu Bala" wrote: > Sean Davis wrote: >> >> Before we get too far down this line of thought, keep in mind that this will >> be dozens of Gb of sequence and database tables. See here for details: >> >> http://genome.ucsc.edu/admin/mirror.html >> >> The sequences include all of genbank, essentially. The mysql tables ALONE >> (no sequence) for only ONE human assembly is on the order of 10Gb--not the >> kind of thing you can download in a few minutes (or even hours). Just to >> keep in mind.... > > I think if someone needs heavy-duty access to genomic data, they'll find > the discspace. That wouldn't be the problem. The problem would be > finding an easy way of getting the data, which is where I hoped > something like a UCSC frontend would come in. If you look into the code that underlies the UCSC browser, they use a piece of software called the "autojoiner". It describes the relationships between databases and their tables and how they relate to each other. They don't have the strict concept of a foreign key, but rather "join" rules that can include things like the key in one table being used in a join to a key in another table, but perhaps with "fuzzy" matching or with an arbitrary prefix or suffix. In order to reproduce what UCSC does, we need to recreate the autojoiner. I've looked at it a bit, but it is not a trivially easy and is probably not a task that I can complete in the space of a few weeks, but one never knows. Here is a link to the table/database autojoiner description file and accompanying documentation, just to give you a sense of what it looks like (extracted from the UCSC source tree): http://watson.nci.nih.gov/~sdavis/all.joiner http://watson.nci.nih.gov/~sdavis/joiner.doc A laudable goal would be to parse and use this file, and this is quite doable, I suppose. If one wanted to make a table-browser-featured interface, it would include parsing this file and then having the appropriate introspection methods and a means to use them to design a query of interest. Getting the information into a bioperl format, after doing this, is another matter, of course. > >> On another point, the strength of UCSC is not in obtaining sequence, but in >> mapping to the genome. I think getting actual sequence should be secondary >> here, if for no other reason than there are trivially easy ways of getting >> sequence information from elsewhere given an accession or ID. There is >> simply too much information to be stored locally for most people and getting >> the data remotely from UCSC doesn't seem possible currently. > > The work would certainly be highly valuable even if it didn't allow for > sequence retrieval, but from my own point of view my main interest was > exactly the retrieval of arbitrary bits of genomic sequence - for which > there is no accession or ID that can be used to query some other database. For this purposes, their DAS server works just fine. Try this: http://genome.ucsc.edu/cgi-bin/das/hg16/dna?segment=7:50001,51000&segment=8: 1,100000 There are a number of other alternatives, including working with .nib or .2bit files, as Malcolm mentioned. Sean From sdavis2 at mail.nih.gov Thu Aug 10 07:10:52 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 10 Aug 2006 07:10:52 -0400 Subject: [Bioperl-l] UCSC database backend In-Reply-To: <20060810103839.GA4900@eniac.jgi-psf.org> Message-ID: On 8/10/06 6:38 AM, "Joel Martin" wrote: > Sendu Bala wrote: >> Sean Davis wrote: >>> On another point, the strength of UCSC is not in obtaining sequence, but in >>> mapping to the genome. I think getting actual sequence should be secondary >>> here, if for no other reason than there are trivially easy ways of getting >>> sequence information from elsewhere given an accession or ID. There is >>> simply too much information to be stored locally for most people and getting >>> the data remotely from UCSC doesn't seem possible currently. >> >> The work would certainly be highly valuable even if it didn't allow for >> sequence retrieval, but from my own point of view my main interest was >> exactly the retrieval of arbitrary bits of genomic sequence - for which >> there is no accession or ID that can be used to query some other database. > > piping in as a user, retrieving sequence based on chromosomal coordinates, > or offsets from coordinates of an acceession/model is a large amount of what > I use ucsc for. I'm not using it enough to store all that data but it's a > nicely straightforward place to ask 'give me genomic seq for transcript ABC > and 2kb off either end. I wouldn't know how to ask that at ncbi. Thanks, Joel. I think little "use cases" like this are important to hear about. Sean From rmb32 at cornell.edu Wed Aug 9 18:39:42 2006 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 09 Aug 2006 15:39:42 -0700 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <000801c6bbfd$5d5d6680$15327e82@pyrimidine> References: <000801c6bbfd$5d5d6680$15327e82@pyrimidine> Message-ID: <44DA642E.9030102@cornell.edu> I don't think it really matters. Every parser has its own strengths. If you've written something that already works well, but are concerned about adding yet another XML parser to bioperl's external dependencies, pick a parser that is a.) already being used somewhere else in bioperl and b.) requires the fewest changes to your already-working code. Since you're already using XML::Simple, which is basically a DOM parser, I would say go with another DOM parser that's already being used in bioperl. How about XML::DOM? Rob Chris Fields wrote: > All, > > I am finishing up the EUtilities modules in bioperl-live. I'm using > XML::Simple to grab the IDs and other information from XML returned from > NCBI via esearch/elink/epost queries, but I noticed that no other Bioperl > modules use this particular module. > > It comes with ActiveState Perl by default (the reason I use it) but I found, > after the fact, other perl distributions do not include this (Mac OS X was > one). I don't necessarily want to lump another XML parser requirement for > bioperl users on top of the four or so already present, so I'm considering > changing. > > I have a preference for SAX (hehe) but XML::Twig might also be an option. > Any thoughts? > > Christopher Fields > Postdoctoral Researcher - Switzer Lab > Dept. of Biochemistry > University of Illinois Urbana-Champaign > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Robert Buels SGN Bioinformatics Analyst 252A Emerson Hall, Cornell University Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From jurgen.pletinckx at algonomics.com Thu Aug 10 08:29:57 2006 From: jurgen.pletinckx at algonomics.com (Jurgen Pletinckx) Date: Thu, 10 Aug 2006 14:29:57 +0200 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <44DA79C7.6000303@campus.iztacala.unam.mx> Message-ID: <20060810123157.D9C9483BB@sienna.algonomics.com> | I have no doubt that XML::LibXML is a great parser (I've used | it a few | times), the problem with it is that it runs on top of libxml2's C | library. On *nix systems it's fairly simple to have this dependency | compiled and running, but what about having it under other OS's (e.g. | Windows)? | | Introducing XML::LibXML as a dependency into the toolkit will | probably | place EUtilities as a module not usable by everyone, especially those | who use BioPerl in a OS where installing/compiling C | dependencies can be | a headache. Regarding XML::LibXML, there does appear to be an up-to-date ppm package (which fetches libxml2.dll) at http://theoryx5.uwinnipeg.ca/ppms/XML-LibXML.ppd (and less than a week since the release of the corresponding version to cpan, too.) So the threshold for distribution to Windows, at least, is less high than it might have been. -- Jurgen Pletinckx AlgoNomics NV From cjfields at uiuc.edu Thu Aug 10 09:35:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 08:35:21 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: <20060810123157.D9C9483BB@sienna.algonomics.com> References: <20060810123157.D9C9483BB@sienna.algonomics.com> Message-ID: Jurgen, Thanks for pointing that out! However, the problem is we want to keep the number of dependencies down; there are already four XML parser dependencies for Bioperl (XML::Twig is one, but XML::LibXML isn't). Maybe new modules which require XML parsing stick with four XML parsers. However, not the current four (XML::DOM, XML::Twig, XML::Parser, XML::SAX). Maybe we should pick four XML parsers, each with their own particular strengths: 1) XML::SAX (SAX parsing; flexible, can use pure Perl, ExpatXS, etc) Switch using XML::Parser to XML::SAX (done for Bio::SeacrhIO::blastxml) 2) XML::LibXML (DOM parsing; maintained, up to date, fast) Switch using XML::DOM to XML::LibXML 3) XML::Twig (DOM-like, SAX-based) - great for processing 'chunks' of XML Used in Bio::DB::Taxonomy::entrez 4) XML::Simple (small XML) - very easy to use XML parser Since they are currently available for most (all?) OS's, shouldn't be a problem. What do you think Mauricio? Chris On Aug 10, 2006, at 7:29 AM, Jurgen Pletinckx wrote: > | I have no doubt that XML::LibXML is a great parser (I've used > | it a few > | times), the problem with it is that it runs on top of libxml2's C > | library. On *nix systems it's fairly simple to have this dependency > | compiled and running, but what about having it under other OS's > (e.g. > | Windows)? > | > | Introducing XML::LibXML as a dependency into the toolkit will > | probably > | place EUtilities as a module not usable by everyone, especially > those > | who use BioPerl in a OS where installing/compiling C > | dependencies can be > | a headache. > > Regarding XML::LibXML, there does appear to be an up-to-date ppm > package (which fetches libxml2.dll) at > > http://theoryx5.uwinnipeg.ca/ppms/XML-LibXML.ppd > > (and less than a week since the release of the corresponding > version to cpan, too.) > > So the threshold for distribution to Windows, at least, is less > high than it might have been. > > -- > Jurgen Pletinckx > AlgoNomics NV > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Aug 10 10:21:04 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 09:21:04 -0500 Subject: [Bioperl-l] UCSC database backend In-Reply-To: <44DADCBB.8080908@sendu.me.uk> References: <44DADCBB.8080908@sendu.me.uk> Message-ID: Sendu, Sean indicates that the sequences would be held in flatfiles. The trick would be grabbing location information from a particular MySQL table, then using that to retrieve the sequence slice from the indexed flatfile. MySQL table-->SeqFeatureI(?)--> Bio::LocationI(Simple/Split/Fuzzy etc)-->sequence slice from Indexed file Would be relatively easy if the MySQL table contains information about which flatfile is used; that I don't know. If not, maybe use an .ini file to map the tables to flatfiles? If you wanted something from GenBank: MySQL table-->SeqFeatureI(?)--> Bio::LocationI(Simple/Split/Fuzzy etc)-->sequence slice from GenBank file The GenBank file slice could be retrieved remotely via Bio::DB::GenBank if you didn't want a local GenBank installation: my $ncbi = Bio::DB::GenBank->new(-format => 'fasta'); # later... $ncbi->seq_start($start); $ncbi->seq_stop($end); $ncbi->strand($strand); my $seq = $ncbi->get_Seq_by_id($id); Bio::DB::Fasta and Bio::DB::GenBank both implement Bio::DB::RandomAccessI. A requirement for sequence retrieval could be a DB handle that is-a Bio::DB::RandomAccessI. Bio::SeqFeatureI's spliced_seq() uses a similar idea: using an optional DB handle, piece together sequence slices based on location information from a seqfeature. One possible issue: lack of correspondence between the local MySQL database and the remote GenBank database. This would require the user automate updating their local databases once a week or so. There are a few problems which should be easily worked around: 1) Bio::DB::Fasta can't handle very large files (http:// bugzilla.open-bio.org/show_bug.cgi?id=2063). There is a proposed fix in Bugzilla, but I'm not sure about the the idea of dynamically determining the packing/unpacking (32-bit vs 64-bit) based on file size. 2) I think sequences in UCSC start with 0; in bioperl sequences start with 1. Easy enough, but something to keep in mind. Chris On Aug 10, 2006, at 2:14 AM, Sendu Bala wrote: > Sean Davis wrote: >> >> Before we get too far down this line of thought, keep in mind that >> this will >> be dozens of Gb of sequence and database tables. See here for >> details: >> >> http://genome.ucsc.edu/admin/mirror.html >> >> The sequences include all of genbank, essentially. The mysql >> tables ALONE >> (no sequence) for only ONE human assembly is on the order of 10Gb-- >> not the >> kind of thing you can download in a few minutes (or even hours). >> Just to >> keep in mind.... > > I think if someone needs heavy-duty access to genomic data, they'll > find > the discspace. That wouldn't be the problem. The problem would be > finding an easy way of getting the data, which is where I hoped > something like a UCSC frontend would come in. > > >> On another point, the strength of UCSC is not in obtaining >> sequence, but in >> mapping to the genome. I think getting actual sequence should be >> secondary >> here, if for no other reason than there are trivially easy ways of >> getting >> sequence information from elsewhere given an accession or ID. >> There is >> simply too much information to be stored locally for most people >> and getting >> the data remotely from UCSC doesn't seem possible currently. > > The work would certainly be highly valuable even if it didn't allow > for > sequence retrieval, but from my own point of view my main interest was > exactly the retrieval of arbitrary bits of genomic sequence - for > which > there is no accession or ID that can be used to query some other > database. > > How does the website table browser frontend allow retrieval of > sequence > data? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From sdavis2 at mail.nih.gov Thu Aug 10 10:37:46 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 10 Aug 2006 10:37:46 -0400 Subject: [Bioperl-l] UCSC database backend In-Reply-To: Message-ID: On 8/10/06 10:21 AM, "Chris Fields" wrote: > Sendu, > > Sean indicates that the sequences would be held in flatfiles. The > trick would be grabbing location information from a particular MySQL > table, then using that to retrieve the sequence slice from the > indexed flatfile. > > MySQL table-->SeqFeatureI(?)--> > Bio::LocationI(Simple/Split/Fuzzy etc)-->sequence slice from Indexed > file For genomic information, that can be done relatively easily, either using DAS or local flat files indexed by whatever means. Data at UCSC is stored relative to the genome, so this may be enough, as long as one does not care about having the "original" sequence that generated the alignment that UCSC is reporting. > Would be relatively easy if the MySQL table contains information > about which flatfile is used; that I don't know. If not, maybe use > an .ini file to map the tables to flatfiles? I don't think maintaining an additional file of flatfiles is reasonable, given the complexity of the system at UCSC, but it is certainly worth mentioning as a possibility. Sean From bix at sendu.me.uk Thu Aug 10 11:19:52 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 16:19:52 +0100 Subject: [Bioperl-l] SearchIO speed up Message-ID: <44DB4E98.70703@sendu.me.uk> I am aiming to solve Project priority list item 1.2.1 "Improve Bio::SearchIO speed...". I have made changes that result in a certain speed improvement (up to 5x) in parsing BLAST results (and most/all? other SearchIO parsers). The changes made were as minimal as possible. I did not touch the parsing code itself, did not make any API changes except for gain-of-function, and did not make any behaviour changes (except for speed!). For this reason I don't think there would be any contention about the changes so I will just commit them... ...Except I need to know if the community considers the speed problem solved or not. More radical changes will make SearchIO even faster, eg. Chris Fields and Jason (if I interpret the Project priority list item correctly) have suggested an end to individual Hit and HSP objects, which become just data members of a Result-like object. Ideally I don't want to go down that route because we lose quite a bit of OO power; HSP objects in particular make important use of inheritance and we'd end up stuffing tens of duplicated-code methods into the Result object replacement. Ugh. Or having the Result-like object implement 4 different interfaces. Ugh. That said, judging by the results below, having hsps as pure hashes would result in a ~5x speedup all the time instead of only a 1.5x speedup in worst case. So can people do some of their own speed tests on typical, realistic blast-parsing jobs before and after I commit the changes? Then everyone can decide if more needs to be done. I don't think everyone should do the same test - more variety is good. Obviously each individual should do the same test before and after. Post your before tests here and I'll commit sometime this weekend. Thanks for your help! My own speed tests ------------------ #Files used: 'medium.blast' = 670k plain-text format blastn with 1 result 'large.blast' = 8.2MB plain-text format blastn with 5 results # Best case scenario: time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => "large.blast"); while ($result = $sio->next_result) { }' Before: repeat 1 repeat 2 repeat 3 average real 1m28.065s 1m28.305s 1m28.271s 1m28.214s user 1m27.610s 1m27.850s 1m27.860s 1m27.773s sys 0m0.440s 0m0.450s 0m0.410s 0m0.433s After: repeat 1 repeat 2 repeat 3 average speed up: real 0m16.653s 0m16.711s 0m16.685s 0m16.683s 5.3x user 0m16.510s 0m16.570s 0m16.580s 0m16.553s 5.3x sys 0m0.130s 0m0.140s 0m0.110s 0m0.127s 3.4x perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => "medium.blast"); while ($result = $sio->next_result) { }' dprofpp -I before: Total Elapsed Time = 5.309818 Seconds User+System Time = 5.209818 Seconds Inclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 131. 0.473 6.834 2 0.2365 3.4169 Bio::SearchIO::blast::next_result 110. 0.333 5.753 36512 0.0000 0.0002 Bio::SearchIO::blast::end_element 102. 0.315 5.326 2388 0.0001 0.0022 Bio::SearchIO::SearchResultEventBuilder::end_hsp 92.2 0.026 4.804 7266 0.0000 0.0007 Bio::Factory::ObjectFactory::create_object 91.4 0.250 4.765 2388 0.0001 0.0020 Bio::Search::HSP::GenericHSP::new 40.4 0.162 2.105 2388 0.0001 0.0009 Bio::SeqFeature::SimilarityPair::new 39.0 0.173 2.034 9552 0.0000 0.0002 Bio::SeqFeature::Similarity::new dprofpp -I after: Total Elapsed Time = 1.090717 Seconds User+System Time = 1.490717 Seconds Inclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 98.2 0.333 1.465 2 0.1665 0.7324 Bio::SearchIO::blast::next_result 40.5 0.213 0.604 36512 0.0000 0.0000 Bio::SearchIO::blast::end_element 22.8 - 0.341 34022 - 0.0000 Bio::SearchIO::blast::element 19.8 0.296 0.296 2388 0.0001 0.0001 Bio::SearchIO::SearchResultEventBuilder::end_hsp 15.7 0.156 0.235 41186 0.0000 0.0000 Bio::SearchIO::blast::characters 12.6 0.188 0.188 141495 0.0000 0.0000 Bio::SearchIO::blast::in_element 10.3 0.133 0.154 25795 0.0000 0.0000 Bio::Root::IO::_readline Improvement: most of our time now spent in the SAX-like bits of the parser code, instead of creating HSPI objects. (I might streamline the SAX-like bits as well.) # Medium case scenario: time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => "large.blast"); while ($result = $sio->next_result) { while ($hit = $result->next_hit) { while ($hsp = $hit->next_hsp) { } } }' Before: repeat 1 repeat 2 repeat 3 average real 1m27.771s 1m27.996s 1m28.953s 1m28.240s user 1m27.200s 1m27.540s 1m28.500s 1m27.747s sys 0m0.560s 0m0.450s 0m0.450s 0m0.487s After: repeat 1 repeat 2 repeat 3 average speed up: real 0m25.984s 0m25.996s 0m25.714s 0m25.898s 3.4x user 0m25.780s 0m25.750s 0m25.510s 0m25.680s 3.4x sys 0m0.210s 0m0.240s 0m0.200s 0m0.217s 2.2x # Worse case scenario: time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => "large.blast"); while ($result = $sio->next_result) { while ($hit = $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query; $hsp->hit; } } }' Before: repeat 1 repeat 2 repeat 3 average real 1m28.104s 1m27.861s 1m27.970s 1m27.978s user 1m27.600s 1m27.420s 1m27.420s 1m27.480s sys 0m0.490s 0m0.430s 0m0.550s 0m0.490s After: repeat 1 repeat 2 repeat 3 average speed up: real 0m57.167s 0m57.080s 0m56.903s 0m58.050s 1.5x user 0m56.700s 0m56.770s 0m56.530s 0m56.667s 1.5x sys 0m0.470s 0m0.310s 0m0.370s 0m0.383s 1.3x perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => "medium.blast"); while ($result = $sio->next_result) { while ($hit = $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query; $hsp->hit; } } }' dprofpp -I before: Total Elapsed Time = 9.767702 Seconds User+System Time = 9.577702 Seconds Inclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 115. 0.568 11.031 2 0.2838 5.5156 Bio::SearchIO::blast::next_result dprofpp -I after: Total Elapsed Time = 6.521398 Seconds User+System Time = 6.571398 Seconds Inclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 61.2 0.087 4.023 11940 0.0000 0.0003 Bio::Search::HSP::GenericHSP::query Improvement: now we spend most of our time dealing with the thing we actually wanted from the parse, not dealing with the parsing process itself. -------------------------------------------------------- Here are the changes involved in achieving the speed up: Bio::Search::HSP::GenericHSP ---------------------------- # Implementation changes Call to new() now calls no methods of its own; no work is done simply to create a GenericHSP. The code that was previously in new() has been moved to private methods that are called just-in-time, as the user desires to know certain information by manually calling HSPI methods. Added new options to new() -hit_desc and -query_desc for setting the description text for the sequences. Bio::Search::Hit::GenericHit ---------------------------- # API-CHANGES new() has extra option -hsp_factory. New method hsp_factory() which gets/sets a Bio::Factory::ObjectFactoryI. add_hsp() can now accept a hash ref instead of just a HSPI. # Implementation changes next_hsp() and hsps() convert hash ref hsp data to HSPI objects using the hsp_factory() as necessary. # Notes num_hsps() claimed to throw if there were no HSPs, but returns '-'. Updated docs. Bio::Search::Result::GenericResult ---------------------------------- # API-CHANGES new() has extra option -hit_factory. New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI. add_hit() can now accept a hash ref instead of just a HitI. # Implementation changes next_hit() and hits() convert hash ref hit data to HitI objects using the hit_factory() as necessary. Bio::Search::Iteration::GenericIteration ---------------------------------------- # API-CHANGES new() has extra option -hit_factory. New method hit_factory() which gets/sets a Bio::Factory::ObjectFactoryI. add_hit() can now accept a hash ref instead of just a HitI. # Implementation changes Various methods convert hash ref hit data to HitI objects using the hit_factory() as necessary. Bio::SearchIO::SearchResultEventBuilder --------------------------------------- # Implementation changes Methods end_hsp() and end_hit() return hash refs containing data suitable for creating HitI and HSPI objects respectively. Bio::SearchIO::IteratedSearchResultEventBuilder ----------------------------------------------- # Implementation changes _add_hit() deals with the new way hit information is stored. end_iteration() supplies the hit factory to created and returned iteration factories. Bio::SeqFeature::SimilarityPair ------------------------------- # Notes new() makes an object factory if not supplied -feature_factory as arg, but then does nothing with its freshly created factory, and ignores any that are supplied with -feature_factory. Now it just doesn't make a factory at all, to save time. new() has an option -feature1, but when supplied this is never used or set. I've left this alone. From Joseph.Travaglini at FMR.com Thu Aug 10 11:27:03 2006 From: Joseph.Travaglini at FMR.com (Travaglini, Joseph) Date: Thu, 10 Aug 2006 11:27:03 -0400 Subject: [Bioperl-l] RSS feed weirdness from the wiki Message-ID: <7FC9739191F8124B990E025C98EB9BCE010C5A97@MSGMROCLN2WIN.DMN1.FMR.COM> Nevermind, I figured it out. Didn't realize this was being sent to the whole mailing list either - my apologies! Joe Travaglini FISC-FCAT w: 617.563.3811 c: 978.210.1580 From cjfields at uiuc.edu Thu Aug 10 11:39:34 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 10:39:34 -0500 Subject: [Bioperl-l] UCSC database backend In-Reply-To: References: Message-ID: <64F81741-734F-4CF4-8288-20E2945CC8BC@uiuc.edu> On Aug 10, 2006, at 9:37 AM, Sean Davis wrote: ... > > For genomic information, that can be done relatively easily, either > using > DAS or local flat files indexed by whatever means. Data at UCSC is > stored > relative to the genome, so this may be enough, as long as one does > not care > about having the "original" sequence that generated the alignment > that UCSC > is reporting. The caveat being the location info (sequence, strand, coordinates) from the local UCSC database has to correspond to the requested remote sequence. That shouldn't be a problem if a local UCSC installation is updated periodically. DAS would definitely fit in here. >> Would be relatively easy if the MySQL table contains information >> about which flatfile is used; that I don't know. If not, maybe use >> an .ini file to map the tables to flatfiles? > > I don't think maintaining an additional file of flatfiles is > reasonable, > given the complexity of the system at UCSC, but it is certainly worth > mentioning as a possibility. Only reason to bring it up, actually, is as a last resort. > Sean > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From sdavis2 at mail.nih.gov Thu Aug 10 11:43:19 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 10 Aug 2006 11:43:19 -0400 Subject: [Bioperl-l] UCSC database backend In-Reply-To: <64F81741-734F-4CF4-8288-20E2945CC8BC@uiuc.edu> Message-ID: On 8/10/06 11:39 AM, "Chris Fields" wrote: > > On Aug 10, 2006, at 9:37 AM, Sean Davis wrote: > ... >> >> For genomic information, that can be done relatively easily, either >> using >> DAS or local flat files indexed by whatever means. Data at UCSC is >> stored >> relative to the genome, so this may be enough, as long as one does >> not care >> about having the "original" sequence that generated the alignment >> that UCSC >> is reporting. > > The caveat being the location info (sequence, strand, coordinates) > from the local UCSC database has to correspond to the requested > remote sequence. That shouldn't be a problem if a local UCSC > installation is updated periodically. The genome files are not updated at UCSC. They use an assembly as the basis of each database and those sequences do not change, so having a local mirror of the genomic sequence is probably not necessary except for speed issues. Sean From sdavis2 at mail.nih.gov Thu Aug 10 11:59:04 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 10 Aug 2006 11:59:04 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB4E98.70703@sendu.me.uk> Message-ID: On 8/10/06 11:19 AM, "Sendu Bala" wrote: > ...Except I need to know if the community considers the speed problem > solved or not. More radical changes will make SearchIO even faster, eg. > Chris Fields and Jason (if I interpret the Project priority list item > correctly) have suggested an end to individual Hit and HSP objects, > which become just data members of a Result-like object. Ideally I don't > want to go down that route because we lose quite a bit of OO power; HSP > objects in particular make important use of inheritance and we'd end up > stuffing tens of duplicated-code methods into the Result object > replacement. Ugh. Or having the Result-like object implement 4 different > interfaces. Ugh. That said, judging by the results below, having hsps as > pure hashes would result in a ~5x speedup all the time instead of only a > 1.5x speedup in worst case. Just curious, but is there a possibility of making "lazy" instantiation of objects like HSP and HIT objects? Things like parsing and output could be accomplished without these objects? Sean From bix at sendu.me.uk Thu Aug 10 12:04:51 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 17:04:51 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44DB5923.3010700@sendu.me.uk> Sean Davis wrote: > > On 8/10/06 11:19 AM, "Sendu Bala" wrote: > >> ...Except I need to know if the community considers the speed problem >> solved or not. More radical changes will make SearchIO even faster, eg. >> Chris Fields and Jason (if I interpret the Project priority list item >> correctly) have suggested an end to individual Hit and HSP objects, >> which become just data members of a Result-like object. Ideally I don't >> want to go down that route because we lose quite a bit of OO power; HSP >> objects in particular make important use of inheritance and we'd end up >> stuffing tens of duplicated-code methods into the Result object >> replacement. Ugh. Or having the Result-like object implement 4 different >> interfaces. Ugh. That said, judging by the results below, having hsps as >> pure hashes would result in a ~5x speedup all the time instead of only a >> 1.5x speedup in worst case. > > Just curious, but is there a possibility of making "lazy" instantiation of > objects like HSP and HIT objects? Things like parsing and output could be > accomplished without these objects? That's what I've done actually, which is why performance varies between 5x and 1.5x (lower performance when the instantiation is forced). But, things like 'parsing and output' do need to force the instantiation unless, say, an output module knew about the hash structure of the thing stored inside a Result object. Which is too horrible a situation to comprehend. :O Or is it? What specifically did you have in mind? From cjfields at uiuc.edu Thu Aug 10 12:10:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 11:10:57 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB4E98.70703@sendu.me.uk> References: <44DB4E98.70703@sendu.me.uk> Message-ID: Sendu, You've been busy! I think that anything that speeds up SearchIO would help tremendously. We still need to be wary of API issues, but I have no problem whatsoever adding this in. You might need to give the community time to digest these changes before committing to HEAD, though. Here's a couple of suggestions to get around that if you want to get the code out there for testing: Could this be CVS-tagged to an experimental bioperl branch instead? It could be merged back to the main branch once everybody gets to try it out, and you could commit changes to the branch (tests, scripts, etc) along the way based on suggestions. Think of this as a test- drive for a new Bioperl release. Alternatively, add the modified modules or patches to Bugzilla for people to test out. You might want to add a script and test data as well. These could be updated along the way much like CVS, but probably a lot noisier. I agree with the thought of retaining some degree of OO. I still wonder how much object instantiation really affects speed vs. all those method calls. Chris On Aug 10, 2006, at 10:19 AM, Sendu Bala wrote: > I am aiming to solve Project priority list item 1.2.1 "Improve > Bio::SearchIO speed...". > > I have made changes that result in a certain speed improvement (up to > 5x) in parsing BLAST results (and most/all? other SearchIO > parsers). The > changes made were as minimal as possible. I did not touch the parsing > code itself, did not make any API changes except for gain-of-function, > and did not make any behaviour changes (except for speed!). > For this reason I don't think there would be any contention about the > changes so I will just commit them... > > ...Except I need to know if the community considers the speed problem > solved or not. More radical changes will make SearchIO even faster, > eg. > Chris Fields and Jason (if I interpret the Project priority list item > correctly) have suggested an end to individual Hit and HSP objects, > which become just data members of a Result-like object. Ideally I > don't > want to go down that route because we lose quite a bit of OO power; > HSP > objects in particular make important use of inheritance and we'd > end up > stuffing tens of duplicated-code methods into the Result object > replacement. Ugh. Or having the Result-like object implement 4 > different > interfaces. Ugh. That said, judging by the results below, having > hsps as > pure hashes would result in a ~5x speedup all the time instead of > only a > 1.5x speedup in worst case. > > So can people do some of their own speed tests on typical, realistic > blast-parsing jobs before and after I commit the changes? Then > everyone > can decide if more needs to be done. I don't think everyone should do > the same test - more variety is good. Obviously each individual should > do the same test before and after. Post your before tests here and > I'll > commit sometime this weekend. Thanks for your help! > > > My own speed tests > ------------------ > #Files used: > 'medium.blast' = 670k plain-text format blastn with 1 result > 'large.blast' = 8.2MB plain-text format blastn with 5 results > > # Best case scenario: > time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => > "large.blast"); while ($result = $sio->next_result) { }' > > Before: > repeat 1 repeat 2 repeat 3 average > real 1m28.065s 1m28.305s 1m28.271s 1m28.214s > user 1m27.610s 1m27.850s 1m27.860s 1m27.773s > sys 0m0.440s 0m0.450s 0m0.410s 0m0.433s > > After: > repeat 1 repeat 2 repeat 3 average speed up: > real 0m16.653s 0m16.711s 0m16.685s 0m16.683s 5.3x > user 0m16.510s 0m16.570s 0m16.580s 0m16.553s 5.3x > sys 0m0.130s 0m0.140s 0m0.110s 0m0.127s 3.4x > > > perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => > "medium.blast"); while ($result = $sio->next_result) { }' > > dprofpp -I before: > Total Elapsed Time = 5.309818 Seconds > User+System Time = 5.209818 Seconds > Inclusive Times > %Time ExclSec CumulS #Calls sec/call Csec/c Name > 131. 0.473 6.834 2 0.2365 3.4169 > Bio::SearchIO::blast::next_result > 110. 0.333 5.753 36512 0.0000 0.0002 > Bio::SearchIO::blast::end_element > 102. 0.315 5.326 2388 0.0001 0.0022 > Bio::SearchIO::SearchResultEventBuilder::end_hsp > 92.2 0.026 4.804 7266 0.0000 0.0007 > Bio::Factory::ObjectFactory::create_object > 91.4 0.250 4.765 2388 0.0001 0.0020 > Bio::Search::HSP::GenericHSP::new > 40.4 0.162 2.105 2388 0.0001 0.0009 > Bio::SeqFeature::SimilarityPair::new > 39.0 0.173 2.034 9552 0.0000 0.0002 > Bio::SeqFeature::Similarity::new > > dprofpp -I after: > Total Elapsed Time = 1.090717 Seconds > User+System Time = 1.490717 Seconds > Inclusive Times > %Time ExclSec CumulS #Calls sec/call Csec/c Name > 98.2 0.333 1.465 2 0.1665 0.7324 > Bio::SearchIO::blast::next_result > 40.5 0.213 0.604 36512 0.0000 0.0000 > Bio::SearchIO::blast::end_element > 22.8 - 0.341 34022 - 0.0000 > Bio::SearchIO::blast::element > 19.8 0.296 0.296 2388 0.0001 0.0001 > Bio::SearchIO::SearchResultEventBuilder::end_hsp > 15.7 0.156 0.235 41186 0.0000 0.0000 > Bio::SearchIO::blast::characters > 12.6 0.188 0.188 141495 0.0000 0.0000 > Bio::SearchIO::blast::in_element > 10.3 0.133 0.154 25795 0.0000 0.0000 Bio::Root::IO::_readline > > Improvement: most of our time now spent in the SAX-like bits of the > parser code, instead of creating HSPI objects. (I might streamline the > SAX-like bits as well.) > > > # Medium case scenario: > time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => > "large.blast"); while ($result = $sio->next_result) { while ($hit = > $result->next_hit) { while ($hsp = $hit->next_hsp) { } } }' > > Before: > repeat 1 repeat 2 repeat 3 average > real 1m27.771s 1m27.996s 1m28.953s 1m28.240s > user 1m27.200s 1m27.540s 1m28.500s 1m27.747s > sys 0m0.560s 0m0.450s 0m0.450s 0m0.487s > > After: > repeat 1 repeat 2 repeat 3 average speed up: > real 0m25.984s 0m25.996s 0m25.714s 0m25.898s 3.4x > user 0m25.780s 0m25.750s 0m25.510s 0m25.680s 3.4x > sys 0m0.210s 0m0.240s 0m0.200s 0m0.217s 2.2x > > > # Worse case scenario: > time perl -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => > "large.blast"); while ($result = $sio->next_result) { while ($hit = > $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query; > $hsp->hit; } } }' > > Before: > repeat 1 repeat 2 repeat 3 average > real 1m28.104s 1m27.861s 1m27.970s 1m27.978s > user 1m27.600s 1m27.420s 1m27.420s 1m27.480s > sys 0m0.490s 0m0.430s 0m0.550s 0m0.490s > > After: > repeat 1 repeat 2 repeat 3 average speed up: > real 0m57.167s 0m57.080s 0m56.903s 0m58.050s 1.5x > user 0m56.700s 0m56.770s 0m56.530s 0m56.667s 1.5x > sys 0m0.470s 0m0.310s 0m0.370s 0m0.383s 1.3x > > > perl -d:DProf -MBio::SearchIO -e '$sio = new Bio::SearchIO(-file => > "medium.blast"); while ($result = $sio->next_result) { while ($hit = > $result->next_hit) { while ($hsp = $hit->next_hsp) { $hsp->query; > $hsp->hit; } } }' > > dprofpp -I before: > Total Elapsed Time = 9.767702 Seconds > User+System Time = 9.577702 Seconds > Inclusive Times > %Time ExclSec CumulS #Calls sec/call Csec/c Name > 115. 0.568 11.031 2 0.2838 5.5156 > Bio::SearchIO::blast::next_result > > dprofpp -I after: > Total Elapsed Time = 6.521398 Seconds > User+System Time = 6.571398 Seconds > Inclusive Times > %Time ExclSec CumulS #Calls sec/call Csec/c Name > 61.2 0.087 4.023 11940 0.0000 0.0003 > Bio::Search::HSP::GenericHSP::query > > Improvement: now we spend most of our time dealing with the thing we > actually wanted from the parse, not dealing with the parsing > process itself. > > > > -------------------------------------------------------- > > Here are the changes involved in achieving the speed up: > > > Bio::Search::HSP::GenericHSP > ---------------------------- > > # Implementation changes > Call to new() now calls no methods of its own; no work is done > simply to > create a GenericHSP. The code that was previously in new() has been > moved to private methods that are called just-in-time, as the user > desires to know certain information by manually calling HSPI methods. > > Added new options to new() -hit_desc and -query_desc for setting the > description text for the sequences. > > > Bio::Search::Hit::GenericHit > ---------------------------- > > # API-CHANGES > new() has extra option -hsp_factory. > > New method hsp_factory() which gets/sets a > Bio::Factory::ObjectFactoryI. > > add_hsp() can now accept a hash ref instead of just a HSPI. > > # Implementation changes > next_hsp() and hsps() convert hash ref hsp data to HSPI objects using > the hsp_factory() as necessary. > > # Notes > num_hsps() claimed to throw if there were no HSPs, but returns '-'. > Updated docs. > > > Bio::Search::Result::GenericResult > ---------------------------------- > > # API-CHANGES > new() has extra option -hit_factory. > > New method hit_factory() which gets/sets a > Bio::Factory::ObjectFactoryI. > > add_hit() can now accept a hash ref instead of just a HitI. > > # Implementation changes > next_hit() and hits() convert hash ref hit data to HitI objects using > the hit_factory() as necessary. > > > Bio::Search::Iteration::GenericIteration > ---------------------------------------- > > # API-CHANGES > new() has extra option -hit_factory. > > New method hit_factory() which gets/sets a > Bio::Factory::ObjectFactoryI. > > add_hit() can now accept a hash ref instead of just a HitI. > > # Implementation changes > Various methods convert hash ref hit data to HitI objects using the > hit_factory() as necessary. > > > Bio::SearchIO::SearchResultEventBuilder > --------------------------------------- > > # Implementation changes > Methods end_hsp() and end_hit() return hash refs containing data > suitable for creating HitI and HSPI objects respectively. > > > Bio::SearchIO::IteratedSearchResultEventBuilder > ----------------------------------------------- > > # Implementation changes > _add_hit() deals with the new way hit information is stored. > > end_iteration() supplies the hit factory to created and returned > iteration factories. > > > Bio::SeqFeature::SimilarityPair > ------------------------------- > > # Notes > new() makes an object factory if not supplied -feature_factory as arg, > but then does nothing with its freshly created factory, and ignores > any > that are supplied with -feature_factory. Now it just doesn't make a > factory at all, to save time. > > new() has an option -feature1, but when supplied this is never used or > set. I've left this alone. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Aug 10 12:25:59 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 11:25:59 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB5923.3010700@sendu.me.uk> References: <44DB5923.3010700@sendu.me.uk> Message-ID: On Aug 10, 2006, at 11:04 AM, Sendu Bala wrote: >> Just curious, but is there a possibility of making "lazy" >> instantiation of >> objects like HSP and HIT objects? Things like parsing and output >> could be >> accomplished without these objects? > > That's what I've done actually, which is why performance varies > between > 5x and 1.5x (lower performance when the instantiation is forced). > > But, things like 'parsing and output' do need to force the > instantiation > unless, say, an output module knew about the hash structure of the > thing > stored inside a Result object. Which is too horrible a situation to > comprehend. :O > > Or is it? What specifically did you have in mind? The nice thing about SearchIO is the ability to attach a Handler to return specific objects. For instance, if you didn't want HSP's then they could be 'junked' by using SearchIO::FastResultEventBuilder, which just returns hits. I don't know how the other SearchIO modules (hmmer, etc) deal with this though, but it works for blast and (I think) blastxml. You might use this same strategy have the handler return simple hashes instead of objects, or create a new set of simpler Result/Hit/ HSP classes to deal with the data. Alternatively, create a new SearchIO class (call it fastblast; okay, terrible name) that doesn't use a handler and just returns hashes. I think Jason pointed out previously that the handler isn't required. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From Joseph.Travaglini at FMR.com Thu Aug 10 10:05:40 2006 From: Joseph.Travaglini at FMR.com (Travaglini, Joseph) Date: Thu, 10 Aug 2006 10:05:40 -0400 Subject: [Bioperl-l] RSS feed weirdness from the wiki Message-ID: <7FC9739191F8124B990E025C98EB9BCE010C5A96@MSGMROCLN2WIN.DMN1.FMR.COM> Hi Jason Just wondering what script you changed to fix this -- I have the same problem and can't figure out where the extra newline is. http://lists.open-bio.org/pipermail/bioperl-l/2006-April/021355.html Thanks Joe Travaglini FISC-FCAT w: 617.563.3811 c: 978.210.1580 From xianjun.dong at bccs.uib.no Thu Aug 10 12:02:39 2006 From: xianjun.dong at bccs.uib.no (Xianjun Dong) Date: Thu, 10 Aug 2006 18:02:39 +0200 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <027201c6b4b4$ddc201f0$2f01a8c0@GOLHARMOBILE1> References: <027201c6b4b4$ddc201f0$2f01a8c0@GOLHARMOBILE1> Message-ID: <1155225760.4343.16.camel@lauvtre.ii.uib.no> Hi, Ryan Thanks for your reply! But here I still have two questions about the sample code: 1. the translate() function of Bio::Seq object use generic codon table, but for Mitochondrial DNA (mtDNA), we should use different codon table. So, if we take the human transcript ENST00000361390 as example, >ENST00000361390 _cDNA ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA After translating with above function, the amino acid sequence is like this, which contain *(stop codon) within the sequence(also at the end of the sequence). But actually, this is a mtDNA, if we use different codon table, the * within the sequence will change to 'W'(Trp). (Because in vertebrate mitochondria ?AGA? and ?AGG? are also stop codons, but not ?UGA?, which codes for tryptophan instead.) >ENST00000361390 aa_beforefilter IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHLL*KNFLPLTLALLI*YVSIPITISSIPPQT* 2. My second question is: If there are * both in the middle and end of the translated sequence (with pattern AAAAAA*AAAAAAAAAAAAAAA*AAA*), like above case, after the two checks for stop codon, all * will be filtered out. So, when translate back from aa_aln to dna_aln, there should be no stop codon included. But actually, when I track the program, it display that there are still stop codon included. Here is the DNA alignment after recalling the aa_to_dna_aln function. How to explain this? >ENST00000361390 aa_to_dna_aln ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT I attached my script for two ortholog transcripts demo and the output (including the error msg) here. Could you kindly check for me? Thanks! -Xianjun ///////////////////////////////////////////////////////////////////// /////////////////////////////// output ////////////////////////////// ///////////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl calculator.pl >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCCAGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCGGGAGTACCACCATACATATAG Calculate the Ka/Ks for ENSG00000198888 : ENSMUSG00000064341 ... >ENSMUST00000082392 aa_beforefilter VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFIIAPTLSLTLALSL*VPLPIPHPLINLNLGILFILATSSLSVYSIL*SG*ASNSKYSLFGALRAVAQTISYEVTIAIILLSVLLINGSYSLQTLITTQEHI*LLLPA*PIAII*FISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFALFFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHLL*KNFLPLTLALCM*HISLPIFTAGVPPYI* >ENSMUST00000082392 aa_afterfilter VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFIIAPTLSLTLALSLVPLPIPHPLINLNLGILFILATSSLSVYSILSGASNSKYSLFGALRAVAQTISYEVTIAIILLSVLLINGSYSLQTLITTQEHILLLPAPIAIIFISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFALFFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHLLKNFLPLTLALCMHISLPIFTAGVPPYI >ENST00000361390 aa_beforefilter IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHLL*KNFLPLTLALLI*YVSIPITISSIPPQT* >ENST00000361390 aa_afterfilter IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLLTPLPIPNPLVNLNLGLLFILATSSLAVYSILSGASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHLLLLPSPLAIIFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLIRTAYPRFRYDQLIHLLKNFLPLTLALLIYVSIPITISSIPPQT Print out the DNA sequences translated back from aa_to_dna function: >ENSMUST00000082392 aa_to_dna_aln GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAACGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCATTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATTATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATTAATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGATGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTAACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACCCAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAAACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCAGCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATTATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTACTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTTCTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTT >ENST00000361390 aa_to_dna_aln ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCCTTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATCACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTCAACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGGTGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTCACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACACAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAGACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCCGCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTACTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output --------------------------------------------------- ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: main::kaks_calculate calculator.pl:176 STACK: calculator.pl:116 ///////////////////////////////////////////////////////////////////// /////////////////////////////// script ////////////////////////////// ///////////////////////////////////////////////////////////////////// sub kaks_calculate { my %seqs=@_; #my %seqs = %$seqs_ref; my @prots; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new ('quiet'=>1); # process each sequence for my $seqid (keys %seqs) { my $seq = $seqs{$seqid}; my $protein =$seq->translate(); my $pseq = $protein->seq(); print ">$seqid aa_beforefilter \n$pseq\n"; if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; print ">$seqid aa_afterfilter \n$pseq\n"; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); print "\nPrint out the DNA sequences translated back from aa_to_dna function:\n\n"; foreach my $s ( $dna_aln->each_seq() ) { print ">".$s->display_id." aa_to_dna_aln\n".$s->seq()."\n"; } my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, } ); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; # print OUT join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); # print OUT join("\t", $otus[$i]->display_id, print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } } -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output --------------------------------------------------- ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: main::kaks_calculate calculator.pl:176 STACK: calculator.pl:116 ---------------------------------------------------------------- On Mon, 2006-07-31 at 11:20 -0400, Ryan Golhar wrote: > Hi Xianjun, > > I just did some work on this module including the example. > > >> it does not occur in the codon position > >>(say, the third codon's position is not a times of 3). > >>Why it effect the result? > > If I'm interpreting your question correctly, the stop codons in your > sequence occur in-frame. This is why it is choking. > > >>So, when translate back from aa_aln to dna_aln, there should be no > stop codon included. SO, why it can not pass? > > The Ka and Ks statistics are not calculated based on the protein > sequence, they are calculated based on the DNA sequence. The protein > sequence is used to provide a alignment for the codons of the DNA > sequence. Checking the protein sequence for * is easier to identify > in-frame stop codons than scanning the DNA sequence. > > The two checks for stop codons you mentioned are to check for stop > codons within the sequence without worry for the last amino acid. The > second part remove the * at the end of the sequence (not the middle). > > If you want to remove the in-frame stop codons, you can, but do so > before translating it to protein sequences. > > Ryan > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun Dong > Sent: Monday, July 31, 2006 7:56 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] PAML + Codeml problem.. > > > Hi, > > I have a problem during running the Codeml Wiki-HOWTO code: > > Here is the error message: > //////////////////////////////////////////////////////////////// > [xianjund at lauvtre kaks]$ perl paml.pl test.fa > > -------------------- WARNING --------------------- > MSG: There was an error - see error_string for the program output STACK > Bio::Tools::Run::Phylo::PAML::Codeml::run > /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML/C > odeml.pm:581 > STACK toplevel paml.pl:61 > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > MSG: Unknown format of PAML output > STACK: Error::throw > STACK: > Bio::Root::Root::throw > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > STACK: > Bio::Tools::Phylo::PAML::_parse_summary > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > STACK: > Bio::Tools::Phylo::PAML::next_result > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > STACK: paml.pl:62 > ---------------------------------------------------------------- > //////////////////////////////////////////////////////////////// > > My test sequence is: > >ENST00000361390 > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCTAA > >ENSMUST00000082392 > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG > GGAGTACCACCATACATATAG > > Sure, I checked it. There is some stop codon in it. If I replace it with > non-stop codon, it works. > > For example, > >ENST00000361390 > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCGAA > CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC > TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG > caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC > ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA > CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA > ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCcaa > >ENSMUST00000082392 > GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaaAA > CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc > aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA > caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca > aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA > ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA > GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT > ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG > GGAGTACCACCATACATAcaa > > But my question is: it does not occur in the codon position (say, the > third codon's position is not a times of 3). Why it effect the result? > > And also there is code to filter out the stop codon in the sample code > (as the following shown) /////////////////////////////// > if( $pseq =~ /\*/ && > $pseq !~ /\*$/ ) { > warn("provided a CDS sequence with a stop codon, PAML will > choke!"); > exit(0); > } > # Tcoffee can't handle '*' even if it is trailing > $pseq =~ s/\*//g; > ///////////////////////////// > > So, when translate back from aa_aln to dna_aln, there should be no stop > codon included. SO, why it can not pass? > > Thanks for answer! > > P.S: attach my code here: > ///////////////////////////////////////////////////////// > #!/usr/bin/perl -w > use strict; > use Bio::Tools::Run::Phylo::PAML::Codeml; > use Bio::Tools::Run::Alignment::Clustalw; > > # for projecting alignments from protein to R/DNA space > use Bio::Align::Utilities qw(aa_to_dna_aln); > # for input of the sequence data > use Bio::SeqIO; > use Bio::AlignIO; > > my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); > my $seqdata = shift || 'test.fa'; > > my $seqio = new Bio::SeqIO(-file => $seqdata, > -format => 'fasta'); > my %seqs; > my @prots; > # process each sequence > while ( my $seq = $seqio->next_seq ) { > $seqs{$seq->display_id} = $seq; > # translate them into protein > my $protein = $seq->translate(); > my $pseq = $protein->seq(); > if( $pseq =~ /\*/ && > $pseq !~ /\*$/ ) { > warn("provided a CDS sequence with a stop codon, PAML will > choke!"); > exit(0); > } > # Tcoffee can't handle '*' even if it is trailing > $pseq =~ s/\*//g; > > $protein->seq($pseq); > push @prots, $protein; > } > > if( @prots < 2 ) { > warn("Need at least 2 CDS sequences to proceed"); > exit(0); > } > > # open(OUT, ">align_output.txt") || die("cannot open output > align_output for writing"); # Align the sequences with clustalw my > $aa_aln = $aln_factory->align(\@prots); # project the protein alignment > back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); > > my @each = $dna_aln->each_seq(); > > my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new > ( -params => { 'runmode' => -2, > 'seqtype' => 1, > }, > -save_tempfiles => 1, > -verbose => 1); > > # set the alignment object > $kaks_factory->alignment($dna_aln); > > # run the KaKs analysis > my ($rc,$parser) = $kaks_factory->run(); > my $result = $parser->next_result; > my $MLmatrix = $result->get_MLmatrix(); > > my @otus = $result->get_seqs(); > # this gives us a mapping from the PAML order of sequences back to # the > input order (since names get truncated) my @pos = map { > my $c= 1; > foreach my $s ( @each ) { > last if( $s->display_id eq $_->display_id ); > $c++; > } > $c; > } @otus; > > print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { > for( my $j = $i+1; $j < (scalar @otus); $j++ ) { > my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); > my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); > print join("\t", $otus[$i]->display_id, > $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- > >{'dN'}, > $MLmatrix->[$i]->[$j]->{'dS'}, > $MLmatrix->[$i]->[$j]->{'omega'}, > sprintf("%.2f",$sub_aa_aln- > >percentage_identity), > sprintf("%.2f",$sub_dna_aln- > >percentage_identity), > ), "\n"; > } > } > From sdavis2 at mail.nih.gov Thu Aug 10 12:41:08 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 10 Aug 2006 12:41:08 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB5923.3010700@sendu.me.uk> Message-ID: On 8/10/06 12:04 PM, "Sendu Bala" wrote: > Sean Davis wrote: >> >> On 8/10/06 11:19 AM, "Sendu Bala" wrote: >> >>> ...Except I need to know if the community considers the speed problem >>> solved or not. More radical changes will make SearchIO even faster, eg. >>> Chris Fields and Jason (if I interpret the Project priority list item >>> correctly) have suggested an end to individual Hit and HSP objects, >>> which become just data members of a Result-like object. Ideally I don't >>> want to go down that route because we lose quite a bit of OO power; HSP >>> objects in particular make important use of inheritance and we'd end up >>> stuffing tens of duplicated-code methods into the Result object >>> replacement. Ugh. Or having the Result-like object implement 4 different >>> interfaces. Ugh. That said, judging by the results below, having hsps as >>> pure hashes would result in a ~5x speedup all the time instead of only a >>> 1.5x speedup in worst case. >> >> Just curious, but is there a possibility of making "lazy" instantiation of >> objects like HSP and HIT objects? Things like parsing and output could be >> accomplished without these objects? > > That's what I've done actually, which is why performance varies between > 5x and 1.5x (lower performance when the instantiation is forced). > > But, things like 'parsing and output' do need to force the instantiation > unless, say, an output module knew about the hash structure of the thing > stored inside a Result object. Which is too horrible a situation to > comprehend. :O I was just asking--I have no idea what is possible. But what you have already accomplished is QUITE impressive! Sean From cjfields at uiuc.edu Thu Aug 10 13:18:08 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 12:18:08 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <97BFE9D4-A485-46CE-BE04-9FC3C24FDD6F@uiuc.edu> On Aug 10, 2006, at 11:41 AM, Sean Davis wrote: > > On 8/10/06 12:04 PM, "Sendu Bala" wrote: >> ... >> That's what I've done actually, which is why performance varies >> between >> 5x and 1.5x (lower performance when the instantiation is forced). >> >> But, things like 'parsing and output' do need to force the >> instantiation >> unless, say, an output module knew about the hash structure of the >> thing >> stored inside a Result object. Which is too horrible a situation to >> comprehend. :O > > I was just asking--I have no idea what is possible. But what you have > already accomplished is QUITE impressive! > > Sean Agreed! I say get the code out there (CVS, Bugzilla) so people can start testing, coding, make suggestions, etc. Chris Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From arareko at campus.iztacala.unam.mx Thu Aug 10 13:25:58 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Thu, 10 Aug 2006 12:25:58 -0500 Subject: [Bioperl-l] XML parser preference? In-Reply-To: References: <20060810123157.D9C9483BB@sienna.algonomics.com> Message-ID: <44DB6C26.8080607@campus.iztacala.unam.mx> As long as we don't complicate things for new users, picking those parsers is fine for me. Thanks Jurgen (and Rutger) for giving us advice on libxml2's availability :) Mauricio. Chris Fields wrote: > Jurgen, > > Thanks for pointing that out! However, the problem is we want to > keep the number of dependencies down; there are already four XML > parser dependencies for Bioperl (XML::Twig is one, but XML::LibXML > isn't). > > Maybe new modules which require XML parsing stick with four XML > parsers. However, not the current four (XML::DOM, XML::Twig, > XML::Parser, XML::SAX). > > Maybe we should pick four XML parsers, each with their own particular > strengths: > > 1) XML::SAX (SAX parsing; flexible, can use pure Perl, ExpatXS, etc) > Switch using XML::Parser to XML::SAX (done for > Bio::SeacrhIO::blastxml) > 2) XML::LibXML (DOM parsing; maintained, up to date, fast) > Switch using XML::DOM to XML::LibXML > 3) XML::Twig (DOM-like, SAX-based) - great for processing 'chunks' > of XML > Used in Bio::DB::Taxonomy::entrez > 4) XML::Simple (small XML) - very easy to use XML parser > > Since they are currently available for most (all?) OS's, shouldn't be > a problem. What do you think Mauricio? > > Chris > > > On Aug 10, 2006, at 7:29 AM, Jurgen Pletinckx wrote: > >> | I have no doubt that XML::LibXML is a great parser (I've used >> | it a few >> | times), the problem with it is that it runs on top of libxml2's C >> | library. On *nix systems it's fairly simple to have this dependency >> | compiled and running, but what about having it under other OS's >> (e.g. >> | Windows)? >> | >> | Introducing XML::LibXML as a dependency into the toolkit will >> | probably >> | place EUtilities as a module not usable by everyone, especially >> those >> | who use BioPerl in a OS where installing/compiling C >> | dependencies can be >> | a headache. >> >> Regarding XML::LibXML, there does appear to be an up-to-date ppm >> package (which fetches libxml2.dll) at >> >> http://theoryx5.uwinnipeg.ca/ppms/XML-LibXML.ppd >> >> (and less than a week since the release of the corresponding >> version to cpan, too.) >> >> So the threshold for distribution to Windows, at least, is less >> high than it might have been. >> >> -- >> Jurgen Pletinckx >> AlgoNomics NV >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From aaron.j.mackey at gsk.com Thu Aug 10 13:39:59 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Thu, 10 Aug 2006 13:39:59 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB4E98.70703@sendu.me.uk> Message-ID: > ...Except I need to know if the community considers the speed problem > solved or not. More radical changes will make SearchIO even faster, eg. > Chris Fields and Jason (if I interpret the Project priority list item > correctly) have suggested an end to individual Hit and HSP objects, > which become just data members of a Result-like object. Ideally I don't > want to go down that route because we lose quite a bit of OO power; As already mentioned, a lazy-evaluation approach would also work. Jason and I did once talk about an entirely new parsing/object-building framework, based on nested grammars; in essence, the "top-level" parser, simply "chunks" the input into blobs of (minimally parsed) text that correspond to the top level result object. This chunk/blob is the input to the next-level parser for Hits, which in return has chunk for HSPs. Note that the Result/Hit/HSP "chunks" are "fat", i.e. they *are* the same Generic*I-implementing objects we're already using. Thus, if HSPs are never interrogated, they're never parsed; as soon as one is interrogated, it gets parsed, and so on. In such an environment, you can imagine flyweight objects that are built very quickly/easily (recall that many previous analyses of BioPerl speed problems are not related to parsing, so much as heavy-weight object creation). I happen to have such a nested parser lying around for Bio::SearchIO::fasta.pm, but it also uses an Inline::C, yacc-generated C parser backend (yet another experiment in trying to get SearchIO to run faster), so really isn't ready for prime time (being entirely untested, and probably not even finished). -Aaron From cjfields at uiuc.edu Thu Aug 10 14:54:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 13:54:18 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: Message-ID: <002e01c6bcae$62223c20$15327e82@pyrimidine> > > ...Except I need to know if the community considers the speed problem > > solved or not. More radical changes will make SearchIO even faster, eg. > > Chris Fields and Jason (if I interpret the Project priority list item > > correctly) have suggested an end to individual Hit and HSP objects, > > which become just data members of a Result-like object. Ideally I don't > > want to go down that route because we lose quite a bit of OO power; > > As already mentioned, a lazy-evaluation approach would also work. > > Jason and I did once talk about an entirely new parsing/object-building > framework, based on nested grammars; in essence, the "top-level" parser, > simply "chunks" the input into blobs of (minimally parsed) text that > correspond to the top level result object. This chunk/blob is the input > to the next-level parser for Hits, which in return has chunk for HSPs. > Note that the Result/Hit/HSP "chunks" are "fat", i.e. they *are* the same > Generic*I-implementing objects we're already using. Thus, if HSPs are > never interrogated, they're never parsed; as soon as one is interrogated, > it gets parsed, and so on. In such an environment, you can imagine > flyweight objects that are built very quickly/easily (recall that many > previous analyses of BioPerl speed problems are not related to parsing, so > much as heavy-weight object creation). > > I happen to have such a nested parser lying around for > Bio::SearchIO::fasta.pm, but it also uses an Inline::C, yacc-generated C > parser backend (yet another experiment in trying to get SearchIO to run > faster), so really isn't ready for prime time (being entirely untested, > and probably not even finished). > > -Aaron The 'nested parsers' idea sounds like a good approach as well though, like you indicate, it would be outside of SearchIO. How well does it scale i.e. very large reports? Chris From bix at sendu.me.uk Thu Aug 10 15:06:33 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 20:06:33 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44DB83B9.6050507@sendu.me.uk> aaron.j.mackey at gsk.com wrote: >> ...Except I need to know if the community considers the speed problem >> solved or not. More radical changes will make SearchIO even faster, eg. >> Chris Fields and Jason (if I interpret the Project priority list item >> correctly) have suggested an end to individual Hit and HSP objects, >> which become just data members of a Result-like object. Ideally I don't >> want to go down that route because we lose quite a bit of OO power; > > As already mentioned, a lazy-evaluation approach would also work. > > Jason and I did once talk about an entirely new parsing/object-building > framework, based on nested grammars; in essence, the "top-level" parser, > simply "chunks" the input into blobs of (minimally parsed) text that > correspond to the top level result object. This chunk/blob is the input > to the next-level parser for Hits, which in return has chunk for HSPs. > Note that the Result/Hit/HSP "chunks" are "fat", i.e. they *are* the same > Generic*I-implementing objects we're already using. Thus, if HSPs are > never interrogated, they're never parsed; as soon as one is interrogated, > it gets parsed, and so on. As I understand your description, this is exactly what I do. My 'chunks' are the hashes that are normally used to create a new Hit/HSP object. The initial parse of the data file results in a small number of objects (Results) that contain all the data: HSP data nested in Hit data nested in the Result objects. When you actually want to do something with a certain hit or HSP it becomes an object, allowing you to call its methods like normal. Or are you suggesting something that would be even better than that? If so, please elucidate! :) From golharam at umdnj.edu Thu Aug 10 14:53:51 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Thu, 10 Aug 2006 14:53:51 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <1155225760.4343.16.camel@lauvtre.ii.uib.no> Message-ID: <005201c6bcae$526c2bb0$2f01a8c0@GOLHARMOBILE1> Hi Xianjun, 1. The Bio::Seq::translate function (to my knowledge) only uses the generic codon table. So, you will need to translate the DNA sequence using some other method. In any case, even removing the *'s from the protein sequence still leaves the stop codons in the DNA sequence which must be removed. 2. The checks were written to assume that the sequences provided are full-length coding sequences. That means the start and stop codon are present as well. When the translate function is called, the stop codon is translated as a '*'. The script initally just remove the * from the end of the sequence and continued on. I added a check to see if there is a '*' in the middle of the sequence because I found in some of my genes that there is in fact in-frame stop codons which actually codes for selenocysteine. I see the warning check isn't working for some reason - odd, it worked when I wrote it. You can remove the *'s from the protein sequence, but you must also be sure to remove the corresponding codons from the DNA sequence as well before invoking run() on the Codeml pacakge. I suppose someone could add a check to the script to remove the in-frame stop codons. Remember, the pairwise_kaks script is just a starting point (tutorial) to show you how you can go about performing this type of an analysis. In fact, I've since switched from PAML to using a different method PBL which a colleuge coded. I found that PAML tends to overestimate synonymous rates in some cases. Let me know if this helps. If not, I'll try to clarify. Ryan -----Original Message----- From: Xianjun Dong [mailto:xianjun.dong at bccs.uib.no] Sent: Thursday, August 10, 2006 12:03 PM To: golharam at umdnj.edu Cc: bioperl-l at lists.open-bio.org Subject: RE: [Bioperl-l] PAML + Codeml problem.. Hi, Ryan Thanks for your reply! But here I still have two questions about the sample code: 1. the translate() function of Bio::Seq object use generic codon table, but for Mitochondrial DNA (mtDNA), we should use different codon table. So, if we take the human transcript ENST00000361390 as example, >ENST00000361390 _cDNA ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCTAA After translating with above function, the amino acid sequence is like this, which contain *(stop codon) within the sequence(also at the end of the sequence). But actually, this is a mtDNA, if we use different codon table, the * within the sequence will change to 'W'(Trp). (Because in vertebrate mitochondria "AGA" and "AGG" are also stop codons, but not "UGA", which codes for tryptophan instead.) >ENST00000361390 aa_beforefilter IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYI TAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEV TLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYA AGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHL L*KNFLPLTLALLI*YVSIPITISSIPPQT* 2. My second question is: If there are * both in the middle and end of the translated sequence (with pattern AAAAAA*AAAAAAAAAAAAAAA*AAA*), like above case, after the two checks for stop codon, all * will be filtered out. So, when translate back from aa_aln to dna_aln, there should be no stop codon included. But actually, when I track the program, it display that there are still stop codon included. Here is the DNA alignment after recalling the aa_to_dna_aln function. How to explain this? >ENST00000361390 aa_to_dna_aln ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACAC CTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT I attached my script for two ortholog transcripts demo and the output (including the error msg) here. Could you kindly check for me? Thanks! -Xianjun ///////////////////////////////////////////////////////////////////// /////////////////////////////// output ////////////////////////////// ///////////////////////////////////////////////////////////////////// [xianjund at lauvtre kaks]$ perl calculator.pl >ENST00000361390 ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC AGCATTCCCCCTCAAACCTAA >ENSMUST00000082392 GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG GGAGTACCACCATACATATAG Calculate the Ka/Ks for ENSG00000198888 : ENSMUSG00000064341 ... >ENSMUST00000082392 aa_beforefilter VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFI IAPTLSLTLALSL*VPLPIPHPLINLNLGILFILATSSLSVYSIL*SG*ASNSKYSLFGALRAVAQTISYEV TIAIILLSVLLINGSYSLQTLITTQEHI*LLLPA*PIAII*FISTLAETNRAPFDLTEGESELVSGFNVEYA AGPFALFFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHL L*KNFLPLTLALCM*HISLPIFTAGVPPYI* >ENSMUST00000082392 aa_afterfilter VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFI IAPTLSLTLALSLVPLPIPHPLINLNLGILFILATSSLSVYSILSGASNSKYSLFGALRAVAQTISYEVTIA IILLSVLLINGSYSLQTLITTQEHILLLPAPIAIIFISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFAL FFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHLLKNFLP LTLALCMHISLPIFTAGVPPYI >ENST00000361390 aa_beforefilter IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYI TAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEV TLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYA AGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHL L*KNFLPLTLALLI*YVSIPITISSIPPQT* >ENST00000361390 aa_afterfilter IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYI TAPTLALTIALLLTPLPIPNPLVNLNLGLLFILATSSLAVYSILSGASNSNYALIGALRAVAQTISYEVTLA IILLSTLLISGSFNLSTLITTQEHLLLLPSPLAIIFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFAL FFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLIRTAYPRFRYDQLIHLLKNFLPL TLALLIYVSIPITISSIPPQT Print out the DNA sequences translated back from aa_to_dna function: >ENSMUST00000082392 aa_to_dna_aln GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTT >ENST00000361390 aa_to_dna_aln ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA CTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACAC CTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output --------------------------------------------------- ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: main::kaks_calculate calculator.pl:176 STACK: calculator.pl:116 ///////////////////////////////////////////////////////////////////// /////////////////////////////// script ////////////////////////////// ///////////////////////////////////////////////////////////////////// sub kaks_calculate { my %seqs=@_; #my %seqs = %$seqs_ref; my @prots; my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new ('quiet'=>1); # process each sequence for my $seqid (keys %seqs) { my $seq = $seqs{$seqid}; my $protein =$seq->translate(); my $pseq = $protein->seq(); print ">$seqid aa_beforefilter \n$pseq\n"; if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { warn("provided a CDS sequence with a stop codon, PAML will choke!"); exit(0); } # Tcoffee can't handle '*' even if it is trailing $pseq =~ s/\*//g; print ">$seqid aa_afterfilter \n$pseq\n"; $protein->seq($pseq); push @prots, $protein; } if( @prots < 2 ) { warn("Need at least 2 CDS sequences to proceed"); exit(0); } # open(OUT, ">align_output.txt") || die("cannot open output align_output for writing"); # Align the sequences with clustalw my $aa_aln = $aln_factory->align(\@prots); # project the protein alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); my @each = $dna_aln->each_seq(); print "\nPrint out the DNA sequences translated back from aa_to_dna function:\n\n"; foreach my $s ( $dna_aln->each_seq() ) { print ">".$s->display_id." aa_to_dna_aln\n".$s->seq()."\n"; } my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, } ); # set the alignment object $kaks_factory->alignment($dna_aln); # run the KaKs analysis my ($rc,$parser) = $kaks_factory->run(); my $result = $parser->next_result; my $MLmatrix = $result->get_MLmatrix(); my @otus = $result->get_seqs(); # this gives us a mapping from the PAML order of sequences back to # the input order (since names get truncated) my @pos = map { my $c= 1; foreach my $s ( @each ) { last if( $s->display_id eq $_->display_id ); $c++; } $c; } @otus; # print OUT join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { for( my $j = $i+1; $j < (scalar @otus); $j++ ) { my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); # print OUT join("\t", $otus[$i]->display_id, print join("\t", $otus[$i]->display_id, $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- >{'dN'}, $MLmatrix->[$i]->[$j]->{'dS'}, $MLmatrix->[$i]->[$j]->{'omega'}, sprintf("%.2f",$sub_aa_aln- >percentage_identity), sprintf("%.2f",$sub_dna_aln- >percentage_identity), ), "\n"; } } } -------------------- WARNING --------------------- MSG: There was an error - see error_string for the program output --------------------------------------------------- ------------- EXCEPTION: Bio::Root::NotImplemented ------------- MSG: Unknown format of PAML output STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Tools::Phylo::PAML::_parse_summary /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 STACK: Bio::Tools::Phylo::PAML::next_result /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 STACK: main::kaks_calculate calculator.pl:176 STACK: calculator.pl:116 ---------------------------------------------------------------- On Mon, 2006-07-31 at 11:20 -0400, Ryan Golhar wrote: > Hi Xianjun, > > I just did some work on this module including the example. > > >> it does not occur in the codon position > >>(say, the third codon's position is not a times of 3). > >>Why it effect the result? > > If I'm interpreting your question correctly, the stop codons in your > sequence occur in-frame. This is why it is choking. > > >>So, when translate back from aa_aln to dna_aln, there should be no > stop codon included. SO, why it can not pass? > > The Ka and Ks statistics are not calculated based on the protein > sequence, they are calculated based on the DNA sequence. The protein > sequence is used to provide a alignment for the codons of the DNA > sequence. Checking the protein sequence for * is easier to identify > in-frame stop codons than scanning the DNA sequence. > > The two checks for stop codons you mentioned are to check for stop > codons within the sequence without worry for the last amino acid. The > second part remove the * at the end of the sequence (not the middle). > > If you want to remove the in-frame stop codons, you can, but do so > before translating it to protein sequences. > > Ryan > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun > Dong > Sent: Monday, July 31, 2006 7:56 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] PAML + Codeml problem.. > > > Hi, > > I have a problem during running the Codeml Wiki-HOWTO code: > > Here is the error message: > //////////////////////////////////////////////////////////////// > [xianjund at lauvtre kaks]$ perl paml.pl test.fa > > -------------------- WARNING --------------------- > MSG: There was an error - see error_string for the program output > STACK Bio::Tools::Run::Phylo::PAML::Codeml::run > /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML > /C > odeml.pm:581 > STACK toplevel paml.pl:61 > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > MSG: Unknown format of PAML output > STACK: Error::throw > STACK: > Bio::Root::Root::throw > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > STACK: > Bio::Tools::Phylo::PAML::_parse_summary > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > STACK: > Bio::Tools::Phylo::PAML::next_result > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > STACK: paml.pl:62 > ---------------------------------------------------------------- > //////////////////////////////////////////////////////////////// > > My test sequence is: > >ENST00000361390 > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCG > AA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCTAA > >ENSMUST00000082392 > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAG > AA > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG > GGAGTACCACCATACATATAG > > Sure, I checked it. There is some stop codon in it. If I replace it > with non-stop codon, it works. > > For example, > >ENST00000361390 > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCG > AA > CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC > TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG > caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC > ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA > CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA > ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCcaa > >ENSMUST00000082392 > GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaa > AA > CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc > aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA > caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca > aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA > ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA > GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT > ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG > GGAGTACCACCATACATAcaa > > But my question is: it does not occur in the codon position (say, the > third codon's position is not a times of 3). Why it effect the result? > > And also there is code to filter out the stop codon in the sample code > (as the following shown) /////////////////////////////// > if( $pseq =~ /\*/ && > $pseq !~ /\*$/ ) { > warn("provided a CDS sequence with a stop codon, PAML will > choke!"); > exit(0); > } > # Tcoffee can't handle '*' even if it is trailing > $pseq =~ s/\*//g; > ///////////////////////////// > > So, when translate back from aa_aln to dna_aln, there should be no > stop codon included. SO, why it can not pass? > > Thanks for answer! > > P.S: attach my code here: > ///////////////////////////////////////////////////////// > #!/usr/bin/perl -w > use strict; > use Bio::Tools::Run::Phylo::PAML::Codeml; > use Bio::Tools::Run::Alignment::Clustalw; > > # for projecting alignments from protein to R/DNA space > use Bio::Align::Utilities qw(aa_to_dna_aln); > # for input of the sequence data > use Bio::SeqIO; > use Bio::AlignIO; > > my $aln_factory = > Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); > my $seqdata = shift || 'test.fa'; > > my $seqio = new Bio::SeqIO(-file => $seqdata, > -format => 'fasta'); > my %seqs; > my @prots; > # process each sequence > while ( my $seq = $seqio->next_seq ) { > $seqs{$seq->display_id} = $seq; > # translate them into protein > my $protein = $seq->translate(); > my $pseq = $protein->seq(); > if( $pseq =~ /\*/ && > $pseq !~ /\*$/ ) { > warn("provided a CDS sequence with a stop codon, PAML will > choke!"); > exit(0); > } > # Tcoffee can't handle '*' even if it is trailing > $pseq =~ s/\*//g; > > $protein->seq($pseq); > push @prots, $protein; > } > > if( @prots < 2 ) { > warn("Need at least 2 CDS sequences to proceed"); > exit(0); > } > > # open(OUT, ">align_output.txt") || die("cannot open output > align_output for writing"); # Align the sequences with clustalw my > $aa_aln = $aln_factory->align(\@prots); # project the protein > alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, > \%seqs); > > my @each = $dna_aln->each_seq(); > > my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new > ( -params => { 'runmode' => -2, > 'seqtype' => 1, > }, > -save_tempfiles => 1, > -verbose => 1); > > # set the alignment object $kaks_factory->alignment($dna_aln); > > # run the KaKs analysis > my ($rc,$parser) = $kaks_factory->run(); > my $result = $parser->next_result; > my $MLmatrix = $result->get_MLmatrix(); > > my @otus = $result->get_seqs(); > # this gives us a mapping from the PAML order of sequences back to # > the input order (since names get truncated) my @pos = map { > my $c= 1; > foreach my $s ( @each ) { > last if( $s->display_id eq $_->display_id ); > $c++; > } > $c; > } @otus; > > print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) { > for( my $j = $i+1; $j < (scalar @otus); $j++ ) { > my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); > my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); > print join("\t", $otus[$i]->display_id, > $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- > >{'dN'}, > $MLmatrix->[$i]->[$j]->{'dS'}, > $MLmatrix->[$i]->[$j]->{'omega'}, > sprintf("%.2f",$sub_aa_aln- > >percentage_identity), > sprintf("%.2f",$sub_dna_aln- > >percentage_identity), > ), "\n"; > } > } > From bix at sendu.me.uk Thu Aug 10 15:11:24 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 20:11:24 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: <44DB5923.3010700@sendu.me.uk> Message-ID: <44DB84DC.70705@sendu.me.uk> Chris Fields wrote: > > On Aug 10, 2006, at 11:04 AM, Sendu Bala wrote: > >>> Just curious, but is there a possibility of making "lazy" >>> instantiation of >>> objects like HSP and HIT objects? Things like parsing and output >>> could be >>> accomplished without these objects? >> >> That's what I've done actually, which is why performance varies between >> 5x and 1.5x (lower performance when the instantiation is forced). >> >> But, things like 'parsing and output' do need to force the instantiation >> unless, say, an output module knew about the hash structure of the thing >> stored inside a Result object. Which is too horrible a situation to >> comprehend. :O >> >> Or is it? What specifically did you have in mind? > > The nice thing about SearchIO is the ability to attach a Handler to > return specific objects. For instance, if you didn't want HSP's then > they could be 'junked' by using SearchIO::FastResultEventBuilder, which > just returns hits. I don't know how the other SearchIO modules (hmmer, > etc) deal with this though, but it works for blast and (I think) blastxml. > > You might use this same strategy have the handler return simple hashes > instead of objects, Yes, the main change I have made that provides the speed increase is to make the handler (SearchResultEventBuilder) return hashes instead of objects. It's a transparent change when combined with the lazy instantiation. > Alternatively, create a new SearchIO class (call it fastblast; okay, > terrible name) that doesn't use a handler and just returns hashes. I > think Jason pointed out previously that the handler isn't required. But I didn't see any particular harm in keeping them. Not having a handler might shave a percent or two off run times, but you need to balance speed with power and flexibility. I don't know where that balance lies, hence my question to the community. From bix at sendu.me.uk Thu Aug 10 15:23:38 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 20:23:38 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: <44DB4E98.70703@sendu.me.uk> Message-ID: <44DB87BA.7020902@sendu.me.uk> Chris Fields wrote: > I agree with the thought of retaining some degree of OO. I still > wonder how much object instantiation really affects speed vs. all > those method calls. All the SAX-like calls you mean? I'll investigate that with my hmmpfam parser, since I have to come up with a new system anyway. Those methods are certainly the limiting step remaining, but I don't envisage another 5x speed up is possible, so don't get your hopes up! ;) From birney at ebi.ac.uk Thu Aug 10 14:55:46 2006 From: birney at ebi.ac.uk (Ewan Birney) Date: Thu, 10 Aug 2006 19:55:46 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <1322DD89-8E98-4403-8599-D151A079331E@ebi.ac.uk> On 10 Aug 2006, at 18:39, aaron.j.mackey at gsk.com wrote: >> ...Except I need to know if the community considers the speed problem >> solved or not. More radical changes will make SearchIO even >> faster, eg. >> Chris Fields and Jason (if I interpret the Project priority list item >> correctly) have suggested an end to individual Hit and HSP objects, >> which become just data members of a Result-like object. Ideally I >> don't >> want to go down that route because we lose quite a bit of OO power; > > As already mentioned, a lazy-evaluation approach would also work. > > Jason and I did once talk about an entirely new parsing/object- > building > framework, based on nested grammars; in essence, the "top-level" > parser, > simply "chunks" the input into blobs of (minimally parsed) text that > correspond to the top level result object. This chunk/blob is the > input > to the next-level parser for Hits, which in return has chunk for HSPs. > Note that the Result/Hit/HSP "chunks" are "fat", i.e. they *are* > the same > Generic*I-implementing objects we're already using. Thus, if HSPs are > never interrogated, they're never parsed; as soon as one is > interrogated, > it gets parsed, and so on. In such an environment, you can imagine > flyweight objects that are built very quickly/easily (recall that many > previous analyses of BioPerl speed problems are not related to > parsing, so > much as heavy-weight object creation). > for people's interest, this is what the SwissKnife package does as well for swissprot (which has a trivially top level chunking strategy) (ewan returns to his hectic life of too many balls in the air :)). From bix at sendu.me.uk Thu Aug 10 15:32:43 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 20:32:43 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB83B9.6050507@sendu.me.uk> References: <44DB83B9.6050507@sendu.me.uk> Message-ID: <44DB89DB.7050503@sendu.me.uk> Sendu Bala wrote: > aaron.j.mackey at gsk.com wrote: >>> ...Except I need to know if the community considers the speed problem >>> solved or not. More radical changes will make SearchIO even faster, >>> eg. Chris Fields and Jason (if I interpret the Project priority list >>> item correctly) have suggested an end to individual Hit and HSP >>> objects, which become just data members of a Result-like object. >>> Ideally I don't want to go down that route because we lose quite a >>> bit of OO power; >> >> As already mentioned, a lazy-evaluation approach would also work. >> >> Jason and I did once talk about an entirely new >> parsing/object-building framework, based on nested grammars; in >> essence, the "top-level" parser, simply "chunks" the input into blobs >> of (minimally parsed) text that correspond to the top level result >> object. [...] > Or are you suggesting something that would be even better than that? If > so, please elucidate! :) Oh, I guess the difference is the 'minimally parsed' bit, ie. the hsp chunks could virtually be raw lines from the input file? I don't think parsing the lines into data stored in hashes is any kind of significant burden, but it is certainly worthy of investigation if we're really really hungry for speed. Remember that anyway, we have to do a significant amount of parsing to discover where the chunks start and end. ... Though, with that approach we might also get a memory saving: assuming we can rely on the input file sticking around, store a pointer to the position and length of each 'chunk' of lines, instead of the line data itself. (I don't think that's a serious suggestion, just throwing ideas out.) From cjfields at uiuc.edu Thu Aug 10 17:04:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 16:04:29 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB84DC.70705@sendu.me.uk> Message-ID: <000001c6bcc0$92183300$15327e82@pyrimidine> ... > > You might use this same strategy have the handler return simple hashes > > instead of objects, > > Yes, the main change I have made that provides the speed increase is to > make the handler (SearchResultEventBuilder) return hashes instead of > objects. > > It's a transparent change when combined with the lazy instantiation. I agree, and may be the best way to proceed initially. There are other ways to optimize. I personally like Aaron's 'chunk' idea using nested parsers, which should fly; I could envision a way to take advantage of that with Perl6's regex objects. > > Alternatively, create a new SearchIO class (call it fastblast; okay, > > terrible name) that doesn't use a handler and just returns hashes. I > > think Jason pointed out previously that the handler isn't required. > > But I didn't see any particular harm in keeping them. Not having a > handler might shave a percent or two off run times, but you need to > balance speed with power and flexibility. I don't know where that > balance lies, hence my question to the community. Depends on the person, hence flexibility is probably the best way to go. I'm like you in that I prefer using the various objects. The cool thing about SearchIO is you could design a module to your liking. The tools are there (SearchIO module, Generic* Search objects, the handlers), you just have to know how they work together and where to optimize. It's up to the user. If someone wants a streamlined BLAST parser, they can build a specialized SearchIO module that returns hashes straight out with no handler and no internal caching (my fastblast suggestion). Or use a specialized handler to dole out hashes (your method). Or use full-blown interleaved objects (current implementation). The learning curve is somewhat high if you don't have a strong computer science background like me (the molecular microbiologist). You have to grok how the system works, how the Handler works, the various Search* objects that are returned, how they are implemented, etc. But... The system is flexible if you know how to use it. Chris From cjfields at uiuc.edu Thu Aug 10 17:11:00 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 16:11:00 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <1322DD89-8E98-4403-8599-D151A079331E@ebi.ac.uk> Message-ID: <000101c6bcc1$7ee47810$15327e82@pyrimidine> And he comes down from the mount and speaks to the masses...then disappears back into the mist... Kidding aside, this strategy may be something to think about for other parsers in Bioperl (such as SeqIO). Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Ewan Birney > Sent: Thursday, August 10, 2006 1:56 PM > To: aaron.j.mackey at gsk.com > Cc: bioperl-l at lists.open-bio.org; Sendu Bala > Subject: Re: [Bioperl-l] SearchIO speed up > > > On 10 Aug 2006, at 18:39, aaron.j.mackey at gsk.com wrote: > > >> ...Except I need to know if the community considers the speed problem > >> solved or not. More radical changes will make SearchIO even > >> faster, eg. > >> Chris Fields and Jason (if I interpret the Project priority list item > >> correctly) have suggested an end to individual Hit and HSP objects, > >> which become just data members of a Result-like object. Ideally I > >> don't > >> want to go down that route because we lose quite a bit of OO power; > > > > As already mentioned, a lazy-evaluation approach would also work. > > > > Jason and I did once talk about an entirely new parsing/object- > > building > > framework, based on nested grammars; in essence, the "top-level" > > parser, > > simply "chunks" the input into blobs of (minimally parsed) text that > > correspond to the top level result object. This chunk/blob is the > > input > > to the next-level parser for Hits, which in return has chunk for HSPs. > > Note that the Result/Hit/HSP "chunks" are "fat", i.e. they *are* > > the same > > Generic*I-implementing objects we're already using. Thus, if HSPs are > > never interrogated, they're never parsed; as soon as one is > > interrogated, > > it gets parsed, and so on. In such an environment, you can imagine > > flyweight objects that are built very quickly/easily (recall that many > > previous analyses of BioPerl speed problems are not related to > > parsing, so > > much as heavy-weight object creation). > > > > for people's interest, this is what the SwissKnife package does as well > for swissprot (which has a trivially top level chunking strategy) > > (ewan returns to his hectic life of too many balls in the air :)). > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Aug 10 18:06:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 17:06:18 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB89DB.7050503@sendu.me.uk> Message-ID: <000401c6bcc9$34ba70c0$15327e82@pyrimidine> ... > > Or are you suggesting something that would be even better than that? If > > so, please elucidate! :) > > Oh, I guess the difference is the 'minimally parsed' bit, ie. the hsp > chunks could virtually be raw lines from the input file? I don't think > parsing the lines into data stored in hashes is any kind of significant > burden, but it is certainly worthy of investigation if we're really > really hungry for speed. Remember that anyway, we have to do a > significant amount of parsing to discover where the chunks start and end. You would just look for start/end 'elements' for Result/Hit/HSPs. Though SearchIO does this, I probably wouldn't use exactly the same approach. This is my take on it. I could be completely off here... Carve out each result chunk, parse out the data for the ResultI, then carve out the hits into 'chunks' based on start/end events only (minimal parsing). These are passed onto the next parser, which processes the chunk for HitI, then carves out HSP chunks based on start/end events. This is passed on to a third parser for grabbing HSPI data. Sound about right? > ... Though, with that approach we might also get a memory saving: > assuming we can rely on the input file sticking around, store a pointer > to the position and length of each 'chunk' of lines, instead of the line > data itself. > > (I don't think that's a serious suggestion, just throwing ideas out.) That's what this forum is for. Chris From birney at ebi.ac.uk Thu Aug 10 18:19:06 2006 From: birney at ebi.ac.uk (Ewan Birney) Date: Thu, 10 Aug 2006 23:19:06 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <000101c6bcc1$7ee47810$15327e82@pyrimidine> References: <000101c6bcc1$7ee47810$15327e82@pyrimidine> Message-ID: On 10 Aug 2006, at 22:11, Chris Fields wrote: > And he comes down from the mount and speaks to the masses...then > disappears > back into the mist... > I'm not sure if this makes me some sort gorilla or ... someone who is starting a new cult. Or both (?). One day I will return ;) I keep wanting to write a bioperl-ensembl bridge so Ensembl can appear transparently as a Bio::DB::RandomAccessI etc for all the datasets it has internally... One day, one day... (back to my hectic life...) From aaron.j.mackey at gsk.com Thu Aug 10 16:43:52 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Thu, 10 Aug 2006 16:43:52 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB83B9.6050507@sendu.me.uk> Message-ID: > As I understand your description, this is exactly what I do. My 'chunks' > are the hashes that are normally used to create a new Hit/HSP object. > > The initial parse of the data file results in a small number of objects > (Results) that contain all the data: HSP data nested in Hit data nested > in the Result objects. When you actually want to do something with a > certain hit or HSP it becomes an object, allowing you to call its > methods like normal. > > Or are you suggesting something that would be even better than that? If > so, please elucidate! :) So the only lazyness you invoke is the object instantiation (but you've already done all the parsing). My proposal involves the "chunks" being unparsed, raw text "blobs", that are essentially blessed into a package that does the parsing only when necessary (and even then, might choose different parsing strategies, based on what's been asked for). Thus a potentially large amount of parsing and storage is skipped. Additionally, you now have the option of not even storing the blobs in memory, just file seek pointers (requiring temp. storage for streaming pipe data sources), and thus can process very large reports without consuming memory (currently a problem). Just to reiterate, here's some "user level" code with comments describing what's happened behind the scenes: use Bio::SearchIO; my $io = Bio::SearchIO->new(-format => "blast", -file => "myresult.blast"); # when next_result is called, $io has to do the top-level parse # to figure out the start/stop of the next result while (my $result = $io->next_result()) { # $result is now a "blessed" blob my $query = $result->query(); # blob got (minimally) lazily parsed # to extract the requested bit, nothing more # first time next_hit is called, $result has to do the next-level parse # to figure out the start(s)/stop(s) of each hit; for BLAST reports, this # info is in two places, the hit table and the alignment info while (my $hit = $result->next_hit()) { # etc. } } note also that the current "push" event model can still work with this architecture, but a "pull" model would speed up initial access even more (preventing the need to parse/store the entire enumeration of blobs to get the first/next), and lower the memory footprint even further. -Aaron From bix at sendu.me.uk Thu Aug 10 18:28:49 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 23:28:49 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44DBB321.3090701@sendu.me.uk> aaron.j.mackey at gsk.com wrote: >> As I understand your description, this is exactly what I do. My 'chunks' >> are the hashes that are normally used to create a new Hit/HSP object. >> >> The initial parse of the data file results in a small number of objects >> (Results) that contain all the data: HSP data nested in Hit data nested >> in the Result objects. When you actually want to do something with a >> certain hit or HSP it becomes an object, allowing you to call its >> methods like normal. >> >> Or are you suggesting something that would be even better than that? If >> so, please elucidate! :) > > So the only lazyness you invoke is the object instantiation (but you've > already done all the parsing). > > My proposal involves the "chunks" being unparsed, raw text "blobs", that > are essentially blessed into a package that does the parsing only when > necessary (and even then, might choose different parsing strategies, based > on what's been asked for). Thus a potentially large amount of parsing and > storage is skipped. Additionally, you now have the option of not even > storing the blobs in memory, just file seek pointers (requiring temp. > storage for streaming pipe data sources), and thus can process very large > reports without consuming memory (currently a problem). Thanks, I might try out something along those lines. The problem I see is with piped input; I wouldn't want to require temp. storage because the user may deliberately be trying to gain speed by doing as little disc io as possible. Then you'd have to special-case it; pointers if we have a file on disc, stored-in-memory if piped. Maybe that special-case wouldn't be so bad. From bix at sendu.me.uk Thu Aug 10 18:30:53 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 10 Aug 2006 23:30:53 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: <000101c6bcc1$7ee47810$15327e82@pyrimidine> Message-ID: <44DBB39D.40605@sendu.me.uk> Ewan Birney wrote: > > On 10 Aug 2006, at 22:11, Chris Fields wrote: > >> And he comes down from the mount and speaks to the masses...then >> disappears back into the mist... >> > > I'm not sure if this makes me some sort gorilla or ... someone who is > starting a new cult. Or both (?). One day I will return ;) I keep wanting > to write a bioperl-ensembl bridge so Ensembl can appear transparently > as a Bio::DB::RandomAccessI etc for all the datasets it has internally... > One day, one day... Soon I hope! :) From cjfields at uiuc.edu Thu Aug 10 18:36:05 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 17:36:05 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: Message-ID: <000001c6bccd$61b5a690$15327e82@pyrimidine> > On 10 Aug 2006, at 22:11, Chris Fields wrote: > > > And he comes down from the mount and speaks to the masses...then > > disappears > > back into the mist... > > > > I'm not sure if this makes me some sort gorilla or ... someone who is > starting a new cult. Or both (?). One day I will return ;) I keep > wanting > to write a bioperl-ensembl bridge so Ensembl can appear transparently > as a Bio::DB::RandomAccessI etc for all the datasets it has > internally... > > > One day, one day... Sorry, got my "Gorillas in the Mist' mixed up with 'Passion of the Christ.' Long day... I agree with Sendu on the ensemble bridge (though I'll try to restrain from salivating). One day... Chris From osborne1 at optonline.net Thu Aug 10 18:40:23 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Thu, 10 Aug 2006 18:40:23 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <005201c6bcae$526c2bb0$2f01a8c0@GOLHARMOBILE1> Message-ID: Xianjun and Ryan, translate() can use any of NCBI's codon tables: http://www.bioperl.org/wiki/Bptutorial.pl#III.3.2_Translating Brian O. On 8/10/06 2:53 PM, "Ryan Golhar" wrote: > 1. The Bio::Seq::translate function (to my knowledge) only uses the > generic codon table. From cjfields at uiuc.edu Thu Aug 10 18:51:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 17:51:33 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: Message-ID: <000201c6bccf$8941f090$15327e82@pyrimidine> ... > So the only lazyness you invoke is the object instantiation (but you've > already done all the parsing). > > My proposal involves the "chunks" being unparsed, raw text "blobs", that > are essentially blessed into a package that does the parsing only when > necessary (and even then, might choose different parsing strategies, based > on what's been asked for). Thus a potentially large amount of parsing and > storage is skipped. Additionally, you now have the option of not even > storing the blobs in memory, just file seek pointers (requiring temp. > storage for streaming pipe data sources), and thus can process very large > reports without consuming memory (currently a problem). > > Just to reiterate, here's some "user level" code with comments describing > what's happened behind the scenes: > > use Bio::SearchIO; > > my $io = Bio::SearchIO->new(-format => "blast", -file => > "myresult.blast"); > > # when next_result is called, $io has to do the top-level parse > # to figure out the start/stop of the next result > while (my $result = $io->next_result()) { > # $result is now a "blessed" blob > > my $query = $result->query(); # blob got (minimally) lazily parsed > # to extract the requested bit, nothing > more > > # first time next_hit is called, $result has to do the next-level parse > # to figure out the start(s)/stop(s) of each hit; for BLAST reports, > this > # info is in two places, the hit table and the alignment info > while (my $hit = $result->next_hit()) { > # etc. > } > } > > note also that the current "push" event model can still work with this > architecture, but a "pull" model would speed up initial access even more > (preventing the need to parse/store the entire enumeration of blobs to get > the first/next), and lower the memory footprint even further. > > -Aaron Using file pointers is a great touch. Sendu has a slight aversion to temp files but he has already indicated other ways around this. Would be nice to see this to fruition. Okay, really have to get back to work! Chris From torsten.seemann at infotech.monash.edu.au Thu Aug 10 19:47:27 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Fri, 11 Aug 2006 09:47:27 +1000 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <1322DD89-8E98-4403-8599-D151A079331E@ebi.ac.uk> References: <1322DD89-8E98-4403-8599-D151A079331E@ebi.ac.uk> Message-ID: <44DBC58F.3040302@infotech.monash.edu.au> > for people's interest, this is what the SwissKnife package does as well > for swissprot (which has a trivially top level chunking strategy) For those interested, here are references for SwissKnife: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=retrieve&db=pubmed&list_uids=10498781&dopt=Abstract http://bioinformatics.oxfordjournals.org/cgi/content/abstract/15/9/771 -- Torsten Seemann Victorian Bioinformatics Consortium, Monash University, Australia From torsten.seemann at infotech.monash.edu.au Thu Aug 10 19:45:33 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Fri, 11 Aug 2006 09:45:33 +1000 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44DBC51D.9010004@infotech.monash.edu.au> > So the only lazyness you invoke is the object instantiation (but you've > already done all the parsing). > > My proposal involves the "chunks" being unparsed, raw text "blobs", that > are essentially blessed into a package that does the parsing only when > necessary (and even then, might choose different parsing strategies, based > on what's been asked for). Thus a potentially large amount of parsing and > storage is skipped. Additionally, you now have the option of not even > storing the blobs in memory, just file seek pointers (requiring temp. > storage for streaming pipe data sources), and thus can process very large > reports without consuming memory (currently a problem). This approach is an excellent one, but not all file formats lend themselves to it. BLAST results have a semantically hierarchial layout, and the BLAST XML report syntax matches that layout, so the approach is well suited. Traditional BLAST reports are pretty similar too. ie. most of the data for a low-level object is encapsulated within a certain part of the input file. However, this may not be true for other formats, perhaps HMMER reports, where "HSP"-related info may be spread across multiple sections of the file. But of course, this doesn't prevent us using the approach where suitable, and using the "slow" method otherwise. -- Torsten Seemann Victorian Bioinformatics Consortium, Monash University, Australia From cjfields at uiuc.edu Thu Aug 10 21:32:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 10 Aug 2006 20:32:01 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DBC51D.9010004@infotech.monash.edu.au> Message-ID: <000001c6bce5$f2212930$15327e82@pyrimidine> I took a quick gander at the SwissKnife code; very nice, but quite long: http://swissknife.sourceforge.net/docs/ Perl6 uses parsing expression grammers and rules, so you could build up your own custom grammers for parsing files. That would come in very handy here. Don't know how much of this is implemented or available in Pugs but I may give it a try sometime. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Torsten Seemann > Sent: Thursday, August 10, 2006 6:46 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] SearchIO speed up > > > So the only lazyness you invoke is the object instantiation (but you've > > already done all the parsing). > > > > My proposal involves the "chunks" being unparsed, raw text "blobs", that > > are essentially blessed into a package that does the parsing only when > > necessary (and even then, might choose different parsing strategies, > based > > on what's been asked for). Thus a potentially large amount of parsing > and > > storage is skipped. Additionally, you now have the option of not even > > storing the blobs in memory, just file seek pointers (requiring temp. > > storage for streaming pipe data sources), and thus can process very > large > > reports without consuming memory (currently a problem). > > This approach is an excellent one, but not all file formats lend > themselves to > it. BLAST results have a semantically hierarchial layout, and the BLAST > XML > report syntax matches that layout, so the approach is well suited. > Traditional > BLAST reports are pretty similar too. ie. most of the data for a low-level > object is encapsulated within a certain part of the input file. > > However, this may not be true for other formats, perhaps HMMER reports, > where > "HSP"-related info may be spread across multiple sections of the file. > > But of course, this doesn't prevent us using the approach where suitable, > and > using the "slow" method otherwise. > > -- > Torsten Seemann > Victorian Bioinformatics Consortium, Monash University, Australia > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From avilella at gmail.com Fri Aug 11 03:18:29 2006 From: avilella at gmail.com (Albert Vilella) Date: Fri, 11 Aug 2006 08:18:29 +0100 Subject: [Bioperl-l] mumsa in bioperl-run Message-ID: <1155280709.6590.7.camel@localhost> Hi all, Time permits, I intend to write a wrapper for MUMSA in bioperl-run. Timo Lassmann and Erik L. L. Sonnhammer (2005) Automatic assessment of alignment quality. Nucleic Acids Research Vol.33(22) pp.7120-7128 MUMSA compares multiple sequence alignments, so my idea would be to put the module in Bio/Tools/Run/Alignment/. http://bugzilla.open-bio.org/show_bug.cgi?id=2070 What you reckon? Albert. From akarger at CGR.Harvard.edu Fri Aug 11 09:06:06 2006 From: akarger at CGR.Harvard.edu (Amir Karger) Date: Fri, 11 Aug 2006 09:06:06 -0400 Subject: [Bioperl-l] SearchIO speed up Message-ID: Let me add my voice to the adulation here. IMO, the two main reasons Bioperl hasn't achieved world domination are (a) it's so huge that it's hard to find what you want, which the HOWTOs help with, and (b) it's so darn slow. Speedup is most definitely a Good Thing, and I'm sure that the vast majority of BLAST hits are ignored in the vast majority of cases, where you're just looking for hits where some criterion meets a certain threshold or something. It's unlikely that people want the full alignment for all 100k or whatever hits. (This is why I just use blast -m8: no parser required, and all you lose is the alignment.) Anyway, in your spare time, maybe you do similar speedups for other pieces of Bioperl? My personal favorite would be the GenBank/EMBL parsers. The fungal genome ORF files I'm working with are only 20M or so, but using Bioperl to work with them takes so much longer than with non-Bioperl on the 6M FASTA files for other genomes. I have to imagine it's mostly creating objects for the gazillion tags, 90% of which I never peek at. I know, you folks are busy, and I should be volunteering to do it myself. But you can at least consider it a user request. - Amir Karger Research Computing Bauer Center for Genomics Research Harvard University > -----Original Message----- > From: aaron.j.mackey at gsk.com [mailto:aaron.j.mackey at gsk.com] > Sent: Thursday, August 10, 2006 1:40 PM > To: Sendu Bala > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] SearchIO speed up > > > ...Except I need to know if the community considers the > speed problem > > solved or not. More radical changes will make SearchIO even > faster, eg. > > Chris Fields and Jason (if I interpret the Project priority > list item > > correctly) have suggested an end to individual Hit and HSP objects, > > which become just data members of a Result-like object. > Ideally I don't > > want to go down that route because we lose quite a bit of OO power; > > As already mentioned, a lazy-evaluation approach would also work. > > Jason and I did once talk about an entirely new > parsing/object-building > framework, based on nested grammars; in essence, the > "top-level" parser, > simply "chunks" the input into blobs of (minimally parsed) text that > correspond to the top level result object. This chunk/blob > is the input > to the next-level parser for Hits, which in return has chunk > for HSPs. > Note that the Result/Hit/HSP "chunks" are "fat", i.e. they > *are* the same > Generic*I-implementing objects we're already using. Thus, if > HSPs are > never interrogated, they're never parsed; as soon as one is > interrogated, > it gets parsed, and so on. In such an environment, you can imagine > flyweight objects that are built very quickly/easily (recall > that many > previous analyses of BioPerl speed problems are not related > to parsing, so > much as heavy-weight object creation). > > I happen to have such a nested parser lying around for > Bio::SearchIO::fasta.pm, but it also uses an Inline::C, > yacc-generated C > parser backend (yet another experiment in trying to get > SearchIO to run > faster), so really isn't ready for prime time (being entirely > untested, > and probably not even finished). > > -Aaron > > > From cjfields at uiuc.edu Fri Aug 11 09:48:59 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 11 Aug 2006 08:48:59 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: Message-ID: <001b01c6bd4c$e8887920$15327e82@pyrimidine> > Anyway, in your spare time, maybe you do similar speedups for other > pieces of Bioperl? My personal favorite would be the GenBank/EMBL > parsers. The fungal genome ORF files I'm working with are only 20M or > so, but using Bioperl to work with them takes so much longer than with > non-Bioperl on the 6M FASTA files for other genomes. I have to imagine > it's mostly creating objects for the gazillion tags, 90% of which I > never peek at. I agree completely. Swissknife (lazy parsing of Swiss-Prot) was mentioned here yesterday. We could use something similar for GenBank/EMBL. The code for Swissknife was quite extensive but, really, so is SeqIO::genbank! I also wanted to see how much using bioperl's _readline() method slows things down (my guess is not too dramatically, but for 20 MB files it may be a problem). > I know, you folks are busy, and I should be volunteering to do it > myself. But you can at least consider it a user request. We can't promise anything! If you want, add a bit to the Bioperl release page: http://www.bioperl.org/wiki/Bioperl_Release I would hold that request off until post-1.6. Lots of other priorities pooping up. Chris > - Amir Karger > Research Computing > Bauer Center for Genomics Research > Harvard University ... From torsten.seemann at infotech.monash.edu.au Thu Aug 10 18:11:15 2006 From: torsten.seemann at infotech.monash.edu.au (Torsten Seemann) Date: Fri, 11 Aug 2006 08:11:15 +1000 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <005201c6bcae$526c2bb0$2f01a8c0@GOLHARMOBILE1> References: <005201c6bcae$526c2bb0$2f01a8c0@GOLHARMOBILE1> Message-ID: <44DBAF03.5010707@infotech.monash.edu.au> > 1. The Bio::Seq::translate function (to my knowledge) only uses the > generic codon table. So, you will need to translate the DNA sequence > using some other method. Actually, the translate() method has a (-codontable_id => ??) parameter: http://doc.bioperl.org/bioperl-live/Bio/PrimarySeqI.html#POD10 Here are the table IDs: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c -- Torsten Seemann Victorian Bioinformatics Consortium, Monash University, Australia http://www.vicbioinformatics.com/ From cjfields at uiuc.edu Fri Aug 11 11:42:04 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 11 Aug 2006 10:42:04 -0500 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <44DBAF03.5010707@infotech.monash.edu.au> Message-ID: <001c01c6bd5c$b27dd630$15327e82@pyrimidine> Just so everyone knows, (and it's probably unrelated to all this), but several enhancement requests were added recently to Bio::Tools::Phylo::PAML in CVS. The notes for these are in Bugzilla and have to do with parsing models in codeml output: http://bugzilla.open-bio.org/show_bug.cgi?id=1883 http://bugzilla.open-bio.org/show_bug.cgi?id=2054 http://bugzilla.open-bio.org/show_bug.cgi?id=2055 Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Torsten Seemann > Sent: Thursday, August 10, 2006 5:11 PM > To: golharam at umdnj.edu > Cc: bioperl-l at lists.open-bio.org; 'Xianjun Dong' > Subject: Re: [Bioperl-l] PAML + Codeml problem.. > > > 1. The Bio::Seq::translate function (to my knowledge) only uses the > > generic codon table. So, you will need to translate the DNA sequence > > using some other method. > > Actually, the translate() method has a (-codontable_id => ??) parameter: > > http://doc.bioperl.org/bioperl-live/Bio/PrimarySeqI.html#POD10 > > Here are the table IDs: > > http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c > > -- > Torsten Seemann > Victorian Bioinformatics Consortium, Monash University, Australia > http://www.vicbioinformatics.com/ > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Fri Aug 11 12:33:52 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 11 Aug 2006 11:33:52 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DBB321.3090701@sendu.me.uk> Message-ID: <002f01c6bd63$f146aa70$15327e82@pyrimidine> Sendu, If we go the route of flexibility (so one could use full-blown objects, hashes, lazy parsing, etc.), maybe we should initially have custom Result*, Hit*, HSP* Bio::Search objects returned via the Handler initially. This would allow you to commit everything and get people testing it on various OS's. You could also develop a custom handler but that isn't absolutely necessary (see below). The various Handlers apparently are set up for allowing one to create a custom Factory for each Search object type (such as BLAST*). These are added to the Handler upon instantiation or by using register_factory(). The modified Handler can then be added using SearchIO's attach_EventHandler(). So I guess one could do something like this: use Bio::SearchIO; use Bio::Factory::ObjectFactory; use Bio::SearchIO::SearchResultEventBuilder; my $resfac = Bio::Factory::ObjectFactory->new( -type => 'Bio::Search::Result::LazyResult', -interface => 'Bio::Search::Result::ResultI'); my $hitfac = Bio::Factory::ObjectFactory->new( -type => 'Bio::Search::Hit::LazyHit', -interface => 'Bio::Search::Hit::HitI'); my $hspfac = Bio::Factory::ObjectFactory->new( -type => 'Bio::Search::HSP::LazyHSP', -interface => 'Bio::Search::HSP::HSPI'); my $handler = Bio::SearchIO::SearchResultEventBuilder->new( -result_factory => $resfac, -hit_factory => $hitfac, -hsp_factory => $hspfac); my $parser = Bio::SearchIO->new(-file => $file, -format => 'lazyblast'); $parser->attach_EventHandler($handler); # proceed with parsing... Of course I haven't tried this out... ;> Would be nice to add a parameter that allows one to add a modified handler upon SearchIO object instantiation. Oh well... Most users don't know nor use the various handlers or know about the Search objects, which is a shame. Maybe the HOWTO needs to be written more explicitly? Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, August 10, 2006 5:29 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] SearchIO speed up > > aaron.j.mackey at gsk.com wrote: > >> As I understand your description, this is exactly what I do. My > 'chunks' > >> are the hashes that are normally used to create a new Hit/HSP object. > >> > >> The initial parse of the data file results in a small number of objects > >> (Results) that contain all the data: HSP data nested in Hit data nested > >> in the Result objects. When you actually want to do something with a > >> certain hit or HSP it becomes an object, allowing you to call its > >> methods like normal. > >> > >> Or are you suggesting something that would be even better than that? If > >> so, please elucidate! :) > > > > So the only lazyness you invoke is the object instantiation (but you've > > already done all the parsing). > > > > My proposal involves the "chunks" being unparsed, raw text "blobs", that > > are essentially blessed into a package that does the parsing only when > > necessary (and even then, might choose different parsing strategies, > based > > on what's been asked for). Thus a potentially large amount of parsing > and > > storage is skipped. Additionally, you now have the option of not even > > storing the blobs in memory, just file seek pointers (requiring temp. > > storage for streaming pipe data sources), and thus can process very > large > > reports without consuming memory (currently a problem). > > Thanks, I might try out something along those lines. The problem I see > is with piped input; I wouldn't want to require temp. storage because > the user may deliberately be trying to gain speed by doing as little > disc io as possible. Then you'd have to special-case it; pointers if we > have a file on disc, stored-in-memory if piped. Maybe that special-case > wouldn't be so bad. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From muratem at eng.uah.edu Fri Aug 11 12:10:30 2006 From: muratem at eng.uah.edu (Mike Muratet) Date: Fri, 11 Aug 2006 11:10:30 -0500 (CDT) Subject: [Bioperl-l] load_seqdatabase fails when loading refseq plant files Message-ID: Hello all I am using biosql-schema/bioperl-db to load Refseq entries into a biosql database. I don't see any version info in the files, but I downloaded everything in the last month or so and everything passed all the tests when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, DBD-mysql-3.006. I was loading plant file from Refseq rel 18: load_seqdatabase.pl --dbname biosql --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz and it crashed after about 30K of 60K records: at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl line 633 -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (01-JUL-2004) National Center for Biotechnology Information, National Institutes of Health, Bethesda 20894, United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs () Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 --------------------------------------------------- Could not store XM_472403: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 STACK Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 t I traced the error back through the source and database and found that XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, but only the last one crashed the script (in spite of --safe). Should there be more info included in the CRC field? I am weak when it comes to RDBMs, but looking at the schema, I would guess that the CRC field was added to make an otherwise degenerate key unique. Would it help to add more fields to the CRC, or another key? The former might be done without have to change a lot of code. Thanks Mike From mblanche at berkeley.edu Fri Aug 11 14:30:54 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Fri, 11 Aug 2006 11:30:54 -0700 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF Message-ID: Dear all, I used to use this very simple script to extract the gene sequence as a fasta flat file from a Bio::DB::GFF database containing the GadFly 4.3 annotations #!/usr/bin/perl use strict; use warnings; use Bio::DB::GFF; use Bio::SeqIO; my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'fasta'); my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', -dsn => 'dbi:mysql:database=dmel_43_LS'); while (<>){ chomp; my @feat = $db->get_feature_by_name($_); $out->write_seq($_) for @feat; } Somehow I now get the following output instead of the actual sequences: >FBgn0024988 gene:.(FBgn0024988) Bio::PrimarySeq=HASH(0x19fd3d8) >FBgn0041184 gene:.(FBgn0041184) Bio::PrimarySeq=HASH(0x19fa684) >FBgn0033636 gene:.(FBgn0033636) Bio::PrimarySeq=HASH(0x19e1908) What change and what would be the right way to get what I want? Many thanks Marco ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From angel at mail.med.upenn.edu Fri Aug 11 14:57:35 2006 From: angel at mail.med.upenn.edu (Angel Pizarro) Date: Fri, 11 Aug 2006 14:57:35 -0400 Subject: [Bioperl-l] [BioSQL-l] load_seqdatabase fails when loading refseq plant files In-Reply-To: References: Message-ID: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> Glad I am not the only one that ran into this problem! Mike, I had reported this issue a few emails back and have provided the list with an example file for testing, so it should be resolved soon. FYI, you are correct that CRC is computed on load to determine if two pub references are in fact the same. This is a feature to save database space. The expected behaviour would be for the subsequent entries with the same CRC reference should have an FK to the originating reference entry, and not insert a duplicate row into the reference table. FYI #2, the --safe option explicitly states that it will continue to process records after errors BUT do a roll-back at the end of the run. This is to gather all of your errors in one shot, as opposed to fixing a record, starting, error, fix, etc ,. If you are impatient and do not care about references, you have three choices. 1) drop the unique constraint on reference.crc (this will cause dups in reference and you can not go back to a unique CRC without some major SQL data migration routine to fix FK's and delete the dups. 2) filter your records to not contain reference information 3) alter load_seqdatabase to not enter reference information. This would be in the Bio::AnnotationCollection object: $seq->annotation()->remove_Annotations('reference'); The above command inserted someplace in the script line ~575 should do the trick. Obviously this means that all reference information is not loaded into the DB at all. -angel On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote: > Hello all > > I am using biosql-schema/bioperl-db to load Refseq entries into a biosql > database. I don't see any version info in the files, but I downloaded > everything in the last month or so and everything passed all the tests > when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, > DBD-mysql-3.006. I was loading plant file from Refseq rel 18: > > load_seqdatabase.pl --dbname biosql > --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz > > and it crashed after about 30K of 60K records: > > at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl > line 633 > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values > were ("","Direct Submission","Submitted (01-JUL-2004) National Center for > Biotechnology Information, National Institutes of Health, Bethesda 20894, > United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs > () > Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 > --------------------------------------------------- > Could not store XM_472403: > ------------- EXCEPTION ------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or to be > found by unique key > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 > STACK Bio::DB::Persistent::PersistentObject::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 > STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 > t > > I traced the error back through the source and database and found that > XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, > but only the last one crashed the script (in spite of --safe). > > Should there be more info included in the CRC field? I am weak when > it comes to RDBMs, but looking at the schema, I would guess that the CRC field > was added to make an otherwise degenerate key unique. Would it help to add > more fields to the CRC, or another key? The former might be done without > have to change a lot of code. > > Thanks > > Mike > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From cjfields at uiuc.edu Fri Aug 11 15:27:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 11 Aug 2006 14:27:21 -0500 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF In-Reply-To: Message-ID: <000301c6bd7c$2a84f9f0$15327e82@pyrimidine> Marco, I guess Bio::DB::GFF::Feature objects are PrimarySeqI? Looking at SeqIO::fasta write_seq(), the descriptor is based on the preferred_id_type and the sequence is obtained from the seq() method. Does 'ref($_->seq())' give you Bio::PrimarySeq? If it does, maybe you should be using '$out->write_seq($_->seq())' instead, though I don't know if that will work either. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Marco Blanchette > Sent: Friday, August 11, 2006 1:31 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF > > Dear all, > > I used to use this very simple script to extract the gene sequence as a > fasta flat file from a Bio::DB::GFF database containing the GadFly 4.3 > annotations > > #!/usr/bin/perl > > use strict; > use warnings; > use Bio::DB::GFF; > use Bio::SeqIO; > > my $out = Bio::SeqIO->new( -fh => \*STDOUT, > -format => 'fasta'); > > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => 'dbi:mysql:database=dmel_43_LS'); > > while (<>){ > chomp; > my @feat = $db->get_feature_by_name($_); > $out->write_seq($_) for @feat; > } > > Somehow I now get the following output instead of the actual sequences: > >FBgn0024988 gene:.(FBgn0024988) > Bio::PrimarySeq=HASH(0x19fd3d8) > >FBgn0041184 gene:.(FBgn0041184) > Bio::PrimarySeq=HASH(0x19fa684) > >FBgn0033636 gene:.(FBgn0033636) > Bio::PrimarySeq=HASH(0x19e1908) > > What change and what would be the right way to get what I want? > > Many thanks > > Marco > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 > -- > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From mblanche at berkeley.edu Fri Aug 11 17:36:27 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Fri, 11 Aug 2006 14:36:27 -0700 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF In-Reply-To: <1155323155.2616.65.camel@localhost.localdomain> Message-ID: Many thanks Scott, At the same time I got your email I was coming to the same conclusion as you. Now I have a stranger problem in my hands... My goal is quite simple, I try to get the sequence of the genes back from the Bio::DB::GFF database loaded on MySQL. The gene list is from a file with one gene id per per line. When I run the following script: use Bio::DB::GFF; use Bio::SeqIO; my $out = Bio::SeqIO->new( -fh => \*STDOUT, -format => 'fasta'); my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', -dsn => 'dbi:mysql:database=dmel_43_new'); while (<>){ chomp; my $id = $_; my @feats = $db->get_feature_by_name($id); for my $f (@feats){ $out->write_seq( $f->seq ) if $f->type =~/gene/; } } I get more sequence back than the number of gene in my input file. I double check there. Some of the duplicated entries are the same, some are not! I double check and none of the the duplicated entry in the output are having duplicated "gene" entry in the original gff files that I just loaded in the MySQL database using bp_bulk_load_gff.pl. Any one as an idea as to what is going with my script? Marco On 8/11/06 12:05, "Scott Cain" wrote: > Hi Marco, > > What you are getting from get_feature_by_name is a list of > Bio::DB::GFF::Feature objects, which are Bio::SeqFeatureI objects. What > you need are Bio::PrimarySeq objects. Fortunately, > Bio::DB::GFF::Feature has a method to get a PrimarySeq out of it; that > method is called seq. > > So, you should be able to change your line to > > $out->write_seq( $_->seq() ) for @feat; > > and it should work. Of course, I haven't test that to make sure that it > does :-) > > Scott > > > On Fri, 2006-08-11 at 11:30 -0700, Marco Blanchette wrote: >> Dear all, >> >> I used to use this very simple script to extract the gene sequence as a >> fasta flat file from a Bio::DB::GFF database containing the GadFly 4.3 >> annotations >> >> #!/usr/bin/perl >> >> use strict; >> use warnings; >> use Bio::DB::GFF; >> use Bio::SeqIO; >> >> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >> -format => 'fasta'); >> >> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >> -dsn => 'dbi:mysql:database=dmel_43_LS'); >> >> while (<>){ >> chomp; >> my @feat = $db->get_feature_by_name($_); >> $out->write_seq($_) for @feat; >> } >> >> Somehow I now get the following output instead of the actual sequences: >>> FBgn0024988 gene:.(FBgn0024988) >> Bio::PrimarySeq=HASH(0x19fd3d8) >>> FBgn0041184 gene:.(FBgn0041184) >> Bio::PrimarySeq=HASH(0x19fa684) >>> FBgn0033636 gene:.(FBgn0033636) >> Bio::PrimarySeq=HASH(0x19e1908) >> >> What change and what would be the right way to get what I want? >> >> Many thanks >> >> Marco >> ______________________________ >> Marco Blanchette, Ph.D. >> >> mblanche at uclink.berkeley.edu >> >> Donald C. Rio's lab >> Department of Molecular and Cell Biology >> 16 Barker Hall >> University of California >> Berkeley, CA 94720-3204 >> >> Tel: (510) 642-1084 >> Cell: (510) 847-0996 >> Fax: (510) 642-6062 ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From cjfields at uiuc.edu Fri Aug 11 18:19:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 11 Aug 2006 17:19:09 -0500 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF In-Reply-To: Message-ID: <000001c6bd94$2d872980$15327e82@pyrimidine> Marco, Do you mean you get duplicates of sequences back, or that you get more than one chunk of the same sequence back? Is it possible that each query using an ID could contain more than one feature? That might explain it (you could check by testing the size of the array @feats). I'm not sure how split locations are handled within Bio:DB::GFF, but do the specific features have split locations? Chris > Many thanks Scott, > > At the same time I got your email I was coming to the same conclusion as > you. > > Now I have a stranger problem in my hands... My goal is quite simple, I > try > to get the sequence of the genes back from the Bio::DB::GFF database > loaded > on MySQL. The gene list is from a file with one gene id per per line. When > I > run the following script: > > > > use Bio::DB::GFF; > use Bio::SeqIO; > my $out = Bio::SeqIO->new( -fh => \*STDOUT, > -format => 'fasta'); > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => 'dbi:mysql:database=dmel_43_new'); > > while (<>){ > chomp; > my $id = $_; > my @feats = $db->get_feature_by_name($id); > for my $f (@feats){ > $out->write_seq( $f->seq ) if $f->type =~/gene/; > } > } > > > I get more sequence back than the number of gene in my input file. I > double > check there. Some of the duplicated entries are the same, some are not! ... From osborne1 at optonline.net Fri Aug 11 19:39:35 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 11 Aug 2006 19:39:35 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: Message-ID: Amir, The ability to customize your Sequence objects when parsing Genbank files is already available: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation#Customizing_Sequence_Ob ject_Construction Not available for the 'embl' format, however. Brian O. On 8/11/06 9:06 AM, "Amir Karger" wrote: > Let me add my voice to the adulation here. IMO, the two main reasons > Bioperl hasn't achieved world domination are (a) it's so huge that it's > hard to find what you want, which the HOWTOs help with, and (b) it's so > darn slow. Speedup is most definitely a Good Thing, and I'm sure that > the vast majority of BLAST hits are ignored in the vast majority of > cases, where you're just looking for hits where some criterion meets a > certain threshold or something. It's unlikely that people want the full > alignment for all 100k or whatever hits. (This is why I just use blast > -m8: no parser required, and all you lose is the alignment.) > > Anyway, in your spare time, maybe you do similar speedups for other > pieces of Bioperl? My personal favorite would be the GenBank/EMBL > parsers. The fungal genome ORF files I'm working with are only 20M or > so, but using Bioperl to work with them takes so much longer than with > non-Bioperl on the 6M FASTA files for other genomes. I have to imagine > it's mostly creating objects for the gazillion tags, 90% of which I > never peek at. > > I know, you folks are busy, and I should be volunteering to do it > myself. But you can at least consider it a user request. > > - Amir Karger > Research Computing > Bauer Center for Genomics Research > Harvard University > >> -----Original Message----- >> From: aaron.j.mackey at gsk.com [mailto:aaron.j.mackey at gsk.com] >> Sent: Thursday, August 10, 2006 1:40 PM >> To: Sendu Bala >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] SearchIO speed up >> >>> ...Except I need to know if the community considers the >> speed problem >>> solved or not. More radical changes will make SearchIO even >> faster, eg. >>> Chris Fields and Jason (if I interpret the Project priority >> list item >>> correctly) have suggested an end to individual Hit and HSP objects, >>> which become just data members of a Result-like object. >> Ideally I don't >>> want to go down that route because we lose quite a bit of OO power; >> >> As already mentioned, a lazy-evaluation approach would also work. >> >> Jason and I did once talk about an entirely new >> parsing/object-building >> framework, based on nested grammars; in essence, the >> "top-level" parser, >> simply "chunks" the input into blobs of (minimally parsed) text that >> correspond to the top level result object. This chunk/blob >> is the input >> to the next-level parser for Hits, which in return has chunk >> for HSPs. >> Note that the Result/Hit/HSP "chunks" are "fat", i.e. they >> *are* the same >> Generic*I-implementing objects we're already using. Thus, if >> HSPs are >> never interrogated, they're never parsed; as soon as one is >> interrogated, >> it gets parsed, and so on. In such an environment, you can imagine >> flyweight objects that are built very quickly/easily (recall >> that many >> previous analyses of BioPerl speed problems are not related >> to parsing, so >> much as heavy-weight object creation). >> >> I happen to have such a nested parser lying around for >> Bio::SearchIO::fasta.pm, but it also uses an Inline::C, >> yacc-generated C >> parser backend (yet another experiment in trying to get >> SearchIO to run >> faster), so really isn't ready for prime time (being entirely >> untested, >> and probably not even finished). >> >> -Aaron >> >> >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From mblanche at berkeley.edu Fri Aug 11 19:59:26 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Fri, 11 Aug 2006 16:59:26 -0700 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF In-Reply-To: <000001c6bd94$2d872980$15327e82@pyrimidine> Message-ID: Chris, > Do you mean you get duplicates of sequences back, or that you get more than > one chunk of the same sequence back? I sometimes get duplicated sequences and sometimes overlapping regions (see bellow) > > Is it possible that each query using an ID could contain more than one > feature? That might explain it (you could check by testing the size of the > array @feats). Most id return more than one features from various type ( point_mutation, insertion_site, processed_transcript, etc...). That's why I restirct the output to type "gene" using regexp /gene/ on $f->type. > > I'm not sure how split locations are handled within Bio:DB::GFF, but do the > specific features have split locations? > > Chris > Not sure what you mean exactly but have a look at the following script, it gives the location and the group id of the feature being reported: use Bio::DB::GFF; my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', -dsn => 'dbi:mysql:database=dmel_43_new'); my %dups; while (<>){ chomp; my $id = $_; my @feat = $db->get_feature_by_name($id); for my $f (@feat){ if (exists $dups{$f->group} && $f->type =~/gene/){ print "Calling >>>", $f->group, "\n"; print "Chr: ", $f->refseq, " Strand: ", $f->strand, " Start: ", $f->start, " End: ", $f->end, "\n"; print "Offending >>>", $dups{$f->group}->group, "\n"; print "Chr: ", $dups{$f->group}->refseq, " Strand: ", $dups{$f->group}->strand, " Start: ", $dups{$f->group}->start, " End: ", $dups{$f->group}->end; print "\n\n"; } else { $dups{$f->group} = $f; } } } Here is the output: Calling >>>FBgn0004179 Chr: 3L Strand: 1 Start: 22201102 End: 22207587 Offending >>>FBgn0004179 Chr: 3L Strand: 1 Start: 22200575 End: 22200575 Calling >>>FBgn0025681 Chr: 2L Strand: 1 Start: 2992964 End: 2998614 Offending >>>FBgn0025681 Chr: 2L Strand: 1 Start: 2992964 End: 2998614 Calling >>>FBgn0025803 Chr: 3R Strand: 1 Start: 16966463 End: 17038413 Offending >>>FBgn0025803 Chr: 3R Strand: 1 Start: 16966463 End: 17038413 Calling >>>FBgn0000117 Chr: X Strand: -1 Start: 1756796 End: 1747557 Offending >>>FBgn0000117 Chr: X Strand: -1 Start: 1757776 End: 1747182 Calling >>>FBgn0005427 Chr: X Strand: -1 Start: 136456 End: 125343 Offending >>>FBgn0005427 Chr: X Strand: -1 Start: 133199 End: 124949 Calling >>>FBgn0000042 Chr: X Strand: 1 Start: 5746100 End: 5750026 Offending >>>FBgn0000042 Chr: X Strand: 1 Start: 5746096 End: 5746106 Calling >>>FBgn0004551 Chr: 2R Strand: -1 Start: 19443485 End: 19434556 Offending >>>FBgn0004551 Chr: 2R Strand: -1 Start: 19445155 End: 19429977 Do you have any suggestions?? Is the procedure I am using to retrieve the genes right? Many thanks Marco >> Many thanks Scott, >> >> At the same time I got your email I was coming to the same conclusion as >> you. >> >> Now I have a stranger problem in my hands... My goal is quite simple, I >> try >> to get the sequence of the genes back from the Bio::DB::GFF database >> loaded >> on MySQL. The gene list is from a file with one gene id per per line. When >> I >> run the following script: >> >> >> >> use Bio::DB::GFF; >> use Bio::SeqIO; >> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >> -format => 'fasta'); >> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >> -dsn => 'dbi:mysql:database=dmel_43_new'); >> >> while (<>){ >> chomp; >> my $id = $_; >> my @feats = $db->get_feature_by_name($id); >> for my $f (@feats){ >> $out->write_seq( $f->seq ) if $f->type =~/gene/; >> } >> } >> >> >> I get more sequence back than the number of gene in my input file. I >> double >> check there. Some of the duplicated entries are the same, some are not! > > > ... > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From bix at sendu.me.uk Sat Aug 12 08:06:32 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 12 Aug 2006 13:06:32 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44D4BC52.30203@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> Message-ID: <44DDC448.7050607@sendu.me.uk> Sendu Bala wrote: > After the initial round of changes to Taxonomy described at > http://bugzilla.open-bio.org/show_bug.cgi?id=2047 (now committed), > further changes will allow for the transition of Bio::Species to > Bio::Taxonomy::Node (renamed to Bio::Taxon), and for Taxon to be fully > usable without external database access. > > In brief: rename Bio::Taxonomy::Node to Bio::Taxon, make Bio::Taxon > implement Bio::Tree::NodeI, make Bio::Species a Bio::Taxon, remove all > Bio::Species-related-backward-compatible methods from Bio::Taxon, create > Bio::DB::Taxonomy::list, update Bio::SeqIO::genbank et al. These changes have now been committed. From hlapp at gmx.net Sat Aug 12 09:55:55 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 12 Aug 2006 09:55:55 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44DDC448.7050607@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> Message-ID: Just to confirm - you have posted your summary email enumerating the changes on the wiki? Also, did you add a (brief!) list of the changes to the Changes file? Thanks for the work. -hilmar On Aug 12, 2006, at 8:06 AM, Sendu Bala wrote: > Sendu Bala wrote: >> After the initial round of changes to Taxonomy described at >> http://bugzilla.open-bio.org/show_bug.cgi?id=2047 (now committed), >> further changes will allow for the transition of Bio::Species to >> Bio::Taxonomy::Node (renamed to Bio::Taxon), and for Taxon to be >> fully >> usable without external database access. >> >> In brief: rename Bio::Taxonomy::Node to Bio::Taxon, make Bio::Taxon >> implement Bio::Tree::NodeI, make Bio::Species a Bio::Taxon, remove >> all >> Bio::Species-related-backward-compatible methods from Bio::Taxon, >> create >> Bio::DB::Taxonomy::list, update Bio::SeqIO::genbank et al. > > These changes have now been committed. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Sat Aug 12 10:07:58 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 12 Aug 2006 15:07:58 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> Message-ID: <44DDE0BE.2050907@sendu.me.uk> Hilmar Lapp wrote: > Just to confirm - you have posted your summary email enumerating the > changes on the wiki? I added a small note to: http://www.bioperl.org/wiki/Module:Bio::Species Do you think more is warranted? Should I make a dedicated page somewhere regarding this? What should the page be called, where should it go? > Also, did you add a (brief!) list of the changes to > the Changes file? No, will do. Where would I put the changes? Add them into the 1.5.1 section, or does the 1.5.1 section contain only the changes that were present at the time of its first release? Should I make a 1.5.5 section instead? And should I move the Main trunk points to the new 1.5.5 section as well? From hlapp at gmx.net Sat Aug 12 10:30:18 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 12 Aug 2006 10:30:18 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44DDE0BE.2050907@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> Message-ID: On Aug 12, 2006, at 10:07 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Just to confirm - you have posted your summary email enumerating the >> changes on the wiki? > > I added a small note to: > http://www.bioperl.org/wiki/Module:Bio::Species > > Do you think more is warranted? The thing that should be somewhere is an authoritative list of changes and current behavior. It looks like the bug documentation has that in one post but then others (and quite lengthy ones) follow, so it's not clear what still holds and nobody should be required to read through the discussion (BTW Chris/Sendu: don't discuss such things on bugzilla - it's by all means not meant for that purpose). So - maybe just add a page to the wiki? It really doesn't matter that much how it's called unless somebody chimes in here; the page can be renamed and the content be migrated if deemed necessary later. > Should I make a dedicated page somewhere > regarding this? What should the page be called, where should it go? > > >> Also, did you add a (brief!) list of the changes to >> the Changes file? > > No, will do. Where would I put the changes? Add them into the 1.5.1 > section, or does the 1.5.1 section contain only the changes that were > present at the time of its first release? Should I make a 1.5.5 > section > instead? And should I move the Main trunk points to the new 1.5.5 > section as well? I'm still confused as to why we are jumping from 1.5.1 to 1.5.5. Also, I'm confused as to why some of the pre-1.5.1 changes are under Main Trunk, and not under 1.5.1. So I guess I'm the wrong person to answer ... -hilmar > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Sat Aug 12 10:33:28 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 12 Aug 2006 15:33:28 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> Message-ID: <44DDE6B8.70809@sendu.me.uk> Hilmar Lapp wrote: > > On Aug 12, 2006, at 10:07 AM, Sendu Bala wrote: > >> No, will do. Where would I put the changes? Add them into the 1.5.1 >> section, or does the 1.5.1 section contain only the changes that were >> present at the time of its first release? Should I make a 1.5.5 section >> instead? And should I move the Main trunk points to the new 1.5.5 >> section as well? > > I'm still confused as to why we are jumping from 1.5.1 to 1.5.5. Also, > I'm confused as to why some of the pre-1.5.1 changes are under Main > Trunk, and not under 1.5.1. So I guess I'm the wrong person to answer ... Well, Chris seemed to like 1.5.5, but 1.5.2 makes more sense to me. Shall we make it 1.5.2 Chris? From cjfields at uiuc.edu Sat Aug 12 10:46:22 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 09:46:22 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44DDE0BE.2050907@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> Message-ID: <2110B7B0-03F0-415E-9490-8163739A8FDF@uiuc.edu> Sendu, > Do you think more is warranted? Should I make a dedicated page > somewhere > regarding this? What should the page be called, where should it go? You could add a page detailing the changeover using the notes you have from Bugzilla. You have already made an announcement on the mail list, so that's taken care of. The only thing that is left is possibly announcing it on the weblog: http://bioperl.org/news/ If you can't post, let me or Hilmar know and we'll post for you. >> Also, did you add a (brief!) list of the changes to >> the Changes file? > > No, will do. Where would I put the changes? Add them into the 1.5.1 > section, or does the 1.5.1 section contain only the changes that were > present at the time of its first release? Should I make a 1.5.5 > section > instead? And should I move the Main trunk points to the new 1.5.5 > section as well? I think the sections represent accumulated changes for each release (changes between releases), so you could accumulate the changes made since '1.5.1' in a '1.5.5' section. Great work Sendu! Amazing amount of work! Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Sat Aug 12 14:16:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 13:16:38 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44DDE6B8.70809@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <44DDE6B8.70809@sendu.me.uk> Message-ID: <5B4497F5-4D73-4F59-9577-097B7F68803F@uiuc.edu> On Aug 12, 2006, at 9:33 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> On Aug 12, 2006, at 10:07 AM, Sendu Bala wrote: >> >>> No, will do. Where would I put the changes? Add them into the 1.5.1 >>> section, or does the 1.5.1 section contain only the changes that >>> were >>> present at the time of its first release? Should I make a 1.5.5 >>> section >>> instead? And should I move the Main trunk points to the new 1.5.5 >>> section as well? >> >> I'm still confused as to why we are jumping from 1.5.1 to 1.5.5. >> Also, >> I'm confused as to why some of the pre-1.5.1 changes are under Main >> Trunk, and not under 1.5.1. So I guess I'm the wrong person to >> answer ... > > Well, Chris seemed to like 1.5.5, but 1.5.2 makes more sense to me. > Shall we make it 1.5.2 Chris?' I believe '1.5.5' originally came from Brian as suggestion for an intermediate developer release prior to 1.6, that's why I brought it up (I thought it was already decided upon). This came up sometime in the spring when Jason was prepping for his defense and we were thinking about future Bioperl releases. We could easily change that to 1.5.2, 1.5.3 etc. (stick with point releases), and have a few more point releases prior to 1.6. I have no problem with that; makes more sense to me. These are developer point releases anyway. Release 1.5 had bugs; v 1.5.1 fixed those. 1.5.2 and beyond can address bugs that pop up but also introduce new modules, new functionality (UCSC), etc, all the time working on the project priority list, tests, POD, etc. Once we reach a particular point, we need to work towards making the next stable release; i.e. stabilizing the API, completing other unfinished business, and focus less on introducing new stuff. At least that's the way I have understood the process from other projects. Sound about right Hilmar? From the FAQ: "Developer releases are odd numbered releases (1.3, 1.5, etc) not intended to be completely stable (although all tests should pass). Stable releases are even numbered (1.0, 1.2, 1.6) and intended to provide a stable API so that modules will continue to respect the API throught a stable release series. We cannot guarantee that APIs are stable between releases (i.e. 1.6 may not be completely compatible with scripts written for 1.4), but we endeavor to keep the API stable so that upgrading is easy." Hilmar is also right in suggesting there is a problem with making commits to Main w/o also including the tagged branches. I am as guilty of this as everyone else, but I think much of this stems from the lack of a new 1.5.* branch to commit to. This was a problem that Fernan Aguero pointed out which has effectively 'hobbled' much code in Bioperl 1.4, but we're now beyond the point of updating that now; v 1.4 was released Dec. 2003 and way too much has been added since then. Fernan's suggestion was to have someone (not the Release pumpkin) maintain regular point releases for stable versions and suggested himself for the job. These would be primarily to fix bugs (no API changes allowed). I think it's a great idea; it frees up the developers to think about the future by plotting a course for the next developer release and the following stable release. Chris Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Sat Aug 12 14:36:57 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 13:36:57 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> Message-ID: <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> ... >> Do you think more is warranted? > > The thing that should be somewhere is an authoritative list of > changes and current behavior. It looks like the bug documentation has > that in one post but then others (and quite lengthy ones) follow, so > it's not clear what still holds and nobody should be required to read > through the discussion (BTW Chris/Sendu: don't discuss such things on > bugzilla - it's by all means not meant for that purpose). > > So - maybe just add a page to the wiki? It really doesn't matter that > much how it's called unless somebody chimes in here; the page can be > renamed and the content be migrated if deemed necessary later. > ... Hilmar, Agreed! Using Bugzilla was my suggestion based on its use for suggested code enhancements. This was also suggested to me by someone else, though I can remember now who that was. It's clogging up the works for Bugzilla and detracts from its primary function, which should be primarily for bug reports. I think the mailing list and wiki are the best places to document upcoming changes (these could be linked to from the project priority list and the Bioperl release page). Make an announcement on the mail list/new page, have a back-and-forth on the specifics, refine the wiki page as we go along. It works for the Bioperl Release page. An advantage of using the wiki is it holds the recent history of the page edits. Sendu, as Hilmar suggests, try keeping it succinct. The wiki allows you to mark up and organize information very efficiently using bullets and links. If anyone has a question, point to the wiki page. Instead of talking about the tons of changes in Taxonomy in bioperl-live/CHANGES, point to the wiki page (though you should provide a few sentences summarizing what was done in CHANGES). Also, if these changes impact the other documentation or scripts (HOWTO, FAQ, tutorial, etc) make sure to modify those accordingly. It would be nice, for instance, to have a demonstration script or similar working code outlining what your changes accomplish. Proper working code speaks volumes, impresses your friends, gets you dates... well, maybe not the latter ;> Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sat Aug 12 15:34:05 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 12 Aug 2006 15:34:05 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> Message-ID: <756E540A-19B3-4C2D-AF21-1E405A7F0A5C@gmx.net> On Aug 12, 2006, at 2:36 PM, Chris Fields wrote: > you should provide a few sentences summarizing what was done in > CHANGES). This file should only state the effect of what was changed (i.e., bug fix, speed-up, API addition, API change, behavior change) with a short explanation where warranted (i.e., if API or behavior changed). > [...] It would be nice, for instance, to have a demonstration > script or similar working code outlining what your changes accomplish. Ideally this would be in the synopsis of the module, and I believe it is actually. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sat Aug 12 16:28:19 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 15:28:19 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <756E540A-19B3-4C2D-AF21-1E405A7F0A5C@gmx.net> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> <756E540A-19B3-4C2D-AF21-1E405A7F0A5C@gmx.net> Message-ID: <8500DEB2-4490-40FA-833F-3698063A2F97@uiuc.edu> > This file should only state the effect of what was changed (i.e., bug > fix, speed-up, API addition, API change, behavior change) with a > short explanation where warranted (i.e., if API or behavior changed). ... > Ideally this would be in the synopsis of the module, and I believe it > is actually. > > -hilmar Sounds good enough to me! I was originally thinking of something to place in scripts/ or examples/ (or even something in the HOWTO for Trees), but that's not necessary. Chris Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Sat Aug 12 17:32:23 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 12 Aug 2006 22:32:23 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> Message-ID: <44DE48E7.6040302@sendu.me.uk> Chris Fields wrote: > ... >>> Do you think more is warranted? >> >> The thing that should be somewhere is an authoritative list of >> changes and current behavior. It looks like the bug documentation has >> that in one post but then others (and quite lengthy ones) follow, so >> it's not clear what still holds and nobody should be required to read >> through the discussion (BTW Chris/Sendu: don't discuss such things on >> bugzilla - it's by all means not meant for that purpose). >> >> So - maybe just add a page to the wiki? It really doesn't matter that >> much how it's called unless somebody chimes in here; the page can be >> renamed and the content be migrated if deemed necessary later. > > Agreed! Using Bugzilla was my suggestion based on its use for suggested > code enhancements. This was also suggested to me by someone else, > though I can remember now who that was. It's clogging up the works for > Bugzilla and detracts from its primary function, which should be > primarily for bug reports. I think the issue was the discussion, not the initial 'bug report'. Bug 2061 is a combined bug report and enhancement 'request'. The wiki doesn't provide a stable, unique-id, uneditable page that I can refer to when I make a large set of related commits. And still no one answers my question where it would be appropriate to make this kind of page on the wiki. 'Anywhere' doesn't help me much ;) From hlapp at gmx.net Sat Aug 12 18:01:54 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 12 Aug 2006 18:01:54 -0400 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44DE48E7.6040302@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> <44DE48E7.6040302@sendu.me.uk> Message-ID: On Aug 12, 2006, at 5:32 PM, Sendu Bala wrote: > And still no one answers my question where it would be appropriate to > make this kind of page on the wiki. 'Anywhere' doesn't help me much ;) If no-one suggests a name it most likely means no-one feels less clueless than you feel so you might as well just go ahead. This is a wiki. If you feel the insuppressible urge to name it 'Anywhere' then do so :-) If subsequently somebody wants to rename the page it's pretty easy to do that. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sat Aug 12 18:12:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 17:12:36 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44DE48E7.6040302@sendu.me.uk> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> <44DE48E7.6040302@sendu.me.uk> Message-ID: <563814DA-5A73-43B3-BB86-858A54BE42BE@uiuc.edu> Sendu, I mentioned making a link from the Bioperl Release or Project Priority List pages. Just create a link to a nonexistent page, click on it, then add what you need. Optionally, start a Code Improvements page with sections on what's actively being worked on by the developers. I think we need to start a full page devoted to developers on the Bioperl wiki. There is lots of information for developers but it is a bit scattered. The only thing is a small section on the main page and a few links in the sidebar which doesn't really detail everything we have. Really a minor complaint, though. Chris On Aug 12, 2006, at 4:32 PM, Sendu Bala wrote: > > I think the issue was the discussion, not the initial 'bug report'. > Bug > 2061 is a combined bug report and enhancement 'request'. The wiki > doesn't provide a stable, unique-id, uneditable page that I can > refer to > when I make a large set of related commits. > > And still no one answers my question where it would be appropriate to > make this kind of page on the wiki. 'Anywhere' doesn't help me much ;) Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From arareko at campus.iztacala.unam.mx Sat Aug 12 18:35:33 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Sat, 12 Aug 2006 17:35:33 -0500 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <563814DA-5A73-43B3-BB86-858A54BE42BE@uiuc.edu> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> <44DE48E7.6040302@sendu.me.uk> <563814DA-5A73-43B3-BB86-858A54BE42BE@uiuc.edu> Message-ID: <44DE57B5.5090808@campus.iztacala.unam.mx> Everything is supposed to be gathered in the "Developer Resources" category: http://bioperl.org/wiki/Category:Developer_resources If you feel like some existent (or new) pages should be included there, just add the appropriate tag at the bottom or top of the desired page(s): [[Category:Developer resources]] Mauricio. Chris Fields wrote: > Sendu, > > I mentioned making a link from the Bioperl Release or Project > Priority List pages. Just create a link to a nonexistent page, click > on it, then add what you need. > > Optionally, start a Code Improvements page with sections on what's > actively being worked on by the developers. > > I think we need to start a full page devoted to developers on the > Bioperl wiki. There is lots of information for developers but it is > a bit scattered. The only thing is a small section on the main page > and a few links in the sidebar which doesn't really detail everything > we have. Really a minor complaint, though. > > Chris > > On Aug 12, 2006, at 4:32 PM, Sendu Bala wrote: > >> I think the issue was the discussion, not the initial 'bug report'. >> Bug >> 2061 is a combined bug report and enhancement 'request'. The wiki >> doesn't provide a stable, unique-id, uneditable page that I can >> refer to >> when I make a large set of related commits. >> >> And still no one answers my question where it would be appropriate to >> make this kind of page on the wiki. 'Anywhere' doesn't help me much ;) > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Sat Aug 12 19:25:41 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 18:25:41 -0500 Subject: [Bioperl-l] Wiki Stuff, was: Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <44DE57B5.5090808@campus.iztacala.unam.mx> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> <44DE48E7.6040302@sendu.me.uk> <563814DA-5A73-43B3-BB86-858A54BE42BE@uiuc.edu> <44DE57B5.5090808@campus.iztacala.unam.mx> Message-ID: Mauricio, Will make sure this is added as I go along. Almost forgot there was a catgories marker for developers! I'll go ahead and remove the developer section from the main page, but I'll leave a noticeable link to the new developer page. It'll reduce the size of the main page and make it easier to add dev- specific information. Chris P.S. Noticed Jason lurking about on the wiki. Hello Dr. Stajich if you're out there! On Aug 12, 2006, at 5:35 PM, Mauricio Herrera Cuadra wrote: > Everything is supposed to be gathered in the "Developer Resources" > category: > > http://bioperl.org/wiki/Category:Developer_resources > > If you feel like some existent (or new) pages should be included > there, just add the appropriate tag at the bottom or top of the > desired page(s): > > [[Category:Developer resources]] > > Mauricio. > > Chris Fields wrote: >> Sendu, >> I mentioned making a link from the Bioperl Release or Project >> Priority List pages. Just create a link to a nonexistent page, >> click on it, then add what you need. >> Optionally, start a Code Improvements page with sections on >> what's actively being worked on by the developers. >> I think we need to start a full page devoted to developers on the >> Bioperl wiki. There is lots of information for developers but it >> is a bit scattered. The only thing is a small section on the >> main page and a few links in the sidebar which doesn't really >> detail everything we have. Really a minor complaint, though. >> Chris >> On Aug 12, 2006, at 4:32 PM, Sendu Bala wrote: >>> I think the issue was the discussion, not the initial 'bug >>> report'. Bug >>> 2061 is a combined bug report and enhancement 'request'. The wiki >>> doesn't provide a stable, unique-id, uneditable page that I can >>> refer to >>> when I make a large set of related commits. >>> >>> And still no one answers my question where it would be >>> appropriate to >>> make this kind of page on the wiki. 'Anywhere' doesn't help me >>> much ;) >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > MAURICIO HERRERA CUADRA > arareko at campus.iztacala.unam.mx > Laboratorio de Gen?tica > Unidad de Morfofisiolog?a y Funci?n > Facultad de Estudios Superiores Iztacala, UNAM > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cain at cshl.edu Fri Aug 11 15:05:54 2006 From: cain at cshl.edu (Scott Cain) Date: Fri, 11 Aug 2006 15:05:54 -0400 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF In-Reply-To: References: Message-ID: <1155323155.2616.65.camel@localhost.localdomain> Hi Marco, What you are getting from get_feature_by_name is a list of Bio::DB::GFF::Feature objects, which are Bio::SeqFeatureI objects. What you need are Bio::PrimarySeq objects. Fortunately, Bio::DB::GFF::Feature has a method to get a PrimarySeq out of it; that method is called seq. So, you should be able to change your line to $out->write_seq( $_->seq() ) for @feat; and it should work. Of course, I haven't test that to make sure that it does :-) Scott On Fri, 2006-08-11 at 11:30 -0700, Marco Blanchette wrote: > Dear all, > > I used to use this very simple script to extract the gene sequence as a > fasta flat file from a Bio::DB::GFF database containing the GadFly 4.3 > annotations > > #!/usr/bin/perl > > use strict; > use warnings; > use Bio::DB::GFF; > use Bio::SeqIO; > > my $out = Bio::SeqIO->new( -fh => \*STDOUT, > -format => 'fasta'); > > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => 'dbi:mysql:database=dmel_43_LS'); > > while (<>){ > chomp; > my @feat = $db->get_feature_by_name($_); > $out->write_seq($_) for @feat; > } > > Somehow I now get the following output instead of the actual sequences: > >FBgn0024988 gene:.(FBgn0024988) > Bio::PrimarySeq=HASH(0x19fd3d8) > >FBgn0041184 gene:.(FBgn0041184) > Bio::PrimarySeq=HASH(0x19fa684) > >FBgn0033636 gene:.(FBgn0033636) > Bio::PrimarySeq=HASH(0x19e1908) > > What change and what would be the right way to get what I want? > > Many thanks > > Marco > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain at cshl.edu GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060811/07e10115/attachment.bin From cjfields at uiuc.edu Sun Aug 13 20:45:02 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 13 Aug 2006 19:45:02 -0500 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF In-Reply-To: References: Message-ID: Marco, Did you figure out what the problem was? I was curious; the issue you were having was rather odd. I wanted to see if it was an issue with the GFF data or with the database itself. Chris On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: > Chris, > >> Do you mean you get duplicates of sequences back, or that you get >> more than >> one chunk of the same sequence back? > > I sometimes get duplicated sequences and sometimes overlapping > regions (see > bellow) > >> >> Is it possible that each query using an ID could contain more than >> one >> feature? That might explain it (you could check by testing the >> size of the >> array @feats). > Most id return more than one features from various type > ( point_mutation, > insertion_site, processed_transcript, etc...). That's why I > restirct the > output to type "gene" using regexp /gene/ on $f->type. > >> >> I'm not sure how split locations are handled within Bio:DB::GFF, >> but do the >> specific features have split locations? >> >> Chris >> > Not sure what you mean exactly but have a look at the following > script, it > gives the location and the group id of the feature being reported: > > use Bio::DB::GFF; > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => > 'dbi:mysql:database=dmel_43_new'); > my %dups; > while (<>){ > chomp; > my $id = $_; > my @feat = $db->get_feature_by_name($id); > > for my $f (@feat){ > if (exists $dups{$f->group} && $f->type =~/gene/){ > print "Calling >>>", $f->group, "\n"; > print "Chr: ", $f->refseq, > " Strand: ", $f->strand, > " Start: ", $f->start, > " End: ", $f->end, > "\n"; > print "Offending >>>", $dups{$f->group}->group, "\n"; > print "Chr: ", $dups{$f->group}->refseq, > " Strand: ", $dups{$f->group}->strand, > " Start: ", $dups{$f->group}->start, > " End: ", $dups{$f->group}->end; > print "\n\n"; > } else { > $dups{$f->group} = $f; > } > } > } > > Here is the output: > Calling >>>FBgn0004179 > Chr: 3L Strand: 1 Start: 22201102 End: 22207587 > Offending >>>FBgn0004179 > Chr: 3L Strand: 1 Start: 22200575 End: 22200575 > > Calling >>>FBgn0025681 > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > Offending >>>FBgn0025681 > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > Calling >>>FBgn0025803 > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > Offending >>>FBgn0025803 > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > Calling >>>FBgn0000117 > Chr: X Strand: -1 Start: 1756796 End: 1747557 > Offending >>>FBgn0000117 > Chr: X Strand: -1 Start: 1757776 End: 1747182 > > Calling >>>FBgn0005427 > Chr: X Strand: -1 Start: 136456 End: 125343 > Offending >>>FBgn0005427 > Chr: X Strand: -1 Start: 133199 End: 124949 > > Calling >>>FBgn0000042 > Chr: X Strand: 1 Start: 5746100 End: 5750026 > Offending >>>FBgn0000042 > Chr: X Strand: 1 Start: 5746096 End: 5746106 > > Calling >>>FBgn0004551 > Chr: 2R Strand: -1 Start: 19443485 End: 19434556 > Offending >>>FBgn0004551 > Chr: 2R Strand: -1 Start: 19445155 End: 19429977 > > Do you have any suggestions?? Is the procedure I am using to > retrieve the > genes right? > > Many thanks > > Marco > > > >>> Many thanks Scott, >>> >>> At the same time I got your email I was coming to the same >>> conclusion as >>> you. >>> >>> Now I have a stranger problem in my hands... My goal is quite >>> simple, I >>> try >>> to get the sequence of the genes back from the Bio::DB::GFF database >>> loaded >>> on MySQL. The gene list is from a file with one gene id per per >>> line. When >>> I >>> run the following script: >>> >>> >>> >>> use Bio::DB::GFF; >>> use Bio::SeqIO; >>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >>> -format => 'fasta'); >>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>> -dsn => >>> 'dbi:mysql:database=dmel_43_new'); >>> >>> while (<>){ >>> chomp; >>> my $id = $_; >>> my @feats = $db->get_feature_by_name($id); >>> for my $f (@feats){ >>> $out->write_seq( $f->seq ) if $f->type =~/gene/; >>> } >>> } >>> >>> >>> I get more sequence back than the number of gene in my input file. I >>> double >>> check there. Some of the duplicated entries are the same, some >>> are not! >> >> >> ... >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 > -- > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Aug 14 04:51:22 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 09:51:22 +0100 Subject: [Bioperl-l] Release 1.5.2 In-Reply-To: <000a01c6ba60$fdf2c390$15327e82@pyrimidine> References: <000a01c6ba60$fdf2c390$15327e82@pyrimidine> Message-ID: <44E0398A.2060108@sendu.me.uk> Chris Fields wrote: > All, > > We are interested in ideas, for what should be included in future releases > of Bioperl, including the next developer release (1.5.5), the eventual next > stable release (1,.6), and beyond (?!?). I think we should at least try > getting a developer point release out soon as there are major changes > looming for Taxonomy, Feature/Annotation, etc. Along the way we can > determine the release pumpkin, etc. > > The direct link: > > http://www.bioperl.org/wiki/Bioperl_Release I think the list has stabilized a little now. We ought to get the show on the road, so unless someone with more experience has the time, I'll offer to be release pumpkin for 1.5.2. Once the pumpkin has been determined we can press on. From bix at sendu.me.uk Mon Aug 14 05:13:30 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 10:13:30 +0100 Subject: [Bioperl-l] Bio::Species, Bio::Taxonomy::Node overhaul In-Reply-To: <756E540A-19B3-4C2D-AF21-1E405A7F0A5C@gmx.net> References: <44D4BC52.30203@sendu.me.uk> <44DDC448.7050607@sendu.me.uk> <44DDE0BE.2050907@sendu.me.uk> <16FDE165-7A20-4894-A352-529F34BE194F@uiuc.edu> <756E540A-19B3-4C2D-AF21-1E405A7F0A5C@gmx.net> Message-ID: <44E03EBA.50903@sendu.me.uk> Hilmar Lapp wrote: > > On Aug 12, 2006, at 2:36 PM, Chris Fields wrote: > >> [...] It would be nice, for instance, to have a demonstration script >> or similar working code outlining what your changes accomplish. > > Ideally this would be in the synopsis of the module, and I believe it is > actually. Yes, additionally there is the scripts/taxa/taxonomy2tree.PLS which shows of some fancier stuff. From bix at sendu.me.uk Mon Aug 14 06:02:30 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 11:02:30 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <000201c6bccf$8941f090$15327e82@pyrimidine> References: <000201c6bccf$8941f090$15327e82@pyrimidine> Message-ID: <44E04A36.6090403@sendu.me.uk> Chris Fields wrote: > ... >> My proposal involves the "chunks" being unparsed, raw text "blobs", that >> are essentially blessed into a package that does the parsing only when >> necessary (and even then, might choose different parsing strategies, based >> on what's been asked for). Thus a potentially large amount of parsing and >> storage is skipped. Additionally, you now have the option of not even >> storing the blobs in memory, just file seek pointers (requiring temp. >> storage for streaming pipe data sources), and thus can process very large >> reports without consuming memory (currently a problem). > > Using file pointers is a great touch. Sendu has a slight aversion to temp > files but he has already indicated other ways around this. I'm in the midst of implementing an 'Aaron'-style pull-parser which I have called PullParserI. My current solution for piped input is: '... The other thing you will need to decide when making a chunk is how to handle piped input. A PullParser needs seekable data to parse, so if your data is piped in and unseekable, you must decide between creating a temp file or reading the input into memory, which will be done before the chunk becomes usable and you can begin any parsing.' I don't think its really possible to avoid this initial 'read everything in first' step, unless anyone has any bright ideas? From bix at sendu.me.uk Mon Aug 14 07:43:39 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 12:43:39 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: <44DB4E98.70703@sendu.me.uk> Message-ID: <44E061EB.9050605@sendu.me.uk> Chris Fields wrote: > Here's a couple of suggestions to get around that if you want to get > the code out there for testing: > > Could this be CVS-tagged to an experimental bioperl branch instead? > It could be merged back to the main branch once everybody gets to try > it out, and you could commit changes to the branch (tests, scripts, > etc) along the way based on suggestions. Think of this as a test- > drive for a new Bioperl release. I have created branch 'branch-experimental' and committed the changes there. Please test by checking out the experimental branch: cvs co -d experimental -r branch-experimental bioperl-live I'll probably end up writing a new pull/chunk parser for BLAST, but these changes will still speed up the other SearchIO modules. So test the speed-up on different kinds of report as well. The experimental branch should be used for trying out major implementation changes that have the potential to break important and substantial parts of bioperl. Everything else should continue to be committed to HEAD until the 1.6 branch emerges (sometime next year). From aaron.j.mackey at gsk.com Mon Aug 14 08:33:30 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Aug 2006 08:33:30 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E04A36.6090403@sendu.me.uk> Message-ID: A "pull parser" need not read everything (i.e. the entire file) into memory, just the current/next chunk, right? It was the current "push parser" architecture that had me thinking about file pointers: if we're forced to make an initial pass through the entire file to build up all the top-level objects before being able to access the first one (as the current SearchIO does), then it would be advantageous to minimize the memory impact of all those top-level objects with file pointers rather than in-memory blobs. But in a "pull" architecture, that consideration is no longer so important. Please forgive me if I've misunderstood what you're describing below. -Aaron bioperl-l-bounces at lists.open-bio.org wrote on 08/14/2006 06:02:30 AM: > Chris Fields wrote: > > ... > >> My proposal involves the "chunks" being unparsed, raw text "blobs", that > >> are essentially blessed into a package that does the parsing only when > >> necessary (and even then, might choose different parsing strategies, based > >> on what's been asked for). Thus a potentially large amount of parsing and > >> storage is skipped. Additionally, you now have the option of not even > >> storing the blobs in memory, just file seek pointers (requiring temp. > >> storage for streaming pipe data sources), and thus can process very large > >> reports without consuming memory (currently a problem). > > > > Using file pointers is a great touch. Sendu has a slight aversion to temp > > files but he has already indicated other ways around this. > > I'm in the midst of implementing an 'Aaron'-style pull-parser which I > have called PullParserI. My current solution for piped input is: > > '... The other thing you will need to decide when making a chunk is how > to handle piped input. A PullParser needs seekable data to parse, so if > your data is piped in and unseekable, you must decide between creating a > temp file or reading the input into memory, which will be done before > the chunk becomes usable and you can begin any parsing.' > > I don't think its really possible to avoid this initial 'read everything > in first' step, unless anyone has any bright ideas? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From hlapp at gmx.net Mon Aug 14 08:38:40 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 14 Aug 2006 08:38:40 -0400 Subject: [Bioperl-l] Release 1.5.2 In-Reply-To: <44E0398A.2060108@sendu.me.uk> References: <000a01c6ba60$fdf2c390$15327e82@pyrimidine> <44E0398A.2060108@sendu.me.uk> Message-ID: <68BA6E9B-06E1-4340-A829-8BF724D1193F@gmx.net> On Aug 14, 2006, at 4:51 AM, Sendu Bala wrote: > I'll offer to be release pumpkin for 1.5.2. Hooray - we have someone to head the release! (You didn't expect somebody else to push you aside, did you? :-) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Mon Aug 14 09:04:19 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 14:04:19 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44E074D3.7010903@sendu.me.uk> aaron.j.mackey at gsk.com wrote: > A "pull parser" need not read everything (i.e. the entire file) into > memory, just the current/next chunk, right? The problem arises when you need random-access to the input data in order to do what you need to do, like get just the next chunk or bit of information. So I don't see a way for a generalized pull-parser to cope with piped input, because most operations are going to have use seek() to work, and you can't seek piped input. What I do at the moment, then, is on detecting piped input, I'm forced to read all the input data in in one go and spit it out into seekable memory or a temp file. After which normal behaviour resumes - you don't read everything, just the bit you want. From cjfields at uiuc.edu Mon Aug 14 09:54:14 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 08:54:14 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E074D3.7010903@sendu.me.uk> References: <44E074D3.7010903@sendu.me.uk> Message-ID: On Aug 14, 2006, at 8:04 AM, Sendu Bala wrote: > aaron.j.mackey at gsk.com wrote: >> A "pull parser" need not read everything (i.e. the entire file) into >> memory, just the current/next chunk, right? > > The problem arises when you need random-access to the input data in > order to do what you need to do, like get just the next chunk or > bit of > information. > > So I don't see a way for a generalized pull-parser to cope with piped > input, because most operations are going to have use seek() to > work, and > you can't seek piped input. > > What I do at the moment, then, is on detecting piped input, I'm forced > to read all the input data in in one go and spit it out into seekable > memory or a temp file. After which normal behaviour resumes - you > don't > read everything, just the bit you want. The traditional route has been using a tempfile. Bio::Root::IO has several methods for creating tempdirs/tempfiles. I would have the option available for a tempfile, at least, for the guys who deal with large BLAST files. I think the XML files can also be quite long. Speaking of XML, is the current idea to get this running on text- based BLAST initially? Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Aug 14 10:01:52 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 09:01:52 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E061EB.9050605@sendu.me.uk> References: <44DB4E98.70703@sendu.me.uk> <44E061EB.9050605@sendu.me.uk> Message-ID: <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> Sendu, Sounds good. We need to make sure that commits to bioperl-live also get committed to the experimental branch, correct? Or at lease make sure bioperl-live commits are merged into experimental (and not vice- versa)? Chris On Aug 14, 2006, at 6:43 AM, Sendu Bala wrote: > Chris Fields wrote: >> Here's a couple of suggestions to get around that if you want to get >> the code out there for testing: >> >> Could this be CVS-tagged to an experimental bioperl branch instead? >> It could be merged back to the main branch once everybody gets to try >> it out, and you could commit changes to the branch (tests, scripts, >> etc) along the way based on suggestions. Think of this as a test- >> drive for a new Bioperl release. > > I have created branch 'branch-experimental' and committed the > changes there. > > Please test by checking out the experimental branch: > cvs co -d experimental -r branch-experimental bioperl-live > > I'll probably end up writing a new pull/chunk parser for BLAST, but > these changes will still speed up the other SearchIO modules. So test > the speed-up on different kinds of report as well. > > > The experimental branch should be used for trying out major > implementation changes that have the potential to break important and > substantial parts of bioperl. Everything else should continue to be > committed to HEAD until the 1.6 branch emerges (sometime next year). > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From aaron.j.mackey at gsk.com Mon Aug 14 10:00:01 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Aug 2006 10:00:01 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E074D3.7010903@sendu.me.uk> Message-ID: I'm failing to understand, sorry. The UNIX utility "more" (or "less" if you prefer) is a pull parser; it reads the stream as much as it needs to satisfy the current iteration (the next iteration occurring when the user asks for an additional screen or line). It does not copy data from a pipe into temp storage. That said, you can't use "more" to page backwards in piped content (unless your "more" is keeping a buffer, which some do). So, I agree that you will need some form of storage for the *current* information to be parsed (and must process all of the stream necessary to obtain all such information), but not for any of the information yet to be accessed. -Aaron bioperl-l-bounces at lists.open-bio.org wrote on 08/14/2006 09:04:19 AM: > aaron.j.mackey at gsk.com wrote: > > A "pull parser" need not read everything (i.e. the entire file) into > > memory, just the current/next chunk, right? > > The problem arises when you need random-access to the input data in > order to do what you need to do, like get just the next chunk or bit of > information. > > So I don't see a way for a generalized pull-parser to cope with piped > input, because most operations are going to have use seek() to work, and > you can't seek piped input. > > What I do at the moment, then, is on detecting piped input, I'm forced > to read all the input data in in one go and spit it out into seekable > memory or a temp file. After which normal behaviour resumes - you don't > read everything, just the bit you want. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Mon Aug 14 10:21:23 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 09:21:23 -0500 Subject: [Bioperl-l] Release 1.5.2 In-Reply-To: <68BA6E9B-06E1-4340-A829-8BF724D1193F@gmx.net> References: <000a01c6ba60$fdf2c390$15327e82@pyrimidine> <44E0398A.2060108@sendu.me.uk> <68BA6E9B-06E1-4340-A829-8BF724D1193F@gmx.net> Message-ID: <0E58504C-FF95-493E-924B-0D95A78FE022@uiuc.edu> Hehehe, I didn't have anything to do with that! Anything at all. Nope. Not me... ;> By the way, on the BioPerl Release discussion page (not the regular BioPerl Release page) I copied the 1.5.2 list and added a few notes. A check list, sort-of, but also a way for people to communicate their thoughts. I'll add a few more things later (pages on unimplemented methods, etc). chris On Aug 14, 2006, at 7:38 AM, Hilmar Lapp wrote: > > On Aug 14, 2006, at 4:51 AM, Sendu Bala wrote: > >> I'll offer to be release pumpkin for 1.5.2. > > Hooray - we have someone to head the release! (You didn't expect > somebody else to push you aside, did you? :-) > > -hilmar Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Mon Aug 14 10:34:02 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 15:34:02 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44E089DA.2020406@sendu.me.uk> aaron.j.mackey at gsk.com wrote: > I'm failing to understand, sorry. > > The UNIX utility "more" (or "less" if you prefer) is a pull parser; it > reads the stream as much as it needs to satisfy the current iteration (the > next iteration occurring when the user asks for an additional screen or > line). It does not copy data from a pipe into temp storage. > > That said, you can't use "more" to page backwards in piped content (unless > your "more" is keeping a buffer, which some do). Exactly, 'more' can work like this because it only ever has to read chunks in linear file order, and when you want a different order it has store everything read in memory. (Which is something we'd like to avoid doing.) > So, I agree that you will need some form of storage for the *current* > information to be parsed (and must process all of the stream necessary to > obtain all such information), but not for any of the information yet to be > accessed. Think of this: User creates a new SearchIO for a foobar report. Ideally no significant work is done. User requests report-statistic Y, which is found on the last line of the report. We want to avoid reading, storing and parsing the entire file just to find Y, so we seek to the last line, parse Y out and return it. Yay, super fast. Now the user requests the next_result(). Let's say the first result begins 5 lines into the file after the header. We quickly seek() there and... Oops, our input file was piped so we can't seek. There are two solutions to the problem: # Don't allow seeking around, read and cache all data as you pass it in search of the information you need. This is slower and more memory hungry than necessary for all parsing cases where the user does not request 100% of the information in the file. or # Allow seeking around. This adds an initial, possibly trivial, burden for piped input only. I'm going for the later solution, and my question is, is there some magical way to avoid reading the whole piped input before we can begin work? I'm thinking no, but I thought I'd put the question out there in case someone had dealt with something similar and found a solution. From hlapp at gmx.net Mon Aug 14 10:35:36 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 14 Aug 2006 10:35:36 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> References: <44DB4E98.70703@sendu.me.uk> <44E061EB.9050605@sendu.me.uk> <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> Message-ID: <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> Don't complicate things more than they need to, please. The experimental branch should be solely for things for which you aren't sure whether they are going to work at all, and for API changes for which the consequences across the board may be significant and difficult to fully anticipate (e.g. the late-breaking and still haunting SeqFeatureI changes before 1.5.0 should have gone on an experimental branch first to see how they will behave). It should never be a requirement to merge commits from the main trunk to an experimental branch (or commit twice for developers). Quite frankly the changes Sendu described for me wouldn't have warranted an experimental branch, they didn't sound like changing API signature or API behavior. -hilmar On Aug 14, 2006, at 10:01 AM, Chris Fields wrote: > Sendu, > > Sounds good. We need to make sure that commits to bioperl-live also > get committed to the experimental branch, correct? Or at lease make > sure bioperl-live commits are merged into experimental (and not vice- > versa)? > > Chris > > > On Aug 14, 2006, at 6:43 AM, Sendu Bala wrote: > >> Chris Fields wrote: >>> Here's a couple of suggestions to get around that if you want to >>> get >>> the code out there for testing: >>> >>> Could this be CVS-tagged to an experimental bioperl branch instead? >>> It could be merged back to the main branch once everybody gets to >>> try >>> it out, and you could commit changes to the branch (tests, scripts, >>> etc) along the way based on suggestions. Think of this as a test- >>> drive for a new Bioperl release. >> >> I have created branch 'branch-experimental' and committed the >> changes there. >> >> Please test by checking out the experimental branch: >> cvs co -d experimental -r branch-experimental bioperl-live >> >> I'll probably end up writing a new pull/chunk parser for BLAST, but >> these changes will still speed up the other SearchIO modules. So test >> the speed-up on different kinds of report as well. >> >> >> The experimental branch should be used for trying out major >> implementation changes that have the potential to break important and >> substantial parts of bioperl. Everything else should continue to be >> committed to HEAD until the 1.6 branch emerges (sometime next year). >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From sdavis2 at mail.nih.gov Mon Aug 14 10:36:26 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 14 Aug 2006 10:36:26 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: Message-ID: On 8/14/06 10:00 AM, "aaron.j.mackey at gsk.com" wrote: > I'm failing to understand, sorry. > > The UNIX utility "more" (or "less" if you prefer) is a pull parser; it > reads the stream as much as it needs to satisfy the current iteration (the > next iteration occurring when the user asks for an additional screen or > line). It does not copy data from a pipe into temp storage. > > That said, you can't use "more" to page backwards in piped content (unless > your "more" is keeping a buffer, which some do). > > So, I agree that you will need some form of storage for the *current* > information to be parsed (and must process all of the stream necessary to > obtain all such information), but not for any of the information yet to be > accessed. I hesitate to try to "clarify", but this is as much for my own good as for that of others. I think the distinction here is between "random access", which is probably not necessary for Blast parsing, and "pull parsing", which only needs sequential, chunk-based parsing. Is this the source of some confusion? > bioperl-l-bounces at lists.open-bio.org wrote on 08/14/2006 09:04:19 AM: > >> aaron.j.mackey at gsk.com wrote: >>> A "pull parser" need not read everything (i.e. the entire file) into >>> memory, just the current/next chunk, right? >> >> The problem arises when you need random-access to the input data in >> order to do what you need to do, like get just the next chunk or bit of >> information. >> >> So I don't see a way for a generalized pull-parser to cope with piped >> input, because most operations are going to have use seek() to work, and > >> you can't seek piped input. >> >> What I do at the moment, then, is on detecting piped input, I'm forced >> to read all the input data in in one go and spit it out into seekable >> memory or a temp file. After which normal behaviour resumes - you don't >> read everything, just the bit you want. >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Mon Aug 14 10:41:54 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 15:41:54 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: <44E074D3.7010903@sendu.me.uk> Message-ID: <44E08BB2.9010201@sendu.me.uk> Chris Fields wrote: > On Aug 14, 2006, at 8:04 AM, Sendu Bala wrote: > >> What I do at the moment, then, is on detecting piped input, I'm forced >> to read all the input data in in one go and spit it out into seekable >> memory or a temp file. After which normal behaviour resumes - you >> don't read everything, just the bit you want. > > The traditional route has been using a tempfile. Bio::Root::IO has > several methods for creating tempdirs/tempfiles. > > I would have the option available for a tempfile, at least, for the > guys who deal with large BLAST files. I think the XML files can also > be quite long. Yes, as I stated, you have the option of creating a tempfile (and I use Bio::Root::IO to do it). My question was can we avoid the need for doing any such thing for piped data whilst still retaining all the advantages of a pull-parser (speed, low memory)? I appreciate it's all very hard to imagine what on earth I'm trying to say; perhaps the discussion is better left until I make some code available. > Speaking of XML, is the current idea to get this running on text- > based BLAST initially? I'm using it first for my hmmpfam parser, then I'll try it for text blastn as proof-of-concept and move on from there. blast.pm is a bit of a nightmare to move over to a new system; that's another thing the pull-parser will solve - make code more manageable. From bix at sendu.me.uk Mon Aug 14 10:47:14 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 15:47:14 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44E08CF2.2010705@sendu.me.uk> Sean Davis wrote: > > > On 8/14/06 10:00 AM, "aaron.j.mackey at gsk.com" > wrote: > >> I'm failing to understand, sorry. >> >> The UNIX utility "more" (or "less" if you prefer) is a pull parser; it >> reads the stream as much as it needs to satisfy the current iteration (the >> next iteration occurring when the user asks for an additional screen or >> line). It does not copy data from a pipe into temp storage. >> >> That said, you can't use "more" to page backwards in piped content (unless >> your "more" is keeping a buffer, which some do). >> >> So, I agree that you will need some form of storage for the *current* >> information to be parsed (and must process all of the stream necessary to >> obtain all such information), but not for any of the information yet to be >> accessed. > > I hesitate to try to "clarify", but this is as much for my own good as for > that of others. I think the distinction here is between "random access", > which is probably not necessary for Blast parsing Not 'necessary', but if you don't make use of random access you lose a lot of the possible advantages of having a pull-parser (ie. will be forced to parse and store more data than the user strictly needs). It'll become clearer when I make some code available, can continue discussion then. From bix at sendu.me.uk Mon Aug 14 10:58:52 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 15:58:52 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> References: <44DB4E98.70703@sendu.me.uk> <44E061EB.9050605@sendu.me.uk> <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> Message-ID: <44E08FAC.5010203@sendu.me.uk> Hilmar Lapp wrote: > Don't complicate things more than they need to, please. The experimental > branch should be solely for things for which you aren't sure whether > they are going to work at all, and for API changes for which the > consequences across the board may be significant and difficult to fully > anticipate (e.g. the late-breaking and still haunting SeqFeatureI > changes before 1.5.0 should have gone on an experimental branch first to > see how they will behave). Agree. > It should never be a requirement to merge commits from the main trunk to > an experimental branch (or commit twice for developers). This wouldn't be done on a regular basis, but any time someone wants to test if (eg. newly added) experimental code would work with the latest 'normal' code, an update of experimental could be done. Given that there won't be much experimental code anyway, one would expect the experimental branch to almost never be updated or committed to. > Quite frankly the changes Sendu described for me wouldn't have warranted > an experimental branch, they didn't sound like changing API signature or > API behavior. Gain-of-function only, but like I say, any bugs there have the potential to do major (but perhaps not immediately obvious) damage to a lot of different modules. From arareko at campus.iztacala.unam.mx Mon Aug 14 11:08:22 2006 From: arareko at campus.iztacala.unam.mx (Mauricio Herrera Cuadra) Date: Mon, 14 Aug 2006 10:08:22 -0500 Subject: [Bioperl-l] Release 1.5.2 In-Reply-To: <0E58504C-FF95-493E-924B-0D95A78FE022@uiuc.edu> References: <000a01c6ba60$fdf2c390$15327e82@pyrimidine> <44E0398A.2060108@sendu.me.uk> <68BA6E9B-06E1-4340-A829-8BF724D1193F@gmx.net> <0E58504C-FF95-493E-924B-0D95A78FE022@uiuc.edu> Message-ID: <44E091E6.6080300@campus.iztacala.unam.mx> Great news Sendu! We are all happy to see you jump to the front line. A great release is coming... ;) Mauricio. Chris Fields wrote: > Hehehe, I didn't have anything to do with that! Anything at all. > Nope. Not me... ;> > > By the way, on the BioPerl Release discussion page (not the regular > BioPerl Release page) I copied the 1.5.2 list and added a few notes. > A check list, sort-of, but also a way for people to communicate their > thoughts. I'll add a few more things later (pages on unimplemented > methods, etc). > > chris > > On Aug 14, 2006, at 7:38 AM, Hilmar Lapp wrote: > >> On Aug 14, 2006, at 4:51 AM, Sendu Bala wrote: >> >>> I'll offer to be release pumpkin for 1.5.2. >> Hooray - we have someone to head the release! (You didn't expect >> somebody else to push you aside, did you? :-) >> >> -hilmar > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- MAURICIO HERRERA CUADRA arareko at campus.iztacala.unam.mx Laboratorio de Gen?tica Unidad de Morfofisiolog?a y Funci?n Facultad de Estudios Superiores Iztacala, UNAM From cjfields at uiuc.edu Mon Aug 14 11:26:48 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 10:26:48 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> References: <44DB4E98.70703@sendu.me.uk> <44E061EB.9050605@sendu.me.uk> <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> Message-ID: <3B38CBE9-05C5-47FE-91F4-D7A4BB3CD3A2@uiuc.edu> Hilmar, I originally mentioned using this branch as a possibility, which no one seemed to oppose at the time (I also mentioned using Bugzilla, dropping in patches or modules as code enhancements). I felt it would be a way for everyone to try it out on different OS's (one of Sendu's main concerns) and still have the ability to update back to the main branch if it doesn't work. I agree with your assessment of only using this for API changes, etc. But it's my opinion Sendu is using it for the right purpose (he, and we, aren't sure whether these will work at all or to what degree they will work). So far everything looks very promising, however we don't know how these changes will affect SearchIO and its API, esp. his and Aaron's proposed 'pull' parser using file pointers (which I quite like the idea of; something to think about for SeqIO if it pans out). At least his code is available to be tested now! If you think we should just have various experimental branches for each 'experiment' that could have API issues or not work, that would work for me as well. I originally thought it would be a good idea to keep code in the experimental branch in line with bioperl-live because it would allow it's continued use as a testing ground, up to 1.6 and beyond if needed. Otherwise I could see an experimental branch stagnating over time without bug fixes, code updates, etc. Then half-baked 'experiments' that didn't work would be tossed, ones that did would be merged back or committed to MAIN. Of course, the long-term danger with a single experimental branch would be having too many experiments running on the same branch, which could make merging back a possible issue Chris On Aug 14, 2006, at 9:35 AM, Hilmar Lapp wrote: > Don't complicate things more than they need to, please. The > experimental branch should be solely for things for which you > aren't sure whether they are going to work at all, and for API > changes for which the consequences across the board may be > significant and difficult to fully anticipate (e.g. the late- > breaking and still haunting SeqFeatureI changes before 1.5.0 should > have gone on an experimental branch first to see how they will > behave). > > It should never be a requirement to merge commits from the main > trunk to an experimental branch (or commit twice for developers). > > Quite frankly the changes Sendu described for me wouldn't have > warranted an experimental branch, they didn't sound like changing > API signature or API behavior. > > -hilmar > > On Aug 14, 2006, at 10:01 AM, Chris Fields wrote: > >> Sendu, >> >> Sounds good. We need to make sure that commits to bioperl-live also >> get committed to the experimental branch, correct? Or at lease make >> sure bioperl-live commits are merged into experimental (and not vice- >> versa)? >> >> Chris >> >> >> On Aug 14, 2006, at 6:43 AM, Sendu Bala wrote: >> >>> Chris Fields wrote: >>>> Here's a couple of suggestions to get around that if you want >>>> to get >>>> the code out there for testing: >>>> >>>> Could this be CVS-tagged to an experimental bioperl branch instead? >>>> It could be merged back to the main branch once everybody gets >>>> to try >>>> it out, and you could commit changes to the branch (tests, scripts, >>>> etc) along the way based on suggestions. Think of this as a test- >>>> drive for a new Bioperl release. >>> >>> I have created branch 'branch-experimental' and committed the >>> changes there. >>> >>> Please test by checking out the experimental branch: >>> cvs co -d experimental -r branch-experimental bioperl-live >>> >>> I'll probably end up writing a new pull/chunk parser for BLAST, but >>> these changes will still speed up the other SearchIO modules. So >>> test >>> the speed-up on different kinds of report as well. >>> >>> >>> The experimental branch should be used for trying out major >>> implementation changes that have the potential to break important >>> and >>> substantial parts of bioperl. Everything else should continue to be >>> committed to HEAD until the 1.6 branch emerges (sometime next year). >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Mon Aug 14 11:30:34 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 10:30:34 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E08CF2.2010705@sendu.me.uk> References: <44E08CF2.2010705@sendu.me.uk> Message-ID: <1F456929-324D-42BD-B5ED-33DE3E082B18@uiuc.edu> >> ...I hesitate to try to "clarify", but this is as much for my own >> good as for >> that of others. I think the distinction here is between "random >> access", >> which is probably not necessary for Blast parsing > > Not 'necessary', but if you don't make use of random access you lose a > lot of the possible advantages of having a pull-parser (ie. will be > forced to parse and store more data than the user strictly needs). > > It'll become clearer when I make some code available, can continue > discussion then. 'Don't stand in the way of someone who threatens to code!' Where have I heard that before?? Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Aug 14 11:40:47 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 14 Aug 2006 11:40:47 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <3B38CBE9-05C5-47FE-91F4-D7A4BB3CD3A2@uiuc.edu> References: <44DB4E98.70703@sendu.me.uk> <44E061EB.9050605@sendu.me.uk> <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> <3B38CBE9-05C5-47FE-91F4-D7A4BB3CD3A2@uiuc.edu> Message-ID: <2E9781DA-25B5-4DBB-9B72-2A79E1458E4C@gmx.net> On Aug 14, 2006, at 11:26 AM, Chris Fields wrote: > I originally thought it would be a good idea to keep code in the > experimental branch in line with bioperl-live because it would > allow it's continued use as a testing ground, up to 1.6 and beyond > if needed. Well, that's exactly not what you should be using it for. Use to test / trial your stuff if you're not sure it may just mess up things and if it does you're not ready to clean up throughout the toolkit but you might as well decide to discard the entire idea - which would then amount to no more than discontinuing the branch. > Otherwise I could see an experimental branch stagnating over time > without bug fixes, code updates, etc. Yeah, after a while once you've convinced yourself your 'experimental' changes work, you should port them over to the main trunk and then the experimental one should go into oblivion. Otherwise you create a maintenance nightmare. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Aug 14 11:45:49 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 10:45:49 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <2E9781DA-25B5-4DBB-9B72-2A79E1458E4C@gmx.net> References: <44DB4E98.70703@sendu.me.uk> <44E061EB.9050605@sendu.me.uk> <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> <3B38CBE9-05C5-47FE-91F4-D7A4BB3CD3A2@uiuc.edu> <2E9781DA-25B5-4DBB-9B72-2A79E1458E4C@gmx.net> Message-ID: <9DC69E4E-E851-4A7E-963D-A9589AB73CFF@uiuc.edu> > On Aug 14, 2006, at 11:26 AM, Chris Fields wrote: > >> I originally thought it would be a good idea to keep code in the >> experimental branch in line with bioperl-live because it would >> allow it's continued use as a testing ground, up to 1.6 and beyond >> if needed. > > Well, that's exactly not what you should be using it for. > > Use to test / trial your stuff if you're not sure it may just mess > up things and if it does you're not ready to clean up throughout > the toolkit but you might as well decide to discard the entire idea > - which would then amount to no more than discontinuing the branch. > >> Otherwise I could see an experimental branch stagnating over time >> without bug fixes, code updates, etc. > > Yeah, after a while once you've convinced yourself your > 'experimental' changes work, you should port them over to the main > trunk and then the experimental one should go into oblivion. > > Otherwise you create a maintenance nightmare. Agreed. So basically have individual experimental branches for each change, which is fine by me. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Aug 14 12:03:37 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 14 Aug 2006 12:03:37 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <9DC69E4E-E851-4A7E-963D-A9589AB73CFF@uiuc.edu> References: <44DB4E98.70703@sendu.me.uk> <44E061EB.9050605@sendu.me.uk> <8D298D41-FED2-48B3-8F0F-BD86E5BA59C0@uiuc.edu> <862A9618-E1DA-49E2-9ABF-7FD86948E575@gmx.net> <3B38CBE9-05C5-47FE-91F4-D7A4BB3CD3A2@uiuc.edu> <2E9781DA-25B5-4DBB-9B72-2A79E1458E4C@gmx.net> <9DC69E4E-E851-4A7E-963D-A9589AB73CFF@uiuc.edu> Message-ID: On Aug 14, 2006, at 11:45 AM, Chris Fields wrote: > So basically have individual experimental branches for each change Yes, just use common sense as for what might count as 'experimental' and what doesn't need to. The rule shouldn't be to open a branch for every commit ... (which I'm sure you didn't want to suggest either) -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Mon Aug 14 12:24:58 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 17:24:58 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44E0A3DA.40601@sendu.me.uk> aaron.j.mackey at gsk.com wrote: >> User requests report-statistic Y, which is found on the last line of the > >> report. We want to avoid reading, storing and parsing the entire file >> just to find Y, so we seek to the last line, parse Y out and return it. >> Yay, super fast. > > This was the bit I was missing, thanks; to be honest, I never knew we had > a get_result(Y) method, I thought we only had next_result() iterators. Oh > wait, we don't, but you're proposing we should extend the API to offer > one? It's subtle. There's no explicit methods defined at the SearchIO level, but currently you have to parse data (or not - we want to pull) to find out things that all result (or even hit, hsp) objects need. You may need to do some internal, optional parsing depending on the specific file format variation you discover you are parsing. And then of course the idea is that this is nested, so the parser for the result data is a Bio::Search::Result::ResultI but also a pull-parser in its own right (and so on for HitI and HSPI) with a need for random-access to the various bits of data needed to answer all the various methods of ResultI. > The reason I'm being so fussy about this is that a primary motivation for > a shockingly-fast parser is shockingly large datasets that we keep only as > compressed files, uncompressing them en route to the parser; thus your > simple "I'll just copy the stream to tempfile and proceed as normal" > solution is not so trivial. Right, that's helpful. I'll keep that in mind. > Here's a compromise: assume that users won't need random access to their > results, only sequential; also, provide a new parameter to the searchIO > constructor to specifify the desired access mode as random; then, if the > input stream is not seekable (which is testable), you can perform your > memory/file caching. If get_result(X) is called without the access mode > being set to random on an unseekable stream, throw an (informative) error. I currently have a -piped_behaviour argument that accepts 'memory' or 'temp_file'. How about a third (non-default) option of 'linear' to avoid any attempt at a seek and just use the data as it is piped? The trouble is that you'd need to virtually implement the methods of a parser module twice, once where the methods can seek, second where they can't. Or maybe not; I'll have to try and see if some sane compromise implementation is possible. From cjfields at uiuc.edu Mon Aug 14 12:32:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 11:32:36 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <2E111197-651B-4F01-B15B-418CA48F2A25@uiuc.edu> On Aug 14, 2006, at 11:00 AM, aaron.j.mackey at gsk.com wrote: >>> Otherwise I could see an experimental branch stagnating over time >>> without bug fixes, code updates, etc. >> >> Yeah, after a while once you've convinced yourself your >> 'experimental' changes work, you should port them over to the main >> trunk and then the experimental one should go into oblivion. > > To reemphasize the point, there's no reason why there should be > only one > experimental branch; rather, when a (named) experiment is to be > performed, > you branch to perform that experiment, the endpoint of which is either > experimental success (merge back to trunk) or failure (branch is > relegated > to oblivion). The best way to ensure that your experimental branch is > reflective of the current trunk is to make new experimental > branches from > the trunk whenever you need one. > > Don't be afraid of branching, that's what it's there for! Branch > early, > branch often; don't pollute the trunk! > > -Aaron Yeah, I agree. Any of Sendu's commits to the current experimental branch would eventually merge back to trunk once they prove successful and work on that experimental branch would stop. Seems using branches for experimental code hasn't been taken advantage of nearly enough judging by Hilmar's statement about Bio::SeqFeatureI (I guess re: the original 1.5 release). BTW, I really like the 'lazy parsing' used for SwissKnife and the use of seek() processing chunks. This is definitely something to think about in the future for Bio::SeqIO. Would be nice to have the capability of handling and writing very large sequences w/o memory issues. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Mon Aug 14 12:54:40 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 14 Aug 2006 12:54:40 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E0A3DA.40601@sendu.me.uk> References: <44E0A3DA.40601@sendu.me.uk> Message-ID: On Aug 14, 2006, at 12:24 PM, Sendu Bala wrote: > How about a third (non-default) option of 'linear' to avoid > any attempt at a seek and just use the data as it is piped? I'd call it 'sequential' or better 'sequential_read'. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From muratem at eng.uah.edu Mon Aug 14 12:55:45 2006 From: muratem at eng.uah.edu (Mike Muratet) Date: Mon, 14 Aug 2006 11:55:45 -0500 (CDT) Subject: [Bioperl-l] [BioSQL-l] load_seqdatabase fails when loading refseq plant files In-Reply-To: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> References: <1155322655.4837.25.camel@gort.gcrc.upenn.edu> Message-ID: On Fri, 11 Aug 2006, Angel Pizarro wrote: > Date: Fri, 11 Aug 2006 14:57:35 -0400 > From: Angel Pizarro > To: BioSQL , Bioperl > Subject: Re: [BioSQL-l] load_seqdatabase fails when loading refseq plant files > > Glad I am not the only one that ran into this problem! Mike, I had > reported this issue a few emails back and have provided the list with an > example file for testing, so it should be resolved soon. > I must have missed it. Sorry. > FYI, you are correct that CRC is computed on load to determine if two > pub references are in fact the same. This is a feature to save database > space. The expected behaviour would be for the subsequent entries with > the same CRC reference should have an FK to the originating reference > entry, and not insert a duplicate row into the reference table. > > FYI #2, the --safe option explicitly states that it will continue to > process records after errors BUT do a roll-back at the end of the run. > This is to gather all of your errors in one shot, as opposed to fixing a > record, starting, error, fix, etc ,. > > If you are impatient and do not care about references, you have three > choices. > 1) drop the unique constraint on reference.crc (this will cause dups in > reference and you can not go back to a unique CRC without some major SQL > data migration routine to fix FK's and delete the dups. > > 2) filter your records to not contain reference information > > 3) alter load_seqdatabase to not enter reference information. This would > be in the Bio::AnnotationCollection object: > > $seq->annotation()->remove_Annotations('reference'); > > The above command inserted someplace in the script line ~575 should do > the trick. Obviously this means that all reference information is not > loaded into the DB at all. > I do need to get something working, and the references are not critical to the application, so I will probably alter load_seqdatabase. Thanks for the help! Cheers Mike > -angel > > On Fri, 2006-08-11 at 11:10 -0500, Mike Muratet wrote: >> Hello all >> >> I am using biosql-schema/bioperl-db to load Refseq entries into a biosql >> database. I don't see any version info in the files, but I downloaded >> everything in the last month or so and everything passed all the tests >> when installed. I am using perl 5.8.5, mysql 5.0.22, DBI-1.5.1, >> DBD-mysql-3.006. I was loading plant file from Refseq rel 18: >> >> load_seqdatabase.pl --dbname biosql >> --lookup --u --namespace plant --format genbank --safe plant*.rna.gbff.gz >> >> and it crashed after about 30K of 60K records: >> >> at /usr/lib/perl5/site_perl/5.8.5/Bio/biosql-schema/sql/bioperl-db/scripts/biosql/load_seqdatabase.pl >> line 633 >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values >> were ("","Direct Submission","Submitted (01-JUL-2004) National Center for >> Biotechnology Information, National Institutes of Health, Bethesda 20894, >> United States of America","CRC-6F1453182E2BAC3F","1","786","") FKs >> () >> Duplicate entry 'CRC-6F1453182E2BAC3F' for key 3 >> --------------------------------------------------- >> Could not store XM_472403: >> ------------- EXCEPTION ------------- >> MSG: create: object (Bio::Annotation::Reference) failed to insert or to be >> found by unique key >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:208 >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 >> STACK Bio::DB::Persistent::PersistentObject::store >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 >> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:219 >> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create >> /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:216 >> t >> >> I traced the error back through the source and database and found that >> XM_472403 has the same CRC value as XM_473880. I actually got many errors of this type, >> but only the last one crashed the script (in spite of --safe). >> >> Should there be more info included in the CRC field? I am weak when >> it comes to RDBMs, but looking at the schema, I would guess that the CRC field >> was added to make an otherwise degenerate key unique. Would it help to add >> more fields to the CRC, or another key? The former might be done without >> have to change a lot of code. >> >> Thanks >> >> Mike >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From aaron.j.mackey at gsk.com Mon Aug 14 11:56:01 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Aug 2006 11:56:01 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E089DA.2020406@sendu.me.uk> Message-ID: > User requests report-statistic Y, which is found on the last line of the > report. We want to avoid reading, storing and parsing the entire file > just to find Y, so we seek to the last line, parse Y out and return it. > Yay, super fast. This was the bit I was missing, thanks; to be honest, I never knew we had a get_result(Y) method, I thought we only had next_result() iterators. Oh wait, we don't, but you're proposing we should extend the API to offer one? The only thing we do have is a "result_count" method that is defined has returning the number of results that "have been parsed" (which, to me, could differ from the number of results that "have already been, or could yet to be, parsed") > Now the user requests the next_result(). Let's say the first result > begins 5 lines into the file after the header. We quickly seek() there > and... Yes, I understand that pipes aren't seekable. I didn't understand the non-streaming context in which you wanted to seek back up the stream. > # Allow seeking around. This adds an initial, possibly trivial, burden > for piped input only. OK, if you insist on the need for "get_result(Y)" functionality, then (as you say) you must use a buffer/cache mechanism (switching from in-memory to tempfile above some threshold is another wrinkle to consider). But, consider emulating XML::Twig's "purge_up_to" mechanism, whereby after I call "get_result(Y)", I can also call "purge_upto(Y)" to release/minimize the buffer contents. The reason I'm being so fussy about this is that a primary motivation for a shockingly-fast parser is shockingly large datasets that we keep only as compressed files, uncompressing them en route to the parser; thus your simple "I'll just copy the stream to tempfile and proceed as normal" solution is not so trivial. Here's a compromise: assume that users won't need random access to their results, only sequential; also, provide a new parameter to the searchIO constructor to specifify the desired access mode as random; then, if the input stream is not seekable (which is testable), you can perform your memory/file caching. If get_result(X) is called without the access mode being set to random on an unseekable stream, throw an (informative) error. Yes, I realize this is a bit more work; but the result could actually be usable! -Aaron From aaron.j.mackey at gsk.com Mon Aug 14 12:00:52 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Aug 2006 12:00:52 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <2E9781DA-25B5-4DBB-9B72-2A79E1458E4C@gmx.net> Message-ID: > > Otherwise I could see an experimental branch stagnating over time > > without bug fixes, code updates, etc. > > Yeah, after a while once you've convinced yourself your > 'experimental' changes work, you should port them over to the main > trunk and then the experimental one should go into oblivion. To reemphasize the point, there's no reason why there should be only one experimental branch; rather, when a (named) experiment is to be performed, you branch to perform that experiment, the endpoint of which is either experimental success (merge back to trunk) or failure (branch is relegated to oblivion). The best way to ensure that your experimental branch is reflective of the current trunk is to make new experimental branches from the trunk whenever you need one. Don't be afraid of branching, that's what it's there for! Branch early, branch often; don't pollute the trunk! -Aaron From xianjun.dong at bccs.uib.no Mon Aug 14 11:57:14 2006 From: xianjun.dong at bccs.uib.no (Xianjun Dong) Date: Mon, 14 Aug 2006 17:57:14 +0200 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <005201c6bcae$526c2bb0$2f01a8c0@GOLHARMOBILE1> References: <005201c6bcae$526c2bb0$2f01a8c0@GOLHARMOBILE1> Message-ID: <1155571035.4343.129.camel@lauvtre.ii.uib.no> Hi, Ryan and all other helpers, I finally could run my script and solved the problem of codonTable. (I checked the DNA type -- mtDNA or nucleotide DNA -- first before I call translate). Thanks a lot for your help. But I still have some questions: 1. For the case which in-frame stop codon codes for selenocysteine('U'), like the transcript ENSMUST00000094469, it should be translated into 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But when I use the codontable_id=1(generic codon table), it still was '*'. Is it because the package(Bio::Tools::CodonTable) is not so updated as the IUPAC rules? 2. Ryan, I still want to confirm one point for your sample code: Can I just directly remove the in-frame stop codons (both in the middle and in the tail) from the CDS sequence, and then get dna_aln by Clustalw, and then invoke run() on the Codeml package? I don't think the filter procedure in the sample code works very well. 3. What's more, there are two ways to get Ka/Ks through the PAML package: my $yn = new Bio::Tools::Run::Phylo::PAML::Yn00(); and my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, } ); I checked both PODs for this two modules. The default setting for Yn00() should be same as the above Codeml setting. But the Ks output for the same sequences is much different. For example, here is the output for the sequences below: [xianjund at lauvtre kaks]$ perl paml.pl seq.fa Yn00: Ka = 0.6267 Ks = 0.9160 Ka/Ks = 0.6841 Codeml: SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID ENSMUST00000094469 ENST00000361918 0.7419 1.6483 0.4501 47.62 55.16 Sequences are here: >ENSMUST00000094469 ATGAGCATCCTACTGTCGCCGCCGTCGCTGCTGCTGCTTCTTGCAGCCCTTGTGGCTCCA GCCACCTCCACCACCAACTACCGACCGGATTGGAACCGTCTTCGAGGCCTGGCCAGGGGG CGGGTGGAGACCTGTGGAGGACAGTTGAATCGCCTAAAGGAGGTGAAGGCCTTTGTCAAA GAAGCTCAGGTGCCCCCCGAGTACCTGTGGGCGCCCGCTAAGCCCCCCGAGGAAGCTTCA GAACACGACTGGCTGTGA >ENST00000361918 ATGAGCCTCCTGTTGCCTCCGCTGGCGCTGCTGCTGCTTCTCGCGGCGCTTGTGGCCCCA GAGCTCGTGCTGCTGGGCCGCCGCTACGAGGAACTAGAGCGCATCCCACTCAGTGAAATG ACCCGCGAAGAGATCAATGCGCTAGTGCAGGAGCTCGGCTTCTACCGCAAGGCGGCGCCC GACGCGCAGGTGCCCCCCGAGTACGTGTGGGCGCCCGCGAAGCCCCCAGAGGAAACTTCG GACCACGCTGACCTGTAG 4. BTW, could you share your method PBL with me? I want to learn more on how to overcome the overestimate synonymous rates cases. Thanks! -Xianjun On Thu, 2006-08-10 at 14:53 -0400, Ryan Golhar wrote': > Hi Xianjun, > > 1. The Bio::Seq::translate function (to my knowledge) only uses the > generic codon table. So, you will need to translate the DNA sequence > using some other method. In any case, even removing the *'s from the > protein sequence still leaves the stop codons in the DNA sequence which > must be removed. > > 2. The checks were written to assume that the sequences provided are > full-length coding sequences. That means the start and stop codon are > present as well. When the translate function is called, the stop codon > is translated as a '*'. The script initally just remove the * from the > end of the sequence and continued on. > > I added a check to see if there is a '*' in the middle of the sequence > because I found in some of my genes that there is in fact in-frame stop > codons which actually codes for selenocysteine. I see the warning check > isn't working for some reason - odd, it worked when I wrote it. > > You can remove the *'s from the protein sequence, but you must also be > sure to remove the corresponding codons from the DNA sequence as well > before invoking run() on the Codeml pacakge. I suppose someone could > add a check to the script to remove the in-frame stop codons. > Remember, the pairwise_kaks script is just a starting point (tutorial) > to show you how you can go about performing this type of an analysis. > > In fact, I've since switched from PAML to using a different method PBL > which a colleuge coded. I found that PAML tends to overestimate > synonymous rates in some cases. > > Let me know if this helps. If not, I'll try to clarify. > > Ryan > > -----Original Message----- > From: Xianjun Dong [mailto:xianjun.dong at bccs.uib.no] > Sent: Thursday, August 10, 2006 12:03 PM > To: golharam at umdnj.edu > Cc: bioperl-l at lists.open-bio.org > Subject: RE: [Bioperl-l] PAML + Codeml problem.. > > > Hi, Ryan > > Thanks for your reply! > > But here I still have two questions about the sample code: > 1. the translate() function of Bio::Seq object use generic codon table, > but for Mitochondrial DNA (mtDNA), we should use different codon table. > So, if we take the human transcript ENST00000361390 as example, > > >ENST00000361390 _cDNA > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCTAA > > After translating with above function, the amino acid sequence is like > this, which contain *(stop codon) within the sequence(also at the end of > the sequence). But actually, this is a mtDNA, if we use different codon > table, the * within the sequence will change to 'W'(Trp). (Because in > vertebrate mitochondria "AGA" and "AGG" are also stop codons, but not > "UGA", which codes for tryptophan instead.) > >ENST00000361390 aa_beforefilter > IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYI > TAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEV > TLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYA > AGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHL > L*KNFLPLTLALLI*YVSIPITISSIPPQT* > > 2. My second question is: > If there are * both in the middle and end of the translated sequence > (with pattern AAAAAA*AAAAAAAAAAAAAAA*AAA*), like above case, after the > two checks for stop codon, all * will be filtered out. So, when > translate back from aa_aln to dna_aln, there should be no stop codon > included. But actually, when I track the program, it display that there > are still stop codon included. Here is the DNA alignment after recalling > the aa_to_dna_aln function. How to explain this? > > >ENST00000361390 aa_to_dna_aln > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACAC > CTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT > > > I attached my script for two ortholog transcripts demo and the output > (including the error msg) here. Could you kindly check for me? > > Thanks! > > -Xianjun > > ///////////////////////////////////////////////////////////////////// > /////////////////////////////// output ////////////////////////////// > ///////////////////////////////////////////////////////////////////// > > [xianjund at lauvtre kaks]$ perl calculator.pl > >ENST00000361390 > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCTAA > >ENSMUST00000082392 > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG > GGAGTACCACCATACATATAG > > Calculate the Ka/Ks for ENSG00000198888 : ENSMUSG00000064341 ... > >ENSMUST00000082392 aa_beforefilter > VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFI > IAPTLSLTLALSL*VPLPIPHPLINLNLGILFILATSSLSVYSIL*SG*ASNSKYSLFGALRAVAQTISYEV > TIAIILLSVLLINGSYSLQTLITTQEHI*LLLPA*PIAII*FISTLAETNRAPFDLTEGESELVSGFNVEYA > AGPFALFFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHL > L*KNFLPLTLALCM*HISLPIFTAGVPPYI* > >ENSMUST00000082392 aa_afterfilter > VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISLFI > IAPTLSLTLALSLVPLPIPHPLINLNLGILFILATSSLSVYSILSGASNSKYSLFGALRAVAQTISYEVTIA > IILLSVLLINGSYSLQTLITTQEHILLLPAPIAIIFISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFAL > FFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHLLKNFLP > LTLALCMHISLPIFTAGVPPYI > >ENST00000361390 aa_beforefilter > IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYI > TAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEV > TLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYA > AGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHL > L*KNFLPLTLALLI*YVSIPITISSIPPQT* > >ENST00000361390 aa_afterfilter > IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYI > TAPTLALTIALLLTPLPIPNPLVNLNLGLLFILATSSLAVYSILSGASNSNYALIGALRAVAQTISYEVTLA > IILLSTLLISGSFNLSTLITTQEHLLLLPSPLAIIFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFAL > FFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLIRTAYPRFRYDQLIHLLKNFLPL > TLALLIYVSIPITISSIPPQT > > Print out the DNA sequences translated back from aa_to_dna function: > >ENSMUST00000082392 aa_to_dna_aln > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAGAA > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTT > >ENST00000361390 aa_to_dna_aln > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCGAA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACAC > CTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT > > -------------------- WARNING --------------------- > MSG: There was an error - see error_string for the program output > --------------------------------------------------- > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > MSG: Unknown format of PAML output > STACK: Error::throw > STACK: > Bio::Root::Root::throw > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > STACK: > Bio::Tools::Phylo::PAML::_parse_summary > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > STACK: > Bio::Tools::Phylo::PAML::next_result > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > STACK: main::kaks_calculate calculator.pl:176 > STACK: calculator.pl:116 > > ///////////////////////////////////////////////////////////////////// > /////////////////////////////// script ////////////////////////////// > ///////////////////////////////////////////////////////////////////// > sub kaks_calculate > { > my %seqs=@_; > #my %seqs = %$seqs_ref; > my @prots; > > my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new > ('quiet'=>1); > > # process each sequence > for my $seqid (keys %seqs) > { > my $seq = $seqs{$seqid}; > my $protein =$seq->translate(); > my $pseq = $protein->seq(); > print ">$seqid aa_beforefilter \n$pseq\n"; > if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { > warn("provided a CDS sequence with a stop codon, PAML > will choke!"); > exit(0); > } > # Tcoffee can't handle '*' even if it is trailing > $pseq =~ s/\*//g; > print ">$seqid aa_afterfilter \n$pseq\n"; > $protein->seq($pseq); > push @prots, $protein; > } > > if( @prots < 2 ) { > warn("Need at least 2 CDS sequences to proceed"); > exit(0); > } > > # open(OUT, ">align_output.txt") || die("cannot open output > align_output for writing"); > # Align the sequences with clustalw > my $aa_aln = $aln_factory->align(\@prots); > # project the protein alignment back to CDS coordinates > my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); > > my @each = $dna_aln->each_seq(); > > print "\nPrint out the DNA sequences translated back from aa_to_dna > function:\n\n"; > foreach my $s ( $dna_aln->each_seq() ) { > print ">".$s->display_id." aa_to_dna_aln\n".$s->seq()."\n"; > } > > my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new > ( -params => { 'runmode' => -2, > 'seqtype' => 1, > } ); > > # set the alignment object > $kaks_factory->alignment($dna_aln); > > # run the KaKs analysis > my ($rc,$parser) = $kaks_factory->run(); > my $result = $parser->next_result; > my $MLmatrix = $result->get_MLmatrix(); > > my @otus = $result->get_seqs(); > # this gives us a mapping from the PAML order of sequences back to > # the input order (since names get truncated) > my @pos = map { > my $c= 1; > foreach my $s ( @each ) { > last if( $s->display_id eq $_->display_id ); > $c++; > } > $c; > } @otus; > > # print OUT join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > CDNA_PERCENTID)),"\n"; > print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > CDNA_PERCENTID)),"\n"; > for( my $i = 0; $i < (scalar @otus -1) ; $i++) { > for( my $j = $i+1; $j < (scalar @otus); $j++ ) { > my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); > my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); > # print OUT join("\t", $otus[$i]->display_id, > print join("\t", $otus[$i]->display_id, > $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- > >{'dN'}, > $MLmatrix->[$i]->[$j]->{'dS'}, > $MLmatrix->[$i]->[$j]->{'omega'}, > sprintf("%.2f",$sub_aa_aln- > >percentage_identity), > sprintf("%.2f",$sub_dna_aln- > >percentage_identity), > ), "\n"; > } > } > > } > > > -------------------- WARNING --------------------- > MSG: There was an error - see error_string for the program output > --------------------------------------------------- > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > MSG: Unknown format of PAML output > STACK: Error::throw > STACK: > Bio::Root::Root::throw > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > STACK: > Bio::Tools::Phylo::PAML::_parse_summary > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > STACK: > Bio::Tools::Phylo::PAML::next_result > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > STACK: main::kaks_calculate calculator.pl:176 > STACK: calculator.pl:116 > ---------------------------------------------------------------- > > > > > On Mon, 2006-07-31 at 11:20 -0400, Ryan Golhar wrote: > > Hi Xianjun, > > > > I just did some work on this module including the example. > > > > >> it does not occur in the codon position > > >>(say, the third codon's position is not a times of 3). > > >>Why it effect the result? > > > > If I'm interpreting your question correctly, the stop codons in your > > sequence occur in-frame. This is why it is choking. > > > > >>So, when translate back from aa_aln to dna_aln, there should be no > > stop codon included. SO, why it can not pass? > > > > The Ka and Ks statistics are not calculated based on the protein > > sequence, they are calculated based on the DNA sequence. The protein > > sequence is used to provide a alignment for the codons of the DNA > > sequence. Checking the protein sequence for * is easier to identify > > in-frame stop codons than scanning the DNA sequence. > > > > The two checks for stop codons you mentioned are to check for stop > > codons within the sequence without worry for the last amino acid. The > > > second part remove the * at the end of the sequence (not the middle). > > > > If you want to remove the in-frame stop codons, you can, but do so > > before translating it to protein sequences. > > > > Ryan > > > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun > > Dong > > Sent: Monday, July 31, 2006 7:56 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] PAML + Codeml problem.. > > > > > > Hi, > > > > I have a problem during running the Codeml Wiki-HOWTO code: > > > > Here is the error message: > > //////////////////////////////////////////////////////////////// > > [xianjund at lauvtre kaks]$ perl paml.pl test.fa > > > > -------------------- WARNING --------------------- > > MSG: There was an error - see error_string for the program output > > STACK Bio::Tools::Run::Phylo::PAML::Codeml::run > > /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML > > /C > > odeml.pm:581 > > STACK toplevel paml.pl:61 > > > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > > MSG: Unknown format of PAML output > > STACK: Error::throw > > STACK: > > Bio::Root::Root::throw > > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > > STACK: > > Bio::Tools::Phylo::PAML::_parse_summary > > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > > STACK: > > Bio::Tools::Phylo::PAML::next_result > > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > > STACK: paml.pl:62 > > ---------------------------------------------------------------- > > //////////////////////////////////////////////////////////////// > > > > My test sequence is: > > >ENST00000361390 > > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCG > > AA > > > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > > > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > > > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > > > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > > > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > > > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > > > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > > > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > > > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > > > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > > > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > > > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC > > AGCATTCCCCCTCAAACCTAA > > >ENSMUST00000082392 > > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAG > > AA > > > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA > > > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT > > > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT > > > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA > > > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA > > > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC > > > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA > > > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA > > > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT > > > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA > > > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > > > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG > > GGAGTACCACCATACATATAG > > > > Sure, I checked it. There is some stop codon in it. If I replace it > > with non-stop codon, it works. > > > > For example, > > >ENST00000361390 > > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTACCG > > AA > > > CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAACCC > > > TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACATC > > > ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGGTC > > > AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAGGG > > > caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAGTC > > > ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAACA > > > CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAGAG > > > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACGCC > > > GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTACA > > > ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCCTA > > > CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > > > CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCTCC > > AGCATTCCCCCTCAAACCcaa > > >ENSMUST00000082392 > > GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaacaa > > AA > > > CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAACCA > > > TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTATT > > > ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaaTc > > > aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAGGA > > > caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAGca > > > aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAACC > > > CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAGAA > > > ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACGCA > > > GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTATT > > > ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTCTA > > > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > > > CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAGCG > > GGAGTACCACCATACATAcaa > > > > But my question is: it does not occur in the codon position (say, the > > third codon's position is not a times of 3). Why it effect the result? > > > > And also there is code to filter out the stop codon in the sample code > > > (as the following shown) /////////////////////////////// > > if( $pseq =~ /\*/ && > > $pseq !~ /\*$/ ) { > > warn("provided a CDS sequence with a stop codon, PAML will > > choke!"); > > exit(0); > > } > > # Tcoffee can't handle '*' even if it is trailing > > $pseq =~ s/\*//g; > > ///////////////////////////// > > > > So, when translate back from aa_aln to dna_aln, there should be no > > stop codon included. SO, why it can not pass? > > > > Thanks for answer! > > > > P.S: attach my code here: > > ///////////////////////////////////////////////////////// > > #!/usr/bin/perl -w > > use strict; > > use Bio::Tools::Run::Phylo::PAML::Codeml; > > use Bio::Tools::Run::Alignment::Clustalw; > > > > # for projecting alignments from protein to R/DNA space > > use Bio::Align::Utilities qw(aa_to_dna_aln); > > # for input of the sequence data > > use Bio::SeqIO; > > use Bio::AlignIO; > > > > my $aln_factory = > > Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); > > my $seqdata = shift || 'test.fa'; > > > > my $seqio = new Bio::SeqIO(-file => $seqdata, > > -format => 'fasta'); > > my %seqs; > > my @prots; > > # process each sequence > > while ( my $seq = $seqio->next_seq ) { > > $seqs{$seq->display_id} = $seq; > > # translate them into protein > > my $protein = $seq->translate(); > > my $pseq = $protein->seq(); > > if( $pseq =~ /\*/ && > > $pseq !~ /\*$/ ) { > > warn("provided a CDS sequence with a stop codon, PAML will > > choke!"); > > exit(0); > > } > > # Tcoffee can't handle '*' even if it is trailing > > $pseq =~ s/\*//g; > > > > $protein->seq($pseq); > > push @prots, $protein; > > } > > > > if( @prots < 2 ) { > > warn("Need at least 2 CDS sequences to proceed"); > > exit(0); > > } > > > > # open(OUT, ">align_output.txt") || die("cannot open output > > align_output for writing"); # Align the sequences with clustalw my > > $aa_aln = $aln_factory->align(\@prots); # project the protein > > alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, > > > \%seqs); > > > > my @each = $dna_aln->each_seq(); > > > > my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new > > ( -params => { 'runmode' => -2, > > 'seqtype' => 1, > > }, > > -save_tempfiles => 1, > > -verbose => 1); > > > > # set the alignment object $kaks_factory->alignment($dna_aln); > > > > # run the KaKs analysis > > my ($rc,$parser) = $kaks_factory->run(); > > my $result = $parser->next_result; > > my $MLmatrix = $result->get_MLmatrix(); > > > > my @otus = $result->get_seqs(); > > # this gives us a mapping from the PAML order of sequences back to # > > the input order (since names get truncated) my @pos = map { > > my $c= 1; > > foreach my $s ( @each ) { > > last if( $s->display_id eq $_->display_id ); > > $c++; > > } > > $c; > > } @otus; > > > > print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > > CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) > { > > for( my $j = $i+1; $j < (scalar @otus); $j++ ) { > > my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); > > my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); > > print join("\t", $otus[$i]->display_id, > > $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- > > >{'dN'}, > > $MLmatrix->[$i]->[$j]->{'dS'}, > > $MLmatrix->[$i]->[$j]->{'omega'}, > > sprintf("%.2f",$sub_aa_aln- > > >percentage_identity), > > sprintf("%.2f",$sub_dna_aln- > > >percentage_identity), > > ), "\n"; > > } > > } > > > From bix at sendu.me.uk Mon Aug 14 13:57:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Mon, 14 Aug 2006 18:57:37 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: Message-ID: <44E0B991.6060807@sendu.me.uk> aaron.j.mackey at GSK.COM wrote: >> And then of course the idea is that this is nested, so the parser for >> the result data is a Bio::Search::Result::ResultI but also a pull-parser >> in its own right (and so on for HitI and HSPI) with a need for >> random-access to the various bits of data needed to answer all the >> various methods of ResultI. > > the second- (and third- and so on) level parsers can work on in-memory > "blobs" (if seeking is unavailable), as these will be minute in > comparison; it's only the top-level SearchIO parser that need fuss about > streaming pipes and seekability. Oh, I'd disagree with that. A file given to SearchIO may only have 1 result in it, but that single result could be 99.999% of the 1000MB file. That result might have only one hit, taking 99.99% of the file. And then the user might only be interested in the first hsp, which takes 0.001 % of the file. You don't want to go around chucking in-memory blobs like those to your Result and Hit objects if you can avoid it. >> I currently have a -piped_behaviour argument that accepts 'memory' or >> 'temp_file'. > > does it default to memory? Yes, but the acceptable options and the defaults could vary for different pull-parser-based SearchIO modules. Since the goal here is increased speed of SearchIO, I'm tempted to say that even for a BLAST parser the default should be 'memory' (read everything in first). > fundamentally, parsing occurs when regular expressions operate on > in-memory blobs; so while you can keep lots of file pointers around to > define many largish blobs with minimal memory footprint, at some point > they need to become memory-resident for the parser to take effect. I try to keep a good balance here. I also throw away a blob as soon as I've parsed all the information I want out of it (which could be another irksome thing for a sequential_read of piped data; you either have to keep all blobs indefinitely, or do all your parsing sequentially, making us more like a push parser). From cjfields at uiuc.edu Mon Aug 14 14:04:04 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 13:04:04 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E0A3DA.40601@sendu.me.uk> References: <44E0A3DA.40601@sendu.me.uk> Message-ID: <5084B8EC-5DF4-49E8-B81A-9E43C9BB4BB5@uiuc.edu> On Aug 14, 2006, at 11:24 AM, Sendu Bala wrote: > aaron.j.mackey at gsk.com wrote: >>> User requests report-statistic Y, which is found on the last line >>> of the >> >>> report. We want to avoid reading, storing and parsing the entire >>> file >>> just to find Y, so we seek to the last line, parse Y out and >>> return it. >>> Yay, super fast. >> >> This was the bit I was missing, thanks; to be honest, I never knew >> we had >> a get_result(Y) method, I thought we only had next_result() >> iterators. Oh >> wait, we don't, but you're proposing we should extend the API to >> offer >> one? > > It's subtle. There's no explicit methods defined at the SearchIO > level, > but currently you have to parse data (or not - we want to pull) to > find > out things that all result (or even hit, hsp) objects need. You may > need > to do some internal, optional parsing depending on the specific file > format variation you discover you are parsing. > > And then of course the idea is that this is nested, so the parser for > the result data is a Bio::Search::Result::ResultI but also a pull- > parser > in its own right (and so on for HitI and HSPI) with a need for > random-access to the various bits of data needed to answer all the > various methods of ResultI. > > >> The reason I'm being so fussy about this is that a primary >> motivation for >> a shockingly-fast parser is shockingly large datasets that we keep >> only as >> compressed files, uncompressing them en route to the parser; thus >> your >> simple "I'll just copy the stream to tempfile and proceed as normal" >> solution is not so trivial. > > Right, that's helpful. I'll keep that in mind. > > >> Here's a compromise: assume that users won't need random access to >> their >> results, only sequential; also, provide a new parameter to the >> searchIO >> constructor to specifify the desired access mode as random; then, >> if the >> input stream is not seekable (which is testable), you can perform >> your >> memory/file caching. If get_result(X) is called without the >> access mode >> being set to random on an unseekable stream, throw an >> (informative) error. > > I currently have a -piped_behaviour argument that accepts 'memory' or > 'temp_file'. How about a third (non-default) option of 'linear' to > avoid > any attempt at a seek and just use the data as it is piped? The > trouble > is that you'd need to virtually implement the methods of a parser > module > twice, once where the methods can seek, second where they can't. Or > maybe not; I'll have to try and see if some sane compromise > implementation is possible. My worry : would it obfuscate/compromise code having both sequential and random access available in the same module? Thins is something you also seem concerned about. I would focus on getting one implementation running (whichever is furthest along, which sounds like 'random access') with the knowledge of adding sequential access at some point. If it's too hard to fit in sequential without compromising your code then maybe have a separate set of classes specifically handle sequential access. Once the basic code is out anyone interested can test it out; then we can offer suggestions, add code, etc. One suggestion: Bio::DB::WebDBSeqI-implementing classes have a parameter, retrieval_type(), for setting how the data stream is processed from a server (io_string, tempfile, pipeline). You could have a similar get/set with expanded arguments (using parameters if you want) based on the input stream (tempfile, piped) and how you want to process it (random, sequential). $parser->retrieval_type( -stream => 'tempfile', -access => 'random'); # or similar The options could be sorted out in the method using _rearrange(), which adds some flexibility. Of course, you wouldn't need 'access' parameter if you split these into two classes. Another thing also to keep in mind is interoperability. There are more BioPerl Windows users now than in previous years (I was one but now I'm Mac-tified). I don't think it will be a problem except with piping/forking (and that's only 'maybe') but you never know! If anything throws a wrench into the works it'll be DOS/Windows. Once you have some test code committed I'll try it out on WinXP. Mac OS X shouldn't be a problem but I'll try it there as well. No pressure Sendu! Chris Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From aaron.j.mackey at gsk.com Mon Aug 14 13:01:47 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Aug 2006 13:01:47 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E0A3DA.40601@sendu.me.uk> Message-ID: > And then of course the idea is that this is nested, so the parser for > the result data is a Bio::Search::Result::ResultI but also a pull-parser > in its own right (and so on for HitI and HSPI) with a need for > random-access to the various bits of data needed to answer all the > various methods of ResultI. the second- (and third- and so on) level parsers can work on in-memory "blobs" (if seeking is unavailable), as these will be minute in comparison; it's only the top-level SearchIO parser that need fuss about streaming pipes and seekability. > I currently have a -piped_behaviour argument that accepts 'memory' or > 'temp_file'. does it default to memory? > How about a third (non-default) option of 'linear' to avoid > any attempt at a seek and just use the data as it is piped? fine; we can quibble about stylistic API issues later. > The trouble > is that you'd need to virtually implement the methods of a parser module > twice, once where the methods can seek, second where they can't. Or > maybe not; I'll have to try and see if some sane compromise > implementation is possible. fundamentally, parsing occurs when regular expressions operate on in-memory blobs; so while you can keep lots of file pointers around to define many largish blobs with minimal memory footprint, at some point they need to become memory-resident for the parser to take effect. Conversely, if you spend too much time finding out the fine-grained locations of every parsable bit, and saving the pointers then you're recapitulating Perl's own variable storage mechanisms. -Aaron From osborne1 at optonline.net Mon Aug 14 14:55:35 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 14 Aug 2006 14:55:35 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <1155571035.4343.129.camel@lauvtre.ii.uib.no> Message-ID: Xianjun, You're going to have to answer these questions yourself by examining CodonTable.pm and then selecting the appropriate table. Brian O. On 8/14/06 11:57 AM, "Xianjun Dong" wrote: > But I still have some questions: > 1. For the case which in-frame stop codon codes for selenocysteine('U'), > like the transcript ENSMUST00000094469, it should be translated into > 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But > when I use the codontable_id=1(generic codon table), it still was '*'. > Is it because the package(Bio::Tools::CodonTable) is not so updated as > the IUPAC rules? From aaron.j.mackey at gsk.com Mon Aug 14 15:11:21 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Mon, 14 Aug 2006 15:11:21 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: Message-ID: > > 1. For the case which in-frame stop codon codes for selenocysteine('U'), > > like the transcript ENSMUST00000094469, it should be translated into > > 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But > > when I use the codontable_id=1(generic codon table), it still was '*'. > > Is it because the package(Bio::Tools::CodonTable) is not so updated as > > the IUPAC rules? The translation of TGA into Selenocysteine (U) is not "universal", it only occurs when the downstream UTR contains a SECIS RNA element; Bio::Tools::CodonTable is unable to differentiate such selenocysteine-encoding TGA codons from "normal" TGA stop codons, regardless of the translation table in use. GenBank/EMBL-formatted records will typically have /transl_except entries in the feature table, but the BioPerl "translate" method does not (yet) recognize these (someone correct me if I'm wrong). -Aaron From cjfields at uiuc.edu Mon Aug 14 16:26:37 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 15:26:37 -0500 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: Message-ID: <001501c6bfdf$f4d91f00$15327e82@pyrimidine> You are correct, sir! (sorry, bad Ed McMahon impression) I don't know of any translation program that recognizes TGA as selenocysteine as you would have to run a concurrent search for a SECIS element using RNA motif software. Most of the time when I have seen 'U' present for a protein it is added either b/c: 1) there is biochemical evidence to support the presence of selenocysteine, 2) the protein is homologous to one that is known to have 'U' (most common), 3) or the protein suspected to have one based on prior searches for SECIS elements using various RNA motif search programs (http://en.wikipedia.org/wiki/SECIS_element). RNAMotif comes to mind, actually... Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of aaron.j.mackey at gsk.com > Sent: Monday, August 14, 2006 2:11 PM > To: Brian Osborne > Cc: bioperl-l at lists.open-bio.org; Xianjun Dong; golharam at umdnj.edu > Subject: Re: [Bioperl-l] PAML + Codeml problem.. > > > > 1. For the case which in-frame stop codon codes for > selenocysteine('U'), > > > like the transcript ENSMUST00000094469, it should be translated into > > > 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But > > > when I use the codontable_id=1(generic codon table), it still was '*'. > > > Is it because the package(Bio::Tools::CodonTable) is not so updated as > > > the IUPAC rules? > > The translation of TGA into Selenocysteine (U) is not "universal", it only > occurs when the downstream UTR contains a SECIS RNA element; > Bio::Tools::CodonTable is unable to differentiate such > selenocysteine-encoding TGA codons from "normal" TGA stop codons, > regardless of the translation table in use. GenBank/EMBL-formatted > records will typically have /transl_except entries in the feature table, > but the BioPerl "translate" method does not (yet) recognize these (someone > correct me if I'm wrong). > > -Aaron > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Aug 14 16:58:36 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 15:58:36 -0500 Subject: [Bioperl-l] Planned New BioPerl release and Bio::DB::SeqFeature Message-ID: <001e01c6bfe4$69431310$15327e82@pyrimidine> Scott, Lincoln, et al, We are gearing up for another developer release (v. 1.5.2, with a target date for RC1 by Sept 15th). Sendu Bala is taking up the helm of Release Pumpkin for this one. We hope to continue points releases up to the next stable release (v1.6, which we would like to get out by summer 2007). We had some questions about Bio::DB::SeqFeature (for GFF3 support). Is the current implementation sufficiently stable for this release? We haven't heard much about it (besides the commit messages via Bioperl-guts) and didn't know what test files and test cases were available. Thanks! Christopher Fields Postdoctoral Researcher - Switzer Lab Dept. of Biochemistry University of Illinois Urbana-Champaign From osborne1 at optonline.net Mon Aug 14 17:48:31 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 14 Aug 2006 17:48:31 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: Message-ID: Xianjun, I spoke too soon. I'd assumed that NCBI had a table to handle selenocysteine, but it does not: http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c These tables are the basis for the Bio::Tools::CodonTable module, and the CodonTable module looks to be up-to-date with respect to NCBI's page. You can solve your problem by making a custom table using the add_table() method, see t/CodonTable.t for a nice example. Your custom table will look something like the Euplotid Nuclear Code table, which translates TGA to C. You should be able to translate TGA to U since the amino acid codes that CodonTable inherits from Bio::SeqUtils contain "U" and "Sec". This is an issue that's independent of the issue raised by Aaron, I'm assuming you know whether or not your sequences should be translated this way. Brian O. On 8/14/06 3:11 PM, "aaron.j.mackey at gsk.com" wrote: >>> 1. For the case which in-frame stop codon codes for > selenocysteine('U'), >>> like the transcript ENSMUST00000094469, it should be translated into >>> 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But >>> when I use the codontable_id=1(generic codon table), it still was '*'. >>> Is it because the package(Bio::Tools::CodonTable) is not so updated as >>> the IUPAC rules? > > The translation of TGA into Selenocysteine (U) is not "universal", it only > occurs when the downstream UTR contains a SECIS RNA element; > Bio::Tools::CodonTable is unable to differentiate such > selenocysteine-encoding TGA codons from "normal" TGA stop codons, > regardless of the translation table in use. GenBank/EMBL-formatted > records will typically have /transl_except entries in the feature table, > but the BioPerl "translate" method does not (yet) recognize these (someone > correct me if I'm wrong). > > -Aaron > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Mon Aug 14 18:22:55 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 17:22:55 -0500 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: Message-ID: <000001c6bff0$343f2e90$15327e82@pyrimidine> Brian, Would having a custom codon table work? Since TGA->'U' requires a nearby SECIS element, theoretically a gene could have one 'TGA' codon that codes for 'U' (nearby SECIS element) and another 'TGA' codon that codes for the actual stop (no SECIS element). I don't think there is a way to have position-specific TGA->U based on user-input either (a flag, perhaps). That's the only work-around for it I can think of. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Brian Osborne > Sent: Monday, August 14, 2006 4:49 PM > To: Xianjun Dong > Cc: bioperl-l at lists.open-bio.org; aaron.j.mackey at gsk.com; > golharam at umdnj.edu > Subject: Re: [Bioperl-l] PAML + Codeml problem.. > > Xianjun, > > I spoke too soon. I'd assumed that NCBI had a table to handle > selenocysteine, but it does not: > > http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c > > These tables are the basis for the Bio::Tools::CodonTable module, and the > CodonTable module looks to be up-to-date with respect to NCBI's page. You > can solve your problem by making a custom table using the add_table() > method, see t/CodonTable.t for a nice example. Your custom table will look > something like the Euplotid Nuclear Code table, which translates TGA to C. > You should be able to translate TGA to U since the amino acid codes that > CodonTable inherits from Bio::SeqUtils contain "U" and "Sec". > > This is an issue that's independent of the issue raised by Aaron, I'm > assuming you know whether or not your sequences should be translated this > way. > > Brian O. > > > On 8/14/06 3:11 PM, "aaron.j.mackey at gsk.com" > wrote: > > >>> 1. For the case which in-frame stop codon codes for > > selenocysteine('U'), > >>> like the transcript ENSMUST00000094469, it should be translated into > >>> 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But > >>> when I use the codontable_id=1(generic codon table), it still was '*'. > >>> Is it because the package(Bio::Tools::CodonTable) is not so updated as > >>> the IUPAC rules? > > > > The translation of TGA into Selenocysteine (U) is not "universal", it > only > > occurs when the downstream UTR contains a SECIS RNA element; > > Bio::Tools::CodonTable is unable to differentiate such > > selenocysteine-encoding TGA codons from "normal" TGA stop codons, > > regardless of the translation table in use. GenBank/EMBL-formatted > > records will typically have /transl_except entries in the feature table, > > but the BioPerl "translate" method does not (yet) recognize these > (someone > > correct me if I'm wrong). > > > > -Aaron > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Mon Aug 14 19:00:41 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Mon, 14 Aug 2006 19:00:41 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <000001c6bff0$343f2e90$15327e82@pyrimidine> Message-ID: Chris, As I said, I'm assuming Xianjun knows whether or not to translate a given ORF using his custom codon table. He may know this using any of the methods you mentioned previously, or he may have his own approach. So yes, using CodonTable will work for the purpose that it's designed for, finding your TGAs with and without adjacent SECIS elements, or pruning the protein sequence after the mistaken termination suppression, is a separate task. Brian O. On 8/14/06 6:22 PM, "Chris Fields" wrote: > Brian, > > Would having a custom codon table work? Since TGA->'U' requires a nearby > SECIS element, theoretically a gene could have one 'TGA' codon that codes > for 'U' (nearby SECIS element) and another 'TGA' codon that codes for the > actual stop (no SECIS element). > > I don't think there is a way to have position-specific TGA->U based on > user-input either (a flag, perhaps). That's the only work-around for it I > can think of. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Brian Osborne >> Sent: Monday, August 14, 2006 4:49 PM >> To: Xianjun Dong >> Cc: bioperl-l at lists.open-bio.org; aaron.j.mackey at gsk.com; >> golharam at umdnj.edu >> Subject: Re: [Bioperl-l] PAML + Codeml problem.. >> >> Xianjun, >> >> I spoke too soon. I'd assumed that NCBI had a table to handle >> selenocysteine, but it does not: >> >> http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c >> >> These tables are the basis for the Bio::Tools::CodonTable module, and the >> CodonTable module looks to be up-to-date with respect to NCBI's page. You >> can solve your problem by making a custom table using the add_table() >> method, see t/CodonTable.t for a nice example. Your custom table will look >> something like the Euplotid Nuclear Code table, which translates TGA to C. >> You should be able to translate TGA to U since the amino acid codes that >> CodonTable inherits from Bio::SeqUtils contain "U" and "Sec". >> >> This is an issue that's independent of the issue raised by Aaron, I'm >> assuming you know whether or not your sequences should be translated this >> way. >> >> Brian O. >> >> >> On 8/14/06 3:11 PM, "aaron.j.mackey at gsk.com" >> wrote: >> >>>>> 1. For the case which in-frame stop codon codes for >>> selenocysteine('U'), >>>>> like the transcript ENSMUST00000094469, it should be translated into >>>>> 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But >>>>> when I use the codontable_id=1(generic codon table), it still was '*'. >>>>> Is it because the package(Bio::Tools::CodonTable) is not so updated as >>>>> the IUPAC rules? >>> >>> The translation of TGA into Selenocysteine (U) is not "universal", it >> only >>> occurs when the downstream UTR contains a SECIS RNA element; >>> Bio::Tools::CodonTable is unable to differentiate such >>> selenocysteine-encoding TGA codons from "normal" TGA stop codons, >>> regardless of the translation table in use. GenBank/EMBL-formatted >>> records will typically have /transl_except entries in the feature table, >>> but the BioPerl "translate" method does not (yet) recognize these >> (someone >>> correct me if I'm wrong). >>> >>> -Aaron >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From golharam at umdnj.edu Mon Aug 14 20:24:42 2006 From: golharam at umdnj.edu (Ryan Golhar) Date: Mon, 14 Aug 2006 20:24:42 -0400 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <1155571035.4343.129.camel@lauvtre.ii.uib.no> Message-ID: <00b601c6c001$34837ad0$2f01a8c0@GOLHARMOBILE1> Hi Xianjun, 1. It looks like a number of people have already responded to this one. My two cents - I think if you created a custom codon table and used that, it would be okay, but that doesn't solve your underlying problem. PAML will still complain because it does not know selenocysteine. It will see the generic stop codon and halt because of that. Unfortunately, there is nothing bioperl can do about it as PAML is an external package to bioperl. 2. You can remove the stop codons from the CDS sequence. Then, when you run PAML it should be okay. The filter procedure in the code is not designed to remove the in-frame stop codons, its is only designed to warn you about its presence. It will only remove the trailing stop codon as PAML seems to be okay with this. 3. BioPerl (as I've learned) has Ka/Ks calculation routines built in the DNAStatistics module. You could use those. The PBL tool I am using was written by a colleuege whom I will need to get permission from to send to you first. Ryan -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun Dong Sent: Monday, August 14, 2006 11:57 AM To: golharam at umdnj.edu Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] PAML + Codeml problem.. Hi, Ryan and all other helpers, I finally could run my script and solved the problem of codonTable. (I checked the DNA type -- mtDNA or nucleotide DNA -- first before I call translate). Thanks a lot for your help. But I still have some questions: 1. For the case which in-frame stop codon codes for selenocysteine('U'), like the transcript ENSMUST00000094469, it should be translated into 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But when I use the codontable_id=1(generic codon table), it still was '*'. Is it because the package(Bio::Tools::CodonTable) is not so updated as the IUPAC rules? 2. Ryan, I still want to confirm one point for your sample code: Can I just directly remove the in-frame stop codons (both in the middle and in the tail) from the CDS sequence, and then get dna_aln by Clustalw, and then invoke run() on the Codeml package? I don't think the filter procedure in the sample code works very well. 3. What's more, there are two ways to get Ka/Ks through the PAML package: my $yn = new Bio::Tools::Run::Phylo::PAML::Yn00(); and my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new ( -params => { 'runmode' => -2, 'seqtype' => 1, } ); I checked both PODs for this two modules. The default setting for Yn00() should be same as the above Codeml setting. But the Ks output for the same sequences is much different. For example, here is the output for the sequences below: [xianjund at lauvtre kaks]$ perl paml.pl seq.fa Yn00: Ka = 0.6267 Ks = 0.9160 Ka/Ks = 0.6841 Codeml: SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID CDNA_PERCENTID ENSMUST00000094469 ENST00000361918 0.7419 1.6483 0.4501 47.62 55.16 Sequences are here: >ENSMUST00000094469 ATGAGCATCCTACTGTCGCCGCCGTCGCTGCTGCTGCTTCTTGCAGCCCTTGTGGCTCCA GCCACCTCCACCACCAACTACCGACCGGATTGGAACCGTCTTCGAGGCCTGGCCAGGGGG CGGGTGGAGACCTGTGGAGGACAGTTGAATCGCCTAAAGGAGGTGAAGGCCTTTGTCAAA GAAGCTCAGGTGCCCCCCGAGTACCTGTGGGCGCCCGCTAAGCCCCCCGAGGAAGCTTCA GAACACGACTGGCTGTGA >ENST00000361918 ATGAGCCTCCTGTTGCCTCCGCTGGCGCTGCTGCTGCTTCTCGCGGCGCTTGTGGCCCCA GAGCTCGTGCTGCTGGGCCGCCGCTACGAGGAACTAGAGCGCATCCCACTCAGTGAAATG ACCCGCGAAGAGATCAATGCGCTAGTGCAGGAGCTCGGCTTCTACCGCAAGGCGGCGCCC GACGCGCAGGTGCCCCCCGAGTACGTGTGGGCGCCCGCGAAGCCCCCAGAGGAAACTTCG GACCACGCTGACCTGTAG 4. BTW, could you share your method PBL with me? I want to learn more on how to overcome the overestimate synonymous rates cases. Thanks! -Xianjun On Thu, 2006-08-10 at 14:53 -0400, Ryan Golhar wrote': > Hi Xianjun, > > 1. The Bio::Seq::translate function (to my knowledge) only uses the > generic codon table. So, you will need to translate the DNA sequence > using some other method. In any case, even removing the *'s from the > protein sequence still leaves the stop codons in the DNA sequence > which must be removed. > > 2. The checks were written to assume that the sequences provided are > full-length coding sequences. That means the start and stop codon are > present as well. When the translate function is called, the stop > codon is translated as a '*'. The script initally just remove the * > from the end of the sequence and continued on. > > I added a check to see if there is a '*' in the middle of the sequence > because I found in some of my genes that there is in fact in-frame > stop codons which actually codes for selenocysteine. I see the > warning check isn't working for some reason - odd, it worked when I wrote it. > > You can remove the *'s from the protein sequence, but you must also be > sure to remove the corresponding codons from the DNA sequence as well > before invoking run() on the Codeml pacakge. I suppose someone could > add a check to the script to remove the in-frame stop codons. > Remember, the pairwise_kaks script is just a starting point (tutorial) > to show you how you can go about performing this type of an analysis. > > In fact, I've since switched from PAML to using a different method PBL > which a colleuge coded. I found that PAML tends to overestimate > synonymous rates in some cases. > > Let me know if this helps. If not, I'll try to clarify. > > Ryan > > -----Original Message----- > From: Xianjun Dong [mailto:xianjun.dong at bccs.uib.no] > Sent: Thursday, August 10, 2006 12:03 PM > To: golharam at umdnj.edu > Cc: bioperl-l at lists.open-bio.org > Subject: RE: [Bioperl-l] PAML + Codeml problem.. > > > Hi, Ryan > > Thanks for your reply! > > But here I still have two questions about the sample code: > 1. the translate() function of Bio::Seq object use generic codon > table, but for Mitochondrial DNA (mtDNA), we should use different > codon table. So, if we take the human transcript ENST00000361390 as > example, > > >ENST00000361390 _cDNA > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCG > AA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCTAA > > After translating with above function, the amino acid sequence is like > this, which contain *(stop codon) within the sequence(also at the end > of the sequence). But actually, this is a mtDNA, if we use different > codon table, the * within the sequence will change to 'W'(Trp). > (Because in vertebrate mitochondria "AGA" and "AGG" are also stop > codons, but not "UGA", which codes for tryptophan instead.) > >ENST00000361390 aa_beforefilter > IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITL > YI > TAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEV > TLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYA > AGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHL > L*KNFLPLTLALLI*YVSIPITISSIPPQT* > > 2. My second question is: > If there are * both in the middle and end of the translated sequence > (with pattern AAAAAA*AAAAAAAAAAAAAAA*AAA*), like above case, after the > two checks for stop codon, all * will be filtered out. So, when > translate back from aa_aln to dna_aln, there should be no stop codon > included. But actually, when I track the program, it display that there > are still stop codon included. Here is the DNA alignment after recalling > the aa_to_dna_aln function. How to explain this? > > >ENST00000361390 aa_to_dna_aln > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCG > AA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACAC > CTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT > > > I attached my script for two ortholog transcripts demo and the output > (including the error msg) here. Could you kindly check for me? > > Thanks! > > -Xianjun > > ///////////////////////////////////////////////////////////////////// > /////////////////////////////// output ////////////////////////////// > ///////////////////////////////////////////////////////////////////// > > [xianjund at lauvtre kaks]$ perl calculator.pl > >ENST00000361390 > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCG > AA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACCTC > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCTCC > AGCATTCCCCCTCAAACCTAA > >ENSMUST00000082392 > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAG > AA > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAGCG > GGAGTACCACCATACATATAG > > Calculate the Ka/Ks for ENSG00000198888 : ENSMUSG00000064341 ... > >ENSMUST00000082392 aa_beforefilter > VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISL > FI > IAPTLSLTLALSL*VPLPIPHPLINLNLGILFILATSSLSVYSIL*SG*ASNSKYSLFGALRAVAQTISYEV > TIAIILLSVLLINGSYSLQTLITTQEHI*LLLPA*PIAII*FISTLAETNRAPFDLTEGESELVSGFNVEYA > AGPFALFFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHL > L*KNFLPLTLALCM*HISLPIFTAGVPPYI* > >ENSMUST00000082392 aa_afterfilter > VFFINILTLLVPILIAIAFLTLVERKILGYIQLRKGPNIVGPYGILQPFADAIKLFIKEPIRPLTTSISL > FI > IAPTLSLTLALSLVPLPIPHPLINLNLGILFILATSSLSVYSILSGASNSKYSLFGALRAVAQTISYEVTIA > IILLSVLLINGSYSLQTLITTQEHILLLPAPIAIIFISTLAETNRAPFDLTEGESELVSGFNVEYAAGPFAL > FFIAEYTNIILINALTTIIFLGPLYYINLPELYSTNFIIEALLLSSTFLWIRASYPRFRYDQLIHLLKNFLP > LTLALCMHISLPIFTAGVPPYI > >ENST00000361390 aa_beforefilter > IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITL > YI > TAPTLALTIALLL*TPLPIPNPLVNLNLGLLFILATSSLAVYSIL*SG*ASNSNYALIGALRAVAQTISYEV > TLAIILLSTLLISGSFNLSTLITTQEHL*LLLPS*PLAII*FISTLAETNRTPFDLAEGESELVSGFNIEYA > AGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFL*IRTAYPRFRYDQLIHL > L*KNFLPLTLALLI*YVSIPITISSIPPQT* > >ENST00000361390 aa_afterfilter > IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITL > YI > TAPTLALTIALLLTPLPIPNPLVNLNLGLLFILATSSLAVYSILSGASNSNYALIGALRAVAQTISYEVTLA > IILLSTLLISGSFNLSTLITTQEHLLLLPSPLAIIFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFAL > FFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLIRTAYPRFRYDQLIHLLKNFLPL > TLALLIYVSIPITISSIPPQT > > Print out the DNA sequences translated back from aa_to_dna function: > >ENSMUST00000082392 aa_to_dna_aln > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGTAG > AA > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAACCA > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTATT > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAATT > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAGGA > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAGTA > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAACC > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAGAA > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACGCA > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTATT > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTCTA > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATCTT > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTT > >ENST00000361390 aa_to_dna_aln > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTACCG > AA > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAACCC > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACATC > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGGTC > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAGGG > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAGTC > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAACA > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAGAG > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACGCC > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTACA > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCCTA > CTT---CTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACAC > CTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATT > > -------------------- WARNING --------------------- > MSG: There was an error - see error_string for the program output > --------------------------------------------------- > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > MSG: Unknown format of PAML output > STACK: Error::throw > STACK: > Bio::Root::Root::throw > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > STACK: > Bio::Tools::Phylo::PAML::_parse_summary > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > STACK: > Bio::Tools::Phylo::PAML::next_result > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > STACK: main::kaks_calculate calculator.pl:176 > STACK: calculator.pl:116 > > ///////////////////////////////////////////////////////////////////// > /////////////////////////////// script ////////////////////////////// > ///////////////////////////////////////////////////////////////////// > sub kaks_calculate > { > my %seqs=@_; > #my %seqs = %$seqs_ref; > my @prots; > > my $aln_factory = Bio::Tools::Run::Alignment::Clustalw->new > ('quiet'=>1); > > # process each sequence > for my $seqid (keys %seqs) > { > my $seq = $seqs{$seqid}; > my $protein =$seq->translate(); > my $pseq = $protein->seq(); > print ">$seqid aa_beforefilter \n$pseq\n"; > if( $pseq =~ /\*/ && $pseq !~ /\*$/ ) { > warn("provided a CDS sequence with a stop codon, PAML will > choke!"); > exit(0); > } > # Tcoffee can't handle '*' even if it is trailing > $pseq =~ s/\*//g; > print ">$seqid aa_afterfilter \n$pseq\n"; > $protein->seq($pseq); > push @prots, $protein; > } > > if( @prots < 2 ) { > warn("Need at least 2 CDS sequences to proceed"); > exit(0); > } > > # open(OUT, ">align_output.txt") || die("cannot open output > align_output for writing"); > # Align the sequences with clustalw > my $aa_aln = $aln_factory->align(\@prots); > # project the protein alignment back to CDS coordinates > my $dna_aln = aa_to_dna_aln($aa_aln, \%seqs); > > my @each = $dna_aln->each_seq(); > > print "\nPrint out the DNA sequences translated back from aa_to_dna > function:\n\n"; > foreach my $s ( $dna_aln->each_seq() ) { > print ">".$s->display_id." aa_to_dna_aln\n".$s->seq()."\n"; > } > > my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new > ( -params => { 'runmode' => -2, > 'seqtype' => 1, > } ); > > # set the alignment object > $kaks_factory->alignment($dna_aln); > > # run the KaKs analysis > my ($rc,$parser) = $kaks_factory->run(); > my $result = $parser->next_result; > my $MLmatrix = $result->get_MLmatrix(); > > my @otus = $result->get_seqs(); > # this gives us a mapping from the PAML order of sequences back to > # the input order (since names get truncated) > my @pos = map { > my $c= 1; > foreach my $s ( @each ) { > last if( $s->display_id eq $_->display_id ); > $c++; > } > $c; > } @otus; > > # print OUT join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > CDNA_PERCENTID)),"\n"; > print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > CDNA_PERCENTID)),"\n"; > for( my $i = 0; $i < (scalar @otus -1) ; $i++) { > for( my $j = $i+1; $j < (scalar @otus); $j++ ) { > my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); > my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); > # print OUT join("\t", $otus[$i]->display_id, > print join("\t", $otus[$i]->display_id, > $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- > >{'dN'}, > $MLmatrix->[$i]->[$j]->{'dS'}, > $MLmatrix->[$i]->[$j]->{'omega'}, > sprintf("%.2f",$sub_aa_aln- > >percentage_identity), > sprintf("%.2f",$sub_dna_aln- > >percentage_identity), > ), "\n"; > } > } > > } > > > -------------------- WARNING --------------------- > MSG: There was an error - see error_string for the program output > --------------------------------------------------- > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > MSG: Unknown format of PAML output > STACK: Error::throw > STACK: > Bio::Root::Root::throw > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > STACK: > Bio::Tools::Phylo::PAML::_parse_summary > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > STACK: > Bio::Tools::Phylo::PAML::next_result > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > STACK: main::kaks_calculate calculator.pl:176 > STACK: calculator.pl:116 > ---------------------------------------------------------------- > > > > > On Mon, 2006-07-31 at 11:20 -0400, Ryan Golhar wrote: > > Hi Xianjun, > > > > I just did some work on this module including the example. > > > > >> it does not occur in the codon position > > >>(say, the third codon's position is not a times of 3). > > >>Why it effect the result? > > > > If I'm interpreting your question correctly, the stop codons in your > > sequence occur in-frame. This is why it is choking. > > > > >>So, when translate back from aa_aln to dna_aln, there should be no > > stop codon included. SO, why it can not pass? > > > > The Ka and Ks statistics are not calculated based on the protein > > sequence, they are calculated based on the DNA sequence. The protein > > sequence is used to provide a alignment for the codons of the DNA > > sequence. Checking the protein sequence for * is easier to identify > > in-frame stop codons than scanning the DNA sequence. > > > > The two checks for stop codons you mentioned are to check for stop > > codons within the sequence without worry for the last amino acid. The > > > second part remove the * at the end of the sequence (not the > > middle). > > > > If you want to remove the in-frame stop codons, you can, but do so > > before translating it to protein sequences. > > > > Ryan > > > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Xianjun > > Dong > > Sent: Monday, July 31, 2006 7:56 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] PAML + Codeml problem.. > > > > > > Hi, > > > > I have a problem during running the Codeml Wiki-HOWTO code: > > > > Here is the error message: > > //////////////////////////////////////////////////////////////// > > [xianjund at lauvtre kaks]$ perl paml.pl test.fa > > > > -------------------- WARNING --------------------- > > MSG: There was an error - see error_string for the program output > > STACK Bio::Tools::Run::Phylo::PAML::Codeml::run > > /Home/extern/xianjund/src/bioperl/bioperl-run/Bio/Tools/Run/Phylo/PAML > > /C > > odeml.pm:581 > > STACK toplevel paml.pl:61 > > > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > > MSG: Unknown format of PAML output > > STACK: Error::throw > > STACK: > > Bio::Root::Root::throw > > /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 > > STACK: > > Bio::Tools::Phylo::PAML::_parse_summary > > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:359 > > STACK: > > Bio::Tools::Phylo::PAML::next_result > > /usr/lib/perl5/site_perl/5.8.5/Bio/Tools/Phylo/PAML.pm:224 > > STACK: paml.pl:62 > > ---------------------------------------------------------------- > > //////////////////////////////////////////////////////////////// > > > > My test sequence is: > > >ENST00000361390 > > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTTAC > > CG > > AA > > > CGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACAAC > CC > > > TTCGCTGACGCCATAAAACTCTTCACCAAAGAGCCCCTAAAACCCGCCACATCTACCATCACCCTCTACA > TC > > > ACCGCCCCGACCTTAGCTCTCACCATCGCTCTTCTACTATGAACCCCCCTCCCCATACCCAACCCCCTGG > TC > > > AACCTCAACCTAGGCCTCCTATTTATTCTAGCCACCTCTAGCCTAGCCGTTTACTCAATCCTCTGATCAG > GG > > > TGAGCATCAAACTCAAACTACGCCCTGATCGGCGCACTGCGAGCAGTAGCCCAAACAATCTCATATGAAG > TC > > > ACCCTAGCCATCATTCTACTATCAACATTACTAATAAGTGGCTCCTTTAACCTCTCCACCCTTATCACAA > CA > > > CAAGAACACCTCTGATTACTCCTGCCATCATGACCCTTGGCCATAATATGATTTATCTCCACACTAGCAG > AG > > > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACTAGTCTCAGGCTTCAACATCGAATACG > CC > > > GCAGGCCCCTTCGCCCTATTCTTCATAGCCGAATACACAAACATTATTATAATAAACACCCTCACCACTA > CA > > > ATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACATATTTTGTCACCAAGACCC > TA > > > CTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACC > TC > > > CTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATATGTCTCCATACCCATTACAATCT > CC > > AGCATTCCCCCTCAAACCTAA > > >ENSMUST00000082392 > > GTGTTCTTTATTAATATCCTAACACTCCTCGTCCCCATTCTAATCGCCATAGCCTTCCTAACATTAGT > > AG > > AA > > > CGCAAAATCTTAGGGTACATACAACTACGAAAAGGCCCTAACATTGTTGGTCCATACGGCATTTTACAAC > CA > > > TTTGCAGACGCCATAAAATTATTTATAAAAGAACCAATACGCCCTTTAACAACCTCTATATCCTTATTTA > TT > > > ATTGCACCTACCCTATCACTCACACTAGCATTAAGTCTATGAGTTCCCCTACCAATACCACACCCATTAA > TT > > > AATTTAAACCTAGGGATTTTATTTATTTTAGCAACATCTAGCCTATCAGTTTACTCCATTCTATGATCAG > GA > > > TGAGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGTAGCCCAAACAATTTCATATGAAG > TA > > > ACCATAGCTATTATCCTTTTATCAGTTCTATTAATAAATGGATCCTACTCTCTACAAACACTTATTACAA > CC > > > CAAGAACACATATGATTACTTCTGCCAGCCTGACCCATAGCCATAATATGATTTATCTCAACCCTAGCAG > AA > > > ACAAACCGGGCCCCCTTCGACCTGACAGAAGGAGAATCAGAATTAGTATCAGGGTTTAACGTAGAATACG > CA > > > GCCGGCCCATTCGCGTTATTCTTTATAGCAGAGTACACTAACATTATTCTAATAAACGCCCTAACAACTA > TT > > > ATCTTCCTAGGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACTAACTTCATAATAGAAGCTC > TA > > > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATC > TT > > > CTATGAAAAAACTTTCTACCCCTAACACTAGCATTATGTATGTGACATATTTCTTTACCAATTTTTACAG > CG > > GGAGTACCACCATACATATAG > > > > Sure, I checked it. There is some stop codon in it. If I replace it > > with non-stop codon, it works. > > > > For example, > > >ENST00000361390 > > ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCcaaTCGCAATGGCATTCCcaaTGCTTAC > > CG > > AA > > > CGAAAAATTCcaaGCTATATACAACTACGCAAAGGCCCCAACGTTGcaaGCCCCTACGGGCTACTACAAC > CC > > > TTCGCcaaCGCCAcaaAACTCTTCACCAAAGAGCCCCcaaAACCCGCCACATCTACCATCACCCTCTACA > TC > > > ACCGCCCCGACCTcaaCTCTCACCATCGCTCTTCTACTAcaaACCCCCCTCCCCATACCCAACCCCCTGG > TC > > > AACCTCAACCcaaGCCTCCTATTTATTCcaaCCACCTCcaaCCcaaCCGTTTACTCAATCCTCcaaTCAG > GG > > > caaGCATCAAACTCAAACTACGCCCcaaTCGGCGCACTGCGAGCAGcaaCCCAAACAATCTCATAcaaAG > TC > > > ACCCcaaCCATCATTCTACTATCAACATTACcaacaaGTGGCTCCTTcaaCCTCTCCACCCTTATCACAA > CA > > > CAAGAACACCTCcaaTTACTCCTGCCATCAcaaCCCTTGGCCAcaaTAcaaTTTATCTCCACACcaaCAG > AG > > > ACCAACCGAACCCCCTTCGACCTTGCCGAAGGGGAGTCCGAACcaaTCTCAGGCTTCAACATCGAATACG > CC > > > GCAGGCCCCTTCGCCCTATTCTTCAcaaCCGAATACACAAACATTATTAcaacaaACACCCTCACCACTA > CA > > > ATCTTCCcaaGAACAACATAcaaCGCACTCTCCCCcaaACTCTACACAACATATTTTGTCACCAAGACCC > TA > > > CTTCcaaCCTCCCTGTTCTTAcaaATTCGAACAGCATACCCCCGATTCCGCTACGACCAACTCATACACC > TC > > > CTAcaaAAAAACTTCCTACCACTCACCCcaaCATTACTTATAcaaTATGTCTCCATACCCATTACAATCT > CC > > AGCATTCCCCCTCAAACCcaa > > >ENSMUST00000082392 > > GTGTTCTTTATcaaTATCCcaaCACTCCTCGTCCCCATTCcaaTCGCCAcaaCCTTCCcaaCATcaac > > aa > > AA > > > CGCAAAATCTcaaGGTACATACAACTACGAAAAGGCCCcaaCATTGTTGGTCCATACGGCATTTTACAAC > CA > > > TTTGCAGACGCCAcaaAATTATTTAcaaAAGAACCAATACGCCCTTcaaCAACCTCTATATCCTTATTTA > TT > > > ATTGCACCTACCCTATCACTCACACcaaCATcaaGTCTAcaaGTTCCCCTACCAATACCACACCCATcaa > Tc > > > aaTTcaaACCcaaGGATTTTATTTATTTcaaCAACATCcaaCCTATCAGTTTACTCCATTCTAcaaTCAG > GA > > > caaGCCTCAAACTCCAAATACTCACTATTCGGAGCTTTACGAGCCGcaaCCCAAACAATTTCATAcaaAG > ca > > > aCCAcaaCTATTATCCTTTTATCAGTTCTATcaacaaATGGATCCTACTCTCTACAAACACTTATTACAA > CC > > > CAAGAACACATAcaaTTACTTCTGCCAGCCcaaCCCAcaaCCAcaaTAcaaTTTATCTCAACCCcaaCAG > AA > > > ACAAACCGGGCCCCCTTCGACCcaaCAGAAGGAGAATCAGAATcaaTATCAGGGTTcaaCGcaaAATACG > CA > > > GCCGGCCCATTCGCGTTATTCTTTAcaaCAGAGTACACcaaCATTATTCcaacaaACGCCCcaaCAACTA > TT > > > ATCTTCCcaaGACCCCTATACTATATCAATTTACCAGAACTCTACTCAACcaaCTTCAcaacaaAAGCTC > TA > > > CTACTATCATCAACATTCCTATGGATCCGAGCATCTTATCCACGCTTCCGTTACGATCAACTTATACATC > TT > > > CTAcaaAAAAACTTTCTACCCCcaaCACcaaCATTATGTATGcaaCATATTTCTTTACCAATTTTTACAG > CG > > GGAGTACCACCATACATAcaa > > > > But my question is: it does not occur in the codon position (say, > > the > > third codon's position is not a times of 3). Why it effect the result? > > > > And also there is code to filter out the stop codon in the sample > > code > > > (as the following shown) /////////////////////////////// > > if( $pseq =~ /\*/ && > > $pseq !~ /\*$/ ) { > > warn("provided a CDS sequence with a stop codon, PAML will > > choke!"); > > exit(0); > > } > > # Tcoffee can't handle '*' even if it is trailing > > $pseq =~ s/\*//g; > > ///////////////////////////// > > > > So, when translate back from aa_aln to dna_aln, there should be no > > stop codon included. SO, why it can not pass? > > > > Thanks for answer! > > > > P.S: attach my code here: > > ///////////////////////////////////////////////////////// > > #!/usr/bin/perl -w > > use strict; > > use Bio::Tools::Run::Phylo::PAML::Codeml; > > use Bio::Tools::Run::Alignment::Clustalw; > > > > # for projecting alignments from protein to R/DNA space > > use Bio::Align::Utilities qw(aa_to_dna_aln); > > # for input of the sequence data > > use Bio::SeqIO; > > use Bio::AlignIO; > > > > my $aln_factory = > > Bio::Tools::Run::Alignment::Clustalw->new('quiet'=>1); > > my $seqdata = shift || 'test.fa'; > > > > my $seqio = new Bio::SeqIO(-file => $seqdata, > > -format => 'fasta'); > > my %seqs; > > my @prots; > > # process each sequence > > while ( my $seq = $seqio->next_seq ) { > > $seqs{$seq->display_id} = $seq; > > # translate them into protein > > my $protein = $seq->translate(); > > my $pseq = $protein->seq(); > > if( $pseq =~ /\*/ && > > $pseq !~ /\*$/ ) { > > warn("provided a CDS sequence with a stop codon, PAML will > > choke!"); > > exit(0); > > } > > # Tcoffee can't handle '*' even if it is trailing > > $pseq =~ s/\*//g; > > > > $protein->seq($pseq); > > push @prots, $protein; > > } > > > > if( @prots < 2 ) { > > warn("Need at least 2 CDS sequences to proceed"); > > exit(0); > > } > > > > # open(OUT, ">align_output.txt") || die("cannot open output > > align_output for writing"); # Align the sequences with clustalw my > > $aa_aln = $aln_factory->align(\@prots); # project the protein > > alignment back to CDS coordinates my $dna_aln = aa_to_dna_aln($aa_aln, > > > \%seqs); > > > > my @each = $dna_aln->each_seq(); > > > > my $kaks_factory = Bio::Tools::Run::Phylo::PAML::Codeml->new > > ( -params => { 'runmode' => -2, > > 'seqtype' => 1, > > }, > > -save_tempfiles => 1, > > -verbose => 1); > > > > # set the alignment object $kaks_factory->alignment($dna_aln); > > > > # run the KaKs analysis > > my ($rc,$parser) = $kaks_factory->run(); > > my $result = $parser->next_result; > > my $MLmatrix = $result->get_MLmatrix(); > > > > my @otus = $result->get_seqs(); > > # this gives us a mapping from the PAML order of sequences back to # > > the input order (since names get truncated) my @pos = map { > > my $c= 1; > > foreach my $s ( @each ) { > > last if( $s->display_id eq $_->display_id ); > > $c++; > > } > > $c; > > } @otus; > > > > print join("\t", qw(SEQ1 SEQ2 Ka Ks Ka/Ks PROT_PERCENTID > > CDNA_PERCENTID)),"\n"; for( my $i = 0; $i < (scalar @otus -1) ; $i++) > { > > for( my $j = $i+1; $j < (scalar @otus); $j++ ) { > > my $sub_aa_aln = $aa_aln->select_noncont($pos[$i],$pos[$j]); > > my $sub_dna_aln = $dna_aln->select_noncont($pos[$i],$pos[$j]); > > print join("\t", $otus[$i]->display_id, > > > > $otus[$j]->display_id,$MLmatrix->[$i]->[$j]- > > >{'dN'}, > > $MLmatrix->[$i]->[$j]->{'dS'}, > > $MLmatrix->[$i]->[$j]->{'omega'}, > > sprintf("%.2f",$sub_aa_aln- > > >percentage_identity), > > sprintf("%.2f",$sub_dna_aln- > > >percentage_identity), > > ), "\n"; > > } > > } > > > _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From n.saunders at uq.edu.au Mon Aug 14 21:54:51 2006 From: n.saunders at uq.edu.au (Neil Saunders) Date: Tue, 15 Aug 2006 11:54:51 +1000 Subject: [Bioperl-l] PAML + Codeml problem Message-ID: <44E1296B.2030109@uq.edu.au> Not a solution to the original post, but I thought it worth pointing out that as IUPAC have now adopted "O" for amino acid 22, pyrrolysine, any alphabetical character is now valid in a protein sequence: B=D/N, J=I/L, O=pyrrolysine, U=selenocysteine, X=unknown, Z=E/Q Leaving aside issues like how other software deals with this and how databases deal with translated stop codons (or not), I just wondered if any BioPerl modules require changes to reflect this? Neil -- School of Molecular and Microbial Sciences University of Queensland Brisbane 4072 Australia http://nsaunders.wordpress.com From cjfields at uiuc.edu Tue Aug 15 00:21:38 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 23:21:38 -0500 Subject: [Bioperl-l] PAML + Codeml problem In-Reply-To: <44E1296B.2030109@uq.edu.au> Message-ID: <000001c6c022$4ebce0a0$15327e82@pyrimidine> Neil, In a previous post, Brian indicated that Bio::Tools::CodonTable inherits codon tables from Bio::SeqUtils, which includes selenocysteine. It does not include 'J' or 'O', but the other ambiguous codes are all present. 'J' is an odd one; haven't seen that one used before. Is there a valid three-letter code for that? Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Neil Saunders > Sent: Monday, August 14, 2006 8:55 PM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] PAML + Codeml problem > > Not a solution to the original post, but I thought it worth pointing out > that as > IUPAC have now adopted "O" for amino acid 22, pyrrolysine, any > alphabetical > character is now valid in a protein sequence: > > B=D/N, J=I/L, O=pyrrolysine, U=selenocysteine, X=unknown, Z=E/Q > > > Leaving aside issues like how other software deals with this and how > databases > deal with translated stop codons (or not), I just wondered if any BioPerl > modules require changes to reflect this? > > > Neil > -- > School of Molecular and Microbial Sciences > University of Queensland > Brisbane 4072 Australia > > http://nsaunders.wordpress.com > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From n.saunders at uq.edu.au Tue Aug 15 00:31:28 2006 From: n.saunders at uq.edu.au (Neil Saunders) Date: Tue, 15 Aug 2006 14:31:28 +1000 Subject: [Bioperl-l] PAML + Codeml problem In-Reply-To: <000001c6c022$4ebce0a0$15327e82@pyrimidine> References: <000001c6c022$4ebce0a0$15327e82@pyrimidine> Message-ID: <44E14E20.5050707@uq.edu.au> Chris Fields wrote: > 'J' is an odd one; haven't seen that one used before. Is there a valid > three-letter code for that? It was new to me too. Apparently it's used where I/L are indistinguishable in mass spectrometry. The 3-letter code is Xle. As of October 2006, GenBank are making it legal: http://www.bio.net/bionet/mm/genbankb/2006-June/000241.html Neil -- School of Molecular and Microbial Sciences University of Queensland Brisbane 4072 Australia http://nsaunders.wordpress.com From cjfields at uiuc.edu Tue Aug 15 00:48:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Mon, 14 Aug 2006 23:48:18 -0500 Subject: [Bioperl-l] PAML + Codeml problem In-Reply-To: <44E14E20.5050707@uq.edu.au> Message-ID: <000001c6c026$0735e2a0$15327e82@pyrimidine> ...And pyrrolysine is 'Pyl' (found it on the web). Wonder what they'll use when more naturally occurring amino acids are found! Maybe they'll do what NOAA does for hurricanes and revert to the Greek alphabet. ; > I'll go ahead and add these to Bio::SeqUtils. We'll beat the GenBank incorporation date! Chris > -----Original Message----- > From: Neil Saunders [mailto:n.saunders at uq.edu.au] > Sent: Monday, August 14, 2006 11:31 PM > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] PAML + Codeml problem > > Chris Fields wrote: > > 'J' is an odd one; haven't seen that one used before. Is there a valid > > three-letter code for that? > > It was new to me too. Apparently it's used where I/L are > indistinguishable in > mass spectrometry. The 3-letter code is Xle. As of October 2006, GenBank > are > making it legal: > > http://www.bio.net/bionet/mm/genbankb/2006-June/000241.html > > Neil > -- > School of Molecular and Microbial Sciences > University of Queensland > Brisbane 4072 Australia > > http://nsaunders.wordpress.com From avilella at gmail.com Tue Aug 15 07:11:25 2006 From: avilella at gmail.com (Albert Vilella) Date: Tue, 15 Aug 2006 12:11:25 +0100 Subject: [Bioperl-l] PAML + Codeml problem.. In-Reply-To: <000001c6bff0$343f2e90$15327e82@pyrimidine> References: <000001c6bff0$343f2e90$15327e82@pyrimidine> Message-ID: <1155640285.6019.37.camel@localhost> I added a couple of custom tables with selenocystein in my local copy a while ago, but never commited the change to the CVS. Again, this doesn't solve how codeml deals with it, but one can always change the selenocysteine triplets with "NNN" before running codeml... 207,209c207 < '', '', '', < 'Bacterial with selenocystein', # 19 < 'Standard with selenocystein', # 20 --- > '', '', '', '', 215,220d212 < # Bases at each position are: < < #-- Base1 TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG < #-- Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG < #-- Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG < 238,241c230,231 < '' '' < FFLLSSSSYY**CCUWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG < FFLLSSSSYY**CCUWLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG < FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNNKSSSSVVVVAAAADDEEGGGG --- > '' '' '' '' > FFLLSSSSYY**CCWWLLLLPPPPHHQQRRRRIIMMTTTTNNNKSSSSVVVVAAAADDEEGGGG 264,266c254 < '' '' < ---M---------------M------------MMMM---------------M------------ < ---M---------------M---------------M---------------------------- --- > '' '' '' '' 355,377d342 On Mon, 2006-08-14 at 17:22 -0500, Chris Fields wrote: > Brian, > > Would having a custom codon table work? Since TGA->'U' requires a nearby > SECIS element, theoretically a gene could have one 'TGA' codon that codes > for 'U' (nearby SECIS element) and another 'TGA' codon that codes for the > actual stop (no SECIS element). > > I don't think there is a way to have position-specific TGA->U based on > user-input either (a flag, perhaps). That's the only work-around for it I > can think of. > > Chris > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Brian Osborne > > Sent: Monday, August 14, 2006 4:49 PM > > To: Xianjun Dong > > Cc: bioperl-l at lists.open-bio.org; aaron.j.mackey at gsk.com; > > golharam at umdnj.edu > > Subject: Re: [Bioperl-l] PAML + Codeml problem.. > > > > Xianjun, > > > > I spoke too soon. I'd assumed that NCBI had a table to handle > > selenocysteine, but it does not: > > > > http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c > > > > These tables are the basis for the Bio::Tools::CodonTable module, and the > > CodonTable module looks to be up-to-date with respect to NCBI's page. You > > can solve your problem by making a custom table using the add_table() > > method, see t/CodonTable.t for a nice example. Your custom table will look > > something like the Euplotid Nuclear Code table, which translates TGA to C. > > You should be able to translate TGA to U since the amino acid codes that > > CodonTable inherits from Bio::SeqUtils contain "U" and "Sec". > > > > This is an issue that's independent of the issue raised by Aaron, I'm > > assuming you know whether or not your sequences should be translated this > > way. > > > > Brian O. > > > > > > On 8/14/06 3:11 PM, "aaron.j.mackey at gsk.com" > > wrote: > > > > >>> 1. For the case which in-frame stop codon codes for > > > selenocysteine('U'), > > >>> like the transcript ENSMUST00000094469, it should be translated into > > >>> 'U', not '*' since the IUPAC/IUBMB has officially recommended it. But > > >>> when I use the codontable_id=1(generic codon table), it still was '*'. > > >>> Is it because the package(Bio::Tools::CodonTable) is not so updated as > > >>> the IUPAC rules? > > > > > > The translation of TGA into Selenocysteine (U) is not "universal", it > > only > > > occurs when the downstream UTR contains a SECIS RNA element; > > > Bio::Tools::CodonTable is unable to differentiate such > > > selenocysteine-encoding TGA codons from "normal" TGA stop codons, > > > regardless of the translation table in use. GenBank/EMBL-formatted > > > records will typically have /transl_except entries in the feature table, > > > but the BioPerl "translate" method does not (yet) recognize these > > (someone > > > correct me if I'm wrong). > > > > > > -Aaron > > > > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From aaron.j.mackey at gsk.com Tue Aug 15 09:42:38 2006 From: aaron.j.mackey at gsk.com (aaron.j.mackey at gsk.com) Date: Tue, 15 Aug 2006 09:42:38 -0400 Subject: [Bioperl-l] PAML + Codeml problem In-Reply-To: <44E14E20.5050707@uq.edu.au> Message-ID: > > 'J' is an odd one; haven't seen that one used before. Is there a valid > > three-letter code for that? > > It was new to me too. Apparently it's used where I/L are > indistinguishable in > mass spectrometry. The 3-letter code is Xle. As of October 2006, > GenBank are > making it legal: > > http://www.bio.net/bionet/mm/genbankb/2006-June/000241.html for the curios, the appearance of J was first noted in IUPAC nomenclature committees in 1999: http://www.blackwell-synergy.com/doi/pdf/10.1046/j.1432-1327.1999.news99.x and had to do with ambiguous NMR (not mass-spec) signals (which is why Selenocysteine was awarded the U instead of the J). But yes, regardless of why I and L can't be distinguished, J is the ambiguity code for the pair. -Aaron From bix at sendu.me.uk Wed Aug 16 05:24:18 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 16 Aug 2006 10:24:18 +0100 Subject: [Bioperl-l] Release 1.5.2 In-Reply-To: <44E0398A.2060108@sendu.me.uk> References: <000a01c6ba60$fdf2c390$15327e82@pyrimidine> <44E0398A.2060108@sendu.me.uk> Message-ID: <44E2E442.30102@sendu.me.uk> Sendu Bala wrote: http://www.bioperl.org/wiki/Release_Schedule > I think the list has stabilized a little now. We ought to get the show > on the road, so unless someone with more experience has the time, I'll > offer to be release pumpkin for 1.5.2. > > Once the pumpkin has been determined we can press on. Well it seems like I'm the pumpkin :) I've set up a wiki page specifically to track what's going on with 1.5.2. Please use this mailing list or the discussion part of that wiki page for questions and discussion, whilst keeping the main part of the page for actual proposals and decisions. http://www.bioperl.org/wiki/Release_1.5.2 I'm very interested in having a wide range of testers. Please 'sign up' on the wiki page (or by contacting me directly) if you're willing to help out with that, or anything else. I'd also like to specifically call for some volunteers to look through the module documentation; you don't even need to improve/correct it yourself: just having people identify weak points would be extremely valuable. Thank you, Sendu From freimuth at pathology.wustl.edu Wed Aug 16 15:56:37 2006 From: freimuth at pathology.wustl.edu (Freimuth, Robert) Date: Wed, 16 Aug 2006 14:56:37 -0500 Subject: [Bioperl-l] Error parsing BLAST report Message-ID: <71AE766382153B47AAB638DC83ED7F49014CE862@pathexch1.wusm-path.wustl.edu> Hello, I'm trying to parse a BLAST report using the following code: use warnings; use strict; use Bio::SearchIO; my $file = 'NP_006065_blast.out'; my $searchio = new Bio::SearchIO( -format => 'blast', -file => $file ); while( my $result = $searchio->next_result() ) { while( my $hit = $result->next_hit ) { my $hit_acc_num = $hit->accession(); # get the total length of the aligned region for query or sbjct seq # (includes all HSPs, calculated after tiling) my $align_len = $hit->length_aln( 'query' ); print "Alignment length for $hit_acc_num is $align_len\n"; } } There are 104 one-line descriptions in the report, and alignments for each one of them (the blast report was created using b_num_alignments_shown => 500 and v_num_descriptions_shown => 500). However, when I run the above code I get 14 errors like the following: -------------------- WARNING --------------------- MSG: There is no HSP data for hit 'ENSP00000327738'. You have called a method (Bio::Search::Hit::GenericHit::length_aln) that requires HSP data and there was no HSP data for this hit, most likely because it was absent from the BLAST report. Note that by default, BLAST lists alignments for the first 250 hits, but it lists descriptions for 500 hits. If this is the case, and you care about these hits, you should re-run BLAST using the -b option (or equivalent if not using blastall) to increase the number of alignments. --------------------------------------------------- There is an alignment for this (and the other 13 sequences) in the report. In fact, if I edit the report and delete all but the description and the alignment for ENSP00000327738, it parses fine (no error). I continued editing the report and produced the following minimal test case that reproduces the error. Note that the description for ENSP00000350182 appears twice, BUT THE ERROR IS FOR ENSP00000327738. *********** BLAST REPORT FOR TEST CASE *********** BLASTP 2.2.11 [Jun-05-2005] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= NP_006065 (442 letters) Database: Homo_sapiens.NCBI36.apr.pep.fa 48,851 sequences; 23,910,368 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... 120 3e-27 ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... 120 3e-27 ENSP00000327738 pep:known-ccds chromosome:NCBI36:4:189297592:189... 115 8e-26 >ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:ENSG00000137397 transcript:ENST00000357569 Length = 425 Score = 120 bits (301), Expect = 3e-27 Identities = 76/261 (29%), Positives = 140/261 (53%), Gaps = 21/261 (8%) Query: 9 IEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGNL 68 +++EV CPICL++L +P+++DCGH+FC CIT +I E+ S G CP+C+T + + Sbjct: 10 LQEEVICPICLDILQKPVTIDCGHNFCLKCIT-QIGET---SCGFFKCPLCKTSVRKNAI 65 Query: 69 RPNRHLANIVERVKEVKMSP-QEGQKRDVCEHHGKKLQIFCKEDGKVICWVCELSQEHQG 127 R N L N+VE+++ ++ S Q +K C H + FC++DGK +C+VC S++H+ Sbjct: 66 RFNSLLRNLVEKIQALQASEVQSKRKEATCPRHQEMFHYFCEDDGKFLCFVCCESKDHKS 125 Query: 128 HQTFRINEVVKECQEKLQVALQRLIKEDQEAEKLED------DIRQERTAWKIERQKILK 181 H I E + Q ++Q +Q L ++++E +++ D+ ++ + E+Q+IL Sbjct: 126 HNVSLIEEAAQNYQGQIQEQIQVLQQKEKETVQVKAQGVHRVDVFTDQV--EHEKQRILT 183 Query: 182 GFNEMRVILDNEEQRELQKL----EEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRRL 237 F + +L+ E+ L ++ EG +A+ QL D L+ L+ + Sbjct: 184 EFELLHQVLEEEKNFLLSRIYWLGHEGTEAGKHYVASTEPQL----NDLKKLVDSLKTKQ 239 Query: 238 TGSSVEMLQDVIDVMKRSESW 258 ++L+ + RSE + Sbjct: 240 NMPPRQLLEVTQPHLPRSEEF 260 >ENSP00000327738 pep:known-ccds chromosome:NCBI36:4:189297592:189305643:1 gene:ENSG00000184108 transcript:ENST00000332517 CCDS3851.1 Length = 468 Score = 115 bits (289), Expect = 8e-26 Identities = 101/410 (24%), Positives = 180/410 (43%), Gaps = 39/410 (9%) Query: 8 DIEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGN 67 ++ +E+TC ICL+ + P++ +CGHSFC C+ +E SCP C + + Sbjct: 9 NLREELTCFICLDYFSSPVTTECGHSFCLVCLLRSWEE----HNTPLSCPECWRTLEGPH 64 Query: 68 LRPNRHLANIVERVKEVKMSPQEGQKRDVCEHHGK-----KLQIFCKEDGKVICWVCELS 122 + N L + ++++ Q Q D +G+ K ++ G ++ Sbjct: 65 FQSNERLGRLASIARQLR--SQVLQSEDEQGSYGRMPTTAKALSDDEQGGSAF-----VA 117 Query: 123 QEHQGHQTFRINEVVKECQEKLQVALQRLIKEDQEA------EKLEDDIRQERTAWKIER 176 Q H ++ +E + +EKLQ L L +EA EK + QE T K + Sbjct: 118 QSHGANRVHLSSEAEEHHREKLQEILNLLRVRRKEAQAVLTHEKERVKLCQEET--KTCK 175 Query: 177 QKILKGFNEMRVILDNEEQRELQKLEEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRR 236 Q ++ + +M L EEQ +LQ LE+ E + L +L QQ + S +I+ ++ Sbjct: 176 QVVVSEYMKMHQFLKEEEQLQLQLLEQEEKENMRKLRNNEIKLTQQIRSLSKMIAQIESS 235 Query: 237 LTGSSVEMLQDVIDVMKRSESWTXXXXXXXXXXXXXXFRVPDLSGMLQVLKELTDVQYYW 296 S+ E L++V ++RSE + ++GM ++L++ + Sbjct: 236 SQSSAFESLEEVRGALERSE----PLLLQCPEATTTELSLCRITGMKEMLRKFS------ 285 Query: 297 VDVMLNPGSATSNVAISVDQRQVKTVRTCTFKNSNPCDF-SAFGVFGCQYFSSGKYYWEV 355 ++ L+P +A + + +S D + VK + NP F + V G Q F+SG++YWEV Sbjct: 286 TEITLDPATANAYLVLSEDLKSVKYGGSRQQLPDNPERFDQSATVLGTQIFTSGRHYWEV 345 Query: 356 DVSGKIAWILGVHSKISSLNKRKSSGFAFDPSVNYSKVYSRYRPQYGYWV 405 +V K W +G+ S + P +S + + Y WV Sbjct: 346 EVGNKTEWEVGICKDSVS----RKGNLPKPPGDLFSLIGLKIGDDYSLWV 391 Database: Homo_sapiens.NCBI36.apr.pep.fa Posted date: Jun 15, 2006 8:56 PM Number of letters in database: 23,910,368 Number of sequences in database: 48,851 Lambda K H 0.319 0.133 0.398 Gapped Lambda K H 0.267 0.0410 0.140 Matrix: BLOSUM62 Gap Penalties: Existence: 11, Extension: 1 Number of Hits to DB: 20,900,506 Number of Sequences: 48851 Number of extensions: 899179 Number of successful extensions: 6075 Number of sequences better than 1.0e-25: 105 Number of HSP's better than 0.0 without gapping: 18 Number of HSP's successfully gapped in prelim test: 87 Number of HSP's that attempted gapping in prelim test: 5632 Number of HSP's gapped (non-prelim): 157 length of query: 442 length of database: 23,910,368 effective HSP length: 107 effective length of query: 335 effective length of database: 18,683,311 effective search space: 6258909185 effective search space used: 6258909185 T: 11 A: 40 X1: 16 ( 7.4 bits) X2: 38 (14.6 bits) X3: 64 (24.7 bits) S1: 41 (21.8 bits) S2: 289 (115.9 bits) *********** END BLAST REPORT FOR TEST CASE *********** Any ideas? Thanks, Bob From osborne1 at optonline.net Wed Aug 16 19:20:56 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Wed, 16 Aug 2006 19:20:56 -0400 Subject: [Bioperl-l] Error parsing BLAST report In-Reply-To: <71AE766382153B47AAB638DC83ED7F49014CE862@pathexch1.wusm-path.wustl.edu> Message-ID: Robert, The standard answer to a complaint about SearchIO these days is to upgrade to version 1.5.1 - what Bioperl version are you using? Brian O. On 8/16/06 3:56 PM, "Freimuth, Robert" wrote: > Hello, > > I'm trying to parse a BLAST report using the following code: > > use warnings; > use strict; > > use Bio::SearchIO; > > my $file = 'NP_006065_blast.out'; > > my $searchio = new Bio::SearchIO( -format => 'blast', > -file => $file ); > > while( my $result = $searchio->next_result() ) > { > while( my $hit = $result->next_hit ) > { > my $hit_acc_num = $hit->accession(); > > # get the total length of the aligned region for query or > sbjct seq > # (includes all HSPs, calculated after tiling) > > my $align_len = $hit->length_aln( 'query' ); > > print "Alignment length for $hit_acc_num is $align_len\n"; > } > } > > There are 104 one-line descriptions in the report, and alignments for > each one of them (the blast report was created using > b_num_alignments_shown => 500 and v_num_descriptions_shown => 500). > However, when I run the above code I get 14 errors like the following: > > -------------------- WARNING --------------------- > MSG: There is no HSP data for hit 'ENSP00000327738'. > You have called a method (Bio::Search::Hit::GenericHit::length_aln) > that requires HSP data and there was no HSP data for this hit, > most likely because it was absent from the BLAST report. > Note that by default, BLAST lists alignments for the first 250 hits, > but it lists descriptions for 500 hits. If this is the case, > and you care about these hits, you should re-run BLAST using the > -b option (or equivalent if not using blastall) to increase the number > of alignments. > > --------------------------------------------------- > > There is an alignment for this (and the other 13 sequences) in the > report. In fact, if I edit the report and delete all but the > description and the alignment for ENSP00000327738, it parses fine (no > error). > > I continued editing the report and produced the following minimal test > case that reproduces the error. Note that the description for > ENSP00000350182 appears twice, BUT THE ERROR IS FOR ENSP00000327738. > > *********** BLAST REPORT FOR TEST CASE *********** > > BLASTP 2.2.11 [Jun-05-2005] > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. > Schaffer, > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > "Gapped BLAST and PSI-BLAST: a new generation of protein database search > programs", Nucleic Acids Res. 25:3389-3402. > > Query= NP_006065 > (442 letters) > > Database: Homo_sapiens.NCBI36.apr.pep.fa > 48,851 sequences; 23,910,368 total letters > > Searching..................................................done > > Score > E > Sequences producing significant alignments: (bits) > Value > > ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... > 120 3e-27 > ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... > 120 3e-27 > ENSP00000327738 pep:known-ccds chromosome:NCBI36:4:189297592:189... > 115 8e-26 > >> ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 > gene:ENSG00000137397 > transcript:ENST00000357569 > Length = 425 > > Score = 120 bits (301), Expect = 3e-27 > Identities = 76/261 (29%), Positives = 140/261 (53%), Gaps = 21/261 > (8%) > > Query: 9 IEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGNL > 68 > +++EV CPICL++L +P+++DCGH+FC CIT +I E+ S G CP+C+T + + > Sbjct: 10 LQEEVICPICLDILQKPVTIDCGHNFCLKCIT-QIGET---SCGFFKCPLCKTSVRKNAI > 65 > > Query: 69 RPNRHLANIVERVKEVKMSP-QEGQKRDVCEHHGKKLQIFCKEDGKVICWVCELSQEHQG > 127 > R N L N+VE+++ ++ S Q +K C H + FC++DGK +C+VC S++H+ > Sbjct: 66 RFNSLLRNLVEKIQALQASEVQSKRKEATCPRHQEMFHYFCEDDGKFLCFVCCESKDHKS > 125 > > Query: 128 HQTFRINEVVKECQEKLQVALQRLIKEDQEAEKLED------DIRQERTAWKIERQKILK > 181 > H I E + Q ++Q +Q L ++++E +++ D+ ++ + E+Q+IL > Sbjct: 126 HNVSLIEEAAQNYQGQIQEQIQVLQQKEKETVQVKAQGVHRVDVFTDQV--EHEKQRILT > 183 > > Query: 182 GFNEMRVILDNEEQRELQKL----EEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRRL > 237 > F + +L+ E+ L ++ EG +A+ QL D L+ L+ + > Sbjct: 184 EFELLHQVLEEEKNFLLSRIYWLGHEGTEAGKHYVASTEPQL----NDLKKLVDSLKTKQ > 239 > > Query: 238 TGSSVEMLQDVIDVMKRSESW 258 > ++L+ + RSE + > Sbjct: 240 NMPPRQLLEVTQPHLPRSEEF 260 > > >> ENSP00000327738 pep:known-ccds > chromosome:NCBI36:4:189297592:189305643:1 > gene:ENSG00000184108 transcript:ENST00000332517 > CCDS3851.1 > Length = 468 > > Score = 115 bits (289), Expect = 8e-26 > Identities = 101/410 (24%), Positives = 180/410 (43%), Gaps = 39/410 > (9%) > > Query: 8 DIEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGN > 67 > ++ +E+TC ICL+ + P++ +CGHSFC C+ +E SCP C + + > Sbjct: 9 NLREELTCFICLDYFSSPVTTECGHSFCLVCLLRSWEE----HNTPLSCPECWRTLEGPH > 64 > > Query: 68 LRPNRHLANIVERVKEVKMSPQEGQKRDVCEHHGK-----KLQIFCKEDGKVICWVCELS > 122 > + N L + ++++ Q Q D +G+ K ++ G ++ > Sbjct: 65 FQSNERLGRLASIARQLR--SQVLQSEDEQGSYGRMPTTAKALSDDEQGGSAF-----VA > 117 > > Query: 123 QEHQGHQTFRINEVVKECQEKLQVALQRLIKEDQEA------EKLEDDIRQERTAWKIER > 176 > Q H ++ +E + +EKLQ L L +EA EK + QE T K + > Sbjct: 118 QSHGANRVHLSSEAEEHHREKLQEILNLLRVRRKEAQAVLTHEKERVKLCQEET--KTCK > 175 > > Query: 177 QKILKGFNEMRVILDNEEQRELQKLEEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRR > 236 > Q ++ + +M L EEQ +LQ LE+ E + L +L QQ + S +I+ ++ > Sbjct: 176 QVVVSEYMKMHQFLKEEEQLQLQLLEQEEKENMRKLRNNEIKLTQQIRSLSKMIAQIESS > 235 > > Query: 237 LTGSSVEMLQDVIDVMKRSESWTXXXXXXXXXXXXXXFRVPDLSGMLQVLKELTDVQYYW > 296 > S+ E L++V ++RSE + ++GM ++L++ + > Sbjct: 236 SQSSAFESLEEVRGALERSE----PLLLQCPEATTTELSLCRITGMKEMLRKFS------ > 285 > > Query: 297 VDVMLNPGSATSNVAISVDQRQVKTVRTCTFKNSNPCDF-SAFGVFGCQYFSSGKYYWEV > 355 > ++ L+P +A + + +S D + VK + NP F + V G Q F+SG++YWEV > Sbjct: 286 TEITLDPATANAYLVLSEDLKSVKYGGSRQQLPDNPERFDQSATVLGTQIFTSGRHYWEV > 345 > > Query: 356 DVSGKIAWILGVHSKISSLNKRKSSGFAFDPSVNYSKVYSRYRPQYGYWV 405 > +V K W +G+ S + P +S + + Y WV > Sbjct: 346 EVGNKTEWEVGICKDSVS----RKGNLPKPPGDLFSLIGLKIGDDYSLWV 391 > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > Posted date: Jun 15, 2006 8:56 PM > Number of letters in database: 23,910,368 > Number of sequences in database: 48,851 > > Lambda K H > 0.319 0.133 0.398 > > Gapped > Lambda K H > 0.267 0.0410 0.140 > > > Matrix: BLOSUM62 > Gap Penalties: Existence: 11, Extension: 1 > Number of Hits to DB: 20,900,506 > Number of Sequences: 48851 > Number of extensions: 899179 > Number of successful extensions: 6075 > Number of sequences better than 1.0e-25: 105 > Number of HSP's better than 0.0 without gapping: 18 > Number of HSP's successfully gapped in prelim test: 87 > Number of HSP's that attempted gapping in prelim test: 5632 > Number of HSP's gapped (non-prelim): 157 > length of query: 442 > length of database: 23,910,368 > effective HSP length: 107 > effective length of query: 335 > effective length of database: 18,683,311 > effective search space: 6258909185 > effective search space used: 6258909185 > T: 11 > A: 40 > X1: 16 ( 7.4 bits) > X2: 38 (14.6 bits) > X3: 64 (24.7 bits) > S1: 41 (21.8 bits) > S2: 289 (115.9 bits) > > *********** END BLAST REPORT FOR TEST CASE *********** > > Any ideas? > > Thanks, > > Bob > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From mblanche at berkeley.edu Wed Aug 16 22:59:04 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Wed, 16 Aug 2006 19:59:04 -0700 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF Message-ID: Dear all, I am desperately trying to get a list of gene coordinates from a MySQL database version of the fly genome populated using the Bio::DB::GFF module. I have a list of 277 id in a text file that when parsed through the following script return 279 entries (2 more entries then the number of genes in the starting list). Here is the script: use Bio::DB::GFF; my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', -dsn => 'dbi:mysql:database=dmel_43_new'); while (<>){ chomp; my @feat = $db->get_feature_by_name($_); for my $f (@feat){ if ($f->type->method eq 'gene'){ print "Name: ", $f->name, " Strand: ", $f->strand, " Start: ", $f->start, " End: ", $f->end, "\n"; } } } I totally don?t understand where the 2 extra entries are coming from. Nothing differentiate them from each other. Moreover, when I double check the MySQL database, both genes are having only a single ?gene? entry in the fdata table. Is there a bug in the way I am trying to fetch the individual genes or something is wrong with the latest Bio::DB::GFF module from the CVS repository? Here is a test script and it?s output that I am using to try to tract down what the problem is. Hope this could help: use Bio::DB::GFF; my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', -dsn => 'dbi:mysql:database=dmel_43_new'); my %dups; my ($j, $i) =0; while (<>){ chomp; my $id = $_; my @feat = $db->get_feature_by_name($id); my $feat_size = $#feat; $j++ if $feat_size == 2; for my $f (@feat){ $i++; if (exists $dups{$f->group} && $f->type->method eq 'gene'){ print "Calling >>>", $f->group, " ID=", $i, " from \@feat of size $feat_size", "\n"; print "Chr: ", $f->refseq, " Strand: ", $f->strand, " Start: ", $f->start, " End: ", $f->end, "\n"; print "Offending >>>", $dups{$f->group}->[0]->group, " ID=", $dups{$f->group}->[1], "\n"; print "Chr: ", $dups{$f->group}->[0]->refseq, " Strand: ", $dups{$f->group}->[0]->strand, " Start: ", $dups{$f->group}->[0]->start, " End: ", $dups{$f->group}->[0]->end; print "\n\n"; } elsif ($f->type->method eq 'gene') { $dups{$f->group} = [$f, $i]; } } } print "#### there was $j \@feat with only 2 features\n"; Output of the test script: $ perl test.pl hrp36_targets.txt Calling >>>FBgn0025803 ID=98 from @feat of size 2 Chr: 3R Strand: 1 Start: 16966463 End: 17038413 Offending >>>FBgn0025803 ID=97 Chr: 3R Strand: 1 Start: 16966463 End: 17038413 Calling >>>FBgn0025681 ID=304 from @feat of size 2 Chr: 2L Strand: 1 Start: 2992964 End: 2998614 Offending >>>FBgn0025681 ID=303 Chr: 2L Strand: 1 Start: 2992964 End: 2998614 #### there was 11 @feat with only 2 features With the hope someone can find out the problem... Cheers, Marco ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From cjfields at uiuc.edu Wed Aug 16 23:12:46 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 16 Aug 2006 22:12:46 -0500 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: References: Message-ID: Marco, I got your earlier email as well. Have you posted this over on the GMOD-Gbrowse group as well? I can't work out the problem myself. It's very odd. Chris On Aug 16, 2006, at 9:59 PM, Marco Blanchette wrote: > Dear all, > > I am desperately trying to get a list of gene coordinates from a MySQL > database version of the fly genome populated using the Bio::DB::GFF > module. > I have a list of 277 id in a text file that when parsed through the > following script return 279 entries (2 more entries then the number > of genes > in the starting list). > > Here is the script: > > use Bio::DB::GFF; > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => > 'dbi:mysql:database=dmel_43_new'); > while (<>){ > chomp; > my @feat = $db->get_feature_by_name($_); > for my $f (@feat){ > if ($f->type->method eq 'gene'){ > print "Name: ", $f->name, > " Strand: ", $f->strand, > " Start: ", $f->start, > " End: ", $f->end, > "\n"; > } > } > } > > I totally don?t understand where the 2 extra entries are coming from. > Nothing differentiate them from each other. Moreover, when I double > check > the MySQL database, both genes are having only a single ?gene? > entry in the > fdata table. > > Is there a bug in the way I am trying to fetch the individual genes or > something is wrong with the latest Bio::DB::GFF module from the CVS > repository? > > Here is a test script and it?s output that I am using to try to > tract down > what the problem is. Hope this could help: > > use Bio::DB::GFF; > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => > 'dbi:mysql:database=dmel_43_new'); > my %dups; > my ($j, $i) =0; > while (<>){ > chomp; > my $id = $_; > my @feat = $db->get_feature_by_name($id); > my $feat_size = $#feat; > $j++ if $feat_size == 2; > > for my $f (@feat){ > $i++; > > if (exists $dups{$f->group} && $f->type->method eq 'gene'){ > print "Calling >>>", $f->group, > " ID=", $i, > " from \@feat of size $feat_size", > "\n"; > print "Chr: ", $f->refseq, > " Strand: ", $f->strand, > " Start: ", $f->start, > " End: ", $f->end, > "\n"; > print "Offending >>>", $dups{$f->group}->[0]->group, > " ID=", $dups{$f->group}->[1], "\n"; > print "Chr: ", $dups{$f->group}->[0]->refseq, > " Strand: ", $dups{$f->group}->[0]->strand, > " Start: ", $dups{$f->group}->[0]->start, > " End: ", $dups{$f->group}->[0]->end; > print "\n\n"; > } elsif ($f->type->method eq 'gene') { > $dups{$f->group} = [$f, $i]; > } > } > } > > print "#### there was $j \@feat with only 2 features\n"; > > Output of the test script: > > $ perl test.pl hrp36_targets.txt > Calling >>>FBgn0025803 ID=98 from @feat of size 2 > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > Offending >>>FBgn0025803 ID=97 > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > Calling >>>FBgn0025681 ID=304 from @feat of size 2 > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > Offending >>>FBgn0025681 ID=303 > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > #### there was 11 @feat with only 2 features > > With the hope someone can find out the problem... > > Cheers, > > Marco > > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 > -- > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cain.cshl at gmail.com Wed Aug 16 23:30:47 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Wed, 16 Aug 2006 23:30:47 -0400 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: References: Message-ID: <1155785447.2596.8.camel@dhcpvisitor217149.slac.stanford.edu> Hi Marco, I'm working on it right now--my first guess (without doing any real work), I'm betting on the problem being an incompatibility between the GFF3 file and the Bio::DB::GFF schema. Scott On Wed, 2006-08-16 at 19:59 -0700, Marco Blanchette wrote: > Dear all, > > I am desperately trying to get a list of gene coordinates from a MySQL > database version of the fly genome populated using the Bio::DB::GFF module. > I have a list of 277 id in a text file that when parsed through the > following script return 279 entries (2 more entries then the number of genes > in the starting list). > > Here is the script: > > use Bio::DB::GFF; > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => 'dbi:mysql:database=dmel_43_new'); > while (<>){ > chomp; > my @feat = $db->get_feature_by_name($_); > for my $f (@feat){ > if ($f->type->method eq 'gene'){ > print "Name: ", $f->name, > " Strand: ", $f->strand, > " Start: ", $f->start, > " End: ", $f->end, > "\n"; > } > } > } > > I totally don?t understand where the 2 extra entries are coming from. > Nothing differentiate them from each other. Moreover, when I double check > the MySQL database, both genes are having only a single ?gene? entry in the > fdata table. > > Is there a bug in the way I am trying to fetch the individual genes or > something is wrong with the latest Bio::DB::GFF module from the CVS > repository? > > Here is a test script and it?s output that I am using to try to tract down > what the problem is. Hope this could help: > > use Bio::DB::GFF; > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > -dsn => 'dbi:mysql:database=dmel_43_new'); > my %dups; > my ($j, $i) =0; > while (<>){ > chomp; > my $id = $_; > my @feat = $db->get_feature_by_name($id); > my $feat_size = $#feat; > $j++ if $feat_size == 2; > > for my $f (@feat){ > $i++; > > if (exists $dups{$f->group} && $f->type->method eq 'gene'){ > print "Calling >>>", $f->group, > " ID=", $i, > " from \@feat of size $feat_size", > "\n"; > print "Chr: ", $f->refseq, > " Strand: ", $f->strand, > " Start: ", $f->start, > " End: ", $f->end, > "\n"; > print "Offending >>>", $dups{$f->group}->[0]->group, > " ID=", $dups{$f->group}->[1], "\n"; > print "Chr: ", $dups{$f->group}->[0]->refseq, > " Strand: ", $dups{$f->group}->[0]->strand, > " Start: ", $dups{$f->group}->[0]->start, > " End: ", $dups{$f->group}->[0]->end; > print "\n\n"; > } elsif ($f->type->method eq 'gene') { > $dups{$f->group} = [$f, $i]; > } > } > } > > print "#### there was $j \@feat with only 2 features\n"; > > Output of the test script: > > $ perl test.pl hrp36_targets.txt > Calling >>>FBgn0025803 ID=98 from @feat of size 2 > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > Offending >>>FBgn0025803 ID=97 > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > Calling >>>FBgn0025681 ID=304 from @feat of size 2 > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > Offending >>>FBgn0025681 ID=303 > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > #### there was 11 @feat with only 2 features > > With the hope someone can find out the problem... > > Cheers, > > Marco > > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060816/779d9754/attachment-0001.bin From freimuth at pathology.wustl.edu Thu Aug 17 00:42:19 2006 From: freimuth at pathology.wustl.edu (Freimuth, Robert) Date: Wed, 16 Aug 2006 23:42:19 -0500 Subject: [Bioperl-l] Error parsing BLAST report Message-ID: <71AE766382153B47AAB638DC83ED7F49014CE8CA@pathexch1.wusm-path.wustl.edu> Hi, Thank you for your reply. I downloaded bioperl-1.5.1 from http://bioperl.org/DIST/ and installed it (which appeared successful), but the one-liner: perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION, "\n"' prints 1.5 (I expected 1.5.1). When I run the test case that I reported earlier, I get the following output: -------------------- WARNING --------------------- MSG: There is no HSP data for hit 'ENSP00000327738'. You have called a method (Bio::Search::Hit::GenericHit::length_aln) that requires HSP data and there was no HSP data for this hit, most likely because it was absent from the BLAST report. Note that by default, BLAST lists alignments for the first 250 hits, but it lists descriptions for 500 hits. If this is the case, and you care about these hits, you should re-run BLAST using the -b option (or equivalent if not using blastall) to increase the number of alignments. --------------------------------------------------- Alignment length for ENSP00000327738 is - Alignment length for ENSP00000350182 is 250 Alignment length for ENSP00000327738 is 398 Could someone that is running 1.5.1 please verify the output of the one-liner above (did I somehow get the wrong file from the ftp site?) and try to reproduce the error with the test case? Thanks for the help. I'm stumped. Bob > -----Original Message----- > From: Brian Osborne [mailto:osborne1 at optonline.net] > Sent: Wednesday, August 16, 2006 6:21 PM > To: Freimuth, Robert; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Error parsing BLAST report > > Robert, > > The standard answer to a complaint about SearchIO these days > is to upgrade > to version 1.5.1 - what Bioperl version are you using? > > Brian O. > > > On 8/16/06 3:56 PM, "Freimuth, Robert" > wrote: > > > Hello, > > > > I'm trying to parse a BLAST report using the following code: > > > > use warnings; > > use strict; > > > > use Bio::SearchIO; > > > > my $file = 'NP_006065_blast.out'; > > > > my $searchio = new Bio::SearchIO( -format => 'blast', > > -file => $file ); > > > > while( my $result = $searchio->next_result() ) > > { > > while( my $hit = $result->next_hit ) > > { > > my $hit_acc_num = $hit->accession(); > > > > # get the total length of the aligned region > for query or > > sbjct seq > > # (includes all HSPs, calculated after tiling) > > > > my $align_len = $hit->length_aln( 'query' ); > > > > print "Alignment length for $hit_acc_num is > $align_len\n"; > > } > > } > > > > There are 104 one-line descriptions in the report, and > alignments for > > each one of them (the blast report was created using > > b_num_alignments_shown => 500 and v_num_descriptions_shown => 500). > > However, when I run the above code I get 14 errors like the > following: > > > > -------------------- WARNING --------------------- > > MSG: There is no HSP data for hit 'ENSP00000327738'. > > You have called a method (Bio::Search::Hit::GenericHit::length_aln) > > that requires HSP data and there was no HSP data for this hit, > > most likely because it was absent from the BLAST report. > > Note that by default, BLAST lists alignments for the first 250 hits, > > but it lists descriptions for 500 hits. If this is the case, > > and you care about these hits, you should re-run BLAST using the > > -b option (or equivalent if not using blastall) to increase > the number > > of alignments. > > > > --------------------------------------------------- > > > > There is an alignment for this (and the other 13 sequences) in the > > report. In fact, if I edit the report and delete all but the > > description and the alignment for ENSP00000327738, it > parses fine (no > > error). > > > > I continued editing the report and produced the following > minimal test > > case that reproduces the error. Note that the description for > > ENSP00000350182 appears twice, BUT THE ERROR IS FOR ENSP00000327738. > > > > *********** BLAST REPORT FOR TEST CASE *********** > > > > BLASTP 2.2.11 [Jun-05-2005] > > > > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. > > Schaffer, > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > > "Gapped BLAST and PSI-BLAST: a new generation of protein > database search > > programs", Nucleic Acids Res. 25:3389-3402. > > > > Query= NP_006065 > > (442 letters) > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > 48,851 sequences; 23,910,368 total letters > > > > Searching..................................................done > > > > > Score > > E > > Sequences producing significant alignments: > (bits) > > Value > > > > ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... > > 120 3e-27 > > ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... > > 120 3e-27 > > ENSP00000327738 pep:known-ccds chromosome:NCBI36:4:189297592:189... > > 115 8e-26 > > > >> ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 > > gene:ENSG00000137397 > > transcript:ENST00000357569 > > Length = 425 > > > > Score = 120 bits (301), Expect = 3e-27 > > Identities = 76/261 (29%), Positives = 140/261 (53%), Gaps = 21/261 > > (8%) > > > > Query: 9 > IEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGNL > > 68 > > +++EV CPICL++L +P+++DCGH+FC CIT +I E+ S G > CP+C+T + + > > Sbjct: 10 > LQEEVICPICLDILQKPVTIDCGHNFCLKCIT-QIGET---SCGFFKCPLCKTSVRKNAI > > 65 > > > > Query: 69 > RPNRHLANIVERVKEVKMSP-QEGQKRDVCEHHGKKLQIFCKEDGKVICWVCELSQEHQG > > 127 > > R N L N+VE+++ ++ S Q +K C H + FC++DGK > +C+VC S++H+ > > Sbjct: 66 > RFNSLLRNLVEKIQALQASEVQSKRKEATCPRHQEMFHYFCEDDGKFLCFVCCESKDHKS > > 125 > > > > Query: 128 > HQTFRINEVVKECQEKLQVALQRLIKEDQEAEKLED------DIRQERTAWKIERQKILK > > 181 > > H I E + Q ++Q +Q L ++++E +++ D+ ++ > + E+Q+IL > > Sbjct: 126 > HNVSLIEEAAQNYQGQIQEQIQVLQQKEKETVQVKAQGVHRVDVFTDQV--EHEKQRILT > > 183 > > > > Query: 182 > GFNEMRVILDNEEQRELQKL----EEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRRL > > 237 > > F + +L+ E+ L ++ EG +A+ QL D > L+ L+ + > > Sbjct: 184 > EFELLHQVLEEEKNFLLSRIYWLGHEGTEAGKHYVASTEPQL----NDLKKLVDSLKTKQ > > 239 > > > > Query: 238 TGSSVEMLQDVIDVMKRSESW 258 > > ++L+ + RSE + > > Sbjct: 240 NMPPRQLLEVTQPHLPRSEEF 260 > > > > > >> ENSP00000327738 pep:known-ccds > > chromosome:NCBI36:4:189297592:189305643:1 > > gene:ENSG00000184108 transcript:ENST00000332517 > > CCDS3851.1 > > Length = 468 > > > > Score = 115 bits (289), Expect = 8e-26 > > Identities = 101/410 (24%), Positives = 180/410 (43%), > Gaps = 39/410 > > (9%) > > > > Query: 8 > DIEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGN > > 67 > > ++ +E+TC ICL+ + P++ +CGHSFC C+ +E > SCP C + + > > Sbjct: 9 > NLREELTCFICLDYFSSPVTTECGHSFCLVCLLRSWEE----HNTPLSCPECWRTLEGPH > > 64 > > > > Query: 68 > LRPNRHLANIVERVKEVKMSPQEGQKRDVCEHHGK-----KLQIFCKEDGKVICWVCELS > > 122 > > + N L + ++++ Q Q D +G+ K ++ > G ++ > > Sbjct: 65 > FQSNERLGRLASIARQLR--SQVLQSEDEQGSYGRMPTTAKALSDDEQGGSAF-----VA > > 117 > > > > Query: 123 > QEHQGHQTFRINEVVKECQEKLQVALQRLIKEDQEA------EKLEDDIRQERTAWKIER > > 176 > > Q H ++ +E + +EKLQ L L +EA EK > + QE T K + > > Sbjct: 118 > QSHGANRVHLSSEAEEHHREKLQEILNLLRVRRKEAQAVLTHEKERVKLCQEET--KTCK > > 175 > > > > Query: 177 > QKILKGFNEMRVILDNEEQRELQKLEEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRR > > 236 > > Q ++ + +M L EEQ +LQ LE+ E + L +L QQ + > S +I+ ++ > > Sbjct: 176 > QVVVSEYMKMHQFLKEEEQLQLQLLEQEEKENMRKLRNNEIKLTQQIRSLSKMIAQIESS > > 235 > > > > Query: 237 > LTGSSVEMLQDVIDVMKRSESWTXXXXXXXXXXXXXXFRVPDLSGMLQVLKELTDVQYYW > > 296 > > S+ E L++V ++RSE + ++GM ++L++ + > > Sbjct: 236 > SQSSAFESLEEVRGALERSE----PLLLQCPEATTTELSLCRITGMKEMLRKFS------ > > 285 > > > > Query: 297 > VDVMLNPGSATSNVAISVDQRQVKTVRTCTFKNSNPCDF-SAFGVFGCQYFSSGKYYWEV > > 355 > > ++ L+P +A + + +S D + VK + NP F + V G > Q F+SG++YWEV > > Sbjct: 286 > TEITLDPATANAYLVLSEDLKSVKYGGSRQQLPDNPERFDQSATVLGTQIFTSGRHYWEV > > 345 > > > > Query: 356 DVSGKIAWILGVHSKISSLNKRKSSGFAFDPSVNYSKVYSRYRPQYGYWV 405 > > +V K W +G+ S + P +S + + Y WV > > Sbjct: 346 EVGNKTEWEVGICKDSVS----RKGNLPKPPGDLFSLIGLKIGDDYSLWV 391 > > > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > Posted date: Jun 15, 2006 8:56 PM > > Number of letters in database: 23,910,368 > > Number of sequences in database: 48,851 > > > > Lambda K H > > 0.319 0.133 0.398 > > > > Gapped > > Lambda K H > > 0.267 0.0410 0.140 > > > > > > Matrix: BLOSUM62 > > Gap Penalties: Existence: 11, Extension: 1 > > Number of Hits to DB: 20,900,506 > > Number of Sequences: 48851 > > Number of extensions: 899179 > > Number of successful extensions: 6075 > > Number of sequences better than 1.0e-25: 105 > > Number of HSP's better than 0.0 without gapping: 18 > > Number of HSP's successfully gapped in prelim test: 87 > > Number of HSP's that attempted gapping in prelim test: 5632 > > Number of HSP's gapped (non-prelim): 157 > > length of query: 442 > > length of database: 23,910,368 > > effective HSP length: 107 > > effective length of query: 335 > > effective length of database: 18,683,311 > > effective search space: 6258909185 > > effective search space used: 6258909185 > > T: 11 > > A: 40 > > X1: 16 ( 7.4 bits) > > X2: 38 (14.6 bits) > > X3: 64 (24.7 bits) > > S1: 41 (21.8 bits) > > S2: 289 (115.9 bits) > > > > *********** END BLAST REPORT FOR TEST CASE *********** > > > > Any ideas? > > > > Thanks, > > > > Bob > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > From mblanche at berkeley.edu Thu Aug 17 01:20:45 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Wed, 16 Aug 2006 22:20:45 -0700 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: <1155791512.2596.19.camel@dhcpvisitor217149.slac.stanford.edu> Message-ID: Many thanks Scott, I will probably follow your suggestion and start using PostGres. Beside being a different database engine, is their any big difference between using PostGres and MySQL? Many thanks for the help, I was starting to doubt my ability to code!! Cheers, Marco On 8/16/06 10:11 PM, "Scott Cain" wrote: > Hi Marco, > > Well, it works for me :-) > > I ran this script: > > #!/usr/bin/perl -w > use strict; > > use Bio::DB::GFF; > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::pg', > -dsn => 'dbi:Pg:dbname=flybase'); > > my @feat = $db->get_feature_by_name('FBgn0025803'); > > for (@feat) { > print "$_\n" if ($_->method eq 'gene'); > } > > and got one line: > > gene:.(FBgn0025803) > > The only real difference is that this in a PostgreSQL database and not > MySQL. I used Pg since I have that installed. I'll blow away this > database, install MySQL and see if that makes a difference (of course, > it shouldn't, but you never know...) > > Gaah! I ran the exact same script with a mysql Bio::DB::GFF and got > this out: > > gene:.(FBgn0025803) > gene:.(FBgn0025803) > > Looks like a bug in the mysql adaptor. I'll see if I can track it down; > in the mean time, you could switch to a real database :-) > > Scott > > > > On Wed, 2006-08-16 at 23:30 -0400, Scott Cain wrote: >> Hi Marco, >> >> I'm working on it right now--my first guess (without doing any real >> work), I'm betting on the problem being an incompatibility between the >> GFF3 file and the Bio::DB::GFF schema. >> >> Scott >> >> >> On Wed, 2006-08-16 at 19:59 -0700, Marco Blanchette wrote: >>> Dear all, >>> >>> I am desperately trying to get a list of gene coordinates from a MySQL >>> database version of the fly genome populated using the Bio::DB::GFF module. >>> I have a list of 277 id in a text file that when parsed through the >>> following script return 279 entries (2 more entries then the number of genes >>> in the starting list). >>> >>> Here is the script: >>> >>> use Bio::DB::GFF; >>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>> -dsn => 'dbi:mysql:database=dmel_43_new'); >>> while (<>){ >>> chomp; >>> my @feat = $db->get_feature_by_name($_); >>> for my $f (@feat){ >>> if ($f->type->method eq 'gene'){ >>> print "Name: ", $f->name, >>> " Strand: ", $f->strand, >>> " Start: ", $f->start, >>> " End: ", $f->end, >>> "\n"; >>> } >>> } >>> } >>> >>> I totally don?t understand where the 2 extra entries are coming from. >>> Nothing differentiate them from each other. Moreover, when I double check >>> the MySQL database, both genes are having only a single ?gene? entry in the >>> fdata table. >>> >>> Is there a bug in the way I am trying to fetch the individual genes or >>> something is wrong with the latest Bio::DB::GFF module from the CVS >>> repository? >>> >>> Here is a test script and it?s output that I am using to try to tract down >>> what the problem is. Hope this could help: >>> >>> use Bio::DB::GFF; >>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>> -dsn => 'dbi:mysql:database=dmel_43_new'); >>> my %dups; >>> my ($j, $i) =0; >>> while (<>){ >>> chomp; >>> my $id = $_; >>> my @feat = $db->get_feature_by_name($id); >>> my $feat_size = $#feat; >>> $j++ if $feat_size == 2; >>> >>> for my $f (@feat){ >>> $i++; >>> >>> if (exists $dups{$f->group} && $f->type->method eq 'gene'){ >>> print "Calling >>>", $f->group, >>> " ID=", $i, >>> " from \@feat of size $feat_size", >>> "\n"; >>> print "Chr: ", $f->refseq, >>> " Strand: ", $f->strand, >>> " Start: ", $f->start, >>> " End: ", $f->end, >>> "\n"; >>> print "Offending >>>", $dups{$f->group}->[0]->group, >>> " ID=", $dups{$f->group}->[1], "\n"; >>> print "Chr: ", $dups{$f->group}->[0]->refseq, >>> " Strand: ", $dups{$f->group}->[0]->strand, >>> " Start: ", $dups{$f->group}->[0]->start, >>> " End: ", $dups{$f->group}->[0]->end; >>> print "\n\n"; >>> } elsif ($f->type->method eq 'gene') { >>> $dups{$f->group} = [$f, $i]; >>> } >>> } >>> } >>> >>> print "#### there was $j \@feat with only 2 features\n"; >>> >>> Output of the test script: >>> >>> $ perl test.pl hrp36_targets.txt >>> Calling >>>FBgn0025803 ID=98 from @feat of size 2 >>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>> Offending >>>FBgn0025803 ID=97 >>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>> >>> Calling >>>FBgn0025681 ID=304 from @feat of size 2 >>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>> Offending >>>FBgn0025681 ID=303 >>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>> >>> #### there was 11 @feat with only 2 features >>> >>> With the hope someone can find out the problem... >>> >>> Cheers, >>> >>> Marco >>> >>> ______________________________ >>> Marco Blanchette, Ph.D. >>> >>> mblanche at uclink.berkeley.edu >>> >>> Donald C. Rio's lab >>> Department of Molecular and Cell Biology >>> 16 Barker Hall >>> University of California >>> Berkeley, CA 94720-3204 >>> >>> Tel: (510) 642-1084 >>> Cell: (510) 847-0996 >>> Fax: (510) 642-6062 >> -- >> ------------------------------------------------------------------------ >> Scott Cain, Ph. D. cain.cshl at gmail.com >> GMOD Coordinator (http://www.gmod.org/) 216-392-3087 >> Cold Spring Harbor Laboratory >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l Marco Blanchette, Ph.D. mblanche at berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 From cain.cshl at gmail.com Thu Aug 17 01:54:01 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Thu, 17 Aug 2006 01:54:01 -0400 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: References: Message-ID: <1155794042.2596.32.camel@dhcpvisitor217149.slac.stanford.edu> Marco, After stepping my script through the debugger, I am pretty sure that this really does come down to the incompatibilities between the Bio::DB::GFF schema and some GFF3 files. In this case, amusingly enough, Lincoln's efforts to make the Bio::DB::GFF mysql adaptor compatible with GFF3 has lead to this bug, whereas I didn't do the same for the Postgres adaptor. Unfortunately, I can't guarantee you that if you were to switch to Postgres that it would work because it may miss cases that the MySQL adaptor is getting. You could try Bio::DB::SeqFeature (loaded with bp_seqfeature_load.pl) which was designed to work with GFF3 files. Welcome to the bleeding edge :-) Scott On Wed, 2006-08-16 at 22:20 -0700, Marco Blanchette wrote: > Many thanks Scott, > I will probably follow your suggestion and start using PostGres. Besidebeing a different database engine, is their any big difference between usingPostGres and MySQL? > Many thanks for the help, I was starting to doubt my ability to code!! > Cheers, > > Marco > > On 8/16/06 10:11 PM, "Scott Cain" wrote: > > Hi Marco,> > Well, it works for me :-)> > I ran this script:> > #!/usr/bin/perl -w> use strict;> > use Bio::DB::GFF;> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::pg',> -dsn => 'dbi:Pg:dbname=flybase');> > my @feat = $db->get_feature_by_name('FBgn0025803');> > for (@feat) {> print "$_\n" if ($_->method eq 'gene');> }> > and got one line:> > gene:.(FBgn0025803)> > The only real difference is that this in a PostgreSQL database and not> MySQL. I used Pg since I have that installed. I'll blow away this> database, install MySQL and see if that makes a difference (of course,> it shouldn't, but you never know...)> > Gaah! I ran the exact same script with a mysql Bio::DB::GFF and got> this out:> > gene:.(FBgn0025803)> gene:.(FBgn0025803)> > Looks like a bug in the mysql adaptor. I'll see if I can track it down;> in the mean time, you could switch to a real database :-)> > Scott> > > > On Wed, 2006-08-16 at 23:30 -0400, Scott Cain wrote:>> Hi Marco,>> >> I'm working on it right now--my first guess (without doing any real>> work), I'm betting on the problem being an incompatibility between the>> GFF3 file and the Bio::DB::GFF schema.>> >> Scott>> >> >> On Wed, 2006-08-16 at 19:59 -0700, Marco Blanchette wrote:>>> Dear all,>>> >>> I am desperately trying to get a list of gene coordinates from a MySQL>>> database version of the fly genome populated using the Bio::DB::GFF module.>>> I have a list of 277 id in a text file that when parsed through the>>> following script return 279 entries (2 more entries then the number of genes>>> in the starting list).>>> >>> Here is the script:>>> >>> use Bio::DB::GFF;>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',>>> -dsn => 'dbi:mysql:database=dmel_43_new');>>> while (<>){>>> chomp;>>> my @feat = $db->get_feature_by_name($_);>>> for my $f (@feat){>>> if ($f->type->method eq 'gene'){>>> print "Name: ", $f->name,>>> " Strand: ", $f->strand,>>> " Start: ", $f->start,>>> " End: ", $f->end,>>> "\n";>>> }>>> }>>> }>>> >>> I totally don?t understand where the 2 extra entries are coming from.>>> Nothing differentiate them from each other. Moreover, when I double check>>> the MySQL database, both genes are having only a single ?gene? entry in the>>> fdata table.>>> >>> Is there a bug in the way I am trying to fetch the individual genes or>>> something is wrong with the latest Bio::DB::GFF module from the CVS>>> repository?>>> >>> Here is a test script and it?s output that I am using to try to tract down>>> what the problem is. Hope this could help:>>> >>> use Bio::DB::GFF;>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',>>> -dsn => 'dbi:mysql:database=dmel_43_new');>>> my %dups;>>> my ($j, $i) =0;>>> while (<>){>>> chomp;>>> my $id = $_;>>> my @feat = $db->get_feature_by_name($id);>>> my $feat_size = $#feat;>>> $j++ if $feat_size == 2;>>> >>> for my $f (@feat){>>> $i++;>>> >>> if (exists $dups{$f->group} && $f->type->method eq 'gene'){>>> print "Calling >>>", $f->group,>>> " ID=", $i,>>> " from \@feat of size $feat_size",>>> "\n";>>> print "Chr: ", $f->refseq,>>> " Strand: ", $f->strand,>>> " Start: ", $f->start,>>> " End: ", $f->end,>>> "\n";>>> print "Offending >>>", $dups{$f->group}->[0]->group,>>> " ID=", $dups{$f->group}->[1], "\n";>>> print "Chr: ", $dups{$f->group}->[0]->refseq,>>> " Strand: ", $dups{$f->group}->[0]->strand,>>> " Start: ", $dups{$f->group}->[0]->start,>>> " End: ", $dups{$f->group}->[0]->end;>>> print "\n\n";>>> } elsif ($f->type->method eq 'gene') {>>> $dups{$f->group} = [$f, $i];>>> }>>> }>>> }>>> >>> print "#### there was $j \@feat with only 2 features\n";>>> >>> Output of the test script:>>> >>> $ perl test.pl hrp36_targets.txt>>> Calling >>>FBgn0025803 ID=98 from @feat of size 2>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413>>> Offending >>>FBgn0025803 ID=97>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413>>> >>> Calling >>>FBgn0025681 ID=304 from @feat of size 2>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614>>> Offending >>>FBgn0025681 ID=303>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614>>> >>> #### there was 11 @feat with only 2 features>>> >>> With the hope someone can find out the problem...>>> >>> Cheers,>>> >>> Marco>>> >>> ______________________________>>> Marco Blanchette, Ph.D.>>> >>> mblanche at uclink.berkeley.edu>>> >>> Donald C. Rio's lab>>> Department of Molecular and Cell Biology>>> 16 Barker Hall>>> University of California>>> Berkeley, CA 94720-3204>>> >>> Tel: (510) 642-1084>>> Cell: (510) 847-0996>>> Fax: (510) 642-6062>> -- >> ------------------------------------------------------------------------>> Scott Cain, Ph. D. cain.cshl at gmail.com>> GMOD Coordinator (http://www.gmod.org/) 216-392-3087>> Cold Spring Harbor Laboratory>> >> _______________________________________________>> Bioperl-l mailing list>> Bioperl-l at lists.open-bio.org>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Marco Blanchette, Ph.D. > mblanche at berkeley.edu > Donald C. Rio's labDepartment of Molecular and Cell Biology16 Barker HallUniversity of CaliforniaBerkeley, CA 94720-3204 > Tel: (510) 642-1084Cell: (510) 847-0996Fax: (510) 642-6062 > > > > > > _______________________________________________Bioperl-l mailing listBioperl-l at lists.open-bio.orghttp://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060817/79bcf62b/attachment.bin From mblanche at berkeley.edu Thu Aug 17 02:09:25 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Wed, 16 Aug 2006 23:09:25 -0700 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: <1155794042.2596.32.camel@dhcpvisitor217149.slac.stanford.edu> Message-ID: Gnarl... my problem with the Bio::DB::SeqFeature (and its bp_seqfeature.load.pl script) is that it doesn't integrate the fasta sequences database yet (at from what I can tell from the version install on my workstation). What is the expected timeline on the Bio::DB::SeqFeature and how stable and reliable the latest version is? Many thanks for your help Scott, Marco On 8/16/06 10:54 PM, "Scott Cain" wrote: > Marco, > > After stepping my script through the debugger, I am pretty sure that > this really does come down to the incompatibilities between the > Bio::DB::GFF schema and some GFF3 files. In this case, amusingly > enough, Lincoln's efforts to make the Bio::DB::GFF mysql adaptor > compatible with GFF3 has lead to this bug, whereas I didn't do the same > for the Postgres adaptor. Unfortunately, I can't guarantee you that if > you were to switch to Postgres that it would work because it may miss > cases that the MySQL adaptor is getting. > > You could try Bio::DB::SeqFeature (loaded with bp_seqfeature_load.pl) > which was designed to work with GFF3 files. Welcome to the bleeding > edge :-) > > Scott > > > On Wed, 2006-08-16 at 22:20 -0700, Marco Blanchette wrote: >> Many thanks Scott, >> I will probably follow your suggestion and start using PostGres. Besidebeing >> a different database engine, is their any big difference between >> usingPostGres and MySQL? >> Many thanks for the help, I was starting to doubt my ability to code!! >> Cheers, >> >> Marco >> >> On 8/16/06 10:11 PM, "Scott Cain" wrote: >>> Hi Marco,> > Well, it works for me :-)> > I ran this script:> > >>> #!/usr/bin/perl -w> use strict;> > use Bio::DB::GFF;> my $db = >>> Bio::DB::GFF->new( -adaptor => 'dbi::pg',> >>> -dsn => 'dbi:Pg:dbname=flybase');> > my @feat = >>> $db->get_feature_by_name('FBgn0025803');> > for (@feat) {> print "$_\n" >>> if ($_->method eq 'gene');> }> > and got one line:> > gene:.(FBgn0025803)> > >>> The only real difference is that this in a PostgreSQL database and not> >>> MySQL. I used Pg since I have that installed. I'll blow away this> >>> database, install MySQL and see if that makes a difference (of course,> it >>> shouldn't, but you never know...)> > Gaah! I ran the exact same script with >>> a mysql Bio::DB::GFF and got> this out:> > gene:.(FBgn0025803)> >>> gene:.(FBgn0025803)> > Looks like a bug in the mysql adaptor. I'll see if I >>> can track it down;> in the mean time, you could switch to a real database >>> :-)> > Scott> > > > On Wed, 2006-08-16 at 23:30 -0400, Scott Cain wrote:>> >>> Hi Marco,>> >> I'm working on it right now--my first guess (without doing >>> any real>> work), I'm betting on the problem being an incompatibility >>> between the>> GFF3 file and the Bio::DB::GFF schema.>> >> Scott>> >> >> On >>> Wed, 2006-08-16 at 19:59 -0700, Marco Blanchette wrote:>>> Dear all,>>> >>> >>> I am desperately trying to get a list of gene coordinates from a MySQL>>> >>> database version of the fly genome populated using the Bio::DB::GFF >>> module.>>> I have a list of 277 id in a text file that when parsed through >>> the>>> following script return 279 entries (2 more entries then the number >>> of genes>>> in the starting list).>>> >>> Here is the script:>>> >>> use >>> Bio::DB::GFF;>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',>>> >>> -dsn => 'dbi:mysql:database=dmel_43_new');>>> while (<>){>>> chomp;>>> >>> my @feat = $db->get_feature_by_name($_);>>> for my $f (@feat){>>> >>> if ($f->type->method eq 'gene'){>>> print "Name: ", $f->name,>>> >>> " Strand: ", $f->strand,>>> " Start: ", $f->start,>>> >>> " End: ", $f->end,>>> "\n";>>> }>>> }>>> }>>> >>> >>> I totally don?t understand where the 2 extra entries are coming from.>>> >>> Nothing differentiate them from each other. Moreover, when I double check>>> >>> the MySQL database, both genes are having only a single ?gene? entry in >>> the>>> fdata table.>>> >>> Is there a bug in the way I am trying to fetch >>> the individual genes or>>> something is wrong with the latest Bio::DB::GFF >>> module from the CVS>>> repository?>>> >>> Here is a test script and it?s >>> output that I am using to try to tract down>>> what the problem is. Hope >>> this could help:>>> >>> use Bio::DB::GFF;>>> my $db = Bio::DB::GFF->new( >>> -adaptor => 'dbi::mysql',>>> -dsn => >>> 'dbi:mysql:database=dmel_43_new');>>> my %dups;>>> my ($j, $i) =0;>>> while >>> (<>){>>> chomp;>>> my $id = $_;>>> my @feat = >>> $db->get_feature_by_name($id);>>> my $feat_size = $#feat;>>> $j++ if >>> $feat_size == 2;>>> >>> for my $f (@feat){>>> $i++;>>> >>> >>> if (exists $dups{$f->group} && $f->type->method eq 'gene'){>>> >>> print "Calling >>>", $f->group,>>> " ID=", >>> $i,>>> " from \@feat of size $feat_size",>>> >>> "\n";>>> print "Chr: ", $f->refseq,>>> " >>> Strand: ", $f->strand,>>> " Start: ", $f->start,>>> >>> " End: ", $f->end,>>> "\n";>>> print >>> "Offending >>>", $dups{$f->group}->[0]->group,>>> " ID=", >>> $dups{$f->group}->[1], "\n";>>> print "Chr: ", >>> $dups{$f->group}->[0]->refseq,>>> " Strand: ", >>> $dups{$f->group}->[0]->strand,>>> " Start: ", >>> $dups{$f->group}->[0]->start,>>> " End: ", >>> $dups{$f->group}->[0]->end;>>> print "\n\n";>>> } elsif >>> ($f->type->method eq 'gene') {>>> $dups{$f->group} = [$f, >>> $i];>>> }>>> }>>> }>>> >>> print "#### there was $j \@feat with >>> only 2 features\n";>>> >>> Output of the test script:>>> >>> $ perl test.pl >>> hrp36_targets.txt>>> Calling >>>FBgn0025803 ID=98 from @feat of size 2>>> >>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413>>> Offending >>>FBgn0025803 >>> ID=97>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413>>> >>> Calling >>> >>>FBgn0025681 ID=304 from @feat of size 2>>> Chr: 2L Strand: 1 Start: >>> 2992964 End: 2998614>>> Offending >>>FBgn0025681 ID=303>>> Chr: 2L Strand: 1 >>> Start: 2992964 End: 2998614>>> >>> #### there was 11 @feat with only 2 >>> features>>> >>> With the hope someone can find out the problem...>>> >>> >>> Cheers,>>> >>> Marco>>> >>> ______________________________>>> Marco >>> Blanchette, Ph.D.>>> >>> mblanche at uclink.berkeley.edu>>> >>> Donald C. Rio's >>> lab>>> Department of Molecular and Cell Biology>>> 16 Barker Hall>>> >>> University of California>>> Berkeley, CA 94720-3204>>> >>> Tel: (510) >>> 642-1084>>> Cell: (510) 847-0996>>> Fax: (510) 642-6062>> -- >> >>> ------------------------------------------------------------------------>> >>> Scott Cain, Ph. D. cain.cshl at gmail.com>> >>> GMOD Coordinator (http://www.gmod.org/) 216-392-3087>> >>> Cold Spring Harbor Laboratory>> >> >>> _______________________________________________>> Bioperl-l mailing list>> >>> Bioperl-l at lists.open-bio.org>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Marco Blanchette, Ph.D. >> mblanche at berkeley.edu >> Donald C. Rio's labDepartment of Molecular and Cell Biology16 Barker >> HallUniversity of CaliforniaBerkeley, CA 94720-3204 >> Tel: (510) 642-1084Cell: (510) 847-0996Fax: (510) 642-6062 >> >> >> >> >> >> _______________________________________________Bioperl-l mailing >> listBioperl-l at lists.open-bio.orghttp://lists.open-bio.org/mailman/listinfo/bi >> operl-l >> Marco Blanchette, Ph.D. mblanche at berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 From cain.cshl at gmail.com Thu Aug 17 01:11:52 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Thu, 17 Aug 2006 01:11:52 -0400 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: <1155785447.2596.8.camel@dhcpvisitor217149.slac.stanford.edu> References: <1155785447.2596.8.camel@dhcpvisitor217149.slac.stanford.edu> Message-ID: <1155791512.2596.19.camel@dhcpvisitor217149.slac.stanford.edu> Hi Marco, Well, it works for me :-) I ran this script: #!/usr/bin/perl -w use strict; use Bio::DB::GFF; my $db = Bio::DB::GFF->new( -adaptor => 'dbi::pg', -dsn => 'dbi:Pg:dbname=flybase'); my @feat = $db->get_feature_by_name('FBgn0025803'); for (@feat) { print "$_\n" if ($_->method eq 'gene'); } and got one line: gene:.(FBgn0025803) The only real difference is that this in a PostgreSQL database and not MySQL. I used Pg since I have that installed. I'll blow away this database, install MySQL and see if that makes a difference (of course, it shouldn't, but you never know...) Gaah! I ran the exact same script with a mysql Bio::DB::GFF and got this out: gene:.(FBgn0025803) gene:.(FBgn0025803) Looks like a bug in the mysql adaptor. I'll see if I can track it down; in the mean time, you could switch to a real database :-) Scott On Wed, 2006-08-16 at 23:30 -0400, Scott Cain wrote: > Hi Marco, > > I'm working on it right now--my first guess (without doing any real > work), I'm betting on the problem being an incompatibility between the > GFF3 file and the Bio::DB::GFF schema. > > Scott > > > On Wed, 2006-08-16 at 19:59 -0700, Marco Blanchette wrote: > > Dear all, > > > > I am desperately trying to get a list of gene coordinates from a MySQL > > database version of the fly genome populated using the Bio::DB::GFF module. > > I have a list of 277 id in a text file that when parsed through the > > following script return 279 entries (2 more entries then the number of genes > > in the starting list). > > > > Here is the script: > > > > use Bio::DB::GFF; > > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > > -dsn => 'dbi:mysql:database=dmel_43_new'); > > while (<>){ > > chomp; > > my @feat = $db->get_feature_by_name($_); > > for my $f (@feat){ > > if ($f->type->method eq 'gene'){ > > print "Name: ", $f->name, > > " Strand: ", $f->strand, > > " Start: ", $f->start, > > " End: ", $f->end, > > "\n"; > > } > > } > > } > > > > I totally don?t understand where the 2 extra entries are coming from. > > Nothing differentiate them from each other. Moreover, when I double check > > the MySQL database, both genes are having only a single ?gene? entry in the > > fdata table. > > > > Is there a bug in the way I am trying to fetch the individual genes or > > something is wrong with the latest Bio::DB::GFF module from the CVS > > repository? > > > > Here is a test script and it?s output that I am using to try to tract down > > what the problem is. Hope this could help: > > > > use Bio::DB::GFF; > > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > > -dsn => 'dbi:mysql:database=dmel_43_new'); > > my %dups; > > my ($j, $i) =0; > > while (<>){ > > chomp; > > my $id = $_; > > my @feat = $db->get_feature_by_name($id); > > my $feat_size = $#feat; > > $j++ if $feat_size == 2; > > > > for my $f (@feat){ > > $i++; > > > > if (exists $dups{$f->group} && $f->type->method eq 'gene'){ > > print "Calling >>>", $f->group, > > " ID=", $i, > > " from \@feat of size $feat_size", > > "\n"; > > print "Chr: ", $f->refseq, > > " Strand: ", $f->strand, > > " Start: ", $f->start, > > " End: ", $f->end, > > "\n"; > > print "Offending >>>", $dups{$f->group}->[0]->group, > > " ID=", $dups{$f->group}->[1], "\n"; > > print "Chr: ", $dups{$f->group}->[0]->refseq, > > " Strand: ", $dups{$f->group}->[0]->strand, > > " Start: ", $dups{$f->group}->[0]->start, > > " End: ", $dups{$f->group}->[0]->end; > > print "\n\n"; > > } elsif ($f->type->method eq 'gene') { > > $dups{$f->group} = [$f, $i]; > > } > > } > > } > > > > print "#### there was $j \@feat with only 2 features\n"; > > > > Output of the test script: > > > > $ perl test.pl hrp36_targets.txt > > Calling >>>FBgn0025803 ID=98 from @feat of size 2 > > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > Offending >>>FBgn0025803 ID=97 > > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > > > Calling >>>FBgn0025681 ID=304 from @feat of size 2 > > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > Offending >>>FBgn0025681 ID=303 > > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > > > #### there was 11 @feat with only 2 features > > > > With the hope someone can find out the problem... > > > > Cheers, > > > > Marco > > > > ______________________________ > > Marco Blanchette, Ph.D. > > > > mblanche at uclink.berkeley.edu > > > > Donald C. Rio's lab > > Department of Molecular and Cell Biology > > 16 Barker Hall > > University of California > > Berkeley, CA 94720-3204 > > > > Tel: (510) 642-1084 > > Cell: (510) 847-0996 > > Fax: (510) 642-6062 > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. cain.cshl at gmail.com > GMOD Coordinator (http://www.gmod.org/) 216-392-3087 > Cold Spring Harbor Laboratory > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060817/93c9027b/attachment.bin From reenayadav at gmail.com Thu Aug 17 03:00:50 2006 From: reenayadav at gmail.com (Reena Yadav) Date: Thu, 17 Aug 2006 07:00:50 +0000 Subject: [Bioperl-l] starter..help run the first script Message-ID: <76f897dd0608170000v4b6f0f01qeab1cbbe9050e0c0@mail.gmail.com> Hi I am a novice and am seriously starting today. I tried the following script, put in a text editor and saved it as .pl as ext. Internet is connected, Web proxy is set. But is not recognised by the script. Could someone walk me through how exactly to write, save, and run the script in Linux or win, preferably linux. Reena Yadav SCRIPT ---------- #!/usr/bin/perl -w use Bio::Perl; # this script will only work if you have an internet connection on the # computer you're using, the databases you can get sequences from # are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq' $seq_object = get_sequence('swiss',"ROA1_HUMAN"); write_sequence(">roa1.fasta",'fasta',$seq_object); REPLY ON THE PAGE ------------------------ ------------- EXCEPTION ------------- MSG: WebDBSeqI Request Error: HTTP/1.1 403 Forbidden Connection: close Date: Thursday, 17-Aug-06 06:31:28 GMT Server: Web_Proxy Content-Type: text/html Expires: Thursday, 17-Aug-06 06:31:28 GMT Client-Date: Thu, 17 Aug 2006 12:01:10 GMT Client-Peer: 193.62.197.151:80 Client-Response-Num: 1 Title: Web Proxy Web Proxy

Web Proxy


This site is protected by firewall 126. [ryadav at inbgsc0125 ~/ry_files]$ perl bt2.pl Name "main::sequence_as_a_string" used only once: possible typo at bt2.plline 22. Use of uninitialized value in concatenation (.) or string at /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 169. read_sequence() - usage incorrect at /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 168 Bio::Perl::read_sequence('undef') called at bt2.pl line 5 [ryadav at inbgsc0125 ~/ry_files]$ perl bt1.pl ------------- EXCEPTION ------------- MSG: WebDBSeqI Request Error: HTTP/1.1 403 Forbidden Connection: close Date: Thursday, 17-Aug-06 06:43:03 GMT Server: Web_Proxy Content-Type: text/html Expires: Thursday, 17-Aug-06 06:43:03 GMT Client-Date: Thu, 17 Aug 2006 12:12:45 GMT Client-Peer: 193.62.197.151:80 Client-Response-Num: 1 Title: Web Proxy Web Proxy

Web Proxy


This site is protected by firewall 126. All requests are screened and logged.

You are not permitted to access the requested URL http://www.ebi.ac.uk/cgi-bin/dbfetch.


For further information contact:
TP&S Firewall Admin, Wilmington

STACK Bio::DB::WebDBSeqI::_stream_request /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:728 STACK Bio::DB::WebDBSeqI::get_seq_stream /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:460 STACK Bio::DB::WebDBSeqI::get_Stream_by_id /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:287 STACK Bio::DB::WebDBSeqI::get_Seq_by_id /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:153 STACK Bio::Perl::get_sequence /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm:511 STACK toplevel bt1.pl:8 -------------------------------------- -------------------- WARNING --------------------- MSG: id (ROA1_HUMAN) does not exist --------------------------------------------------- Use of uninitialized value in length at /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 271. Use of uninitialized value in concatenation (.) or string at /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 283. You have a non object [] passed to write_sequence. It maybe that you want to use new_sequence to make this string into a sequence object? at /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 283 Bio::Perl::write_sequence('>roa1.fasta', 'fasta', 'undef') called at bt1.pl line 10 From bix at sendu.me.uk Thu Aug 17 04:56:47 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 17 Aug 2006 09:56:47 +0100 Subject: [Bioperl-l] starter..help run the first script In-Reply-To: <76f897dd0608170000v4b6f0f01qeab1cbbe9050e0c0@mail.gmail.com> References: <76f897dd0608170000v4b6f0f01qeab1cbbe9050e0c0@mail.gmail.com> Message-ID: <44E42F4F.1080809@sendu.me.uk> Reena Yadav wrote: > Hi I am a novice and am seriously starting today. > I tried the following script, put in a text editor and saved it as .pl as > ext. > Internet is connected, Web proxy is set. But is not recognised by the > script. > Could someone walk me through how exactly to write, save, and run the script > in Linux or win, preferably linux. > Reena Yadav > SCRIPT > ---------- > #!/usr/bin/perl -w > use Bio::Perl; > > # this script will only work if you have an internet connection on the > # computer you're using, the databases you can get sequences from > # are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq' > > $seq_object = get_sequence('swiss',"ROA1_HUMAN"); > > write_sequence(">roa1.fasta",'fasta',$seq_object); [...] >

This site is protected by firewall 126. > All requests are screened and logged.

> >

You are not permitted to access the requested URL > http://www.ebi.ac.uk/cgi-bin/dbfetch.

> >


>
>

For further information contact: >
TP&S Firewall > Admin, Wilmington

Your script is running fine. As you can see from the error message, your web proxy firewall is not allowing you to access the ebi website (can you visit http://www.ebi.ac.uk/cgi-bin/dbfetch in a web browser on the same machine?). You'll need to get in touch with the 'firewall admin', or change web proxy. From avilella at gmail.com Thu Aug 17 07:37:20 2006 From: avilella at gmail.com (Albert Vilella) Date: Thu, 17 Aug 2006 12:37:20 +0100 Subject: [Bioperl-l] informative codons method for kaks -- Bio/Align/DNAStatistics.pm Message-ID: <1155814640.11572.2.camel@localhost> Hi all, I think it would be nice to have a method in Bio/Align/DNAStatistics.pm that gives the number of informative codons for kaks in a MSA. That is, the codons that are used in the calculation of kaks. This, AFAICS, more or less what codeml calls "patterns". I often find myself in the situation of wanting to know how big is the CDS alignment not in terms of sequence length, but of the number of codons that are going to be used in the kaks statistics. I guess this method would help in that. The method could: return the number of informative codons? maybe return a new seqarray with only the informative codons? What do you think? Jason? Chris? http://bugzilla.open-bio.org/show_bug.cgi?id=2078 Bests, Albert. From cjfields at uiuc.edu Thu Aug 17 10:03:31 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 09:03:31 -0500 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: <1155794042.2596.32.camel@dhcpvisitor217149.slac.stanford.edu> References: <1155794042.2596.32.camel@dhcpvisitor217149.slac.stanford.edu> Message-ID: On Aug 17, 2006, at 12:54 AM, Scott Cain wrote: > ... > You could try Bio::DB::SeqFeature (loaded with bp_seqfeature_load.pl) > which was designed to work with GFF3 files. Welcome to the bleeding > edge :-) > > Scott Speaking of, I had recently posted that we're planning on a new developer release (1.5.2). Is Bio::DB::SeqFeature stable enough (i.e. not too bleeding-edge) to include? Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cain.cshl at gmail.com Thu Aug 17 02:17:08 2006 From: cain.cshl at gmail.com (Scott Cain) Date: Thu, 17 Aug 2006 02:17:08 -0400 Subject: [Bioperl-l] Genes from MySQL database using Bio::DB::GFF In-Reply-To: References: Message-ID: <1155795429.2596.35.camel@dhcpvisitor217149.slac.stanford.edu> Um, no idea. I haven't even tried it myself yet (which is why I didn't answer Chris' question about it a few days ago). Sorry. On Wed, 2006-08-16 at 23:09 -0700, Marco Blanchette wrote: > Gnarl... my problem with the Bio::DB::SeqFeature (and itsbp_seqfeature.load.pl script) is that it doesn't integrate the fastasequences database yet (at from what I can tell from the version install onmy workstation). > What is the expected timeline on the Bio::DB::SeqFeature and how stable andreliable the latest version is? > Many thanks for your help Scott, > Marco > > On 8/16/06 10:54 PM, "Scott Cain" wrote: > > Marco,> > After stepping my script through the debugger, I am pretty sure that> this really does come down to the incompatibilities between the> Bio::DB::GFF schema and some GFF3 files. In this case, amusingly> enough, Lincoln's efforts to make the Bio::DB::GFF mysql adaptor> compatible with GFF3 has lead to this bug, whereas I didn't do the same> for the Postgres adaptor. Unfortunately, I can't guarantee you that if> you were to switch to Postgres that it would work because it may miss> cases that the MySQL adaptor is getting.> > You could try Bio::DB::SeqFeature (loaded with bp_seqfeature_load.pl)> which was designed to work with GFF3 files. Welcome to the bleeding> edge :-)> > Scott> > > On Wed, 2006-08-16 at 22:20 -0700, Marco Blanchette wrote:>> Many thanks Scott,>> I will probably follow your suggestion and start using PostGres. Besidebeing>> a different database engine, is their any big difference between>> usingPostGres and MySQL?>> Many thanks for the help, I was starting to doubt my ability to code!!>> Cheers,>> >> Marco>> >> On 8/16/06 10:11 PM, "Scott Cain" wrote:>>> Hi Marco,> > Well, it works for me :-)> > I ran this script:> >>>> #!/usr/bin/perl -w> use strict;> > use Bio::DB::GFF;> my $db =>>> Bio::DB::GFF->new( -adaptor => 'dbi::pg',>>>> -dsn => 'dbi:Pg:dbname=flybase');> > my @feat =>>> $db->get_feature_by_name('FBgn0025803');> > for (@feat) {> print "$_\n">>> if ($_->method eq 'gene');> }> > and got one line:> > gene:.(FBgn0025803)> >>>> The only real difference is that this in a PostgreSQL database and not>>>> MySQL. I used Pg since I have that installed. I'll blow away this>>>> database, install MySQL and see if that makes a difference (of course,> it>>> shouldn't, but you never know...)> > Gaah! I ran the exact same script with>>> a mysql Bio::DB::GFF and got> this out:> > gene:.(FBgn0025803)>>>> gene:.(FBgn0025803)> > Looks like a bug in the mysql adaptor. I'll see if I>>> can track it down;> in the mean time, you could switch to a real database>>> :-)> > Scott> > > > On Wed, 2006-08-16 at 23:30 -0400, Scott Cain wrote:>>>>> Hi Marco,>> >> I'm working on it right now--my first guess (without doing>>> any real>> work), I'm betting on the problem being an incompatibility>>> between the>> GFF3 file and the Bio::DB::GFF schema.>> >> Scott>> >> >> On>>> Wed, 2006-08-16 at 19:59 -0700, Marco Blanchette wrote:>>> Dear all,>>> >>>>>> I am desperately trying to get a list of gene coordinates from a MySQL>>>>>> database version of the fly genome populated using the Bio::DB::GFF>>> module.>>> I have a list of 277 id in a text file that when parsed through>>> the>>> following script return 279 entries (2 more entries then the number>>> of genes>>> in the starting list).>>> >>> Here is the script:>>> >>> use>>> Bio::DB::GFF;>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql',>>>>>> -dsn => 'dbi:mysql:database=dmel_43_new');>>> while (<>){>>> chomp;>>>>>> my @feat = $db->get_feature_by_name($_);>>> for my $f (@feat){>>>>>> if ($f->type->method eq 'gene'){>>> print "Name: ", $f->name,>>>>>> " Strand: ", $f->strand,>>> " Start: ", $f->start,>>>>>> " End: ", $f->end,>>> "\n";>>> }>>> }>>> }>>>>>> >>> I totally don?t understand where the 2 extra entries are coming from.>>>>>> Nothing differentiate them from each other. Moreover, when I double check>>>>>> the MySQL database, both genes are having only a single ?gene? entry in>>> the>>> fdata table.>>> >>> Is there a bug in the way I am trying to fetch>>> the individual genes or>>> something is wrong with the latest Bio::DB::GFF>>> module from the CVS>>> repository?>>> >>> Here is a test script and it?s>>> output that I am using to try to tract down>>> what the problem is. Hope>>> this could help:>>> >>> use Bio::DB::GFF;>>> my $db = Bio::DB::GFF->new(>>> -adaptor => 'dbi::mysql',>>> -dsn =>>>> 'dbi:mysql:database=dmel_43_new');>>> my %dups;>>> my ($j, $i) =0;>>> while>>> (<>){>>> chomp;>>> my $id = $_;>>> my @feat =>>> $db->get_feature_by_name($id);>>> my $feat_size = $#feat;>>> $j++ if>>> $feat_size == 2;>>> >>> for my $f (@feat){>>> $i++;>>>>>> >>> if (exists $dups{$f->group} && $f->type->method eq 'gene'){>>>>>> print "Calling >>>", $f->group,>>> " ID=",>>> $i,>>> " from \@feat of size $feat_size",>>>>>> "\n";>>> print "Chr: ", $f->refseq,>>> ">>> Strand: ", $f->strand,>>> " Start: ", $f->start,>>>>>> " End: ", $f->end,>>> "\n";>>> print>>> "Offending >>>", $dups{$f->group}->[0]->group,>>> " ID=",>>> $dups{$f->group}->[1], "\n";>>> print "Chr: ",>>> $dups{$f->group}->[0]->refseq,>>> " Strand: ",>>> $dups{$f->group}->[0]->strand,>>> " Start: ",>>> $dups{$f->group}->[0]->start,>>> " End: ",>>> $dups{$f->group}->[0]->end;>>> print "\n\n";>>> } elsif>>> ($f->type->method eq 'gene') {>>> $dups{$f->group} = [$f,>>> $i];>>> }>>> }>>> }>>> >>> print "#### there was $j \@feat with>>> only 2 features\n";>>> >>> Output of the test script:>>> >>> $ perl test.pl>>> hrp36_targets.txt>>> Calling >>>FBgn0025803 ID=98 from @feat of size 2>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413>>> Offending >>>FBgn0025803>>> ID=97>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413>>> >>> Calling>>> >>>FBgn0025681 ID=304 from @feat of size 2>>> Chr: 2L Strand: 1 Start:>>> 2992964 End: 2998614>>> Offending >>>FBgn0025681 ID=303>>> Chr: 2L Strand: 1>>> Start: 2992964 End: 2998614>>> >>> #### there was 11 @feat with only 2>>> features>>> >>> With the hope someone can find out the problem...>>> >>>>>> Cheers,>>> >>> Marco>>> >>> ______________________________>>> Marco>>> Blanchette, Ph.D.>>> >>> mblanche at uclink.berkeley.edu>>> >>> Donald C. Rio's>>> lab>>> Department of Molecular and Cell Biology>>> 16 Barker Hall>>>>>> University of California>>> Berkeley, CA 94720-3204>>> >>> Tel: (510)>>> 642-1084>>> Cell: (510) 847-0996>>> Fax: (510) 642-6062>> -- >>>>> ------------------------------------------------------------------------>>>>> Scott Cain, Ph. D. cain.cshl at gmail.com>>>>> GMOD Coordinator (http://www.gmod.org/) 216-392-3087>>>>> Cold Spring Harbor Laboratory>> >>>>> _______________________________________________>> Bioperl-l mailing list>>>>> Bioperl-l at lists.open-bio.org>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l>> >> Marco Blanchette, Ph.D.>> mblanche at berkeley.edu>> Donald C. Rio's labDepartment of Molecular and Cell Biology16 Barker>> HallUniversity of CaliforniaBerkeley, CA 94720-3204>> Tel: (510) 642-1084Cell: (510) 847-0996Fax: (510) 642-6062>> >> >> >> >> >> _______________________________________________Bioperl-l mailing>> listBioperl-l at lists.open-bio.orghttp://lists.open-bio.org/mailman/listinfo/bi>> operl-l>> > > Marco Blanchette, Ph.D. > mblanche at berkeley.edu > Donald C. Rio's labDepartment of Molecular and Cell Biology16 Barker HallUniversity of CaliforniaBerkeley, CA 94720-3204 > Tel: (510) 642-1084Cell: (510) 847-0996Fax: (510) 642-6062 > > > > > > _______________________________________________Bioperl-l mailing listBioperl-l at lists.open-bio.orghttp://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. cain.cshl at gmail.com GMOD Coordinator (http://www.gmod.org/) 216-392-3087 Cold Spring Harbor Laboratory -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: This is a digitally signed message part Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060817/b27831da/attachment-0001.bin From deep_ans at yahoo.com Thu Aug 17 08:03:28 2006 From: deep_ans at yahoo.com (deepak shingan) Date: Thu, 17 Aug 2006 05:03:28 -0700 (PDT) Subject: [Bioperl-l] Error while parsing a BLAST Report Message-ID: <20060817120328.37822.qmail@web51709.mail.yahoo.com> Hi all, I am getting the following exception while parsing a blast report using SearchIO algorithm. ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Undefined sub-sequence (400,401). Valid range = 285 - 401 STACK: Error::throw STACK: Bio::Root::Root::throw /usr/lib/perl5/site_perl/5.8.5/Bio/Root/Root.pm:328 STACK: Bio::Search::HSP::HSPI::matches /usr/lib/perl5/site_perl/5.8.5/Bio/Search/HSP/HSPI.pm:711 STACK: Bio::Search::SearchUtils::_adjust_contigs /usr/lib/perl5/site_perl/5.8.5/Bio/Search/SearchUtils.pm:365 STACK: Bio::Search::SearchUtils::tile_hsps /usr/lib/perl5/site_perl/5.8.5/Bio/Search/SearchUtils.pm:176 STACK: Bio::Search::Hit::GenericHit::matches /usr/lib/perl5/site_perl/5.8.5/Bio/Search/Hit/GenericHit.pm:830 STACK: parserMethod.pl:62 I am using bio-perl version 1.5.1. The same parserMethod works fine with all other blast reoprt files but tblastx output files are not getting parsed correctly. If anybody knows something about this, Please reply . Regards... Deepak [Note : I am sending source code and the sample blast report file in which the exception is occuring as an attachment with this mail] --------------------------------- Get your email and more, right on the new Yahoo.com -------------- next part -------------- A non-text attachment was scrubbed... Name: parserMethod.pl Type: application/octet-stream Size: 4038 bytes Desc: 3421527925-parserMethod.pl Url : http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060817/783957d6/attachment-0001.obj -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: tempBlastFile Url: http://lists.open-bio.org/pipermail/bioperl-l/attachments/20060817/783957d6/attachment-0001.pl From cjfields at uiuc.edu Thu Aug 17 10:26:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 09:26:33 -0500 Subject: [Bioperl-l] informative codons method for kaks -- Bio/Align/DNAStatistics.pm In-Reply-To: <1155814640.11572.2.camel@localhost> References: <1155814640.11572.2.camel@localhost> Message-ID: Sure, why not? If you (or someone) can add one in, I don't see how it could hurt. Make sure to add tests for this in the proper test suite. Chris On Aug 17, 2006, at 6:37 AM, Albert Vilella wrote: > Hi all, > > I think it would be nice to have a method in > Bio/Align/DNAStatistics.pm that gives the number of informative codons > for kaks in a MSA. That is, the codons that are used in the > calculation of kaks. This, AFAICS, more or less what codeml calls > "patterns". > > I often find myself in the situation of wanting to know how big is the > CDS alignment not in terms of sequence length, but of the number of > codons that are going to be used in the kaks statistics. I guess this > method would help in that. > > The method could: > > return the number of informative codons? > maybe return a new seqarray with only the informative codons? > > What do you think? Jason? Chris? > > http://bugzilla.open-bio.org/show_bug.cgi?id=2078 > > Bests, > > Albert. > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Aug 17 11:58:17 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 10:58:17 -0500 Subject: [Bioperl-l] Error parsing BLAST report In-Reply-To: <71AE766382153B47AAB638DC83ED7F49014CE8CA@pathexch1.wusm-path.wustl.edu> Message-ID: <006c01c6c215$f492c900$15327e82@pyrimidine> Robert, This sounds like a possible bug; the error you are getting is from BLAST 2.2.11 output, so it should work. The BLAST parsing errors fixed in CVS had to do with parsing BLAST 2.2.13 (and later) output. Could you add this as a bug to Bugzilla with your test script and test case data that generates the error? How to post the bug: http://www.bioperl.org/wiki/Bugs Bugzilla: http://bugzilla.open-bio.org/ Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Freimuth, Robert > Sent: Wednesday, August 16, 2006 11:42 PM > To: Brian Osborne; bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Error parsing BLAST report > > Hi, > > Thank you for your reply. I downloaded bioperl-1.5.1 from > http://bioperl.org/DIST/ and installed it (which appeared successful), > but the one-liner: > > perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION, "\n"' > > prints 1.5 (I expected 1.5.1). > > When I run the test case that I reported earlier, I get the following > output: > > -------------------- WARNING --------------------- > MSG: There is no HSP data for hit 'ENSP00000327738'. > You have called a method (Bio::Search::Hit::GenericHit::length_aln) > that requires HSP data and there was no HSP data for this hit, > most likely because it was absent from the BLAST report. > Note that by default, BLAST lists alignments for the first 250 hits, > but it lists descriptions for 500 hits. If this is the case, > and you care about these hits, you should re-run BLAST using the > -b option (or equivalent if not using blastall) to increase the number > of alignments. > > --------------------------------------------------- > Alignment length for ENSP00000327738 is - > Alignment length for ENSP00000350182 is 250 > Alignment length for ENSP00000327738 is 398 > > Could someone that is running 1.5.1 please verify the output of the > one-liner above (did I somehow get the wrong file from the ftp site?) > and try to reproduce the error with the test case? > > Thanks for the help. I'm stumped. > > Bob > > > -----Original Message----- > > From: Brian Osborne [mailto:osborne1 at optonline.net] > > Sent: Wednesday, August 16, 2006 6:21 PM > > To: Freimuth, Robert; bioperl-l at lists.open-bio.org > > Subject: Re: [Bioperl-l] Error parsing BLAST report > > > > Robert, > > > > The standard answer to a complaint about SearchIO these days > > is to upgrade > > to version 1.5.1 - what Bioperl version are you using? > > > > Brian O. > > > > > > On 8/16/06 3:56 PM, "Freimuth, Robert" > > wrote: > > > > > Hello, > > > > > > I'm trying to parse a BLAST report using the following code: > > > > > > use warnings; > > > use strict; > > > > > > use Bio::SearchIO; > > > > > > my $file = 'NP_006065_blast.out'; > > > > > > my $searchio = new Bio::SearchIO( -format => 'blast', > > > -file => $file ); > > > > > > while( my $result = $searchio->next_result() ) > > > { > > > while( my $hit = $result->next_hit ) > > > { > > > my $hit_acc_num = $hit->accession(); > > > > > > # get the total length of the aligned region > > for query or > > > sbjct seq > > > # (includes all HSPs, calculated after tiling) > > > > > > my $align_len = $hit->length_aln( 'query' ); > > > > > > print "Alignment length for $hit_acc_num is > > $align_len\n"; > > > } > > > } > > > > > > There are 104 one-line descriptions in the report, and > > alignments for > > > each one of them (the blast report was created using > > > b_num_alignments_shown => 500 and v_num_descriptions_shown => 500). > > > However, when I run the above code I get 14 errors like the > > following: > > > > > > -------------------- WARNING --------------------- > > > MSG: There is no HSP data for hit 'ENSP00000327738'. > > > You have called a method (Bio::Search::Hit::GenericHit::length_aln) > > > that requires HSP data and there was no HSP data for this hit, > > > most likely because it was absent from the BLAST report. > > > Note that by default, BLAST lists alignments for the first 250 hits, > > > but it lists descriptions for 500 hits. If this is the case, > > > and you care about these hits, you should re-run BLAST using the > > > -b option (or equivalent if not using blastall) to increase > > the number > > > of alignments. > > > > > > --------------------------------------------------- > > > > > > There is an alignment for this (and the other 13 sequences) in the > > > report. In fact, if I edit the report and delete all but the > > > description and the alignment for ENSP00000327738, it > > parses fine (no > > > error). > > > > > > I continued editing the report and produced the following > > minimal test > > > case that reproduces the error. Note that the description for > > > ENSP00000350182 appears twice, BUT THE ERROR IS FOR ENSP00000327738. > > > > > > *********** BLAST REPORT FOR TEST CASE *********** > > > > > > BLASTP 2.2.11 [Jun-05-2005] > > > > > > > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. > > > Schaffer, > > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), > > > "Gapped BLAST and PSI-BLAST: a new generation of protein > > database search > > > programs", Nucleic Acids Res. 25:3389-3402. > > > > > > Query= NP_006065 > > > (442 letters) > > > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > > 48,851 sequences; 23,910,368 total letters > > > > > > Searching..................................................done > > > > > > > > Score > > > E > > > Sequences producing significant alignments: > > (bits) > > > Value > > > > > > ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... > > > 120 3e-27 > > > ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 gene:E... > > > 120 3e-27 > > > ENSP00000327738 pep:known-ccds chromosome:NCBI36:4:189297592:189... > > > 115 8e-26 > > > > > >> ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 > > > gene:ENSG00000137397 > > > transcript:ENST00000357569 > > > Length = 425 > > > > > > Score = 120 bits (301), Expect = 3e-27 > > > Identities = 76/261 (29%), Positives = 140/261 (53%), Gaps = 21/261 > > > (8%) > > > > > > Query: 9 > > IEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGNL > > > 68 > > > +++EV CPICL++L +P+++DCGH+FC CIT +I E+ S G > > CP+C+T + + > > > Sbjct: 10 > > LQEEVICPICLDILQKPVTIDCGHNFCLKCIT-QIGET---SCGFFKCPLCKTSVRKNAI > > > 65 > > > > > > Query: 69 > > RPNRHLANIVERVKEVKMSP-QEGQKRDVCEHHGKKLQIFCKEDGKVICWVCELSQEHQG > > > 127 > > > R N L N+VE+++ ++ S Q +K C H + FC++DGK > > +C+VC S++H+ > > > Sbjct: 66 > > RFNSLLRNLVEKIQALQASEVQSKRKEATCPRHQEMFHYFCEDDGKFLCFVCCESKDHKS > > > 125 > > > > > > Query: 128 > > HQTFRINEVVKECQEKLQVALQRLIKEDQEAEKLED------DIRQERTAWKIERQKILK > > > 181 > > > H I E + Q ++Q +Q L ++++E +++ D+ ++ > > + E+Q+IL > > > Sbjct: 126 > > HNVSLIEEAAQNYQGQIQEQIQVLQQKEKETVQVKAQGVHRVDVFTDQV--EHEKQRILT > > > 183 > > > > > > Query: 182 > > GFNEMRVILDNEEQRELQKL----EEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRRL > > > 237 > > > F + +L+ E+ L ++ EG +A+ QL D > > L+ L+ + > > > Sbjct: 184 > > EFELLHQVLEEEKNFLLSRIYWLGHEGTEAGKHYVASTEPQL----NDLKKLVDSLKTKQ > > > 239 > > > > > > Query: 238 TGSSVEMLQDVIDVMKRSESW 258 > > > ++L+ + RSE + > > > Sbjct: 240 NMPPRQLLEVTQPHLPRSEEF 260 > > > > > > > > >> ENSP00000327738 pep:known-ccds > > > chromosome:NCBI36:4:189297592:189305643:1 > > > gene:ENSG00000184108 transcript:ENST00000332517 > > > CCDS3851.1 > > > Length = 468 > > > > > > Score = 115 bits (289), Expect = 8e-26 > > > Identities = 101/410 (24%), Positives = 180/410 (43%), > > Gaps = 39/410 > > > (9%) > > > > > > Query: 8 > > DIEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGN > > > 67 > > > ++ +E+TC ICL+ + P++ +CGHSFC C+ +E > > SCP C + + > > > Sbjct: 9 > > NLREELTCFICLDYFSSPVTTECGHSFCLVCLLRSWEE----HNTPLSCPECWRTLEGPH > > > 64 > > > > > > Query: 68 > > LRPNRHLANIVERVKEVKMSPQEGQKRDVCEHHGK-----KLQIFCKEDGKVICWVCELS > > > 122 > > > + N L + ++++ Q Q D +G+ K ++ > > G ++ > > > Sbjct: 65 > > FQSNERLGRLASIARQLR--SQVLQSEDEQGSYGRMPTTAKALSDDEQGGSAF-----VA > > > 117 > > > > > > Query: 123 > > QEHQGHQTFRINEVVKECQEKLQVALQRLIKEDQEA------EKLEDDIRQERTAWKIER > > > 176 > > > Q H ++ +E + +EKLQ L L +EA EK > > + QE T K + > > > Sbjct: 118 > > QSHGANRVHLSSEAEEHHREKLQEILNLLRVRRKEAQAVLTHEKERVKLCQEET--KTCK > > > 175 > > > > > > Query: 177 > > QKILKGFNEMRVILDNEEQRELQKLEEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRR > > > 236 > > > Q ++ + +M L EEQ +LQ LE+ E + L +L QQ + > > S +I+ ++ > > > Sbjct: 176 > > QVVVSEYMKMHQFLKEEEQLQLQLLEQEEKENMRKLRNNEIKLTQQIRSLSKMIAQIESS > > > 235 > > > > > > Query: 237 > > LTGSSVEMLQDVIDVMKRSESWTXXXXXXXXXXXXXXFRVPDLSGMLQVLKELTDVQYYW > > > 296 > > > S+ E L++V ++RSE + ++GM ++L++ + > > > Sbjct: 236 > > SQSSAFESLEEVRGALERSE----PLLLQCPEATTTELSLCRITGMKEMLRKFS------ > > > 285 > > > > > > Query: 297 > > VDVMLNPGSATSNVAISVDQRQVKTVRTCTFKNSNPCDF-SAFGVFGCQYFSSGKYYWEV > > > 355 > > > ++ L+P +A + + +S D + VK + NP F + V G > > Q F+SG++YWEV > > > Sbjct: 286 > > TEITLDPATANAYLVLSEDLKSVKYGGSRQQLPDNPERFDQSATVLGTQIFTSGRHYWEV > > > 345 > > > > > > Query: 356 DVSGKIAWILGVHSKISSLNKRKSSGFAFDPSVNYSKVYSRYRPQYGYWV 405 > > > +V K W +G+ S + P +S + + Y WV > > > Sbjct: 346 EVGNKTEWEVGICKDSVS----RKGNLPKPPGDLFSLIGLKIGDDYSLWV 391 > > > > > > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > > Posted date: Jun 15, 2006 8:56 PM > > > Number of letters in database: 23,910,368 > > > Number of sequences in database: 48,851 > > > > > > Lambda K H > > > 0.319 0.133 0.398 > > > > > > Gapped > > > Lambda K H > > > 0.267 0.0410 0.140 > > > > > > > > > Matrix: BLOSUM62 > > > Gap Penalties: Existence: 11, Extension: 1 > > > Number of Hits to DB: 20,900,506 > > > Number of Sequences: 48851 > > > Number of extensions: 899179 > > > Number of successful extensions: 6075 > > > Number of sequences better than 1.0e-25: 105 > > > Number of HSP's better than 0.0 without gapping: 18 > > > Number of HSP's successfully gapped in prelim test: 87 > > > Number of HSP's that attempted gapping in prelim test: 5632 > > > Number of HSP's gapped (non-prelim): 157 > > > length of query: 442 > > > length of database: 23,910,368 > > > effective HSP length: 107 > > > effective length of query: 335 > > > effective length of database: 18,683,311 > > > effective search space: 6258909185 > > > effective search space used: 6258909185 > > > T: 11 > > > A: 40 > > > X1: 16 ( 7.4 bits) > > > X2: 38 (14.6 bits) > > > X3: 64 (24.7 bits) > > > S1: 41 (21.8 bits) > > > S2: 289 (115.9 bits) > > > > > > *********** END BLAST REPORT FOR TEST CASE *********** > > > > > > Any ideas? > > > > > > Thanks, > > > > > > Bob > > > > > > > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Aug 17 12:10:33 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 11:10:33 -0500 Subject: [Bioperl-l] starter..help run the first script In-Reply-To: <76f897dd0608170000v4b6f0f01qeab1cbbe9050e0c0@mail.gmail.com> Message-ID: <006d01c6c217$aeb80650$15327e82@pyrimidine> If you set up environmental variables for your web proxy it should work. This is one of the areas with BioPerl that really needs further testing: http://www.bioperl.org/wiki/Project_priority_list#Test_and_fix_Bioperl.27s_u se_of_proxies There is also this from the INSTALL file and the wiki: "HTTP_PROXY : If you access the internet via a proxy server then you can tell the Bioperl modules which require network access about this by using the HTTP_PROXY environment variable. The value set includes the proxy address and the port used (e.g. http://www.cache.example.com:8080)." If the module uses LWP, there is an additional problem (from perldoc LWP::UserAgent): "On systems with case insensitive environment variables there exists a name clash between the CGI environment variables and the HTTP_PROXY environment variable normally picked up by env_proxy(). Because of this HTTP_PROXY is not honored for CGI scripts. The CGI_HTTP_PROXY environment variable can be used instead." So try CGI_HTTP_PROXY if HTTP_PROXY doesn't work. Don't know what system you are using (Win or Linux) so really can't make a determination which would be best for both. Let us know how it works out! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Reena Yadav > Sent: Thursday, August 17, 2006 2:01 AM > Cc: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] starter..help run the first script > > Hi I am a novice and am seriously starting today. > I tried the following script, put in a text editor and saved it as .pl as > ext. > Internet is connected, Web proxy is set. But is not recognised by the > script. > Could someone walk me through how exactly to write, save, and run the > script > in Linux or win, preferably linux. > Reena Yadav > SCRIPT > ---------- > #!/usr/bin/perl -w > use Bio::Perl; > > # this script will only work if you have an internet connection on the > # computer you're using, the databases you can get sequences from > # are 'swiss', 'genbank', 'genpept', 'embl', and 'refseq' > > $seq_object = get_sequence('swiss',"ROA1_HUMAN"); > > write_sequence(">roa1.fasta",'fasta',$seq_object); > > REPLY ON THE PAGE > ------------------------ > > ------------- EXCEPTION ------------- > MSG: WebDBSeqI Request Error: > HTTP/1.1 403 Forbidden > Connection: close > Date: Thursday, 17-Aug-06 06:31:28 GMT > Server: Web_Proxy > Content-Type: text/html > Expires: Thursday, 17-Aug-06 06:31:28 GMT > Client-Date: Thu, 17 Aug 2006 12:01:10 GMT > Client-Peer: 193.62.197.151:80 > Client-Response-Num: 1 > Title: Web Proxy > > > Web Proxy >
>

Web Proxy

>
> >

This site is protected by firewall 126. > [ryadav at inbgsc0125 ~/ry_files]$ perl bt2.pl > Name "main::sequence_as_a_string" used only once: possible typo at > bt2.plline 22. > Use of uninitialized value in concatenation (.) or string at > /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 169. > read_sequence() - usage incorrect at > /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 168 > Bio::Perl::read_sequence('undef') called at bt2.pl line 5 > [ryadav at inbgsc0125 ~/ry_files]$ perl bt1.pl > > ------------- EXCEPTION ------------- > MSG: WebDBSeqI Request Error: > HTTP/1.1 403 Forbidden > Connection: close > Date: Thursday, 17-Aug-06 06:43:03 GMT > Server: Web_Proxy > Content-Type: text/html > Expires: Thursday, 17-Aug-06 06:43:03 GMT > Client-Date: Thu, 17 Aug 2006 12:12:45 GMT > Client-Peer: 193.62.197.151:80 > Client-Response-Num: 1 > Title: Web Proxy > > > Web Proxy >
>

Web Proxy

>
> >

This site is protected by firewall 126. > All requests are screened and logged.

> >

You are not permitted to access the requested URL > http://www.ebi.ac.uk/cgi-bin/dbfetch.

> >


>
>

For further information contact: >
TP&S Firewall > Admin, Wilmington

>
> > > STACK Bio::DB::WebDBSeqI::_stream_request > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:728 > STACK Bio::DB::WebDBSeqI::get_seq_stream > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:460 > STACK Bio::DB::WebDBSeqI::get_Stream_by_id > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:287 > STACK Bio::DB::WebDBSeqI::get_Seq_by_id > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/WebDBSeqI.pm:153 > STACK Bio::Perl::get_sequence > /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm:511 > STACK toplevel bt1.pl:8 > > -------------------------------------- > > -------------------- WARNING --------------------- > MSG: id (ROA1_HUMAN) does not exist > --------------------------------------------------- > Use of uninitialized value in length at > /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 271. > Use of uninitialized value in concatenation (.) or string at > /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 283. > You have a non object [] passed to write_sequence. It maybe that you want > to > use new_sequence to make this string into a sequence object? at > /usr/lib/perl5/site_perl/5.8.5/Bio/Perl.pm line 283 > Bio::Perl::write_sequence('>roa1.fasta', 'fasta', 'undef') called > at > bt1.pl line 10 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From avilella at gmail.com Thu Aug 17 12:17:25 2006 From: avilella at gmail.com (Albert Vilella) Date: Thu, 17 Aug 2006 17:17:25 +0100 Subject: [Bioperl-l] informative codons method for kaks -- Bio/Align/DNAStatistics.pm In-Reply-To: References: <1155814640.11572.2.camel@localhost> Message-ID: <1155831445.12146.24.camel@localhost> I will opt for the "return the number of informative codons" then, which is the easiest :) I don't know when is this going to be there if it depends on me, though... my list of 'enh' bug tickets is growing shamefully fast :p Cheers, Albert. On Thu, 2006-08-17 at 09:26 -0500, Chris Fields wrote: > Sure, why not? If you (or someone) can add one in, I don't see how > it could hurt. > > Make sure to add tests for this in the proper test suite. > > Chris > > On Aug 17, 2006, at 6:37 AM, Albert Vilella wrote: > > > Hi all, > > > > I think it would be nice to have a method in > > Bio/Align/DNAStatistics.pm that gives the number of informative codons > > for kaks in a MSA. That is, the codons that are used in the > > calculation of kaks. This, AFAICS, more or less what codeml calls > > "patterns". > > > > I often find myself in the situation of wanting to know how big is the > > CDS alignment not in terms of sequence length, but of the number of > > codons that are going to be used in the kaks statistics. I guess this > > method would help in that. > > > > The method could: > > > > return the number of informative codons? > > maybe return a new seqarray with only the informative codons? > > > > What do you think? Jason? Chris? > > > > http://bugzilla.open-bio.org/show_bug.cgi?id=2078 > > > > Bests, > > > > Albert. > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Thu Aug 17 12:39:19 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 11:39:19 -0500 Subject: [Bioperl-l] informative codons method for kaks --Bio/Align/DNAStatistics.pm In-Reply-To: <1155831445.12146.24.camel@localhost> Message-ID: <007701c6c21b$b08d65c0$15327e82@pyrimidine> Albert, Might be a good idea to start working on those! Nice to use Bugzilla as a repository for ideas, but we're heading towards using the Bioperl wiki for that more now. Chris > -----Original Message----- > From: Albert Vilella [mailto:avilella at gmail.com] > Sent: Thursday, August 17, 2006 11:17 AM > To: Chris Fields > Cc: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] informative codons method for kaks -- > Bio/Align/DNAStatistics.pm > > I will opt for the "return the number of informative codons" then, which > is the easiest :) > > I don't know when is this going to be there if it depends on me, > though... my list of 'enh' bug tickets is growing shamefully fast :p > > Cheers, > > Albert. > > On Thu, 2006-08-17 at 09:26 -0500, Chris Fields wrote: > > Sure, why not? If you (or someone) can add one in, I don't see how > > it could hurt. > > > > Make sure to add tests for this in the proper test suite. > > > > Chris > > > > On Aug 17, 2006, at 6:37 AM, Albert Vilella wrote: > > > > > Hi all, > > > > > > I think it would be nice to have a method in > > > Bio/Align/DNAStatistics.pm that gives the number of informative codons > > > for kaks in a MSA. That is, the codons that are used in the > > > calculation of kaks. This, AFAICS, more or less what codeml calls > > > "patterns". > > > > > > I often find myself in the situation of wanting to know how big is the > > > CDS alignment not in terms of sequence length, but of the number of > > > codons that are going to be used in the kaks statistics. I guess this > > > method would help in that. > > > > > > The method could: > > > > > > return the number of informative codons? > > > maybe return a new seqarray with only the informative codons? > > > > > > What do you think? Jason? Chris? > > > > > > http://bugzilla.open-bio.org/show_bug.cgi?id=2078 > > > > > > Bests, > > > > > > Albert. > > > > > > > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Christopher Fields > > Postdoctoral Researcher > > Lab of Dr. Robert Switzer > > Dept of Biochemistry > > University of Illinois Urbana-Champaign > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lincoln.stein at gmail.com Thu Aug 17 12:26:07 2006 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Thu, 17 Aug 2006 12:26:07 -0400 Subject: [Bioperl-l] Extracting gene seq from Bio::DB::GFF In-Reply-To: References: Message-ID: <6dce9a0b0608170926x173b23ebwc5cb5010b5354d1e@mail.gmail.com> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to load the GFF3-format Fly data? I think you're probably getting confused by overlapping mRNA splice forms, an issue that won't occur with the full GFF3-formatted data. On 8/13/06, Chris Fields wrote: > > Marco, > > Did you figure out what the problem was? I was curious; the issue > you were having was rather odd. I wanted to see if it was an issue > with the GFF data or with the database itself. > > Chris > > On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: > > > Chris, > > > >> Do you mean you get duplicates of sequences back, or that you get > >> more than > >> one chunk of the same sequence back? > > > > I sometimes get duplicated sequences and sometimes overlapping > > regions (see > > bellow) > > > >> > >> Is it possible that each query using an ID could contain more than > >> one > >> feature? That might explain it (you could check by testing the > >> size of the > >> array @feats). > > Most id return more than one features from various type > > ( point_mutation, > > insertion_site, processed_transcript, etc...). That's why I > > restirct the > > output to type "gene" using regexp /gene/ on $f->type. > > > >> > >> I'm not sure how split locations are handled within Bio:DB::GFF, > >> but do the > >> specific features have split locations? > >> > >> Chris > >> > > Not sure what you mean exactly but have a look at the following > > script, it > > gives the location and the group id of the feature being reported: > > > > use Bio::DB::GFF; > > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > > -dsn => > > 'dbi:mysql:database=dmel_43_new'); > > my %dups; > > while (<>){ > > chomp; > > my $id = $_; > > my @feat = $db->get_feature_by_name($id); > > > > for my $f (@feat){ > > if (exists $dups{$f->group} && $f->type =~/gene/){ > > print "Calling >>>", $f->group, "\n"; > > print "Chr: ", $f->refseq, > > " Strand: ", $f->strand, > > " Start: ", $f->start, > > " End: ", $f->end, > > "\n"; > > print "Offending >>>", $dups{$f->group}->group, "\n"; > > print "Chr: ", $dups{$f->group}->refseq, > > " Strand: ", $dups{$f->group}->strand, > > " Start: ", $dups{$f->group}->start, > > " End: ", $dups{$f->group}->end; > > print "\n\n"; > > } else { > > $dups{$f->group} = $f; > > } > > } > > } > > > > Here is the output: > > Calling >>>FBgn0004179 > > Chr: 3L Strand: 1 Start: 22201102 End: 22207587 > > Offending >>>FBgn0004179 > > Chr: 3L Strand: 1 Start: 22200575 End: 22200575 > > > > Calling >>>FBgn0025681 > > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > Offending >>>FBgn0025681 > > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > > > Calling >>>FBgn0025803 > > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > Offending >>>FBgn0025803 > > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > > > Calling >>>FBgn0000117 > > Chr: X Strand: -1 Start: 1756796 End: 1747557 > > Offending >>>FBgn0000117 > > Chr: X Strand: -1 Start: 1757776 End: 1747182 > > > > Calling >>>FBgn0005427 > > Chr: X Strand: -1 Start: 136456 End: 125343 > > Offending >>>FBgn0005427 > > Chr: X Strand: -1 Start: 133199 End: 124949 > > > > Calling >>>FBgn0000042 > > Chr: X Strand: 1 Start: 5746100 End: 5750026 > > Offending >>>FBgn0000042 > > Chr: X Strand: 1 Start: 5746096 End: 5746106 > > > > Calling >>>FBgn0004551 > > Chr: 2R Strand: -1 Start: 19443485 End: 19434556 > > Offending >>>FBgn0004551 > > Chr: 2R Strand: -1 Start: 19445155 End: 19429977 > > > > Do you have any suggestions?? Is the procedure I am using to > > retrieve the > > genes right? > > > > Many thanks > > > > Marco > > > > > > > >>> Many thanks Scott, > >>> > >>> At the same time I got your email I was coming to the same > >>> conclusion as > >>> you. > >>> > >>> Now I have a stranger problem in my hands... My goal is quite > >>> simple, I > >>> try > >>> to get the sequence of the genes back from the Bio::DB::GFF database > >>> loaded > >>> on MySQL. The gene list is from a file with one gene id per per > >>> line. When > >>> I > >>> run the following script: > >>> > >>> > >>> > >>> use Bio::DB::GFF; > >>> use Bio::SeqIO; > >>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, > >>> -format => 'fasta'); > >>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > >>> -dsn => > >>> 'dbi:mysql:database=dmel_43_new'); > >>> > >>> while (<>){ > >>> chomp; > >>> my $id = $_; > >>> my @feats = $db->get_feature_by_name($id); > >>> for my $f (@feats){ > >>> $out->write_seq( $f->seq ) if $f->type =~/gene/; > >>> } > >>> } > >>> > >>> > >>> I get more sequence back than the number of gene in my input file. I > >>> double > >>> check there. Some of the duplicated entries are the same, some > >>> are not! > >> > >> > >> ... > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ______________________________ > > Marco Blanchette, Ph.D. > > > > mblanche at uclink.berkeley.edu > > > > Donald C. Rio's lab > > Department of Molecular and Cell Biology > > 16 Barker Hall > > University of California > > Berkeley, CA 94720-3204 > > > > Tel: (510) 642-1084 > > Cell: (510) 847-0996 > > Fax: (510) 642-6062 > > -- > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 (516) 367-8380 (voice) (516) 367-8389 (fax) FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From lstein at cshl.edu Thu Aug 17 13:27:33 2006 From: lstein at cshl.edu (Lincoln Stein) Date: Thu, 17 Aug 2006 13:27:33 -0400 Subject: [Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF In-Reply-To: <6dce9a0b0608170926x173b23ebwc5cb5010b5354d1e@mail.gmail.com> References: <6dce9a0b0608170926x173b23ebwc5cb5010b5354d1e@mail.gmail.com> Message-ID: <6dce9a0b0608171027k36bb06c3pb3b49eed411bcfac@mail.gmail.com> Hi, This message bounced because I tried to send it from my gmail account and so I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it finds a file that contains DNA data, it simply loads it. There is no special command line switch. Also you can include the DNA in the GFF3 file. Lincoln ---------- Forwarded message ---------- From: Lincoln Stein Date: Aug 17, 2006 12:26 PM Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF To: Chris Fields Cc: Marco Blanchette , "bioperl-l at lists.open-bio.org" , cain.cshl at gmail.com I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to load the GFF3-format Fly data? I think you're probably getting confused by overlapping mRNA splice forms, an issue that won't occur with the full GFF3-formatted data. On 8/13/06, Chris Fields wrote: > > Marco, > > Did you figure out what the problem was? I was curious; the issue > you were having was rather odd. I wanted to see if it was an issue > with the GFF data or with the database itself. > > Chris > > On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: > > > Chris, > > > >> Do you mean you get duplicates of sequences back, or that you get > >> more than > >> one chunk of the same sequence back? > > > > I sometimes get duplicated sequences and sometimes overlapping > > regions (see > > bellow) > > > >> > >> Is it possible that each query using an ID could contain more than > >> one > >> feature? That might explain it (you could check by testing the > >> size of the > >> array @feats). > > Most id return more than one features from various type > > ( point_mutation, > > insertion_site, processed_transcript, etc...). That's why I > > restirct the > > output to type "gene" using regexp /gene/ on $f->type. > > > >> > >> I'm not sure how split locations are handled within Bio:DB::GFF, > >> but do the > >> specific features have split locations? > >> > >> Chris > >> > > Not sure what you mean exactly but have a look at the following > > script, it > > gives the location and the group id of the feature being reported: > > > > use Bio::DB::GFF; > > my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > > -dsn => > > 'dbi:mysql:database=dmel_43_new'); > > my %dups; > > while (<>){ > > chomp; > > my $id = $_; > > my @feat = $db->get_feature_by_name($id); > > > > for my $f (@feat){ > > if (exists $dups{$f->group} && $f->type =~/gene/){ > > print "Calling >>>", $f->group, "\n"; > > print "Chr: ", $f->refseq, > > " Strand: ", $f->strand, > > " Start: ", $f->start, > > " End: ", $f->end, > > "\n"; > > print "Offending >>>", $dups{$f->group}->group, "\n"; > > print "Chr: ", $dups{$f->group}->refseq, > > " Strand: ", $dups{$f->group}->strand, > > " Start: ", $dups{$f->group}->start, > > " End: ", $dups{$f->group}->end; > > print "\n\n"; > > } else { > > $dups{$f->group} = $f; > > } > > } > > } > > > > Here is the output: > > Calling >>>FBgn0004179 > > Chr: 3L Strand: 1 Start: 22201102 End: 22207587 > > Offending >>>FBgn0004179 > > Chr: 3L Strand: 1 Start: 22200575 End: 22200575 > > > > Calling >>>FBgn0025681 > > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > Offending >>>FBgn0025681 > > Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > > > > Calling >>>FBgn0025803 > > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > Offending >>>FBgn0025803 > > Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > > > > Calling >>>FBgn0000117 > > Chr: X Strand: -1 Start: 1756796 End: 1747557 > > Offending >>>FBgn0000117 > > Chr: X Strand: -1 Start: 1757776 End: 1747182 > > > > Calling >>>FBgn0005427 > > Chr: X Strand: -1 Start: 136456 End: 125343 > > Offending >>>FBgn0005427 > > Chr: X Strand: -1 Start: 133199 End: 124949 > > > > Calling >>>FBgn0000042 > > Chr: X Strand: 1 Start: 5746100 End: 5750026 > > Offending >>>FBgn0000042 > > Chr: X Strand: 1 Start: 5746096 End: 5746106 > > > > Calling >>>FBgn0004551 > > Chr: 2R Strand: -1 Start: 19443485 End: 19434556 > > Offending >>>FBgn0004551 > > Chr: 2R Strand: -1 Start: 19445155 End: 19429977 > > > > Do you have any suggestions?? Is the procedure I am using to > > retrieve the > > genes right? > > > > Many thanks > > > > Marco > > > > > > > >>> Many thanks Scott, > >>> > >>> At the same time I got your email I was coming to the same > >>> conclusion as > >>> you. > >>> > >>> Now I have a stranger problem in my hands... My goal is quite > >>> simple, I > >>> try > >>> to get the sequence of the genes back from the Bio::DB::GFF database > >>> loaded > >>> on MySQL. The gene list is from a file with one gene id per per > >>> line. When > >>> I > >>> run the following script: > >>> > >>> > >>> > >>> use Bio::DB::GFF; > >>> use Bio::SeqIO; > >>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, > >>> -format => 'fasta'); > >>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > >>> -dsn => > >>> 'dbi:mysql:database=dmel_43_new'); > >>> > >>> while (<>){ > >>> chomp; > >>> my $id = $_; > >>> my @feats = $db->get_feature_by_name($id); > >>> for my $f (@feats){ > >>> $out->write_seq( $f->seq ) if $f->type =~/gene/; > >>> } > >>> } > >>> > >>> > >>> I get more sequence back than the number of gene in my input file. I > >>> double > >>> check there. Some of the duplicated entries are the same, some > >>> are not! > >> > >> > >> ... > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > ______________________________ > > Marco Blanchette, Ph.D. > > > > mblanche at uclink.berkeley.edu > > > > Donald C. Rio's lab > > Department of Molecular and Cell Biology > > 16 Barker Hall > > University of California > > Berkeley, CA 94720-3204 > > > > Tel: (510) 642-1084 > > Cell: (510) 847-0996 > > Fax: (510) 642-6062 > > -- > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 (516) 367-8380 (voice) (516) 367-8389 (fax) FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 (516) 367-8380 (voice) (516) 367-8389 (fax) FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From skumagai at life.bio.sunysb.edu Thu Aug 17 13:35:17 2006 From: skumagai at life.bio.sunysb.edu (Seiji Kumagai) Date: Thu, 17 Aug 2006 13:35:17 -0400 (EDT) Subject: [Bioperl-l] No space between parameters in codeml output. Message-ID: Hi, I have just reported a bug in PAML.pm (#2080) with patches. They solve an issue that parsing fails if codeml reported omega (dN/dS) in NSsite results without \s+ between two estimates like '0.12345678.90123'. The patches also include a fix to a cosmetic issue I introduced in previous report (#2054 or #2055 can't remember). Would someone care to take a look? Thanks From cjfields at uiuc.edu Thu Aug 17 14:00:00 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 13:00:00 -0500 Subject: [Bioperl-l] No space between parameters in codeml output. In-Reply-To: Message-ID: <008501c6c226$f5eb9280$15327e82@pyrimidine> I'm working on it now. Thanks! Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Seiji Kumagai > Sent: Thursday, August 17, 2006 12:35 PM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] No space between parameters in codeml output. > > Hi, > > I have just reported a bug in PAML.pm (#2080) with patches. They solve an > issue that parsing fails if codeml reported omega (dN/dS) in NSsite > results without \s+ between two estimates like '0.12345678.90123'. The > patches also include a fix to a cosmetic issue I introduced in previous > report (#2054 or #2055 can't remember). > > Would someone care to take a look? > > Thanks > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu Aug 17 14:02:59 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 17 Aug 2006 19:02:59 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44DB4E98.70703@sendu.me.uk> References: <44DB4E98.70703@sendu.me.uk> Message-ID: <44E4AF53.2010402@sendu.me.uk> Sendu Bala wrote: > I am aiming to solve Project priority list item 1.2.1 "Improve > Bio::SearchIO speed...". [...] > More radical changes will make SearchIO even faster, eg. > Chris Fields and Jason (if I interpret the Project priority list item > correctly) have suggested an end to individual Hit and HSP objects, > which become just data members of a Result-like object. Ideally I don't > want to go down that route because we lose quite a bit of OO power; HSP > objects in particular make important use of inheritance The most significant cause of slow-down is HSPI objects being Bio::SeqFeature::SimilarityPair objects. The main reason for that inheritance seems to be so we can have methods hit() and query() which give back Bio::SeqFeature::Similarity objects (which are Bio::SeqFeature::Generic). Does anyone feel it is vital for HSPIs to be like this, or could they be simpler (eg. just return Bio::LocatableSeq objects for hit() and query(), with all other information available via direct HSPI methods)? In one test case I can get a 3.5x speed up from that change alone. From mblanche at berkeley.edu Thu Aug 17 14:20:00 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Thu, 17 Aug 2006 11:20:00 -0700 Subject: [Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF In-Reply-To: <6dce9a0b0608171027k36bb06c3pb3b49eed411bcfac@mail.gmail.com> Message-ID: Lincoln, thanks for the precision. I just could not find any references to how to load the DNA (no where in bp_seqfeature_load.pl or in the Bio::DB::SeqFeature::Store it says how load the DNA sequences). So right now the gff files were loaded in mysql using: /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff I tried the --fast options but got a bunch of warning (see below). The DNA file (a single fasta database file containing all chromosome sequences) was in a different location from the gff files and was not loaded together with the gff files (the sequence table is empty in the database). Can I load the DNA sequence after the gff files were loaded? Many thanks Marco -------------------- WARNING --------------------- MSG: ID=ortho:2825 has been used more than once, but it cannot be found in the database. This can happen if you have specified fast loading, but features sharing the same ID are not contiguous in the GFF file. This will be loaded as a separate feature. Line 483681: "X . orthologous_region 19477824 19478027 . + . ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse" STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537 STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424 STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342 STACK Bio::DB::SeqFeature::Store::GFF3Loader::load /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240 STACK toplevel /usr/bin/bp_seqfeature_load.pl:81 On 8/17/06 10:27, "Lincoln Stein" wrote: > Hi, > > This message bounced because I tried to send it from my gmail account and so > I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it > finds a file that contains DNA data, it simply loads it. There is no special > command line switch. Also you can include the DNA in the GFF3 file. > > Lincoln > > ---------- Forwarded message ---------- > From: Lincoln Stein > Date: Aug 17, 2006 12:26 PM > Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF > To: Chris Fields > Cc: Marco Blanchette , "bioperl-l at lists.open-bio.org" > , cain.cshl at gmail.com > > I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to > load the GFF3-format Fly data? I think you're probably getting confused by > overlapping mRNA splice forms, an issue that won't occur with the full > GFF3-formatted data. > > > On 8/13/06, Chris Fields wrote: >> >> Marco, >> >> Did you figure out what the problem was? I was curious; the issue >> you were having was rather odd. I wanted to see if it was an issue >> with the GFF data or with the database itself. >> >> Chris >> >> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: >> >>> Chris, >>> >>>> Do you mean you get duplicates of sequences back, or that you get >>>> more than >>>> one chunk of the same sequence back? >>> >>> I sometimes get duplicated sequences and sometimes overlapping >>> regions (see >>> bellow) >>> >>>> >>>> Is it possible that each query using an ID could contain more than >>>> one >>>> feature? That might explain it (you could check by testing the >>>> size of the >>>> array @feats). >>> Most id return more than one features from various type >>> ( point_mutation, >>> insertion_site, processed_transcript, etc...). That's why I >>> restirct the >>> output to type "gene" using regexp /gene/ on $f->type. >>> >>>> >>>> I'm not sure how split locations are handled within Bio:DB::GFF, >>>> but do the >>>> specific features have split locations? >>>> >>>> Chris >>>> >>> Not sure what you mean exactly but have a look at the following >>> script, it >>> gives the location and the group id of the feature being reported: >>> >>> use Bio::DB::GFF; >>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>> -dsn => >>> 'dbi:mysql:database=dmel_43_new'); >>> my %dups; >>> while (<>){ >>> chomp; >>> my $id = $_; >>> my @feat = $db->get_feature_by_name($id); >>> >>> for my $f (@feat){ >>> if (exists $dups{$f->group} && $f->type =~/gene/){ >>> print "Calling >>>", $f->group, "\n"; >>> print "Chr: ", $f->refseq, >>> " Strand: ", $f->strand, >>> " Start: ", $f->start, >>> " End: ", $f->end, >>> "\n"; >>> print "Offending >>>", $dups{$f->group}->group, "\n"; >>> print "Chr: ", $dups{$f->group}->refseq, >>> " Strand: ", $dups{$f->group}->strand, >>> " Start: ", $dups{$f->group}->start, >>> " End: ", $dups{$f->group}->end; >>> print "\n\n"; >>> } else { >>> $dups{$f->group} = $f; >>> } >>> } >>> } >>> >>> Here is the output: >>> Calling >>>FBgn0004179 >>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587 >>> Offending >>>FBgn0004179 >>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575 >>> >>> Calling >>>FBgn0025681 >>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>> Offending >>>FBgn0025681 >>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>> >>> Calling >>>FBgn0025803 >>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>> Offending >>>FBgn0025803 >>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>> >>> Calling >>>FBgn0000117 >>> Chr: X Strand: -1 Start: 1756796 End: 1747557 >>> Offending >>>FBgn0000117 >>> Chr: X Strand: -1 Start: 1757776 End: 1747182 >>> >>> Calling >>>FBgn0005427 >>> Chr: X Strand: -1 Start: 136456 End: 125343 >>> Offending >>>FBgn0005427 >>> Chr: X Strand: -1 Start: 133199 End: 124949 >>> >>> Calling >>>FBgn0000042 >>> Chr: X Strand: 1 Start: 5746100 End: 5750026 >>> Offending >>>FBgn0000042 >>> Chr: X Strand: 1 Start: 5746096 End: 5746106 >>> >>> Calling >>>FBgn0004551 >>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556 >>> Offending >>>FBgn0004551 >>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977 >>> >>> Do you have any suggestions?? Is the procedure I am using to >>> retrieve the >>> genes right? >>> >>> Many thanks >>> >>> Marco >>> >>> >>> >>>>> Many thanks Scott, >>>>> >>>>> At the same time I got your email I was coming to the same >>>>> conclusion as >>>>> you. >>>>> >>>>> Now I have a stranger problem in my hands... My goal is quite >>>>> simple, I >>>>> try >>>>> to get the sequence of the genes back from the Bio::DB::GFF database >>>>> loaded >>>>> on MySQL. The gene list is from a file with one gene id per per >>>>> line. When >>>>> I >>>>> run the following script: >>>>> >>>>> >>>>> >>>>> use Bio::DB::GFF; >>>>> use Bio::SeqIO; >>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >>>>> -format => 'fasta'); >>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>> -dsn => >>>>> 'dbi:mysql:database=dmel_43_new'); >>>>> >>>>> while (<>){ >>>>> chomp; >>>>> my $id = $_; >>>>> my @feats = $db->get_feature_by_name($id); >>>>> for my $f (@feats){ >>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/; >>>>> } >>>>> } >>>>> >>>>> >>>>> I get more sequence back than the number of gene in my input file. I >>>>> double >>>>> check there. Some of the duplicated entries are the same, some >>>>> are not! >>>> >>>> >>>> ... >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> ______________________________ >>> Marco Blanchette, Ph.D. >>> >>> mblanche at uclink.berkeley.edu >>> >>> Donald C. Rio's lab >>> Department of Molecular and Cell Biology >>> 16 Barker Hall >>> University of California >>> Berkeley, CA 94720-3204 >>> >>> Tel: (510) 642-1084 >>> Cell: (510) 847-0996 >>> Fax: (510) 642-6062 >>> -- >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> Christopher Fields >> Postdoctoral Researcher >> Lab of Dr. Robert Switzer >> Dept of Biochemistry >> University of Illinois Urbana-Champaign >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From mblanche at berkeley.edu Thu Aug 17 14:40:50 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Thu, 17 Aug 2006 11:40:50 -0700 Subject: [Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF In-Reply-To: Message-ID: I will answer my own question... Yes, one can load the fasta file after having loaded the gff file by doing: bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta Marco On 8/17/06 11:20, "Marco Blanchette" wrote: > Lincoln, thanks for the precision. I just could not find any references to > how to load the DNA (no where in bp_seqfeature_load.pl or in the > Bio::DB::SeqFeature::Store it says how load the DNA sequences). > > So right now the gff files were loaded in mysql using: > /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff > > I tried the --fast options but got a bunch of warning (see below). > > The DNA file (a single fasta database file containing all chromosome > sequences) was in a different location from the gff files and was not loaded > together with the gff files (the sequence table is empty in the database). > > Can I load the DNA sequence after the gff files were loaded? > > Many thanks > > Marco > > > -------------------- WARNING --------------------- > MSG: ID=ortho:2825 has been used more than once, but it cannot be found in > the database. > This can happen if you have specified fast loading, but features sharing the > same ID > are not contiguous in the GFF file. This will be loaded as a separate > feature. > Line 483681: "X . orthologous_region 19477824 19478027 > . + . > ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse" > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537 > STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424 > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342 > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240 > STACK toplevel /usr/bin/bp_seqfeature_load.pl:81 > > > On 8/17/06 10:27, "Lincoln Stein" wrote: > >> Hi, >> >> This message bounced because I tried to send it from my gmail account and so >> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it >> finds a file that contains DNA data, it simply loads it. There is no special >> command line switch. Also you can include the DNA in the GFF3 file. >> >> Lincoln >> >> ---------- Forwarded message ---------- >> From: Lincoln Stein >> Date: Aug 17, 2006 12:26 PM >> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF >> To: Chris Fields >> Cc: Marco Blanchette , "bioperl-l at lists.open-bio.org" >> , cain.cshl at gmail.com >> >> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to >> load the GFF3-format Fly data? I think you're probably getting confused by >> overlapping mRNA splice forms, an issue that won't occur with the full >> GFF3-formatted data. >> >> >> On 8/13/06, Chris Fields wrote: >>> >>> Marco, >>> >>> Did you figure out what the problem was? I was curious; the issue >>> you were having was rather odd. I wanted to see if it was an issue >>> with the GFF data or with the database itself. >>> >>> Chris >>> >>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: >>> >>>> Chris, >>>> >>>>> Do you mean you get duplicates of sequences back, or that you get >>>>> more than >>>>> one chunk of the same sequence back? >>>> >>>> I sometimes get duplicated sequences and sometimes overlapping >>>> regions (see >>>> bellow) >>>> >>>>> >>>>> Is it possible that each query using an ID could contain more than >>>>> one >>>>> feature? That might explain it (you could check by testing the >>>>> size of the >>>>> array @feats). >>>> Most id return more than one features from various type >>>> ( point_mutation, >>>> insertion_site, processed_transcript, etc...). That's why I >>>> restirct the >>>> output to type "gene" using regexp /gene/ on $f->type. >>>> >>>>> >>>>> I'm not sure how split locations are handled within Bio:DB::GFF, >>>>> but do the >>>>> specific features have split locations? >>>>> >>>>> Chris >>>>> >>>> Not sure what you mean exactly but have a look at the following >>>> script, it >>>> gives the location and the group id of the feature being reported: >>>> >>>> use Bio::DB::GFF; >>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>> -dsn => >>>> 'dbi:mysql:database=dmel_43_new'); >>>> my %dups; >>>> while (<>){ >>>> chomp; >>>> my $id = $_; >>>> my @feat = $db->get_feature_by_name($id); >>>> >>>> for my $f (@feat){ >>>> if (exists $dups{$f->group} && $f->type =~/gene/){ >>>> print "Calling >>>", $f->group, "\n"; >>>> print "Chr: ", $f->refseq, >>>> " Strand: ", $f->strand, >>>> " Start: ", $f->start, >>>> " End: ", $f->end, >>>> "\n"; >>>> print "Offending >>>", $dups{$f->group}->group, "\n"; >>>> print "Chr: ", $dups{$f->group}->refseq, >>>> " Strand: ", $dups{$f->group}->strand, >>>> " Start: ", $dups{$f->group}->start, >>>> " End: ", $dups{$f->group}->end; >>>> print "\n\n"; >>>> } else { >>>> $dups{$f->group} = $f; >>>> } >>>> } >>>> } >>>> >>>> Here is the output: >>>> Calling >>>FBgn0004179 >>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587 >>>> Offending >>>FBgn0004179 >>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575 >>>> >>>> Calling >>>FBgn0025681 >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>> Offending >>>FBgn0025681 >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>> >>>> Calling >>>FBgn0025803 >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>> Offending >>>FBgn0025803 >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>> >>>> Calling >>>FBgn0000117 >>>> Chr: X Strand: -1 Start: 1756796 End: 1747557 >>>> Offending >>>FBgn0000117 >>>> Chr: X Strand: -1 Start: 1757776 End: 1747182 >>>> >>>> Calling >>>FBgn0005427 >>>> Chr: X Strand: -1 Start: 136456 End: 125343 >>>> Offending >>>FBgn0005427 >>>> Chr: X Strand: -1 Start: 133199 End: 124949 >>>> >>>> Calling >>>FBgn0000042 >>>> Chr: X Strand: 1 Start: 5746100 End: 5750026 >>>> Offending >>>FBgn0000042 >>>> Chr: X Strand: 1 Start: 5746096 End: 5746106 >>>> >>>> Calling >>>FBgn0004551 >>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556 >>>> Offending >>>FBgn0004551 >>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977 >>>> >>>> Do you have any suggestions?? Is the procedure I am using to >>>> retrieve the >>>> genes right? >>>> >>>> Many thanks >>>> >>>> Marco >>>> >>>> >>>> >>>>>> Many thanks Scott, >>>>>> >>>>>> At the same time I got your email I was coming to the same >>>>>> conclusion as >>>>>> you. >>>>>> >>>>>> Now I have a stranger problem in my hands... My goal is quite >>>>>> simple, I >>>>>> try >>>>>> to get the sequence of the genes back from the Bio::DB::GFF database >>>>>> loaded >>>>>> on MySQL. The gene list is from a file with one gene id per per >>>>>> line. When >>>>>> I >>>>>> run the following script: >>>>>> >>>>>> >>>>>> >>>>>> use Bio::DB::GFF; >>>>>> use Bio::SeqIO; >>>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >>>>>> -format => 'fasta'); >>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>>> -dsn => >>>>>> 'dbi:mysql:database=dmel_43_new'); >>>>>> >>>>>> while (<>){ >>>>>> chomp; >>>>>> my $id = $_; >>>>>> my @feats = $db->get_feature_by_name($id); >>>>>> for my $f (@feats){ >>>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/; >>>>>> } >>>>>> } >>>>>> >>>>>> >>>>>> I get more sequence back than the number of gene in my input file. I >>>>>> double >>>>>> check there. Some of the duplicated entries are the same, some >>>>>> are not! >>>>> >>>>> >>>>> ... >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> ______________________________ >>>> Marco Blanchette, Ph.D. >>>> >>>> mblanche at uclink.berkeley.edu >>>> >>>> Donald C. Rio's lab >>>> Department of Molecular and Cell Biology >>>> 16 Barker Hall >>>> University of California >>>> Berkeley, CA 94720-3204 >>>> >>>> Tel: (510) 642-1084 >>>> Cell: (510) 847-0996 >>>> Fax: (510) 642-6062 >>>> -- >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> Christopher Fields >>> Postdoctoral Researcher >>> Lab of Dr. Robert Switzer >>> Dept of Biochemistry >>> University of Illinois Urbana-Champaign >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> > > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From cjfields at uiuc.edu Thu Aug 17 14:55:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 13:55:21 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E4AF53.2010402@sendu.me.uk> Message-ID: <000701c6c22e$b08fb510$15327e82@pyrimidine> I don't feel a pressing nature to keep it, but others will likely disagree. How hard would it be to implement hit(), query(), and other methods in HSPI to return Bio::SeqFeature::Similarity objects directly instead of just inheriting the methods? In other words, only build and return the objects when the user calls hit() or query()? Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Thursday, August 17, 2006 1:03 PM > To: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] SearchIO speed up > > Sendu Bala wrote: > > I am aiming to solve Project priority list item 1.2.1 "Improve > > Bio::SearchIO speed...". > [...] > > More radical changes will make SearchIO even faster, eg. > > Chris Fields and Jason (if I interpret the Project priority list item > > correctly) have suggested an end to individual Hit and HSP objects, > > which become just data members of a Result-like object. Ideally I don't > > want to go down that route because we lose quite a bit of OO power; HSP > > objects in particular make important use of inheritance > > The most significant cause of slow-down is HSPI objects being > Bio::SeqFeature::SimilarityPair objects. The main reason for that > inheritance seems to be so we can have methods hit() and query() which > give back Bio::SeqFeature::Similarity objects (which are > Bio::SeqFeature::Generic). > > Does anyone feel it is vital for HSPIs to be like this, or could they be > simpler (eg. just return Bio::LocatableSeq objects for hit() and > query(), with all other information available via direct HSPI methods)? > > In one test case I can get a 3.5x speed up from that change alone. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From freimuth at pathology.wustl.edu Thu Aug 17 15:53:04 2006 From: freimuth at pathology.wustl.edu (Freimuth, Robert) Date: Thu, 17 Aug 2006 14:53:04 -0500 Subject: [Bioperl-l] Error parsing BLAST report Message-ID: <71AE766382153B47AAB638DC83ED7F49014CEA17@pathexch1.wusm-path.wustl.edu> Hi, Thanks for the suggestion. I have entered it as bug 2081 (http://bugzilla.open-bio.org/show_bug.cgi?id=2081) and uploaded both the test code and BLAST report as attachments. Thanks for the help, Bob > -----Original Message----- > From: Chris Fields [mailto:cjfields at uiuc.edu] > Sent: Thursday, August 17, 2006 10:58 AM > To: Freimuth, Robert; 'Brian Osborne'; bioperl-l at lists.open-bio.org > Subject: RE: [Bioperl-l] Error parsing BLAST report > > Robert, > > This sounds like a possible bug; the error you are getting is > from BLAST > 2.2.11 output, so it should work. The BLAST parsing errors > fixed in CVS had > to do with parsing BLAST 2.2.13 (and later) output. > > Could you add this as a bug to Bugzilla with your test script > and test case > data that generates the error? > > How to post the bug: > > http://www.bioperl.org/wiki/Bugs > > Bugzilla: > > http://bugzilla.open-bio.org/ > > Chris > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Freimuth, Robert > > Sent: Wednesday, August 16, 2006 11:42 PM > > To: Brian Osborne; bioperl-l at lists.open-bio.org > > Subject: Re: [Bioperl-l] Error parsing BLAST report > > > > Hi, > > > > Thank you for your reply. I downloaded bioperl-1.5.1 from > > http://bioperl.org/DIST/ and installed it (which appeared > successful), > > but the one-liner: > > > > perl -MBio::Root::Version -e 'print > $Bio::Root::Version::VERSION, "\n"' > > > > prints 1.5 (I expected 1.5.1). > > > > When I run the test case that I reported earlier, I get the > following > > output: > > > > -------------------- WARNING --------------------- > > MSG: There is no HSP data for hit 'ENSP00000327738'. > > You have called a method (Bio::Search::Hit::GenericHit::length_aln) > > that requires HSP data and there was no HSP data for this hit, > > most likely because it was absent from the BLAST report. > > Note that by default, BLAST lists alignments for the first 250 hits, > > but it lists descriptions for 500 hits. If this is the case, > > and you care about these hits, you should re-run BLAST using the > > -b option (or equivalent if not using blastall) to increase > the number > > of alignments. > > > > --------------------------------------------------- > > Alignment length for ENSP00000327738 is - > > Alignment length for ENSP00000350182 is 250 > > Alignment length for ENSP00000327738 is 398 > > > > Could someone that is running 1.5.1 please verify the output of the > > one-liner above (did I somehow get the wrong file from the > ftp site?) > > and try to reproduce the error with the test case? > > > > Thanks for the help. I'm stumped. > > > > Bob > > > > > -----Original Message----- > > > From: Brian Osborne [mailto:osborne1 at optonline.net] > > > Sent: Wednesday, August 16, 2006 6:21 PM > > > To: Freimuth, Robert; bioperl-l at lists.open-bio.org > > > Subject: Re: [Bioperl-l] Error parsing BLAST report > > > > > > Robert, > > > > > > The standard answer to a complaint about SearchIO these days > > > is to upgrade > > > to version 1.5.1 - what Bioperl version are you using? > > > > > > Brian O. > > > > > > > > > On 8/16/06 3:56 PM, "Freimuth, Robert" > > > wrote: > > > > > > > Hello, > > > > > > > > I'm trying to parse a BLAST report using the following code: > > > > > > > > use warnings; > > > > use strict; > > > > > > > > use Bio::SearchIO; > > > > > > > > my $file = 'NP_006065_blast.out'; > > > > > > > > my $searchio = new Bio::SearchIO( -format => 'blast', > > > > -file => $file ); > > > > > > > > while( my $result = $searchio->next_result() ) > > > > { > > > > while( my $hit = $result->next_hit ) > > > > { > > > > my $hit_acc_num = $hit->accession(); > > > > > > > > # get the total length of the aligned region > > > for query or > > > > sbjct seq > > > > # (includes all HSPs, calculated after tiling) > > > > > > > > my $align_len = $hit->length_aln( 'query' ); > > > > > > > > print "Alignment length for $hit_acc_num is > > > $align_len\n"; > > > > } > > > > } > > > > > > > > There are 104 one-line descriptions in the report, and > > > alignments for > > > > each one of them (the blast report was created using > > > > b_num_alignments_shown => 500 and > v_num_descriptions_shown => 500). > > > > However, when I run the above code I get 14 errors like the > > > following: > > > > > > > > -------------------- WARNING --------------------- > > > > MSG: There is no HSP data for hit 'ENSP00000327738'. > > > > You have called a method > (Bio::Search::Hit::GenericHit::length_aln) > > > > that requires HSP data and there was no HSP data for this hit, > > > > most likely because it was absent from the BLAST report. > > > > Note that by default, BLAST lists alignments for the > first 250 hits, > > > > but it lists descriptions for 500 hits. If this is the case, > > > > and you care about these hits, you should re-run BLAST using the > > > > -b option (or equivalent if not using blastall) to increase > > > the number > > > > of alignments. > > > > > > > > --------------------------------------------------- > > > > > > > > There is an alignment for this (and the other 13 > sequences) in the > > > > report. In fact, if I edit the report and delete all but the > > > > description and the alignment for ENSP00000327738, it > > > parses fine (no > > > > error). > > > > > > > > I continued editing the report and produced the following > > > minimal test > > > > case that reproduces the error. Note that the description for > > > > ENSP00000350182 appears twice, BUT THE ERROR IS FOR > ENSP00000327738. > > > > > > > > *********** BLAST REPORT FOR TEST CASE *********** > > > > > > > > BLASTP 2.2.11 [Jun-05-2005] > > > > > > > > > > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. > > > > Schaffer, > > > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. > Lipman (1997), > > > > "Gapped BLAST and PSI-BLAST: a new generation of protein > > > database search > > > > programs", Nucleic Acids Res. 25:3389-3402. > > > > > > > > Query= NP_006065 > > > > (442 letters) > > > > > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > > > 48,851 sequences; 23,910,368 total letters > > > > > > > > Searching..................................................done > > > > > > > > > > > Score > > > > E > > > > Sequences producing significant alignments: > > > (bits) > > > > Value > > > > > > > > ENSP00000350182 pep:novel > clone::BX322644.8:4905:15090:-1 gene:E... > > > > 120 3e-27 > > > > ENSP00000350182 pep:novel > clone::BX322644.8:4905:15090:-1 gene:E... > > > > 120 3e-27 > > > > ENSP00000327738 pep:known-ccds > chromosome:NCBI36:4:189297592:189... > > > > 115 8e-26 > > > > > > > >> ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 > > > > gene:ENSG00000137397 > > > > transcript:ENST00000357569 > > > > Length = 425 > > > > > > > > Score = 120 bits (301), Expect = 3e-27 > > > > Identities = 76/261 (29%), Positives = 140/261 (53%), > Gaps = 21/261 > > > > (8%) > > > > > > > > Query: 9 > > > IEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGNL > > > > 68 > > > > +++EV CPICL++L +P+++DCGH+FC CIT +I E+ S G > > > CP+C+T + + > > > > Sbjct: 10 > > > LQEEVICPICLDILQKPVTIDCGHNFCLKCIT-QIGET---SCGFFKCPLCKTSVRKNAI > > > > 65 > > > > > > > > Query: 69 > > > RPNRHLANIVERVKEVKMSP-QEGQKRDVCEHHGKKLQIFCKEDGKVICWVCELSQEHQG > > > > 127 > > > > R N L N+VE+++ ++ S Q +K C H + FC++DGK > > > +C+VC S++H+ > > > > Sbjct: 66 > > > RFNSLLRNLVEKIQALQASEVQSKRKEATCPRHQEMFHYFCEDDGKFLCFVCCESKDHKS > > > > 125 > > > > > > > > Query: 128 > > > HQTFRINEVVKECQEKLQVALQRLIKEDQEAEKLED------DIRQERTAWKIERQKILK > > > > 181 > > > > H I E + Q ++Q +Q L ++++E +++ D+ ++ > > > + E+Q+IL > > > > Sbjct: 126 > > > HNVSLIEEAAQNYQGQIQEQIQVLQQKEKETVQVKAQGVHRVDVFTDQV--EHEKQRILT > > > > 183 > > > > > > > > Query: 182 > > > GFNEMRVILDNEEQRELQKL----EEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRRL > > > > 237 > > > > F + +L+ E+ L ++ EG +A+ QL D > > > L+ L+ + > > > > Sbjct: 184 > > > EFELLHQVLEEEKNFLLSRIYWLGHEGTEAGKHYVASTEPQL----NDLKKLVDSLKTKQ > > > > 239 > > > > > > > > Query: 238 TGSSVEMLQDVIDVMKRSESW 258 > > > > ++L+ + RSE + > > > > Sbjct: 240 NMPPRQLLEVTQPHLPRSEEF 260 > > > > > > > > > > > >> ENSP00000327738 pep:known-ccds > > > > chromosome:NCBI36:4:189297592:189305643:1 > > > > gene:ENSG00000184108 transcript:ENST00000332517 > > > > CCDS3851.1 > > > > Length = 468 > > > > > > > > Score = 115 bits (289), Expect = 8e-26 > > > > Identities = 101/410 (24%), Positives = 180/410 (43%), > > > Gaps = 39/410 > > > > (9%) > > > > > > > > Query: 8 > > > DIEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGN > > > > 67 > > > > ++ +E+TC ICL+ + P++ +CGHSFC C+ +E > > > SCP C + + > > > > Sbjct: 9 > > > NLREELTCFICLDYFSSPVTTECGHSFCLVCLLRSWEE----HNTPLSCPECWRTLEGPH > > > > 64 > > > > > > > > Query: 68 > > > LRPNRHLANIVERVKEVKMSPQEGQKRDVCEHHGK-----KLQIFCKEDGKVICWVCELS > > > > 122 > > > > + N L + ++++ Q Q D +G+ K ++ > > > G ++ > > > > Sbjct: 65 > > > FQSNERLGRLASIARQLR--SQVLQSEDEQGSYGRMPTTAKALSDDEQGGSAF-----VA > > > > 117 > > > > > > > > Query: 123 > > > QEHQGHQTFRINEVVKECQEKLQVALQRLIKEDQEA------EKLEDDIRQERTAWKIER > > > > 176 > > > > Q H ++ +E + +EKLQ L L +EA EK > > > + QE T K + > > > > Sbjct: 118 > > > QSHGANRVHLSSEAEEHHREKLQEILNLLRVRRKEAQAVLTHEKERVKLCQEET--KTCK > > > > 175 > > > > > > > > Query: 177 > > > QKILKGFNEMRVILDNEEQRELQKLEEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRR > > > > 236 > > > > Q ++ + +M L EEQ +LQ LE+ E + L +L QQ + > > > S +I+ ++ > > > > Sbjct: 176 > > > QVVVSEYMKMHQFLKEEEQLQLQLLEQEEKENMRKLRNNEIKLTQQIRSLSKMIAQIESS > > > > 235 > > > > > > > > Query: 237 > > > LTGSSVEMLQDVIDVMKRSESWTXXXXXXXXXXXXXXFRVPDLSGMLQVLKELTDVQYYW > > > > 296 > > > > S+ E L++V ++RSE + > ++GM ++L++ + > > > > Sbjct: 236 > > > SQSSAFESLEEVRGALERSE----PLLLQCPEATTTELSLCRITGMKEMLRKFS------ > > > > 285 > > > > > > > > Query: 297 > > > VDVMLNPGSATSNVAISVDQRQVKTVRTCTFKNSNPCDF-SAFGVFGCQYFSSGKYYWEV > > > > 355 > > > > ++ L+P +A + + +S D + VK + NP F + V G > > > Q F+SG++YWEV > > > > Sbjct: 286 > > > TEITLDPATANAYLVLSEDLKSVKYGGSRQQLPDNPERFDQSATVLGTQIFTSGRHYWEV > > > > 345 > > > > > > > > Query: 356 > DVSGKIAWILGVHSKISSLNKRKSSGFAFDPSVNYSKVYSRYRPQYGYWV 405 > > > > +V K W +G+ S + P +S + + Y WV > > > > Sbjct: 346 > EVGNKTEWEVGICKDSVS----RKGNLPKPPGDLFSLIGLKIGDDYSLWV 391 > > > > > > > > > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > > > Posted date: Jun 15, 2006 8:56 PM > > > > Number of letters in database: 23,910,368 > > > > Number of sequences in database: 48,851 > > > > > > > > Lambda K H > > > > 0.319 0.133 0.398 > > > > > > > > Gapped > > > > Lambda K H > > > > 0.267 0.0410 0.140 > > > > > > > > > > > > Matrix: BLOSUM62 > > > > Gap Penalties: Existence: 11, Extension: 1 > > > > Number of Hits to DB: 20,900,506 > > > > Number of Sequences: 48851 > > > > Number of extensions: 899179 > > > > Number of successful extensions: 6075 > > > > Number of sequences better than 1.0e-25: 105 > > > > Number of HSP's better than 0.0 without gapping: 18 > > > > Number of HSP's successfully gapped in prelim test: 87 > > > > Number of HSP's that attempted gapping in prelim test: 5632 > > > > Number of HSP's gapped (non-prelim): 157 > > > > length of query: 442 > > > > length of database: 23,910,368 > > > > effective HSP length: 107 > > > > effective length of query: 335 > > > > effective length of database: 18,683,311 > > > > effective search space: 6258909185 > > > > effective search space used: 6258909185 > > > > T: 11 > > > > A: 40 > > > > X1: 16 ( 7.4 bits) > > > > X2: 38 (14.6 bits) > > > > X3: 64 (24.7 bits) > > > > S1: 41 (21.8 bits) > > > > S2: 289 (115.9 bits) > > > > > > > > *********** END BLAST REPORT FOR TEST CASE *********** > > > > > > > > Any ideas? > > > > > > > > Thanks, > > > > > > > > Bob > > > > > > > > > > > > _______________________________________________ > > > > Bioperl-l mailing list > > > > Bioperl-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From bix at sendu.me.uk Thu Aug 17 16:07:10 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 17 Aug 2006 21:07:10 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <000701c6c22e$b08fb510$15327e82@pyrimidine> References: <000701c6c22e$b08fb510$15327e82@pyrimidine> Message-ID: <44E4CC6E.3080300@sendu.me.uk> Chris Fields wrote: > I don't feel a pressing nature to keep it, but others will likely disagree. > > How hard would it be to implement hit(), query(), and other methods in HSPI > to return Bio::SeqFeature::Similarity objects directly instead of just > inheriting the methods? In other words, only build and return the objects > when the user calls hit() or query()? I can make HSPI inherit only from Bio::Root::Root and then add the following methods to GenericHSP (with no other changes): sub score { my $self = shift; if (@_) { $self->{_score} = shift; } return $self->{_score}; } sub bits { my $self = shift; if (@_) { $self->{_bits} = shift; } return $self->{_bits}; } sub query { my $self = shift; if (@_) { $self->{_query} = shift; } return $self->{_query}; } sub hit { my $self = shift; if (@_) { $self->{_hit} = shift; } return $self->{_hit}; } *subject = \&hit; The entire test suite is happy except for t/SimilarityPair (it tests that a HSP is a SimilarityPair, which it obviously isn't anymore) and t/WABA, since WABAHSP wants to $self->add_tag_value. Changing that to $self->hit->add_tag_value solves the problem. So, it's trivial to make the change, but of course the test suite doesn't use HSPs in all the ways users out there may be using them. The gain? We see a ~1.5x speed up on worst case scenario (see first post to this thread), for just the changes given above. From cjfields at uiuc.edu Thu Aug 17 16:17:03 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 15:17:03 -0500 Subject: [Bioperl-l] Error parsing BLAST report In-Reply-To: <71AE766382153B47AAB638DC83ED7F49014CEA17@pathexch1.wusm-path.wustl.edu> Message-ID: <001701c6c23a$1a770fe0$15327e82@pyrimidine> Robert, I have taken this one up. This is similar to a previous bug that was reported (bug 1986) which came from duplicate names in the hit table. That one is still unresolved at this time. Could you attach the BLAST report to an email to me (don't CC the group!) or attach it to the bug report? I would like to get this one resolved. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Freimuth, Robert > Sent: Thursday, August 17, 2006 2:53 PM > To: Chris Fields; Brian Osborne; bioperl-l at lists.open-bio.org > Cc: deepak shingan > Subject: Re: [Bioperl-l] Error parsing BLAST report > > Hi, > > Thanks for the suggestion. I have entered it as bug 2081 > (http://bugzilla.open-bio.org/show_bug.cgi?id=2081) and uploaded both > the test code and BLAST report as attachments. > > Thanks for the help, > Bob > > > > > -----Original Message----- > > From: Chris Fields [mailto:cjfields at uiuc.edu] > > Sent: Thursday, August 17, 2006 10:58 AM > > To: Freimuth, Robert; 'Brian Osborne'; bioperl-l at lists.open-bio.org > > Subject: RE: [Bioperl-l] Error parsing BLAST report > > > > Robert, > > > > This sounds like a possible bug; the error you are getting is > > from BLAST > > 2.2.11 output, so it should work. The BLAST parsing errors > > fixed in CVS had > > to do with parsing BLAST 2.2.13 (and later) output. > > > > Could you add this as a bug to Bugzilla with your test script > > and test case > > data that generates the error? > > > > How to post the bug: > > > > http://www.bioperl.org/wiki/Bugs > > > > Bugzilla: > > > > http://bugzilla.open-bio.org/ > > > > Chris > > > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of Freimuth, Robert > > > Sent: Wednesday, August 16, 2006 11:42 PM > > > To: Brian Osborne; bioperl-l at lists.open-bio.org > > > Subject: Re: [Bioperl-l] Error parsing BLAST report > > > > > > Hi, > > > > > > Thank you for your reply. I downloaded bioperl-1.5.1 from > > > http://bioperl.org/DIST/ and installed it (which appeared > > successful), > > > but the one-liner: > > > > > > perl -MBio::Root::Version -e 'print > > $Bio::Root::Version::VERSION, "\n"' > > > > > > prints 1.5 (I expected 1.5.1). > > > > > > When I run the test case that I reported earlier, I get the > > following > > > output: > > > > > > -------------------- WARNING --------------------- > > > MSG: There is no HSP data for hit 'ENSP00000327738'. > > > You have called a method (Bio::Search::Hit::GenericHit::length_aln) > > > that requires HSP data and there was no HSP data for this hit, > > > most likely because it was absent from the BLAST report. > > > Note that by default, BLAST lists alignments for the first 250 hits, > > > but it lists descriptions for 500 hits. If this is the case, > > > and you care about these hits, you should re-run BLAST using the > > > -b option (or equivalent if not using blastall) to increase > > the number > > > of alignments. > > > > > > --------------------------------------------------- > > > Alignment length for ENSP00000327738 is - > > > Alignment length for ENSP00000350182 is 250 > > > Alignment length for ENSP00000327738 is 398 > > > > > > Could someone that is running 1.5.1 please verify the output of the > > > one-liner above (did I somehow get the wrong file from the > > ftp site?) > > > and try to reproduce the error with the test case? > > > > > > Thanks for the help. I'm stumped. > > > > > > Bob > > > > > > > -----Original Message----- > > > > From: Brian Osborne [mailto:osborne1 at optonline.net] > > > > Sent: Wednesday, August 16, 2006 6:21 PM > > > > To: Freimuth, Robert; bioperl-l at lists.open-bio.org > > > > Subject: Re: [Bioperl-l] Error parsing BLAST report > > > > > > > > Robert, > > > > > > > > The standard answer to a complaint about SearchIO these days > > > > is to upgrade > > > > to version 1.5.1 - what Bioperl version are you using? > > > > > > > > Brian O. > > > > > > > > > > > > On 8/16/06 3:56 PM, "Freimuth, Robert" > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > I'm trying to parse a BLAST report using the following code: > > > > > > > > > > use warnings; > > > > > use strict; > > > > > > > > > > use Bio::SearchIO; > > > > > > > > > > my $file = 'NP_006065_blast.out'; > > > > > > > > > > my $searchio = new Bio::SearchIO( -format => 'blast', > > > > > -file => $file ); > > > > > > > > > > while( my $result = $searchio->next_result() ) > > > > > { > > > > > while( my $hit = $result->next_hit ) > > > > > { > > > > > my $hit_acc_num = $hit->accession(); > > > > > > > > > > # get the total length of the aligned region > > > > for query or > > > > > sbjct seq > > > > > # (includes all HSPs, calculated after tiling) > > > > > > > > > > my $align_len = $hit->length_aln( 'query' ); > > > > > > > > > > print "Alignment length for $hit_acc_num is > > > > $align_len\n"; > > > > > } > > > > > } > > > > > > > > > > There are 104 one-line descriptions in the report, and > > > > alignments for > > > > > each one of them (the blast report was created using > > > > > b_num_alignments_shown => 500 and > > v_num_descriptions_shown => 500). > > > > > However, when I run the above code I get 14 errors like the > > > > following: > > > > > > > > > > -------------------- WARNING --------------------- > > > > > MSG: There is no HSP data for hit 'ENSP00000327738'. > > > > > You have called a method > > (Bio::Search::Hit::GenericHit::length_aln) > > > > > that requires HSP data and there was no HSP data for this hit, > > > > > most likely because it was absent from the BLAST report. > > > > > Note that by default, BLAST lists alignments for the > > first 250 hits, > > > > > but it lists descriptions for 500 hits. If this is the case, > > > > > and you care about these hits, you should re-run BLAST using the > > > > > -b option (or equivalent if not using blastall) to increase > > > > the number > > > > > of alignments. > > > > > > > > > > --------------------------------------------------- > > > > > > > > > > There is an alignment for this (and the other 13 > > sequences) in the > > > > > report. In fact, if I edit the report and delete all but the > > > > > description and the alignment for ENSP00000327738, it > > > > parses fine (no > > > > > error). > > > > > > > > > > I continued editing the report and produced the following > > > > minimal test > > > > > case that reproduces the error. Note that the description for > > > > > ENSP00000350182 appears twice, BUT THE ERROR IS FOR > > ENSP00000327738. > > > > > > > > > > *********** BLAST REPORT FOR TEST CASE *********** > > > > > > > > > > BLASTP 2.2.11 [Jun-05-2005] > > > > > > > > > > > > > > > Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. > > > > > Schaffer, > > > > > Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. > > Lipman (1997), > > > > > "Gapped BLAST and PSI-BLAST: a new generation of protein > > > > database search > > > > > programs", Nucleic Acids Res. 25:3389-3402. > > > > > > > > > > Query= NP_006065 > > > > > (442 letters) > > > > > > > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > > > > 48,851 sequences; 23,910,368 total letters > > > > > > > > > > Searching..................................................done > > > > > > > > > > > > > > Score > > > > > E > > > > > Sequences producing significant alignments: > > > > (bits) > > > > > Value > > > > > > > > > > ENSP00000350182 pep:novel > > clone::BX322644.8:4905:15090:-1 gene:E... > > > > > 120 3e-27 > > > > > ENSP00000350182 pep:novel > > clone::BX322644.8:4905:15090:-1 gene:E... > > > > > 120 3e-27 > > > > > ENSP00000327738 pep:known-ccds > > chromosome:NCBI36:4:189297592:189... > > > > > 115 8e-26 > > > > > > > > > >> ENSP00000350182 pep:novel clone::BX322644.8:4905:15090:-1 > > > > > gene:ENSG00000137397 > > > > > transcript:ENST00000357569 > > > > > Length = 425 > > > > > > > > > > Score = 120 bits (301), Expect = 3e-27 > > > > > Identities = 76/261 (29%), Positives = 140/261 (53%), > > Gaps = 21/261 > > > > > (8%) > > > > > > > > > > Query: 9 > > > > IEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGNL > > > > > 68 > > > > > +++EV CPICL++L +P+++DCGH+FC CIT +I E+ S G > > > > CP+C+T + + > > > > > Sbjct: 10 > > > > LQEEVICPICLDILQKPVTIDCGHNFCLKCIT-QIGET---SCGFFKCPLCKTSVRKNAI > > > > > 65 > > > > > > > > > > Query: 69 > > > > RPNRHLANIVERVKEVKMSP-QEGQKRDVCEHHGKKLQIFCKEDGKVICWVCELSQEHQG > > > > > 127 > > > > > R N L N+VE+++ ++ S Q +K C H + FC++DGK > > > > +C+VC S++H+ > > > > > Sbjct: 66 > > > > RFNSLLRNLVEKIQALQASEVQSKRKEATCPRHQEMFHYFCEDDGKFLCFVCCESKDHKS > > > > > 125 > > > > > > > > > > Query: 128 > > > > HQTFRINEVVKECQEKLQVALQRLIKEDQEAEKLED------DIRQERTAWKIERQKILK > > > > > 181 > > > > > H I E + Q ++Q +Q L ++++E +++ D+ ++ > > > > + E+Q+IL > > > > > Sbjct: 126 > > > > HNVSLIEEAAQNYQGQIQEQIQVLQQKEKETVQVKAQGVHRVDVFTDQV--EHEKQRILT > > > > > 183 > > > > > > > > > > Query: 182 > > > > GFNEMRVILDNEEQRELQKL----EEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRRL > > > > > 237 > > > > > F + +L+ E+ L ++ EG +A+ QL D > > > > L+ L+ + > > > > > Sbjct: 184 > > > > EFELLHQVLEEEKNFLLSRIYWLGHEGTEAGKHYVASTEPQL----NDLKKLVDSLKTKQ > > > > > 239 > > > > > > > > > > Query: 238 TGSSVEMLQDVIDVMKRSESW 258 > > > > > ++L+ + RSE + > > > > > Sbjct: 240 NMPPRQLLEVTQPHLPRSEEF 260 > > > > > > > > > > > > > > >> ENSP00000327738 pep:known-ccds > > > > > chromosome:NCBI36:4:189297592:189305643:1 > > > > > gene:ENSG00000184108 transcript:ENST00000332517 > > > > > CCDS3851.1 > > > > > Length = 468 > > > > > > > > > > Score = 115 bits (289), Expect = 8e-26 > > > > > Identities = 101/410 (24%), Positives = 180/410 (43%), > > > > Gaps = 39/410 > > > > > (9%) > > > > > > > > > > Query: 8 > > > > DIEKEVTCPICLELLTEPLSLDCGHSFCQACITAKIKESVIISRGESSCPVCQTRFQPGN > > > > > 67 > > > > > ++ +E+TC ICL+ + P++ +CGHSFC C+ +E > > > > SCP C + + > > > > > Sbjct: 9 > > > > NLREELTCFICLDYFSSPVTTECGHSFCLVCLLRSWEE----HNTPLSCPECWRTLEGPH > > > > > 64 > > > > > > > > > > Query: 68 > > > > LRPNRHLANIVERVKEVKMSPQEGQKRDVCEHHGK-----KLQIFCKEDGKVICWVCELS > > > > > 122 > > > > > + N L + ++++ Q Q D +G+ K ++ > > > > G ++ > > > > > Sbjct: 65 > > > > FQSNERLGRLASIARQLR--SQVLQSEDEQGSYGRMPTTAKALSDDEQGGSAF-----VA > > > > > 117 > > > > > > > > > > Query: 123 > > > > QEHQGHQTFRINEVVKECQEKLQVALQRLIKEDQEA------EKLEDDIRQERTAWKIER > > > > > 176 > > > > > Q H ++ +E + +EKLQ L L +EA EK > > > > + QE T K + > > > > > Sbjct: 118 > > > > QSHGANRVHLSSEAEEHHREKLQEILNLLRVRRKEAQAVLTHEKERVKLCQEET--KTCK > > > > > 175 > > > > > > > > > > Query: 177 > > > > QKILKGFNEMRVILDNEEQRELQKLEEGEVNVLDNLAAATDQLVQQRQDASTLISDLQRR > > > > > 236 > > > > > Q ++ + +M L EEQ +LQ LE+ E + L +L QQ + > > > > S +I+ ++ > > > > > Sbjct: 176 > > > > QVVVSEYMKMHQFLKEEEQLQLQLLEQEEKENMRKLRNNEIKLTQQIRSLSKMIAQIESS > > > > > 235 > > > > > > > > > > Query: 237 > > > > LTGSSVEMLQDVIDVMKRSESWTXXXXXXXXXXXXXXFRVPDLSGMLQVLKELTDVQYYW > > > > > 296 > > > > > S+ E L++V ++RSE + > > ++GM ++L++ + > > > > > Sbjct: 236 > > > > SQSSAFESLEEVRGALERSE----PLLLQCPEATTTELSLCRITGMKEMLRKFS------ > > > > > 285 > > > > > > > > > > Query: 297 > > > > VDVMLNPGSATSNVAISVDQRQVKTVRTCTFKNSNPCDF-SAFGVFGCQYFSSGKYYWEV > > > > > 355 > > > > > ++ L+P +A + + +S D + VK + NP F + V G > > > > Q F+SG++YWEV > > > > > Sbjct: 286 > > > > TEITLDPATANAYLVLSEDLKSVKYGGSRQQLPDNPERFDQSATVLGTQIFTSGRHYWEV > > > > > 345 > > > > > > > > > > Query: 356 > > DVSGKIAWILGVHSKISSLNKRKSSGFAFDPSVNYSKVYSRYRPQYGYWV 405 > > > > > +V K W +G+ S + P +S + + Y WV > > > > > Sbjct: 346 > > EVGNKTEWEVGICKDSVS----RKGNLPKPPGDLFSLIGLKIGDDYSLWV 391 > > > > > > > > > > > > > > > Database: Homo_sapiens.NCBI36.apr.pep.fa > > > > > Posted date: Jun 15, 2006 8:56 PM > > > > > Number of letters in database: 23,910,368 > > > > > Number of sequences in database: 48,851 > > > > > > > > > > Lambda K H > > > > > 0.319 0.133 0.398 > > > > > > > > > > Gapped > > > > > Lambda K H > > > > > 0.267 0.0410 0.140 > > > > > > > > > > > > > > > Matrix: BLOSUM62 > > > > > Gap Penalties: Existence: 11, Extension: 1 > > > > > Number of Hits to DB: 20,900,506 > > > > > Number of Sequences: 48851 > > > > > Number of extensions: 899179 > > > > > Number of successful extensions: 6075 > > > > > Number of sequences better than 1.0e-25: 105 > > > > > Number of HSP's better than 0.0 without gapping: 18 > > > > > Number of HSP's successfully gapped in prelim test: 87 > > > > > Number of HSP's that attempted gapping in prelim test: 5632 > > > > > Number of HSP's gapped (non-prelim): 157 > > > > > length of query: 442 > > > > > length of database: 23,910,368 > > > > > effective HSP length: 107 > > > > > effective length of query: 335 > > > > > effective length of database: 18,683,311 > > > > > effective search space: 6258909185 > > > > > effective search space used: 6258909185 > > > > > T: 11 > > > > > A: 40 > > > > > X1: 16 ( 7.4 bits) > > > > > X2: 38 (14.6 bits) > > > > > X3: 64 (24.7 bits) > > > > > S1: 41 (21.8 bits) > > > > > S2: 289 (115.9 bits) > > > > > > > > > > *********** END BLAST REPORT FOR TEST CASE *********** > > > > > > > > > > Any ideas? > > > > > > > > > > Thanks, > > > > > > > > > > Bob > > > > > > > > > > > > > > > _______________________________________________ > > > > > Bioperl-l mailing list > > > > > Bioperl-l at lists.open-bio.org > > > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > > > > > > > > > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Aug 17 16:20:23 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 17 Aug 2006 16:20:23 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E4CC6E.3080300@sendu.me.uk> References: <000701c6c22e$b08fb510$15327e82@pyrimidine> <44E4CC6E.3080300@sendu.me.uk> Message-ID: <539B565C-8872-43BE-AA86-F33E781FD356@gmx.net> On Aug 17, 2006, at 4:07 PM, Sendu Bala wrote: > The entire test suite is happy except for t/SimilarityPair (it tests > that a HSP is a SimilarityPair, which it obviously isn't anymore) This is a sure sign that the change should not be made. There are numerous possible applications that may need to rely on this being a SeqFeature::SimilarityPair, for instance OR mappers. A change like this is pretty drastic. Making it in a well-traveled interface like SearchIO and friends would need to be justified by the previous design being clearly very poor or too limiting (or both), or by an at least equally drastic speed-up. Any speed-up would have to be more than an order of magnitude to count as drastic. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Aug 17 16:49:15 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 15:49:15 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <539B565C-8872-43BE-AA86-F33E781FD356@gmx.net> Message-ID: <001b01c6c23e$9aa67940$15327e82@pyrimidine> I have to agree. If there was a way to get around this by having the change behind the scenes in HSPI then I wouldn't see a problem. Hence my suggestion of implementing hit() and other SeqFeature::SimilarityPair methods directly in Bio::Search::HSP::HSPI (i.e. no SimilarityPair inheritance) to return Bio::SeqFeature::Similarity objects directly. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > Sent: Thursday, August 17, 2006 3:20 PM > To: Sendu Bala > Cc: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] SearchIO speed up > > > On Aug 17, 2006, at 4:07 PM, Sendu Bala wrote: > > > The entire test suite is happy except for t/SimilarityPair (it tests > > that a HSP is a SimilarityPair, which it obviously isn't anymore) > > This is a sure sign that the change should not be made. There are > numerous possible applications that may need to rely on this being a > SeqFeature::SimilarityPair, for instance OR mappers. > > A change like this is pretty drastic. Making it in a well-traveled > interface like SearchIO and friends would need to be justified by the > previous design being clearly very poor or too limiting (or both), or > by an at least equally drastic speed-up. Any speed-up would have to > be more than an order of magnitude to count as drastic. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu Aug 17 17:16:44 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 17 Aug 2006 17:16:44 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <001b01c6c23e$9aa67940$15327e82@pyrimidine> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> Message-ID: <133CE1D6-3773-44A0-BE64-62EF744BD327@gmx.net> You mean you maintain inheritance but re-implement the methods to override the inherited ones? On Aug 17, 2006, at 4:49 PM, Chris Fields wrote: > I have to agree. If there was a way to get around this by having > the change > behind the scenes in HSPI then I wouldn't see a problem. > > Hence my suggestion of implementing hit() and other > SeqFeature::SimilarityPair methods directly in > Bio::Search::HSP::HSPI (i.e. > no SimilarityPair inheritance) to return > Bio::SeqFeature::Similarity objects > directly. > > Chris > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp >> Sent: Thursday, August 17, 2006 3:20 PM >> To: Sendu Bala >> Cc: bioperl-l at bioperl.org >> Subject: Re: [Bioperl-l] SearchIO speed up >> >> >> On Aug 17, 2006, at 4:07 PM, Sendu Bala wrote: >> >>> The entire test suite is happy except for t/SimilarityPair (it tests >>> that a HSP is a SimilarityPair, which it obviously isn't anymore) >> >> This is a sure sign that the change should not be made. There are >> numerous possible applications that may need to rely on this being a >> SeqFeature::SimilarityPair, for instance OR mappers. >> >> A change like this is pretty drastic. Making it in a well-traveled >> interface like SearchIO and friends would need to be justified by the >> previous design being clearly very poor or too limiting (or both), or >> by an at least equally drastic speed-up. Any speed-up would have to >> be more than an order of magnitude to count as drastic. >> >> -hilmar >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Thu Aug 17 17:34:52 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 16:34:52 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <133CE1D6-3773-44A0-BE64-62EF744BD327@gmx.net> Message-ID: <000401c6c244$f9da5d40$15327e82@pyrimidine> I suppose you could do it that way. What I was thinking would be having hit() and query() methods directly in HSPI (no inheritance), so I guess you wouldn't really be 'implementing' them. They would return Bio::SeqFeature::Similarity objects directly. Though I don't really know if it's possible, and also don't really see the purpose here if the speedup is marginal. Chris > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Thursday, August 17, 2006 4:17 PM > To: Chris Fields > Cc: 'Sendu Bala'; bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] SearchIO speed up > > You mean you maintain inheritance but re-implement the methods to > override the inherited ones? > > On Aug 17, 2006, at 4:49 PM, Chris Fields wrote: > > > I have to agree. If there was a way to get around this by having > > the change > > behind the scenes in HSPI then I wouldn't see a problem. > > > > Hence my suggestion of implementing hit() and other > > SeqFeature::SimilarityPair methods directly in > > Bio::Search::HSP::HSPI (i.e. > > no SimilarityPair inheritance) to return > > Bio::SeqFeature::Similarity objects > > directly. > > > > Chris > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > >> bounces at lists.open-bio.org] On Behalf Of Hilmar Lapp > >> Sent: Thursday, August 17, 2006 3:20 PM > >> To: Sendu Bala > >> Cc: bioperl-l at bioperl.org > >> Subject: Re: [Bioperl-l] SearchIO speed up > >> > >> > >> On Aug 17, 2006, at 4:07 PM, Sendu Bala wrote: > >> > >>> The entire test suite is happy except for t/SimilarityPair (it tests > >>> that a HSP is a SimilarityPair, which it obviously isn't anymore) > >> > >> This is a sure sign that the change should not be made. There are > >> numerous possible applications that may need to rely on this being a > >> SeqFeature::SimilarityPair, for instance OR mappers. > >> > >> A change like this is pretty drastic. Making it in a well-traveled > >> interface like SearchIO and friends would need to be justified by the > >> previous design being clearly very poor or too limiting (or both), or > >> by an at least equally drastic speed-up. Any speed-up would have to > >> be more than an order of magnitude to count as drastic. > >> > >> -hilmar > >> > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From bix at sendu.me.uk Thu Aug 17 17:53:52 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 17 Aug 2006 22:53:52 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <001b01c6c23e$9aa67940$15327e82@pyrimidine> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> Message-ID: <44E4E570.2040503@sendu.me.uk> Chris Fields wrote: > I have to agree. If there was a way to get around this by having the change > behind the scenes in HSPI then I wouldn't see a problem. > > Hence my suggestion of implementing hit() and other > SeqFeature::SimilarityPair methods directly in Bio::Search::HSP::HSPI (i.e. > no SimilarityPair inheritance) to return Bio::SeqFeature::Similarity objects > directly. That is exactly what I did (on your suggestion). The problem that Hilmar points out is that HSPI should continue being a SimilarityPair in case anything checks that it is a SimilarityPair. Would there be any problem with leaving HSPI as a SimilarityPair and having GenericHSP::new as: sub new { my($class, at args) = @_; my $self = $class->Bio::Root::Root::new(@args); #... # one change I forgot to mention before: the Similarity objects # created for query() and hit() can no longer have # '-primary' => $self->primary_tag set unless we also override # primary_tag, but I've no idea what primary_tag is supposed to do } # overridden methods, as before This gives a 1.43x speedup. (Simply overriding methods gives only a 1.14x speedup.) From bix at sendu.me.uk Thu Aug 17 18:05:10 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 17 Aug 2006 23:05:10 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <000401c6c244$f9da5d40$15327e82@pyrimidine> References: <000401c6c244$f9da5d40$15327e82@pyrimidine> Message-ID: <44E4E816.1050002@sendu.me.uk> Chris Fields wrote: > I suppose you could do it that way. What I was thinking would be having > hit() and query() methods directly in HSPI (no inheritance) Hilmar already suggested HSPIs would be best left as SimilarityPairs, so inheritance is required. > They would return Bio::SeqFeature::Similarity objects directly. Though I > don't really know if it's possible, and also don't really see the purpose > here if the speedup is marginal. It's the rate-limiting step after all my other speedups (not considered in the present discussion). And if we can have a 50% speedup for 'free', why would we say no? My changes in experimental branch already do something similar: new() calls SUPER but just doesn't pass it any args. That's a little slower than this (calling Bio::Root::Root::new directly), and you miss out on passing -verbose et al. From hlapp at gmx.net Thu Aug 17 18:15:32 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 17 Aug 2006 18:15:32 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E4E570.2040503@sendu.me.uk> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> Message-ID: I wouldn't do any of this. It is at best unexpected code with expected behavior, and from there it only gets worse. I don't see why the loss of standard constructor implementation and behavior is worth a speed-up of less than several fold. I can't imagine that any of the Bioperl-associated speed problems will go away by speeding it up 1.4-fold. I appreciate your work and I appreciate you caring about bioperl being as useful as possible by meeting people's requirements of speedy performance. I also do think though that areas holding the best cost to benefit ratio are going to be those places where speed - ups of 5x or 10x or more can be achieved without drastic API or inheritance changes. I would not be surprised if there isn't a great number of such places all over Bioperl; this is not the first time people tried to (and succeeded to) improve speed. Drastic approaches I think will not work though if applied piece-meal; rather, drastic speed improvements are likely to require drastic architecture changes, which I believe cannot be done really well if they are always constrained by backwards compatibility. I.e., if you talk about drastic architecture changes you are no longer talking about Bioperl as we know it ("1.x"). A few years ago several bright young people wanted to get together to build what at the time was dubbed Bioperl 2.0 ... Unfortunately, we all get older and we move on in our lives ... As a result those people are now scattered and I doubt will ever take this on. I.e., Bioperl 2.0 will need a new crop to pick up the challenge. -hilmar On Aug 17, 2006, at 5:53 PM, Sendu Bala wrote: > Chris Fields wrote: >> I have to agree. If there was a way to get around this by having >> the change >> behind the scenes in HSPI then I wouldn't see a problem. >> >> Hence my suggestion of implementing hit() and other >> SeqFeature::SimilarityPair methods directly in >> Bio::Search::HSP::HSPI (i.e. >> no SimilarityPair inheritance) to return >> Bio::SeqFeature::Similarity objects >> directly. > > That is exactly what I did (on your suggestion). The problem that > Hilmar > points out is that HSPI should continue being a SimilarityPair in case > anything checks that it is a SimilarityPair. > > Would there be any problem with leaving HSPI as a SimilarityPair and > having GenericHSP::new as: > > sub new { > my($class, at args) = @_; > my $self = $class->Bio::Root::Root::new(@args); > #... > # one change I forgot to mention before: the Similarity objects > # created for query() and hit() can no longer have > # '-primary' => $self->primary_tag set unless we also override > # primary_tag, but I've no idea what primary_tag is supposed > to do > } > > # overridden methods, as before > > This gives a 1.43x speedup. (Simply overriding methods gives only a > 1.14x speedup.) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From mblanche at berkeley.edu Thu Aug 17 20:25:24 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Thu, 17 Aug 2006 17:25:24 -0700 Subject: [Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF In-Reply-To: <6dce9a0b0608171532j77da0146x95d5023200801cb5@mail.gmail.com> Message-ID: Lincoln, I can?t seem to find how to fetch genes by their FBgn id (which seems to now be the official GadFly unique identifier for genes...). The get_feature_by_name method uses the real gene name not the FBgn id... I can see that the FBgn ids are store in the attributes table in the mySQL database with an attribute id of 7. I tried using ?get_features_by_attribute({parent_id => $FBgn})? without success. What is the trick?? Also, there is a typo in the Bio::DB::SeqFeature::Store documentation. ... # ...by type @features = $db->get_features_by_name('gene'); ... Should read ... # ...by type @features = $db->get_features_by_type('gene'); ... Many thanks Marco On 8/17/06 15:32, "Lincoln Stein" wrote: > Let me know how it works. > > I also get a few of the warnings about the ortho:* features. They don't seem > to hurt anything so you can go ahead and use fast loading if you want. The > long-term fix is to sort the GFF3 files so that all features that share the > same ID occur next to each other. > > Lincoln > > On 8/17/06, Marco Blanchette wrote: >> I will answer my own question... >> >> Yes, one can load the fasta file after having loaded the gff file by doing: >> >> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta >> >> Marco >> >> >> On 8/17/06 11:20, "Marco Blanchette" < mblanche at berkeley.edu> wrote: >> >>> > Lincoln, thanks for the precision. I just could not find any references to >>> > how to load the DNA (no where in bp_seqfeature_load.pl or in the >>> > Bio::DB::SeqFeature::Store it says how load the DNA sequences). >>> > >>> > So right now the gff files were loaded in mysql using: >>> > /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff >>> > >>> > I tried the --fast options but got a bunch of warning (see below). >>> > >>> > The DNA file (a single fasta database file containing all chromosome >>> > sequences) was in a different location from the gff files and was not >>> loaded >>> > together with the gff files (the sequence table is empty in the database). >>> > >>> > Can I load the DNA sequence after the gff files were loaded? >>> > >>> > Many thanks >>> > >>> > Marco >>> > >>> > >>> > -------------------- WARNING --------------------- >>> > MSG: ID=ortho:2825 has been used more than once, but it cannot be found in >>> > the database. >>> > This can happen if you have specified fast loading, but features sharing >>> the >>> > same ID >>> > are not contiguous in the GFF file. This will be loaded as a separate >>> > feature. >>> > Line 483681: "X . orthologous_region 19477824 19478027 >>> > . + . >>> > ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse" >>> > >>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature >>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537 >>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load >>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424 >>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh >>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342 >>> > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load >>> > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240 >>> > STACK toplevel /usr/bin/bp_seqfeature_load.pl:81 >>> > >>> > >>> > On 8/17/06 10:27, "Lincoln Stein" wrote: >>> > >>>> >> Hi, >>>> >> >>>> >> This message bounced because I tried to send it from my gmail account >>>> and so >>>> >> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it >>>> >> finds a file that contains DNA data, it simply loads it. There is no >>>> special >>>> >> command line switch. Also you can include the DNA in the GFF3 file. >>>> >> >>>> >> Lincoln >>>> >> >>>> >> ---------- Forwarded message ---------- >>>> >> From: Lincoln Stein >>>> >> Date: Aug 17, 2006 12:26 PM >>>> >> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF >>>> >> To: Chris Fields >>>> >> Cc: Marco Blanchette < mblanche at berkeley.edu >>>> >, "bioperl-l at lists.open-bio.org" >>>> >> , cain.cshl at gmail.com >>>> >> >>>> >> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to >>>> >> load the GFF3-format Fly data? I think you're probably getting confused >>>> by >>>> >> overlapping mRNA splice forms, an issue that won't occur with the full >>>> >> GFF3-formatted data. >>>> >> >>>> >> >>>> >> On 8/13/06, Chris Fields >>>> > wrote: >>>>> >>> >>>>> >>> Marco, >>>>> >>> >>>>> >>> Did you figure out what the problem was? I was curious; the issue >>>>> >>> you were having was rather odd. I wanted to see if it was an issue >>>>> >>> with the GFF data or with the database itself. >>>>> >>> >>>>> >>> Chris >>>>> >>> >>>>> >>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: >>>>> >>> >>>>>> >>>> Chris, >>>>>> >>>> >>>>>>> >>>>> Do you mean you get duplicates of sequences back, or that you get >>>>>>> >>>>> more than >>>>>>> >>>>> one chunk of the same sequence back? >>>>>> >>>> >>>>>> >>>> I sometimes get duplicated sequences and sometimes overlapping >>>>>> >>>> regions (see >>>>>> >>>> bellow) >>>>>> >>>> >>>>>>> >>>>> >>>>>>> >>>>> Is it possible that each query using an ID could contain more than >>>>>>> >>>>> one >>>>>>> >>>>> feature? That might explain it (you could check by testing the >>>>>>> >>>>> size of the >>>>>>> >>>>> array @feats). >>>>>> >>>> Most id return more than one features from various type >>>>>> >>>> ( point_mutation, >>>>>> >>>> insertion_site, processed_transcript, etc...). That's why I >>>>>> >>>> restirct the >>>>>> >>>> output to type "gene" using regexp /gene/ on $f->type. >>>>>> >>>> >>>>>>> >>>>> >>>>>>> >>>>> I'm not sure how split locations are handled within Bio:DB::GFF, >>>>>>> >>>>> but do the >>>>>>> >>>>> specific features have split locations? >>>>>>> >>>>> >>>>>>> >>>>> Chris >>>>>>> >>>>> >>>>>> >>>> Not sure what you mean exactly but have a look at the following >>>>>> >>>> script, it >>>>>> >>>> gives the location and the group id of the feature being reported: >>>>>> >>>> >>>>>> >>>> use Bio::DB::GFF; >>>>>> >>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>>> >>>> -dsn => >>>>>> >>>> 'dbi:mysql:database=dmel_43_new'); >>>>>> >>>> my %dups; >>>>>> >>>> while (<>){ >>>>>> >>>> chomp; >>>>>> >>>> my $id = $_; >>>>>> >>>> my @feat = $db->get_feature_by_name($id); >>>>>> >>>> >>>>>> >>>> for my $f (@feat){ >>>>>> >>>> if (exists $dups{$f->group} && $f->type =~/gene/){ >>>>>> >>>> print "Calling >>>", $f->group, "\n"; >>>>>> >>>> print "Chr: ", $f->refseq, >>>>>> >>>> " Strand: ", $f->strand, >>>>>> >>>> " Start: ", $f->start, >>>>>> >>>> " End: ", $f->end, >>>>>> >>>> "\n"; >>>>>> >>>> print "Offending >>>", $dups{$f->group}->group, "\n"; >>>>>> >>>> print "Chr: ", $dups{$f->group}->refseq, >>>>>> >>>> " Strand: ", $dups{$f->group}->strand, >>>>>> >>>> " Start: ", $dups{$f->group}->start, >>>>>> >>>> " End: ", $dups{$f->group}->end; >>>>>> >>>> print "\n\n"; >>>>>> >>>> } else { >>>>>> >>>> $dups{$f->group} = $f; >>>>>> >>>> } >>>>>> >>>> } >>>>>> >>>> } >>>>>> >>>> >>>>>> >>>> Here is the output: >>>>>> >>>> Calling >>>FBgn0004179 >>>>>> >>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587 >>>>>> >>>> Offending >>>FBgn0004179 >>>>>> >>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575 >>>>>> >>>> >>>>>> >>>> Calling >>>FBgn0025681 >>>>>> >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>>>> >>>> Offending >>>FBgn0025681 >>>>>> >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>>>> >>>> >>>>>> >>>> Calling >>>FBgn0025803 >>>>>> >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>>>> >>>> Offending >>>FBgn0025803 >>>>>> >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>>>> >>>> >>>>>> >>>> Calling >>>FBgn0000117 >>>>>> >>>> Chr: X Strand: -1 Start: 1756796 End: 1747557 >>>>>> >>>> Offending >>>FBgn0000117 >>>>>> >>>> Chr: X Strand: -1 Start: 1757776 End: 1747182 >>>>>> >>>> >>>>>> >>>> Calling >>>FBgn0005427 >>>>>> >>>> Chr: X Strand: -1 Start: 136456 End: 125343 >>>>>> >>>> Offending >>>FBgn0005427 >>>>>> >>>> Chr: X Strand: -1 Start: 133199 End: 124949 >>>>>> >>>> >>>>>> >>>> Calling >>>FBgn0000042 >>>>>> >>>> Chr: X Strand: 1 Start: 5746100 End: 5750026 >>>>>> >>>> Offending >>>FBgn0000042 >>>>>> >>>> Chr: X Strand: 1 Start: 5746096 End: 5746106 >>>>>> >>>> >>>>>> >>>> Calling >>>FBgn0004551 >>>>>> >>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556 >>>>>> >>>> Offending >>>FBgn0004551 >>>>>> >>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977 >>>>>> >>>> >>>>>> >>>> Do you have any suggestions?? Is the procedure I am using to >>>>>> >>>> retrieve the >>>>>> >>>> genes right? >>>>>> >>>> >>>>>> >>>> Many thanks >>>>>> >>>> >>>>>> >>>> Marco >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>>>> >>>>>> Many thanks Scott, >>>>>>>> >>>>>> >>>>>>>> >>>>>> At the same time I got your email I was coming to the same >>>>>>>> >>>>>> conclusion as >>>>>>>> >>>>>> you. >>>>>>>> >>>>>> >>>>>>>> >>>>>> Now I have a stranger problem in my hands... My goal is quite >>>>>>>> >>>>>> simple, I >>>>>>>> >>>>>> try >>>>>>>> >>>>>> to get the sequence of the genes back from the Bio::DB::GFF >>>>>>>> database >>>>>>>> >>>>>> loaded >>>>>>>> >>>>>> on MySQL. The gene list is from a file with one gene id per per >>>>>>>> >>>>>> line. When >>>>>>>> >>>>>> I >>>>>>>> >>>>>> run the following script: >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> use Bio::DB::GFF; >>>>>>>> >>>>>> use Bio::SeqIO; >>>>>>>> >>>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >>>>>>>> >>>>>> -format => 'fasta'); >>>>>>>> >>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>>>>> >>>>>> -dsn => >>>>>>>> >>>>>> 'dbi:mysql:database=dmel_43_new'); >>>>>>>> >>>>>> >>>>>>>> >>>>>> while (<>){ >>>>>>>> >>>>>> chomp; >>>>>>>> >>>>>> my $id = $_; >>>>>>>> >>>>>> my @feats = $db->get_feature_by_name($id); >>>>>>>> >>>>>> for my $f (@feats){ >>>>>>>> >>>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/; >>>>>>>> >>>>>> } >>>>>>>> >>>>>> } >>>>>>>> >>>>>> >>>>>>>> >>>>>> >>>>>>>> >>>>>> I get more sequence back than the number of gene in my input >>>>>>>> file. I >>>>>>>> >>>>>> double >>>>>>>> >>>>>> check there. Some of the duplicated entries are the same, some >>>>>>>> >>>>>> are not! >>>>>>> >>>>> >>>>>>> >>>>> >>>>>>> >>>>> ... >>>>>>> >>>>> >>>>>>> >>>>> _______________________________________________ >>>>>>> >>>>> Bioperl-l mailing list >>>>>>> >>>>> Bioperl-l at lists.open-bio.org >>>>>>> >>>>>>> >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>> >>>> >>>>>> >>>> ______________________________ >>>>>> >>>> Marco Blanchette, Ph.D. >>>>>> >>>> >>>>>> >>>> mblanche at uclink.berkeley.edu >>>>>> >>>> >>>>>> >>>> Donald C. Rio's lab >>>>>> >>>> Department of Molecular and Cell Biology >>>>>> >>>> 16 Barker Hall >>>>>> >>>> University of California >>>>>> >>>> Berkeley, CA 94720-3204 >>>>>> >>>> >>>>>> >>>> Tel: (510) 642-1084 >>>>>> >>>> Cell: (510) 847-0996 >>>>>> >>>> Fax: (510) 642-6062 >>>>>> >>>> -- >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> _______________________________________________ >>>>>> >>>> Bioperl-l mailing list >>>>>> >>>> Bioperl-l at lists.open-bio.org >>>>>> >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>> >>>>> >>> Christopher Fields >>>>> >>> Postdoctoral Researcher >>>>> >>> Lab of Dr. Robert Switzer >>>>> >>> Dept of Biochemistry >>>>> >>> University of Illinois Urbana-Champaign >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> _______________________________________________ >>>>> >>> Bioperl-l mailing list >>>>> >>> Bioperl-l at lists.open-bio.org >>>>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>> >>>> >> >>>> >> >>> > >>> > ______________________________ >>> > Marco Blanchette, Ph.D. >>> > >>> > mblanche at uclink.berkeley.edu >>> > >>> > Donald C. Rio's lab >>> > Department of Molecular and Cell Biology >>> > 16 Barker Hall >>> > University of California >>> > Berkeley, CA 94720-3204 >>> > >>> > Tel: (510) 642-1084 >>> > Cell: (510) 847-0996 >>> > Fax: (510) 642-6062 >> >> ______________________________ >> Marco Blanchette, Ph.D. >> >> mblanche at uclink.berkeley.edu >> >> Donald C. Rio's lab >> Department of Molecular and Cell Biology >> 16 Barker Hall >> University of California >> Berkeley, CA 94720-3204 >> >> Tel: (510) 642-1084 >> Cell: (510) 847-0996 >> Fax: (510) 642-6062 >> -- >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From cjfields at uiuc.edu Thu Aug 17 22:12:52 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 21:12:52 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E4E570.2040503@sendu.me.uk> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> Message-ID: <9C1F95DA-CDD4-4A68-916C-8A75CA10F935@uiuc.edu> On Aug 17, 2006, at 4:53 PM, Sendu Bala wrote: > Chris Fields wrote: >> ... > > That is exactly what I did (on your suggestion). The problem that > Hilmar > points out is that HSPI should continue being a SimilarityPair in case > anything checks that it is a SimilarityPair. Okay, fine by me. It was merely a suggestion thrown out there. Seemed like you were banging your head against the wall trying to work this out. What I intended was something that wouldn't dramatically change what was returned from the methods (you would get SeqFeature::Similarity objects back). Hilmar has a point, though; if checks are performed to see if the HSP is-a SeqFeatureI then there will be problems (as the failed tests probably show). > Would there be any problem with leaving HSPI as a SimilarityPair and > having GenericHSP::new as: > ... > This gives a 1.43x speedup. (Simply overriding methods gives only a > 1.14x speedup.) I don't think it's worth that much effort really. There are other ways to go about this, such as your and Aaron's suggested pull parser, the hash-based approach, etc., which may be better. My concern is trying to maintain API in the current set of classes unless (as pointed out, again, by Hilmar) there is a tremendous advantage to making changes that break the current API. So far, sorry to say, it's debatable whether a 1.5-fold increase in speed along with even small API changes is worth all the effort you are putting into it. I don't think changing what's already present in the current SearchIO modules will accomplish much. That being said, the nice thing about SearchIO is that you could introduce new SearchIO::* modules using your own custom handler/ Search class combinations to work alongside the current ones; that way everybody has an option (use the old slow more OO ones vs. the new fast hash-based ones). There, they may choose to use a new API for the speed advantages. Make it easier for them to make the right choice i.e. Damian Conway's affordances. You may not even have to use a handler, and you could even build your own Search interface classes to tailor-fit your specific needs. There's a lot of freedom there, which can be a dangerous thing. Those SearchIO classes that get the most usage will likely eventually lead to deprecation of the ones infrequently used/maintained. This is the current idea of Lincoln's Bio::DB::SeqFeature, which I believe is intended to eventually replace Bio::DB::GFF. When everybody realizes that GFF3 works better with Bio::DB::SeqFeature, eventually Bio::DB::GFF likely will no longer be actively maintained and eventually deprecated. Remember, your SearchIO modifications do not have to be included in this release of BioPerl, so don't rush them to make a release. We could feasibly have 1-2 extra dev releases before v1.6, maybe more. Rushing to make a release was one of the initial problems with Bio::SeqFeatureI (I think) in the first 1.5 release. Please correct me if I'm wrong there, Hilmar. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Thu Aug 17 22:27:09 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Thu, 17 Aug 2006 21:27:09 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> Message-ID: <33BEEDF2-6AC3-4C11-B5C1-28BEC5A024AA@uiuc.edu> ... > I.e., if you talk about drastic architecture changes you are no > longer talking about Bioperl as we know it ("1.x"). A few years ago > several bright young people wanted to get together to build what at > the time was dubbed Bioperl 2.0 ... Unfortunately, we all get older > and we move on in our lives ... As a result those people are now > scattered and I doubt will ever take this on. I.e., Bioperl 2.0 will > need a new crop to pick up the challenge. Yeah, I have thought of a few things I would like to change but can't. I like D. Conway's inside-out classes/objects for encapsulation but I can't see using those in BioPerl w/o major architecture changes, as pretty much every class is hash-based (Bio::Root::Root). With Perl6 not too far off now, we could start thinking about what works well and what doesn't. Then when (if?) the eventual Perl5- >Perl6 changeover begins in BioPerl, try implementing new ideas. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From mblanche at berkeley.edu Thu Aug 17 23:15:14 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Thu, 17 Aug 2006 20:15:14 -0700 Subject: [Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF In-Reply-To: Message-ID: Again, I will answer my own questions... The get_features_by_alias() method will use the FBgn ids to retrieve a given feature as in use Bio::DB::SeqFeature::Store; my @ids = ('FBgn0026620', 'FBgn0010772', 'FBgn0025879'); my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'DBI::mysql', -dsn =>'dbi:mysql:dmel_43_SeqF'); for my $id (@ids){ my @feats = $db->get_features_by_alias($id); for my $f (@feats){ print $f->name, "\n"; } } Sorry for the spam... On 8/17/06 17:25, "Marco Blanchette" wrote: > Lincoln, > > I can?t seem to find how to fetch genes by their FBgn id (which seems to now > be the official GadFly unique identifier for genes...). The > get_feature_by_name method uses the real gene name not the FBgn id... I can > see that the FBgn ids are store in the attributes table in the mySQL > database with an attribute id of 7. I tried using > ?get_features_by_attribute({parent_id => $FBgn})? without success. > > What is the trick?? > > Also, there is a typo in the Bio::DB::SeqFeature::Store documentation. > > ... > # ...by type > @features = $db->get_features_by_name('gene'); > ... > > Should read > ... > # ...by type > @features = $db->get_features_by_type('gene'); > ... > > Many thanks > > Marco > > > On 8/17/06 15:32, "Lincoln Stein" wrote: > >> Let me know how it works. >> >> I also get a few of the warnings about the ortho:* features. They don't seem >> to hurt anything so you can go ahead and use fast loading if you want. The >> long-term fix is to sort the GFF3 files so that all features that share the >> same ID occur next to each other. >> >> Lincoln >> >> On 8/17/06, Marco Blanchette wrote: >>> I will answer my own question... >>> >>> Yes, one can load the fasta file after having loaded the gff file by doing: >>> >>> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta >>> >>> Marco >>> >>> >>> On 8/17/06 11:20, "Marco Blanchette" < mblanche at berkeley.edu> wrote: >>> >>>>> Lincoln, thanks for the precision. I just could not find any references to >>>>> how to load the DNA (no where in bp_seqfeature_load.pl or in the >>>>> Bio::DB::SeqFeature::Store it says how load the DNA sequences). >>>>> >>>>> So right now the gff files were loaded in mysql using: >>>>> /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff >>>>> >>>>> I tried the --fast options but got a bunch of warning (see below). >>>>> >>>>> The DNA file (a single fasta database file containing all chromosome >>>>> sequences) was in a different location from the gff files and was not >>>> loaded >>>>> together with the gff files (the sequence table is empty in the database). >>>>> >>>>> Can I load the DNA sequence after the gff files were loaded? >>>>> >>>>> Many thanks >>>>> >>>>> Marco >>>>> >>>>> >>>>> -------------------- WARNING --------------------- >>>>> MSG: ID=ortho:2825 has been used more than once, but it cannot be found in >>>>> the database. >>>>> This can happen if you have specified fast loading, but features sharing >>>> the >>>>> same ID >>>>> are not contiguous in the GFF file. This will be loaded as a separate >>>>> feature. >>>>> Line 483681: "X . orthologous_region 19477824 19478027 >>>>> . + . >>>>> ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse" >>>>> >>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature >>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537 >>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load >>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424 >>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh >>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342 >>>>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load >>>>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240 >>>>> STACK toplevel /usr/bin/bp_seqfeature_load.pl:81 >>>>> >>>>> >>>>> On 8/17/06 10:27, "Lincoln Stein" wrote: >>>>> >>>>>>> Hi, >>>>>>> >>>>>>> This message bounced because I tried to send it from my gmail account >>>>> and so >>>>>>> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it >>>>>>> finds a file that contains DNA data, it simply loads it. There is no >>>>> special >>>>>>> command line switch. Also you can include the DNA in the GFF3 file. >>>>>>> >>>>>>> Lincoln >>>>>>> >>>>>>> ---------- Forwarded message ---------- >>>>>>> From: Lincoln Stein >>>>>>> Date: Aug 17, 2006 12:26 PM >>>>>>> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF >>>>>>> To: Chris Fields >>>>>>> Cc: Marco Blanchette < mblanche at berkeley.edu >>>>> >, "bioperl-l at lists.open-bio.org" >>>>>>> , cain.cshl at gmail.com >>>>>>> >>>>>>> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class to >>>>>>> load the GFF3-format Fly data? I think you're probably getting confused >>>>> by >>>>>>> overlapping mRNA splice forms, an issue that won't occur with the full >>>>>>> GFF3-formatted data. >>>>>>> >>>>>>> >>>>>>> On 8/13/06, Chris Fields >>>>>> wrote: >>>>>>>>> >>>>>>>>> Marco, >>>>>>>>> >>>>>>>>> Did you figure out what the problem was? I was curious; the issue >>>>>>>>> you were having was rather odd. I wanted to see if it was an issue >>>>>>>>> with the GFF data or with the database itself. >>>>>>>>> >>>>>>>>> Chris >>>>>>>>> >>>>>>>>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: >>>>>>>>> >>>>>>>>>>> Chris, >>>>>>>>>>> >>>>>>>>>>>> Do you mean you get duplicates of sequences back, or that you get >>>>>>>>>>>> more than >>>>>>>>>>>> one chunk of the same sequence back? >>>>>>>>>>> >>>>>>>>>>> I sometimes get duplicated sequences and sometimes overlapping >>>>>>>>>>> regions (see >>>>>>>>>>> bellow) >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Is it possible that each query using an ID could contain more than >>>>>>>>>>>> one >>>>>>>>>>>> feature? That might explain it (you could check by testing the >>>>>>>>>>>> size of the >>>>>>>>>>>> array @feats). >>>>>>>>>>> Most id return more than one features from various type >>>>>>>>>>> ( point_mutation, >>>>>>>>>>> insertion_site, processed_transcript, etc...). That's why I >>>>>>>>>>> restirct the >>>>>>>>>>> output to type "gene" using regexp /gene/ on $f->type. >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I'm not sure how split locations are handled within Bio:DB::GFF, >>>>>>>>>>>> but do the >>>>>>>>>>>> specific features have split locations? >>>>>>>>>>>> >>>>>>>>>>>> Chris >>>>>>>>>>>> >>>>>>>>>>> Not sure what you mean exactly but have a look at the following >>>>>>>>>>> script, it >>>>>>>>>>> gives the location and the group id of the feature being reported: >>>>>>>>>>> >>>>>>>>>>> use Bio::DB::GFF; >>>>>>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>>>>>>>> -dsn => >>>>>>>>>>> 'dbi:mysql:database=dmel_43_new'); >>>>>>>>>>> my %dups; >>>>>>>>>>> while (<>){ >>>>>>>>>>> chomp; >>>>>>>>>>> my $id = $_; >>>>>>>>>>> my @feat = $db->get_feature_by_name($id); >>>>>>>>>>> >>>>>>>>>>> for my $f (@feat){ >>>>>>>>>>> if (exists $dups{$f->group} && $f->type =~/gene/){ >>>>>>>>>>> print "Calling >>>", $f->group, "\n"; >>>>>>>>>>> print "Chr: ", $f->refseq, >>>>>>>>>>> " Strand: ", $f->strand, >>>>>>>>>>> " Start: ", $f->start, >>>>>>>>>>> " End: ", $f->end, >>>>>>>>>>> "\n"; >>>>>>>>>>> print "Offending >>>", $dups{$f->group}->group, "\n"; >>>>>>>>>>> print "Chr: ", $dups{$f->group}->refseq, >>>>>>>>>>> " Strand: ", $dups{$f->group}->strand, >>>>>>>>>>> " Start: ", $dups{$f->group}->start, >>>>>>>>>>> " End: ", $dups{$f->group}->end; >>>>>>>>>>> print "\n\n"; >>>>>>>>>>> } else { >>>>>>>>>>> $dups{$f->group} = $f; >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Here is the output: >>>>>>>>>>> Calling >>>FBgn0004179 >>>>>>>>>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587 >>>>>>>>>>> Offending >>>FBgn0004179 >>>>>>>>>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575 >>>>>>>>>>> >>>>>>>>>>> Calling >>>FBgn0025681 >>>>>>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>>>>>>>>> Offending >>>FBgn0025681 >>>>>>>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>>>>>>>>> >>>>>>>>>>> Calling >>>FBgn0025803 >>>>>>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>>>>>>>>> Offending >>>FBgn0025803 >>>>>>>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>>>>>>>>> >>>>>>>>>>> Calling >>>FBgn0000117 >>>>>>>>>>> Chr: X Strand: -1 Start: 1756796 End: 1747557 >>>>>>>>>>> Offending >>>FBgn0000117 >>>>>>>>>>> Chr: X Strand: -1 Start: 1757776 End: 1747182 >>>>>>>>>>> >>>>>>>>>>> Calling >>>FBgn0005427 >>>>>>>>>>> Chr: X Strand: -1 Start: 136456 End: 125343 >>>>>>>>>>> Offending >>>FBgn0005427 >>>>>>>>>>> Chr: X Strand: -1 Start: 133199 End: 124949 >>>>>>>>>>> >>>>>>>>>>> Calling >>>FBgn0000042 >>>>>>>>>>> Chr: X Strand: 1 Start: 5746100 End: 5750026 >>>>>>>>>>> Offending >>>FBgn0000042 >>>>>>>>>>> Chr: X Strand: 1 Start: 5746096 End: 5746106 >>>>>>>>>>> >>>>>>>>>>> Calling >>>FBgn0004551 >>>>>>>>>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556 >>>>>>>>>>> Offending >>>FBgn0004551 >>>>>>>>>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977 >>>>>>>>>>> >>>>>>>>>>> Do you have any suggestions?? Is the procedure I am using to >>>>>>>>>>> retrieve the >>>>>>>>>>> genes right? >>>>>>>>>>> >>>>>>>>>>> Many thanks >>>>>>>>>>> >>>>>>>>>>> Marco >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Many thanks Scott, >>>>>>>>>>>> >>>>>>>>>>>> At the same time I got your email I was coming to the same >>>>>>>>>>>> conclusion as >>>>>>>>>>>> you. >>>>>>>>>>>> >>>>>>>>>>>> Now I have a stranger problem in my hands... My goal is quite >>>>>>>>>>>> simple, I >>>>>>>>>>>> try >>>>>>>>>>>> to get the sequence of the genes back from the Bio::DB::GFF >>>>>>>>> database >>>>>>>>>>>> loaded >>>>>>>>>>>> on MySQL. The gene list is from a file with one gene id per per >>>>>>>>>>>> line. When >>>>>>>>>>>> I >>>>>>>>>>>> run the following script: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> use Bio::DB::GFF; >>>>>>>>>>>> use Bio::SeqIO; >>>>>>>>>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >>>>>>>>>>>> -format => 'fasta'); >>>>>>>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>>>>>>>>> -dsn => >>>>>>>>>>>> 'dbi:mysql:database=dmel_43_new'); >>>>>>>>>>>> >>>>>>>>>>>> while (<>){ >>>>>>>>>>>> chomp; >>>>>>>>>>>> my $id = $_; >>>>>>>>>>>> my @feats = $db->get_feature_by_name($id); >>>>>>>>>>>> for my $f (@feats){ >>>>>>>>>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/; >>>>>>>>>>>> } >>>>>>>>>>>> } >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I get more sequence back than the number of gene in my input >>>>>>>>> file. I >>>>>>>>>>>> double >>>>>>>>>>>> check there. Some of the duplicated entries are the same, some >>>>>>>>>>>> are not! >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> ... >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Bioperl-l mailing list >>>>>>>>>>>> Bioperl-l at lists.open-bio.org >>>>>>>> >>>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>>>>>>> >>>>>>>>>>> ______________________________ >>>>>>>>>>> Marco Blanchette, Ph.D. >>>>>>>>>>> >>>>>>>>>>> mblanche at uclink.berkeley.edu >>>>>>>>>>> >>>>>>>>>>> Donald C. Rio's lab >>>>>>>>>>> Department of Molecular and Cell Biology >>>>>>>>>>> 16 Barker Hall >>>>>>>>>>> University of California >>>>>>>>>>> Berkeley, CA 94720-3204 >>>>>>>>>>> >>>>>>>>>>> Tel: (510) 642-1084 >>>>>>>>>>> Cell: (510) 847-0996 >>>>>>>>>>> Fax: (510) 642-6062 >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> Bioperl-l mailing list >>>>>>>>>>> Bioperl-l at lists.open-bio.org >>>>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>>>>> >>>>>>>>> Christopher Fields >>>>>>>>> Postdoctoral Researcher >>>>>>>>> Lab of Dr. Robert Switzer >>>>>>>>> Dept of Biochemistry >>>>>>>>> University of Illinois Urbana-Champaign >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Bioperl-l mailing list >>>>>>>>> Bioperl-l at lists.open-bio.org >>>>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> ______________________________ >>>>> Marco Blanchette, Ph.D. >>>>> >>>>> mblanche at uclink.berkeley.edu >>>>> >>>>> Donald C. Rio's lab >>>>> Department of Molecular and Cell Biology >>>>> 16 Barker Hall >>>>> University of California >>>>> Berkeley, CA 94720-3204 >>>>> >>>>> Tel: (510) 642-1084 >>>>> Cell: (510) 847-0996 >>>>> Fax: (510) 642-6062 >>> >>> ______________________________ >>> Marco Blanchette, Ph.D. >>> >>> mblanche at uclink.berkeley.edu >>> >>> Donald C. Rio's lab >>> Department of Molecular and Cell Biology >>> 16 Barker Hall >>> University of California >>> Berkeley, CA 94720-3204 >>> >>> Tel: (510) 642-1084 >>> Cell: (510) 847-0996 >>> Fax: (510) 642-6062 >>> -- >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From bix at sendu.me.uk Fri Aug 18 02:50:47 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 18 Aug 2006 07:50:47 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> Message-ID: <44E56347.8030408@sendu.me.uk> Hilmar Lapp wrote: > I wouldn't do any of this. It is at best unexpected code with > expected behavior, and from there it only gets worse. I don't see why > the loss of standard constructor implementation and behavior is worth > a speed-up of less than several fold. [...] > I.e., if you talk about drastic architecture changes you are no > longer talking about Bioperl as we know it ("1.x"). Well, we come back to my original question in this thread. When can we consider the priority list item resolved? Is it resolved when we've sped it up as much as possible without doing anything drastic ('resolved fixed')? Or is it only resolved when we've done everything we can think of to speed it up, even drastic things? In the later case, do we just leave it 'verified' until some possible Bioperl 2.0, or do we instead just say 'resolved wontfix'? From bix at sendu.me.uk Fri Aug 18 03:00:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 18 Aug 2006 08:00:57 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <9C1F95DA-CDD4-4A68-916C-8A75CA10F935@uiuc.edu> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> <9C1F95DA-CDD4-4A68-916C-8A75CA10F935@uiuc.edu> Message-ID: <44E565A9.3050008@sendu.me.uk> Chris Fields wrote: > On Aug 17, 2006, at 4:53 PM, Sendu Bala wrote: > > I don't think it's worth that much effort really. There are other > ways to go about this, such as your and Aaron's suggested pull > parser, the hash-based approach, etc., which may be better. Doing multiple things gives a better final result. Every penny counts, so to speak. > So far, sorry to say, it's debatable whether a 1.5-fold increase in speed > along with even small API changes is worth all the effort you are > putting into it. To be fair, no API change is required, and it only took a few minutes to implement and try the idea out :) > That being said, the nice thing about SearchIO is that you could > introduce new SearchIO::* modules using your own custom handler/ > Search class combinations to work alongside the current ones; that > way everybody has an option (use the old slow more OO ones vs. the > new fast hash-based ones). There, they may choose to use a new API > for the speed advantages. Make it easier for them to make the right > choice i.e. Damian Conway's affordances. Even if you were making a new SearchIO module, I think you'd want to have it return HSPI objects for the hsps. Otherwise to what extent is it a bioperl or searchio module? To what extent will people be able to easily use the new module with existing code that expects a SearchIO to eventually provide HSPI objects? Maybe I'm wrong about that - is it reasonable to just come up with a whole new system for returning the results, and have users learn to use the new system? From avilella at gmail.com Fri Aug 18 07:02:57 2006 From: avilella at gmail.com (Albert Vilella) Date: Fri, 18 Aug 2006 12:02:57 +0100 Subject: [Bioperl-l] informative codons method for kaks --Bio/Align/DNAStatistics.pm In-Reply-To: <007701c6c21b$b08d65c0$15327e82@pyrimidine> References: <007701c6c21b$b08d65c0$15327e82@pyrimidine> Message-ID: <1155898977.15817.3.camel@localhost> Upon closer inspection, it turned out to be a one-liner! =head2 kaks_pattern_number Title : kaks_pattern_number Usage : my $patterns = $stats->kaks_pattern_number($alnobj); Function: Counts the number of codons with no gaps in the MSA Returns : Number of codons with no gaps ('patterns' in PAML notation) Args : A Bio::Align::AlignI compliant object such as a Bio::SimpleAlign object. =cut sub kaks_pattern_number{ my ($self, $aln) = @_; return ($aln->remove_gaps->length)/3; } How do you like the method name? Suggestions? Albert. On Thu, 2006-08-17 at 11:39 -0500, Chris Fields wrote: > Albert, > > Might be a good idea to start working on those! Nice to use Bugzilla as a > repository for ideas, but we're heading towards using the Bioperl wiki for > that more now. > > Chris > > > -----Original Message----- > > From: Albert Vilella [mailto:avilella at gmail.com] > > Sent: Thursday, August 17, 2006 11:17 AM > > To: Chris Fields > > Cc: bioperl-l at bioperl.org > > Subject: Re: [Bioperl-l] informative codons method for kaks -- > > Bio/Align/DNAStatistics.pm > > > > I will opt for the "return the number of informative codons" then, which > > is the easiest :) > > > > I don't know when is this going to be there if it depends on me, > > though... my list of 'enh' bug tickets is growing shamefully fast :p > > > > Cheers, > > > > Albert. > > > > On Thu, 2006-08-17 at 09:26 -0500, Chris Fields wrote: > > > Sure, why not? If you (or someone) can add one in, I don't see how > > > it could hurt. > > > > > > Make sure to add tests for this in the proper test suite. > > > > > > Chris > > > > > > On Aug 17, 2006, at 6:37 AM, Albert Vilella wrote: > > > > > > > Hi all, > > > > > > > > I think it would be nice to have a method in > > > > Bio/Align/DNAStatistics.pm that gives the number of informative codons > > > > for kaks in a MSA. That is, the codons that are used in the > > > > calculation of kaks. This, AFAICS, more or less what codeml calls > > > > "patterns". > > > > > > > > I often find myself in the situation of wanting to know how big is the > > > > CDS alignment not in terms of sequence length, but of the number of > > > > codons that are going to be used in the kaks statistics. I guess this > > > > method would help in that. > > > > > > > > The method could: > > > > > > > > return the number of informative codons? > > > > maybe return a new seqarray with only the informative codons? > > > > > > > > What do you think? Jason? Chris? > > > > > > > > http://bugzilla.open-bio.org/show_bug.cgi?id=2078 > > > > > > > > Bests, > > > > > > > > Albert. > > > > > > > > > > > > _______________________________________________ > > > > Bioperl-l mailing list > > > > Bioperl-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > Christopher Fields > > > Postdoctoral Researcher > > > Lab of Dr. Robert Switzer > > > Dept of Biochemistry > > > University of Illinois Urbana-Champaign > > > > > > > > > > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at uiuc.edu Fri Aug 18 07:56:29 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 18 Aug 2006 06:56:29 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E565A9.3050008@sendu.me.uk> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> <9C1F95DA-CDD4-4A68-916C-8A75CA10F935@uiuc.edu> <44E565A9.3050008@sendu.me.uk> Message-ID: <047918A2-9385-40CA-AE65-D336FDF7A17D@uiuc.edu> On Aug 18, 2006, at 2:00 AM, Sendu Bala wrote: > Chris Fields wrote: >> On Aug 17, 2006, at 4:53 PM, Sendu Bala wrote: >> >> I don't think it's worth that much effort really. There are other >> ways to go about this, such as your and Aaron's suggested pull >> parser, the hash-based approach, etc., which may be better. > > Doing multiple things gives a better final result. Every penny counts, > so to speak. If you have 4-5 fold increases w/o API changes, fine. But I don't think 1.5-fold is worth worrying about if it involves something fundamental about the class (inheritance). > > >> So far, sorry to say, it's debatable whether a 1.5-fold increase >> in speed >> along with even small API changes is worth all the effort you are >> putting into it. > > To be fair, no API change is required, and it only took a few > minutes to > implement and try the idea out :) Maybe I'm missing something here; didn't you say it failed tests somewhere? That's suggestive of API problems. > >> That being said, the nice thing about SearchIO is that you could >> introduce new SearchIO::* modules using your own custom handler/ >> Search class combinations to work alongside the current ones; that >> way everybody has an option (use the old slow more OO ones vs. the >> new fast hash-based ones). There, they may choose to use a new API >> for the speed advantages. Make it easier for them to make the right >> choice i.e. Damian Conway's affordances. > > Even if you were making a new SearchIO module, I think you'd want to > have it return HSPI objects for the hsps. Otherwise to what extent > is it > a bioperl or searchio module? To what extent will people be able to > easily use the new module with existing code that expects a > SearchIO to > eventually provide HSPI objects? > > Maybe I'm wrong about that - is it reasonable to just come up with a > whole new system for returning the results, and have users learn to > use > the new system? My point is, if you create something new that changes the API (i.e. create new Result/Hit/HSP interfaces, then implement them) but keep the old way around (ResultI/HitI/HSPI-based implementations), then the user can make the decision on what to use, not us. If the gain in using the new classes is substantial enough people will probably switch and get used to the new API/methods. If both are used, then both stay. If everybody wants to have SearchIO methods return only HSPI/HitI (no new interface/API allowes), then create something new, basing it on SearchIO but running it the way you want. That's how SearchIO came about in the first place. I don't think that's unreasonable. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Fri Aug 18 07:57:39 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 18 Aug 2006 06:57:39 -0500 Subject: [Bioperl-l] informative codons method for kaks --Bio/Align/DNAStatistics.pm In-Reply-To: <1155898977.15817.3.camel@localhost> References: <007701c6c21b$b08d65c0$15327e82@pyrimidine> <1155898977.15817.3.camel@localhost> Message-ID: <9B8156B3-5C0B-4F94-B1E4-7E0518EEDF85@uiuc.edu> Looks good to me. Just make sure to add tests! Chris On Aug 18, 2006, at 6:02 AM, Albert Vilella wrote: > Upon closer inspection, it turned out to be a one-liner! > > =head2 kaks_pattern_number > > Title : kaks_pattern_number > Usage : my $patterns = $stats->kaks_pattern_number($alnobj); > Function: Counts the number of codons with no gaps in the MSA > Returns : Number of codons with no gaps ('patterns' in PAML notation) > Args : A Bio::Align::AlignI compliant object such as a > Bio::SimpleAlign object. > > =cut > > sub kaks_pattern_number{ > my ($self, $aln) = @_; > return ($aln->remove_gaps->length)/3; > } > > How do you like the method name? Suggestions? > > Albert. > > On Thu, 2006-08-17 at 11:39 -0500, Chris Fields wrote: >> Albert, >> >> Might be a good idea to start working on those! Nice to use >> Bugzilla as a >> repository for ideas, but we're heading towards using the Bioperl >> wiki for >> that more now. >> >> Chris >> >>> -----Original Message----- >>> From: Albert Vilella [mailto:avilella at gmail.com] >>> Sent: Thursday, August 17, 2006 11:17 AM >>> To: Chris Fields >>> Cc: bioperl-l at bioperl.org >>> Subject: Re: [Bioperl-l] informative codons method for kaks -- >>> Bio/Align/DNAStatistics.pm >>> >>> I will opt for the "return the number of informative codons" >>> then, which >>> is the easiest :) >>> >>> I don't know when is this going to be there if it depends on me, >>> though... my list of 'enh' bug tickets is growing shamefully fast :p >>> >>> Cheers, >>> >>> Albert. >>> >>> On Thu, 2006-08-17 at 09:26 -0500, Chris Fields wrote: >>>> Sure, why not? If you (or someone) can add one in, I don't see how >>>> it could hurt. >>>> >>>> Make sure to add tests for this in the proper test suite. >>>> >>>> Chris >>>> >>>> On Aug 17, 2006, at 6:37 AM, Albert Vilella wrote: >>>> >>>>> Hi all, >>>>> >>>>> I think it would be nice to have a method in >>>>> Bio/Align/DNAStatistics.pm that gives the number of informative >>>>> codons >>>>> for kaks in a MSA. That is, the codons that are used in the >>>>> calculation of kaks. This, AFAICS, more or less what codeml calls >>>>> "patterns". >>>>> >>>>> I often find myself in the situation of wanting to know how big >>>>> is the >>>>> CDS alignment not in terms of sequence length, but of the >>>>> number of >>>>> codons that are going to be used in the kaks statistics. I >>>>> guess this >>>>> method would help in that. >>>>> >>>>> The method could: >>>>> >>>>> return the number of informative codons? >>>>> maybe return a new seqarray with only the informative codons? >>>>> >>>>> What do you think? Jason? Chris? >>>>> >>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2078 >>>>> >>>>> Bests, >>>>> >>>>> Albert. >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> Christopher Fields >>>> Postdoctoral Researcher >>>> Lab of Dr. Robert Switzer >>>> Dept of Biochemistry >>>> University of Illinois Urbana-Champaign >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Fri Aug 18 09:11:10 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 18 Aug 2006 14:11:10 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <047918A2-9385-40CA-AE65-D336FDF7A17D@uiuc.edu> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> <9C1F95DA-CDD4-4A68-916C-8A75CA10F935@uiuc.edu> <44E565A9.3050008@sendu.me.uk> <047918A2-9385-40CA-AE65-D336FDF7A17D@uiuc.edu> Message-ID: <44E5BC6E.7030805@sendu.me.uk> Chris Fields wrote: > On Aug 18, 2006, at 2:00 AM, Sendu Bala wrote: > >>> So far, sorry to say, it's debatable whether a 1.5-fold increase >>> in speed along with even small API changes is worth all the >>> effort you are putting into it. > >> To be fair, no API change is required, and it only took a few >> minutes to implement and try the idea out :) > > Maybe I'm missing something here; didn't you say it failed tests > somewhere? That's suggestive of API problems. The alternate suggestion using my $self = $class->Bio::Root::Root::new(@args); doesn't cause any test failures because it doesn't involve any API change, only a harmless implementation change. Hilmar wasn't happy with that because of 'loss of standard constructor implementation and behavior'. To be honest, the current implementation of GenericHSP and SimilarityPair&ancestors is a bit of a messy kludge with lots of wasted work, which is why we get the speed up in the first place by going straight to Root. My PullParser modules solve the problem in a much better way (read: Hilmar would have no objections), but I was hoping for something that would work for all existing SearchIO modules as well. It doesn't matter in the end: it was easy to suggest it, and just as easy to not use it if people are unhappy with it. From hlapp at gmx.net Fri Aug 18 09:46:04 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 18 Aug 2006 09:46:04 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E56347.8030408@sendu.me.uk> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> <44E56347.8030408@sendu.me.uk> Message-ID: Your are fixing Bioperl 1.x, right? :-) So, if you fixed it as much as you could without turning Bioperl 1.x into Bioperl 2.x it is resolved, or if there is no such fix, won't fix. If you have ideas though for how to do it better, just you can't implement those thoughts because of compatibility constraints, it's always worthwhile capturing them and write them down on e.g. a wiki page, such as one that collects thoughts towrds Bioperl 2.x, or call it Bioperl-Warp9 ;) -hilmar On Aug 18, 2006, at 2:50 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> I wouldn't do any of this. It is at best unexpected code with >> expected behavior, and from there it only gets worse. I don't see why >> the loss of standard constructor implementation and behavior is worth >> a speed-up of less than several fold. > [...] >> I.e., if you talk about drastic architecture changes you are no >> longer talking about Bioperl as we know it ("1.x"). > > Well, we come back to my original question in this thread. When can we > consider the priority list item resolved? Is it resolved when we've > sped > it up as much as possible without doing anything drastic ('resolved > fixed')? Or is it only resolved when we've done everything we can > think > of to speed it up, even drastic things? In the later case, do we just > leave it 'verified' until some possible Bioperl 2.0, or do we instead > just say 'resolved wontfix'? > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Fri Aug 18 09:52:54 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 18 Aug 2006 09:52:54 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E565A9.3050008@sendu.me.uk> References: <001b01c6c23e$9aa67940$15327e82@pyrimidine> <44E4E570.2040503@sendu.me.uk> <9C1F95DA-CDD4-4A68-916C-8A75CA10F935@uiuc.edu> <44E565A9.3050008@sendu.me.uk> Message-ID: <18D1C818-DB29-41EF-8314-DB871DDA992E@gmx.net> On Aug 18, 2006, at 3:00 AM, Sendu Bala wrote: > Maybe I'm wrong about that - is it reasonable to just come up with a > whole new system for returning the results, and have users learn to > use > the new system? My take on this is that if you know your input format and you know how to use perl regular expressions and you need only some small pieces out of a report or sequence file and you just need this quick, then you might as well set Bioperl aside and do a straight perl one-off. I.e., I don't think Bioperl, or any other toolkit that as its main benefits offers a consistent API, object model, tried-and-tested parsers, etc, should try to be the natural choice for a use case when you don't need any of these benefits. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Fri Aug 18 09:58:21 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 18 Aug 2006 08:58:21 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E5BC6E.7030805@sendu.me.uk> Message-ID: <001f01c6c2ce$5f1cc3d0$15327e82@pyrimidine> > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Sendu Bala > Sent: Friday, August 18, 2006 8:11 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] SearchIO speed up > > Chris Fields wrote: > > On Aug 18, 2006, at 2:00 AM, Sendu Bala wrote: > > > >>> So far, sorry to say, it's debatable whether a 1.5-fold increase > >>> in speed along with even small API changes is worth all the > >>> effort you are putting into it. > > > >> To be fair, no API change is required, and it only took a few > >> minutes to implement and try the idea out :) > > > > Maybe I'm missing something here; didn't you say it failed tests > > somewhere? That's suggestive of API problems. > > The alternate suggestion using > my $self = $class->Bio::Root::Root::new(@args); > doesn't cause any test failures because it doesn't involve any API > change, only a harmless implementation change. Hilmar wasn't happy with > that because of 'loss of standard constructor implementation and > behavior'. To be honest, the current implementation of GenericHSP and > SimilarityPair&ancestors is a bit of a messy kludge with lots of wasted > work, which is why we get the speed up in the first place by going > straight to Root. Okay. I understand his objection to it (maintain constructor behavior by chaining them), but I also see your reasoning. This may be one of those points where code obfuscation isn't worth the small increase in speed. I'm Switzerland on this point (neutral). > My PullParser modules solve the problem in a much better way (read: > Hilmar would have no objections), but I was hoping for something that > would work for all existing SearchIO modules as well. I don't think you'll get that. I agree with Hilmar there (that drastic changes in speed would be necessary to warrant API changes). If you can demonstrate 'drastic changes' in speed, even with API changes, then it may be feasible to introduce them alongside the current implementation and let the user decide. The end user can make the decision on whether to use the older slower modules or the faster ones. Remember, these are all committed to an experimental branch so changes don't pollute the main trunk (and the original SearchIO is largely intact, with no API changes). Hence these are available to anyone who wishes to test them out. You can always add demo scripts in the SYNPOSIS if there are API issues (such as if you return hashes, for instance). > It doesn't matter in the end: it was easy to suggest it, and just as > easy to not use it if people are unhappy with it. I don't have problems with changes, even small API changes. But we have to deal with the long-term repercussions of making such changes via broken scripts, bug reports, etc etc. Small incremental speed increases may not be worth the extra headache of having to deal with the onslaught of ticked-off users. cjf From hlapp at gmx.net Fri Aug 18 10:00:24 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 18 Aug 2006 10:00:24 -0400 Subject: [Bioperl-l] informative codons method for kaks --Bio/Align/DNAStatistics.pm In-Reply-To: <1155898977.15817.3.camel@localhost> References: <007701c6c21b$b08d65c0$15327e82@pyrimidine> <1155898977.15817.3.camel@localhost> Message-ID: <571BBB96-98C2-4611-B505-3D50086126DC@gmx.net> On Aug 18, 2006, at 7:02 AM, Albert Vilella wrote: > How do you like the method name? Suggestions? Sounds fine to me ... except in German it sounds very funny or a bit offending, depending on your sense of humor (kaks = poop). However, first I guess the lingua franca is what matters, and second regularly testing the humor of Germans is not a bad thing to do ;) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From avilella at gmail.com Fri Aug 18 10:03:46 2006 From: avilella at gmail.com (Albert Vilella) Date: Fri, 18 Aug 2006 15:03:46 +0100 Subject: [Bioperl-l] informative codons method for kaks --Bio/Align/DNAStatistics.pm In-Reply-To: <571BBB96-98C2-4611-B505-3D50086126DC@gmx.net> References: <007701c6c21b$b08d65c0$15327e82@pyrimidine> <1155898977.15817.3.camel@localhost> <571BBB96-98C2-4611-B505-3D50086126DC@gmx.net> Message-ID: <1155909826.17178.3.camel@localhost> On Fri, 2006-08-18 at 10:00 -0400, Hilmar Lapp wrote: > On Aug 18, 2006, at 7:02 AM, Albert Vilella wrote: > > > How do you like the method name? Suggestions? > > Sounds fine to me ... except in German it sounds very funny or a bit > offending, depending on your sense of humor (kaks = poop). We can also use the "dnds" notation, although then it won't be systematic with the other "poop"'s in DNAStatistics :) > > However, first I guess the lingua franca is what matters, and second > regularly testing the humor of Germans is not a bad thing to do ;) > > -hilmar > From jeremy_just at netcourrier.com Fri Aug 18 09:51:41 2006 From: jeremy_just at netcourrier.com (=?ISO-8859-15?Q?J=E9r=E9my?= JUST) Date: Fri, 18 Aug 2006 15:51:41 +0200 Subject: [Bioperl-l] How to get rid of warnings Message-ID: <20060818155141.000068ec@pearson.versailles.inra.fr> Hello, I can't manage to get rid of all warnings during a query from GenBank with BioPerl 1.4. When the query doesn't give any result, I catch the exception, but there is still a warning that is issued to STDERR. This is my code: <<<<<<<< #!/usr/bin/perl use strict ; use warnings ; use Bio::DB::GenBank ; my $gb = Bio::DB::GenBank->new( -verbose => -1 ) ; my $query = Bio::DB::Query::GenBank->new (-query => "FooBar", -db => 'nuccore', -verbose => -1 ) ; my $seqio ; eval {$seqio = $gb->get_Stream_by_query( $query ) ; } ; if ( $@ ) {print STDERR "Cannot find a sequence for FooBar\n" ; } exit 0 ; >>>>>>>> and the unwanted warning is: <<<<< Warning(s) from GenBank: FooBar >>>>> How could I set verbosity so that I don't get this warning? Thanks, -- J?r?my JUST From sdavis2 at mail.nih.gov Fri Aug 18 10:15:26 2006 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Fri, 18 Aug 2006 10:15:26 -0400 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <18D1C818-DB29-41EF-8314-DB871DDA992E@gmx.net> Message-ID: On 8/18/06 9:52 AM, "Hilmar Lapp" wrote: > > On Aug 18, 2006, at 3:00 AM, Sendu Bala wrote: > >> Maybe I'm wrong about that - is it reasonable to just come up with a >> whole new system for returning the results, and have users learn to >> use >> the new system? > > My take on this is that if you know your input format and you know > how to use perl regular expressions and you need only some small > pieces out of a report or sequence file and you just need this quick, > then you might as well set Bioperl aside and do a straight perl one-off. > > I.e., I don't think Bioperl, or any other toolkit that as its main > benefits offers a consistent API, object model, tried-and-tested > parsers, etc, should try to be the natural choice for a use case when > you don't need any of these benefits. I can't agree more here. There are many cases with biological data and bioinformatics where bioperl is just not the right answer, and that is OK! Sean From bix at sendu.me.uk Fri Aug 18 11:04:57 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 18 Aug 2006 16:04:57 +0100 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <20060818155141.000068ec@pearson.versailles.inra.fr> References: <20060818155141.000068ec@pearson.versailles.inra.fr> Message-ID: <44E5D719.4030808@sendu.me.uk> J?r?my JUST wrote: > Hello, > > I can't manage to get rid of all warnings during a query from GenBank > with BioPerl 1.4. When the query doesn't give any result, I catch the > exception, but there is still a warning that is issued to STDERR. [...] > and the unwanted warning is: > <<<<< > Warning(s) from GenBank: > FooBar > > How could I set verbosity so that I don't get this warning? You can't easily or appropriately do it with bioperl 1.4, but I've just fixed the problem in cvs. You might be able to just download revision 1.18 of Bio/DB/Query/GenBank.pm from bioperl-live: http://code.open-bio.org/cgi/viewcvs.cgi/bioperl-live/Bio/DB/Query/ (The perl hack would be to redirect STDERR to /dev/null inside your eval.) From cjfields at uiuc.edu Fri Aug 18 11:06:28 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 18 Aug 2006 10:06:28 -0500 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <20060818155141.000068ec@pearson.versailles.inra.fr> Message-ID: <002001c6c2d7$e1d9c3a0$15327e82@pyrimidine> Jeremy, Looks like Sendu is fixing this in CVS. The warning didn't use BioPerl's built-in error handling. Chris > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of J?r?my JUST > Sent: Friday, August 18, 2006 8:52 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] How to get rid of warnings > > > Hello, > > I can't manage to get rid of all warnings during a query from GenBank > with BioPerl 1.4. When the query doesn't give any result, I catch the > exception, but there is still a warning that is issued to STDERR. > > This is my code: > > <<<<<<<< > #!/usr/bin/perl > > use strict ; > use warnings ; > > use Bio::DB::GenBank ; > > my $gb = Bio::DB::GenBank->new( -verbose => -1 ) ; > > my $query = Bio::DB::Query::GenBank->new > (-query => "FooBar", > -db => 'nuccore', > -verbose => -1 > ) ; > > my $seqio ; > eval > {$seqio = $gb->get_Stream_by_query( $query ) ; > } ; > > if ( $@ ) > {print STDERR "Cannot find a sequence for FooBar\n" ; > } > > exit 0 ; > >>>>>>>> > > and the unwanted warning is: > <<<<< > Warning(s) from GenBank: > FooBar > >>>>> > > How could I set verbosity so that I don't get this warning? > > > Thanks, > > -- > J?r?my JUST > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From lincoln.stein at gmail.com Thu Aug 17 18:32:35 2006 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Thu, 17 Aug 2006 18:32:35 -0400 Subject: [Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF In-Reply-To: References: Message-ID: <6dce9a0b0608171532j77da0146x95d5023200801cb5@mail.gmail.com> Let me know how it works. I also get a few of the warnings about the ortho:* features. They don't seem to hurt anything so you can go ahead and use fast loading if you want. The long-term fix is to sort the GFF3 files so that all features that share the same ID occur next to each other. Lincoln On 8/17/06, Marco Blanchette wrote: > > I will answer my own question... > > Yes, one can load the fasta file after having loaded the gff file by > doing: > > bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta > > Marco > > > On 8/17/06 11:20, "Marco Blanchette" wrote: > > > Lincoln, thanks for the precision. I just could not find any references > to > > how to load the DNA (no where in bp_seqfeature_load.pl or in the > > Bio::DB::SeqFeature::Store it says how load the DNA sequences). > > > > So right now the gff files were loaded in mysql using: > > /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff > > > > I tried the --fast options but got a bunch of warning (see below). > > > > The DNA file (a single fasta database file containing all chromosome > > sequences) was in a different location from the gff files and was not > loaded > > together with the gff files (the sequence table is empty in the > database). > > > > Can I load the DNA sequence after the gff files were loaded? > > > > Many thanks > > > > Marco > > > > > > -------------------- WARNING --------------------- > > MSG: ID=ortho:2825 has been used more than once, but it cannot be found > in > > the database. > > This can happen if you have specified fast loading, but features sharing > the > > same ID > > are not contiguous in the GFF file. This will be loaded as a separate > > feature. > > Line 483681: "X . orthologous_region 19477824 19478027 > > . + . > > ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse" > > > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature > > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537 > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load > > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424 > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh > > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342 > > STACK Bio::DB::SeqFeature::Store::GFF3Loader::load > > /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240 > > STACK toplevel /usr/bin/bp_seqfeature_load.pl:81 > > > > > > On 8/17/06 10:27, "Lincoln Stein" wrote: > > > >> Hi, > >> > >> This message bounced because I tried to send it from my gmail account > and so > >> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it > >> finds a file that contains DNA data, it simply loads it. There is no > special > >> command line switch. Also you can include the DNA in the GFF3 file. > >> > >> Lincoln > >> > >> ---------- Forwarded message ---------- > >> From: Lincoln Stein > >> Date: Aug 17, 2006 12:26 PM > >> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF > >> To: Chris Fields > >> Cc: Marco Blanchette , " > bioperl-l at lists.open-bio.org" > >> , cain.cshl at gmail.com > >> > >> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class > to > >> load the GFF3-format Fly data? I think you're probably getting confused > by > >> overlapping mRNA splice forms, an issue that won't occur with the full > >> GFF3-formatted data. > >> > >> > >> On 8/13/06, Chris Fields wrote: > >>> > >>> Marco, > >>> > >>> Did you figure out what the problem was? I was curious; the issue > >>> you were having was rather odd. I wanted to see if it was an issue > >>> with the GFF data or with the database itself. > >>> > >>> Chris > >>> > >>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: > >>> > >>>> Chris, > >>>> > >>>>> Do you mean you get duplicates of sequences back, or that you get > >>>>> more than > >>>>> one chunk of the same sequence back? > >>>> > >>>> I sometimes get duplicated sequences and sometimes overlapping > >>>> regions (see > >>>> bellow) > >>>> > >>>>> > >>>>> Is it possible that each query using an ID could contain more than > >>>>> one > >>>>> feature? That might explain it (you could check by testing the > >>>>> size of the > >>>>> array @feats). > >>>> Most id return more than one features from various type > >>>> ( point_mutation, > >>>> insertion_site, processed_transcript, etc...). That's why I > >>>> restirct the > >>>> output to type "gene" using regexp /gene/ on $f->type. > >>>> > >>>>> > >>>>> I'm not sure how split locations are handled within Bio:DB::GFF, > >>>>> but do the > >>>>> specific features have split locations? > >>>>> > >>>>> Chris > >>>>> > >>>> Not sure what you mean exactly but have a look at the following > >>>> script, it > >>>> gives the location and the group id of the feature being reported: > >>>> > >>>> use Bio::DB::GFF; > >>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > >>>> -dsn => > >>>> 'dbi:mysql:database=dmel_43_new'); > >>>> my %dups; > >>>> while (<>){ > >>>> chomp; > >>>> my $id = $_; > >>>> my @feat = $db->get_feature_by_name($id); > >>>> > >>>> for my $f (@feat){ > >>>> if (exists $dups{$f->group} && $f->type =~/gene/){ > >>>> print "Calling >>>", $f->group, "\n"; > >>>> print "Chr: ", $f->refseq, > >>>> " Strand: ", $f->strand, > >>>> " Start: ", $f->start, > >>>> " End: ", $f->end, > >>>> "\n"; > >>>> print "Offending >>>", $dups{$f->group}->group, "\n"; > >>>> print "Chr: ", $dups{$f->group}->refseq, > >>>> " Strand: ", $dups{$f->group}->strand, > >>>> " Start: ", $dups{$f->group}->start, > >>>> " End: ", $dups{$f->group}->end; > >>>> print "\n\n"; > >>>> } else { > >>>> $dups{$f->group} = $f; > >>>> } > >>>> } > >>>> } > >>>> > >>>> Here is the output: > >>>> Calling >>>FBgn0004179 > >>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587 > >>>> Offending >>>FBgn0004179 > >>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575 > >>>> > >>>> Calling >>>FBgn0025681 > >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > >>>> Offending >>>FBgn0025681 > >>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 > >>>> > >>>> Calling >>>FBgn0025803 > >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > >>>> Offending >>>FBgn0025803 > >>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 > >>>> > >>>> Calling >>>FBgn0000117 > >>>> Chr: X Strand: -1 Start: 1756796 End: 1747557 > >>>> Offending >>>FBgn0000117 > >>>> Chr: X Strand: -1 Start: 1757776 End: 1747182 > >>>> > >>>> Calling >>>FBgn0005427 > >>>> Chr: X Strand: -1 Start: 136456 End: 125343 > >>>> Offending >>>FBgn0005427 > >>>> Chr: X Strand: -1 Start: 133199 End: 124949 > >>>> > >>>> Calling >>>FBgn0000042 > >>>> Chr: X Strand: 1 Start: 5746100 End: 5750026 > >>>> Offending >>>FBgn0000042 > >>>> Chr: X Strand: 1 Start: 5746096 End: 5746106 > >>>> > >>>> Calling >>>FBgn0004551 > >>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556 > >>>> Offending >>>FBgn0004551 > >>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977 > >>>> > >>>> Do you have any suggestions?? Is the procedure I am using to > >>>> retrieve the > >>>> genes right? > >>>> > >>>> Many thanks > >>>> > >>>> Marco > >>>> > >>>> > >>>> > >>>>>> Many thanks Scott, > >>>>>> > >>>>>> At the same time I got your email I was coming to the same > >>>>>> conclusion as > >>>>>> you. > >>>>>> > >>>>>> Now I have a stranger problem in my hands... My goal is quite > >>>>>> simple, I > >>>>>> try > >>>>>> to get the sequence of the genes back from the Bio::DB::GFF > database > >>>>>> loaded > >>>>>> on MySQL. The gene list is from a file with one gene id per per > >>>>>> line. When > >>>>>> I > >>>>>> run the following script: > >>>>>> > >>>>>> > >>>>>> > >>>>>> use Bio::DB::GFF; > >>>>>> use Bio::SeqIO; > >>>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, > >>>>>> -format => 'fasta'); > >>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', > >>>>>> -dsn => > >>>>>> 'dbi:mysql:database=dmel_43_new'); > >>>>>> > >>>>>> while (<>){ > >>>>>> chomp; > >>>>>> my $id = $_; > >>>>>> my @feats = $db->get_feature_by_name($id); > >>>>>> for my $f (@feats){ > >>>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/; > >>>>>> } > >>>>>> } > >>>>>> > >>>>>> > >>>>>> I get more sequence back than the number of gene in my input file. > I > >>>>>> double > >>>>>> check there. Some of the duplicated entries are the same, some > >>>>>> are not! > >>>>> > >>>>> > >>>>> ... > >>>>> > >>>>> _______________________________________________ > >>>>> Bioperl-l mailing list > >>>>> Bioperl-l at lists.open-bio.org > >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>>> > >>>> ______________________________ > >>>> Marco Blanchette, Ph.D. > >>>> > >>>> mblanche at uclink.berkeley.edu > >>>> > >>>> Donald C. Rio's lab > >>>> Department of Molecular and Cell Biology > >>>> 16 Barker Hall > >>>> University of California > >>>> Berkeley, CA 94720-3204 > >>>> > >>>> Tel: (510) 642-1084 > >>>> Cell: (510) 847-0996 > >>>> Fax: (510) 642-6062 > >>>> -- > >>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> Bioperl-l mailing list > >>>> Bioperl-l at lists.open-bio.org > >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >>> Christopher Fields > >>> Postdoctoral Researcher > >>> Lab of Dr. Robert Switzer > >>> Dept of Biochemistry > >>> University of Illinois Urbana-Champaign > >>> > >>> > >>> > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >> > >> > > > > ______________________________ > > Marco Blanchette, Ph.D. > > > > mblanche at uclink.berkeley.edu > > > > Donald C. Rio's lab > > Department of Molecular and Cell Biology > > 16 Barker Hall > > University of California > > Berkeley, CA 94720-3204 > > > > Tel: (510) 642-1084 > > Cell: (510) 847-0996 > > Fax: (510) 642-6062 > > ______________________________ > Marco Blanchette, Ph.D. > > mblanche at uclink.berkeley.edu > > Donald C. Rio's lab > Department of Molecular and Cell Biology > 16 Barker Hall > University of California > Berkeley, CA 94720-3204 > > Tel: (510) 642-1084 > Cell: (510) 847-0996 > Fax: (510) 642-6062 > -- > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Cold Spring Harbor Laboratory 1 Bungtown Road Cold Spring Harbor, NY 11724 (516) 367-8380 (voice) (516) 367-8389 (fax) FOR URGENT MESSAGES & SCHEDULING, PLEASE CONTACT MY ASSISTANT, SANDRA MICHELSEN, AT michelse at cshl.edu From cjfields at uiuc.edu Fri Aug 18 11:39:18 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 18 Aug 2006 10:39:18 -0500 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: Message-ID: <002201c6c2dc$78f25960$15327e82@pyrimidine> > On 8/18/06 9:52 AM, "Hilmar Lapp" wrote: > > > > > On Aug 18, 2006, at 3:00 AM, Sendu Bala wrote: ... > > I.e., I don't think Bioperl, or any other toolkit that as its main > > benefits offers a consistent API, object model, tried-and-tested > > parsers, etc, should try to be the natural choice for a use case when > > you don't need any of these benefits. > > I can't agree more here. There are many cases with biological data and > bioinformatics where bioperl is just not the right answer, and that is OK! > > Sean True. I agree that, if speed is an issue, BioPerl may not be the best solution. But I think it is feasible to make SearchIO faster. Just not the way it is currently implemented. I think that our focus also needs to include SeqIO. Some significant sequence parsing slowdowns were noted between bioperl 1.4 and bioperl 1.5.1, probably tied to the addition of several new classes. There may not be a way around this except SwissKnife-like 'lazy parsing.' Hilmar's wiki idea for Bioperl 2.0 (Warp9?) is already in place in a way: there is a 'Beyond' section for Bioperl Releases, so it wouldn't be too hard to set up a link to a new page for BioPerl 2.0 ideas. Chris From bix at sendu.me.uk Fri Aug 18 11:50:39 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Fri, 18 Aug 2006 16:50:39 +0100 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E5DCB0.1050806@genomics.dk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> Message-ID: <44E5E1CF.40509@sendu.me.uk> Niels Larsen wrote: > I am a casual user of bioperl with my own error handling. Is there > a way to switch off bioperls error handling? I switch off my own > with > > $Common::Config::with_errors = 0; > $Common::Config::with_warnings = 0; There's no current sanctioned way of doing this. It would be relatively easy to implement though. One possible way is to alter Bio::Root::RootI like: 104c104 (at top) < use vars qw($DEBUG $ID $Revision $VERBOSITY); --- > use vars qw($DEBUG $ID $Revision $VERBOSITY $WARNINGS); 114a115 (in BEGIN) > $WARNINGS = 1; 174c175,180 (in &warn) < my ($self,$string) = @_; --- > my ($self,$string,$switch) = @_; > > if ($switch) { > $WARNINGS = $WARNINGS ? 0 : 1; > } > return unless $WARNINGS; Then you could do: my $switcher = new Bio::Root::Root(); $switcher->warn(undef, 1); # globally switch all warnings off $switcher->warn(undef, 1); # switch all warnings back on Or something along those lines. There's probably a nicer way of doing it. From genomewalker at gmail.com Fri Aug 18 14:05:55 2006 From: genomewalker at gmail.com (Antonio) Date: Fri, 18 Aug 2006 11:05:55 -0700 (PDT) Subject: [Bioperl-l] Motifs and aligned sequences Message-ID: <5874751.post@talk.nabble.com> Hello all, I am trying to find the solution of this problem, I've tried several options but no way. I want to find a motif in an aligned sequence, eg: Aligned Sequence: -G---ATT---AT--ATA Motif: ATTATA So i want to find the motif inside this sequence and return the last position of the motif in the aligned sequence, in this case 16. I don't know how I've to play with the '-', any suggestions? Thanks in advance! Antonio -- View this message in context: http://www.nabble.com/Motifs-and-aligned-sequences-tf2128750.html#a5874751 Sent from the Perl - Bioperl-L forum at Nabble.com. From ffauteux at gmail.com Fri Aug 18 15:33:24 2006 From: ffauteux at gmail.com (francois fauteux) Date: Fri, 18 Aug 2006 15:33:24 -0400 Subject: [Bioperl-l] DBA_GenericHSP_output??? Message-ID: <7a2727b90608181233od759396wd96c6c439fa30861@mail.gmail.com> Hi; DBA with non coding DNA; does something when launched: Find start end points: [0,1001][0,1002] Score -2990 Recovering alignment: Alignment recoveredplicit read off I can't find a way to get the output (alignment)... The code looks like this (see DBA.pm): my @params = ('matchA' => 0.75, 'matchB' => '0.55','dymem'=>'linear'); my $factory = Bio::Tools::Run::Alignment::DBA->new(@params); $inputfilename = 'seqs.fasta'; #@hsps is an array of GenericHSP objects my @hsps = $factory->align($inputfilename); Missing the howto ouptut pretty alignment... Many thanks; Fran?ois From mblanche at berkeley.edu Fri Aug 18 15:52:33 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Fri, 18 Aug 2006 12:52:33 -0700 Subject: [Bioperl-l] Fwd: Extracting gene seq from Bio::DB::GFF In-Reply-To: <6dce9a0b0608171532j77da0146x95d5023200801cb5@mail.gmail.com> Message-ID: Many thanks Lincoln, Bio::DB::SeqFeature::Store seems to work fine with me as in: use Bio::DB::SeqFeature::Store; my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'DBI::mysql', -dsn => 'dbi:mysql:dmel_43_SeqF'); while (<>){ chomp; my $id = $_; my @feats = $db->get_features_by_alias($id); for my $f (@feats){ print "$id -> ", $f->name, "\n" if $f->type eq 'gene'; } } get a list of FBgn ids and spits out the gene name. The good thing now is that I am getting the same number of output gene as the number of genes in my starting list (As oposed to when I was using Bio::DB::GFF). My only problem is that I had to guess that the method type() and attributes() were available. My understanding by now is that get_features_by_alias() return a Bio::DB::Feature, however, I couldn't find any documentation on that object (it does not return a Bio::SeqFeatureI as I originally thought). Is the Bio::DB::Feature essentially a clone of the Bio::DB::GFF::Feature? Many thanks again, Marco On 8/17/06 15:32, "Lincoln Stein" wrote: > Let me know how it works. > > I also get a few of the warnings about the ortho:* features. They don't seem > to hurt anything so you can go ahead and use fast loading if you want. The > long-term fix is to sort the GFF3 files so that all features that share the > same ID occur next to each other. > > Lincoln > > On 8/17/06, Marco Blanchette wrote: >> >> I will answer my own question... >> >> Yes, one can load the fasta file after having loaded the gff file by >> doing: >> >> bp_seqfeature_load.pl -d dmel_43_SF_slow dmel-all-chromosome-r4.3.fasta >> >> Marco >> >> >> On 8/17/06 11:20, "Marco Blanchette" wrote: >> >>> Lincoln, thanks for the precision. I just could not find any references >> to >>> how to load the DNA (no where in bp_seqfeature_load.pl or in the >>> Bio::DB::SeqFeature::Store it says how load the DNA sequences). >>> >>> So right now the gff files were loaded in mysql using: >>> /usr/bin/bp_seqfeature_load.pl -d dmel_43_SF_slow *.gff >>> >>> I tried the --fast options but got a bunch of warning (see below). >>> >>> The DNA file (a single fasta database file containing all chromosome >>> sequences) was in a different location from the gff files and was not >> loaded >>> together with the gff files (the sequence table is empty in the >> database). >>> >>> Can I load the DNA sequence after the gff files were loaded? >>> >>> Many thanks >>> >>> Marco >>> >>> >>> -------------------- WARNING --------------------- >>> MSG: ID=ortho:2825 has been used more than once, but it cannot be found >> in >>> the database. >>> This can happen if you have specified fast loading, but features sharing >> the >>> same ID >>> are not contiguous in the GFF file. This will be loaded as a separate >>> feature. >>> Line 483681: "X . orthologous_region 19477824 19478027 >>> . + . >>> ID=ortho:2825;to_name=FBpp0074514,CG14214-PA;to_species=dpse" >>> >>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::handle_feature >>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:537 >>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::do_load >>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:424 >>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load_fh >>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:342 >>> STACK Bio::DB::SeqFeature::Store::GFF3Loader::load >>> /Library/Perl/5.8.6/Bio/DB/SeqFeature/Store/GFF3Loader.pm:240 >>> STACK toplevel /usr/bin/bp_seqfeature_load.pl:81 >>> >>> >>> On 8/17/06 10:27, "Lincoln Stein" wrote: >>> >>>> Hi, >>>> >>>> This message bounced because I tried to send it from my gmail account >> and so >>>> I'm sending it again. Bio::DB::SeqFeature::Store *does* load DNA. If it >>>> finds a file that contains DNA data, it simply loads it. There is no >> special >>>> command line switch. Also you can include the DNA in the GFF3 file. >>>> >>>> Lincoln >>>> >>>> ---------- Forwarded message ---------- >>>> From: Lincoln Stein >>>> Date: Aug 17, 2006 12:26 PM >>>> Subject: Re: [Bioperl-l] Extracting gene seq from Bio::DB::GFF >>>> To: Chris Fields >>>> Cc: Marco Blanchette , " >> bioperl-l at lists.open-bio.org" >>>> , cain.cshl at gmail.com >>>> >>>> I'm curious. Could you try using the Bio::DB::SeqFeature::Store class >> to >>>> load the GFF3-format Fly data? I think you're probably getting confused >> by >>>> overlapping mRNA splice forms, an issue that won't occur with the full >>>> GFF3-formatted data. >>>> >>>> >>>> On 8/13/06, Chris Fields wrote: >>>>> >>>>> Marco, >>>>> >>>>> Did you figure out what the problem was? I was curious; the issue >>>>> you were having was rather odd. I wanted to see if it was an issue >>>>> with the GFF data or with the database itself. >>>>> >>>>> Chris >>>>> >>>>> On Aug 11, 2006, at 6:59 PM, Marco Blanchette wrote: >>>>> >>>>>> Chris, >>>>>> >>>>>>> Do you mean you get duplicates of sequences back, or that you get >>>>>>> more than >>>>>>> one chunk of the same sequence back? >>>>>> >>>>>> I sometimes get duplicated sequences and sometimes overlapping >>>>>> regions (see >>>>>> bellow) >>>>>> >>>>>>> >>>>>>> Is it possible that each query using an ID could contain more than >>>>>>> one >>>>>>> feature? That might explain it (you could check by testing the >>>>>>> size of the >>>>>>> array @feats). >>>>>> Most id return more than one features from various type >>>>>> ( point_mutation, >>>>>> insertion_site, processed_transcript, etc...). That's why I >>>>>> restirct the >>>>>> output to type "gene" using regexp /gene/ on $f->type. >>>>>> >>>>>>> >>>>>>> I'm not sure how split locations are handled within Bio:DB::GFF, >>>>>>> but do the >>>>>>> specific features have split locations? >>>>>>> >>>>>>> Chris >>>>>>> >>>>>> Not sure what you mean exactly but have a look at the following >>>>>> script, it >>>>>> gives the location and the group id of the feature being reported: >>>>>> >>>>>> use Bio::DB::GFF; >>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>>> -dsn => >>>>>> 'dbi:mysql:database=dmel_43_new'); >>>>>> my %dups; >>>>>> while (<>){ >>>>>> chomp; >>>>>> my $id = $_; >>>>>> my @feat = $db->get_feature_by_name($id); >>>>>> >>>>>> for my $f (@feat){ >>>>>> if (exists $dups{$f->group} && $f->type =~/gene/){ >>>>>> print "Calling >>>", $f->group, "\n"; >>>>>> print "Chr: ", $f->refseq, >>>>>> " Strand: ", $f->strand, >>>>>> " Start: ", $f->start, >>>>>> " End: ", $f->end, >>>>>> "\n"; >>>>>> print "Offending >>>", $dups{$f->group}->group, "\n"; >>>>>> print "Chr: ", $dups{$f->group}->refseq, >>>>>> " Strand: ", $dups{$f->group}->strand, >>>>>> " Start: ", $dups{$f->group}->start, >>>>>> " End: ", $dups{$f->group}->end; >>>>>> print "\n\n"; >>>>>> } else { >>>>>> $dups{$f->group} = $f; >>>>>> } >>>>>> } >>>>>> } >>>>>> >>>>>> Here is the output: >>>>>> Calling >>>FBgn0004179 >>>>>> Chr: 3L Strand: 1 Start: 22201102 End: 22207587 >>>>>> Offending >>>FBgn0004179 >>>>>> Chr: 3L Strand: 1 Start: 22200575 End: 22200575 >>>>>> >>>>>> Calling >>>FBgn0025681 >>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>>>> Offending >>>FBgn0025681 >>>>>> Chr: 2L Strand: 1 Start: 2992964 End: 2998614 >>>>>> >>>>>> Calling >>>FBgn0025803 >>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>>>> Offending >>>FBgn0025803 >>>>>> Chr: 3R Strand: 1 Start: 16966463 End: 17038413 >>>>>> >>>>>> Calling >>>FBgn0000117 >>>>>> Chr: X Strand: -1 Start: 1756796 End: 1747557 >>>>>> Offending >>>FBgn0000117 >>>>>> Chr: X Strand: -1 Start: 1757776 End: 1747182 >>>>>> >>>>>> Calling >>>FBgn0005427 >>>>>> Chr: X Strand: -1 Start: 136456 End: 125343 >>>>>> Offending >>>FBgn0005427 >>>>>> Chr: X Strand: -1 Start: 133199 End: 124949 >>>>>> >>>>>> Calling >>>FBgn0000042 >>>>>> Chr: X Strand: 1 Start: 5746100 End: 5750026 >>>>>> Offending >>>FBgn0000042 >>>>>> Chr: X Strand: 1 Start: 5746096 End: 5746106 >>>>>> >>>>>> Calling >>>FBgn0004551 >>>>>> Chr: 2R Strand: -1 Start: 19443485 End: 19434556 >>>>>> Offending >>>FBgn0004551 >>>>>> Chr: 2R Strand: -1 Start: 19445155 End: 19429977 >>>>>> >>>>>> Do you have any suggestions?? Is the procedure I am using to >>>>>> retrieve the >>>>>> genes right? >>>>>> >>>>>> Many thanks >>>>>> >>>>>> Marco >>>>>> >>>>>> >>>>>> >>>>>>>> Many thanks Scott, >>>>>>>> >>>>>>>> At the same time I got your email I was coming to the same >>>>>>>> conclusion as >>>>>>>> you. >>>>>>>> >>>>>>>> Now I have a stranger problem in my hands... My goal is quite >>>>>>>> simple, I >>>>>>>> try >>>>>>>> to get the sequence of the genes back from the Bio::DB::GFF >> database >>>>>>>> loaded >>>>>>>> on MySQL. The gene list is from a file with one gene id per per >>>>>>>> line. When >>>>>>>> I >>>>>>>> run the following script: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> use Bio::DB::GFF; >>>>>>>> use Bio::SeqIO; >>>>>>>> my $out = Bio::SeqIO->new( -fh => \*STDOUT, >>>>>>>> -format => 'fasta'); >>>>>>>> my $db = Bio::DB::GFF->new( -adaptor => 'dbi::mysql', >>>>>>>> -dsn => >>>>>>>> 'dbi:mysql:database=dmel_43_new'); >>>>>>>> >>>>>>>> while (<>){ >>>>>>>> chomp; >>>>>>>> my $id = $_; >>>>>>>> my @feats = $db->get_feature_by_name($id); >>>>>>>> for my $f (@feats){ >>>>>>>> $out->write_seq( $f->seq ) if $f->type =~/gene/; >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> >>>>>>>> I get more sequence back than the number of gene in my input file. >> I >>>>>>>> double >>>>>>>> check there. Some of the duplicated entries are the same, some >>>>>>>> are not! >>>>>>> >>>>>>> >>>>>>> ... >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Bioperl-l mailing list >>>>>>> Bioperl-l at lists.open-bio.org >>>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>>> >>>>>> ______________________________ >>>>>> Marco Blanchette, Ph.D. >>>>>> >>>>>> mblanche at uclink.berkeley.edu >>>>>> >>>>>> Donald C. Rio's lab >>>>>> Department of Molecular and Cell Biology >>>>>> 16 Barker Hall >>>>>> University of California >>>>>> Berkeley, CA 94720-3204 >>>>>> >>>>>> Tel: (510) 642-1084 >>>>>> Cell: (510) 847-0996 >>>>>> Fax: (510) 642-6062 >>>>>> -- >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioperl-l mailing list >>>>>> Bioperl-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>>> Christopher Fields >>>>> Postdoctoral Researcher >>>>> Lab of Dr. Robert Switzer >>>>> Dept of Biochemistry >>>>> University of Illinois Urbana-Champaign >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>> >>>> >>> >>> ______________________________ >>> Marco Blanchette, Ph.D. >>> >>> mblanche at uclink.berkeley.edu >>> >>> Donald C. Rio's lab >>> Department of Molecular and Cell Biology >>> 16 Barker Hall >>> University of California >>> Berkeley, CA 94720-3204 >>> >>> Tel: (510) 642-1084 >>> Cell: (510) 847-0996 >>> Fax: (510) 642-6062 >> >> ______________________________ >> Marco Blanchette, Ph.D. >> >> mblanche at uclink.berkeley.edu >> >> Donald C. Rio's lab >> Department of Molecular and Cell Biology >> 16 Barker Hall >> University of California >> Berkeley, CA 94720-3204 >> >> Tel: (510) 642-1084 >> Cell: (510) 847-0996 >> Fax: (510) 642-6062 >> -- >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From osborne1 at optonline.net Fri Aug 18 17:13:04 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 18 Aug 2006 17:13:04 -0400 Subject: [Bioperl-l] DBA_GenericHSP_output??? In-Reply-To: <7a2727b90608181233od759396wd96c6c439fa30861@mail.gmail.com> Message-ID: Francois, Something like: use Bio::AlignIO; # $aln will be a Bio::SimpleAlign object my $aln = $hsps[0]->get_aln; my $alnIO = Bio::AlignIO->new(-format =>"msf", -file => ">hsp.msf"); $alnIO->write_aln($aln); Brian O. On 8/18/06 3:33 PM, "francois fauteux" wrote: > Hi; > > DBA with non coding DNA; does something when launched: > > Find start end points: [0,1001][0,1002] Score -2990 > Recovering alignment: Alignment recoveredplicit read off > > I can't find a way to get the output (alignment)... > > The code looks like this (see DBA.pm): > > my @params = ('matchA' => 0.75, 'matchB' => '0.55','dymem'=>'linear'); > my $factory = Bio::Tools::Run::Alignment::DBA->new(@params); > $inputfilename = 'seqs.fasta'; > #@hsps is an array of GenericHSP objects > my @hsps = $factory->align($inputfilename); > > Missing the howto ouptut pretty alignment... > > Many thanks; > > Fran?ois > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From osborne1 at optonline.net Fri Aug 18 17:52:44 2006 From: osborne1 at optonline.net (Brian Osborne) Date: Fri, 18 Aug 2006 17:52:44 -0400 Subject: [Bioperl-l] Motifs and aligned sequences In-Reply-To: <5874751.post@talk.nabble.com> Message-ID: Antonio, First remove the dashes from the consensus, s/-//g. Brian O. On 8/18/06 2:05 PM, "Antonio" wrote: > > Hello all, > I am trying to find the solution of this problem, I've tried several options > but no way. I want to find a motif in an aligned sequence, eg: > Aligned Sequence: -G---ATT---AT--ATA > Motif: ATTATA > So i want to find the motif inside this sequence and return the last > position of the motif in the aligned sequence, in this case 16. I don't know > how I've to play with the '-', any suggestions? > Thanks in advance! > Antonio From genomewalker at gmail.com Sat Aug 19 01:02:52 2006 From: genomewalker at gmail.com (Antoni =?iso-8859-1?q?Fern=E0ndez-Guerra?=) Date: Sat, 19 Aug 2006 07:02:52 +0200 Subject: [Bioperl-l] Motifs and aligned sequences In-Reply-To: <200608190358.45542.genomewalker@gmail.com> References: <200608190358.45542.genomewalker@gmail.com> Message-ID: <200608190702.52815.genomewalker@gmail.com> Thanks for you answer Brian, but I've already done it, the problem is that if I remove the dashes I will lose the positions on the aligned sequence, eg: s/-//g --->> GATTATATA, then if i want to know where is the last position of the motif it will be 7 instead of 16. I want to know the positions of the dashes too...but now I don't have any good idea, I will keep working on it. Thanks again Antonio > A Divendres 18 Agost 2006 23:52, v?reu escriure: > > Antonio, > > > > First remove the dashes from the consensus, s/-//g. > > > > Brian O. > > > > On 8/18/06 2:05 PM, "Antonio" wrote: > > > Hello all, > > > I am trying to find the solution of this problem, I've tried several > > > options but no way. I want to find a motif in an aligned sequence, eg: > > > Aligned Sequence: -G---ATT---AT--ATA > > > Motif: ATTATA > > > So i want to find the motif inside this sequence and return the last > > > position of the motif in the aligned sequence, in this case 16. I don't > > > know how I've to play with the '-', any suggestions? > > > Thanks in advance! > > > Antonio From genomewalker at gmail.com Sat Aug 19 03:24:31 2006 From: genomewalker at gmail.com (Antoni =?iso-8859-15?q?Fern=E0ndez-Guerra?=) Date: Sat, 19 Aug 2006 09:24:31 +0200 Subject: [Bioperl-l] Motifs and aligned sequences In-Reply-To: References: <200608190702.52815.genomewalker@gmail.com> Message-ID: <200608190924.31965.genomewalker@gmail.com> Thank you for your help, now I've found a temporary solution for my problem, I'm new using Perl and Bioperl, I've used some help at the book Beginning Perl for Bioinformatics, here is part of the code: I've two arrays to store the dna sequence without dashes(@inter) and his position in the sequence with dashes(@num): foreach $seq (@filename) { if( $seq eq '-'){ ++$a; }elsif ($seq ne '-'){ ++$a; push (@inter, $seq); push (@num, $a); } } After I ask for the motif and is searched into @inter, I can find the beginning and the end of the motif into the modified sequence. With this positions I can look into @num and @inter and I obtain the positions: my $nucleotide = join( '', @inter); while( $nucleotide =~ /$motif/g ) { my $position = pos($nucleotide) ; my $init = pos($nucleotide) - length($&) +1; push(@locations, $position); push(@initial, $init); my $position1 = $position -1; my $init1 = $init -1; print "Start: $inter[$init1] -- $num[$init1]\n"; print "End: $inter[$position1] -- $num[$position1]\n\n"; } I don't know is it very elegant but it seems to work. Thanks again Antonio A Dissabte 19 Agost 2006 05:17, Seiji Kumagai va escriure: > Hi, > > How about this? > > my $str = q/-G---ATT---AT--ATA/; > my $motif = q/A\-*T\-*T\-*A\-*T\-*A/; > while ($str =~ /$motif/g) { > print $+[0], qq/\n/; > } > > The above code prints the last base positions of a motif. It is only valid > for *non-overlapping* motifs. For overlapping motif, you can replace > /$motif/ with /(?=$motif)/. However, if you do so, you won't be able to > print the positions of the last bases. In stead, it will print positions > of the immediately before the first bases in the motif. But, I think you > can easily find the positions of the last bases if you know that > position. Finally, you can find the explanation in perlre. > > On Sat, 19 Aug 2006, Antoni [iso-8859-1] Fern?ndez-Guerra wrote: > > Thanks for you answer Brian, but I've already done it, the problem is > > that if I remove the dashes I will lose the positions on the aligned > > sequence, eg: s/-//g --->> GATTATATA, then if i want to know where is the > > last position of the motif it will be 7 instead of 16. I want to know the > > positions of the dashes too...but now I don't have any good idea, I will > > keep working on it. Thanks again > > Antonio > > > >> A Divendres 18 Agost 2006 23:52, v?reu escriure: > >>> Antonio, > >>> > >>> First remove the dashes from the consensus, s/-//g. > >>> > >>> Brian O. > >>> > >>> On 8/18/06 2:05 PM, "Antonio" wrote: > >>>> Hello all, > >>>> I am trying to find the solution of this problem, I've tried several > >>>> options but no way. I want to find a motif in an aligned sequence, eg: > >>>> Aligned Sequence: -G---ATT---AT--ATA > >>>> Motif: ATTATA > >>>> So i want to find the motif inside this sequence and return the last > >>>> position of the motif in the aligned sequence, in this case 16. I > >>>> don't know how I've to play with the '-', any suggestions? > >>>> Thanks in advance! > >>>> Antonio > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at uiuc.edu Sat Aug 19 00:03:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 18 Aug 2006 23:03:01 -0500 Subject: [Bioperl-l] Fuzzy Locations and GenBank Message-ID: <43339DFB-F14F-4C48-BE97-3CE071577439@uiuc.edu> Don't know how much this will affect Bio::Location::Fuzzy, but I thought it might be worth a heads-up in case something pops up: From the latest GenBank release (154.0): ... 1.4.6 Feature location syntax X.Y to be discontinued The Feature Table currently supports feature locations of the format X.Y, to represent a base position which is greater or equal to X, and less than or equal to Y. For example: misc_feature 1.10..20 misc_feature join(100..150,200.210..250) In the first example, the misc_feature starts somewhere between bases 1 and 10 (inclusive), and ends at basepair 20. In the second, the 51 bases from 100..150 are joined together with a second basepair interval, which could be anywhere from 200..250 to 210..250 . Although this syntax seems like a reasonable way to capture an uncertain interval, it is used for features on a vanishingly small number of sequence records, most database submission mechanisms don't support it, and the meaning of its use in a join() context is not entirely clear. As of October 2006, this type of location will no longer be supported. Those records with features which utilize X.Y locations will be reviewed and converted to a non-uncertain format prior to that date. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sat Aug 19 00:08:26 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 19 Aug 2006 00:08:26 -0400 Subject: [Bioperl-l] Fuzzy Locations and GenBank In-Reply-To: <43339DFB-F14F-4C48-BE97-3CE071577439@uiuc.edu> References: <43339DFB-F14F-4C48-BE97-3CE071577439@uiuc.edu> Message-ID: <2DBA980B-EC1B-43AB-AB9C-77B7B87FC088@gmx.net> Great, the fewer fuzzy locations the better. -hilmar On Aug 19, 2006, at 12:03 AM, Chris Fields wrote: > Don't know how much this will affect Bio::Location::Fuzzy, but I > thought it might be worth a heads-up in case something pops up: > > From the latest GenBank release (154.0): > > ... > > 1.4.6 Feature location syntax X.Y to be discontinued > > The Feature Table currently supports feature locations of the > format X.Y, to represent a base position which is greater or > equal to X, and less than or equal to Y. For example: > > misc_feature 1.10..20 > misc_feature join(100..150,200.210..250) > > In the first example, the misc_feature starts somewhere between > bases 1 and 10 (inclusive), and ends at basepair 20. In the second, > the 51 bases from 100..150 are joined together with a second basepair > interval, which could be anywhere from 200..250 to 210..250 . > > Although this syntax seems like a reasonable way to capture an > uncertain interval, it is used for features on a vanishingly small > number of sequence records, most database submission mechanisms > don't support it, and the meaning of its use in a join() context > is not entirely clear. > > As of October 2006, this type of location will no longer be > supported. Those records with features which utilize X.Y locations > will be reviewed and converted to a non-uncertain format prior to > that date. > > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From niels at genomics.dk Fri Aug 18 11:28:48 2006 From: niels at genomics.dk (Niels Larsen) Date: Fri, 18 Aug 2006 17:28:48 +0200 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E5D719.4030808@sendu.me.uk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> Message-ID: <44E5DCB0.1050806@genomics.dk> I am a casual user of bioperl with my own error handling. Is there a way to switch off bioperls error handling? I switch off my own with $Common::Config::with_errors = 0; $Common::Config::with_warnings = 0; Sendu Bala wrote: > J?r?my JUST wrote: >> Hello, >> >> I can't manage to get rid of all warnings during a query from GenBank >> with BioPerl 1.4. When the query doesn't give any result, I catch the >> exception, but there is still a warning that is issued to STDERR. > [...] >> and the unwanted warning is: >> <<<<< >> Warning(s) from GenBank: >> FooBar >> >> How could I set verbosity so that I don't get this warning? > > You can't easily or appropriately do it with bioperl 1.4, but I've just > fixed the problem in cvs. You might be able to just download revision > 1.18 of Bio/DB/Query/GenBank.pm from bioperl-live: > > http://code.open-bio.org/cgi/viewcvs.cgi/bioperl-live/Bio/DB/Query/ > > (The perl hack would be to redirect STDERR to /dev/null inside your eval.) > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Niels Larsen Danish Genome Institute Gustav Wieds vej 10 C DK-8000 Aarhus C Denmark Electronic mail: niels at genomics.dk Skype: niels_larsen_denmark Telephone: +45-8942-5268 Telefax: +45-8620-1222 ------------------------------------------------------------------------ From skumagai at life.bio.sunysb.edu Fri Aug 18 23:17:20 2006 From: skumagai at life.bio.sunysb.edu (Seiji Kumagai) Date: Fri, 18 Aug 2006 23:17:20 -0400 (EDT) Subject: [Bioperl-l] Motifs and aligned sequences In-Reply-To: <200608190702.52815.genomewalker@gmail.com> References: <200608190358.45542.genomewalker@gmail.com> <200608190702.52815.genomewalker@gmail.com> Message-ID: Hi, How about this? my $str = q/-G---ATT---AT--ATA/; my $motif = q/A\-*T\-*T\-*A\-*T\-*A/; while ($str =~ /$motif/g) { print $+[0], qq/\n/; } The above code prints the last base positions of a motif. It is only valid for *non-overlapping* motifs. For overlapping motif, you can replace /$motif/ with /(?=$motif)/. However, if you do so, you won't be able to print the positions of the last bases. In stead, it will print positions of the immediately before the first bases in the motif. But, I think you can easily find the positions of the last bases if you know that position. Finally, you can find the explanation in perlre. On Sat, 19 Aug 2006, Antoni [iso-8859-1] Fern?ndez-Guerra wrote: > Thanks for you answer Brian, but I've already done it, the problem is that > if I remove the dashes I will lose the positions on the aligned sequence, > eg: s/-//g --->> GATTATATA, then if i want to know where is the last > position of the motif it will be 7 instead of 16. I want to know the > positions of the dashes too...but now I don't have any good idea, I will > keep working on it. Thanks again > Antonio > >> A Divendres 18 Agost 2006 23:52, v?reu escriure: >>> Antonio, >>> >>> First remove the dashes from the consensus, s/-//g. >>> >>> Brian O. >>> >>> On 8/18/06 2:05 PM, "Antonio" wrote: >>>> Hello all, >>>> I am trying to find the solution of this problem, I've tried several >>>> options but no way. I want to find a motif in an aligned sequence, eg: >>>> Aligned Sequence: -G---ATT---AT--ATA >>>> Motif: ATTATA >>>> So i want to find the motif inside this sequence and return the last >>>> position of the motif in the aligned sequence, in this case 16. I don't >>>> know how I've to play with the '-', any suggestions? >>>> Thanks in advance! >>>> Antonio > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From hlapp at gmx.net Sat Aug 19 08:47:30 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 19 Aug 2006 08:47:30 -0400 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E5DCB0.1050806@genomics.dk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> Message-ID: There is a variable Bio::Root::VERBOSITY which is returned by Bio::Root::Root::verbose() if called as a function (as opposed to a method). The accompanying comment says that this would enable global verbosity setting, but I don't see the piece of code that would actually set the variable, let alone defaulting the per-instance value to it if a different one was never set. I'm not sure whether I'm overlooking something (quite possible) but if I'm not this sounds like a bug. Also, I thought that the value of some environment variable would be taken into account but in the code I don't see any indication of that being true either. If no-one sets me straight can you file this as a bug report? -hilmar On Aug 18, 2006, at 11:28 AM, Niels Larsen wrote: > I am a casual user of bioperl with my own error handling. Is there > a way to switch off bioperls error handling? I switch off my own > with > > $Common::Config::with_errors = 0; > $Common::Config::with_warnings = 0; > > Sendu Bala wrote: >> J?r?my JUST wrote: >>> Hello, >>> >>> I can't manage to get rid of all warnings during a query from >>> GenBank >>> with BioPerl 1.4. When the query doesn't give any result, I catch >>> the >>> exception, but there is still a warning that is issued to STDERR. >> [...] >>> and the unwanted warning is: >>> <<<<< >>> Warning(s) from GenBank: >>> FooBar >>> >>> How could I set verbosity so that I don't get this warning? >> >> You can't easily or appropriately do it with bioperl 1.4, but I've >> just >> fixed the problem in cvs. You might be able to just download revision >> 1.18 of Bio/DB/Query/GenBank.pm from bioperl-live: >> >> http://code.open-bio.org/cgi/viewcvs.cgi/bioperl-live/Bio/DB/Query/ >> >> (The perl hack would be to redirect STDERR to /dev/null inside >> your eval.) >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- > ---------------------------------------------------------------------- > -- > > Niels Larsen > Danish Genome Institute > Gustav Wieds vej 10 C > DK-8000 Aarhus C > Denmark > > Electronic mail: niels at genomics.dk > Skype: niels_larsen_denmark > > Telephone: +45-8942-5268 > Telefax: +45-8620-1222 > > ---------------------------------------------------------------------- > -- > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Sat Aug 19 09:19:49 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 19 Aug 2006 14:19:49 +0100 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> Message-ID: <44E70FF5.5010508@sendu.me.uk> Hilmar Lapp wrote: > There is a variable Bio::Root::VERBOSITY which is returned by > Bio::Root::Root::verbose() if called as a function (as opposed to a > method). > > The accompanying comment says that this would enable global verbosity > setting, but I don't see the piece of code that would actually set the > variable, let alone defaulting the per-instance value to it if a > different one was never set. > > I'm not sure whether I'm overlooking something (quite possible) but if > I'm not this sounds like a bug. Also, I thought that the value of some > environment variable would be taken into account but in the code I don't > see any indication of that being true either. > > If no-one sets me straight can you file this as a bug report? Well, there's nothing in the actual docs to say that this functionality is present, so it's not really a bug that it doesn't work. A simple global verbosity variable is no good anyway. You want to be able to temporarily switch all objects warn behaviour, but then be able to switch it back to whatever their verbosities were to begin with. If you just set Bio::Root::VERBOSITY to something, how would any method (ie. &warn) know if you wanted that verbosity to be used now all the time, or if the verbosity set on the root instance should be used? I suppose the default of it could be undef, and you set it back to undef when you don't want a global verbosity, but is that very nice? From hlapp at gmx.net Sat Aug 19 12:16:01 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 19 Aug 2006 12:16:01 -0400 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E70FF5.5010508@sendu.me.uk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> Message-ID: <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> Well, the rule would be: 1) if a local (instance) verbosity has been set, use it 2) otherwise, if a global (class, static) verbosity has been set, use it 3) otherwise, use a default value. This would mean indeed that if you changed verbosity for a specific instance it will be unaffected by global changes of the verbosity level. If that doesn't sound good, you would reverse rules 1) and 2). Or am I missing something? -hilmar On Aug 19, 2006, at 9:19 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> There is a variable Bio::Root::VERBOSITY which is returned by >> Bio::Root::Root::verbose() if called as a function (as opposed to a >> method). >> >> The accompanying comment says that this would enable global verbosity >> setting, but I don't see the piece of code that would actually set >> the >> variable, let alone defaulting the per-instance value to it if a >> different one was never set. >> >> I'm not sure whether I'm overlooking something (quite possible) >> but if >> I'm not this sounds like a bug. Also, I thought that the value of >> some >> environment variable would be taken into account but in the code I >> don't >> see any indication of that being true either. >> >> If no-one sets me straight can you file this as a bug report? > > Well, there's nothing in the actual docs to say that this > functionality > is present, so it's not really a bug that it doesn't work. > > A simple global verbosity variable is no good anyway. You want to be > able to temporarily switch all objects warn behaviour, but then be > able > to switch it back to whatever their verbosities were to begin with. If > you just set Bio::Root::VERBOSITY to something, how would any method > (ie. &warn) know if you wanted that verbosity to be used now all the > time, or if the verbosity set on the root instance should be used? I > suppose the default of it could be undef, and you set it back to undef > when you don't want a global verbosity, but is that very nice? > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From ffauteux at gmail.com Sat Aug 19 12:49:57 2006 From: ffauteux at gmail.com (francois fauteux) Date: Sat, 19 Aug 2006 12:49:57 -0400 Subject: [Bioperl-l] Adding all elements from an array of scores Message-ID: <7a2727b90608190949q6c21a39cq3c5b596bd9c54a14@mail.gmail.com> Hi; Having an array of alignment scores: @scores = ('x1', 'x2', 'x3', 'xn') Where x1, x2...xn are the score values and n is the number of elements in the array; How could I add the values x1 + x2 + x3 + ...xn (would have to adapt to arrays of variable number of elements); It would look like 'add $score[1] + $score[2] ... and so on until there is no more elements in the array'; Thanks for hints; Fran?ois Fauteux From mblanche at berkeley.edu Sat Aug 19 13:50:24 2006 From: mblanche at berkeley.edu (Marco Blanchette) Date: Sat, 19 Aug 2006 10:50:24 -0700 Subject: [Bioperl-l] Adding all elements from an array of scores In-Reply-To: <7a2727b90608190949q6c21a39cq3c5b596bd9c54a14@mail.gmail.com> Message-ID: Fairly simple Francois, @scores = qw( 10 34 25 46 ); $y=0; for $x (@scores){ $y = $y+$x; } print "$y\n"; Marco On 8/19/06 9:49, "francois fauteux" wrote: > Hi; > > Having an array of alignment scores: > > @scores = ('x1', 'x2', 'x3', 'xn') > > Where x1, x2...xn are the score values and n is the number of elements > in the array; > > How could I add the values x1 + x2 + x3 + ...xn (would have to adapt > to arrays of variable number of elements); > > It would look like 'add $score[1] + $score[2] ... and so on until > there is no more elements in the array'; > > Thanks for hints; > > Fran?ois Fauteux > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ______________________________ Marco Blanchette, Ph.D. mblanche at uclink.berkeley.edu Donald C. Rio's lab Department of Molecular and Cell Biology 16 Barker Hall University of California Berkeley, CA 94720-3204 Tel: (510) 642-1084 Cell: (510) 847-0996 Fax: (510) 642-6062 -- From bix at sendu.me.uk Sat Aug 19 14:07:49 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 19 Aug 2006 19:07:49 +0100 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> Message-ID: <44E75375.20600@sendu.me.uk> Hilmar Lapp wrote: > Well, the rule would be: > > 1) if a local (instance) verbosity has been set, use it > 2) otherwise, if a global (class, static) verbosity has been set, > use it > 3) otherwise, use a default value. > > This would mean indeed that if you changed verbosity for a specific > instance it will be unaffected by global changes of the verbosity level. > > If that doesn't sound good, you would reverse rules 1) and 2). But then if you set the global verbosity how do you later change your mind, unset it, and go back to using instance verbosity? > Or am I missing something? Well, assuming we would prefer global changes to really be global changes, and we reversed 1) and 2), if we have just a single simple global variable, how does a method like &warn even know that it is 'set' (and not still on its default value)? How does it know when it is unset (we no longer want a globally acting verbosity)? Like I say, you have to have a default of undef and set the value to undef to turn the feature off, which doesn't seem very nice to me. I'd prefer to be able to chose a global verbosity level and independently turn global behaviour on and off by supplying a method a boolean or even the words 'on'|'off', not supplying int or undef. From bix at sendu.me.uk Sat Aug 19 14:11:20 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sat, 19 Aug 2006 19:11:20 +0100 Subject: [Bioperl-l] Adding all elements from an array of scores In-Reply-To: <7a2727b90608190949q6c21a39cq3c5b596bd9c54a14@mail.gmail.com> References: <7a2727b90608190949q6c21a39cq3c5b596bd9c54a14@mail.gmail.com> Message-ID: <44E75448.8000407@sendu.me.uk> francois fauteux wrote: > Having an array of alignment scores: > > @scores = ('x1', 'x2', 'x3', 'xn') > > Where x1, x2...xn are the score values and n is the number of elements > in the array; > > How could I add the values x1 + x2 + x3 + ...xn (would have to adapt > to arrays of variable number of elements); > > It would look like 'add $score[1] + $score[2] ... and so on until > there is no more elements in the array'; This list is for questions relating to Bioperl, not general perl questions. In any case, just use a foreach loop and the + operator. From ffauteux at gmail.com Sat Aug 19 21:02:08 2006 From: ffauteux at gmail.com (francois fauteux) Date: Sat, 19 Aug 2006 21:02:08 -0400 Subject: [Bioperl-l] DBA & HSP -> write_aln (array) Message-ID: <7a2727b90608191802v3a6a2e8ax609fc3702508f438@mail.gmail.com> Say an array of High Scoring Pair elements (been generated with Bio::Tools::Run::Alignment::DBA); @hsps = $factory->align(\@files); Output every alignment - push $aln into @aln; foreach $hsps (@hsps) { my $aln = $hsps->get_aln; push @aln, $aln; } my $alnIO = Bio::AlignIO->new(-format =>"fasta", -file =>">out.fasta"); $alnIO->write_aln(@aln); If use msf format and write elements separately, works good: my $alnIO = Bio::AlignIO->new(-format =>"msf", -file =>">out.msf"); $alnIO->write_aln($aln[0]); If use msf format and the array: my $alnIO = Bio::AlignIO->new(-format =>"msf", -file =>">out.msf"); $alnIO->write_aln(@aln); The output is quite messy... How to output everything (all alignments from @aln) in msf format in a single file? In fasta format, it not too bad, the only bug being the last element (2nd seq) prints: >DBA/start-stop instead of >id/start-stop... Thanks for hints; Fran?ois From cjfields at uiuc.edu Sat Aug 19 21:58:30 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 19 Aug 2006 20:58:30 -0500 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E75375.20600@sendu.me.uk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> <44E75375.20600@sendu.me.uk> Message-ID: On Aug 19, 2006, at 1:07 PM, Sendu Bala wrote: > Hilmar Lapp wrote: >> Well, the rule would be: >> >> 1) if a local (instance) verbosity has been set, use it >> 2) otherwise, if a global (class, static) verbosity has been set, >> use it >> 3) otherwise, use a default value. >> >> This would mean indeed that if you changed verbosity for a specific >> instance it will be unaffected by global changes of the verbosity >> level. >> >> If that doesn't sound good, you would reverse rules 1) and 2). > > But then if you set the global verbosity how do you later change your > mind, unset it, and go back to using instance verbosity? > I think verbosity can be changed using Bio::Root::Root method verbose ($verbose_value). The method is a get/set which defaults to 0 (or the previous set value if not 0) when called without an argument. Every BioPerl class inherits Bio::Root::Root directly or indirectly; many (but not all) allow you to set it via the constructor using '- verbose'. >> Or am I missing something? > > Well, assuming we would prefer global changes to really be global > changes, and we reversed 1) and 2), if we have just a single simple > global variable, how does a method like &warn even know that it is > 'set' > (and not still on its default value)? How does it know when it is > unset > (we no longer want a globally acting verbosity)? > > Like I say, you have to have a default of undef and set the value to > undef to turn the feature off, which doesn't seem very nice to me. > > I'd prefer to be able to chose a global verbosity level and > independently turn global behaviour on and off by supplying a method a > boolean or even the words 'on'|'off', not supplying int or undef I prefer to leave it as it is. I don't think it is broken. Using verbose() is pretty straightforward: From Bio::Root::Root: Title : verbose Usage : $self->verbose(1) Function: Sets verbose level for how ->warn behaves -1 = no warning 0 = standard, small warning 1 = warning with stack trace 2 = warning becomes throw Returns : The current verbosity setting (integer between -1 to 2) Args : -1,0,1 or 2 sub verbose { my ($self,$value) = @_; # allow one to set global verbosity flag return $DEBUG if $DEBUG; return $VERBOSITY unless ref $self; if (defined $value || ! defined $self->{'_root_verbose'}) { $self->{'_root_verbose'} = $value || 0; } return $self->{'_root_verbose'}; } According to RootI, the degree of verbosity for warn() is set to 0 by default if verbose() is not set. Previously, verbose() wasn't available unless the class inherited Bio::Root::Root. However, the constructor for RootI indicates that Bio::Root::Root inheritance is required (I think this came about prior to v1.4). Anyway, the degree of verbosity is based on verbose() You should be able to set this normally using most classes w/o problems, and turn the warnings off (-1) or make the warnings throw instead (2) based upon what you want to accomplish. It is set to 0 by default if it is called at any time; it also only sets verbosity to a defined value (in other words, it is not undef and cannot be set to undef; if you can then there is something wrong). I don't think there are explicit checks on the value set within verbose() but I don't see the problem with that. Error handling methods like warn() do that for you. Now, if you find that something doesn't work this way, it is a bug. BTW, the reason you don't want simple 'on/off' is that you may want different degrees of error handling strictness, hence a sliding and easily testable scale. So, you should be able to turn warnings on and off by resetting verbose() to the appropriate value. You could also have any warning changed to a throw() instead. The problem you'll find (as I have found) is some classes/interfaces do not set verbose() based on parameters upon instantiation (i.e. you could pass the parameter '-verbose' and it won't do anything). This makes it hard to set error handling upon instantiation to a user- defined value. Others may not use RootI-implemented error handling/debugging methods; they may use built-in warn or STDERR (ugh) instead of $self- >warn(). And still others with internal objects may also neglect to set '-verbose' for those objects based on the current object's verbose value ($self->verbose). It's all based on how well the classes are maintained and how well the maintainer knows BioPerl's error handling mechanisms. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From hlapp at gmx.net Sat Aug 19 15:04:43 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 19 Aug 2006 15:04:43 -0400 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E75375.20600@sendu.me.uk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> <44E75375.20600@sendu.me.uk> Message-ID: <66F38F2D-4243-4982-8253-B7F2794F61A5@gmx.net> On Aug 19, 2006, at 2:07 PM, Sendu Bala wrote: > Like I say, you have to have a default of undef and set the value to > undef to turn the feature off, which doesn't seem very nice to me. Why? Typically a value of undef for a property (class or instance- level) means it hasn't been set. This is used all over the place, and I'm sure not just in bioperl. > > I'd prefer to be able to chose a global verbosity level and > independently turn global behaviour on and off by supplying a method a > boolean or even the words 'on'|'off', not supplying int or undef. You can do that too but I'm not sure about how much would be gained. If I want to globally alter the verbosity, I will usually know why and therefore to which level. I'm not sure how often the situation would occur that I want to globally change the verbosity level, whatever the system may think that should be. Typically I will want to dictate the level too, not just switching it 'on' regardless of what the system thinks I may mean by 'on'. The only situation where just 'switching on' may apply is inside of a module. However, if a module wants to do this then I'm strongly inclined to think that something more fundamental is wrong. Changing verbosity level globally should only be a client's (user's) decision, never that of a module author. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From bix at sendu.me.uk Sun Aug 20 02:41:10 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 20 Aug 2006 07:41:10 +0100 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> <44E75375.20600@sendu.me.uk> Message-ID: <44E80406.1040905@sendu.me.uk> Chris Fields wrote: > On Aug 19, 2006, at 1:07 PM, Sendu Bala wrote: > >> Hilmar Lapp wrote: >>> Well, the rule would be: >>> >>> 1) if a local (instance) verbosity has been set, use it >>> 2) otherwise, if a global (class, static) verbosity has been set, >>> use it >>> 3) otherwise, use a default value. >>> >>> This would mean indeed that if you changed verbosity for a specific >>> instance it will be unaffected by global changes of the verbosity level. >>> >>> If that doesn't sound good, you would reverse rules 1) and 2). >> >> But then if you set the global verbosity how do you later change your >> mind, unset it, and go back to using instance verbosity? >> > > I think verbosity can be changed using Bio::Root::Root method > verbose($verbose_value). [snip] We know how verbose works. We're discussing a desire for a new ability to set the verbosity for /all/ Root instances (hence 'global'), not just one at a time. My 'on'|'off' suggestion is a switch for global behaviour, not the specific level of verbosity, which you would still (independently) chose. From bix at sendu.me.uk Sun Aug 20 03:10:35 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 20 Aug 2006 08:10:35 +0100 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <66F38F2D-4243-4982-8253-B7F2794F61A5@gmx.net> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> <44E75375.20600@sendu.me.uk> <66F38F2D-4243-4982-8253-B7F2794F61A5@gmx.net> Message-ID: <44E80AEB.1000401@sendu.me.uk> Hilmar Lapp wrote: > > On Aug 19, 2006, at 2:07 PM, Sendu Bala wrote: > >> Like I say, you have to have a default of undef and set the value to >> undef to turn the feature off, which doesn't seem very nice to me. > > Why? Typically a value of undef for a property (class or instance-level) > means it hasn't been set. > > This is used all over the place, and I'm sure not just in bioperl. Yes, but very very rarely are you ever required to deliberately pass 'undef' as a value to a method in order to do something. Because that's quite a horrible thing to do. >> I'd prefer to be able to chose a global verbosity level and >> independently turn global behaviour on and off by supplying a method a >> boolean or even the words 'on'|'off', not supplying int or undef. > > You can do that too but I'm not sure about how much would be gained. If > I want to globally alter the verbosity, I will usually know why and > therefore to which level. I'm not sure how often the situation would > occur that I want to globally change the verbosity level, whatever the > system may think that should be. Typically I will want to dictate the > level too, not just switching it 'on' regardless of what the system > thinks I may mean by 'on'. Like I say, you would also chose the specific level of verbosity you want. The 'on' is to turn on global behaviour of whatever verbosity you want. So, perhaps: # set variable Bio::Root::Root::VERBOSITY, which has no effect on # anything in particular, except perhaps VERBOSITY is used as the # default verbosity for Root instances that you don't manually set # the verbosity of (in which case, most of the time this would # seem like a global change) Bio::Root::Root::verbose(-1); # chose to make all instances behave as if they had a verbosity # of -1 (ie. including the ones you had set to some specific # verbosity and weren't still on default value - we have fine # grained control over what we want with this system) Bio::Root::Root::global_verbosity('on'); # or (1) # chose to return behaviour to normal, instances behave like they # had their set or default verbosity Bio::Root::Root::global_verbosity('off'); # or (0) The alternative is: # set variable Bio::Root::Root::VERBOSITY and make all instances # behave as if they had that verbosity Bio::Root::Root::verbose(-1); # chose to return behaviour to normal, instances behave like they # had their set or default verbosity Bio::Root::Root::verbose(undef); From hlapp at gmx.net Sun Aug 20 08:37:56 2006 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 20 Aug 2006 08:37:56 -0400 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E80AEB.1000401@sendu.me.uk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> <44E75375.20600@sendu.me.uk> <66F38F2D-4243-4982-8253-B7F2794F61A5@gmx.net> <44E80AEB.1000401@sendu.me.uk> Message-ID: Quite frankly I find nothing offending with passing undef to a method ... I do this all the time if I want to return a property to its virgin state. Conceptually it is like passing [] to a method that wants an array ref. Hence, I do prefer the latter alternative, because it is simpler and doesn't involve having to introduce a new method solely for better control of the verbosity. my $0.02 ... -hilmar On Aug 20, 2006, at 3:10 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> On Aug 19, 2006, at 2:07 PM, Sendu Bala wrote: >> >>> Like I say, you have to have a default of undef and set the value to >>> undef to turn the feature off, which doesn't seem very nice to me. >> >> Why? Typically a value of undef for a property (class or instance- >> level) >> means it hasn't been set. >> >> This is used all over the place, and I'm sure not just in bioperl. > > Yes, but very very rarely are you ever required to deliberately pass > 'undef' as a value to a method in order to do something. Because > that's > quite a horrible thing to do. > > >>> I'd prefer to be able to chose a global verbosity level and >>> independently turn global behaviour on and off by supplying a >>> method a >>> boolean or even the words 'on'|'off', not supplying int or undef. >> >> You can do that too but I'm not sure about how much would be >> gained. If >> I want to globally alter the verbosity, I will usually know why and >> therefore to which level. I'm not sure how often the situation would >> occur that I want to globally change the verbosity level, whatever >> the >> system may think that should be. Typically I will want to dictate the >> level too, not just switching it 'on' regardless of what the system >> thinks I may mean by 'on'. > > Like I say, you would also chose the specific level of verbosity you > want. The 'on' is to turn on global behaviour of whatever verbosity > you > want. > > So, perhaps: > > # set variable Bio::Root::Root::VERBOSITY, which has no effect on > # anything in particular, except perhaps VERBOSITY is used as the > # default verbosity for Root instances that you don't manually set > # the verbosity of (in which case, most of the time this would > # seem like a global change) > Bio::Root::Root::verbose(-1); > > # chose to make all instances behave as if they had a verbosity > # of -1 (ie. including the ones you had set to some specific > # verbosity and weren't still on default value - we have fine > # grained control over what we want with this system) > Bio::Root::Root::global_verbosity('on'); # or (1) > > # chose to return behaviour to normal, instances behave like they > # had their set or default verbosity > Bio::Root::Root::global_verbosity('off'); # or (0) > > > The alternative is: > # set variable Bio::Root::Root::VERBOSITY and make all instances > # behave as if they had that verbosity > Bio::Root::Root::verbose(-1); > > # chose to return behaviour to normal, instances behave like they > # had their set or default verbosity > Bio::Root::Root::verbose(undef); > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Aug 20 09:26:42 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 20 Aug 2006 08:26:42 -0500 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E80AEB.1000401@sendu.me.uk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> <44E75375.20600@sendu.me.uk> <66F38F2D-4243-4982-8253-B7F2794F61A5@gmx.net> <44E80AEB.1000401@sendu.me.uk> Message-ID: <79604CF6-4482-4B5E-80E9-2F4562CB6949@uiuc.edu> On Aug 20, 2006, at 2:10 AM, Sendu Bala wrote: > Hilmar Lapp wrote: >> >> On Aug 19, 2006, at 2:07 PM, Sendu Bala wrote: >> >>> Like I say, you have to have a default of undef and set the value to >>> undef to turn the feature off, which doesn't seem very nice to me. >> >> Why? Typically a value of undef for a property (class or instance- >> level) >> means it hasn't been set. >> >> This is used all over the place, and I'm sure not just in bioperl. > > Yes, but very very rarely are you ever required to deliberately pass > 'undef' as a value to a method in order to do something. Because > that's > quite a horrible thing to do. Why is that? What dictum of developer ethics dictates that we should never pass undef? I believe this is a perfectly valid (and widely used) way to unset a get/set. This is your opinion, Sendu. It is not a fact. Please remember that. ... Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From cjfields at uiuc.edu Sun Aug 20 10:04:52 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 20 Aug 2006 09:04:52 -0500 Subject: [Bioperl-l] How to get rid of warnings In-Reply-To: <44E80406.1040905@sendu.me.uk> References: <20060818155141.000068ec@pearson.versailles.inra.fr> <44E5D719.4030808@sendu.me.uk> <44E5DCB0.1050806@genomics.dk> <44E70FF5.5010508@sendu.me.uk> <07DEE817-AC4C-4A73-BABE-1DBFBC370EBD@gmx.net> <44E75375.20600@sendu.me.uk> <44E80406.1040905@sendu.me.uk> Message-ID: <6CF03066-D4BA-4F12-98F4-AF4032A0F12E@uiuc.edu> On Aug 20, 2006, at 1:41 AM, Sendu Bala wrote: > ... > [snip] > > We know how verbose works. We're discussing a desire for a new ability > to set the verbosity for /all/ Root instances (hence 'global'), not > just > one at a time. > > My 'on'|'off' suggestion is a switch for global behaviour, not the > specific level of verbosity, which you would still (independently) > chose. Global settings are what using env. variables are for. Why not set verbosity to whatever BIOPERLDEBUG is set to? Or, if you don't like that, use a different env. variable for global settings. If it isn't set then verbosity is 0 (default setting anyway). verbose() is called in the Root constructor so it should always be set to a defined value. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From bix at sendu.me.uk Sun Aug 20 17:56:28 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 20 Aug 2006 22:56:28 +0100 Subject: [Bioperl-l] SearchIO speed up In-Reply-To: <44E04A36.6090403@sendu.me.uk> References: <000201c6bccf$8941f090$15327e82@pyrimidine> <44E04A36.6090403@sendu.me.uk> Message-ID: <44E8DA8C.3030305@sendu.me.uk> Sendu Bala wrote: > Chris Fields wrote: >> ... >>> My proposal involves the "chunks" being unparsed, raw text "blobs", that >>> are essentially blessed into a package that does the parsing only when >>> necessary (and even then, might choose different parsing strategies, based >>> on what's been asked for). Thus a potentially large amount of parsing and >>> storage is skipped. Additionally, you now have the option of not even >>> storing the blobs in memory, just file seek pointers (requiring temp. >>> storage for streaming pipe data sources), and thus can process very large >>> reports without consuming memory (currently a problem). >> Using file pointers is a great touch. Sendu has a slight aversion to temp >> files but he has already indicated other ways around this. > > I'm in the midst of implementing an 'Aaron'-style pull-parser which I > have called PullParserI. I've now committed this to bioperl-live. It is Bio::PullParserI and the first thing to implement it is my new hmmer parser, Bio::SearchIO::hmmer_pull (for want of a better name). The API here isn't set in stone, so certainly I'd encourage suggestions for improvement. I've made a start on a BLASTN parser so we can see a more familiar speed comparison, but its not ready yet. Meanwhile, see thread 'New hmmpfam parser'. From bix at sendu.me.uk Sun Aug 20 17:56:37 2006 From: bix at sendu.me.uk (Sendu Bala) Date: Sun, 20 Aug 2006 22:56:37 +0100 Subject: [Bioperl-l] New hmmpfam parser Message-ID: <44E8DA95.9070803@sendu.me.uk> I've added a new hmmpfam parser to bioperl-live. You access it with Bio::SearchIO::new(-format => "hmmer_pull"). It uses the new Bio::PullParserI discussed in thread 'SearchIO speedup'. The major differences between it and the existing SearchIO plugin for hmmpfam reports (hmmer.pm) are speed, memory usage and how it deals with hits and hsps. hmmer.pm breaks Bio::Search::HitI API by having hit (model) name()s that are not unique within a ResultI. It also only ever has one domain per model. hmmer_pull.pm has unique model names and as many domains per model as there are in the file being parsed. hmmer_pull.pm also gives back more correct answers when you try to use the full variety of HitI, GenericHit, HSPI and GenericHSP methods. Speed tested on one example hmmpfam report of 441kb comparing hmmer.pm and hmmer_pull.pm: (memory usage was always ~1.8x less) # for the result for query sequence 'test5' (5th result of 10 in my # test dataset), just get the most significant domain of the most # significant model: # while ($result = $searchio->next_result) { # if ($result->query_name eq 'test5') { # $result->sort_hits(sub{#sort by significance}); # $hit = $result->next_hit; # $hsp = $hit->hsp('best'); # last; # } # } 23.5x faster # while ($result = $searchio->next_result) { # do nothing } 38x faster # while ($result = $searchio->next_result) { # while ($hit = $result->next_hit) { # while ($hsp = $hit->next_hsp) { # do nothing } 5.3x faster # while ($result = $searchio->next_result) { # while ($hit = $result->next_hit) { # while ($hsp = $hit->next_hsp) { # $fi = $hsp->frac_identical('query'); # } (note that hmmer.pm returns the wrong answer for $fi: 0) 2.2x faster From cjfields at uiuc.edu Sun Aug 20 20:25:32 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 20 Aug 2006 19:25:32 -0500 Subject: [Bioperl-l] New hmmpfam parser In-Reply-To: <44E8DA95.9070803@sendu.me.uk> References: <44E8DA95.9070803@sendu.me.uk> Message-ID: <9C83621D-2FA1-4555-AD78-BF09EB8E3FDD@uiuc.edu> Sendu, Could you post the example file you used somewhere for testing? Chris On Aug 20, 2006, at 4:56 PM, Sendu Bala wrote: > I've added a new hmmpfam parser to bioperl-live. > > You access it with Bio::SearchIO::new(-format => "hmmer_pull"). It > uses > the new Bio::PullParserI discussed in thread 'SearchIO speedup'. > > The major differences between it and the existing SearchIO plugin for > hmmpfam reports (hmmer.pm) are speed, memory usage and how it deals > with > hits and hsps. hmmer.pm breaks Bio::Search::HitI API by having hit > (model) name()s that are not unique within a ResultI. It also only > ever > has one domain per model. hmmer_pull.pm has unique model names and as > many domains per model as there are in the file being parsed. > hmmer_pull.pm also gives back more correct answers when you try to use > the full variety of HitI, GenericHit, HSPI and GenericHSP methods. > > > Speed tested on one example hmmpfam report of 441kb comparing hmmer.pm > and hmmer_pull.pm: > (memory usage was always ~1.8x less) > > # for the result for query sequence 'test5' (5th result of 10 in my > # test dataset), just get the most significant domain of the most > # significant model: > # while ($result = $searchio->next_result) { > # if ($result->query_name eq 'test5') { > # $result->sort_hits(sub{#sort by significance}); > # $hit = $result->next_hit; > # $hsp = $hit->hsp('best'); > # last; > # } > # } > 23.5x faster > > # while ($result = $searchio->next_result) { # do nothing } > 38x faster > > # while ($result = $searchio->next_result) { > # while ($hit = $result->next_hit) { > # while ($hsp = $hit->next_hsp) { # do nothing } > 5.3x faster > > # while ($result = $searchio->next_result) { > # while ($hit = $result->next_hit) { > # while ($hsp = $hit->next_hsp) { > # $fi = $hsp->frac_identical('query'); > # } > (note that hmmer.pm returns the wrong answer for $fi: 0) > 2.2x faster > > > ______________________________