From lincoln.stein at gmail.com Fri May 1 13:33:09 2009 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Fri, 1 May 2009 13:33:09 -0400 Subject: [Bioperl-l] Bio::DB::SeqFeature::Segment problem In-Reply-To: <23319982.post@talk.nabble.com> References: <23319982.post@talk.nabble.com> Message-ID: <6dce9a0b0905011033r307e8b88l5caaddc953f7de95@mail.gmail.com> Hi Jon, Sounds like your multiple chromosome-1 problems have been cleared up. The documentation should mention the exception and doesn't. I will fix it. Lincoln On Thu, Apr 30, 2009 at 12:40 PM, Jon Flowers wrote: > > Dear colleagues, > > I have set up a mySQL database and loaded a GFF3 and fasta file using > Bio::DB::SeqFeature::Store::GFF3Loader. Everything appears to be working > normally except when I attempt to create a Bio::DB::SeqFeature::Segment > object. > > The following works as expected: > > my $db = Bio::DB::SeqFeature::Store->new(-adaptor => 'DBI::mysql', > -dsn => > 'dbi:mysql:foo', > -user => > 'myuser', > -pass => > 'mypassword', > -write => > '1'); > > my @features = $db->features(-seq_id=>'chr1', > -start=>1, > -end=>10000, > -types=>['gene']); > > However, when I try to create a segment object using either of the two > following method calls I get an error: > > my $segment = $db->segment('chr1',1=>10000); > > my $segment = $db->segment( -seq_id => 'chr1', -start => '1', -end => > '10000'); > > -------------------------------- EXCEPTION > ------------------------------------ > > MSG: segment() called in a scalar context but multiple features match. > Either call in a list context or narrow your search using the -types or > -class arguments > > STACK Bio::DB::SeqFeature::Store::segment > /usr/share/perl5/Bio/DB/SeqFeature/Store.pm:1178 > STACK toplevel trial.pl:42 > ------------------------------------------------------- > > Calling in list context (which is not defined in the documentation) > produces > an array of 22 identical scalars = 'chr1:1..10000'. > > Any ideas? > > Thanks > > Jonathan > > -- > View this message in context: > http://www.nabble.com/Bio%3A%3ADB%3A%3ASeqFeature%3A%3ASegment-problem-tp23319982p23319982.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Director, Informatics and Biocomputing Platform Ontario Institute for Cancer Research 101 College St., Suite 800 Toronto, ON, Canada M5G0A3 416 673-8514 Assistant: Renata Musa From hlapp at gmx.net Sun May 3 14:36:59 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 3 May 2009 14:36:59 -0400 Subject: [Bioperl-l] Other object oddities In-Reply-To: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> Message-ID: I agree, $seq->seq() could possibly be better named. Maybe $seq- >seqstr()? The thing is that having $seq->seq() return an object would be meaningless - it would be $self. You can test what kind of object you have using ref() or isa(): $seq = $obj->seq(); # we need the sequence string $seq = $seq->seq() if ref($seq) && $seq->isa("Bio::PrimarySeqI"); There has been a naming consistency review, but it's been a long time. -hilmar On Apr 30, 2009, at 5:56 PM, Kevin Brown wrote: > So, I'm using quite a bit of bioperl code in my own stuff and have > been > seeing some oddities with the naming of methods. A good example > would be > in the Bio::Seq and Bio::SeqFeature::Generic. Both have a method > called > "seq" but in the latter case it returns an object (and expects an > object > when doing a Set) and in the former it returns a string and expects a > string when doing a Set. > > This makes for a bit of brain freeze on my part when the return from > another object might be a Bio::Seq or Bio::SeqFeature::Generic and now > calling the ->seq returns different things. > > Guess I'm just curious if anyone has done an audit of the methods of > the > various objects and their return types to see how consistent they are > across even a subsection of the codebase? > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From wangyi2412 at gmail.com Mon May 4 00:42:31 2009 From: wangyi2412 at gmail.com (yi wang) Date: Mon, 4 May 2009 12:42:31 +0800 Subject: [Bioperl-l] bioperl / emboss on windows Message-ID: ---------- Forwarded message ---------- From: yi wang Date: 2009/5/4 Subject: bioperl on windows To: bioperl-l at bioperl.org I have installed the bioperl and emboss on my* windows xp*, as guided on the web. But it --------------------- WARNING --------------------- *MSG: Application [needle] is not available!* --------------------------------------------------- use warnings; use CGI; use Bio::Perl; use Bio::Root::Root; use Bio::Factory::ApplicationFactoryI; use Bio::Factory::EMBOSS; use Bio::Tools::Run::EMBOSSApplication; *my $f = Bio::Factory::EMBOSS -> new();* *$f->program("needle");* #my $factory = new Bio::Factory::EMBOSS; #my $compseqapp = $factory->program("needle"); I checked the manual and the emboss.pm, write the programe as the demo, but it could work! How could it be the problem? Thank you very much! *Looking for your reply!* Best Wishes, -- ?????????? From SMarkel at accelrys.com Mon May 4 09:41:06 2009 From: SMarkel at accelrys.com (Scott Markel) Date: Mon, 4 May 2009 09:41:06 -0400 Subject: [Bioperl-l] bioperl / emboss on windows In-Reply-To: References: Message-ID: <1F1240778FB0AF46B4E5A72C44D2C7472A11B418@exch1-hi.accelrys.net> Is needle in your path? Note that needle needs two input sequences, which you don't provide. You might try invoking embossversion, which takes no inputs. Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Co-chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of yi wang > Sent: Sunday, 03 May 2009 9:43 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] bioperl / emboss on windows > > ---------- Forwarded message ---------- > From: yi wang > Date: 2009/5/4 > Subject: bioperl on windows > To: bioperl-l at bioperl.org > > > I have installed the bioperl and emboss on my* windows xp*, as guided on > the web. But it > --------------------- WARNING --------------------- > *MSG: Application [needle] is not available!* > --------------------------------------------------- > > > use warnings; > use CGI; > use Bio::Perl; > use Bio::Root::Root; > use Bio::Factory::ApplicationFactoryI; > use Bio::Factory::EMBOSS; > use Bio::Tools::Run::EMBOSSApplication; > > > > *my $f = Bio::Factory::EMBOSS -> new();* > *$f->program("needle");* > #my $factory = new Bio::Factory::EMBOSS; #my $compseqapp = $factory- > >program("needle"); > > I checked the manual and the emboss.pm, write the programe as the demo, > but it could work! How could it be the problem? Thank you very much! > *Looking for your reply!* > > > Best Wishes, > > > > -- > ?????????? From Kevin.M.Brown at asu.edu Mon May 4 11:31:30 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Mon, 4 May 2009 08:31:30 -0700 Subject: [Bioperl-l] Other object oddities In-Reply-To: References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> Message-ID: <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> I don't mind that Bio::Seq uses seq to return a string. In fact I prefer that. Just would be nice if other objects obeyed the same convention. Bio::SeqFeature::Generic returns an object for both entire_seq and seq, but uses attach_seq to store the Bio::Seq object into the Feature. Maybe SeqFeature could be adjusted so that ->seq returns the sequence string of the feature (just like Bio::Seq) and ->feature_seq returns the Bio::Seq object. > -----Original Message----- > From: Hilmar Lapp [mailto:hlapp at gmx.net] > Sent: Sunday, May 03, 2009 11:37 AM > To: Kevin Brown > Cc: BioPerl List > Subject: Re: [Bioperl-l] Other object oddities > > I agree, $seq->seq() could possibly be better named. Maybe $seq- > >seqstr()? > > The thing is that having $seq->seq() return an object would be > meaningless - it would be $self. > > You can test what kind of object you have using ref() or isa(): > > $seq = $obj->seq(); > # we need the sequence string > $seq = $seq->seq() if ref($seq) && > $seq->isa("Bio::PrimarySeqI"); > > There has been a naming consistency review, but it's been a long time. > > -hilmar > > > On Apr 30, 2009, at 5:56 PM, Kevin Brown wrote: > > > So, I'm using quite a bit of bioperl code in my own stuff and have > > been > > seeing some oddities with the naming of methods. A good example > > would be > > in the Bio::Seq and Bio::SeqFeature::Generic. Both have a method > > called > > "seq" but in the latter case it returns an object (and expects an > > object > > when doing a Set) and in the former it returns a string and > expects a > > string when doing a Set. > > > > This makes for a bit of brain freeze on my part when the return from > > another object might be a Bio::Seq or > Bio::SeqFeature::Generic and now > > calling the ->seq returns different things. > > > > Guess I'm just curious if anyone has done an audit of the > methods of > > the > > various objects and their return types to see how > consistent they are > > across even a subsection of the codebase? > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > From uludag at ebi.ac.uk Mon May 4 11:39:55 2009 From: uludag at ebi.ac.uk (uludag at ebi.ac.uk) Date: Mon, 4 May 2009 16:39:55 +0100 (BST) Subject: [Bioperl-l] bioperl / emboss on windows In-Reply-To: References: Message-ID: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> It looks like EMBOSS was disabled in Bio\Factory\EMBOSS.pm for Windows platform. After commenting out the related condition in the _program_list function (as shown below) i don't get the "Application [needle] is not available" error any more. if( #$^O =~ /MSWIN/i || Regards, Mahmut > I have installed the bioperl and emboss on my* windows xp*, as guided on > the web. But it > --------------------- WARNING --------------------- > *MSG: Application [needle] is not available!* > --------------------------------------------------- > > > use warnings; > use CGI; > use Bio::Perl; > use Bio::Root::Root; > use Bio::Factory::ApplicationFactoryI; > use Bio::Factory::EMBOSS; > use Bio::Tools::Run::EMBOSSApplication; > > > > *my $f = Bio::Factory::EMBOSS -> new();* > *$f->program("needle");* > #my $factory = new Bio::Factory::EMBOSS; > #my $compseqapp = $factory->program("needle"); From maj at fortinbras.us Mon May 4 11:50:59 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Mon, 4 May 2009 11:50:59 -0400 Subject: [Bioperl-l] Other object oddities In-Reply-To: <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> Message-ID: <4D0732D667FD4A26B6161660107920E5@NewLife> This is definitely a reasonable issue to chase down. How to do it needs a little care. I personally see 'seq' and think 'object', and have resorted to 'seqstr' in my own code to hold/access just strings. FWIW, my preference would be to have any object that has a seq object as a property return objects when a '..._seq' accessor is called. However, the seq objects themselves generally contain the sequence string in their seq() property. We wouldn't want to disrupt that, but would it be worth creating an alias getter/setter for the Seq classes seq() property called 'seqstr'? We could then count on $foo->bar_seq, an object $foo->bar_seq->seqstr, a string $foo->seqstr, a string (not nec same as above) cheers Mark ----- Original Message ----- From: "Kevin Brown" Cc: "BioPerl List" Sent: Monday, May 04, 2009 11:31 AM Subject: Re: [Bioperl-l] Other object oddities >I don't mind that Bio::Seq uses seq to return a string. In fact I prefer > that. Just would be nice if other objects obeyed the same convention. > Bio::SeqFeature::Generic returns an object for both entire_seq and seq, > but uses attach_seq to store the Bio::Seq object into the Feature. > > Maybe SeqFeature could be adjusted so that ->seq returns the sequence > string of the feature (just like Bio::Seq) and ->feature_seq returns the > Bio::Seq object. > >> -----Original Message----- >> From: Hilmar Lapp [mailto:hlapp at gmx.net] >> Sent: Sunday, May 03, 2009 11:37 AM >> To: Kevin Brown >> Cc: BioPerl List >> Subject: Re: [Bioperl-l] Other object oddities >> >> I agree, $seq->seq() could possibly be better named. Maybe $seq- >> >seqstr()? >> >> The thing is that having $seq->seq() return an object would be >> meaningless - it would be $self. >> >> You can test what kind of object you have using ref() or isa(): >> >> $seq = $obj->seq(); >> # we need the sequence string >> $seq = $seq->seq() if ref($seq) && >> $seq->isa("Bio::PrimarySeqI"); >> >> There has been a naming consistency review, but it's been a long time. >> >> -hilmar >> >> >> On Apr 30, 2009, at 5:56 PM, Kevin Brown wrote: >> >> > So, I'm using quite a bit of bioperl code in my own stuff and have >> > been >> > seeing some oddities with the naming of methods. A good example >> > would be >> > in the Bio::Seq and Bio::SeqFeature::Generic. Both have a method >> > called >> > "seq" but in the latter case it returns an object (and expects an >> > object >> > when doing a Set) and in the former it returns a string and >> expects a >> > string when doing a Set. >> > >> > This makes for a bit of brain freeze on my part when the return from >> > another object might be a Bio::Seq or >> Bio::SeqFeature::Generic and now >> > calling the ->seq returns different things. >> > >> > Guess I'm just curious if anyone has done an audit of the >> methods of >> > the >> > various objects and their return types to see how >> consistent they are >> > across even a subsection of the codebase? >> > >> > _______________________________________________ >> > Bioperl-l mailing list >> > Bioperl-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From Kevin.M.Brown at asu.edu Mon May 4 11:58:05 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Mon, 4 May 2009 08:58:05 -0700 Subject: [Bioperl-l] Other object oddities In-Reply-To: <4D0732D667FD4A26B6161660107920E5@NewLife> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> <4D0732D667FD4A26B6161660107920E5@NewLife> Message-ID: <1A4207F8295607498283FE9E93B775B405F7F347@EX02.asurite.ad.asu.edu> I guess since my first exposure to BioPerl was reading in FASTA data, that I picked up the preference for ->seq to be a string as that is what happens in Bio::Seq objects. So, I see seq and think sequence string, heheh. Just be aware, ->seq returning/setting a string seems to be far more common than it returning an object. > -----Original Message----- > From: Mark A. Jensen [mailto:maj at fortinbras.us] > Sent: Monday, May 04, 2009 8:51 AM > To: Kevin Brown > Cc: BioPerl List > Subject: Re: [Bioperl-l] Other object oddities > > This is definitely a reasonable issue to chase down. How to > do it needs > a little care. I personally see 'seq' and think 'object', and > have resorted to > 'seqstr' in my own code to hold/access just strings. FWIW, my > preference would > be to have any object that has a seq object as a property > return objects > when a '..._seq' accessor is called. However, the seq objects > themselves > generally contain the sequence string in their seq() > property. We wouldn't > want to disrupt that, but would it be worth creating an alias > getter/setter for > the Seq classes seq() property called 'seqstr'? We could then count on > > $foo->bar_seq, an object > $foo->bar_seq->seqstr, a string > $foo->seqstr, a string (not nec same as above) > > cheers Mark > ----- Original Message ----- > From: "Kevin Brown" > Cc: "BioPerl List" > Sent: Monday, May 04, 2009 11:31 AM > Subject: Re: [Bioperl-l] Other object oddities > > > >I don't mind that Bio::Seq uses seq to return a string. In > fact I prefer > > that. Just would be nice if other objects obeyed the same > convention. > > Bio::SeqFeature::Generic returns an object for both > entire_seq and seq, > > but uses attach_seq to store the Bio::Seq object into the Feature. > > > > Maybe SeqFeature could be adjusted so that ->seq returns > the sequence > > string of the feature (just like Bio::Seq) and > ->feature_seq returns the > > Bio::Seq object. > > > >> -----Original Message----- > >> From: Hilmar Lapp [mailto:hlapp at gmx.net] > >> Sent: Sunday, May 03, 2009 11:37 AM > >> To: Kevin Brown > >> Cc: BioPerl List > >> Subject: Re: [Bioperl-l] Other object oddities > >> > >> I agree, $seq->seq() could possibly be better named. Maybe $seq- > >> >seqstr()? > >> > >> The thing is that having $seq->seq() return an object would be > >> meaningless - it would be $self. > >> > >> You can test what kind of object you have using ref() or isa(): > >> > >> $seq = $obj->seq(); > >> # we need the sequence string > >> $seq = $seq->seq() if ref($seq) && > >> $seq->isa("Bio::PrimarySeqI"); > >> > >> There has been a naming consistency review, but it's been > a long time. > >> > >> -hilmar > >> > >> > >> On Apr 30, 2009, at 5:56 PM, Kevin Brown wrote: > >> > >> > So, I'm using quite a bit of bioperl code in my own > stuff and have > >> > been > >> > seeing some oddities with the naming of methods. A good example > >> > would be > >> > in the Bio::Seq and Bio::SeqFeature::Generic. Both have a method > >> > called > >> > "seq" but in the latter case it returns an object (and expects an > >> > object > >> > when doing a Set) and in the former it returns a string and > >> expects a > >> > string when doing a Set. > >> > > >> > This makes for a bit of brain freeze on my part when the > return from > >> > another object might be a Bio::Seq or > >> Bio::SeqFeature::Generic and now > >> > calling the ->seq returns different things. > >> > > >> > Guess I'm just curious if anyone has done an audit of the > >> methods of > >> > the > >> > various objects and their return types to see how > >> consistent they are > >> > across even a subsection of the codebase? > >> > > >> > _______________________________________________ > >> > Bioperl-l mailing list > >> > Bioperl-l at lists.open-bio.org > >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > From cjfields at illinois.edu Mon May 4 11:53:47 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 4 May 2009 10:53:47 -0500 Subject: [Bioperl-l] bioperl / emboss on windows In-Reply-To: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> References: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> Message-ID: Okay, so I assume everything works then? I remember getting this to work at some point on WinXP years ago (I have since moved on to Linux/ Mac). chris On May 4, 2009, at 10:39 AM, uludag at ebi.ac.uk wrote: > > It looks like EMBOSS was disabled in Bio\Factory\EMBOSS.pm for Windows > platform. After commenting out the related condition in the > _program_list > function (as shown below) i don't get the "Application [needle] is not > available" error any more. > > if( #$^O =~ /MSWIN/i || > > Regards, > Mahmut > > >> I have installed the bioperl and emboss on my* windows xp*, as >> guided on >> the web. But it >> --------------------- WARNING --------------------- >> *MSG: Application [needle] is not available!* >> --------------------------------------------------- >> >> >> use warnings; >> use CGI; >> use Bio::Perl; >> use Bio::Root::Root; >> use Bio::Factory::ApplicationFactoryI; >> use Bio::Factory::EMBOSS; >> use Bio::Tools::Run::EMBOSSApplication; >> >> >> >> *my $f = Bio::Factory::EMBOSS -> new();* >> *$f->program("needle");* >> #my $factory = new Bio::Factory::EMBOSS; >> #my $compseqapp = $factory->program("needle"); > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Mon May 4 12:04:10 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 4 May 2009 11:04:10 -0500 Subject: [Bioperl-l] Other object oddities In-Reply-To: <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> Message-ID: <87C756F8-44FB-4930-8154-478BE50AE270@illinois.edu> On May 4, 2009, at 10:31 AM, Kevin Brown wrote: > I don't mind that Bio::Seq uses seq to return a string. In fact I > prefer > that. Just would be nice if other objects obeyed the same convention. > Bio::SeqFeature::Generic returns an object for both entire_seq and > seq, > but uses attach_seq to store the Bio::Seq object into the Feature. I think most of these are legacy issues that (for the most part) have just been dealt with ('they just work'), and with the thought that changing things breaks legacy code. I agree with you, though; it's a good time to rethink how we're naming methods, work towards some consistency, and possibly do this for the next significant release. I don't want to fall into the trap that perl 5.x had fallen into (and is working towards digging out of), namely fear of breaking old code. > Maybe SeqFeature could be adjusted so that ->seq returns the sequence > string of the feature (just like Bio::Seq) and ->feature_seq returns > the > Bio::Seq object. That would be a significant API change and would be inconsistent with seq() in other classes returning a Bio::Seq. Not that it's any different than some of the current behavior, but if we want to correct this it should be done in a *consistent*, well-defined way. My thoughts: To me, seq() should always return a Bio::PrimarySeqI (derived from invocant PrimarySeqI class). However, this is currently inconsistent as illustrated by your example. Changing this would require a deprecation cycle. A new method, seqstr()/str()/rawseq(), could be guaranteed to return a raw sequence. Similarly, bioseq(), could always return a Bio::PrimarySeqI. chris >> -----Original Message----- >> From: Hilmar Lapp [mailto:hlapp at gmx.net] >> Sent: Sunday, May 03, 2009 11:37 AM >> To: Kevin Brown >> Cc: BioPerl List >> Subject: Re: [Bioperl-l] Other object oddities >> >> I agree, $seq->seq() could possibly be better named. Maybe $seq- >>> seqstr()? >> >> The thing is that having $seq->seq() return an object would be >> meaningless - it would be $self. >> >> You can test what kind of object you have using ref() or isa(): >> >> $seq = $obj->seq(); >> # we need the sequence string >> $seq = $seq->seq() if ref($seq) && >> $seq->isa("Bio::PrimarySeqI"); >> >> There has been a naming consistency review, but it's been a long >> time. >> >> -hilmar >> >> >> On Apr 30, 2009, at 5:56 PM, Kevin Brown wrote: >> >>> So, I'm using quite a bit of bioperl code in my own stuff and have >>> been >>> seeing some oddities with the naming of methods. A good example >>> would be >>> in the Bio::Seq and Bio::SeqFeature::Generic. Both have a method >>> called >>> "seq" but in the latter case it returns an object (and expects an >>> object >>> when doing a Set) and in the former it returns a string and >> expects a >>> string when doing a Set. >>> >>> This makes for a bit of brain freeze on my part when the return from >>> another object might be a Bio::Seq or >> Bio::SeqFeature::Generic and now >>> calling the ->seq returns different things. >>> >>> Guess I'm just curious if anyone has done an audit of the >> methods of >>> the >>> various objects and their return types to see how >> consistent they are >>> across even a subsection of the codebase? >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From uludag at ebi.ac.uk Mon May 4 12:26:20 2009 From: uludag at ebi.ac.uk (uludag at ebi.ac.uk) Date: Mon, 4 May 2009 17:26:20 +0100 (BST) Subject: [Bioperl-l] bioperl / emboss on windows In-Reply-To: References: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> Message-ID: <43939.86.149.78.35.1241454380.squirrel@webmail.ebi.ac.uk> > Okay, so I assume everything works then? I remember getting this to > work at some point on WinXP years ago (I have since moved on to Linux/ > Mac). I cannot say everything works but it looks like at least basic things are working. I just tested the 'water' example given on top of EMBOSS.pm. Example Bio::Seq inputs were properly transferred to 'water' and bioperl was able to construct Bio::AlignIO object from the output file EMBOSS generated. In the example, 'water' inputs are named as 'sequencea' and 'seqall', however, i needed to rename them as 'asequence' and 'bsequence' (i use mEMBOSS-6.0.1). Regards, Mahmut > On May 4, 2009, at 10:39 AM, uludag at ebi.ac.uk wrote: > >> >> It looks like EMBOSS was disabled in Bio\Factory\EMBOSS.pm for Windows >> platform. After commenting out the related condition in the >> _program_list >> function (as shown below) i don't get the "Application [needle] is not >> available" error any more. >> >> if( #$^O =~ /MSWIN/i || >> >> Regards, >> Mahmut >> >> >>> I have installed the bioperl and emboss on my* windows xp*, as >>> guided on >>> the web. But it >>> --------------------- WARNING --------------------- >>> *MSG: Application [needle] is not available!* >>> --------------------------------------------------- >>> >>> >>> use warnings; >>> use CGI; >>> use Bio::Perl; >>> use Bio::Root::Root; >>> use Bio::Factory::ApplicationFactoryI; >>> use Bio::Factory::EMBOSS; >>> use Bio::Tools::Run::EMBOSSApplication; >>> >>> >>> >>> *my $f = Bio::Factory::EMBOSS -> new();* >>> *$f->program("needle");* >>> #my $factory = new Bio::Factory::EMBOSS; >>> #my $compseqapp = $factory->program("needle"); From cjfields at illinois.edu Mon May 4 12:30:23 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 4 May 2009 11:30:23 -0500 Subject: [Bioperl-l] bioperl / emboss on windows In-Reply-To: <43939.86.149.78.35.1241454380.squirrel@webmail.ebi.ac.uk> References: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> <43939.86.149.78.35.1241454380.squirrel@webmail.ebi.ac.uk> Message-ID: Yes, I recall something along those lines. Parameters are something that need to be genericized for EMBOSS use. Good to hear it works, though. chris On May 4, 2009, at 11:26 AM, uludag at ebi.ac.uk wrote: > >> Okay, so I assume everything works then? I remember getting this to >> work at some point on WinXP years ago (I have since moved on to >> Linux/ >> Mac). > > I cannot say everything works but it looks like at least basic > things are > working. I just tested the 'water' example given on top of EMBOSS.pm. > Example Bio::Seq inputs were properly transferred to 'water' and > bioperl > was able to construct Bio::AlignIO object from the output file EMBOSS > generated. > > In the example, 'water' inputs are named as 'sequencea' and 'seqall', > however, i needed to rename them as 'asequence' and 'bsequence' (i use > mEMBOSS-6.0.1). > > Regards, > Mahmut > > >> On May 4, 2009, at 10:39 AM, uludag at ebi.ac.uk wrote: >> >>> >>> It looks like EMBOSS was disabled in Bio\Factory\EMBOSS.pm for >>> Windows >>> platform. After commenting out the related condition in the >>> _program_list >>> function (as shown below) i don't get the "Application [needle] is >>> not >>> available" error any more. >>> >>> if( #$^O =~ /MSWIN/i || >>> >>> Regards, >>> Mahmut >>> >>> >>>> I have installed the bioperl and emboss on my* windows xp*, as >>>> guided on >>>> the web. But it >>>> --------------------- WARNING --------------------- >>>> *MSG: Application [needle] is not available!* >>>> --------------------------------------------------- >>>> >>>> >>>> use warnings; >>>> use CGI; >>>> use Bio::Perl; >>>> use Bio::Root::Root; >>>> use Bio::Factory::ApplicationFactoryI; >>>> use Bio::Factory::EMBOSS; >>>> use Bio::Tools::Run::EMBOSSApplication; >>>> >>>> >>>> >>>> *my $f = Bio::Factory::EMBOSS -> new();* >>>> *$f->program("needle");* >>>> #my $factory = new Bio::Factory::EMBOSS; >>>> #my $compseqapp = $factory->program("needle"); > > From cjfields at illinois.edu Mon May 4 13:51:23 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 4 May 2009 12:51:23 -0500 Subject: [Bioperl-l] Can I load ontologies into BioSQL? In-Reply-To: <0F6F530C-3EE5-4F1D-AA03-151B810AB068@berkeleybop.org> References: <0F6F530C-3EE5-4F1D-AA03-151B810AB068@berkeleybop.org> Message-ID: <6D2B293A-7BC5-4F4D-8D8C-3579BB4FD5AB@illinois.edu> We can note it as deprecated for the next minor release (1.7). chris On Apr 29, 2009, at 3:58 PM, Chris Mungall wrote: > The .ontology files have been deprecated by GO. Use the .obo files > instead. > > It appears the bioperl parser for the .ontology files isn't able to > deal with the new relations in GO. I suggest that the > bioperl .ontology parser is deprecated too > > On Apr 22, 2009, at 6:38 AM, Hilmar Lapp wrote: > >> Hi Carlos, >> >> I am moving your inquiry to the BioPerl list, as the tool is a part >> of Bioperl-db and uses BioPerl for parsing the ontologies. >> >> In your case, the goflat parser in BioPerl seems to balk at the >> second one of the input files. It may be that the input file is >> (was?) corrupted, that does happen every once in a while. More >> likely though is that the goflat parser hasn't kept up with some >> format changes. Have you tried using the obo format version instead? >> >> -hilmar >> >> On Apr 20, 2009, at 11:44 AM, Carlos A. Canchaya wrote: >> >>> Hi guys >>> >>> I'm working with biosql and I try to figure out how to load >>> ontologies into biosql. >>> >>> I've tried >>> >>> load_ontology.pl --driver mysql --dbuser carlos --dbpass xxx -- >>> host localhost --dbname biosql --namespace "Gene Ontology" -- >>> format goflat --fmtargs "-defs_file,GO.defs" function.ontology >>> process.ontology component.ontology >>> >>> as in the script info but I have an error, >>> >>> >>> ------------------- WARNING --------------------- >>> MSG: DBLink exists in the dblink of _default >>> --------------------------------------------------- >>> >>> ------------- EXCEPTION ------------- >>> MSG: format error (file process.ontology) offending line: >>> -negative regulation of angiogenesis ; GO:0016525 ; synonym:down >>> regulation of angiogenesis ; synonym:down\-regulation of >>> angiogenesis ; synonym:downregulation of angiogenesis ; >>> synonym:inhibition of angiogenesis % negative regulation of >>> developmental process ; GO:0051093 % regulation of angiogenesis ; >>> GO:0045765 >>> >>> STACK Bio::OntologyIO::dagflat::_parse_flat_file /usr/local/share/ >>> perl/5.10.0/Bio/OntologyIO/dagflat.pm:627 >>> STACK Bio::OntologyIO::dagflat::parse /usr/local/share/perl/5.10.0/ >>> Bio/OntologyIO/dagflat.pm:284 >>> STACK Bio::OntologyIO::dagflat::next_ontology /usr/local/share/ >>> perl/5.10.0/Bio/OntologyIO/dagflat.pm:317 >>> STACK toplevel /usr/local/share/biosql/bioperl-db/scripts/biosql/ >>> load_ontology.pl:604 >>> ------------------------------------- >>> >>> Any suggestion? >>> >>> Cheers, >>> >>> Carlos >>> >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Mon May 4 14:20:16 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 4 May 2009 13:20:16 -0500 Subject: [Bioperl-l] Other object oddities In-Reply-To: <4D0732D667FD4A26B6161660107920E5@NewLife> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> <4D0732D667FD4A26B6161660107920E5@NewLife> Message-ID: <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> Sorry I haven't chimed in, but $job had killed me the last couple weeks! Unfortunately the reason this hasn't been chased down before is the headache involved. It requires significant API changes to a broadly used codebase (read: so devs are scared about breaking someone's old scripts), having to deal with deprecation cycles, not to mention the most critical aspect, which would be tuits. Saying that, the reason I made a 1.6 branch is to maintain the snapshot of the code for API reasons. There is no reason we can't add in more explicit methods to main trunk. We can deprecate the use of more ambiguous methods down the road. chris On May 4, 2009, at 10:50 AM, Mark A. Jensen wrote: > This is definitely a reasonable issue to chase down. How to do it > needs > a little care. I personally see 'seq' and think 'object', and have > resorted to > 'seqstr' in my own code to hold/access just strings. FWIW, my > preference would > be to have any object that has a seq object as a property return > objects > when a '..._seq' accessor is called. However, the seq objects > themselves > generally contain the sequence string in their seq() property. We > wouldn't > want to disrupt that, but would it be worth creating an alias getter/ > setter for > the Seq classes seq() property called 'seqstr'? We could then count on > > $foo->bar_seq, an object > $foo->bar_seq->seqstr, a string > $foo->seqstr, a string (not nec same as above) > > cheers Mark > ----- Original Message ----- From: "Kevin Brown" > > Cc: "BioPerl List" > Sent: Monday, May 04, 2009 11:31 AM > Subject: Re: [Bioperl-l] Other object oddities > > >> I don't mind that Bio::Seq uses seq to return a string. In fact I >> prefer >> that. Just would be nice if other objects obeyed the same convention. >> Bio::SeqFeature::Generic returns an object for both entire_seq and >> seq, >> but uses attach_seq to store the Bio::Seq object into the Feature. >> >> Maybe SeqFeature could be adjusted so that ->seq returns the sequence >> string of the feature (just like Bio::Seq) and ->feature_seq >> returns the >> Bio::Seq object. >> >>> -----Original Message----- >>> From: Hilmar Lapp [mailto:hlapp at gmx.net] >>> Sent: Sunday, May 03, 2009 11:37 AM >>> To: Kevin Brown >>> Cc: BioPerl List >>> Subject: Re: [Bioperl-l] Other object oddities >>> >>> I agree, $seq->seq() could possibly be better named. Maybe $seq- >>> >seqstr()? >>> >>> The thing is that having $seq->seq() return an object would be >>> meaningless - it would be $self. >>> >>> You can test what kind of object you have using ref() or isa(): >>> >>> $seq = $obj->seq(); >>> # we need the sequence string >>> $seq = $seq->seq() if ref($seq) && >>> $seq->isa("Bio::PrimarySeqI"); >>> >>> There has been a naming consistency review, but it's been a long >>> time. >>> >>> -hilmar >>> >>> >>> On Apr 30, 2009, at 5:56 PM, Kevin Brown wrote: >>> >>> > So, I'm using quite a bit of bioperl code in my own stuff and have >>> > been >>> > seeing some oddities with the naming of methods. A good example >>> > would be >>> > in the Bio::Seq and Bio::SeqFeature::Generic. Both have a method >>> > called >>> > "seq" but in the latter case it returns an object (and expects an >>> > object >>> > when doing a Set) and in the former it returns a string and >>> expects a >>> > string when doing a Set. >>> > >>> > This makes for a bit of brain freeze on my part when the return >>> from >>> > another object might be a Bio::Seq or >>> Bio::SeqFeature::Generic and now >>> > calling the ->seq returns different things. >>> > >>> > Guess I'm just curious if anyone has done an audit of the >>> methods of >>> > the >>> > various objects and their return types to see how >>> consistent they are >>> > across even a subsection of the codebase? >>> > >>> > _______________________________________________ >>> > Bioperl-l mailing list >>> > Bioperl-l at lists.open-bio.org >>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> -- >>> =========================================================== >>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >>> =========================================================== >>> >>> >>> >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Kevin.M.Brown at asu.edu Mon May 4 14:25:54 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Mon, 4 May 2009 11:25:54 -0700 Subject: [Bioperl-l] Other object oddities In-Reply-To: <87C756F8-44FB-4930-8154-478BE50AE270@illinois.edu> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> <87C756F8-44FB-4930-8154-478BE50AE270@illinois.edu> Message-ID: <1A4207F8295607498283FE9E93B775B405F7F3F4@EX02.asurite.ad.asu.edu> > > I don't mind that Bio::Seq uses seq to return a string. In fact I > > prefer > > that. Just would be nice if other objects obeyed the same > convention. > > Bio::SeqFeature::Generic returns an object for both entire_seq and > > seq, > > but uses attach_seq to store the Bio::Seq object into the Feature. > > I think most of these are legacy issues that (for the most > part) have > just been dealt with ('they just work'), and with the thought that > changing things breaks legacy code. I agree with you, > though; it's a > good time to rethink how we're naming methods, work towards some > consistency, and possibly do this for the next significant > release. I > don't want to fall into the trap that perl 5.x had fallen > into (and is > working towards digging out of), namely fear of breaking old code. > > > Maybe SeqFeature could be adjusted so that ->seq returns > the sequence > > string of the feature (just like Bio::Seq) and > ->feature_seq returns > > the > > Bio::Seq object. > > That would be a significant API change and would be > inconsistent with > seq() in other classes returning a Bio::Seq. Not that it's any > different than some of the current behavior, but if we want > to correct > this it should be done in a *consistent*, well-defined way. Changing it in either set of objects would be a break in the API. Either it always returns an object or always returns a string. Right now Bio::Seq/LocatableSeq/PrimarySeq/etc... and others of its ilk return strings when calling ->seq() and also allow one to set the sequence with that same method. Bio::SeqFeature::*, the Bio::DB objects, etc... only allow one to get the seq object that way, but set it via a different method. > My thoughts: > > To me, seq() should always return a Bio::PrimarySeqI (derived from > invocant PrimarySeqI class). However, this is currently > inconsistent > as illustrated by your example. Changing this would require a > deprecation cycle. > > A new method, seqstr()/str()/rawseq(), could be guaranteed to > return a > raw sequence. Similarly, bioseq(), could always return a > Bio::PrimarySeqI. Those sound like possibilities. With one or another of the methods being aliased to ->seq if you still want to keep the call around. From cjfields at illinois.edu Mon May 4 15:11:18 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 4 May 2009 14:11:18 -0500 Subject: [Bioperl-l] Other object oddities In-Reply-To: <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> Message-ID: <30B87481-2314-48D4-8E84-F90FB02E90DB@illinois.edu> On May 4, 2009, at 2:01 PM, Mark A. Jensen wrote: > [I hear you re: $job] > Def. thanks for chiming- Maybe this should be an element of > the "Align refactor" that perhaps should be an overall > "Seq refactor". > > Are you saying that the trunk is fair game for api additions > for this issue? > cheers I don't think anyone should feel afraid to change things on trunk, but I think significant changes should be discussed here so everyone has a chance to chime in. And API additions are not nearly as severe as having a method like seq() return a different value. In fact, I personally don't have a problem with merging that to the 1.6 branch (others may disagree though). I consider it a 'bug fix' in a loose way. chris From maj at fortinbras.us Mon May 4 15:01:41 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Mon, 4 May 2009 15:01:41 -0400 Subject: [Bioperl-l] Other object oddities In-Reply-To: <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> Message-ID: <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> [I hear you re: $job] Def. thanks for chiming- Maybe this should be an element of the "Align refactor" that perhaps should be an overall "Seq refactor". Are you saying that the trunk is fair game for api additions for this issue? cheers ----- Original Message ----- From: "Chris Fields" To: "Mark A. Jensen" Cc: "BioPerl List" ; "Kevin Brown" Sent: Monday, May 04, 2009 2:20 PM Subject: Re: [Bioperl-l] Other object oddities > Sorry I haven't chimed in, but $job had killed me the last couple weeks! > > Unfortunately the reason this hasn't been chased down before is the headache > involved. It requires significant API changes to a broadly used codebase > (read: so devs are scared about breaking someone's old scripts), having to > deal with deprecation cycles, not to mention the most critical aspect, which > would be tuits. > > Saying that, the reason I made a 1.6 branch is to maintain the snapshot of > the code for API reasons. There is no reason we can't add in more explicit > methods to main trunk. We can deprecate the use of more ambiguous methods > down the road. > > chris > > On May 4, 2009, at 10:50 AM, Mark A. Jensen wrote: > >> This is definitely a reasonable issue to chase down. How to do it needs >> a little care. I personally see 'seq' and think 'object', and have resorted >> to >> 'seqstr' in my own code to hold/access just strings. FWIW, my preference >> would >> be to have any object that has a seq object as a property return objects >> when a '..._seq' accessor is called. However, the seq objects themselves >> generally contain the sequence string in their seq() property. We wouldn't >> want to disrupt that, but would it be worth creating an alias getter/ setter >> for >> the Seq classes seq() property called 'seqstr'? We could then count on >> >> $foo->bar_seq, an object >> $foo->bar_seq->seqstr, a string >> $foo->seqstr, a string (not nec same as above) >> >> cheers Mark >> ----- Original Message ----- From: "Kevin Brown" > > >> Cc: "BioPerl List" >> Sent: Monday, May 04, 2009 11:31 AM >> Subject: Re: [Bioperl-l] Other object oddities >> >> >>> I don't mind that Bio::Seq uses seq to return a string. In fact I prefer >>> that. Just would be nice if other objects obeyed the same convention. >>> Bio::SeqFeature::Generic returns an object for both entire_seq and seq, >>> but uses attach_seq to store the Bio::Seq object into the Feature. >>> >>> Maybe SeqFeature could be adjusted so that ->seq returns the sequence >>> string of the feature (just like Bio::Seq) and ->feature_seq returns the >>> Bio::Seq object. >>> >>>> -----Original Message----- >>>> From: Hilmar Lapp [mailto:hlapp at gmx.net] >>>> Sent: Sunday, May 03, 2009 11:37 AM >>>> To: Kevin Brown >>>> Cc: BioPerl List >>>> Subject: Re: [Bioperl-l] Other object oddities >>>> >>>> I agree, $seq->seq() could possibly be better named. Maybe $seq- >>>> >seqstr()? >>>> >>>> The thing is that having $seq->seq() return an object would be >>>> meaningless - it would be $self. >>>> >>>> You can test what kind of object you have using ref() or isa(): >>>> >>>> $seq = $obj->seq(); >>>> # we need the sequence string >>>> $seq = $seq->seq() if ref($seq) && >>>> $seq->isa("Bio::PrimarySeqI"); >>>> >>>> There has been a naming consistency review, but it's been a long time. >>>> >>>> -hilmar >>>> >>>> >>>> On Apr 30, 2009, at 5:56 PM, Kevin Brown wrote: >>>> >>>> > So, I'm using quite a bit of bioperl code in my own stuff and have >>>> > been >>>> > seeing some oddities with the naming of methods. A good example >>>> > would be >>>> > in the Bio::Seq and Bio::SeqFeature::Generic. Both have a method >>>> > called >>>> > "seq" but in the latter case it returns an object (and expects an >>>> > object >>>> > when doing a Set) and in the former it returns a string and >>>> expects a >>>> > string when doing a Set. >>>> > >>>> > This makes for a bit of brain freeze on my part when the return >>>> from >>>> > another object might be a Bio::Seq or >>>> Bio::SeqFeature::Generic and now >>>> > calling the ->seq returns different things. >>>> > >>>> > Guess I'm just curious if anyone has done an audit of the >>>> methods of >>>> > the >>>> > various objects and their return types to see how >>>> consistent they are >>>> > across even a subsection of the codebase? >>>> > >>>> > _______________________________________________ >>>> > Bioperl-l mailing list >>>> > Bioperl-l at lists.open-bio.org >>>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> -- >>>> =========================================================== >>>> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >>>> =========================================================== >>>> >>>> >>>> >>>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From punit_vergoboy2004 at yahoo.co.in Mon May 4 15:31:34 2009 From: punit_vergoboy2004 at yahoo.co.in (punit kumar) Date: Tue, 5 May 2009 01:01:34 +0530 (IST) Subject: [Bioperl-l] machine learnings Message-ID: <704392.20390.qm@web8402.mail.in.yahoo.com> hello i am punit kumar , i want to know that is the artificial neural network, and other machine learnings techniques?modules are availabe in? bio perl or not, if available pls give suggestion that how i?can utilise them.? ? ? ? ? punit kumar kadimi. Cricket on your mind? Visit the ultimate cricket website. Enter http://beta.cricket.yahoo.com From wangyi2412 at gmail.com Mon May 4 23:59:54 2009 From: wangyi2412 at gmail.com (yi wang) Date: Tue, 5 May 2009 11:59:54 +0800 Subject: [Bioperl-l] bioperl / emboss on windows In-Reply-To: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> References: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> Message-ID: Thanks to your good thought, which reminds me doing somthing tracking the emboss.pm. I found beside mswin supporting, there is another problem: *open(WOSSOUT, "wossname -auto |") is not successful*, so the WOSSOUT is got empty, and the following while loop does not executed, in which important data is set. So, does anybody know how to fix this problem? Thanks very much! Best Wishes! 2009/5/4 > > It looks like EMBOSS was disabled in Bio\Factory\EMBOSS.pm for Windows > platform. After commenting out the related condition in the _program_list > function (as shown below) i don't get the "Application [needle] is not > available" error any more. > > if( #$^O =~ /MSWIN/i || > > Regards, > Mahmut > > > > I have installed the bioperl and emboss on my* windows xp*, as guided on > > the web. But it > > --------------------- WARNING --------------------- > > *MSG: Application [needle] is not available!* > > --------------------------------------------------- > > > > > > use warnings; > > use CGI; > > use Bio::Perl; > > use Bio::Root::Root; > > use Bio::Factory::ApplicationFactoryI; > > use Bio::Factory::EMBOSS; > > use Bio::Tools::Run::EMBOSSApplication; > > > > > > > > *my $f = Bio::Factory::EMBOSS -> new();* > > *$f->program("needle");* > > #my $factory = new Bio::Factory::EMBOSS; > > #my $compseqapp = $factory->program("needle"); > > > > -- ?????????? From uludag at ebi.ac.uk Tue May 5 00:55:46 2009 From: uludag at ebi.ac.uk (uludag at ebi.ac.uk) Date: Tue, 5 May 2009 05:55:46 +0100 (BST) Subject: [Bioperl-l] bioperl / emboss on windows In-Reply-To: References: <60194.86.149.78.35.1241451595.squirrel@webmail.ebi.ac.uk> Message-ID: <42480.86.149.78.35.1241499346.squirrel@webmail.ebi.ac.uk> > I found beside mswin supporting, there is another problem: > *open(WOSSOUT, "wossname -auto |") > is not successful*, so the WOSSOUT is got empty, and the following while > loop does not executed, in which important data is set. As Scott wrote yesterday you can double check whether EMBOSS programs are included in your PATH environment variable. Installing mEMBOSS (version of EMBOSS for Windows) from EMBOSS ftp site should automatically update PATH environment variable (otherwise you should update it manually, Control Panel->System->Advanced->Environment Variables). ftp://emboss.open-bio.org/pub/EMBOSS/windows/ to check whether your PATH environment variable has successfully been updated you can call 'wossname -auto' command from Windows Command Prompt, it should return names of EMBOSS programs with their short descriptions. Regards, Mahmut From wangyi2412 at gmail.com Tue May 5 03:32:42 2009 From: wangyi2412 at gmail.com (yi wang) Date: Tue, 5 May 2009 15:32:42 +0800 Subject: [Bioperl-l] Solved: bioperl / emboss on windows Message-ID: Thank you very much! How big a mistake I have made! I did not even got the emboss actually, but I thought the bioperl,bioperl-run was enough, because the installed emboss.pm made me think so. Now, it's clear the bioperl and bioperl-run are just base for calling bio-perl module and external programs like emboss, but itself does not contain such things. Emboss.pm is just a handle for calling that emboss module. How foolish I was! Thank you for your patient and detailed answer very much! Best Wishes! 2009/5/5 > > > I found beside mswin supporting, there is another problem: > > *open(WOSSOUT, "wossname -auto |") > > is not successful*, so the WOSSOUT is got empty, and the following while > > loop does not executed, in which important data is set. > > As Scott wrote yesterday you can double check whether EMBOSS programs are > included in your PATH environment variable. Installing mEMBOSS (version of > EMBOSS for Windows) from EMBOSS ftp site should automatically update PATH > environment variable (otherwise you should update it manually, Control > Panel->System->Advanced->Environment Variables). > > ftp://emboss.open-bio.org/pub/EMBOSS/windows/ > > to check whether your PATH environment variable has successfully been > updated you can call 'wossname -auto' command from Windows Command Prompt, > it should return names of EMBOSS programs with their short descriptions. > > Regards, > Mahmut > > > -- ?????????? From hlapp at gmx.net Tue May 5 08:31:41 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 5 May 2009 08:31:41 -0400 Subject: [Bioperl-l] Other object oddities In-Reply-To: <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> Message-ID: <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: > Maybe this should be an element of > the "Align refactor" that perhaps should be an overall > "Seq refactor". Possibly. Most importantly, it'd be great if someone would volunteer to summarize what's been said here so it won't get lost. > Are you saying that the trunk is fair game for api additions > for this issue? There's been talk some (a long, actually) time ago about BioPerl 2.0 that would start on a clean slate and not be bothered by backwards compatibility demands. That effort never really took off, but maybe this is also a good time to ask the question again whether it's better to introduce the API changes we desire in add/deprecate/remove cycles, or in a more radical fashion starting on a clean slate. The obvious advantage of the former is that we get API improvements sooner, but making them is possibly more dreadful, discouraging, or not even doable due to compatibility constraints. The disadvantage of the latter is that it really needs a committed crew of people to see it through or otherwise all the nice changes die in some grand but half-finished 2.0 construction site. I think Chris also had plans to branch off a Perl6 version of BioPerl - maybe those could be the same efforts? I'm not trying to advocate one over the other here; rather, I'd like to help push on that front that is best able to capture the energy of volunteers, as that's what it takes in the end. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at illinois.edu Tue May 5 10:31:23 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 5 May 2009 09:31:23 -0500 Subject: [Bioperl-l] Other object oddities In-Reply-To: <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> Message-ID: <6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: > > On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: > >> Maybe this should be an element of >> the "Align refactor" that perhaps should be an overall >> "Seq refactor". > > Possibly. Most importantly, it'd be great if someone would volunteer > to summarize what's been said here so it won't get lost. Looks like mark's done it. >> Are you saying that the trunk is fair game for api additions >> for this issue? > > There's been talk some (a long, actually) time ago about BioPerl 2.0 > that would start on a clean slate and not be bothered by backwards > compatibility demands. That effort never really took off, but maybe > this is also a good time to ask the question again whether it's > better to introduce the API changes we desire in add/deprecate/ > remove cycles, or in a more radical fashion starting on a clean slate. That's what I'm thinking. > The obvious advantage of the former is that we get API improvements > sooner, but making them is possibly more dreadful, discouraging, or > not even doable due to compatibility constraints. The disadvantage > of the latter is that it really needs a committed crew of people to > see it through or otherwise all the nice changes die in some grand > but half-finished 2.0 construction site. I think Chris also had > plans to branch off a Perl6 version of BioPerl - maybe those could > be the same efforts? I have been toying around with perl6 for a bit now (Rakudo on Parrot implementation). It's possible an alpha for perl6 will be available by christmas this year; Rakudo is now passing over 11000 spec tests. Just to note, Perl6 is another beast altogether from Perl5. Yes, there is supposed to be a backwards compatibility mode, but no one has implemented that yet, and it likely won't be implemented in the near future. Based on that I'm not sure we could really call a bioperl in perl6 bioperl 2.0, more like bioperl6 1.0, as it would be a complete refactor. As for perl5, it has a nice OO set of modules (Moose) that could be used for refactoring. It implements roles and a few other perl6-ish bits (along with MooseX modules). perl 5.10 also has a few things backported from p6, say(), given/when, state vars, etc. We could require Modern::Perl (perl5.10 with strict/warnings pragmas on) and Moose. I have played around with both and find them quite nice, so I suggest if we were to start a 2.0 effort it should include Moose, and we should push most of the interfaces into roles. Anyway, I grabbed the git repos for bioperl6 and biomoose (bioperl implemented in Moose) on github. We can set up something there using those namespaces if needed. > I'm not trying to advocate one over the other here; rather, I'd like > to help push on that front that is best able to capture the energy > of volunteers, as that's what it takes in the end. > > -hilmar Depends on where everyone wants to place their efforts. May be less work to port the most important core classes over to Moose, and a simple test implementation will give us an idea on what works Role- wise and what doesn't. From there we could work on p6 variants; that would have to be a separate project altogether. We could also include a few other MooseX modules if it makes life easier. chris From maj at fortinbras.us Tue May 5 10:13:04 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 5 May 2009 10:13:04 -0400 Subject: [Bioperl-l] Other object oddities In-Reply-To: <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> Message-ID: <727BD57B31FE464082258697A4D742A7@NewLife> > Possibly. Most importantly, it'd be great if someone would volunteer > to summarize what's been said here so it won't get lost. > http://www.bioperl.org/wiki/Naming_Conventions_and_the_Future From cjm at berkeleybop.org Tue May 5 14:28:02 2009 From: cjm at berkeleybop.org (Chris Mungall) Date: Tue, 5 May 2009 11:28:02 -0700 Subject: [Bioperl-l] Moose [was Re: Other object oddities] In-Reply-To: <6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> <6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> Message-ID: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> On May 5, 2009, at 7:31 AM, Chris Fields wrote: > On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: > >> >> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: >> >>> Maybe this should be an element of >>> the "Align refactor" that perhaps should be an overall >>> "Seq refactor". >> >> Possibly. Most importantly, it'd be great if someone would >> volunteer to summarize what's been said here so it won't get lost. > > Looks like mark's done it. > >>> Are you saying that the trunk is fair game for api additions >>> for this issue? >> >> There's been talk some (a long, actually) time ago about BioPerl >> 2.0 that would start on a clean slate and not be bothered by >> backwards compatibility demands. That effort never really took off, >> but maybe this is also a good time to ask the question again >> whether it's better to introduce the API changes we desire in add/ >> deprecate/remove cycles, or in a more radical fashion starting on a >> clean slate. > > That's what I'm thinking. > >> The obvious advantage of the former is that we get API improvements >> sooner, but making them is possibly more dreadful, discouraging, or >> not even doable due to compatibility constraints. The disadvantage >> of the latter is that it really needs a committed crew of people to >> see it through or otherwise all the nice changes die in some grand >> but half-finished 2.0 construction site. I think Chris also had >> plans to branch off a Perl6 version of BioPerl - maybe those could >> be the same efforts? > > I have been toying around with perl6 for a bit now (Rakudo on Parrot > implementation). It's possible an alpha for perl6 will be available > by christmas this year; Rakudo is now passing over 11000 spec tests. > > Just to note, Perl6 is another beast altogether from Perl5. Yes, > there is supposed to be a backwards compatibility mode, but no one > has implemented that yet, and it likely won't be implemented in the > near future. Based on that I'm not sure we could really call a > bioperl in perl6 bioperl 2.0, more like bioperl6 1.0, as it would be > a complete refactor. > > As for perl5, it has a nice OO set of modules (Moose) that could be > used for refactoring. It implements roles and a few other perl6-ish > bits (along with MooseX modules). perl 5.10 also has a few things > backported from p6, say(), given/when, state vars, etc. We could > require Modern::Perl (perl5.10 with strict/warnings pragmas on) and > Moose. I have played around with both and find them quite nice, so > I suggest if we were to start a 2.0 effort it should include Moose, > and we should push most of the interfaces into roles. We're playing around with a rewrite of go-perl using Moose: http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ This is early enough that parts could be scrapped or rewritten. Compatibility with bioperl is important. Speed was an initial concern but apparently there are some moose tricks to speed things up DBIx::Class compatibility is also important. Not sure if there is specific support for this yet > > Anyway, I grabbed the git repos for bioperl6 and biomoose (bioperl > implemented in Moose) on github. We can set up something there > using those namespaces if needed. > >> I'm not trying to advocate one over the other here; rather, I'd >> like to help push on that front that is best able to capture the >> energy of volunteers, as that's what it takes in the end. >> >> -hilmar > > Depends on where everyone wants to place their efforts. May be less > work to port the most important core classes over to Moose, and a > simple test implementation will give us an idea on what works Role- > wise and what doesn't. From there we could work on p6 variants; > that would have to be a separate project altogether. We could also > include a few other MooseX modules if it makes life easier. > > chris > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From sidd.basu at gmail.com Tue May 5 16:51:07 2009 From: sidd.basu at gmail.com (Siddhartha Basu) Date: Tue, 5 May 2009 15:51:07 -0500 Subject: [Bioperl-l] Re: Moose [was Re:Other object oddities] In-Reply-To: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> References: <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> <6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> Message-ID: <20090505205105.GD422@Macintosh-47.local> On Tue, 05 May 2009, Chris Mungall wrote: > > On May 5, 2009, at 7:31 AM, Chris Fields wrote: > > > On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: > > > >> > >> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: > >> > >>> Maybe this should be an element of > >>> the "Align refactor" that perhaps should be an overall > >>> "Seq refactor". > >> > >> Possibly. Most importantly, it'd be great if someone would volunteer to > >> summarize what's been said here so it won't get lost. > > > > Looks like mark's done it. > > > >>> Are you saying that the trunk is fair game for api additions > >>> for this issue? > >> > >> There's been talk some (a long, actually) time ago about BioPerl 2.0 that > >> would start on a clean slate and not be bothered by backwards > >> compatibility demands. That effort never really took off, but maybe this > >> is also a good time to ask the question again whether it's better to > >> introduce the API changes we desire in add/deprecate/remove cycles, or in > >> a more radical fashion starting on a clean slate. > > > > That's what I'm thinking. > > > >> The obvious advantage of the former is that we get API improvements > >> sooner, but making them is possibly more dreadful, discouraging, or not > >> even doable due to compatibility constraints. The disadvantage of the > >> latter is that it really needs a committed crew of people to see it > >> through or otherwise all the nice changes die in some grand but > >> half-finished 2.0 construction site. I think Chris also had plans to > >> branch off a Perl6 version of BioPerl - maybe those could be the same > >> efforts? > > > > I have been toying around with perl6 for a bit now (Rakudo on Parrot > > implementation). It's possible an alpha for perl6 will be available by > > christmas this year; Rakudo is now passing over 11000 spec tests. > > > > Just to note, Perl6 is another beast altogether from Perl5. Yes, there is > > supposed to be a backwards compatibility mode, but no one has implemented > > that yet, and it likely won't be implemented in the near future. Based on > > that I'm not sure we could really call a bioperl in perl6 bioperl 2.0, > > more like bioperl6 1.0, as it would be a complete refactor. > > > > As for perl5, it has a nice OO set of modules (Moose) that could be used > > for refactoring. It implements roles and a few other perl6-ish bits > > (along with MooseX modules). perl 5.10 also has a few things backported > > from p6, say(), given/when, state vars, etc. We could require > > Modern::Perl (perl5.10 with strict/warnings pragmas on) and Moose. I have > > played around with both and find them quite nice, so I suggest if we were > > to start a 2.0 effort it should include Moose, and we should push most of > > the interfaces into roles. > > We're playing around with a rewrite of go-perl using Moose: > http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ > > This is early enough that parts could be scrapped or rewritten. > Compatibility with bioperl is important. > > Speed was an initial concern but apparently there are some moose tricks to > speed things up > > DBIx::Class compatibility is also important. Not sure if there is specific > support for this yet > > > > > > Anyway, I grabbed the git repos for bioperl6 and biomoose (bioperl > > implemented in Moose) on github. We can set up something there using > > those namespaces if needed. > > > >> I'm not trying to advocate one over the other here; rather, I'd like to > >> help push on that front that is best able to capture the energy of > >> volunteers, as that's what it takes in the end. I would definitely like to volunteer for 'biomoose' project as much as my skills will permit. I wrote a 'homologene' parser in early Moose days(0.3) and till then quite interested to work on a Moose based project. Hopefully will be able to help as the project takes some shape. Though quite early, two MooseX extension that worth looking, MooseX::Declare http://search.cpan.org/~flora/MooseX-Declare-0.21/lib/MooseX/Declare.pm MooseX::MultiMethods http://search.cpan.org/~flora/MooseX-MultiMethods-0.02/lib/MooseX/MultiMethods.pm thanks, -siddhartha > >> > >> -hilmar > > > > Depends on where everyone wants to place their efforts. May be less work > > to port the most important core classes over to Moose, and a simple test > > implementation will give us an idea on what works Role-wise and what > > doesn't. From there we could work on p6 variants; that would have to be a > > separate project altogether. We could also include a few other MooseX > > modules if it makes life easier. > > > > chris > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hartzell at alerce.com Tue May 5 17:42:19 2009 From: hartzell at alerce.com (George Hartzell) Date: Tue, 5 May 2009 14:42:19 -0700 Subject: [Bioperl-l] question about in-between overlapping exact location Message-ID: <18944.45755.94431.882844@already.local> I was surprised to see that: $ins = Bio::Location::Simple->new(-start => 2, -end => 3, -location_type => 'IN-BETWEEN', ); $start = Bio::Location::Simple->new(-start => 3, -end => 5); print "Wow!\n" if $start->overlaps($ins); To my mind they would only overlap if the insertion were 3^4 or 4^5. Is my mental model of in-between's overlapping exact's wrong, or could the code be improved (I'm happy to make a change, but...)? g. From jason at bioperl.org Tue May 5 18:06:50 2009 From: jason at bioperl.org (Jason Stajich) Date: Tue, 5 May 2009 15:06:50 -0700 Subject: [Bioperl-l] question about in-between overlapping exact location In-Reply-To: <18944.45755.94431.882844@already.local> References: <18944.45755.94431.882844@already.local> Message-ID: <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> George - I don't think the location type is taken into account in the overlap code testing code. Would you expect 2..3 and 3..5 to overlap? -jason On May 5, 2009, at 2:42 PM, George Hartzell wrote: > > I was surprised to see that: > > $ins = Bio::Location::Simple->new(-start => 2, > -end => 3, > -location_type => 'IN-BETWEEN', > ); > $start = Bio::Location::Simple->new(-start => 3, > -end => 5); > > print "Wow!\n" if $start->overlaps($ins); > > To my mind they would only overlap if the insertion were 3^4 or 4^5. > > Is my mental model of in-between's overlapping exact's wrong, or could > the code be improved (I'm happy to make a change, but...)? > > g. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From hartzell at alerce.com Wed May 6 00:17:49 2009 From: hartzell at alerce.com (George Hartzell) Date: Tue, 5 May 2009 21:17:49 -0700 Subject: [Bioperl-l] question about in-between overlapping exact location In-Reply-To: <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> References: <18944.45755.94431.882844@already.local> <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> Message-ID: <18945.3949.852961.763626@already.local> Jason Stajich writes: > George - > I don't think the location type is taken into account in the overlap > code testing code. Would you expect 2..3 and 3..5 to overlap? > > -jason > On May 5, 2009, at 2:42 PM, George Hartzell wrote: > > > > > I was surprised to see that: > > > > $ins = Bio::Location::Simple->new(-start => 2, > > -end => 3, > > -location_type => 'IN-BETWEEN', > > ); > > $start = Bio::Location::Simple->new(-start => 3, > > -end => 5); > > > > print "Wow!\n" if $start->overlaps($ins); > > > > To my mind they would only overlap if the insertion were 3^4 or 4^5. > > > > Is my mental model of in-between's overlapping exact's wrong, or could > > the code be improved (I'm happy to make a change, but...)? Yep, I'd expect them to overlap. 1 2 3 4 5 A T T A A I'm trying to ask a question like the following. Given a location that describes an e.g. start codon (3..5) and a description of a mutation, does the mutation cause a change in the ATG. Substitutions are described with exact locations (change bases 3..4 from AT to TA) and insertions are modeled as in-between locations (insert G at 3^4). 1 2 3 4 5 6 A T G T G C C Given 3..5, I can just ask if 3..4 overlaps it (yes), if 3 overlaps it (yes) and if 3^4 overlaps it (yes). For things to work out this easily, 2^3 shouldn't overlap (an insertion there wouldn't change the codon). I can get the in-between to work by using RangeI->contains, but then I end up with 1..4 not "causing a change". I've ended up with a two part if() that checks the location_type and uses ->overlap() or ->contains() so that it works out. g. From webb.daniel at yahoo.com Wed May 6 07:21:47 2009 From: webb.daniel at yahoo.com (Daniel Webb) Date: Wed, 6 May 2009 04:21:47 -0700 (PDT) Subject: [Bioperl-l] retrieving gene sequence given protein id Message-ID: <277937.66024.qm@web45507.mail.sp1.yahoo.com> Hi all, is there a script or a module with which I could, given the list of protein gi or accessions, retrieve corresponding genes from Entrez Gene/GenBank? What I would like is sequence of the whole gene in fasta format, with all the introns and UTRs. I would be grateful for any help Dan From hlapp at gmx.net Wed May 6 07:54:14 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 6 May 2009 07:54:14 -0400 Subject: [Bioperl-l] question about in-between overlapping exact location In-Reply-To: <18945.3949.852961.763626@already.local> References: <18944.45755.94431.882844@already.local> <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> <18945.3949.852961.763626@already.local> Message-ID: This sounds like a bug to me - the location type should be taken into account, shouldn't it? Would you mind submitting this (and a patch if you have one :) to bugzilla? -hilmar On May 6, 2009, at 12:17 AM, George Hartzell wrote: > > Jason Stajich writes: >> George - >> I don't think the location type is taken into account in the overlap >> code testing code. Would you expect 2..3 and 3..5 to overlap? >> >> -jason >> On May 5, 2009, at 2:42 PM, George Hartzell wrote: >> >>> >>> I was surprised to see that: >>> >>> $ins = Bio::Location::Simple->new(-start => 2, >>> -end => 3, >>> -location_type => 'IN-BETWEEN', >>> ); >>> $start = Bio::Location::Simple->new(-start => 3, >>> -end => 5); >>> >>> print "Wow!\n" if $start->overlaps($ins); >>> >>> To my mind they would only overlap if the insertion were 3^4 or 4^5. >>> >>> Is my mental model of in-between's overlapping exact's wrong, or >>> could >>> the code be improved (I'm happy to make a change, but...)? > > Yep, I'd expect them to overlap. > > 1 2 3 4 5 > A T > T A A > > I'm trying to ask a question like the following. Given a location > that describes an e.g. start codon (3..5) and a description of a > mutation, does the mutation cause a change in the ATG. Substitutions > are described with exact locations (change bases 3..4 from AT to TA) > and insertions are modeled as in-between locations (insert G at 3^4). > > 1 2 3 4 5 6 > A T G > T G C C > > Given 3..5, I can just ask if 3..4 overlaps it (yes), if 3 overlaps it > (yes) and if 3^4 overlaps it (yes). For things to work out this > easily, 2^3 shouldn't overlap (an insertion there wouldn't change the > codon). > > I can get the in-between to work by using RangeI->contains, but then I > end up with 1..4 not "causing a change". > > I've ended up with a two part if() that checks the location_type and > uses ->overlap() or ->contains() so that it works out. > > g. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at illinois.edu Wed May 6 10:26:35 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 6 May 2009 09:26:35 -0500 Subject: [Bioperl-l] question about in-between overlapping exact location In-Reply-To: References: <18944.45755.94431.882844@already.local> <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> <18945.3949.852961.763626@already.local> Message-ID: <0FF287BC-6EFE-498E-81BD-3D1E8DF37353@illinois.edu> We should definitely come up with some test cases and expected results for this; e.g. whether 2^3 should overlap with 1..2 or 3..5, etc (I would guess, in the latter example, they shouldn't). Also, as these are LocationI-specific, I'm not sure we should make changes to RangeI methods. Maybe fix LocationI-specific bits within LocationI and delegate to RangeI::overlaps/etc in simple cases? chris On May 6, 2009, at 6:54 AM, Hilmar Lapp wrote: > This sounds like a bug to me - the location type should be taken > into account, shouldn't it? > > Would you mind submitting this (and a patch if you have one :) to > bugzilla? > > -hilmar > > On May 6, 2009, at 12:17 AM, George Hartzell wrote: > >> >> Jason Stajich writes: >>> George - >>> I don't think the location type is taken into account in the overlap >>> code testing code. Would you expect 2..3 and 3..5 to overlap? >>> >>> -jason >>> On May 5, 2009, at 2:42 PM, George Hartzell wrote: >>> >>>> >>>> I was surprised to see that: >>>> >>>> $ins = Bio::Location::Simple->new(-start => 2, >>>> -end => 3, >>>> -location_type => 'IN-BETWEEN', >>>> ); >>>> $start = Bio::Location::Simple->new(-start => 3, >>>> -end => 5); >>>> >>>> print "Wow!\n" if $start->overlaps($ins); >>>> >>>> To my mind they would only overlap if the insertion were 3^4 or >>>> 4^5. >>>> >>>> Is my mental model of in-between's overlapping exact's wrong, or >>>> could >>>> the code be improved (I'm happy to make a change, but...)? >> >> Yep, I'd expect them to overlap. >> >> 1 2 3 4 5 >> A T >> T A A >> >> I'm trying to ask a question like the following. Given a location >> that describes an e.g. start codon (3..5) and a description of a >> mutation, does the mutation cause a change in the ATG. Substitutions >> are described with exact locations (change bases 3..4 from AT to TA) >> and insertions are modeled as in-between locations (insert G at 3^4). >> >> 1 2 3 4 5 6 >> A T G >> T G C C >> >> Given 3..5, I can just ask if 3..4 overlaps it (yes), if 3 overlaps >> it >> (yes) and if 3^4 overlaps it (yes). For things to work out this >> easily, 2^3 shouldn't overlap (an insertion there wouldn't change the >> codon). >> >> I can get the in-between to work by using RangeI->contains, but >> then I >> end up with 1..4 not "causing a change". >> >> I've ended up with a two part if() that checks the location_type and >> uses ->overlap() or ->contains() so that it works out. >> >> g. >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Wed May 6 10:32:51 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 6 May 2009 09:32:51 -0500 Subject: [Bioperl-l] Moose [was Re: Other object oddities] In-Reply-To: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> <6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> Message-ID: <17AF855A-AC55-4322-BE15-050F5EE3E802@illinois.edu> On May 5, 2009, at 1:28 PM, Chris Mungall wrote: > > On May 5, 2009, at 7:31 AM, Chris Fields wrote: > >> On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: >> >>> >>> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: >>> >>>> Maybe this should be an element of >>>> the "Align refactor" that perhaps should be an overall >>>> "Seq refactor". >>> >>> Possibly. Most importantly, it'd be great if someone would >>> volunteer to summarize what's been said here so it won't get lost. >> >> Looks like mark's done it. >> >>>> Are you saying that the trunk is fair game for api additions >>>> for this issue? >>> >>> There's been talk some (a long, actually) time ago about BioPerl >>> 2.0 that would start on a clean slate and not be bothered by >>> backwards compatibility demands. That effort never really took >>> off, but maybe this is also a good time to ask the question again >>> whether it's better to introduce the API changes we desire in add/ >>> deprecate/remove cycles, or in a more radical fashion starting on >>> a clean slate. >> >> That's what I'm thinking. >> >>> The obvious advantage of the former is that we get API >>> improvements sooner, but making them is possibly more dreadful, >>> discouraging, or not even doable due to compatibility constraints. >>> The disadvantage of the latter is that it really needs a committed >>> crew of people to see it through or otherwise all the nice changes >>> die in some grand but half-finished 2.0 construction site. I think >>> Chris also had plans to branch off a Perl6 version of BioPerl - >>> maybe those could be the same efforts? >> >> I have been toying around with perl6 for a bit now (Rakudo on >> Parrot implementation). It's possible an alpha for perl6 will be >> available by christmas this year; Rakudo is now passing over 11000 >> spec tests. >> >> Just to note, Perl6 is another beast altogether from Perl5. Yes, >> there is supposed to be a backwards compatibility mode, but no one >> has implemented that yet, and it likely won't be implemented in the >> near future. Based on that I'm not sure we could really call a >> bioperl in perl6 bioperl 2.0, more like bioperl6 1.0, as it would >> be a complete refactor. >> >> As for perl5, it has a nice OO set of modules (Moose) that could be >> used for refactoring. It implements roles and a few other perl6- >> ish bits (along with MooseX modules). perl 5.10 also has a few >> things backported from p6, say(), given/when, state vars, etc. We >> could require Modern::Perl (perl5.10 with strict/warnings pragmas >> on) and Moose. I have played around with both and find them quite >> nice, so I suggest if we were to start a 2.0 effort it should >> include Moose, and we should push most of the interfaces into roles. > > We're playing around with a rewrite of go-perl using Moose: > http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ > > This is early enough that parts could be scrapped or rewritten. > Compatibility with bioperl is important. I don't think it needs to be scrapped. A stable Moose-based BioPerl is probably still a ways off from production use (I would like to test out a bit of interface->role conversion). > Speed was an initial concern but apparently there are some moose > tricks to speed things up > > DBIx::Class compatibility is also important. Not sure if there is > specific support for this yet I'm not sure about DBIx::Class, but I know Moose sometimes doesn't play well with Error.pm and it's exported methods (I think there is a conflict). I believe there have been some musings in the past over changing Bio::Root::Exceptions to use Exception::Class or similar, so maybe this'll be the push to do so. Startup speed is an issue with Moose but as you noted there are ways to optimize things. And, truthfully, if we can get around the interface issues using roles it might actually help a bit. chris From Michael.Stubbington at hpa.org.uk Wed May 6 10:39:27 2009 From: Michael.Stubbington at hpa.org.uk (Michael Stubbington) Date: Wed, 6 May 2009 15:39:27 +0100 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters Message-ID: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> Dear all, I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly program. I have some reads that will only assemble if cap3 is used with the '-y 150' option. This is fine from the command line but I can't work out how to pass this option to the Cap3 factory object in my script. If I do the following my $params = "y 150" ; my $cap3Factory = Bio::Tools::Run::Cap3->new($params); my $assembly = $cap3Factory->run($file); Then I get an exception as follows: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Unallowed parameter: y ! STACK: Error::throw STACK: Bio::Root::Root::throw /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 STACK: Bio::Tools::Run::Cap3::AUTOLOAD /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 STACK: Bio::Tools::Run::Cap3::new /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 STACK: /Users/mike/perlScripts/QGenotype.pl:150 If I don't try to pass any parameters to Cap3 it runs fine but just fails to assemble the reads that need the -y 150 flag. I'd very much appreciate any help with this. I'm pretty new to bioperl, hope I haven't missed anything obvious! Thanks in advance, Mike ------------------------------------------------------------------------ ---- Mike Stubbington Novel and Dangerous Pathogens Health Protection Agency Centre for Emergency Preparedness and Response Porton Down Salisbury SP4 0JG Tel: +44 1980 619812 ----------------------------------------- ************************************************************************** The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of the HPA, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses, but please re-sweep any attachments before opening or saving. HTTP://www.HPA.org.uk ************************************************************************** From Michael.Stubbington at hpa.org.uk Wed May 6 11:27:39 2009 From: Michael.Stubbington at hpa.org.uk (Michael Stubbington) Date: Wed, 6 May 2009 16:27:39 +0100 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: <3AE5C5E1-8551-4F44-92B3-C8DD40752A56@verizon.net> References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> <3AE5C5E1-8551-4F44-92B3-C8DD40752A56@verizon.net> Message-ID: <335635A922FA2B43B35B6ADD7929CC59017B1317@porhpaexc001.HPA.org.uk> Hi Brian, Thanks for your reply. If I do that it doesn't throw the exception any more but it also doesn't successfully assemble the reads that need the -y 150 flag. M ________________________________ From: Brian Osborne [mailto:bosborne11 at verizon.net] Sent: 06 May 2009 16:09 To: Michael Stubbington Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters Michael, In Bio/Tools/Run/CAP3.pm you see this at the top: BEGIN { @PARAMS = qw(a b c d e f g m n o p s u v x); $PROGRAMDIR = '/usr/local/bin'; # Authorize attribute fields foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; } } If you add the letter y to @PARAMS does it work? Brian O. On May 6, 2009, at 10:39 AM, Michael Stubbington wrote: Dear all, I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly program. I have some reads that will only assemble if cap3 is used with the '-y 150' option. This is fine from the command line but I can't work out how to pass this option to the Cap3 factory object in my script. If I do the following my $params = "y 150" ; my $cap3Factory = Bio::Tools::Run::Cap3->new($params); my $assembly = $cap3Factory->run($file); Then I get an exception as follows: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Unallowed parameter: y ! STACK: Error::throw STACK: Bio::Root::Root::throw /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 STACK: Bio::Tools::Run::Cap3::AUTOLOAD /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 STACK: Bio::Tools::Run::Cap3::new /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 STACK: /Users/mike/perlScripts/QGenotype.pl:150 If I don't try to pass any parameters to Cap3 it runs fine but just fails to assemble the reads that need the -y 150 flag. I'd very much appreciate any help with this. I'm pretty new to bioperl, hope I haven't missed anything obvious! Thanks in advance, Mike ------------------------------------------------------------------------ ---- Mike Stubbington Novel and Dangerous Pathogens Health Protection Agency Centre for Emergency Preparedness and Response Porton Down Salisbury SP4 0JG Tel: +44 1980 619812 ----------------------------------------- ************************************************************************ ** The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of the HPA, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses, but please re-sweep any attachments before opening or saving. HTTP://www.HPA.org.uk ************************************************************************ ** _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From Kevin.M.Brown at asu.edu Wed May 6 11:23:30 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Wed, 6 May 2009 08:23:30 -0700 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> Message-ID: <1A4207F8295607498283FE9E93B775B405F7F7C0@EX02.asurite.ad.asu.edu> BEGIN { @PARAMS = qw(a b c d e f g m n o p s u v x); $PROGRAMDIR = '/usr/local/bin'; # Authorize attribute fields foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; } That is the list of params that Cap3 will accept in the BioPerl module. I'm guessing if you add the y to that list that it might work. > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Michael Stubbington > Sent: Wednesday, May 06, 2009 7:39 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters > > Dear all, > > > > I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly > program. I have some reads that will only assemble if cap3 is > used with > the '-y 150' option. This is fine from the command line but I > can't work > out how to pass this option to the Cap3 factory object in my script. > > > > If I do the following > > > > my $params = "y 150" ; > > my $cap3Factory = Bio::Tools::Run::Cap3->new($params); > > my $assembly = $cap3Factory->run($file); > > > > Then I get an exception as follows: > > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > > MSG: Unallowed parameter: y ! > > STACK: Error::throw > > STACK: Bio::Root::Root::throw > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 > > STACK: Bio::Tools::Run::Cap3::AUTOLOAD > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 > > STACK: Bio::Tools::Run::Cap3::new > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 > > STACK: /Users/mike/perlScripts/QGenotype.pl:150 > > > > If I don't try to pass any parameters to Cap3 it runs fine but just > fails to assemble the reads that need the -y 150 flag. > > > > I'd very much appreciate any help with this. I'm pretty new > to bioperl, > hope I haven't missed anything obvious! > > > > Thanks in advance, > > > > Mike > > > > -------------------------------------------------------------- > ---------- > ---- > > Mike Stubbington > > Novel and Dangerous Pathogens > > Health Protection Agency > > Centre for Emergency Preparedness and Response > > Porton Down > > Salisbury > > SP4 0JG > > > > Tel: +44 1980 619812 > > > > > > ----------------------------------------- > ************************************************************** > ************ > The information contained in the EMail and any attachments is > confidential and intended solely and for the attention and use of > the named addressee(s). It may not be disclosed to any other person > without the express authority of the HPA, or the intended > recipient, or both. If you are not the intended recipient, you must > not disclose, copy, distribute or retain this message or any part > of it. This footnote also confirms that this EMail has been swept > for computer viruses, but please re-sweep any attachments before > opening or saving. HTTP://www.HPA.org.uk > ************************************************************** > ************ > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From hartzell at alerce.com Wed May 6 11:31:59 2009 From: hartzell at alerce.com (George Hartzell) Date: Wed, 6 May 2009 08:31:59 -0700 Subject: [Bioperl-l] question about in-between overlapping exact location In-Reply-To: References: <18944.45755.94431.882844@already.local> <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> <18945.3949.852961.763626@already.local> Message-ID: <18945.44399.403097.640951@already.local> Hilmar Lapp writes: > This sounds like a bug to me - the location type should be taken into > account, shouldn't it? > > Would you mind submitting this (and a patch if you have one :) to > bugzilla? Will do. I can just commit a fix if you'd like if the behaviour I expected makes sense to people. g. From jonathancrabtree at gmail.com Wed May 6 11:45:32 2009 From: jonathancrabtree at gmail.com (Jonathan Crabtree) Date: Wed, 6 May 2009 11:45:32 -0400 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: <1A4207F8295607498283FE9E93B775B405F7F7C0@EX02.asurite.ad.asu.edu> References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> <1A4207F8295607498283FE9E93B775B405F7F7C0@EX02.asurite.ad.asu.edu> Message-ID: <8e5b8bf80905060845q59e91a2l8b84f4839de5065f@mail.gmail.com> The "new" argument to Cap3 expects an array, not a string. So I think you need to do this: my $cap3Factory = Bio::Tools::Run::Cap3->new('y', '150'); rather than this: my $cap3Factory = Bio::Tools::Run::Cap3->new('y 150'); Otherwise it will silently ignore the parameter. There are also several problems with the Cap3 module itself, at least the version shown here: http://cpansearch.perl.org/src/CJFIELDS/BioPerl-run-1.6.1/Bio/Tools/Run/Cap3.pm Those problems are: 1. "y" is not in the PARAMS array, as Brian and Kevin have noted 2. $PROGRAMDIR appears to be hard-coded to /usr/local/bin (OK if that's where your cap3 is installed) 3. The run() method does this: my $commandstring = $exe . $param_string . " $infilename1"; but at least for the version of cap3 I'm using, you need to put the $param_string _after_ the $infilename1 for it to work. Once all these things are corrected it worked for me and correctly passed the -y 150 to cap3 when new() was called as shown above. Jonathan On Wed, May 6, 2009 at 11:23 AM, Kevin Brown wrote: > BEGIN { > > @PARAMS = qw(a b c d e f g m n o p s u v x); > $PROGRAMDIR = '/usr/local/bin'; > > # Authorize attribute fields > foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; > > } > > That is the list of params that Cap3 will accept in the BioPerl module. > I'm guessing if you add the y to that list that it might work. > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > > Michael Stubbington > > Sent: Wednesday, May 06, 2009 7:39 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters > > > > Dear all, > > > > > > > > I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly > > program. I have some reads that will only assemble if cap3 is > > used with > > the '-y 150' option. This is fine from the command line but I > > can't work > > out how to pass this option to the Cap3 factory object in my script. > > > > > > > > If I do the following > > > > > > > > my $params = "y 150" ; > > > > my $cap3Factory = Bio::Tools::Run::Cap3->new($params); > > > > my $assembly = $cap3Factory->run($file); > > > > > > > > Then I get an exception as follows: > > > > > > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > > > > MSG: Unallowed parameter: y ! > > > > STACK: Error::throw > > > > STACK: Bio::Root::Root::throw > > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 > > > > STACK: Bio::Tools::Run::Cap3::AUTOLOAD > > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 > > > > STACK: Bio::Tools::Run::Cap3::new > > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 > > > > STACK: /Users/mike/perlScripts/QGenotype.pl:150 > > > > > > > > If I don't try to pass any parameters to Cap3 it runs fine but just > > fails to assemble the reads that need the -y 150 flag. > > > > > > > > I'd very much appreciate any help with this. I'm pretty new > > to bioperl, > > hope I haven't missed anything obvious! > > > > > > > > Thanks in advance, > > > > > > > > Mike > > > > > > > > -------------------------------------------------------------- > > ---------- > > ---- > > > > Mike Stubbington > > > > Novel and Dangerous Pathogens > > > > Health Protection Agency > > > > Centre for Emergency Preparedness and Response > > > > Porton Down > > > > Salisbury > > > > SP4 0JG > > > > > > > > Tel: +44 1980 619812 > > > > > > > > > > > > ----------------------------------------- > > ************************************************************** > > ************ > > The information contained in the EMail and any attachments is > > confidential and intended solely and for the attention and use of > > the named addressee(s). It may not be disclosed to any other person > > without the express authority of the HPA, or the intended > > recipient, or both. If you are not the intended recipient, you must > > not disclose, copy, distribute or retain this message or any part > > of it. This footnote also confirms that this EMail has been swept > > for computer viruses, but please re-sweep any attachments before > > opening or saving. HTTP://www.HPA.org.uk > > ************************************************************** > > ************ > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From hartzell at alerce.com Wed May 6 11:48:14 2009 From: hartzell at alerce.com (George Hartzell) Date: Wed, 6 May 2009 08:48:14 -0700 Subject: [Bioperl-l] question about in-between overlapping exact location In-Reply-To: <0FF287BC-6EFE-498E-81BD-3D1E8DF37353@illinois.edu> References: <18944.45755.94431.882844@already.local> <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> <18945.3949.852961.763626@already.local> <0FF287BC-6EFE-498E-81BD-3D1E8DF37353@illinois.edu> Message-ID: <18945.45374.875448.871575@already.local> Chris Fields writes: > We should definitely come up with some test cases and expected results > for this; e.g. whether 2^3 should overlap with 1..2 or 3..5, etc (I > would guess, in the latter example, they shouldn't). My expectations agree with your guess. > Also, as these are LocationI-specific, I'm not sure we should make > changes to RangeI methods. Maybe fix LocationI-specific bits within > LocationI and delegate to RangeI::overlaps/etc in simple cases? I think that LocationI would my intended victim. I'll build up some test cases w/ expected output and a patch and see what people think before I commit it. g. From cjfields at illinois.edu Wed May 6 12:07:00 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 6 May 2009 11:07:00 -0500 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: <8e5b8bf80905060845q59e91a2l8b84f4839de5065f@mail.gmail.com> References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> <1A4207F8295607498283FE9E93B775B405F7F7C0@EX02.asurite.ad.asu.edu> <8e5b8bf80905060845q59e91a2l8b84f4839de5065f@mail.gmail.com> Message-ID: Jonathan, Have a diff file? We can fix that on main trunk for the next release. chris On May 6, 2009, at 10:45 AM, Jonathan Crabtree wrote: > The "new" argument to Cap3 expects an array, not a string. So I > think you > need to do this: > > my $cap3Factory = Bio::Tools::Run::Cap3->new('y', '150'); > > rather than this: > > my $cap3Factory = Bio::Tools::Run::Cap3->new('y 150'); > > Otherwise it will silently ignore the parameter. There are also > several > problems with the Cap3 module itself, at least the version shown here: > > http://cpansearch.perl.org/src/CJFIELDS/BioPerl-run-1.6.1/Bio/Tools/Run/Cap3.pm > > Those problems are: > > 1. "y" is not in the PARAMS array, as Brian and Kevin have noted > 2. $PROGRAMDIR appears to be hard-coded to /usr/local/bin (OK if > that's > where your cap3 is installed) > 3. The run() method does this: > > my $commandstring = $exe . $param_string . " $infilename1"; > > but at least for the version of cap3 I'm using, you need to put the > $param_string _after_ the $infilename1 for it to work. Once all these > things are corrected it worked for me and correctly passed the -y > 150 to > cap3 when new() was called as shown above. > > Jonathan > > > On Wed, May 6, 2009 at 11:23 AM, Kevin Brown > wrote: > >> BEGIN { >> >> @PARAMS = qw(a b c d e f g m n o p s u v x); >> $PROGRAMDIR = '/usr/local/bin'; >> >> # Authorize attribute fields >> foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; >> >> } >> >> That is the list of params that Cap3 will accept in the BioPerl >> module. >> I'm guessing if you add the y to that list that it might work. >> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org >>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of >>> Michael Stubbington >>> Sent: Wednesday, May 06, 2009 7:39 AM >>> To: bioperl-l at lists.open-bio.org >>> Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters >>> >>> Dear all, >>> >>> >>> >>> I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly >>> program. I have some reads that will only assemble if cap3 is >>> used with >>> the '-y 150' option. This is fine from the command line but I >>> can't work >>> out how to pass this option to the Cap3 factory object in my script. >>> >>> >>> >>> If I do the following >>> >>> >>> >>> my $params = "y 150" ; >>> >>> my $cap3Factory = Bio::Tools::Run::Cap3->new($params); >>> >>> my $assembly = $cap3Factory->run($file); >>> >>> >>> >>> Then I get an exception as follows: >>> >>> >>> >>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>> >>> MSG: Unallowed parameter: y ! >>> >>> STACK: Error::throw >>> >>> STACK: Bio::Root::Root::throw >>> /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 >>> >>> STACK: Bio::Tools::Run::Cap3::AUTOLOAD >>> /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 >>> >>> STACK: Bio::Tools::Run::Cap3::new >>> /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 >>> >>> STACK: /Users/mike/perlScripts/QGenotype.pl:150 >>> >>> >>> >>> If I don't try to pass any parameters to Cap3 it runs fine but just >>> fails to assemble the reads that need the -y 150 flag. >>> >>> >>> >>> I'd very much appreciate any help with this. I'm pretty new >>> to bioperl, >>> hope I haven't missed anything obvious! >>> >>> >>> >>> Thanks in advance, >>> >>> >>> >>> Mike >>> >>> >>> >>> -------------------------------------------------------------- >>> ---------- >>> ---- >>> >>> Mike Stubbington >>> >>> Novel and Dangerous Pathogens >>> >>> Health Protection Agency >>> >>> Centre for Emergency Preparedness and Response >>> >>> Porton Down >>> >>> Salisbury >>> >>> SP4 0JG >>> >>> >>> >>> Tel: +44 1980 619812 >>> >>> >>> >>> >>> >>> ----------------------------------------- >>> ************************************************************** >>> ************ >>> The information contained in the EMail and any attachments is >>> confidential and intended solely and for the attention and use of >>> the named addressee(s). It may not be disclosed to any other person >>> without the express authority of the HPA, or the intended >>> recipient, or both. If you are not the intended recipient, you must >>> not disclose, copy, distribute or retain this message or any part >>> of it. This footnote also confirms that this EMail has been swept >>> for computer viruses, but please re-sweep any attachments before >>> opening or saving. HTTP://www.HPA.org.uk >>> ************************************************************** >>> ************ >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Wed May 6 11:09:27 2009 From: bosborne11 at verizon.net (Brian Osborne) Date: Wed, 06 May 2009 11:09:27 -0400 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> Message-ID: <3AE5C5E1-8551-4F44-92B3-C8DD40752A56@verizon.net> Michael, In Bio/Tools/Run/CAP3.pm you see this at the top: BEGIN { @PARAMS = qw(a b c d e f g m n o p s u v x); $PROGRAMDIR = '/usr/local/bin'; # Authorize attribute fields foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; } } If you add the letter y to @PARAMS does it work? Brian O. On May 6, 2009, at 10:39 AM, Michael Stubbington wrote: > Dear all, > > > > I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly > program. I have some reads that will only assemble if cap3 is used > with > the '-y 150' option. This is fine from the command line but I can't > work > out how to pass this option to the Cap3 factory object in my script. > > > > If I do the following > > > > my $params = "y 150" ; > > my $cap3Factory = Bio::Tools::Run::Cap3->new($params); > > my $assembly = $cap3Factory->run($file); > > > > Then I get an exception as follows: > > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > > MSG: Unallowed parameter: y ! > > STACK: Error::throw > > STACK: Bio::Root::Root::throw > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 > > STACK: Bio::Tools::Run::Cap3::AUTOLOAD > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 > > STACK: Bio::Tools::Run::Cap3::new > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 > > STACK: /Users/mike/perlScripts/QGenotype.pl:150 > > > > If I don't try to pass any parameters to Cap3 it runs fine but just > fails to assemble the reads that need the -y 150 flag. > > > > I'd very much appreciate any help with this. I'm pretty new to > bioperl, > hope I haven't missed anything obvious! > > > > Thanks in advance, > > > > Mike > > > > ------------------------------------------------------------------------ > ---- > > Mike Stubbington > > Novel and Dangerous Pathogens > > Health Protection Agency > > Centre for Emergency Preparedness and Response > > Porton Down > > Salisbury > > SP4 0JG > > > > Tel: +44 1980 619812 > > > > > > ----------------------------------------- > ************************************************************************** > The information contained in the EMail and any attachments is > confidential and intended solely and for the attention and use of > the named addressee(s). It may not be disclosed to any other person > without the express authority of the HPA, or the intended > recipient, or both. If you are not the intended recipient, you must > not disclose, copy, distribute or retain this message or any part > of it. This footnote also confirms that this EMail has been swept > for computer viruses, but please re-sweep any attachments before > opening or saving. HTTP://www.HPA.org.uk > ************************************************************************** > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Wed May 6 12:49:09 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 6 May 2009 11:49:09 -0500 Subject: [Bioperl-l] Moose [was Re: Other object oddities] In-Reply-To: <18945.47362.738611.609881@already.local> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> <4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> <6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> <17AF855A-AC55-4322-BE15-050F5EE3E802@illinois.edu> <18945.47362.738611.609881@already.local> Message-ID: <5A04CB9C-6B21-4DF6-868F-7B5F9A45C679@illinois.edu> On May 6, 2009, at 11:21 AM, George Hartzell wrote: > Chris Fields writes: >> [...] >> Startup speed is an issue with Moose but as you noted there are ways >> to optimize things. And, truthfully, if we can get around the >> interface issues using roles it might actually help a bit. > > Can anyone point to a thread/presentation/paper about Moose best > practices and/or common workarounds? > > Thanks, > > g. Best place is the actual module docs for Moose (including the cookbook and manual). http://search.cpan.org/~drolsky/Moose-0.77/lib/Moose/Cookbook.pod http://search.cpan.org/~drolsky/Moose-0.77/lib/Moose/Manual.pod For Moose extensions: http://search.cpan.org/~stevan/Task-Moose-0.01/lib/Task/Moose.pm Main Moose page: http://www.iinteractive.com/moose/ I have added these to: http://www.bioperl.org/wiki/BioMoose chris From hartzell at alerce.com Wed May 6 12:21:22 2009 From: hartzell at alerce.com (George Hartzell) Date: Wed, 6 May 2009 09:21:22 -0700 Subject: [Bioperl-l] Moose [was Re: Other object oddities] In-Reply-To: <17AF855A-AC55-4322-BE15-050F5EE3E802@illinois.edu> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu> <1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu> <4D0732D667FD4A26B6161660107920E5@NewLife> <31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu> <02EEF4C7F37247C7BBA8EC1068069FC3@NewLife> <38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net> <6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> <17AF855A-AC55-4322-BE15-050F5EE3E802@illinois.edu> Message-ID: <18945.47362.738611.609881@already.local> Chris Fields writes: > [...] > Startup speed is an issue with Moose but as you noted there are ways > to optimize things. And, truthfully, if we can get around the > interface issues using roles it might actually help a bit. Can anyone point to a thread/presentation/paper about Moose best practices and/or common workarounds? Thanks, g. From maj at fortinbras.us Wed May 6 13:56:03 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Wed, 6 May 2009 13:56:03 -0400 Subject: [Bioperl-l] Moose [was Re: Other object oddities] In-Reply-To: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife><31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu><02EEF4C7F37247C7BBA8EC1068069FC3@NewLife><38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net><6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> Message-ID: Great discussion-- I have redacted the moose portions to http://www.bioperl.org/wiki/Talk:BioMoose and encourage all interested folks to log comments there as well. cheers Mark ----- Original Message ----- From: "Chris Mungall" To: "Chris Fields" Cc: "BioPerl List" ; "Mark A. Jensen" ; "Kevin Brown" Sent: Tuesday, May 05, 2009 2:28 PM Subject: [Bioperl-l] Moose [was Re: Other object oddities] > > On May 5, 2009, at 7:31 AM, Chris Fields wrote: > >> On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: >> >>> >>> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: >>> >>>> Maybe this should be an element of >>>> the "Align refactor" that perhaps should be an overall >>>> "Seq refactor". >>> >>> Possibly. Most importantly, it'd be great if someone would volunteer to >>> summarize what's been said here so it won't get lost. >> >> Looks like mark's done it. >> >>>> Are you saying that the trunk is fair game for api additions >>>> for this issue? >>> >>> There's been talk some (a long, actually) time ago about BioPerl 2.0 that >>> would start on a clean slate and not be bothered by backwards compatibility >>> demands. That effort never really took off, but maybe this is also a good >>> time to ask the question again whether it's better to introduce the API >>> changes we desire in add/ deprecate/remove cycles, or in a more radical >>> fashion starting on a clean slate. >> >> That's what I'm thinking. >> >>> The obvious advantage of the former is that we get API improvements sooner, >>> but making them is possibly more dreadful, discouraging, or not even doable >>> due to compatibility constraints. The disadvantage of the latter is that it >>> really needs a committed crew of people to see it through or otherwise all >>> the nice changes die in some grand but half-finished 2.0 construction site. >>> I think Chris also had plans to branch off a Perl6 version of BioPerl - >>> maybe those could be the same efforts? >> >> I have been toying around with perl6 for a bit now (Rakudo on Parrot >> implementation). It's possible an alpha for perl6 will be available by >> christmas this year; Rakudo is now passing over 11000 spec tests. >> >> Just to note, Perl6 is another beast altogether from Perl5. Yes, there is >> supposed to be a backwards compatibility mode, but no one has implemented >> that yet, and it likely won't be implemented in the near future. Based on >> that I'm not sure we could really call a bioperl in perl6 bioperl 2.0, more >> like bioperl6 1.0, as it would be a complete refactor. >> >> As for perl5, it has a nice OO set of modules (Moose) that could be used for >> refactoring. It implements roles and a few other perl6-ish bits (along with >> MooseX modules). perl 5.10 also has a few things backported from p6, say(), >> given/when, state vars, etc. We could require Modern::Perl (perl5.10 with >> strict/warnings pragmas on) and Moose. I have played around with both and >> find them quite nice, so I suggest if we were to start a 2.0 effort it >> should include Moose, and we should push most of the interfaces into roles. > > We're playing around with a rewrite of go-perl using Moose: > http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ > > This is early enough that parts could be scrapped or rewritten. Compatibility > with bioperl is important. > > Speed was an initial concern but apparently there are some moose tricks to > speed things up > > DBIx::Class compatibility is also important. Not sure if there is specific > support for this yet > > >> >> Anyway, I grabbed the git repos for bioperl6 and biomoose (bioperl >> implemented in Moose) on github. We can set up something there using those >> namespaces if needed. >> >>> I'm not trying to advocate one over the other here; rather, I'd like to >>> help push on that front that is best able to capture the energy of >>> volunteers, as that's what it takes in the end. >>> >>> -hilmar >> >> Depends on where everyone wants to place their efforts. May be less work to >> port the most important core classes over to Moose, and a simple test >> implementation will give us an idea on what works Role- wise and what >> doesn't. From there we could work on p6 variants; that would have to be a >> separate project altogether. We could also include a few other MooseX >> modules if it makes life easier. >> >> chris >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From hlapp at gmx.net Wed May 6 14:40:55 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 6 May 2009 14:40:55 -0400 Subject: [Bioperl-l] question about in-between overlapping exact location In-Reply-To: <18945.45374.875448.871575@already.local> References: <18944.45755.94431.882844@already.local> <1E9CA287-58C3-48B1-B9AD-3AC9541984C3@bioperl.org> <18945.3949.852961.763626@already.local> <0FF287BC-6EFE-498E-81BD-3D1E8DF37353@illinois.edu> <18945.45374.875448.871575@already.local> Message-ID: On May 6, 2009, at 11:48 AM, George Hartzell wrote: > I think that LocationI would my intended victim. I'll build up some > test cases w/ expected output and a patch and see what people think > before I commit it. That'd be great - forgot that you can commit away already! -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From scott at scottcain.net Wed May 6 15:37:52 2009 From: scott at scottcain.net (Scott Cain) Date: Wed, 6 May 2009 15:37:52 -0400 Subject: [Bioperl-l] Blasting 100kb against dbEST? Message-ID: <4536f7700905061237r50bd7c7cqe64d9eecfebab035@mail.gmail.com> Hi all, I'm working on a project that needs to BLAST 100kb genomic fragments against large DBs like dbEST. Now, 100kb is a big query, and I was hoping that there might be a standard way to break this apart, parallelize the BLAST and then reassembly/collate the results. Is there a standard way to do that? That is, the first two things are easy to do, but putting it all back together seems fraught with traps. Its those traps I'm looking for. Thanks, Scott -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From jonathancrabtree at gmail.com Wed May 6 15:34:07 2009 From: jonathancrabtree at gmail.com (Jonathan Crabtree) Date: Wed, 6 May 2009 15:34:07 -0400 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> <1A4207F8295607498283FE9E93B775B405F7F7C0@EX02.asurite.ad.asu.edu> <8e5b8bf80905060845q59e91a2l8b84f4839de5065f@mail.gmail.com> Message-ID: <8e5b8bf80905061234v2c980235hb2d8f9ac38edb02d@mail.gmail.com> Chris- It looks like Brian already added the 'y' option and fixed one of the typos in the SYNOPSIS, so here's a suggested diff based on the SVN head as of a few minutes ago. It includes the following changes: 1. Removed reference to BLAST in the comments. 2. Modified SYNOPSIS to make intended new() usage clearer. 3. Added the following @PARAMS: h i j k r t w z (to match those in the 12/21/07 version of cap3) 4. Calling $factory->program_dir('/some/path') now changes the default PROGRAMDIR (/usr/local/bin) 5. Bug fix, at least for post-2005 cap3 versions: changed run() method to pass CAP3 options _after_ the filename. 6. Throw an (informative) exception if the executable couldn't be found by WrapperBase::executable. 7. Changed comments to use "CAP3", not "Cap3" or "cap3" as the name of the software package. The Perl module is still "Cap3". 4. is probably the only change that might be controversial. It seems that most of the Bio::Tools::Run wrappers determine the directory in which the program executable resides by checking an environment variable. Cap3.pm stands out by hard-coding it. program_dir() is a class method but I've changed it to allow it to be called _either_ as a class method (in which case it returns the default $PROGRAMDIR) _or_ as an object method (in which case it returns or sets an internal copy of the program directory, as illustrated in the new SYNOPSIS.) If you want to change the class default you have to modify $PROGRAMDIR directly. I also noticed that if cap3 _isn't_ in the default $PROGRAMDIR the error message is completely unhelpful, so I've added a new throw() statement for this case. Finally, it doesn't seem that there's a test file for Cap3. If I have some free time later I'll look into adding one. Jonathan On Wed, May 6, 2009 at 12:07 PM, Chris Fields wrote: > Jonathan, > > Have a diff file? We can fix that on main trunk for the next release. > > chris > > > On May 6, 2009, at 10:45 AM, Jonathan Crabtree wrote: > > The "new" argument to Cap3 expects an array, not a string. So I think you >> need to do this: >> >> my $cap3Factory = Bio::Tools::Run::Cap3->new('y', '150'); >> >> rather than this: >> >> my $cap3Factory = Bio::Tools::Run::Cap3->new('y 150'); >> >> Otherwise it will silently ignore the parameter. There are also several >> problems with the Cap3 module itself, at least the version shown here: >> >> >> http://cpansearch.perl.org/src/CJFIELDS/BioPerl-run-1.6.1/Bio/Tools/Run/Cap3.pm >> >> Those problems are: >> >> 1. "y" is not in the PARAMS array, as Brian and Kevin have noted >> 2. $PROGRAMDIR appears to be hard-coded to /usr/local/bin (OK if that's >> where your cap3 is installed) >> 3. The run() method does this: >> >> my $commandstring = $exe . $param_string . " $infilename1"; >> >> but at least for the version of cap3 I'm using, you need to put the >> $param_string _after_ the $infilename1 for it to work. Once all these >> things are corrected it worked for me and correctly passed the -y 150 to >> cap3 when new() was called as shown above. >> >> Jonathan >> >> >> On Wed, May 6, 2009 at 11:23 AM, Kevin Brown >> wrote: >> >> BEGIN { >>> >>> @PARAMS = qw(a b c d e f g m n o p s u v x); >>> $PROGRAMDIR = '/usr/local/bin'; >>> >>> # Authorize attribute fields >>> foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; >>> >>> } >>> >>> That is the list of params that Cap3 will accept in the BioPerl module. >>> I'm guessing if you add the y to that list that it might work. >>> >>> -----Original Message----- >>>> From: bioperl-l-bounces at lists.open-bio.org >>>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of >>>> Michael Stubbington >>>> Sent: Wednesday, May 06, 2009 7:39 AM >>>> To: bioperl-l at lists.open-bio.org >>>> Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters >>>> >>>> Dear all, >>>> >>>> >>>> >>>> I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly >>>> program. I have some reads that will only assemble if cap3 is >>>> used with >>>> the '-y 150' option. This is fine from the command line but I >>>> can't work >>>> out how to pass this option to the Cap3 factory object in my script. >>>> >>>> >>>> >>>> If I do the following >>>> >>>> >>>> >>>> my $params = "y 150" ; >>>> >>>> my $cap3Factory = Bio::Tools::Run::Cap3->new($params); >>>> >>>> my $assembly = $cap3Factory->run($file); >>>> >>>> >>>> >>>> Then I get an exception as follows: >>>> >>>> >>>> >>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>> >>>> MSG: Unallowed parameter: y ! >>>> >>>> STACK: Error::throw >>>> >>>> STACK: Bio::Root::Root::throw >>>> /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 >>>> >>>> STACK: Bio::Tools::Run::Cap3::AUTOLOAD >>>> /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 >>>> >>>> STACK: Bio::Tools::Run::Cap3::new >>>> /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 >>>> >>>> STACK: /Users/mike/perlScripts/QGenotype.pl:150 >>>> >>>> >>>> >>>> If I don't try to pass any parameters to Cap3 it runs fine but just >>>> fails to assemble the reads that need the -y 150 flag. >>>> >>>> >>>> >>>> I'd very much appreciate any help with this. I'm pretty new >>>> to bioperl, >>>> hope I haven't missed anything obvious! >>>> >>>> >>>> >>>> Thanks in advance, >>>> >>>> >>>> >>>> Mike >>>> >>>> >>>> >>>> -------------------------------------------------------------- >>>> ---------- >>>> ---- >>>> >>>> Mike Stubbington >>>> >>>> Novel and Dangerous Pathogens >>>> >>>> Health Protection Agency >>>> >>>> Centre for Emergency Preparedness and Response >>>> >>>> Porton Down >>>> >>>> Salisbury >>>> >>>> SP4 0JG >>>> >>>> >>>> >>>> Tel: +44 1980 619812 >>>> >>>> >>>> >>>> >>>> >>>> ----------------------------------------- >>>> ************************************************************** >>>> ************ >>>> The information contained in the EMail and any attachments is >>>> confidential and intended solely and for the attention and use of >>>> the named addressee(s). It may not be disclosed to any other person >>>> without the express authority of the HPA, or the intended >>>> recipient, or both. If you are not the intended recipient, you must >>>> not disclose, copy, distribute or retain this message or any part >>>> of it. This footnote also confirms that this EMail has been swept >>>> for computer viruses, but please re-sweep any attachments before >>>> opening or saving. HTTP://www.HPA.org.uk >>>> ************************************************************** >>>> ************ >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -------------- next part -------------- A non-text attachment was scrubbed... Name: Cap3.pm.diff Type: text/x-diff Size: 1724 bytes Desc: not available URL: From len.zaifman at sickkids.ca Wed May 6 15:59:19 2009 From: len.zaifman at sickkids.ca (len.zaifman at sickkids.ca) Date: Wed, 6 May 2009 15:59:19 -0400 Subject: [Bioperl-l] Installing bioperl without ftp Message-ID: Due to institutional requirements we cannot use ftp (port 21) to obtain the pre-requisite packages using the easy build method. Is there a way to do this using https or at least http? Or better yet scp/sftp? Thanks. Sent by a BlackBerry device From scott at scottcain.net Wed May 6 16:29:08 2009 From: scott at scottcain.net (Scott Cain) Date: Wed, 6 May 2009 16:29:08 -0400 Subject: [Bioperl-l] Installing bioperl without ftp In-Reply-To: References: Message-ID: <536f21b00905061329y7464db27o685bb61433d3c29e@mail.gmail.com> Hi Len, When you configure cpan, you can specify to use http urls instead of ftp urls. If you've already configured cpan, and need to redo it, enter the cpan shell and type "o conf init" and look for http urls for mirrors. If there aren't any in Canada, look in the US--there are about 10. Scott On Wed, May 6, 2009 at 3:59 PM, wrote: > > Due to institutional requirements we cannot use ftp (port 21) to obtain the > pre-requisite packages using the easy build method. Is there a way to do > this using https or at least http? Or better yet scp/sftp? > > Thanks. > Sent by a BlackBerry device > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From cjfields at illinois.edu Wed May 6 16:22:47 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 6 May 2009 15:22:47 -0500 Subject: [Bioperl-l] Installing bioperl without ftp In-Reply-To: References: Message-ID: <0C6EC7DB-15BD-44B7-8330-177551749033@illinois.edu> Do you mean via CPAN or from the bioperl.org site? If the latter, you can use http: http://bioperl.org/DIST/ chris On May 6, 2009, at 2:59 PM, len.zaifman at sickkids.ca wrote: > Due to institutional requirements we cannot use ftp (port 21) to > obtain the > pre-requisite packages using the easy build method. Is there a way > to do > this using https or at least http? Or better yet scp/sftp? > > Thanks. > Sent by a BlackBerry device > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From heikki.lehvaslaiho at gmail.com Wed May 6 16:47:56 2009 From: heikki.lehvaslaiho at gmail.com (Heikki Lehvaslaiho) Date: Wed, 6 May 2009 22:47:56 +0200 Subject: [Bioperl-l] Installing bioperl without ftp In-Reply-To: References: Message-ID: Len, Install bioperl-live and other repositories directly from the SVN repository. http://www.bioperl.org/wiki/Using_Subversion SVN uses ssh, so that should work for you. -Heikki 2009/5/6 : > > Due to institutional requirements we cannot use ftp (port 21) to obtain the > pre-requisite packages using the easy build method. Is there a way to do > this using https or at least http? Or better yet scp/sftp? > > Thanks. > Sent by a BlackBerry device > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- -Heikki Heikki Lehvaslaiho - skype:heikki_lehvaslaiho cell: +27 (0)714328090 Sent from Claremont, WC, South Africa From cjfields at illinois.edu Wed May 6 16:49:46 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 6 May 2009 15:49:46 -0500 Subject: [Bioperl-l] Blasting 100kb against dbEST? In-Reply-To: <4536f7700905061237r50bd7c7cqe64d9eecfebab035@mail.gmail.com> References: <4536f7700905061237r50bd7c7cqe64d9eecfebab035@mail.gmail.com> Message-ID: <408B0EA1-5B3A-49CA-B3D7-459DD7F7BF8D@illinois.edu> I have locally run ~100kb fragments before (BLASTN) w/o problems off my first-gen MacBook, but this was against a small database. If you need to iterate through the sequence in chunks you can specify start/ stop with -L, so the hits/HSPs will be mapped accordingly (instead of starting from 1). Also, mpiBLAST appears to segment queries: http://www.mpiblast.org/ chris On May 6, 2009, at 2:37 PM, Scott Cain wrote: > Hi all, > > I'm working on a project that needs to BLAST 100kb genomic fragments > against large DBs like dbEST. Now, 100kb is a big query, and I was > hoping that there might be a standard way to break this apart, > parallelize the BLAST and then reassembly/collate the results. Is > there a standard way to do that? That is, the first two things are > easy to do, but putting it all back together seems fraught with traps. > Its those traps I'm looking for. > > Thanks, > Scott > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. scott at > scottcain dot net > GMOD Coordinator (http://gmod.org/) 216-392-3087 > Ontario Institute for Cancer Research > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From David.Messina at sbc.su.se Wed May 6 17:05:06 2009 From: David.Messina at sbc.su.se (Dave Messina) Date: Wed, 6 May 2009 23:05:06 +0200 Subject: [Bioperl-l] Installing bioperl without ftp In-Reply-To: <536f21b00905061329y7464db27o685bb61433d3c29e@mail.gmail.com> References: <536f21b00905061329y7464db27o685bb61433d3c29e@mail.gmail.com> Message-ID: <628aabb70905061405w357a80ela3231c14aae973b3@mail.gmail.com> Hey Len, In addition to what Scott said, it's possible to get BioPerl via http directly off the website. See http://www.bioperl.org/wiki/Getting_BioPerl for the URLs and other details. Dave From cjfields at illinois.edu Wed May 6 18:41:48 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 6 May 2009 17:41:48 -0500 Subject: [Bioperl-l] Moose [was Re: Other object oddities] In-Reply-To: References: <1A4207F8295607498283FE9E93B775B405F1257B@EX02.asurite.ad.asu.edu><1A4207F8295607498283FE9E93B775B405F1286C@EX02.asurite.ad.asu.edu><4D0732D667FD4A26B6161660107920E5@NewLife><31FC08BB-1AF2-4064-8F7F-273517ECBE81@illinois.edu><02EEF4C7F37247C7BBA8EC1068069FC3@NewLife><38483E75-E05A-4A3D-B057-28B7C928ADC6@gmx.net><6B76016F-60E8-4FE5-B083-E64762D79039@illinois.edu> <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> Message-ID: As a final bit: if we go the Moose route, we should be very careful about which MooseX modules we want. I don't think we want to expand the dependency tree. For instance, I am attempting to install one possible module (MooseX::Declare) and the dependency tree was ginormous and included modules only needed for installation. chris On May 6, 2009, at 12:56 PM, Mark A. Jensen wrote: > Great discussion-- I have redacted the moose portions to http://www.bioperl.org/wiki/Talk:BioMoose > and encourage all interested folks to log comments there as well. > cheers Mark > ----- Original Message ----- From: "Chris Mungall" > > To: "Chris Fields" > Cc: "BioPerl List" ; "Mark A. Jensen" >; "Kevin Brown" > Sent: Tuesday, May 05, 2009 2:28 PM > Subject: [Bioperl-l] Moose [was Re: Other object oddities] > > >> >> On May 5, 2009, at 7:31 AM, Chris Fields wrote: >> >>> On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: >>> >>>> >>>> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: >>>> >>>>> Maybe this should be an element of >>>>> the "Align refactor" that perhaps should be an overall >>>>> "Seq refactor". >>>> >>>> Possibly. Most importantly, it'd be great if someone would >>>> volunteer to summarize what's been said here so it won't get lost. >>> >>> Looks like mark's done it. >>> >>>>> Are you saying that the trunk is fair game for api additions >>>>> for this issue? >>>> >>>> There's been talk some (a long, actually) time ago about BioPerl >>>> 2.0 that would start on a clean slate and not be bothered by >>>> backwards compatibility demands. That effort never really took >>>> off, but maybe this is also a good time to ask the question >>>> again whether it's better to introduce the API changes we desire >>>> in add/ deprecate/remove cycles, or in a more radical fashion >>>> starting on a clean slate. >>> >>> That's what I'm thinking. >>> >>>> The obvious advantage of the former is that we get API >>>> improvements sooner, but making them is possibly more dreadful, >>>> discouraging, or not even doable due to compatibility >>>> constraints. The disadvantage of the latter is that it really >>>> needs a committed crew of people to see it through or otherwise >>>> all the nice changes die in some grand but half-finished 2.0 >>>> construction site. I think Chris also had plans to branch off a >>>> Perl6 version of BioPerl - maybe those could be the same efforts? >>> >>> I have been toying around with perl6 for a bit now (Rakudo on >>> Parrot implementation). It's possible an alpha for perl6 will be >>> available by christmas this year; Rakudo is now passing over >>> 11000 spec tests. >>> >>> Just to note, Perl6 is another beast altogether from Perl5. Yes, >>> there is supposed to be a backwards compatibility mode, but no >>> one has implemented that yet, and it likely won't be implemented >>> in the near future. Based on that I'm not sure we could really >>> call a bioperl in perl6 bioperl 2.0, more like bioperl6 1.0, as >>> it would be a complete refactor. >>> >>> As for perl5, it has a nice OO set of modules (Moose) that could >>> be used for refactoring. It implements roles and a few other >>> perl6-ish bits (along with MooseX modules). perl 5.10 also has a >>> few things backported from p6, say(), given/when, state vars, >>> etc. We could require Modern::Perl (perl5.10 with strict/ >>> warnings pragmas on) and Moose. I have played around with both >>> and find them quite nice, so I suggest if we were to start a 2.0 >>> effort it should include Moose, and we should push most of the >>> interfaces into roles. >> >> We're playing around with a rewrite of go-perl using Moose: >> http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ >> >> This is early enough that parts could be scrapped or rewritten. >> Compatibility with bioperl is important. >> >> Speed was an initial concern but apparently there are some moose >> tricks to speed things up >> >> DBIx::Class compatibility is also important. Not sure if there is >> specific support for this yet >> >> >>> >>> Anyway, I grabbed the git repos for bioperl6 and biomoose (bioperl >>> implemented in Moose) on github. We can set up something there >>> using those namespaces if needed. >>> >>>> I'm not trying to advocate one over the other here; rather, I'd >>>> like to help push on that front that is best able to capture the >>>> energy of volunteers, as that's what it takes in the end. >>>> >>>> -hilmar >>> >>> Depends on where everyone wants to place their efforts. May be >>> less work to port the most important core classes over to Moose, >>> and a simple test implementation will give us an idea on what >>> works Role- wise and what doesn't. From there we could work on p6 >>> variants; that would have to be a separate project altogether. >>> We could also include a few other MooseX modules if it makes life >>> easier. >>> >>> chris >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From alden.huang at gmail.com Wed May 6 19:30:37 2009 From: alden.huang at gmail.com (Alden Huang) Date: Wed, 6 May 2009 16:30:37 -0700 Subject: [Bioperl-l] retrieving gene sequence given protein id In-Reply-To: <277937.66024.qm@web45507.mail.sp1.yahoo.com> References: <277937.66024.qm@web45507.mail.sp1.yahoo.com> Message-ID: <9e408d720905061630o7f9348f8o6e3285eb53e6a09e@mail.gmail.com> I am pretty sure you can just do that on the NCBI website through "Batch Entrez." Just select like gene or nucleotide for your database. If I am wrong, sorry. On Wed, May 6, 2009 at 4:21 AM, Daniel Webb wrote: > > Hi all, > > is there a script or a module with which I could, given the list of protein gi or accessions, retrieve corresponding genes from Entrez Gene/GenBank? What I would like is sequence of the whole gene in fasta format, with all the introns and UTRs. > I would be grateful for any help > > Dan > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From jason at bioperl.org Wed May 6 19:37:30 2009 From: jason at bioperl.org (Jason Stajich) Date: Wed, 6 May 2009 16:37:30 -0700 Subject: [Bioperl-l] retrieving gene sequence given protein id In-Reply-To: <9e408d720905061630o7f9348f8o6e3285eb53e6a09e@mail.gmail.com> References: <277937.66024.qm@web45507.mail.sp1.yahoo.com> <9e408d720905061630o7f9348f8o6e3285eb53e6a09e@mail.gmail.com> Message-ID: <07585EBF-561D-47EF-864B-CD8982EE716C@bioperl.org> or the bioperl modules and the related Entrez query module: Bio::DB::GenBank Bio::DB::GenPept covered in the HOWTOs -jason On May 6, 2009, at 4:30 PM, Alden Huang wrote: > I am pretty sure you can just do that on the NCBI website through > "Batch Entrez." Just select like gene or nucleotide for your database. > If I am wrong, sorry. > > On Wed, May 6, 2009 at 4:21 AM, Daniel Webb > wrote: >> >> Hi all, >> >> is there a script or a module with which I could, given the list of >> protein gi or accessions, retrieve corresponding genes from Entrez >> Gene/GenBank? What I would like is sequence of the whole gene in >> fasta format, with all the introns and UTRs. >> I would be grateful for any help >> >> Dan >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From Russell.Smithies at agresearch.co.nz Wed May 6 19:56:59 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 7 May 2009 11:56:59 +1200 Subject: [Bioperl-l] retrieving gene sequence given protein id In-Reply-To: <277937.66024.qm@web45507.mail.sp1.yahoo.com> References: <277937.66024.qm@web45507.mail.sp1.yahoo.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE8D07@exchsth.agresearch.co.nz> Hi Daniel, You should be able to do it with Bio::DB::Eutilities http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook Or use wget and manually link from protein (gi2088631) to gene: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Link&LinkName=protein_gene&from_uid=2088631 then link gene to nucleotide: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Link&LinkName=gene_nuccore&from_uid=282375 Or use NCBI eUtils http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework&part=eutils Or build a pipeline: http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/epipe.html NCBI is like Perl, there's always more than one way to do it :-) --Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E? russell.smithies at agresearch.co.nz Invermay? Research Centre Puddle Alley, Mosgiel, New Zealand T? +64 3 489 3809?? F? +64 3 489 9174? www.agresearch.co.nz > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Daniel Webb > Sent: Wednesday, 6 May 2009 11:22 p.m. > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] retrieving gene sequence given protein id > > > Hi all, > > is there a script or a module with which I could, given the list of protein gi > or accessions, retrieve corresponding genes from Entrez Gene/GenBank? What I > would like is sequence of the whole gene in fasta format, with all the introns > and UTRs. > I would be grateful for any help > > Dan > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From brianli.cas at gmail.com Wed May 6 21:10:46 2009 From: brianli.cas at gmail.com (brian li) Date: Thu, 7 May 2009 09:10:46 +0800 Subject: [Bioperl-l] Asking for advice on full EMBL extraction Message-ID: Dear all, Recently I have to extract all EMBL entries and put them into relational database so that our report generation tool can access the data. I used Bio::SeqIO::embl to get entries one by one, but can not move on when dealing with big million-line entries. Segmentation Fault popped. And as currently SeqBuilder is not integrated into Bio::SeqIO::embl, SeqBuilder->add_unwanted_slot can't help (http://bugzilla.open-bio.org/show_bug.cgi?id=2823). Is there another way to get entires one by one with BioPerl? Brian From Russell.Smithies at agresearch.co.nz Wed May 6 23:32:32 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 7 May 2009 15:32:32 +1200 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: References: Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> Hi Brian, I hate to say it but it worked OK for me using rel_ann_mus_01_r99.dat.gz and simple example Bio::SeqIO code from bugzilla It's not using more than 1GB memory on our server and doesn't segfault. Send me your example code and I'll give it a go if you like. Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E? russell.smithies at agresearch.co.nz Invermay? Research Centre Puddle Alley, Mosgiel, New Zealand T? +64 3 489 3809?? F? +64 3 489 9174? www.agresearch.co.nz > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of brian li > Sent: Thursday, 7 May 2009 1:11 p.m. > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Asking for advice on full EMBL extraction > > Dear all, > > Recently I have to extract all EMBL entries and put them into > relational database so that our report generation tool can access the > data. > > I used Bio::SeqIO::embl to get entries one by one, but can not > move on when dealing with big million-line entries. Segmentation Fault > popped. And as currently SeqBuilder is not integrated into > Bio::SeqIO::embl, SeqBuilder->add_unwanted_slot can't help > (http://bugzilla.open-bio.org/show_bug.cgi?id=2823). > > Is there another way to get entires one by one with BioPerl? > > Brian > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From brianli.cas at gmail.com Thu May 7 00:50:26 2009 From: brianli.cas at gmail.com (brian li) Date: Thu, 7 May 2009 12:50:26 +0800 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> Message-ID: Dear Russell, My example code is as following. I omit the parse process and these lines give me "Segmentation Fault" too. # Start of code my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', -format => 'EMBL'); my $index = 1; while (my $seq = $seqio->next_seq) { print "Dealing with entry: $index\n"; $index++; } # End The platform I run this code on: BioPerl 1.6.0 Perl 5.8.8 Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) I have monitored the memory usage when I run the code above. There is always around 20GB free memory (buffer size counted in) left. So I suppose the segfault can't be explained just by memory shortage. Brian On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell wrote: > Hi Brian, > I hate to say it but it worked OK for me using rel_ann_mus_01_r99.dat.gz and simple example Bio::SeqIO code from bugzilla > It's not using more than 1GB memory on our server and doesn't segfault. > > Send me your example code and I'll give it a go if you like. > > > Russell Smithies > > Bioinformatics Applications Developer > T +64 3 489 9085 > E? russell.smithies at agresearch.co.nz > > Invermay? Research Centre > Puddle Alley, > Mosgiel, > New Zealand > T? +64 3 489 3809 > F? +64 3 489 9174 > www.agresearch.co.nz > > From Russell.Smithies at agresearch.co.nz Thu May 7 01:01:13 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Thu, 7 May 2009 17:01:13 +1200 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> Sadly, that's the same code as I ran but I had a Data::Dump in the middle. Versions of Perl and BioPerl are the same. We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM If you get a full script running on a smaller dataset, I could probably run it on the bigger stuff and give you back tab-separated (or is that tab\tseparated ?) data for loading into your db. --Russell > -----Original Message----- > From: brian li [mailto:brianli.cas at gmail.com] > Sent: Thursday, 7 May 2009 4:50 p.m. > To: Smithies, Russell > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > Dear Russell, > > My example code is as following. I omit the parse process and these > lines give me "Segmentation Fault" too. > > # Start of code > my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', > -format => 'EMBL'); > my $index = 1; > while (my $seq = $seqio->next_seq) > { > print "Dealing with entry: $index\n"; > $index++; > } > # End > > The platform I run this code on: > BioPerl 1.6.0 > Perl 5.8.8 > Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) > > I have monitored the memory usage when I run the code above. There is > always around 20GB free memory (buffer size counted in) left. So I > suppose the segfault can't be explained just by memory shortage. > > Brian > > > On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > wrote: > > Hi Brian, > > I hate to say it but it worked OK for me using rel_ann_mus_01_r99.dat.gz and > simple example Bio::SeqIO code from bugzilla > > It's not using more than 1GB memory on our server and doesn't segfault. > > > > Send me your example code and I'll give it a go if you like. > > > > > > Russell Smithies > > > > Bioinformatics Applications Developer > > T +64 3 489 9085 > > E? russell.smithies at agresearch.co.nz > > > > Invermay? Research Centre > > Puddle Alley, > > Mosgiel, > > New Zealand > > T? +64 3 489 3809 > > F? +64 3 489 9174 > > www.agresearch.co.nz > > > > ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From brianli.cas at gmail.com Thu May 7 01:32:56 2009 From: brianli.cas at gmail.com (brian li) Date: Thu, 7 May 2009 13:32:56 +0800 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> Message-ID: Thank you very much for your offer. The director of our lab wants me to do the extraction every time a new release of EMBL is published. I can't push the task to you every time. I can offer more information of the server I run my script on if needed. -Brian On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell wrote: > Sadly, that's the same code as I ran but I had a Data::Dump in the middle. > Versions of Perl and BioPerl are the same. > We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM > > If you get a full script running on a smaller dataset, I could probably run it on the bigger stuff and give you back tab-separated (or is that tab\tseparated ?) data for loading into your db. > > --Russell > >> -----Original Message----- >> From: brian li [mailto:brianli.cas at gmail.com] >> Sent: Thursday, 7 May 2009 4:50 p.m. >> To: Smithies, Russell >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >> >> Dear Russell, >> >> My example code is as following. I omit the parse process and these >> lines give me "Segmentation Fault" too. >> >> # Start of code >> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-format => 'EMBL'); >> my $index = 1; >> while (my $seq = $seqio->next_seq) >> { >> ? ? print "Dealing with entry: $index\n"; >> ? ? $index++; >> } >> # End >> >> The platform I run this code on: >> BioPerl 1.6.0 >> Perl 5.8.8 >> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) >> >> I have monitored the memory usage when I run the code above. There is >> always around 20GB free memory (buffer size counted in) left. So I >> suppose the segfault can't be explained just by memory shortage. >> >> Brian >> >> >> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell >> wrote: >> > Hi Brian, >> > I hate to say it but it worked OK for me using rel_ann_mus_01_r99.dat.gz and >> simple example Bio::SeqIO code from bugzilla >> > It's not using more than 1GB memory on our server and doesn't segfault. >> > >> > Send me your example code and I'll give it a go if you like. >> > >> > >> > Russell Smithies >> > >> > Bioinformatics Applications Developer >> > T +64 3 489 9085 >> > E? russell.smithies at agresearch.co.nz >> > >> > Invermay? Research Centre >> > Puddle Alley, >> > Mosgiel, >> > New Zealand >> > T? +64 3 489 3809 >> > F? +64 3 489 9174 >> > www.agresearch.co.nz >> > >> > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > From Michael.Stubbington at hpa.org.uk Thu May 7 03:53:29 2009 From: Michael.Stubbington at hpa.org.uk (Michael Stubbington) Date: Thu, 7 May 2009 08:53:29 +0100 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: <8e5b8bf80905060845q59e91a2l8b84f4839de5065f@mail.gmail.com> References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> <1A4207F8295607498283FE9E93B775B405F7F7C0@EX02.asurite.ad.asu.edu> <8e5b8bf80905060845q59e91a2l8b84f4839de5065f@mail.gmail.com> Message-ID: <335635A922FA2B43B35B6ADD7929CC59017B138B@porhpaexc001.HPA.org.uk> Jonathan, Thanks a lot for this advice. It now all works for me. Strangely my cap3 installation is not in /usr/local/bin but everything works fine without me having to change $PROGRAMDIR in cap3.pm Thanks to everyone else involved in this thread for their efforts in improving Bio::Tools::Run::Cap3. Best wishes, Mike ________________________________ From: Jonathan Crabtree [mailto:jonathancrabtree at gmail.com] Sent: 06 May 2009 16:46 To: Kevin Brown Cc: Michael Stubbington; bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters The "new" argument to Cap3 expects an array, not a string. So I think you need to do this: my $cap3Factory = Bio::Tools::Run::Cap3->new('y', '150'); rather than this: my $cap3Factory = Bio::Tools::Run::Cap3->new('y 150'); Otherwise it will silently ignore the parameter. There are also several problems with the Cap3 module itself, at least the version shown here: http://cpansearch.perl.org/src/CJFIELDS/BioPerl-run-1.6.1/Bio/Tools/Run/ Cap3.pm Those problems are: 1. "y" is not in the PARAMS array, as Brian and Kevin have noted 2. $PROGRAMDIR appears to be hard-coded to /usr/local/bin (OK if that's where your cap3 is installed) 3. The run() method does this: my $commandstring = $exe . $param_string . " $infilename1"; but at least for the version of cap3 I'm using, you need to put the $param_string _after_ the $infilename1 for it to work. Once all these things are corrected it worked for me and correctly passed the -y 150 to cap3 when new() was called as shown above. Jonathan On Wed, May 6, 2009 at 11:23 AM, Kevin Brown wrote: BEGIN { @PARAMS = qw(a b c d e f g m n o p s u v x); $PROGRAMDIR = '/usr/local/bin'; # Authorize attribute fields foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; } That is the list of params that Cap3 will accept in the BioPerl module. I'm guessing if you add the y to that list that it might work. > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Michael Stubbington > Sent: Wednesday, May 06, 2009 7:39 AM > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters > > Dear all, > > > > I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly > program. I have some reads that will only assemble if cap3 is > used with > the '-y 150' option. This is fine from the command line but I > can't work > out how to pass this option to the Cap3 factory object in my script. > > > > If I do the following > > > > my $params = "y 150" ; > > my $cap3Factory = Bio::Tools::Run::Cap3->new($params); > > my $assembly = $cap3Factory->run($file); > > > > Then I get an exception as follows: > > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > > MSG: Unallowed parameter: y ! > > STACK: Error::throw > > STACK: Bio::Root::Root::throw > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 > > STACK: Bio::Tools::Run::Cap3::AUTOLOAD > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 > > STACK: Bio::Tools::Run::Cap3::new > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 > > STACK: /Users/mike/perlScripts/QGenotype.pl:150 > > > > If I don't try to pass any parameters to Cap3 it runs fine but just > fails to assemble the reads that need the -y 150 flag. > > > > I'd very much appreciate any help with this. I'm pretty new > to bioperl, > hope I haven't missed anything obvious! > > > > Thanks in advance, > > > > Mike > > > > -------------------------------------------------------------- > ---------- > ---- > > Mike Stubbington > > Novel and Dangerous Pathogens > > Health Protection Agency > > Centre for Emergency Preparedness and Response > > Porton Down > > Salisbury > > SP4 0JG > > > > Tel: +44 1980 619812 > > > > > > ----------------------------------------- > ************************************************************** > ************ > The information contained in the EMail and any attachments is > confidential and intended solely and for the attention and use of > the named addressee(s). It may not be disclosed to any other person > without the express authority of the HPA, or the intended > recipient, or both. If you are not the intended recipient, you must > not disclose, copy, distribute or retain this message or any part > of it. This footnote also confirms that this EMail has been swept > for computer viruses, but please re-sweep any attachments before > opening or saving. HTTP://www.HPA.org.uk > ************************************************************** > ************ > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l ----------------------------------------- ************************************************************************** The information contained in the EMail and any attachments is confidential and intended solely and for the attention and use of the named addressee(s). It may not be disclosed to any other person without the express authority of the HPA, or the intended recipient, or both. If you are not the intended recipient, you must not disclose, copy, distribute or retain this message or any part of it. This footnote also confirms that this EMail has been swept for computer viruses, but please re-sweep any attachments before opening or saving. HTTP://www.HPA.org.uk ************************************************************************** From webb.daniel at yahoo.com Thu May 7 04:28:44 2009 From: webb.daniel at yahoo.com (Daniel Webb) Date: Thu, 7 May 2009 01:28:44 -0700 (PDT) Subject: [Bioperl-l] retrieving gene sequence given protein id Message-ID: <36212.82053.qm@web45511.mail.sp1.yahoo.com> Awesome! Thank you all for replying :) --- On Wed, 5/6/09, Smithies, Russell wrote: From: Smithies, Russell Subject: RE: [Bioperl-l] retrieving gene sequence given protein id To: "'Daniel Webb'" , "'bioperl-l at lists.open-bio.org'" Date: Wednesday, May 6, 2009, 11:56 PM Hi Daniel, You should be able to do it with Bio::DB::Eutilities http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook Or use wget and manually link from protein (gi2088631) to gene: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Link&LinkName=protein_gene&from_uid=2088631 then link gene to nucleotide: http://www.ncbi.nlm.nih.gov/sites/entrez?db=gene&cmd=Link&LinkName=gene_nuccore&from_uid=282375 Or use NCBI eUtils http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=coursework?=eutils Or build a pipeline: http://www.ncbi.nlm.nih.gov/Class/wheeler/eutils/epipe.html NCBI is like Perl, there's always more than one way to do it? :-) --Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E? russell.smithies at agresearch.co.nz Invermay? Research Centre Puddle Alley, Mosgiel, New Zealand T? +64 3 489 3809?? F? +64 3 489 9174? www.agresearch.co.nz > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Daniel Webb > Sent: Wednesday, 6 May 2009 11:22 p.m. > To: bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] retrieving gene sequence given protein id > > > Hi all, > > is there a script or a module with which I could, given the list of protein gi > or accessions, retrieve corresponding genes from Entrez Gene/GenBank? What I > would like is sequence of the whole gene in fasta format, with all the introns > and UTRs. > I would be grateful for any help > > Dan > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From cjfields at illinois.edu Thu May 7 08:07:54 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 7 May 2009 07:07:54 -0500 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> Message-ID: <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> I noticed that Russell has 16GB RAM on his setup. Was yours equivalent? chris On May 7, 2009, at 12:32 AM, brian li wrote: > Thank you very much for your offer. > > The director of our lab wants me to do the extraction every time a new > release of EMBL is published. I can't push the task to you every time. > > I can offer more information of the server I run my script on if > needed. > > -Brian > > On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > wrote: >> Sadly, that's the same code as I ran but I had a Data::Dump in the >> middle. >> Versions of Perl and BioPerl are the same. >> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM >> >> If you get a full script running on a smaller dataset, I could >> probably run it on the bigger stuff and give you back tab-separated >> (or is that tab\tseparated ?) data for loading into your db. >> >> --Russell >> >>> -----Original Message----- >>> From: brian li [mailto:brianli.cas at gmail.com] >>> Sent: Thursday, 7 May 2009 4:50 p.m. >>> To: Smithies, Russell >>> Cc: bioperl-l at lists.open-bio.org >>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >>> >>> Dear Russell, >>> >>> My example code is as following. I omit the parse process and these >>> lines give me "Segmentation Fault" too. >>> >>> # Start of code >>> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', >>> -format => 'EMBL'); >>> my $index = 1; >>> while (my $seq = $seqio->next_seq) >>> { >>> print "Dealing with entry: $index\n"; >>> $index++; >>> } >>> # End >>> >>> The platform I run this code on: >>> BioPerl 1.6.0 >>> Perl 5.8.8 >>> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) >>> >>> I have monitored the memory usage when I run the code above. There >>> is >>> always around 20GB free memory (buffer size counted in) left. So I >>> suppose the segfault can't be explained just by memory shortage. >>> >>> Brian >>> >>> >>> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell >>> wrote: >>>> Hi Brian, >>>> I hate to say it but it worked OK for me using >>>> rel_ann_mus_01_r99.dat.gz and >>> simple example Bio::SeqIO code from bugzilla >>>> It's not using more than 1GB memory on our server and doesn't >>>> segfault. >>>> >>>> Send me your example code and I'll give it a go if you like. >>>> >>>> >>>> Russell Smithies >>>> >>>> Bioinformatics Applications Developer >>>> T +64 3 489 9085 >>>> E russell.smithies at agresearch.co.nz >>>> >>>> Invermay Research Centre >>>> Puddle Alley, >>>> Mosgiel, >>>> New Zealand >>>> T +64 3 489 3809 >>>> F +64 3 489 9174 >>>> www.agresearch.co.nz >>>> >>>> >> = >> = >> ===================================================================== >> Attention: The information contained in this message and/or >> attachments >> from AgResearch Limited is intended only for the persons or entities >> to which it is addressed and may contain confidential and/or >> privileged >> material. Any review, retransmission, dissemination or other use >> of, or >> taking of any action in reliance upon, this information by persons or >> entities other than the intended recipients is prohibited by >> AgResearch >> Limited. If you have received this message in error, please notify >> the >> sender immediately. >> = >> = >> ===================================================================== >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From brianli.cas at gmail.com Thu May 7 08:59:59 2009 From: brianli.cas at gmail.com (brian li) Date: Thu, 7 May 2009 20:59:59 +0800 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> Message-ID: My server has 32 GB RAM. The os of my server is 64-bit version of Ubuntu Server Edition 8.04 LTS. And I have run my example code on another server with 32-bit version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. -Brian On Thu, May 7, 2009 at 8:07 PM, Chris Fields wrote: > I noticed that Russell has 16GB RAM on his setup. ?Was yours equivalent? > > chris > > On May 7, 2009, at 12:32 AM, brian li wrote: > >> Thank you very much for your offer. >> >> The director of our lab wants me to do the extraction every time a new >> release of EMBL is published. I can't push the task to you every time. >> >> I can offer more information of the server I run my script on if needed. >> >> -Brian >> >> On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell >> wrote: >>> >>> Sadly, that's the same code as I ran but I had a Data::Dump in the >>> middle. >>> Versions of Perl and BioPerl are the same. >>> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM >>> >>> If you get a full script running on a smaller dataset, I could probably >>> run it on the bigger stuff and give you back tab-separated (or is that >>> tab\tseparated ?) data for loading into your db. >>> >>> --Russell >>> >>>> -----Original Message----- >>>> From: brian li [mailto:brianli.cas at gmail.com] >>>> Sent: Thursday, 7 May 2009 4:50 p.m. >>>> To: Smithies, Russell >>>> Cc: bioperl-l at lists.open-bio.org >>>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >>>> >>>> Dear Russell, >>>> >>>> My example code is as following. I omit the parse process and these >>>> lines give me "Segmentation Fault" too. >>>> >>>> # Start of code >>>> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -format => 'EMBL'); >>>> my $index = 1; >>>> while (my $seq = $seqio->next_seq) >>>> { >>>> ? ?print "Dealing with entry: $index\n"; >>>> ? ?$index++; >>>> } >>>> # End >>>> >>>> The platform I run this code on: >>>> BioPerl 1.6.0 >>>> Perl 5.8.8 >>>> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) >>>> >>>> I have monitored the memory usage when I run the code above. There is >>>> always around 20GB free memory (buffer size counted in) left. So I >>>> suppose the segfault can't be explained just by memory shortage. >>>> >>>> Brian >>>> >>>> >>>> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell >>>> wrote: >>>>> >>>>> Hi Brian, >>>>> I hate to say it but it worked OK for me using >>>>> rel_ann_mus_01_r99.dat.gz and >>>> >>>> simple example Bio::SeqIO code from bugzilla >>>>> >>>>> It's not using more than 1GB memory on our server and doesn't segfault. >>>>> >>>>> Send me your example code and I'll give it a go if you like. >>>>> >>>>> >>>>> Russell Smithies >>>>> >>>>> Bioinformatics Applications Developer >>>>> T +64 3 489 9085 >>>>> E ?russell.smithies at agresearch.co.nz >>>>> >>>>> Invermay ?Research Centre >>>>> Puddle Alley, >>>>> Mosgiel, >>>>> New Zealand >>>>> T ?+64 3 489 3809 >>>>> F ?+64 3 489 9174 >>>>> www.agresearch.co.nz >>>>> >>>>> >>> ======================================================================= >>> Attention: The information contained in this message and/or attachments >>> from AgResearch Limited is intended only for the persons or entities >>> to which it is addressed and may contain confidential and/or privileged >>> material. Any review, retransmission, dissemination or other use of, or >>> taking of any action in reliance upon, this information by persons or >>> entities other than the intended recipients is prohibited by AgResearch >>> Limited. If you have received this message in error, please notify the >>> sender immediately. >>> ======================================================================= >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From jonathancrabtree at gmail.com Thu May 7 10:20:08 2009 From: jonathancrabtree at gmail.com (Jonathan Crabtree) Date: Thu, 7 May 2009 10:20:08 -0400 Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters In-Reply-To: <335635A922FA2B43B35B6ADD7929CC59017B138B@porhpaexc001.HPA.org.uk> References: <335635A922FA2B43B35B6ADD7929CC59017B12E8@porhpaexc001.HPA.org.uk> <1A4207F8295607498283FE9E93B775B405F7F7C0@EX02.asurite.ad.asu.edu> <8e5b8bf80905060845q59e91a2l8b84f4839de5065f@mail.gmail.com> <335635A922FA2B43B35B6ADD7929CC59017B138B@porhpaexc001.HPA.org.uk> Message-ID: <8e5b8bf80905070720j837b842x32a8b0b3d924c544@mail.gmail.com> No problem. With respect to the location of cap3, I was a bit quick to pass judgment and didn't look at exactly what WrapperBase does: it first tries the hard-coded directory location from the Cap3 module, and then falls back to Bio::Root:IO::exists_exe, which searches your PATH for the executable, provided that File::Spec can be loaded. Jonathan On Thu, May 7, 2009 at 3:53 AM, Michael Stubbington < Michael.Stubbington at hpa.org.uk> wrote: > Jonathan, > > > > Thanks a lot for this advice. It now all works for me. > > > > Strangely my cap3 installation is not in /usr/local/bin but everything > works fine without me having to change $PROGRAMDIR in cap3.pm > > > > Thanks to everyone else involved in this thread for their efforts in > improving Bio::Tools::Run::Cap3. > > > > Best wishes, > > > > Mike > > > ------------------------------ > > *From:* Jonathan Crabtree [mailto:jonathancrabtree at gmail.com] > *Sent:* 06 May 2009 16:46 > *To:* Kevin Brown > *Cc:* Michael Stubbington; bioperl-l at lists.open-bio.org > *Subject:* Re: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters > > > > > The "new" argument to Cap3 expects an array, not a string. So I think you > need to do this: > > my $cap3Factory = Bio::Tools::Run::Cap3->new('y', '150'); > > rather than this: > > my $cap3Factory = Bio::Tools::Run::Cap3->new('y 150'); > > Otherwise it will silently ignore the parameter. There are also several > problems with the Cap3 module itself, at least the version shown here: > > > http://cpansearch.perl.org/src/CJFIELDS/BioPerl-run-1.6.1/Bio/Tools/Run/Cap3.pm > > Those problems are: > > 1. "y" is not in the PARAMS array, as Brian and Kevin have noted > 2. $PROGRAMDIR appears to be hard-coded to /usr/local/bin (OK if that's > where your cap3 is installed) > 3. The run() method does this: > > my $commandstring = $exe . $param_string . " $infilename1"; > > but at least for the version of cap3 I'm using, you need to put the > $param_string _after_ the $infilename1 for it to work. Once all these > things are corrected it worked for me and correctly passed the -y 150 to > cap3 when new() was called as shown above. > > Jonathan > > On Wed, May 6, 2009 at 11:23 AM, Kevin Brown > wrote: > > BEGIN { > > @PARAMS = qw(a b c d e f g m n o p s u v x); > $PROGRAMDIR = '/usr/local/bin'; > > # Authorize attribute fields > foreach my $attr (@PARAMS) { $OK_FIELD{$attr}++; > > } > > That is the list of params that Cap3 will accept in the BioPerl module. > I'm guessing if you add the y to that list that it might work. > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > > Michael Stubbington > > Sent: Wednesday, May 06, 2009 7:39 AM > > To: bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] Bio::Tools::Run::Cap3 - Parameters > > > > Dear all, > > > > > > > > I am using the Bio::Tools::Run::Cap3 wrapper to the Cap3 assembly > > program. I have some reads that will only assemble if cap3 is > > used with > > the '-y 150' option. This is fine from the command line but I > > can't work > > out how to pass this option to the Cap3 factory object in my script. > > > > > > > > If I do the following > > > > > > > > my $params = "y 150" ; > > > > my $cap3Factory = Bio::Tools::Run::Cap3->new($params); > > > > my $assembly = $cap3Factory->run($file); > > > > > > > > Then I get an exception as follows: > > > > > > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > > > > MSG: Unallowed parameter: y ! > > > > STACK: Error::throw > > > > STACK: Bio::Root::Root::throw > > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Root/Root.pm:357 > > > > STACK: Bio::Tools::Run::Cap3::AUTOLOAD > > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:116 > > > > STACK: Bio::Tools::Run::Cap3::new > > /Users/mike/lib/perl5/site_perl/5.8.9/Bio/Tools/Run/Cap3.pm:101 > > > > STACK: /Users/mike/perlScripts/QGenotype.pl:150 > > > > > > > > If I don't try to pass any parameters to Cap3 it runs fine but just > > fails to assemble the reads that need the -y 150 flag. > > > > > > > > I'd very much appreciate any help with this. I'm pretty new > > to bioperl, > > hope I haven't missed anything obvious! > > > > > > > > Thanks in advance, > > > > > > > > Mike > > > > > > > > -------------------------------------------------------------- > > ---------- > > ---- > > > > Mike Stubbington > > > > Novel and Dangerous Pathogens > > > > Health Protection Agency > > > > Centre for Emergency Preparedness and Response > > > > Porton Down > > > > Salisbury > > > > SP4 0JG > > > > > > > > Tel: +44 1980 619812 > > > > > > > > > > > > ----------------------------------------- > > ************************************************************** > > ************ > > The information contained in the EMail and any attachments is > > confidential and intended solely and for the attention and use of > > the named addressee(s). It may not be disclosed to any other person > > without the express authority of the HPA, or the intended > > recipient, or both. If you are not the intended recipient, you must > > not disclose, copy, distribute or retain this message or any part > > of it. This footnote also confirms that this EMail has been swept > > for computer viruses, but please re-sweep any attachments before > > opening or saving. HTTP://www.HPA.org.uk > > ************************************************************** > > ************ > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > From wgallin at ualberta.ca Thu May 7 16:01:58 2009 From: wgallin at ualberta.ca (Warren Gallin) Date: Thu, 7 May 2009 14:01:58 -0600 Subject: [Bioperl-l] Appending efetch results to a file Message-ID: <9325A632-C366-456B-AF94-604E77A1F9AF@ualberta.ca> Hi, I am having trouble with a script that was working a few months ago, but has started giving unexpected results. I need to request 100's of records, and to avoid stress the Entrez server I do my fetching inside a loop that increments the -retstart parameter in the factory. This should append the fetched records to the file that I am using to collect all the records, but instead it is replacing the file. How can I make the get_Response append to an existing file instead of overwriting it? Warren Gallin From Russell.Smithies at agresearch.co.nz Thu May 7 17:24:53 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Fri, 8 May 2009 09:24:53 +1200 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> I'm not sure if this will help with your problem or how it deals with memory management but using "ordinary" Perl to split the large EMBL file might work. Give this a go: ============================ #!perl -w use Bio::SeqIO; use IO::String; use constant SEP => "//\n"; open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; my $index = 1; while(my $stringfh = new IO::String(get_next_record($fh))){ my $seqio = Bio::SeqIO->new( -fh => $stringfh,-format => "EMBL" ) or die $!; while ( my $seq_object = $seqio->next_seq ) { print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; # show the features for my $feat_object ($seq_object->get_SeqFeatures) { print "primary tag: ", $feat_object->primary_tag, "\n"; for my $tag ($feat_object->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat_object->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } } sub get_next_record{ my($fh) = @_; (my $old_sep,$/) = ($/,SEP); my $record = <$fh>; $/ = $old_sep; return $record; } ======================================== --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of brian li > Sent: Friday, 8 May 2009 1:00 a.m. > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > My server has 32 GB RAM. > > The os of my server is 64-bit version of Ubuntu Server Edition 8.04 > LTS. And I have run my example code on another server with 32-bit > version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. > > -Brian > > On Thu, May 7, 2009 at 8:07 PM, Chris Fields wrote: > > I noticed that Russell has 16GB RAM on his setup. ?Was yours equivalent? > > > > chris > > > > On May 7, 2009, at 12:32 AM, brian li wrote: > > > >> Thank you very much for your offer. > >> > >> The director of our lab wants me to do the extraction every time a new > >> release of EMBL is published. I can't push the task to you every time. > >> > >> I can offer more information of the server I run my script on if needed. > >> > >> -Brian > >> > >> On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > >> wrote: > >>> > >>> Sadly, that's the same code as I ran but I had a Data::Dump in the > >>> middle. > >>> Versions of Perl and BioPerl are the same. > >>> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM > >>> > >>> If you get a full script running on a smaller dataset, I could probably > >>> run it on the bigger stuff and give you back tab-separated (or is that > >>> tab\tseparated ?) data for loading into your db. > >>> > >>> --Russell > >>> > >>>> -----Original Message----- > >>>> From: brian li [mailto:brianli.cas at gmail.com] > >>>> Sent: Thursday, 7 May 2009 4:50 p.m. > >>>> To: Smithies, Russell > >>>> Cc: bioperl-l at lists.open-bio.org > >>>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > >>>> > >>>> Dear Russell, > >>>> > >>>> My example code is as following. I omit the parse process and these > >>>> lines give me "Segmentation Fault" too. > >>>> > >>>> # Start of code > >>>> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', > >>>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? -format => 'EMBL'); > >>>> my $index = 1; > >>>> while (my $seq = $seqio->next_seq) > >>>> { > >>>> ? ?print "Dealing with entry: $index\n"; > >>>> ? ?$index++; > >>>> } > >>>> # End > >>>> > >>>> The platform I run this code on: > >>>> BioPerl 1.6.0 > >>>> Perl 5.8.8 > >>>> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) > >>>> > >>>> I have monitored the memory usage when I run the code above. There is > >>>> always around 20GB free memory (buffer size counted in) left. So I > >>>> suppose the segfault can't be explained just by memory shortage. > >>>> > >>>> Brian > >>>> > >>>> > >>>> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > >>>> wrote: > >>>>> > >>>>> Hi Brian, > >>>>> I hate to say it but it worked OK for me using > >>>>> rel_ann_mus_01_r99.dat.gz and > >>>> > >>>> simple example Bio::SeqIO code from bugzilla > >>>>> > >>>>> It's not using more than 1GB memory on our server and doesn't segfault. > >>>>> > >>>>> Send me your example code and I'll give it a go if you like. > >>>>> > >>>>> > >>>>> Russell Smithies > >>>>> > >>>>> Bioinformatics Applications Developer > >>>>> T +64 3 489 9085 > >>>>> E ?russell.smithies at agresearch.co.nz > >>>>> > >>>>> Invermay ?Research Centre > >>>>> Puddle Alley, > >>>>> Mosgiel, > >>>>> New Zealand > >>>>> T ?+64 3 489 3809 > >>>>> F ?+64 3 489 9174 > >>>>> www.agresearch.co.nz > >>>>> > >>>>> > >>> ======================================================================= > >>> Attention: The information contained in this message and/or attachments > >>> from AgResearch Limited is intended only for the persons or entities > >>> to which it is addressed and may contain confidential and/or privileged > >>> material. Any review, retransmission, dissemination or other use of, or > >>> taking of any action in reliance upon, this information by persons or > >>> entities other than the intended recipients is prohibited by AgResearch > >>> Limited. If you have received this message in error, please notify the > >>> sender immediately. > >>> ======================================================================= > >>> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason at bioperl.org Thu May 7 17:54:39 2009 From: jason at bioperl.org (Jason Stajich) Date: Thu, 7 May 2009 14:54:39 -0700 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> Message-ID: <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> Russell - I am not sure how that will help as only 1 sequence is parsed at a time by SeqIO parsers and they use the "//" delimiter. If the equivalent data exists in genbank format at NCBI I think _that_ module (Bio::SeqIO::genbank) has the ability to ignore annotations/features. Really we have to re-work the whole thing to be more lightweight and lazy-parse. -jason On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: > I'm not sure if this will help with your problem or how it deals > with memory management but using "ordinary" Perl to split the large > EMBL file might work. > Give this a go: > > ============================ > #!perl -w > > use Bio::SeqIO; > use IO::String; > > use constant SEP => "//\n"; > > open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; > > my $index = 1; > > while(my $stringfh = new IO::String(get_next_record($fh))){ > > my $seqio = Bio::SeqIO->new( -fh => $stringfh,-format => > "EMBL" ) or die $!; > > while ( my $seq_object = $seqio->next_seq ) { > print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; > > # show the features > for my $feat_object ($seq_object->get_SeqFeatures) { > print "primary tag: ", $feat_object->primary_tag, "\n"; > for my $tag ($feat_object->get_all_tags) { > print " tag: ", $tag, "\n"; > for my $value ($feat_object->get_tag_values($tag)) { > print " value: ", $value, "\n"; > } > } > } > } > > } > > > sub get_next_record{ > my($fh) = @_; > (my $old_sep,$/) = ($/,SEP); > my $record = <$fh>; > $/ = $old_sep; > return $record; > } > ======================================== > > > --Russell > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of brian li >> Sent: Friday, 8 May 2009 1:00 a.m. >> To: Chris Fields >> Cc: bioperl-l at lists.open-bio.org >> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >> >> My server has 32 GB RAM. >> >> The os of my server is 64-bit version of Ubuntu Server Edition 8.04 >> LTS. And I have run my example code on another server with 32-bit >> version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. >> >> -Brian >> >> On Thu, May 7, 2009 at 8:07 PM, Chris Fields >> wrote: >>> I noticed that Russell has 16GB RAM on his setup. Was yours >>> equivalent? >>> >>> chris >>> >>> On May 7, 2009, at 12:32 AM, brian li wrote: >>> >>>> Thank you very much for your offer. >>>> >>>> The director of our lab wants me to do the extraction every time >>>> a new >>>> release of EMBL is published. I can't push the task to you every >>>> time. >>>> >>>> I can offer more information of the server I run my script on if >>>> needed. >>>> >>>> -Brian >>>> >>>> On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell >>>> wrote: >>>>> >>>>> Sadly, that's the same code as I ran but I had a Data::Dump in the >>>>> middle. >>>>> Versions of Perl and BioPerl are the same. >>>>> We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM >>>>> >>>>> If you get a full script running on a smaller dataset, I could >>>>> probably >>>>> run it on the bigger stuff and give you back tab-separated (or >>>>> is that >>>>> tab\tseparated ?) data for loading into your db. >>>>> >>>>> --Russell >>>>> >>>>>> -----Original Message----- >>>>>> From: brian li [mailto:brianli.cas at gmail.com] >>>>>> Sent: Thursday, 7 May 2009 4:50 p.m. >>>>>> To: Smithies, Russell >>>>>> Cc: bioperl-l at lists.open-bio.org >>>>>> Subject: Re: [Bioperl-l] Asking for advice on full EMBL >>>>>> extraction >>>>>> >>>>>> Dear Russell, >>>>>> >>>>>> My example code is as following. I omit the parse process and >>>>>> these >>>>>> lines give me "Segmentation Fault" too. >>>>>> >>>>>> # Start of code >>>>>> my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', >>>>>> -format => 'EMBL'); >>>>>> my $index = 1; >>>>>> while (my $seq = $seqio->next_seq) >>>>>> { >>>>>> print "Dealing with entry: $index\n"; >>>>>> $index++; >>>>>> } >>>>>> # End >>>>>> >>>>>> The platform I run this code on: >>>>>> BioPerl 1.6.0 >>>>>> Perl 5.8.8 >>>>>> Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) >>>>>> >>>>>> I have monitored the memory usage when I run the code above. >>>>>> There is >>>>>> always around 20GB free memory (buffer size counted in) left. >>>>>> So I >>>>>> suppose the segfault can't be explained just by memory shortage. >>>>>> >>>>>> Brian >>>>>> >>>>>> >>>>>> On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell >>>>>> wrote: >>>>>>> >>>>>>> Hi Brian, >>>>>>> I hate to say it but it worked OK for me using >>>>>>> rel_ann_mus_01_r99.dat.gz and >>>>>> >>>>>> simple example Bio::SeqIO code from bugzilla >>>>>>> >>>>>>> It's not using more than 1GB memory on our server and doesn't >>>>>>> segfault. >>>>>>> >>>>>>> Send me your example code and I'll give it a go if you like. >>>>>>> >>>>>>> >>>>>>> Russell Smithies >>>>>>> >>>>>>> Bioinformatics Applications Developer >>>>>>> T +64 3 489 9085 >>>>>>> E russell.smithies at agresearch.co.nz >>>>>>> >>>>>>> Invermay Research Centre >>>>>>> Puddle Alley, >>>>>>> Mosgiel, >>>>>>> New Zealand >>>>>>> T +64 3 489 3809 >>>>>>> F +64 3 489 9174 >>>>>>> www.agresearch.co.nz >>>>>>> >>>>>>> >>>>> = >>>>> = >>>>> = >>>>> = >>>>> = >>>>> ================================================================== >>>>> Attention: The information contained in this message and/or >>>>> attachments >>>>> from AgResearch Limited is intended only for the persons or >>>>> entities >>>>> to which it is addressed and may contain confidential and/or >>>>> privileged >>>>> material. Any review, retransmission, dissemination or other use >>>>> of, or >>>>> taking of any action in reliance upon, this information by >>>>> persons or >>>>> entities other than the intended recipients is prohibited by >>>>> AgResearch >>>>> Limited. If you have received this message in error, please >>>>> notify the >>>>> sender immediately. >>>>> = >>>>> = >>>>> = >>>>> = >>>>> = >>>>> ================================================================== >>>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From Russell.Smithies at agresearch.co.nz Thu May 7 18:05:55 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Fri, 8 May 2009 10:05:55 +1200 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> OK, I misunderstood, I thought the entire file loaded was loaded into memory first then each sequence was extracted from there. I hoped splitting into 588 individual sequences might help. --Russell From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason Stajich Sent: Friday, 8 May 2009 9:55 a.m. To: Smithies, Russell Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction Russell - I am not sure how that will help as only 1 sequence is parsed at a time by SeqIO parsers and they use the "//" delimiter. If the equivalent data exists in genbank format at NCBI I think _that_ module (Bio::SeqIO::genbank) has the ability to ignore annotations/features. Really we have to re-work the whole thing to be more lightweight and lazy-parse. -jason On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: I'm not sure if this will help with your problem or how it deals with memory management but using "ordinary" Perl to split the large EMBL file might work. Give this a go: ============================ #!perl -w use Bio::SeqIO; use IO::String; use constant SEP => "//\n"; open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; my $index = 1; while(my $stringfh = new IO::String(get_next_record($fh))){ my $seqio = Bio::SeqIO->new( -fh => $stringfh,-format => "EMBL" ) or die $!; while ( my $seq_object = $seqio->next_seq ) { print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; # show the features for my $feat_object ($seq_object->get_SeqFeatures) { print "primary tag: ", $feat_object->primary_tag, "\n"; for my $tag ($feat_object->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat_object->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } } sub get_next_record{ my($fh) = @_; (my $old_sep,$/) = ($/,SEP); my $record = <$fh>; $/ = $old_sep; return $record; } ======================================== --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- bounces at lists.open-bio.org] On Behalf Of brian li Sent: Friday, 8 May 2009 1:00 a.m. To: Chris Fields Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction My server has 32 GB RAM. The os of my server is 64-bit version of Ubuntu Server Edition 8.04 LTS. And I have run my example code on another server with 32-bit version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. -Brian On Thu, May 7, 2009 at 8:07 PM, Chris Fields > wrote: I noticed that Russell has 16GB RAM on his setup. Was yours equivalent? chris On May 7, 2009, at 12:32 AM, brian li wrote: Thank you very much for your offer. The director of our lab wants me to do the extraction every time a new release of EMBL is published. I can't push the task to you every time. I can offer more information of the server I run my script on if needed. -Brian On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > wrote: Sadly, that's the same code as I ran but I had a Data::Dump in the middle. Versions of Perl and BioPerl are the same. We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM If you get a full script running on a smaller dataset, I could probably run it on the bigger stuff and give you back tab-separated (or is that tab\tseparated ?) data for loading into your db. --Russell -----Original Message----- From: brian li [mailto:brianli.cas at gmail.com] Sent: Thursday, 7 May 2009 4:50 p.m. To: Smithies, Russell Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction Dear Russell, My example code is as following. I omit the parse process and these lines give me "Segmentation Fault" too. # Start of code my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', -format => 'EMBL'); my $index = 1; while (my $seq = $seqio->next_seq) { print "Dealing with entry: $index\n"; $index++; } # End The platform I run this code on: BioPerl 1.6.0 Perl 5.8.8 Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) I have monitored the memory usage when I run the code above. There is always around 20GB free memory (buffer size counted in) left. So I suppose the segfault can't be explained just by memory shortage. Brian On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > wrote: Hi Brian, I hate to say it but it worked OK for me using rel_ann_mus_01_r99.dat.gz and simple example Bio::SeqIO code from bugzilla It's not using more than 1GB memory on our server and doesn't segfault. Send me your example code and I'll give it a go if you like. Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E russell.smithies at agresearch.co.nz Invermay Research Centre Puddle Alley, Mosgiel, New Zealand T +64 3 489 3809 F +64 3 489 9174 www.agresearch.co.nz ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From wgallin at ualberta.ca Thu May 7 19:00:45 2009 From: wgallin at ualberta.ca (Warren Gallin) Date: Thu, 7 May 2009 17:00:45 -0600 Subject: [Bioperl-l] More on Eutilities get_Response problem Message-ID: Hi, I am using the get_response method inside a loop, so I want to iteratively append the retrieved material to a file. If I pass temp_hold.gb as the file parameter a file called temp_hold.gb is created and that file is successively overwritten as I cycle through the loop. If I pass >temp_hold.gb as the file parameter a file called temp_hold.gb is created and that file is successively overwritten as I cycle through the loop. If I pass >>temp_hold.gb as the file parameter a file called >temp_hold.gb (yes, the > is part of the file name) is created and that file is successively overwritten as I cycle through the loop. Could it be that the way the file parameter is passed in has been slightly broken so it is no loner reading the >> as an indicator to append? Warren Gallin From Russell.Smithies at agresearch.co.nz Thu May 7 19:04:52 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Fri, 8 May 2009 11:04:52 +1200 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE90A4@exchsth.agresearch.co.nz> I guess Tie::File is going to do the same thing? (this works on my 32-bit Windows pc with 2GB RAM but is slow) --Russell ===================== #!perl -w use Bio::SeqIO; use IO::String; use Tie::File; tie @array, 'Tie::File', "rel_ann_mus_01_r99.dat", recsep => "//\n" or die $!; print "loaded ". $#array." records\n"; for (my $i = 0; $i < $#array; $i++) { print "$i\n"; my $seqio = Bio::SeqIO->new( -fh => new IO::String($array[$i]), -format => "EMBL" ) or die $!; # should only be one seq my $seq_object = $seqio->next_seq; print "Dealing with entry: $i\t" . $seq_object->id . "\n"; } ===================== From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason Stajich Sent: Friday, 8 May 2009 9:55 a.m. To: Smithies, Russell Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction Russell - I am not sure how that will help as only 1 sequence is parsed at a time by SeqIO parsers and they use the "//" delimiter. If the equivalent data exists in genbank format at NCBI I think _that_ module (Bio::SeqIO::genbank) has the ability to ignore annotations/features. Really we have to re-work the whole thing to be more lightweight and lazy-parse. -jason On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: I'm not sure if this will help with your problem or how it deals with memory management but using "ordinary" Perl to split the large EMBL file might work. Give this a go: ============================ #!perl -w use Bio::SeqIO; use IO::String; use constant SEP => "//\n"; open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; my $index = 1; while(my $stringfh = new IO::String(get_next_record($fh))){ my $seqio = Bio::SeqIO->new( -fh => $stringfh,-format => "EMBL" ) or die $!; while ( my $seq_object = $seqio->next_seq ) { print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; # show the features for my $feat_object ($seq_object->get_SeqFeatures) { print "primary tag: ", $feat_object->primary_tag, "\n"; for my $tag ($feat_object->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat_object->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } } sub get_next_record{ my($fh) = @_; (my $old_sep,$/) = ($/,SEP); my $record = <$fh>; $/ = $old_sep; return $record; } ======================================== --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- bounces at lists.open-bio.org] On Behalf Of brian li Sent: Friday, 8 May 2009 1:00 a.m. To: Chris Fields Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction My server has 32 GB RAM. The os of my server is 64-bit version of Ubuntu Server Edition 8.04 LTS. And I have run my example code on another server with 32-bit version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. -Brian On Thu, May 7, 2009 at 8:07 PM, Chris Fields > wrote: I noticed that Russell has 16GB RAM on his setup. Was yours equivalent? chris On May 7, 2009, at 12:32 AM, brian li wrote: Thank you very much for your offer. The director of our lab wants me to do the extraction every time a new release of EMBL is published. I can't push the task to you every time. I can offer more information of the server I run my script on if needed. -Brian On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > wrote: Sadly, that's the same code as I ran but I had a Data::Dump in the middle. Versions of Perl and BioPerl are the same. We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM If you get a full script running on a smaller dataset, I could probably run it on the bigger stuff and give you back tab-separated (or is that tab\tseparated ?) data for loading into your db. --Russell -----Original Message----- From: brian li [mailto:brianli.cas at gmail.com] Sent: Thursday, 7 May 2009 4:50 p.m. To: Smithies, Russell Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction Dear Russell, My example code is as following. I omit the parse process and these lines give me "Segmentation Fault" too. # Start of code my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', -format => 'EMBL'); my $index = 1; while (my $seq = $seqio->next_seq) { print "Dealing with entry: $index\n"; $index++; } # End The platform I run this code on: BioPerl 1.6.0 Perl 5.8.8 Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) I have monitored the memory usage when I run the code above. There is always around 20GB free memory (buffer size counted in) left. So I suppose the segfault can't be explained just by memory shortage. Brian On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > wrote: Hi Brian, I hate to say it but it worked OK for me using rel_ann_mus_01_r99.dat.gz and simple example Bio::SeqIO code from bugzilla It's not using more than 1GB memory on our server and doesn't segfault. Send me your example code and I'll give it a go if you like. Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E russell.smithies at agresearch.co.nz Invermay Research Centre Puddle Alley, Mosgiel, New Zealand T +64 3 489 3809 F +64 3 489 9174 www.agresearch.co.nz ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From jason at bioperl.org Thu May 7 19:25:16 2009 From: jason at bioperl.org (Jason Stajich) Date: Thu, 7 May 2009 16:25:16 -0700 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> Message-ID: <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> It parses from a stream or file, one sequence at a time so it only reads a single sequence out at a time, but it does have to parse that whole sequence record which is where feature rich sequences might be causing problems. I think per your other mention of Tie::File - the whole file is not going into memory so that is not the problem, it is the creation of many objects that it does as it parses the sequence that is likely the problem. It will read up to the first "//" from that Tie::File anyways, that becomes an entire string which is then parsed to pull out the relevant features so you don't gain anything with Tie::File -- what would be the way to solve it is if the objects could be created and reside in a DB on disk rather than in-memory. I'd really enjoy seeing more indexed and hashed data to objects stored on disk when mem requirements are such so that very large datasets can be handled more nimbly. I think there have been several attempts to simplify, but it basically means a dedicated developer to really overhaul or map to a new system. What we've tried to build is a decent API so a new implementation can be done without affecting the 'next_seq' and 'write_seq' API. Non-withstanding the seemed API confusion caused by _ancient_ decisions on giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq' which return different types -- don't forget that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a sequence as a string as well so major API changes in general here will create in all likelihood a big split between the branches that will make any new Bioperl not match up well with existing scripts or libraries that use it - hence the reason for no "great realigning" to a completely well-planned out API rather than the organically grown whims of several generations of devs. I say this in jest a bit - I do want to see changes, but I think it really will have to be called something else besides BioPerl to avoid confusion and the fact that a lot of things will break that depend on the current APIs. BioPerl2 or something indicating a Perl6 association. -jason On May 7, 2009, at 3:05 PM, Smithies, Russell wrote: > OK, I misunderstood, I thought the entire file loaded was loaded > into memory first then each sequence was extracted from there. > I hoped splitting into 588 individual sequences might help. > > --Russell > > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of > Jason Stajich > Sent: Friday, 8 May 2009 9:55 a.m. > To: Smithies, Russell > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > Russell - > > I am not sure how that will help as only 1 sequence is parsed at a > time by SeqIO parsers and they use the "//" delimiter. > > If the equivalent data exists in genbank format at NCBI I think > _that_ module (Bio::SeqIO::genbank) has the ability to ignore > annotations/features. Really we have to re-work the whole thing to > be more lightweight and lazy-parse. > > -jason > On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: > > > I'm not sure if this will help with your problem or how it deals > with memory management but using "ordinary" Perl to split the large > EMBL file might work. > Give this a go: > > ============================ > #!perl -w > > use Bio::SeqIO; > use IO::String; > > use constant SEP => "//\n"; > > open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; > > my $index = 1; > > while(my $stringfh = new IO::String(get_next_record($fh))){ > > my $seqio = Bio::SeqIO->new( -fh => $stringfh,-format > => "EMBL" ) or die $!; > > while ( my $seq_object = $seqio->next_seq ) { > print "Dealing with entry: ".$index++."\t".$seq_object- > >id."\n"; > > # show the features > for my $feat_object ($seq_object->get_SeqFeatures) { > print "primary tag: ", $feat_object- > >primary_tag, "\n"; > for my $tag ($feat_object->get_all_tags) { > print " tag: ", $tag, "\n"; > for my $value ($feat_object- > >get_tag_values($tag)) { > print " value: ", $value, "\n"; > } > } > } > } > > } > > > sub get_next_record{ > my($fh) = @_; > (my $old_sep,$/) = ($/,SEP); > my $record = <$fh>; > $/ = $old_sep; > return $record; > } > ======================================== > > > --Russell > > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On > Behalf Of brian li > Sent: Friday, 8 May 2009 1:00 a.m. > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > My server has 32 GB RAM. > > The os of my server is 64-bit version of Ubuntu Server Edition 8.04 > LTS. And I have run my example code on another server with 32-bit > version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. > > -Brian > > On Thu, May 7, 2009 at 8:07 PM, Chris Fields >> wrote: > I noticed that Russell has 16GB RAM on his setup. Was yours > equivalent? > > chris > > On May 7, 2009, at 12:32 AM, brian li wrote: > > Thank you very much for your offer. > > The director of our lab wants me to do the extraction every time a new > release of EMBL is published. I can't push the task to you every time. > > I can offer more information of the server I run my script on if > needed. > > -Brian > > On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > >> wrote: > > Sadly, that's the same code as I ran but I had a Data::Dump in the > middle. > Versions of Perl and BioPerl are the same. > We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM > > If you get a full script running on a smaller dataset, I could > probably > run it on the bigger stuff and give you back tab-separated (or is that > tab\tseparated ?) data for loading into your db. > > --Russell > > -----Original Message----- > From: brian li [mailto:brianli.cas at gmail.com] > Sent: Thursday, 7 May 2009 4:50 p.m. > To: Smithies, Russell > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > Dear Russell, > > My example code is as following. I omit the parse process and these > lines give me "Segmentation Fault" too. > > # Start of code > my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', > -format => 'EMBL'); > my $index = 1; > while (my $seq = $seqio->next_seq) > { > print "Dealing with entry: $index\n"; > $index++; > } > # End > > The platform I run this code on: > BioPerl 1.6.0 > Perl 5.8.8 > Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) > > I have monitored the memory usage when I run the code above. There is > always around 20GB free memory (buffer size counted in) left. So I > suppose the segfault can't be explained just by memory shortage. > > Brian > > > On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > >> wrote: > > Hi Brian, > I hate to say it but it worked OK for me using > rel_ann_mus_01_r99.dat.gz and > > simple example Bio::SeqIO code from bugzilla > > It's not using more than 1GB memory on our server and doesn't > segfault. > > Send me your example code and I'll give it a go if you like. > > > Russell Smithies > > Bioinformatics Applications Developer > T +64 3 489 9085 > E russell.smithies at agresearch.co.nz > > > Invermay Research Centre > Puddle Alley, > Mosgiel, > New Zealand > T +64 3 489 3809 > F +64 3 489 9174 > www.agresearch.co.nz > > > = > ====================================================================== > Attention: The information contained in this message and/or > attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or > privileged > material. Any review, retransmission, dissemination or other use of, > or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by > AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > = > ====================================================================== > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason at bioperl.org > > > > Jason Stajich jason at bioperl.org From Russell.Smithies at agresearch.co.nz Thu May 7 20:03:58 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Fri, 8 May 2009 12:03:58 +1200 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE9104@exchsth.agresearch.co.nz> I think the problem here though is the size of the sequences rather than too many features. If one was inclined to bodge/hack and didn't care about sequence, I guess you could filter them out with awk so Bio::SeqIO doesn't have to create the Bio::PrimarySeq :) Probably breaks the EMBL file spec ... Eg. open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |" ) or die; --Russell From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason Stajich Sent: Friday, 8 May 2009 11:25 a.m. To: Smithies, Russell Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction It parses from a stream or file, one sequence at a time so it only reads a single sequence out at a time, but it does have to parse that whole sequence record which is where feature rich sequences might be causing problems. I think per your other mention of Tie::File - the whole file is not going into memory so that is not the problem, it is the creation of many objects that it does as it parses the sequence that is likely the problem. It will read up to the first "//" from that Tie::File anyways, that becomes an entire string which is then parsed to pull out the relevant features so you don't gain anything with Tie::File -- what would be the way to solve it is if the objects could be created and reside in a DB on disk rather than in-memory. I'd really enjoy seeing more indexed and hashed data to objects stored on disk when mem requirements are such so that very large datasets can be handled more nimbly. I think there have been several attempts to simplify, but it basically means a dedicated developer to really overhaul or map to a new system. What we've tried to build is a decent API so a new implementation can be done without affecting the 'next_seq' and 'write_seq' API. Non-withstanding the seemed API confusion caused by _ancient_ decisions on giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq' which return different types -- don't forget that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a sequence as a string as well so major API changes in general here will create in all likelihood a big split between the branches that will make any new Bioperl not match up well with existing scripts or libraries that use it - hence the reason for no "great realigning" to a completely well-planned out API rather than the organically grown whims of several generations of devs. I say this in jest a bit - I do want to see changes, but I think it really will have to be called something else besides BioPerl to avoid confusion and the fact that a lot of things will break that depend on the current APIs. BioPerl2 or something indicating a Perl6 association. -jason On May 7, 2009, at 3:05 PM, Smithies, Russell wrote: OK, I misunderstood, I thought the entire file loaded was loaded into memory first then each sequence was extracted from there. I hoped splitting into 588 individual sequences might help. --Russell From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason Stajich Sent: Friday, 8 May 2009 9:55 a.m. To: Smithies, Russell Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction Russell - I am not sure how that will help as only 1 sequence is parsed at a time by SeqIO parsers and they use the "//" delimiter. If the equivalent data exists in genbank format at NCBI I think _that_ module (Bio::SeqIO::genbank) has the ability to ignore annotations/features. Really we have to re-work the whole thing to be more lightweight and lazy-parse. -jason On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: I'm not sure if this will help with your problem or how it deals with memory management but using "ordinary" Perl to split the large EMBL file might work. Give this a go: ============================ #!perl -w use Bio::SeqIO; use IO::String; use constant SEP => "//\n"; open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; my $index = 1; while(my $stringfh = new IO::String(get_next_record($fh))){ my $seqio = Bio::SeqIO->new( -fh => $stringfh,-format => "EMBL" ) or die $!; while ( my $seq_object = $seqio->next_seq ) { print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; # show the features for my $feat_object ($seq_object->get_SeqFeatures) { print "primary tag: ", $feat_object->primary_tag, "\n"; for my $tag ($feat_object->get_all_tags) { print " tag: ", $tag, "\n"; for my $value ($feat_object->get_tag_values($tag)) { print " value: ", $value, "\n"; } } } } } sub get_next_record{ my($fh) = @_; (my $old_sep,$/) = ($/,SEP); my $record = <$fh>; $/ = $old_sep; return $record; } ======================================== --Russell -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- bounces at lists.open-bio.org] On Behalf Of brian li Sent: Friday, 8 May 2009 1:00 a.m. To: Chris Fields Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction My server has 32 GB RAM. The os of my server is 64-bit version of Ubuntu Server Edition 8.04 LTS. And I have run my example code on another server with 32-bit version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. -Brian On Thu, May 7, 2009 at 8:07 PM, Chris Fields > wrote: I noticed that Russell has 16GB RAM on his setup. Was yours equivalent? chris On May 7, 2009, at 12:32 AM, brian li wrote: Thank you very much for your offer. The director of our lab wants me to do the extraction every time a new release of EMBL is published. I can't push the task to you every time. I can offer more information of the server I run my script on if needed. -Brian On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > wrote: Sadly, that's the same code as I ran but I had a Data::Dump in the middle. Versions of Perl and BioPerl are the same. We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM If you get a full script running on a smaller dataset, I could probably run it on the bigger stuff and give you back tab-separated (or is that tab\tseparated ?) data for loading into your db. --Russell -----Original Message----- From: brian li [mailto:brianli.cas at gmail.com] Sent: Thursday, 7 May 2009 4:50 p.m. To: Smithies, Russell Cc: bioperl-l at lists.open-bio.org Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction Dear Russell, My example code is as following. I omit the parse process and these lines give me "Segmentation Fault" too. # Start of code my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', -format => 'EMBL'); my $index = 1; while (my $seq = $seqio->next_seq) { print "Dealing with entry: $index\n"; $index++; } # End The platform I run this code on: BioPerl 1.6.0 Perl 5.8.8 Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) I have monitored the memory usage when I run the code above. There is always around 20GB free memory (buffer size counted in) left. So I suppose the segfault can't be explained just by memory shortage. Brian On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > wrote: Hi Brian, I hate to say it but it worked OK for me using rel_ann_mus_01_r99.dat.gz and simple example Bio::SeqIO code from bugzilla It's not using more than 1GB memory on our server and doesn't segfault. Send me your example code and I'll give it a go if you like. Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E russell.smithies at agresearch.co.nz Invermay Research Centre Puddle Alley, Mosgiel, New Zealand T +64 3 489 3809 F +64 3 489 9174 www.agresearch.co.nz ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org Jason Stajich jason at bioperl.org From valiente at lsi.upc.edu Fri May 8 08:49:22 2009 From: valiente at lsi.upc.edu (Gabriel Valiente) Date: Fri, 8 May 2009 21:49:22 +0900 Subject: [Bioperl-l] The Power of R (Chris Fields) In-Reply-To: References: Message-ID: <8F435B66-33CF-467D-8D86-AA8EF2309E98@lsi.upc.edu> >>> While we're on the topic, can anyone recommend a good book or >>> resource from which to learn R, to supplement the official docs? Well, my new book G. Valiente. Combinatorial Pattern Matching Algorithms in Computational Biology using Perl and R. Taylor & Francis/CRC Press (2009) http://www.crcpress.com/product/isbn/9781420063677 is already available. I hope it will also be of much use to BioPerl developers and users. Gabriel From maj at fortinbras.us Fri May 8 08:43:16 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Fri, 8 May 2009 08:43:16 -0400 Subject: [Bioperl-l] More on Eutilities get_Response problem In-Reply-To: References: Message-ID: <2C9B353790344B11A3A5F0ACB1F95830@NewLife> Hi Warren, The get_Response function is really a wrapper for LWP::UserAgent::get; as such, the -file parameter works differently from the usual BioPerl -file. I agree that this is a bug; it's just not a BioPerl bug. If the behavior of your script really did change, maybe it did so after an update of LWP::UserAgent. Anyway, one way to work around this is to use the callback instead of the file parameter; something like my $global_file = 'eutil-dump.txt'; ... $thing->get_Response( -cb => \&_append_file ); ... sub _append_file { my ($data, $response_obj, $protocol_obj) = @_; open my $fh, ">>$global_file" or die "can't open dump file: $!"; print $fh $data; return; } See http://search.cpan.org/~gaas/libwww-perl-5.826/lib/LWP/UserAgent.pm, and 'perldoc Bio::DB::EUtilities'. cheers, Mark ----- Original Message ----- From: "Warren Gallin" To: "BioPerl List" Sent: Thursday, May 07, 2009 7:00 PM Subject: [Bioperl-l] More on Eutilities get_Response problem > Hi, > > I am using the get_response method inside a loop, so I want to iteratively > append the retrieved material to a file. > > If I pass temp_hold.gb as the file parameter a file called temp_hold.gb is > created and that file is successively overwritten as I cycle through the > loop. > > If I pass >temp_hold.gb as the file parameter a file called temp_hold.gb is > created and that file is successively overwritten as I cycle through the > loop. > > If I pass >>temp_hold.gb as the file parameter a file called > >temp_hold.gb (yes, the > is part of the file name) is created and > that file is successively overwritten as I cycle through the loop. > > Could it be that the way the file parameter is passed in has been slightly > broken so it is no loner reading the >> as an indicator to append? > > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at illinois.edu Fri May 8 08:27:09 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 8 May 2009 07:27:09 -0500 Subject: [Bioperl-l] More on Eutilities get_Response problem In-Reply-To: References: Message-ID: Append was working at some point (on Mac OS X). Just curious, what OS are you using? Regardless, there is some hacky code in get_Response to deal with possible filename issues, but I'll change that to delegate to a Bio::Root::IO for consistency, just in case. Also, so it's available as an internal workaround, I'll add in a -fh (filehandle) option. As a current workaround, you could open a file handle on your own and just print to it: open(my $fh, '>>', 'mydata.gb'); # later in a loop while (...) { print $fh $eutil->get_Response(); } chris On May 7, 2009, at 6:00 PM, Warren Gallin wrote: > Hi, > > I am using the get_response method inside a loop, so I want to > iteratively append the retrieved material to a file. > > If I pass temp_hold.gb as the file parameter a file called > temp_hold.gb is created and that file is successively overwritten as > I cycle through the loop. > > If I pass >temp_hold.gb as the file parameter a file called > temp_hold.gb is created and that file is successively overwritten as > I cycle through the loop. > > If I pass >>temp_hold.gb as the file parameter a file called > >temp_hold.gb (yes, the > is part of the file name) is created and > that file is successively overwritten as I cycle through the loop. > > Could it be that the way the file parameter is passed in has been > slightly broken so it is no loner reading the >> as an indicator to > append? > > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From maj at fortinbras.us Fri May 8 09:42:59 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Fri, 8 May 2009 09:42:59 -0400 Subject: [Bioperl-l] Appending efetch results to a file In-Reply-To: <9325A632-C366-456B-AF94-604E77A1F9AF@ualberta.ca> References: <9325A632-C366-456B-AF94-604E77A1F9AF@ualberta.ca> Message-ID: > I need to request 100's of records, and to avoid stress the Entrez > server I do my fetching inside a loop that increments the -retstart > parameter in the factory. This raises a question in my mind: should EUtilities use Bio::WebAgent rather than LWP::UserAgent directly, and doesn't Bio::WebAgent have magical properties that ease the server burden without having to build it into the user code directly? ----- Original Message ----- From: "Warren Gallin" To: "BioPerl List" Sent: Thursday, May 07, 2009 4:01 PM Subject: [Bioperl-l] Appending efetch results to a file > Hi, > > I am having trouble with a script that was working a few months ago, > but has started giving unexpected results. > > I need to request 100's of records, and to avoid stress the Entrez > server I do my fetching inside a loop that increments the -retstart > parameter in the factory. This should append the fetched records to > the file that I am using to collect all the records, but instead it is > replacing the file. How can I make the get_Response append to an > existing file instead of overwriting it? > > Warren Gallin > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at illinois.edu Fri May 8 10:22:31 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 8 May 2009 09:22:31 -0500 Subject: [Bioperl-l] Appending efetch results to a file In-Reply-To: References: <9325A632-C366-456B-AF94-604E77A1F9AF@ualberta.ca> Message-ID: On May 8, 2009, at 8:42 AM, Mark A. Jensen wrote: >> I need to request 100's of records, and to avoid stress the Entrez >> server I do my fetching inside a loop that increments the - >> retstart parameter in the factory. > > This raises a question in my mind: should EUtilities use > Bio::WebAgent rather > than LWP::UserAgent directly, and doesn't Bio::WebAgent have > magical properties that ease the server burden without having to > build it into the user code directly? I thought about that originally, but there is a significant difference between the two agent implementations. Bio::WebAgent is-a LWP::Useragent subclass, whereas Bio::DB::GenericWebAgent and it's ilk contain a user agent instance (has-a). I choose the latter course b/c I favor composition over inheritance, and LWP::UserAgent uses different named parameter handling than BioPerl (no '-'); Bio::WebAgent code works around that in the constructor. Rather that than the possibility of down the road to run into odd parameter issues. Not to mention, I may genericize it more in the future to be capable of using SOAP-based methods, so switching out the ua made more sense in the long run (still a lot to do on that end). I haven't discussed this extensively on the list before, but when I redesigned EUtilities I wanted to separate out the various tasks, e.g. ua, parser, parameter handling, etc. So, for the specific eutil tools, parser = Bio::Tools:EUtilities, parameter = Bio::Tools::EUtilities::EUtilParameters, ua = LWP::UserAgent. For other DBs one could switch out the relevant bits for DB-specific implementations. Then, Bio::DB::EUtilities basically decorates all three, acts as the traffic cop to get the various bits playing well together, delegates as needed, etc. This'll allow additional components to be added in at later points if needed, and the basic tool can be used for retrieving raw data or as a souped-up agent for retrieving remote data in a new set of modules (Bio::Entrez::*, maybe). There are some experimental bits in there still (repeated requests with the exact same params do not spam eutils, for instance, and there is some 'lazy' code in the parser), but it seems to largely work, and those bits can be removed fairly easily if they prove problematic. chris From brianli.cas at gmail.com Fri May 8 10:48:32 2009 From: brianli.cas at gmail.com (brian li) Date: Fri, 8 May 2009 22:48:32 +0800 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493CE9104@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE9104@exchsth.agresearch.co.nz> Message-ID: open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^FT|^CO/{print}' |" works. open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |" segfaults. So it seems the features are causing problems. Although I still don't know how that hurts my os to pop a segfault, my extraction can move on again. Maybe I can find a clue when I know more about my os's memory management strategy. Really appreciate all your help. -Brian On Fri, May 8, 2009 at 8:03 AM, Smithies, Russell wrote: > I think the problem here though is the size of the sequences rather than too > many features. > > If one was inclined to bodge/hack and didn?t care about sequence, I guess > you could filter them out with awk so Bio::SeqIO doesn?t have to create the > Bio::PrimarySeq J > > Probably breaks the EMBL file spec ? > > Eg. > > open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |" > ) or die; > > > > > > --Russell > > > > > > > > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason > Stajich > Sent: Friday, 8 May 2009 11:25 a.m. > To: Smithies, Russell > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > > > It parses from a stream or file, one sequence at a time so it only reads a > single sequence out at a time, but it does have to parse that whole sequence > record which is where feature rich sequences might be causing problems. > > > > I think per your other mention of Tie::File - the whole file is not going > into memory so that is not the problem, it is the creation of many objects > that it does as it parses the sequence that is likely the problem. ?It will > read up to the first "//" from that Tie::File anyways, that becomes an > entire string which is then parsed to pull out the relevant features so you > don't gain anything with Tie::File -- what would be the way to solve it is > if the objects could be created and reside in a DB on disk rather than > in-memory. ?I'd really enjoy seeing more indexed and hashed data to objects > stored on disk when mem requirements are such so that very large datasets > can be handled more nimbly. > > > > I think there have been several attempts to simplify, but it basically means > a dedicated developer to really overhaul or map to a new system. ?What we've > tried to build is a decent API so a new implementation can be done without > affecting the 'next_seq' and 'write_seq' API. > > > > Non-withstanding the seemed API confusion caused by _ancient_ decisions on > giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq' > which return different types -- don't forget that Lincoln's Bio::DB::Fasta > uses the 'seq' method to return a sequence as a string as well so major API > changes in general here will create in all likelihood a big split between > the branches that will make any new Bioperl not match up well with existing > scripts or libraries that use it - hence the reason for no "great > realigning" to a completely well-planned out API rather than the organically > grown whims of several generations of devs. ?I say this in jest a bit - I do > want to see changes, but I think it really will have to be called something > else besides BioPerl to avoid confusion and the fact that a lot of things > will break that depend on the current APIs. ?BioPerl2 or something > indicating a Perl6 association. > > > > -jason > > On May 7, 2009, at 3:05 PM, Smithies, Russell wrote: > > OK, I misunderstood, I thought the entire file loaded was loaded into memory > first then each sequence was extracted from there. > I hoped splitting into 588 individual sequences might help. > > --Russell > > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason > Stajich > Sent: Friday, 8 May 2009 9:55 a.m. > To: Smithies, Russell > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > Russell - > > I am not sure how that will help as only 1 sequence is parsed at a time by > SeqIO parsers and they use the "//" delimiter. > > If the equivalent data exists in genbank format at NCBI I think _that_ > ?module (Bio::SeqIO::genbank) has the ability to ignore > annotations/features. ?Really we have to re-work the whole thing to be more > lightweight and lazy-parse. > > -jason > On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: > > > I'm not sure if this will help with your problem or how it deals with memory > management but using "ordinary" Perl to split the large EMBL file might > work. > Give this a go: > > ============================ > #!perl -w > > use Bio::SeqIO; > use IO::String; > > use constant SEP => "//\n"; > > open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; > > my $index = 1; > > while(my $stringfh = new IO::String(get_next_record($fh))){ > > ?????????my $seqio = Bio::SeqIO->new( -fh ????=> $stringfh,-format => "EMBL" > ) or die $!; > > ?????????while ( my $seq_object = $seqio->next_seq ) { > ??????????print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; > > ??????????# show the features > ??????????for my $feat_object ($seq_object->get_SeqFeatures) { > ???????????????????????print "primary tag: ", $feat_object->primary_tag, > "\n"; > ???????????????????????for my $tag ($feat_object->get_all_tags) { > ??????????????????????????print " ?tag: ", $tag, "\n"; > ??????????????????????????for my $value ($feat_object->get_tag_values($tag)) > { > ?????????????????????????????print " ???value: ", $value, "\n"; > ??????????????????????????} > ???????????????????????} > ?????????????????????} > ?????????} > > } > > > sub get_next_record{ > ?????????my($fh) = @_; > ?????????(my $old_sep,$/) = ($/,SEP); > ?????????my $record = <$fh>; > ?????????$/ = $old_sep; > ?????????return $record; > } > ======================================== > > > --Russell > > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of > brian li > Sent: Friday, 8 May 2009 1:00 a.m. > To: Chris Fields > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > My server has 32 GB RAM. > > The os of my server is 64-bit version of Ubuntu Server Edition 8.04 > LTS. And I have run my example code on another server with 32-bit > version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. > > -Brian > > On Thu, May 7, 2009 at 8:07 PM, Chris Fields > > wrote: > I noticed that Russell has 16GB RAM on his setup. ?Was yours equivalent? > > chris > > On May 7, 2009, at 12:32 AM, brian li wrote: > > Thank you very much for your offer. > > The director of our lab wants me to do the extraction every time a new > release of EMBL is published. I can't push the task to you every time. > > I can offer more information of the server I run my script on if needed. > > -Brian > > On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > > > wrote: > > Sadly, that's the same code as I ran but I had a Data::Dump in the > middle. > Versions of Perl and BioPerl are the same. > We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM > > If you get a full script running on a smaller dataset, I could probably > run it on the bigger stuff and give you back tab-separated (or is that > tab\tseparated ?) data for loading into your db. > > --Russell > > -----Original Message----- > From: brian li [mailto:brianli.cas at gmail.com] > Sent: Thursday, 7 May 2009 4:50 p.m. > To: Smithies, Russell > Cc: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > Dear Russell, > > My example code is as following. I omit the parse process and these > lines give me "Segmentation Fault" too. > > # Start of code > my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', > ???????????????????????????????????????????-format => 'EMBL'); > my $index = 1; > while (my $seq = $seqio->next_seq) > { > ??print "Dealing with entry: $index\n"; > ??$index++; > } > # End > > The platform I run this code on: > BioPerl 1.6.0 > Perl 5.8.8 > Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) > > I have monitored the memory usage when I run the code above. There is > always around 20GB free memory (buffer size counted in) left. So I > suppose the segfault can't be explained just by memory shortage. > > Brian > > > On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > > > wrote: > > Hi Brian, > I hate to say it but it worked OK for me using > rel_ann_mus_01_r99.dat.gz and > > simple example Bio::SeqIO code from bugzilla > > It's not using more than 1GB memory on our server and doesn't segfault. > > Send me your example code and I'll give it a go if you like. > > > Russell Smithies > > Bioinformatics Applications Developer > T +64 3 489 9085 > E > ?russell.smithies at agresearch.co.nz > > Invermay ?Research Centre > Puddle Alley, > Mosgiel, > New Zealand > T ?+64 3 489 3809 > F ?+64 3 489 9174 > www.agresearch.co.nz > > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason at bioperl.org > > > > > > Jason Stajich > > jason at bioperl.org > > > > > > > > From avilella at gmail.com Fri May 8 12:43:55 2009 From: avilella at gmail.com (Albert Vilella) Date: Fri, 8 May 2009 17:43:55 +0100 Subject: [Bioperl-l] parsing paml output In-Reply-To: <1252ab5f0905080844r5664f3d2la7eede658602bcb5@mail.gmail.com> References: <1252ab5f0905080844r5664f3d2la7eede658602bcb5@mail.gmail.com> Message-ID: <358f4d650905080943i69016199t4df58bbe4853cff2@mail.gmail.com> If I remember correctly, there was a way to just parse the output that is decoupled of running PAML through bioperl. I am ccing the bioperl mailing list as I've seen people having parsing issues with some newer versions of PAML. What version are you trying to parse? Can you attach a small example? On Fri, May 8, 2009 at 4:44 PM, Irene Newton wrote: > Hello! > > I first want to thank you for contributing your module to the bioperl > community. Every time I think, "hey, wouldn't it be great if someone coded > this tool?" It's there! It's much appreciated. > > I've been trying to implement it but am confused about one thing: what is > the main codeml output file that the parser expects? I usually work with > the *.out files or the rst files but when I try either of those as input, > the parser throws an error: > > ------------- EXCEPTION: Bio::Root::NotImplemented ------------- > MSG: Unknown format of PAML output > STACK: Error::throw > STACK: Bio::Root::Root::throw > /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:328 > STACK: Bio::Tools::Phylo::PAML::_parse_summary > /usr/local/share/perl/5.8.8/Bio/Tools/Phylo/PAML.pm:359 > STACK: Bio::Tools::Phylo::PAML::next_result > /usr/local/share/perl/5.8.8/Bio/Tools/Phylo/PAML.pm:224 > STACK: ./paml_parser.pl:14 > ---------------------------------------------------------------- > > Any thoughts? Warm regards, > Irene > > -- > Irene L.G. Newton > Postdoctoral Fellow > Tufts University - Microbiology Department > Jaharis 424 > 136 Harrison Ave. > Boston, MA 02111 > From jason at bioperl.org Fri May 8 12:57:54 2009 From: jason at bioperl.org (Jason Stajich) Date: Fri, 8 May 2009 09:57:54 -0700 Subject: [Bioperl-l] parsing paml output In-Reply-To: <358f4d650905080943i69016199t4df58bbe4853cff2@mail.gmail.com> References: <1252ab5f0905080844r5664f3d2la7eede658602bcb5@mail.gmail.com> <358f4d650905080943i69016199t4df58bbe4853cff2@mail.gmail.com> Message-ID: <00798FDA-02D6-4527-A8F6-904D9C88F617@bioperl.org> The parsing is decoupled that is Bio::Tools::Phylo::PAML and I'm pretty sure where the errors are coming from. I think a sample report and sample script as a bug report is a good first step in case this a simple problem to diagnose. However we need programmers who want to work on this problem to step up and help update the module to deal with new variants in the PAML output file format. I think we've failed to recruit any new developers to work on supporting the latest PAML so things work(ed) for 3.15 but I think 4 has new variations in the format that cause it to fall over. -jason On May 8, 2009, at 9:43 AM, Albert Vilella wrote: > If I remember correctly, there was a way to just parse the output > that is > decoupled of running PAML through bioperl. > I am ccing the bioperl mailing list as I've seen people having parsing > issues with some newer versions of PAML. > > What version are you trying to parse? Can you attach a small example? > > On Fri, May 8, 2009 at 4:44 PM, Irene Newton > wrote: > >> Hello! >> >> I first want to thank you for contributing your module to the bioperl >> community. Every time I think, "hey, wouldn't it be great if >> someone coded >> this tool?" It's there! It's much appreciated. >> >> I've been trying to implement it but am confused about one thing: >> what is >> the main codeml output file that the parser expects? I usually >> work with >> the *.out files or the rst files but when I try either of those as >> input, >> the parser throws an error: >> >> ------------- EXCEPTION: Bio::Root::NotImplemented ------------- >> MSG: Unknown format of PAML output >> STACK: Error::throw >> STACK: Bio::Root::Root::throw >> /usr/local/share/perl/5.8.8/Bio/Root/Root.pm:328 >> STACK: Bio::Tools::Phylo::PAML::_parse_summary >> /usr/local/share/perl/5.8.8/Bio/Tools/Phylo/PAML.pm:359 >> STACK: Bio::Tools::Phylo::PAML::next_result >> /usr/local/share/perl/5.8.8/Bio/Tools/Phylo/PAML.pm:224 >> STACK: ./paml_parser.pl:14 >> ---------------------------------------------------------------- >> >> Any thoughts? Warm regards, >> Irene >> >> -- >> Irene L.G. Newton >> Postdoctoral Fellow >> Tufts University - Microbiology Department >> Jaharis 424 >> 136 Harrison Ave. >> Boston, MA 02111 >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From cjfields at illinois.edu Fri May 8 13:32:46 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 8 May 2009 12:32:46 -0500 Subject: [Bioperl-l] More on Eutilities get_Response problem In-Reply-To: <2C9B353790344B11A3A5F0ACB1F95830@NewLife> References: <2C9B353790344B11A3A5F0ACB1F95830@NewLife> Message-ID: <39D1D481-EF7F-4948-98FF-3A60DEF29DE5@illinois.edu> Yes, whoops, forgot about that (slept since then). Yes, the filename is passed on to LWP::UserAgent, http://search.cpan.org/~gaas/libwww-perl-5.826/lib/LWP/UserAgent.pm#REQUEST_METHODS IIRC, there was an odd issue that popped up when passing the filename onto get(), but I so recall append working correctly at one point. Let me se what I can work out. chris On May 8, 2009, at 7:43 AM, Mark A. Jensen wrote: > Hi Warren, > > The get_Response function is really a wrapper for LWP::UserAgent::get; > as such, the -file parameter works differently from the usual > BioPerl -file. > I agree that this is a bug; it's just not a BioPerl bug. If the > behavior of your > script really did change, maybe it did so after an update of > LWP::UserAgent. > Anyway, one way to work around this is to use the callback instead > of the > file parameter; something like > > my $global_file = 'eutil-dump.txt'; > ... > $thing->get_Response( -cb => \&_append_file ); > ... > sub _append_file { > my ($data, $response_obj, $protocol_obj) = @_; > open my $fh, ">>$global_file" or die "can't open dump file: $!"; > print $fh $data; > return; > } > > See http://search.cpan.org/~gaas/libwww-perl-5.826/lib/LWP/UserAgent.pm > , > and 'perldoc Bio::DB::EUtilities'. > > cheers, > Mark > > > > ----- Original Message ----- From: "Warren Gallin" > > To: "BioPerl List" > Sent: Thursday, May 07, 2009 7:00 PM > Subject: [Bioperl-l] More on Eutilities get_Response problem > > >> Hi, >> >> I am using the get_response method inside a loop, so I want to >> iteratively append the retrieved material to a file. >> >> If I pass temp_hold.gb as the file parameter a file called >> temp_hold.gb is created and that file is successively overwritten >> as I cycle through the loop. >> >> If I pass >temp_hold.gb as the file parameter a file called >> temp_hold.gb is created and that file is successively overwritten >> as I cycle through the loop. >> >> If I pass >>temp_hold.gb as the file parameter a file called >> >temp_hold.gb (yes, the > is part of the file name) is created and >> that file is successively overwritten as I cycle through the loop. >> >> Could it be that the way the file parameter is passed in has been >> slightly broken so it is no loner reading the >> as an indicator >> to append? >> >> >> Warren Gallin >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Fri May 8 13:45:09 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 8 May 2009 12:45:09 -0500 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> Message-ID: On May 7, 2009, at 6:25 PM, Jason Stajich wrote: > It parses from a stream or file, one sequence at a time so it only > reads a single sequence out at a time, but it does have to parse > that whole sequence record which is where feature rich sequences > might be causing problems. > > I think per your other mention of Tie::File - the whole file is not > going into memory so that is not the problem, it is the creation of > many objects that it does as it parses the sequence that is likely > the problem. It will read up to the first "//" from that Tie::File > anyways, that becomes an entire string which is then parsed to pull > out the relevant features so you don't gain anything with Tie::File > -- what would be the way to solve it is if the objects could be > created and reside in a DB on disk rather than in-memory. I'd > really enjoy seeing more indexed and hashed data to objects stored > on disk when mem requirements are such so that very large datasets > can be handled more nimbly. Or maybe implement some lazy iterator-based methods. We have brought up the subject of the SwissKnife modules here before... > I think there have been several attempts to simplify, but it > basically means a dedicated developer to really overhaul or map to a > new system. What we've tried to build is a decent API so a new > implementation can be done without affecting the 'next_seq' and > 'write_seq' API. > > Non-withstanding the seemed API confusion caused by _ancient_ > decisions on giving function names of Bio::SeqFeatureI 'seq' and > Bio::PrimarySeq 'seq' which return different types -- don't forget > that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a > sequence as a string as well so major API changes in general here > will create in all likelihood a big split between the branches that > will make any new Bioperl not match up well with existing scripts or > libraries that use it - hence the reason for no "great realigning" > to a completely well-planned out API rather than the organically > grown whims of several generations of devs. I say this in jest a > bit - I do want to see changes, but I think it really will have to > be called something else besides BioPerl to avoid confusion and the > fact that a lot of things will break that depend on the current > APIs. BioPerl2 or something indicating a Perl6 association. > > -jason Just thought of this: doesn't the feature iterator in Bio::DB::SeqFeature::Store use next_seq for features? Yikes... Anyway, I think if we set a decent enough deprecation schedule, users would adjust, but that's generally for small changes. Dramatic large-scale changes (such as Moose integration and conversion of interfaces to roles) should be done in a separate project. Similarly, as mentioned before, perl6 is a different (yet related) beast to perl5, and so a bioperl-related project using perl6 shouldn't be called BioPerl 2.0. The nice aspect of this: we can take what we like from BioPerl now and refactor it for either project, along the way making sure only the most critical modules get in. chris From raulmendez at cbm.uam.es Fri May 8 12:19:40 2009 From: raulmendez at cbm.uam.es (Raul Mendez Giraldez) Date: Fri, 08 May 2009 18:19:40 +0200 Subject: [Bioperl-l] How to get coil prediction out of Bio::Tools::Run::Coil modules Message-ID: <1241799580.6963.165.camel@pepa.cbm.uam.es> Hi, I'm trying to get coiled-coiled prediction in protein sequences using Bob Russell's program ncoils, through the bioperl interface Bio::Tools::Run::Coil, but the only thing I can get from any element on the features list is just the sequence name, and few more not so useful atributes. I'm running the following script: #!/home/rmendez/bin/perl -w use strict; use FileHandle; use Data::Dumper; use Bio::Tools::Run::Coil; my $seqin=filein.fasta my $factory=Bio::Tools::Run::Coil->new('-c'); my @features=$factory->run($seqin); print "Printing content of features[0]\n"; print Dumper $features[0]; ---- And the output is (the content of the first element of the features array) is : '_gsf_tag_hash' => { 'percent_id' => [ 'NULL' ], 'hid' => [ 'ncoils' ], 'evalue' => [ 0 ] }, '_location' => bless( { '_location_type' => 'EXACT', '_start' => 138, '_end' => 172 }, 'Bio::Location::Simple' ), '_gsf_seq_id' => 'ENSDARP00000084927', '_parse_h' => {}, '_root_cleanup_methods' => [ sub { "DUMMY" } ], '_source_tag' => 'Coils', '_primary_tag' => 'ncoils', '_root_verbose' => 0 }, 'Bio::SeqFeature::Generic' ); Then how could I get the sequence itself with the coil annotation 'xxx'? Thanks, Raul From cjfields at illinois.edu Fri May 8 13:45:09 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 8 May 2009 12:45:09 -0500 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> References: <18DF7D20DFEC044098A1062202F5FFF32493CE8E5C@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> Message-ID: On May 7, 2009, at 6:25 PM, Jason Stajich wrote: > It parses from a stream or file, one sequence at a time so it only > reads a single sequence out at a time, but it does have to parse > that whole sequence record which is where feature rich sequences > might be causing problems. > > I think per your other mention of Tie::File - the whole file is not > going into memory so that is not the problem, it is the creation of > many objects that it does as it parses the sequence that is likely > the problem. It will read up to the first "//" from that Tie::File > anyways, that becomes an entire string which is then parsed to pull > out the relevant features so you don't gain anything with Tie::File > -- what would be the way to solve it is if the objects could be > created and reside in a DB on disk rather than in-memory. I'd > really enjoy seeing more indexed and hashed data to objects stored > on disk when mem requirements are such so that very large datasets > can be handled more nimbly. Or maybe implement some lazy iterator-based methods. We have brought up the subject of the SwissKnife modules here before... > I think there have been several attempts to simplify, but it > basically means a dedicated developer to really overhaul or map to a > new system. What we've tried to build is a decent API so a new > implementation can be done without affecting the 'next_seq' and > 'write_seq' API. > > Non-withstanding the seemed API confusion caused by _ancient_ > decisions on giving function names of Bio::SeqFeatureI 'seq' and > Bio::PrimarySeq 'seq' which return different types -- don't forget > that Lincoln's Bio::DB::Fasta uses the 'seq' method to return a > sequence as a string as well so major API changes in general here > will create in all likelihood a big split between the branches that > will make any new Bioperl not match up well with existing scripts or > libraries that use it - hence the reason for no "great realigning" > to a completely well-planned out API rather than the organically > grown whims of several generations of devs. I say this in jest a > bit - I do want to see changes, but I think it really will have to > be called something else besides BioPerl to avoid confusion and the > fact that a lot of things will break that depend on the current > APIs. BioPerl2 or something indicating a Perl6 association. > > -jason Just thought of this: doesn't the feature iterator in Bio::DB::SeqFeature::Store use next_seq for features? Yikes... Anyway, I think if we set a decent enough deprecation schedule, users would adjust, but that's generally for small changes. Dramatic large-scale changes (such as Moose integration and conversion of interfaces to roles) should be done in a separate project. Similarly, as mentioned before, perl6 is a different (yet related) beast to perl5, and so a bioperl-related project using perl6 shouldn't be called BioPerl 2.0. The nice aspect of this: we can take what we like from BioPerl now and refactor it for either project, along the way making sure only the most critical modules get in. chris From sidd.basu at gmail.com Fri May 8 14:30:19 2009 From: sidd.basu at gmail.com (Siddhartha Basu) Date: Fri, 8 May 2009 13:30:19 -0500 Subject: [Bioperl-l] Re: Moose [was Re:Other object oddities] In-Reply-To: References: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> Message-ID: <4a047a3e.23bb720a.3b09.ffff9430@mx.google.com> On Wed, 06 May 2009, Chris Fields wrote: > As a final bit: if we go the Moose route, we should be very careful about > which MooseX modules we want. I don't think we want to expand the > dependency tree. For instance, I am attempting to install one possible > module (MooseX::Declare) and the dependency tree was ginormous and included > modules only needed for installation. > > chris Since we are on the topic of Moose dependencies, here is a nice article about it. http://chris.prather.org/perl/moose-dependencies-a-lurid-tale/ -siddhartha > > On May 6, 2009, at 12:56 PM, Mark A. Jensen wrote: > > > Great discussion-- I have redacted the moose portions to > > http://www.bioperl.org/wiki/Talk:BioMoose and encourage all interested > > folks to log comments there as well. cheers Mark > > ----- Original Message ----- From: "Chris Mungall" > > To: "Chris Fields" > > Cc: "BioPerl List" ; "Mark A. Jensen" > > ; "Kevin Brown" > > Sent: Tuesday, May 05, 2009 2:28 PM > > Subject: [Bioperl-l] Moose [was Re: Other object oddities] > > > > > >> > >> On May 5, 2009, at 7:31 AM, Chris Fields wrote: > >> > >>> On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: > >>> > >>>> > >>>> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: > >>>> > >>>>> Maybe this should be an element of > >>>>> the "Align refactor" that perhaps should be an overall > >>>>> "Seq refactor". > >>>> > >>>> Possibly. Most importantly, it'd be great if someone would volunteer > >>>> to summarize what's been said here so it won't get lost. > >>> > >>> Looks like mark's done it. > >>> > >>>>> Are you saying that the trunk is fair game for api additions > >>>>> for this issue? > >>>> > >>>> There's been talk some (a long, actually) time ago about BioPerl 2.0 > >>>> that would start on a clean slate and not be bothered by backwards > >>>> compatibility demands. That effort never really took off, but maybe > >>>> this is also a good time to ask the question again whether it's better > >>>> to introduce the API changes we desire in add/ deprecate/remove cycles, > >>>> or in a more radical fashion starting on a clean slate. > >>> > >>> That's what I'm thinking. > >>> > >>>> The obvious advantage of the former is that we get API improvements > >>>> sooner, but making them is possibly more dreadful, discouraging, or > >>>> not even doable due to compatibility constraints. The disadvantage of > >>>> the latter is that it really needs a committed crew of people to see > >>>> it through or otherwise all the nice changes die in some grand but > >>>> half-finished 2.0 construction site. I think Chris also had plans to > >>>> branch off a Perl6 version of BioPerl - maybe those could be the same > >>>> efforts? > >>> > >>> I have been toying around with perl6 for a bit now (Rakudo on Parrot > >>> implementation). It's possible an alpha for perl6 will be available by > >>> christmas this year; Rakudo is now passing over 11000 spec tests. > >>> > >>> Just to note, Perl6 is another beast altogether from Perl5. Yes, there > >>> is supposed to be a backwards compatibility mode, but no one has > >>> implemented that yet, and it likely won't be implemented in the near > >>> future. Based on that I'm not sure we could really call a bioperl in > >>> perl6 bioperl 2.0, more like bioperl6 1.0, as it would be a complete > >>> refactor. > >>> > >>> As for perl5, it has a nice OO set of modules (Moose) that could be > >>> used for refactoring. It implements roles and a few other perl6-ish > >>> bits (along with MooseX modules). perl 5.10 also has a few things > >>> backported from p6, say(), given/when, state vars, etc. We could > >>> require Modern::Perl (perl5.10 with strict/warnings pragmas on) and > >>> Moose. I have played around with both and find them quite nice, so I > >>> suggest if we were to start a 2.0 effort it should include Moose, and > >>> we should push most of the interfaces into roles. > >> > >> We're playing around with a rewrite of go-perl using Moose: > >> http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ > >> > >> This is early enough that parts could be scrapped or rewritten. > >> Compatibility with bioperl is important. > >> > >> Speed was an initial concern but apparently there are some moose tricks > >> to speed things up > >> > >> DBIx::Class compatibility is also important. Not sure if there is > >> specific support for this yet > >> > >> > >>> > >>> Anyway, I grabbed the git repos for bioperl6 and biomoose (bioperl > >>> implemented in Moose) on github. We can set up something there using > >>> those namespaces if needed. > >>> > >>>> I'm not trying to advocate one over the other here; rather, I'd like > >>>> to help push on that front that is best able to capture the energy of > >>>> volunteers, as that's what it takes in the end. > >>>> > >>>> -hilmar > >>> > >>> Depends on where everyone wants to place their efforts. May be less > >>> work to port the most important core classes over to Moose, and a > >>> simple test implementation will give us an idea on what works Role- wise > >>> and what doesn't. From there we could work on p6 variants; that would > >>> have to be a separate project altogether. We could also include a few > >>> other MooseX modules if it makes life easier. > >>> > >>> chris > >>> _______________________________________________ > >>> Bioperl-l mailing list > >>> Bioperl-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >>> > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From roychu at gmail.com Fri May 8 14:48:15 2009 From: roychu at gmail.com (Chu, Roy) Date: Fri, 8 May 2009 11:48:15 -0700 Subject: [Bioperl-l] The Power of R (Chris Fields) Message-ID: <4d7f3e450905081148i28579baai3b9889ec7c8e8b1e@mail.gmail.com> Gabriel's book looks like a promising reference that I think I'll want to check out. I just recently came across this use R series by Springer: http://www.springer.com/series/6991 Another one I don't see listed, but probably more relevant is the Springer book: Statistical Methods in Computational Biology. Roy Date: Fri, 8 May 2009 21:49:22 +0900 From: Gabriel Valiente Subject: Re: [Bioperl-l] The Power of R (Chris Fields) To: bioperl-l at lists.open-bio.org Message-ID: <8F435B66-33CF-467D-8D86-AA8EF2309E98 at lsi.upc.edu> Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed >>> While we're on the topic, can anyone recommend a good book or >>> resource from which to learn R, to supplement the official docs? Well, my new book G. Valiente. Combinatorial Pattern Matching Algorithms in Computational Biology using Perl and R. Taylor & Francis/CRC Press (2009) http://www.crcpress.com/product/isbn/9781420063677 is already available. I hope it will also be of much use to BioPerl developers and users. Gabriel From mmuratet at hudsonalpha.org Fri May 8 15:29:38 2009 From: mmuratet at hudsonalpha.org (Michael Muratet) Date: Fri, 8 May 2009 14:29:38 -0500 Subject: [Bioperl-l] fastq parsing problem Message-ID: Greetings I've got a problem parsing fastq output from the maq aligner. The parser is throwing an exception for the following record: @HWI-EAS146:3:1:2:177#0/1 CTCCGCTNNCTTCTCAGCTTTCTTGTAGGCGATAGACTTCCCGAGCCTANCCAGAGCAACGAGCNTNNNGNNNNTN + @,AB=>-&&:5).;+*=<*8?%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% I looked up the line in fastq.pm that does the parsing: 116 my ($top,$sequence,$top2,$qualsequence) = $entry =~ /^ 117 \@?(. +?)\n 118 ([^ \@]*?)\n 119 \+?(. +?)\n 120 (.*)\n 121 /xs I don't consider myself a regex-pert, but I would interpret the above as "put everything after one or zero @ characters on the first line in $top; then put anything that is not @ on the second line in $sequence; then everything after one or zero + characters on the third line in $top2; then everything on the fourth line in $qualsequence; and don't be greedy". It seems like the fastq record above should parse with these rules. I note that the @ character is escaped in the regex and appears in several of the problem records, but not all. Has anyone come across this before? I don't see this exact problem in the list archives. Thanks Mike From jason at bioperl.org Fri May 8 16:04:06 2009 From: jason at bioperl.org (Jason Stajich) Date: Fri, 8 May 2009 13:04:06 -0700 Subject: [Bioperl-l] How to get coil prediction out of Bio::Tools::Run::Coil modules In-Reply-To: <1241799580.6963.165.camel@pepa.cbm.uam.es> References: <1241799580.6963.165.camel@pepa.cbm.uam.es> Message-ID: The sequence isn't part of the report - or at least isn't parsed but you can just do this (pseudo-y-code here). my $seqout =Bio::SeqIO->new(-format => 'fasta'); for my $feature ( @features ) my $featseq = $seqin->subseq($feature->start, $feature->end); $seqout->write_seq($featseq); } On May 8, 2009, at 9:19 AM, Raul Mendez Giraldez wrote: > Hi, > > I'm trying to get coiled-coiled prediction in protein sequences using > Bob Russell's program ncoils, through the bioperl interface > Bio::Tools::Run::Coil, but the only thing I can get from any element > on > the features list is just the sequence name, and few more not so > useful > atributes. > > I'm running the following script: > > > #!/home/rmendez/bin/perl -w > > use strict; > use FileHandle; > use Data::Dumper; > > use Bio::Tools::Run::Coil; > > my $seqin=filein.fasta > my $factory=Bio::Tools::Run::Coil->new('-c'); > my @features=$factory->run($seqin); > > print "Printing content of features[0]\n"; > print Dumper $features[0]; > > ---- > > And the output is (the content of the first element of the features > array) is : > '_gsf_tag_hash' => { > 'percent_id' => [ > 'NULL' > ], > 'hid' => [ > 'ncoils' > ], > 'evalue' => [ > 0 > ] > }, > '_location' => bless( { > '_location_type' => 'EXACT', > '_start' => 138, > '_end' => 172 > }, 'Bio::Location::Simple' ), > '_gsf_seq_id' => 'ENSDARP00000084927', > '_parse_h' => {}, > '_root_cleanup_methods' => [ > sub { "DUMMY" } > ], > '_source_tag' => 'Coils', > '_primary_tag' => 'ncoils', > '_root_verbose' => 0 > }, 'Bio::SeqFeature::Generic' ); > > Then how could I get the sequence itself with the coil annotation > 'xxx'? > > Thanks, > > Raul > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From SMarkel at accelrys.com Fri May 8 16:05:29 2009 From: SMarkel at accelrys.com (Scott Markel) Date: Fri, 8 May 2009 16:05:29 -0400 Subject: [Bioperl-l] The Power of R (Chris Fields) In-Reply-To: <8F435B66-33CF-467D-8D86-AA8EF2309E98@lsi.upc.edu> References: <8F435B66-33CF-467D-8D86-AA8EF2309E98@lsi.upc.edu> Message-ID: <1F1240778FB0AF46B4E5A72C44D2C7472A29BE7E@exch1-hi.accelrys.net> Gabriel, I just finished looking through my copy of your book last night. You cover a nice combination of pattern matching tasks and include background information for both Perl and R. I like the fact that the Perl examples use BioPerl where appropriate. Too bad that most other bioinformatics books don't do the same. A quick personal comment - Thank you for referencing the "Using BioPerl" book that Jason Stajich, Ewan Birney, and I are writing. Now we'll have to finish it. :) Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Co-chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Gabriel Valiente > Sent: Friday, 08 May 2009 5:49 AM > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] The Power of R (Chris Fields) > > >>> While we're on the topic, can anyone recommend a good book or > >>> resource from which to learn R, to supplement the official docs? > > Well, my new book > > G. Valiente. Combinatorial Pattern Matching Algorithms in > Computational Biology using Perl and R. Taylor & Francis/CRC Press > (2009) > > http://www.crcpress.com/product/isbn/9781420063677 > > is already available. I hope it will also be of much use to BioPerl > developers and users. > > Gabriel > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From roychu at gmail.com Fri May 8 16:46:48 2009 From: roychu at gmail.com (Chu, Roy) Date: Fri, 8 May 2009 13:46:48 -0700 Subject: [Bioperl-l] The Power of R (Chris Fields) In-Reply-To: <4d7f3e450905081148i28579baai3b9889ec7c8e8b1e@mail.gmail.com> References: <4d7f3e450905081148i28579baai3b9889ec7c8e8b1e@mail.gmail.com> Message-ID: <4d7f3e450905081346j14aa9fbdv9712f7e4b0cc8781@mail.gmail.com> My mistake, Statistical Methods in Bioinformatics. On Fri, May 8, 2009 at 11:48 AM, Chu, Roy wrote: > Gabriel's book looks like a promising reference that I think I'll want > to check out. > I just recently came across this use R series by Springer: > http://www.springer.com/series/6991 > > Another one I don't see listed, but probably more relevant is the > Springer book: Statistical Methods in Computational Biology. > > Roy > > > Date: Fri, 8 May 2009 21:49:22 +0900 > From: Gabriel Valiente > Subject: Re: [Bioperl-l] The Power of R (Chris Fields) > To: bioperl-l at lists.open-bio.org > Message-ID: <8F435B66-33CF-467D-8D86-AA8EF2309E98 at lsi.upc.edu> > Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed > >>>> While we're on the topic, can anyone recommend a good book or >>>> resource from which to learn R, to supplement the official docs? > > Well, my new book > > G. Valiente. Combinatorial Pattern Matching Algorithms in > Computational Biology using Perl and R. Taylor & Francis/CRC Press > (2009) > > http://www.crcpress.com/product/isbn/9781420063677 > > is already available. I hope it will also be of much use to BioPerl > developers and users. > > Gabriel > From maj at fortinbras.us Fri May 8 21:33:58 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Fri, 8 May 2009 21:33:58 -0400 Subject: [Bioperl-l] Moose [was Re:Other object oddities] In-Reply-To: <4a047a3e.23bb720a.3b09.ffff9430@mx.google.com> References: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> <4a047a3e.23bb720a.3b09.ffff9430@mx.google.com> Message-ID: <4076AEE2CB9F45138C6FE76DD807D0BA@NewLife> thanks Siddhartha- very informative [but he misquotes Eliot in his header!] cheers MAJ ----- Original Message ----- From: "Siddhartha Basu" To: Sent: Friday, May 08, 2009 2:30 PM Subject: [Bioperl-l] Re: Moose [was Re:Other object oddities] > On Wed, 06 May 2009, Chris Fields wrote: > >> As a final bit: if we go the Moose route, we should be very careful about >> which MooseX modules we want. I don't think we want to expand the >> dependency tree. For instance, I am attempting to install one possible >> module (MooseX::Declare) and the dependency tree was ginormous and included >> modules only needed for installation. >> >> chris > > Since we are on the topic of Moose dependencies, here is a nice article > about it. > http://chris.prather.org/perl/moose-dependencies-a-lurid-tale/ > > -siddhartha > >> >> On May 6, 2009, at 12:56 PM, Mark A. Jensen wrote: >> >> > Great discussion-- I have redacted the moose portions to >> > http://www.bioperl.org/wiki/Talk:BioMoose and encourage all interested >> > folks to log comments there as well. cheers Mark >> > ----- Original Message ----- From: "Chris Mungall" >> > To: "Chris Fields" >> > Cc: "BioPerl List" ; "Mark A. Jensen" >> > ; "Kevin Brown" >> > Sent: Tuesday, May 05, 2009 2:28 PM >> > Subject: [Bioperl-l] Moose [was Re: Other object oddities] >> > >> > >> >> >> >> On May 5, 2009, at 7:31 AM, Chris Fields wrote: >> >> >> >>> On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: >> >>> >> >>>> >> >>>> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: >> >>>> >> >>>>> Maybe this should be an element of >> >>>>> the "Align refactor" that perhaps should be an overall >> >>>>> "Seq refactor". >> >>>> >> >>>> Possibly. Most importantly, it'd be great if someone would volunteer >> >>>> to summarize what's been said here so it won't get lost. >> >>> >> >>> Looks like mark's done it. >> >>> >> >>>>> Are you saying that the trunk is fair game for api additions >> >>>>> for this issue? >> >>>> >> >>>> There's been talk some (a long, actually) time ago about BioPerl 2.0 >> >>>> that would start on a clean slate and not be bothered by backwards >> >>>> compatibility demands. That effort never really took off, but maybe >> >>>> this is also a good time to ask the question again whether it's better >> >>>> to introduce the API changes we desire in add/ deprecate/remove cycles, >> >>>> or in a more radical fashion starting on a clean slate. >> >>> >> >>> That's what I'm thinking. >> >>> >> >>>> The obvious advantage of the former is that we get API improvements >> >>>> sooner, but making them is possibly more dreadful, discouraging, or >> >>>> not even doable due to compatibility constraints. The disadvantage of >> >>>> the latter is that it really needs a committed crew of people to see >> >>>> it through or otherwise all the nice changes die in some grand but >> >>>> half-finished 2.0 construction site. I think Chris also had plans to >> >>>> branch off a Perl6 version of BioPerl - maybe those could be the same >> >>>> efforts? >> >>> >> >>> I have been toying around with perl6 for a bit now (Rakudo on Parrot >> >>> implementation). It's possible an alpha for perl6 will be available by >> >>> christmas this year; Rakudo is now passing over 11000 spec tests. >> >>> >> >>> Just to note, Perl6 is another beast altogether from Perl5. Yes, there >> >>> is supposed to be a backwards compatibility mode, but no one has >> >>> implemented that yet, and it likely won't be implemented in the near >> >>> future. Based on that I'm not sure we could really call a bioperl in >> >>> perl6 bioperl 2.0, more like bioperl6 1.0, as it would be a complete >> >>> refactor. >> >>> >> >>> As for perl5, it has a nice OO set of modules (Moose) that could be >> >>> used for refactoring. It implements roles and a few other perl6-ish >> >>> bits (along with MooseX modules). perl 5.10 also has a few things >> >>> backported from p6, say(), given/when, state vars, etc. We could >> >>> require Modern::Perl (perl5.10 with strict/warnings pragmas on) and >> >>> Moose. I have played around with both and find them quite nice, so I >> >>> suggest if we were to start a 2.0 effort it should include Moose, and >> >>> we should push most of the interfaces into roles. >> >> >> >> We're playing around with a rewrite of go-perl using Moose: >> >> http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ >> >> >> >> This is early enough that parts could be scrapped or rewritten. >> >> Compatibility with bioperl is important. >> >> >> >> Speed was an initial concern but apparently there are some moose tricks >> >> to speed things up >> >> >> >> DBIx::Class compatibility is also important. Not sure if there is >> >> specific support for this yet >> >> >> >> >> >>> >> >>> Anyway, I grabbed the git repos for bioperl6 and biomoose (bioperl >> >>> implemented in Moose) on github. We can set up something there using >> >>> those namespaces if needed. >> >>> >> >>>> I'm not trying to advocate one over the other here; rather, I'd like >> >>>> to help push on that front that is best able to capture the energy of >> >>>> volunteers, as that's what it takes in the end. >> >>>> >> >>>> -hilmar >> >>> >> >>> Depends on where everyone wants to place their efforts. May be less >> >>> work to port the most important core classes over to Moose, and a >> >>> simple test implementation will give us an idea on what works Role- wise >> >>> and what doesn't. From there we could work on p6 variants; that would >> >>> have to be a separate project altogether. We could also include a few >> >>> other MooseX modules if it makes life easier. >> >>> >> >>> chris >> >>> _______________________________________________ >> >>> Bioperl-l mailing list >> >>> Bioperl-l at lists.open-bio.org >> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >>> >> >> >> >> _______________________________________________ >> >> Bioperl-l mailing list >> >> Bioperl-l at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> > >> > _______________________________________________ >> > Bioperl-l mailing list >> > Bioperl-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From maj at fortinbras.us Fri May 8 21:45:18 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Fri, 8 May 2009 21:45:18 -0400 Subject: [Bioperl-l] fastq parsing problem In-Reply-To: References: Message-ID: <0CECA54FA78F46839114FAB96CD53F39@NewLife> Hi Michael-- Can you send along the exception? The line you send seems to parse as advertised in the debugger (as long as the last newline that breaks up the string of %'s is not really there). thanks, Mark ----- Original Message ----- From: "Michael Muratet" To: ; Sent: Friday, May 08, 2009 3:29 PM Subject: [Bioperl-l] fastq parsing problem > Greetings > > I've got a problem parsing fastq output from the maq aligner. The > parser is throwing an exception for the following record: > > @HWI-EAS146:3:1:2:177#0/1 > CTCCGCTNNCTTCTCAGCTTTCTTGTAGGCGATAGACTTCCCGAGCCTANCCAGAGCAACGAGCNTNNNGNNNNTN > + > @,AB=>-&&:5).;+*=<*8?%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% > %%%%% > > I looked up the line in fastq.pm that does the parsing: > > 116 my ($top,$sequence,$top2,$qualsequence) = $entry =~ /^ > 117 \@?(. > +?)\n > 118 ([^ > \@]*?)\n > 119 \+?(. > +?)\n > 120 (.*)\n > 121 /xs > > I don't consider myself a regex-pert, but I would interpret the above > as "put everything after one or zero @ characters on the first line in > $top; then put anything that is not @ on the second line in $sequence; > then everything after one or zero + characters on the third line in > $top2; then everything on the fourth line in $qualsequence; and don't > be greedy". > > It seems like the fastq record above should parse with these rules. I > note that the @ character is escaped in the regex and appears in > several of the problem records, but not all. Has anyone come across > this before? I don't see this exact problem in the list archives. > > Thanks > > Mike > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at illinois.edu Sat May 9 11:26:42 2009 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 9 May 2009 10:26:42 -0500 Subject: [Bioperl-l] Moose [was Re:Other object oddities] In-Reply-To: <4076AEE2CB9F45138C6FE76DD807D0BA@NewLife> References: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> <4a047a3e.23bb720a.3b09.ffff9430@mx.google.com> <4076AEE2CB9F45138C6FE76DD807D0BA@NewLife> Message-ID: <003BA940-D974-44A8-9634-55963C2E8341@illinois.edu> Decent article, but it is slightly misleading. These are dependencies for Moose itself, which I don't have a problem with (off the subject, but I personally would like to add in a requirement for Modern::Perl!). What I am worried about are lots of additional dependencies introduced using some of the 'syntactic sugar' in various MooseX modules. For instance, MooseX::Declare, and MooseX::Method::Signatures (two popular ones): http://deps.cpantesters.org/?module=MooseX%3A%3ADeclare&perl=any+version&os=any+OS http://deps.cpantesters.org/?module=MooseX%3A%3AMethod%3A%3ASignatures&perl=any+version&os=any+OS chris On May 8, 2009, at 8:33 PM, Mark A. Jensen wrote: > thanks Siddhartha- very informative [but he misquotes Eliot in his > header!] > cheers MAJ > ----- Original Message ----- From: "Siddhartha Basu" > > To: > Sent: Friday, May 08, 2009 2:30 PM > Subject: [Bioperl-l] Re: Moose [was Re:Other object oddities] > > >> On Wed, 06 May 2009, Chris Fields wrote: >> >>> As a final bit: if we go the Moose route, we should be very >>> careful about >>> which MooseX modules we want. I don't think we want to expand the >>> dependency tree. For instance, I am attempting to install one >>> possible >>> module (MooseX::Declare) and the dependency tree was ginormous and >>> included >>> modules only needed for installation. >>> >>> chris >> >> Since we are on the topic of Moose dependencies, here is a nice >> article >> about it. >> http://chris.prather.org/perl/moose-dependencies-a-lurid-tale/ >> >> -siddhartha >> >>> >>> On May 6, 2009, at 12:56 PM, Mark A. Jensen wrote: >>> >>> > Great discussion-- I have redacted the moose portions to >>> > http://www.bioperl.org/wiki/Talk:BioMoose and encourage all >>> interested >>> > folks to log comments there as well. cheers Mark >>> > ----- Original Message ----- From: "Chris Mungall" >> > >>> > To: "Chris Fields" >>> > Cc: "BioPerl List" ; "Mark A. >>> Jensen" >>> > ; "Kevin Brown" >>> > Sent: Tuesday, May 05, 2009 2:28 PM >>> > Subject: [Bioperl-l] Moose [was Re: Other object oddities] >>> > >>> > >>> >> >>> >> On May 5, 2009, at 7:31 AM, Chris Fields wrote: >>> >> >>> >>> On May 5, 2009, at 7:31 AM, Hilmar Lapp wrote: >>> >>> >>> >>>> >>> >>>> On May 4, 2009, at 3:01 PM, Mark A. Jensen wrote: >>> >>>> >>> >>>>> Maybe this should be an element of >>> >>>>> the "Align refactor" that perhaps should be an overall >>> >>>>> "Seq refactor". >>> >>>> >>> >>>> Possibly. Most importantly, it'd be great if someone would >>> volunteer >>> >>>> to summarize what's been said here so it won't get lost. >>> >>> >>> >>> Looks like mark's done it. >>> >>> >>> >>>>> Are you saying that the trunk is fair game for api additions >>> >>>>> for this issue? >>> >>>> >>> >>>> There's been talk some (a long, actually) time ago about >>> BioPerl 2.0 >>> >>>> that would start on a clean slate and not be bothered by >>> backwards >>> >>>> compatibility demands. That effort never really took off, >>> but maybe >>> >>>> this is also a good time to ask the question again whether >>> it's better >>> >>>> to introduce the API changes we desire in add/ deprecate/ >>> remove cycles, >>> >>>> or in a more radical fashion starting on a clean slate. >>> >>> >>> >>> That's what I'm thinking. >>> >>> >>> >>>> The obvious advantage of the former is that we get API >>> improvements >>> >>>> sooner, but making them is possibly more dreadful, >>> discouraging, or >>> >>>> not even doable due to compatibility constraints. The >>> disadvantage of >>> >>>> the latter is that it really needs a committed crew of people >>> to see >>> >>>> it through or otherwise all the nice changes die in some >>> grand but >>> >>>> half-finished 2.0 construction site. I think Chris also had >>> plans to >>> >>>> branch off a Perl6 version of BioPerl - maybe those could be >>> the same >>> >>>> efforts? >>> >>> >>> >>> I have been toying around with perl6 for a bit now (Rakudo on >>> Parrot >>> >>> implementation). It's possible an alpha for perl6 will be >>> available by >>> >>> christmas this year; Rakudo is now passing over 11000 spec >>> tests. >>> >>> >>> >>> Just to note, Perl6 is another beast altogether from Perl5. >>> Yes, there >>> >>> is supposed to be a backwards compatibility mode, but no one >>> has >>> >>> implemented that yet, and it likely won't be implemented in >>> the near >>> >>> future. Based on that I'm not sure we could really call a >>> bioperl in >>> >>> perl6 bioperl 2.0, more like bioperl6 1.0, as it would be a >>> complete >>> >>> refactor. >>> >>> >>> >>> As for perl5, it has a nice OO set of modules (Moose) that >>> could be >>> >>> used for refactoring. It implements roles and a few other >>> perl6-ish >>> >>> bits (along with MooseX modules). perl 5.10 also has a few >>> things >>> >>> backported from p6, say(), given/when, state vars, etc. We >>> could >>> >>> require Modern::Perl (perl5.10 with strict/warnings pragmas >>> on) and >>> >>> Moose. I have played around with both and find them quite >>> nice, so I >>> >>> suggest if we were to start a 2.0 effort it should include >>> Moose, and >>> >>> we should push most of the interfaces into roles. >>> >> >>> >> We're playing around with a rewrite of go-perl using Moose: >>> >> http://geneontology.svn.sourceforge.net/viewvc/geneontology/go-moose/OBO/ >>> >> >>> >> This is early enough that parts could be scrapped or rewritten. >>> >> Compatibility with bioperl is important. >>> >> >>> >> Speed was an initial concern but apparently there are some >>> moose tricks >>> >> to speed things up >>> >> >>> >> DBIx::Class compatibility is also important. Not sure if there is >>> >> specific support for this yet >>> >> >>> >> >>> >>> >>> >>> Anyway, I grabbed the git repos for bioperl6 and biomoose >>> (bioperl >>> >>> implemented in Moose) on github. We can set up something >>> there using >>> >>> those namespaces if needed. >>> >>> >>> >>>> I'm not trying to advocate one over the other here; rather, >>> I'd like >>> >>>> to help push on that front that is best able to capture the >>> energy of >>> >>>> volunteers, as that's what it takes in the end. >>> >>>> >>> >>>> -hilmar >>> >>> >>> >>> Depends on where everyone wants to place their efforts. May >>> be less >>> >>> work to port the most important core classes over to Moose, >>> and a >>> >>> simple test implementation will give us an idea on what works >>> Role- wise >>> >>> and what doesn't. From there we could work on p6 variants; >>> that would >>> >>> have to be a separate project altogether. We could also >>> include a few >>> >>> other MooseX modules if it makes life easier. >>> >>> >>> >>> chris >>> >>> _______________________________________________ >>> >>> Bioperl-l mailing list >>> >>> Bioperl-l at lists.open-bio.org >>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >> >>> >> _______________________________________________ >>> >> Bioperl-l mailing list >>> >> Bioperl-l at lists.open-bio.org >>> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >>> > >>> > _______________________________________________ >>> > Bioperl-l mailing list >>> > Bioperl-l at lists.open-bio.org >>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Russell.Smithies at agresearch.co.nz Sun May 10 16:49:56 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 11 May 2009 08:49:56 +1200 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: References: <18DF7D20DFEC044098A1062202F5FFF32493CE8F24@exchsth.agresearch.co.nz> <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE9104@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493CE92E3@exchsth.agresearch.co.nz> How about splitting the big file into smaller chunks and processing one sequence at a time? It could be one specific feature line that's causing the segfault and nothing to do with file size. You should be able to split the file with awk as well (I like awk :-) zcat rel_ann_mus_01_r99.dat.gz | awk 'BEGIN{RS="//";OFS="\n"}{$1=$1; print > "chunk"NR}' --Russell > -----Original Message----- > From: brian li [mailto:brianli.cas at gmail.com] > Sent: Saturday, 9 May 2009 2:49 a.m. > To: Smithies, Russell > Cc: bioperl-l at lists.open-bio.org; Jason Stajich; Chris Fields > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk > '!/^FT|^CO/{print}' |" works. > open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ > /{print}' |" segfaults. > > So it seems the features are causing problems. Although I still don't > know how that hurts my os to pop a segfault, my extraction can move on > again. Maybe I can find a clue when I know more about my os's memory > management strategy. > > Really appreciate all your help. > > -Brian > > On Fri, May 8, 2009 at 8:03 AM, Smithies, Russell > wrote: > > I think the problem here though is the size of the sequences rather than too > > many features. > > > > If one was inclined to bodge/hack and didn't care about sequence, I guess > > you could filter them out with awk so Bio::SeqIO doesn't have to create the > > Bio::PrimarySeq J > > > > Probably breaks the EMBL file spec . > > > > Eg. > > > > open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |" > > ) or die; > > > > > > > > > > > > --Russell > > > > > > > > > > > > > > > > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason > > Stajich > > Sent: Friday, 8 May 2009 11:25 a.m. > > To: Smithies, Russell > > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' > > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > > > > > > > It parses from a stream or file, one sequence at a time so it only reads a > > single sequence out at a time, but it does have to parse that whole sequence > > record which is where feature rich sequences might be causing problems. > > > > > > > > I think per your other mention of Tie::File - the whole file is not going > > into memory so that is not the problem, it is the creation of many objects > > that it does as it parses the sequence that is likely the problem. ?It will > > read up to the first "//" from that Tie::File anyways, that becomes an > > entire string which is then parsed to pull out the relevant features so you > > don't gain anything with Tie::File -- what would be the way to solve it is > > if the objects could be created and reside in a DB on disk rather than > > in-memory. ?I'd really enjoy seeing more indexed and hashed data to objects > > stored on disk when mem requirements are such so that very large datasets > > can be handled more nimbly. > > > > > > > > I think there have been several attempts to simplify, but it basically means > > a dedicated developer to really overhaul or map to a new system. ?What we've > > tried to build is a decent API so a new implementation can be done without > > affecting the 'next_seq' and 'write_seq' API. > > > > > > > > Non-withstanding the seemed API confusion caused by _ancient_ decisions on > > giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq' > > which return different types -- don't forget that Lincoln's Bio::DB::Fasta > > uses the 'seq' method to return a sequence as a string as well so major API > > changes in general here will create in all likelihood a big split between > > the branches that will make any new Bioperl not match up well with existing > > scripts or libraries that use it - hence the reason for no "great > > realigning" to a completely well-planned out API rather than the organically > > grown whims of several generations of devs. ?I say this in jest a bit - I do > > want to see changes, but I think it really will have to be called something > > else besides BioPerl to avoid confusion and the fact that a lot of things > > will break that depend on the current APIs. ?BioPerl2 or something > > indicating a Perl6 association. > > > > > > > > -jason > > > > On May 7, 2009, at 3:05 PM, Smithies, Russell wrote: > > > > OK, I misunderstood, I thought the entire file loaded was loaded into memory > > first then each sequence was extracted from there. > > I hoped splitting into 588 individual sequences might help. > > > > --Russell > > > > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason > > Stajich > > Sent: Friday, 8 May 2009 9:55 a.m. > > To: Smithies, Russell > > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' > > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > > > Russell - > > > > I am not sure how that will help as only 1 sequence is parsed at a time by > > SeqIO parsers and they use the "//" delimiter. > > > > If the equivalent data exists in genbank format at NCBI I think _that_ > > ?module (Bio::SeqIO::genbank) has the ability to ignore > > annotations/features. ?Really we have to re-work the whole thing to be more > > lightweight and lazy-parse. > > > > -jason > > On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: > > > > > > I'm not sure if this will help with your problem or how it deals with memory > > management but using "ordinary" Perl to split the large EMBL file might > > work. > > Give this a go: > > > > ============================ > > #!perl -w > > > > use Bio::SeqIO; > > use IO::String; > > > > use constant SEP => "//\n"; > > > > open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; > > > > my $index = 1; > > > > while(my $stringfh = new IO::String(get_next_record($fh))){ > > > > ?????????my $seqio = Bio::SeqIO->new( -fh ????=> $stringfh,-format => "EMBL" > > ) or die $!; > > > > ?????????while ( my $seq_object = $seqio->next_seq ) { > > ??????????print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; > > > > ??????????# show the features > > ??????????for my $feat_object ($seq_object->get_SeqFeatures) { > > ???????????????????????print "primary tag: ", $feat_object->primary_tag, > > "\n"; > > ???????????????????????for my $tag ($feat_object->get_all_tags) { > > ??????????????????????????print " ?tag: ", $tag, "\n"; > > ??????????????????????????for my $value ($feat_object->get_tag_values($tag)) > > { > > ?????????????????????????????print " ???value: ", $value, "\n"; > > ??????????????????????????} > > ???????????????????????} > > ?????????????????????} > > ?????????} > > > > } > > > > > > sub get_next_record{ > > ?????????my($fh) = @_; > > ?????????(my $old_sep,$/) = ($/,SEP); > > ?????????my $record = <$fh>; > > ?????????$/ = $old_sep; > > ?????????return $record; > > } > > ======================================== > > > > > > --Russell > > > > > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of > > brian li > > Sent: Friday, 8 May 2009 1:00 a.m. > > To: Chris Fields > > Cc: bioperl-l at lists.open-bio.org > > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > > > My server has 32 GB RAM. > > > > The os of my server is 64-bit version of Ubuntu Server Edition 8.04 > > LTS. And I have run my example code on another server with 32-bit > > version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. > > > > -Brian > > > > On Thu, May 7, 2009 at 8:07 PM, Chris Fields > > > wrote: > > I noticed that Russell has 16GB RAM on his setup. ?Was yours equivalent? > > > > chris > > > > On May 7, 2009, at 12:32 AM, brian li wrote: > > > > Thank you very much for your offer. > > > > The director of our lab wants me to do the extraction every time a new > > release of EMBL is published. I can't push the task to you every time. > > > > I can offer more information of the server I run my script on if needed. > > > > -Brian > > > > On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell > > > > > > wrote: > > > > Sadly, that's the same code as I ran but I had a Data::Dump in the > > middle. > > Versions of Perl and BioPerl are the same. > > We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM > > > > If you get a full script running on a smaller dataset, I could probably > > run it on the bigger stuff and give you back tab-separated (or is that > > tab\tseparated ?) data for loading into your db. > > > > --Russell > > > > -----Original Message----- > > From: brian li [mailto:brianli.cas at gmail.com] > > Sent: Thursday, 7 May 2009 4:50 p.m. > > To: Smithies, Russell > > Cc: bioperl-l at lists.open-bio.org > > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction > > > > Dear Russell, > > > > My example code is as following. I omit the parse process and these > > lines give me "Segmentation Fault" too. > > > > # Start of code > > my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', > > ???????????????????????????????????????????-format => 'EMBL'); > > my $index = 1; > > while (my $seq = $seqio->next_seq) > > { > > ??print "Dealing with entry: $index\n"; > > ??$index++; > > } > > # End > > > > The platform I run this code on: > > BioPerl 1.6.0 > > Perl 5.8.8 > > Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) > > > > I have monitored the memory usage when I run the code above. There is > > always around 20GB free memory (buffer size counted in) left. So I > > suppose the segfault can't be explained just by memory shortage. > > > > Brian > > > > > > On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell > > > > > > wrote: > > > > Hi Brian, > > I hate to say it but it worked OK for me using > > rel_ann_mus_01_r99.dat.gz and > > > > simple example Bio::SeqIO code from bugzilla > > > > It's not using more than 1GB memory on our server and doesn't segfault. > > > > Send me your example code and I'll give it a go if you like. > > > > > > Russell Smithies > > > > Bioinformatics Applications Developer > > T +64 3 489 9085 > > E > > ?russell.smithies at agresearch.co.nz > > > > Invermay ?Research Centre > > Puddle Alley, > > Mosgiel, > > New Zealand > > T ?+64 3 489 3809 > > F ?+64 3 489 9174 > > www.agresearch.co.nz > > > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Jason Stajich > > jason at bioperl.org > > > > > > > > > > > > Jason Stajich > > > > jason at bioperl.org > > > > > > > > > > > > > > > > From brianli.cas at gmail.com Sun May 10 22:43:48 2009 From: brianli.cas at gmail.com (brian li) Date: Mon, 11 May 2009 10:43:48 +0800 Subject: [Bioperl-l] Asking for advice on full EMBL extraction In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493CE92E3@exchsth.agresearch.co.nz> References: <3070BEFE-CC10-44CC-9FB9-79B7BB0E53E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493CE8FE3@exchsth.agresearch.co.nz> <6C1564CE-EC1E-446B-BD11-A0C1E627B14B@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE904C@exchsth.agresearch.co.nz> <82AAC49D-458A-4E79-90EA-A793A053314F@bioperl.org> <18DF7D20DFEC044098A1062202F5FFF32493CE9104@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493CE92E3@exchsth.agresearch.co.nz> Message-ID: Thanks for your advice. I agree with you that some features lines are causing the segfault. But I don't know which ones. I am afraid splitting big files by seqs could not help as I don't know which chunk has a mean feature :) Can't try some code and run for every file. I think for now I have to just skip the features and make the extraction run first. --Brian On Mon, May 11, 2009 at 4:49 AM, Smithies, Russell wrote: > How about splitting the big file into smaller chunks and processing one sequence at a time? > It could be one specific feature line that's causing the segfault and nothing to do with file size. > You should be able to split the file with awk as well (I like awk :-) > > zcat rel_ann_mus_01_r99.dat.gz | awk 'BEGIN{RS="//";OFS="\n"}{$1=$1; print > "chunk"NR}' > > --Russell > >> -----Original Message----- >> From: brian li [mailto:brianli.cas at gmail.com] >> Sent: Saturday, 9 May 2009 2:49 a.m. >> To: Smithies, Russell >> Cc: bioperl-l at lists.open-bio.org; Jason Stajich; Chris Fields >> Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >> >> open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk >> '!/^FT|^CO/{print}' |" works. >> open $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ >> /{print}' |" segfaults. >> >> So it seems the features are causing problems. Although I still don't >> know how that hurts my os to pop a segfault, my extraction can move on >> again. Maybe I can find a clue when I know more about my os's memory >> management strategy. >> >> Really appreciate all your help. >> >> -Brian >> >> On Fri, May 8, 2009 at 8:03 AM, Smithies, Russell >> wrote: >> > I think the problem here though is the size of the sequences rather than too >> > many features. >> > >> > If one was inclined to bodge/hack and didn't care about sequence, I guess >> > you could filter them out with awk so Bio::SeqIO doesn't have to create the >> > Bio::PrimarySeq J >> > >> > Probably breaks the EMBL file spec . >> > >> > Eg. >> > >> > open( $fh, "gunzip -c rel_ann_mus_01_r99.dat.gz | awk '!/^SQ|^ /{print}' |" >> > ) or die; >> > >> > >> > >> > >> > >> > --Russell >> > >> > >> > >> > >> > >> > >> > >> > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason >> > Stajich >> > Sent: Friday, 8 May 2009 11:25 a.m. >> > To: Smithies, Russell >> > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' >> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >> > >> > >> > >> > It parses from a stream or file, one sequence at a time so it only reads a >> > single sequence out at a time, but it does have to parse that whole sequence >> > record which is where feature rich sequences might be causing problems. >> > >> > >> > >> > I think per your other mention of Tie::File - the whole file is not going >> > into memory so that is not the problem, it is the creation of many objects >> > that it does as it parses the sequence that is likely the problem. ?It will >> > read up to the first "//" from that Tie::File anyways, that becomes an >> > entire string which is then parsed to pull out the relevant features so you >> > don't gain anything with Tie::File -- what would be the way to solve it is >> > if the objects could be created and reside in a DB on disk rather than >> > in-memory. ?I'd really enjoy seeing more indexed and hashed data to objects >> > stored on disk when mem requirements are such so that very large datasets >> > can be handled more nimbly. >> > >> > >> > >> > I think there have been several attempts to simplify, but it basically means >> > a dedicated developer to really overhaul or map to a new system. ?What we've >> > tried to build is a decent API so a new implementation can be done without >> > affecting the 'next_seq' and 'write_seq' API. >> > >> > >> > >> > Non-withstanding the seemed API confusion caused by _ancient_ decisions on >> > giving function names of Bio::SeqFeatureI 'seq' and Bio::PrimarySeq 'seq' >> > which return different types -- don't forget that Lincoln's Bio::DB::Fasta >> > uses the 'seq' method to return a sequence as a string as well so major API >> > changes in general here will create in all likelihood a big split between >> > the branches that will make any new Bioperl not match up well with existing >> > scripts or libraries that use it - hence the reason for no "great >> > realigning" to a completely well-planned out API rather than the organically >> > grown whims of several generations of devs. ?I say this in jest a bit - I do >> > want to see changes, but I think it really will have to be called something >> > else besides BioPerl to avoid confusion and the fact that a lot of things >> > will break that depend on the current APIs. ?BioPerl2 or something >> > indicating a Perl6 association. >> > >> > >> > >> > -jason >> > >> > On May 7, 2009, at 3:05 PM, Smithies, Russell wrote: >> > >> > OK, I misunderstood, I thought the entire file loaded was loaded into memory >> > first then each sequence was extracted from there. >> > I hoped splitting into 588 individual sequences might help. >> > >> > --Russell >> > >> > From: Jason Stajich [mailto:jason.stajich at gmail.com] On Behalf Of Jason >> > Stajich >> > Sent: Friday, 8 May 2009 9:55 a.m. >> > To: Smithies, Russell >> > Cc: 'brian li'; 'Chris Fields'; 'bioperl-l at lists.open-bio.org' >> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >> > >> > Russell - >> > >> > I am not sure how that will help as only 1 sequence is parsed at a time by >> > SeqIO parsers and they use the "//" delimiter. >> > >> > If the equivalent data exists in genbank format at NCBI I think _that_ >> > ?module (Bio::SeqIO::genbank) has the ability to ignore >> > annotations/features. ?Really we have to re-work the whole thing to be more >> > lightweight and lazy-parse. >> > >> > -jason >> > On May 7, 2009, at 2:24 PM, Smithies, Russell wrote: >> > >> > >> > I'm not sure if this will help with your problem or how it deals with memory >> > management but using "ordinary" Perl to split the large EMBL file might >> > work. >> > Give this a go: >> > >> > ============================ >> > #!perl -w >> > >> > use Bio::SeqIO; >> > use IO::String; >> > >> > use constant SEP => "//\n"; >> > >> > open($fh, "gunzip -c rel_ann_mus_01_r99.dat.gz |") or die; >> > >> > my $index = 1; >> > >> > while(my $stringfh = new IO::String(get_next_record($fh))){ >> > >> > ?????????my $seqio = Bio::SeqIO->new( -fh ????=> $stringfh,-format => "EMBL" >> > ) or die $!; >> > >> > ?????????while ( my $seq_object = $seqio->next_seq ) { >> > ??????????print "Dealing with entry: ".$index++."\t".$seq_object->id."\n"; >> > >> > ??????????# show the features >> > ??????????for my $feat_object ($seq_object->get_SeqFeatures) { >> > ???????????????????????print "primary tag: ", $feat_object->primary_tag, >> > "\n"; >> > ???????????????????????for my $tag ($feat_object->get_all_tags) { >> > ??????????????????????????print " ?tag: ", $tag, "\n"; >> > ??????????????????????????for my $value ($feat_object->get_tag_values($tag)) >> > { >> > ?????????????????????????????print " ???value: ", $value, "\n"; >> > ??????????????????????????} >> > ???????????????????????} >> > ?????????????????????} >> > ?????????} >> > >> > } >> > >> > >> > sub get_next_record{ >> > ?????????my($fh) = @_; >> > ?????????(my $old_sep,$/) = ($/,SEP); >> > ?????????my $record = <$fh>; >> > ?????????$/ = $old_sep; >> > ?????????return $record; >> > } >> > ======================================== >> > >> > >> > --Russell >> > >> > >> > >> > -----Original Message----- >> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> > bounces at lists.open-bio.org] On Behalf Of >> > brian li >> > Sent: Friday, 8 May 2009 1:00 a.m. >> > To: Chris Fields >> > Cc: bioperl-l at lists.open-bio.org >> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >> > >> > My server has 32 GB RAM. >> > >> > The os of my server is 64-bit version of Ubuntu Server Edition 8.04 >> > LTS. And I have run my example code on another server with 32-bit >> > version of Ubuntu Server Edition 8.04 and 4 GB RAM. Segfault again. >> > >> > -Brian >> > >> > On Thu, May 7, 2009 at 8:07 PM, Chris Fields >> > > wrote: >> > I noticed that Russell has 16GB RAM on his setup. ?Was yours equivalent? >> > >> > chris >> > >> > On May 7, 2009, at 12:32 AM, brian li wrote: >> > >> > Thank you very much for your offer. >> > >> > The director of our lab wants me to do the extraction every time a new >> > release of EMBL is published. I can't push the task to you every time. >> > >> > I can offer more information of the server I run my script on if needed. >> > >> > -Brian >> > >> > On Thu, May 7, 2009 at 1:01 PM, Smithies, Russell >> > >> > >> > wrote: >> > >> > Sadly, that's the same code as I ran but I had a Data::Dump in the >> > middle. >> > Versions of Perl and BioPerl are the same. >> > We're running RHEL 5 (kernel 2.6.18-92.1.18.el5) with 16GB RAM >> > >> > If you get a full script running on a smaller dataset, I could probably >> > run it on the bigger stuff and give you back tab-separated (or is that >> > tab\tseparated ?) data for loading into your db. >> > >> > --Russell >> > >> > -----Original Message----- >> > From: brian li [mailto:brianli.cas at gmail.com] >> > Sent: Thursday, 7 May 2009 4:50 p.m. >> > To: Smithies, Russell >> > Cc: bioperl-l at lists.open-bio.org >> > Subject: Re: [Bioperl-l] Asking for advice on full EMBL extraction >> > >> > Dear Russell, >> > >> > My example code is as following. I omit the parse process and these >> > lines give me "Segmentation Fault" too. >> > >> > # Start of code >> > my $seqio = Bio::SeqIO->new(-file => 'rel_ann_mus_01_r99.dat', >> > ???????????????????????????????????????????-format => 'EMBL'); >> > my $index = 1; >> > while (my $seq = $seqio->next_seq) >> > { >> > ??print "Dealing with entry: $index\n"; >> > ??$index++; >> > } >> > # End >> > >> > The platform I run this code on: >> > BioPerl 1.6.0 >> > Perl 5.8.8 >> > Ubuntu 8.04 LTS Server 64-bit version (Linux 2.6.24-23-server) >> > >> > I have monitored the memory usage when I run the code above. There is >> > always around 20GB free memory (buffer size counted in) left. So I >> > suppose the segfault can't be explained just by memory shortage. >> > >> > Brian >> > >> > >> > On Thu, May 7, 2009 at 11:32 AM, Smithies, Russell >> > >> > >> > wrote: >> > >> > Hi Brian, >> > I hate to say it but it worked OK for me using >> > rel_ann_mus_01_r99.dat.gz and >> > >> > simple example Bio::SeqIO code from bugzilla >> > >> > It's not using more than 1GB memory on our server and doesn't segfault. >> > >> > Send me your example code and I'll give it a go if you like. >> > >> > >> > Russell Smithies >> > >> > Bioinformatics Applications Developer >> > T +64 3 489 9085 >> > E >> > ?russell.smithies at agresearch.co.nz >> > >> > Invermay ?Research Centre >> > Puddle Alley, >> > Mosgiel, >> > New Zealand >> > T ?+64 3 489 3809 >> > F ?+64 3 489 9174 >> > www.agresearch.co.nz >> > >> > >> > ======================================================================= >> > Attention: The information contained in this message and/or attachments >> > from AgResearch Limited is intended only for the persons or entities >> > to which it is addressed and may contain confidential and/or privileged >> > material. Any review, retransmission, dissemination or other use of, or >> > taking of any action in reliance upon, this information by persons or >> > entities other than the intended recipients is prohibited by AgResearch >> > Limited. If you have received this message in error, please notify the >> > sender immediately. >> > ======================================================================= >> > >> > >> > _______________________________________________ >> > Bioperl-l mailing list >> > Bioperl-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > >> > >> > >> > _______________________________________________ >> > Bioperl-l mailing list >> > Bioperl-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > >> > _______________________________________________ >> > Bioperl-l mailing list >> > Bioperl-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > >> > Jason Stajich >> > jason at bioperl.org >> > >> > >> > >> > >> > >> > Jason Stajich >> > >> > jason at bioperl.org >> > >> > >> > >> > >> > >> > >> > >> > > From dan.bolser at gmail.com Mon May 11 09:58:01 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Mon, 11 May 2009 14:58:01 +0100 Subject: [Bioperl-l] machine learnings In-Reply-To: <704392.20390.qm@web8402.mail.in.yahoo.com> References: <704392.20390.qm@web8402.mail.in.yahoo.com> Message-ID: <2c8757af0905110658q593c3684h3de8d7e0c294c0@mail.gmail.com> 2009/5/4 punit kumar : > hello > > i am punit kumar , i want to know that is the artificial neural network, and other machine learnings techniques?modules are availabe > in? bio perl or not, I don't think they are available in BioPerl. > if available pls give suggestion that how i?can utilise them. You could try looking in "R" or here: http://smw.referata.com/wiki/Emergent_Neural_Network_Simulation_System Good luck! Dan. > punit kumar kadimi. > > > ? ? ?Cricket on your mind? Visit the ultimate cricket website. Enter http://beta.cricket.yahoo.com > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From dan.bolser at gmail.com Mon May 11 10:34:22 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Mon, 11 May 2009 15:34:22 +0100 Subject: [Bioperl-l] Getting 'features' from SearchIO? Message-ID: <2c8757af0905110734i26f7102k69b615dc413e6be9@mail.gmail.com> Hi, I am parsing a blasttable and extracting Bio::Search::HSP::GenericHSP objects as a result. I read somewhere that HSP objects inherit Feature objects... How can I get a 'standard' representation of the HSP as a feature? Basically I'd like to simply load the blast results into a feature database... When I call feature methods on the HSP objects I just get blank or undef results... I think this is because I'm trying to get at the sequences existing (non existent) features, rather than get the HSP object as a feature... If that makes sense... How can I confirm that I have a feature object containing the details of the HSP? I thought of trying to just pass the HSP object to the Bio::DB::SeqFeature::Store, but I need to get that up and running first (I'm looking into it). In the mean time I thought I'd ask if this sounds like the right thing to do. More generally I want to have features attached to sequences that are themselves annotations of larger sequences (but with unknown position). Is Bio::DB::SeqFeature::Store a way to go? I need to manage various different bits of information coming from a sequencing project, and I need a solution to the whole 'assembly life cycle management' problem. Thanks for any help, Dan. From maj at fortinbras.us Mon May 11 10:31:35 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Mon, 11 May 2009 10:31:35 -0400 Subject: [Bioperl-l] Google Summer of Code student Chase Miller Message-ID: Hello all, With great pleasure, I want to introduce Chase Miller, my Google Summer of Code student from George Washington University, to the community. Chase will be working with me and Rutger Vos on a BioPerl wrapper for Rutger's Bio::Phylo package, with a particular emphasis on creating a BioPerl-native way to import and export the NeXML (http://nexml.org) phylogenetic data format. He wrote a great proposal, available here: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit. We will be working throughout the summer on the project, and will of course come to you for sage advice. I know you will welcome him warmly, as you did me. Cheers, Mark From dan.bolser at gmail.com Mon May 11 11:07:47 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Mon, 11 May 2009 16:07:47 +0100 Subject: [Bioperl-l] Getting 'features' from SearchIO? In-Reply-To: <2c8757af0905110734i26f7102k69b615dc413e6be9@mail.gmail.com> References: <2c8757af0905110734i26f7102k69b615dc413e6be9@mail.gmail.com> Message-ID: <2c8757af0905110807g557efe35laf33a95f256dbf10@mail.gmail.com> 2009/5/11 Dan Bolser : > Hi, > > I am parsing a blasttable and extracting Bio::Search::HSP::GenericHSP > objects as a result. I read somewhere that HSP objects inherit Feature > objects... How can I get a 'standard' representation of the HSP as a > feature? Basically I'd like to simply load the blast results into a > feature database... > > When I call feature methods on the HSP objects I just get blank or > undef results... I think this is because I'm trying to get at the > sequences existing (non existent) features, rather than get the HSP > object as a feature... If that makes sense... How can I confirm that I > have a feature object containing the details of the HSP? > > I thought of trying to just pass the HSP object to the > Bio::DB::SeqFeature::Store, but I need to get that up and running > first (I'm looking into it). In the mean time I thought I'd ask if > this sounds like the right thing to do. Well it works... I am seeing things fill into the database as I call $db->store($p) or die "Couldn't store!"; (I needed to upgrade bioperl to get Bio::DB::SeqFeature working). Here is my code; while(my $r = $s->next_result ){ print $r->query_name, "\n"; while(my $h = $r->next_hit){ print "\t", $h->name, "\n"; while(my $p = $h->next_hsp){ $db->store($p) or die "Couldn't store!"; } } } How can I visualize the resulting set of HSPs? i.e. If I point gbrowse at this location, will it automatically pick up the entry points and their features from the database? Or how much manual configuration will I need? Is there some boilerplate config I can use to visualize this? Cheers, Dan. > More generally I want to have features attached to sequences that are > themselves annotations of larger sequences (but with unknown > position). Is Bio::DB::SeqFeature::Store a way to go? I need to manage > various different bits of information coming from a sequencing > project, and I need a solution to the whole 'assembly life cycle > management' problem. > > Thanks for any help, > Dan. > From jason at bioperl.org Mon May 11 11:38:14 2009 From: jason at bioperl.org (Jason Stajich) Date: Mon, 11 May 2009 08:38:14 -0700 Subject: [Bioperl-l] Getting 'features' from SearchIO? In-Reply-To: <2c8757af0905110807g557efe35laf33a95f256dbf10@mail.gmail.com> References: <2c8757af0905110734i26f7102k69b615dc413e6be9@mail.gmail.com> <2c8757af0905110807g557efe35laf33a95f256dbf10@mail.gmail.com> Message-ID: Dan - There is nice documentation on the gmod website covering the gbrowse tutorial on the expected format of alignment features. That is what you should probably be generating and loading with the bp_seqfeature_load script -- otherwise you need to be converting the HSPs into seqfeatures with the right associated information (i.e. the tag/value pairs that are in the 9th column) in order to have well structured data in the database. There are boilerplate examples of how to visualize alignments on the Gbrowse tutorial website as well so I commend that as great starting place for GFF, data, conf files, and what kind of visualization you can obtain with the browser. There is also some helper scripts, that do this for you like bp_search2gff. Just dumping the feature will take the query ( i believe) of the feature pair that is the HSP by default, so you will need to make some choices about what information you want. You can get the individual features from the feature pair with $hsp->query or $hsp->hit which can also be passed to a GFF writer (or call $hsp->hit- >gff_string). Note that since the data storage is not structured in a GFF3 like-way this won't immediately produce well formed GFF3 for the 9th column. Here's a script I use for some DNA to genome alignments, from FASTA output for example - it assumes 1 HSP per Hit as per what you get from SSEARCH but is a reasonable jumping off place. http://bit.ly/fasta2gff There is also a wublast to gff converting script in that repository as well. -jason On May 11, 2009, at 8:07 AM, Dan Bolser wrote: > 2009/5/11 Dan Bolser : >> Hi, >> >> I am parsing a blasttable and extracting Bio::Search::HSP::GenericHSP >> objects as a result. I read somewhere that HSP objects inherit >> Feature >> objects... How can I get a 'standard' representation of the HSP as a >> feature? Basically I'd like to simply load the blast results into a >> feature database... >> >> When I call feature methods on the HSP objects I just get blank or >> undef results... I think this is because I'm trying to get at the >> sequences existing (non existent) features, rather than get the HSP >> object as a feature... If that makes sense... How can I confirm >> that I >> have a feature object containing the details of the HSP? >> >> I thought of trying to just pass the HSP object to the >> Bio::DB::SeqFeature::Store, but I need to get that up and running >> first (I'm looking into it). In the mean time I thought I'd ask if >> this sounds like the right thing to do. > > Well it works... I am seeing things fill into the database as I call > > $db->store($p) > or die "Couldn't store!"; > > (I needed to upgrade bioperl to get Bio::DB::SeqFeature working). > > > Here is my code; > > while(my $r = $s->next_result ){ > print $r->query_name, "\n"; > while(my $h = $r->next_hit){ > print "\t", $h->name, "\n"; > while(my $p = $h->next_hsp){ > $db->store($p) > or die "Couldn't store!"; > } > } > } > > > How can I visualize the resulting set of HSPs? i.e. If I point > gbrowse at this location, will it automatically pick up the entry > points and their features from the database? Or how much manual > configuration will I need? Is there some boilerplate config I can use > to visualize this? > > Cheers, > Dan. > > >> More generally I want to have features attached to sequences that are >> themselves annotations of larger sequences (but with unknown >> position). Is Bio::DB::SeqFeature::Store a way to go? I need to >> manage >> various different bits of information coming from a sequencing >> project, and I need a solution to the whole 'assembly life cycle >> management' problem. >> >> Thanks for any help, >> Dan. >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From cjfields at illinois.edu Mon May 11 11:39:54 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 11 May 2009 10:39:54 -0500 Subject: [Bioperl-l] Getting 'features' from SearchIO? In-Reply-To: <2c8757af0905110734i26f7102k69b615dc413e6be9@mail.gmail.com> References: <2c8757af0905110734i26f7102k69b615dc413e6be9@mail.gmail.com> Message-ID: On May 11, 2009, at 9:34 AM, Dan Bolser wrote: > Hi, > > I am parsing a blasttable and extracting Bio::Search::HSP::GenericHSP > objects as a result. I read somewhere that HSP objects inherit Feature > objects... How can I get a 'standard' representation of the HSP as a > feature? Basically I'd like to simply load the blast results into a > feature database... They are Bio::SeqFeature::SimilarityPair (all Bio::Search::HSP::HSPI are). > When I call feature methods on the HSP objects I just get blank or > undef results... I think this is because I'm trying to get at the > sequences existing (non existent) features, rather than get the HSP > object as a feature... If that makes sense... How can I confirm that I > have a feature object containing the details of the HSP? These are decorated feature pairs (they map to one another), so you would need to do something like $hsp->hit to get at the actual SeqFeature data for the hit, and similarly $hsp->query for the query SF. They technically have the SeqFeatureI methods but I believe they delegate to one specific feature (the query) unless you explicitly specify which feature to grab info from ('query', 'hit/subject'). I have added some tests for t/SearchIO//blasttable for this. > I thought of trying to just pass the HSP object to the > Bio::DB::SeqFeature::Store, but I need to get that up and running > first (I'm looking into it). In the mean time I thought I'd ask if > this sounds like the right thing to do. Worth a try to see what happens, but I'm not sure it would work as you expect, seeing as the methods by default delegate to the query (and I don't know if support for feature pairs is built in to Bio::DB::SeqFeature::Store). Also, last I recall, SF::Store stores everything based on a specified SF class, not the interface, so mixing SFs classes in the same database (such as Bio::SB::SeqFeature, Bio::SeqFeature::Generic, and HSPs) may not be the wisest thing. I haven't used it in a little while, though, so that may have changed. Just to note, this problem has been 'solved' to some degree in the past. I think there are a few blast2gff scripts floating around, and there is a Bio::SearchIO::Writer::GbrowseGFF module, though it isn't maintained. The main problem is the mapping is subjective based on what your reference sequence is within the BLAST run (e.g. whether it is the query or the hit), and is something that can't be automatically discerned. I ended up rolling my own with SeqFeature::Store (just mapped the relevant data to Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant scripts to integrate my changes in, just haven't had the time (though that may change soon :) > More generally I want to have features attached to sequences that are > themselves annotations of larger sequences (but with unknown > position). Did you mean 'features of larger sequences'? At the very least, you can define a region a feature falls within; if it falls within a region that has gaps on both sides: gap1 gap2 ----------xxxxxxxx--------xxxxxxx------------ |---| you can still assign coordinates to the feature for that release based on the estimated length of the gaps. Therefore it may change in a future release if the gaps are filled in. Otherwise I would assume it's simpler to designate it as a feature in a singleton sequence (on it's own) that hasn't been mapped. > Is Bio::DB::SeqFeature::Store a way to go? I need to manage > various different bits of information coming from a sequencing > project, and I need a solution to the whole 'assembly life cycle > management' problem. It's a good start, but it's not the only solution (by far). If you want to integrate in more information you could look into Chado (Apollo has a plugin for Chado). > Thanks for any help, > Dan. np. chris From hlapp at gmx.net Mon May 11 12:09:20 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 11 May 2009 12:09:20 -0400 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: References: Message-ID: Welcome to the fold, Chase, and looking forward to the project! :-) -hilmar On May 11, 2009, at 10:31 AM, Mark A. Jensen wrote: > Hello all, > With great pleasure, I want to introduce Chase Miller, my Google > Summer of Code student from George Washington University, to the > community. Chase will be working with me and Rutger Vos on a BioPerl > wrapper for Rutger's Bio::Phylo package, with a particular emphasis > on creating a BioPerl-native way to import and export the NeXML (http://nexml.org > ) phylogenetic data format. He wrote a great proposal, available > here: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit > . > We will be working throughout the summer on the project, and will of > course come to you for sage advice. I know you will welcome him > warmly, as you did me. > Cheers, > Mark > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jason at bioperl.org Mon May 11 12:24:06 2009 From: jason at bioperl.org (Jason Stajich) Date: Mon, 11 May 2009 09:24:06 -0700 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: References: Message-ID: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> Welcome Chase. Look forward to the project and helping where needed. -jason On May 11, 2009, at 7:31 AM, Mark A. Jensen wrote: > Hello all, > With great pleasure, I want to introduce Chase Miller, my Google > Summer of Code student from George Washington University, to the > community. Chase will be working with me and Rutger Vos on a BioPerl > wrapper for Rutger's Bio::Phylo package, with a particular emphasis > on creating a BioPerl-native way to import and export the NeXML (http://nexml.org > ) phylogenetic data format. He wrote a great proposal, available > here: https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit > . > We will be working throughout the summer on the project, and will of > course come to you for sage advice. I know you will welcome him > warmly, as you did me. > Cheers, > Mark > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From rmb32 at cornell.edu Mon May 11 12:43:42 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 11 May 2009 09:43:42 -0700 Subject: [Bioperl-l] Moose [was Re:Other object oddities] In-Reply-To: <003BA940-D974-44A8-9634-55963C2E8341@illinois.edu> References: <79D2E471-A9D1-4759-BC1F-4FEE9A812788@berkeleybop.org> <4a047a3e.23bb720a.3b09.ffff9430@mx.google.com> <4076AEE2CB9F45138C6FE76DD807D0BA@NewLife> <003BA940-D974-44A8-9634-55963C2E8341@illinois.edu> Message-ID: <4A0855BE.80509@cornell.edu> Anybody going to YAPC::NA? There are some talks about managing dependencies and using CPAN, could be quite valuable for figuring out what to do about using modern perl techniques in BioPerl. http://yapc10.org/yn2009/talk/1985 http://yapc10.org/yn2009/talk/1975 There are probably more. Rob -- Robert Buels Bioinformatics Analyst, Sol Genomics Network Boyce Thompson Institute for Plant Research Tower Rd Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu Chris Fields wrote: > Decent article, but it is slightly misleading. These are dependencies > for Moose itself, which I don't have a problem with (off the subject, > but I personally would like to add in a requirement for Modern::Perl!). > > What I am worried about are lots of additional dependencies introduced > using some of the 'syntactic sugar' in various MooseX modules. For > instance, MooseX::Declare, and MooseX::Method::Signatures (two popular > ones): > > http://deps.cpantesters.org/?module=MooseX%3A%3ADeclare&perl=any+version&os=any+OS > > http://deps.cpantesters.org/?module=MooseX%3A%3AMethod%3A%3ASignatures&perl=any+version&os=any+OS > > > chris > > On May 8, 2009, at 8:33 PM, Mark A. Jensen wrote: From jm18 at sanger.ac.uk Sat May 9 06:55:29 2009 From: jm18 at sanger.ac.uk (John Marshall) Date: Sat, 9 May 2009 11:55:29 +0100 (BST) Subject: [Bioperl-l] fastq parsing problem In-Reply-To: References: Message-ID: Michael Muratet wrote: > I've got a problem parsing fastq output from the maq aligner. The > parser is throwing an exception for the following record: > > @HWI-EAS146:3:1:2:177#0/1 > CTCCGCTNNCTTCTCAG[...] > + > @,AB=>-&&:5).;+*=[...] > > I looked up the line in fastq.pm that does the parsing: > > 116 my ($top,$sequence,$top2,$qualsequence) = [...] This is the fastq parser from 1.5.2 or thereabouts, which had a bug (the $/ definition just above this code) that prevented it from parsing a record with a quality line starting with "@". This was probably not recognised as a bug for a long time due to the enduring myth that fastq quality lines always start with "!". The fastq next_seq() was rewritten for 1.6.0 and parses this successfully. (Unfortunately the documentation at the top of fastq.pm was not updated and still reflects the now-unused false belief about an initial "!" quality.) You may be able to just drop 1.6.0's Bio/SeqIO/fastq.pm in front of your existing Bioperl installation, if you're a little crazy and don't want to update the installation properly. If you do that, or if you update, you'll find that the new parser emits the following pedantic warning for your fastq sequences: MSG: Seq/Qual descriptions don't match; using sequence description In practice, lots of people (probably even most!) don't bother putting the sequence id on the "+" line, as it is entirely pointless duplication, instead leaving the "+" line otherwise empty. So I hope the maintainers agree that this warning should be relaxed, such as in the attached patch. Or even removed -- there was no equivalent warning in the previous code. Cheers, John -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. -------------- next part -------------- A non-text attachment was scrubbed... Name: qualdesc.diff Type: application/octet-stream Size: 580 bytes Desc: not available URL: From Joao.Fadista at agrsci.dk Mon May 11 05:31:43 2009 From: Joao.Fadista at agrsci.dk (fadista) Date: Mon, 11 May 2009 02:31:43 -0700 (PDT) Subject: [Bioperl-l] alignable portion of a genome Message-ID: <23480025.post@talk.nabble.com> Hi, I would like to know of a good and fast way that could help me calculate the alignable portion of a genome (not human), given a reference sequence. When I say alignable portion I mean that I want to know all the positions of the genome that can be covered uniquely by reads of 36 bp and up to 2 mismatches. Some have advised me to work with Perl using the following strategy but I am not a Perl user so if someone has already a script for this function, it would be nice: "you could approach it by walking along the genome in a sliding window of 36 nt, and hash the frequency of each 36 nt sequence that you encounter. Then count how many of the 36 nt sequences had a frequency of exactly one. Divide this by the total number of 36nt windows visited. This should be do-able in about 20 lines of Perl." Best regards and thanks in advance -- View this message in context: http://www.nabble.com/alignable-portion-of-a-genome-tp23480025p23480025.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From raulmendez at cbm.uam.es Mon May 11 09:16:44 2009 From: raulmendez at cbm.uam.es (Raul Mendez Giraldez) Date: Mon, 11 May 2009 15:16:44 +0200 Subject: [Bioperl-l] How to get coil prediction out of Bio::Tools::Run::Coil modules In-Reply-To: References: <1241799580.6963.165.camel@pepa.cbm.uam.es> Message-ID: <1242047804.6963.192.camel@pepa.cbm.uam.es> Hi Jason, Thank you so much for your suggestion, although it was my $featseq = $seqin->trunc($feature->start, $feature->end); sice the subseq method just give you an string with the sequence, trunc outputs a seqobj as it is needed to be passed to write_seq. Cheers, Raul El vie, 08-05-2009 a las 13:04 -0700, Jason Stajich escribi?: > The sequence isn't part of the report - or at least isn't parsed but > you can just do this (pseudo-y-code here). > my $seqout =Bio::SeqIO->new(-format => 'fasta'); > > > > > for my $feature ( @features ) > my $featseq = $seqin->subseq($feature->start, $feature->end); > $seqout->write_seq($featseq); > } > > > > On May 8, 2009, at 9:19 AM, Raul Mendez Giraldez wrote: > > > Hi, > > > > I'm trying to get coiled-coiled prediction in protein sequences > > using > > Bob Russell's program ncoils, through the bioperl interface > > Bio::Tools::Run::Coil, but the only thing I can get from any element > > on > > the features list is just the sequence name, and few more not so > > useful > > atributes. > > > > I'm running the following script: > > > > > > #!/home/rmendez/bin/perl -w > > > > use strict; > > use FileHandle; > > use Data::Dumper; > > > > use Bio::Tools::Run::Coil; > > > > my $seqin=filein.fasta > > my $factory=Bio::Tools::Run::Coil->new('-c'); > > my @features=$factory->run($seqin); > > > > print "Printing content of features[0]\n"; > > print Dumper $features[0]; > > > > ---- > > > > And the output is (the content of the first element of the features > > array) is : > > '_gsf_tag_hash' => { > > 'percent_id' => [ > > 'NULL' > > ], > > 'hid' => [ > > 'ncoils' > > ], > > 'evalue' => [ > > 0 > > ] > > }, > > '_location' => bless( { > > '_location_type' => 'EXACT', > > '_start' => 138, > > '_end' => 172 > > }, 'Bio::Location::Simple' ), > > '_gsf_seq_id' => 'ENSDARP00000084927', > > '_parse_h' => {}, > > '_root_cleanup_methods' => [ > > sub { "DUMMY" } > > ], > > '_source_tag' => 'Coils', > > '_primary_tag' => 'ncoils', > > '_root_verbose' => 0 > > }, 'Bio::SeqFeature::Generic' ); > > > > Then how could I get the sequence itself with the coil annotation > > 'xxx'? > > > > Thanks, > > > > Raul > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Jason Stajich > jason at bioperl.org > > > > > > > > From wgallin at ualberta.ca Mon May 11 21:35:58 2009 From: wgallin at ualberta.ca (Warren Gallin) Date: Mon, 11 May 2009 19:35:58 -0600 Subject: [Bioperl-l] Eutilities epost/efetch problem Message-ID: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> Hi folks, Something started failing for me this morning that had been working reliably for the last week, I post an array of gi numbers, a history is successfully returned, but when I try to use efetch to get the records, it fails with the error: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Response Error Not Found STACK: Error::throw STACK: Bio::Root::Root::throw /Library/Perl/5.8.8/Bio/Root/Root.pm:368 STACK: Bio::DB::GenericWebAgent::get_Response /Library/Perl/5.8.8/Bio/ DB/GenericWebAgent.pm:215 STACK: 090507_Stable_gb_update.pl:238 ----------------------------------------------------------- I'm running the efetch inside an eval and letting it try a total of 6 times with a 5 sedond sleep in between, but the error is consistent. So I consider two possibilities: 1) Has something changed on the Entrez server recently? Has anyone else started having this kind of problem? 2) Have I inserted some subtle flaw into my code that would lead to a failure of efetch. I am attaching two text files, one with the code chunklet that is doing this and the other the output from the script. Any help or suggestions are profoundly appreciated. Warren Gallin -------------- next part -------------- A non-text attachment was scrubbed... Name: Fetch_Fail Type: application/octet-stream Size: 2659 bytes Desc: not available URL: -------------- next part -------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: Fetch_Fail_Output Type: application/octet-stream Size: 2685 bytes Desc: not available URL: -------------- next part -------------- From cjfields at illinois.edu Mon May 11 23:07:56 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 11 May 2009 22:07:56 -0500 Subject: [Bioperl-l] fastq parsing problem In-Reply-To: References: Message-ID: <0A77B262-B808-4A02-82CB-16970EBF4C2C@illinois.edu> On May 9, 2009, at 5:55 AM, John Marshall wrote: > Michael Muratet wrote: >> I've got a problem parsing fastq output from the maq aligner. The >> parser is throwing an exception for the following record: >> >> @HWI-EAS146:3:1:2:177#0/1 >> CTCCGCTNNCTTCTCAG[...] >> + >> @,AB=>-&&:5).;+*=[...] >> >> I looked up the line in fastq.pm that does the parsing: >> >> 116 my ($top,$sequence,$top2,$qualsequence) = [...] > > This is the fastq parser from 1.5.2 or thereabouts, which had a bug > (the > $/ definition just above this code) that prevented it from parsing a > record with a quality line starting with "@". This was probably not > recognised as a bug for a long time due to the enduring myth that > fastq > quality lines always start with "!". > > The fastq next_seq() was rewritten for 1.6.0 and parses this > successfully. > (Unfortunately the documentation at the top of fastq.pm was not > updated > and still reflects the now-unused false belief about an initial "!" > quality.) > > You may be able to just drop 1.6.0's Bio/SeqIO/fastq.pm in front of > your > existing Bioperl installation, if you're a little crazy and don't > want to > update the installation properly. If you do that, or if you update, > you'll find that the new parser emits the following pedantic warning > for > your fastq sequences: > > MSG: Seq/Qual descriptions don't match; using sequence description > > In practice, lots of people (probably even most!) don't bother > putting the > sequence id on the "+" line, as it is entirely pointless duplication, > instead leaving the "+" line otherwise empty. So I hope the > maintainers > agree that this warning should be relaxed, such as in the attached > patch. > Or even removed -- there was no equivalent warning in the previous > code. > > Cheers, > > John Okay, patch committed (also removed the blurb about '!'). Thanks! chris From Russell.Smithies at agresearch.co.nz Mon May 11 23:55:39 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 12 May 2009 15:55:39 +1200 Subject: [Bioperl-l] alignable portion of a genome In-Reply-To: <23480025.post@talk.nabble.com> References: <23480025.post@talk.nabble.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> Perfect matches is easy: $seq = "atcgacgatcgaacgatcga"; foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++} foreach (keys %hash){ $singles++ if($hash{$_} eq 1)} print $singles/$h; Could probably be done with map as well. Counting the miss-matches might take a bit more thinking.... Any ideas MAJ? --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of fadista > Sent: Monday, 11 May 2009 9:32 p.m. > To: Bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] alignable portion of a genome > > > Hi, > > I would like to know of a good and fast way that could help me calculate the > alignable portion of a genome (not human), given a reference sequence. > When I say alignable portion I mean that I want to know all the positions of > the genome that can be covered uniquely by reads of 36 bp and up to 2 > mismatches. > > Some have advised me to work with Perl using the following strategy but I am > not a Perl user so if someone has already a script for this function, it > would be nice: > > "you could approach it by walking along the genome in a sliding window of > 36 nt, and hash the frequency of each 36 nt sequence that you encounter. > Then count how many of the 36 nt sequences had a frequency of exactly > one. Divide this by the total number of 36nt windows visited. This > should be do-able in about 20 lines of Perl." > > > Best regards and thanks in advance > > -- > View this message in context: http://www.nabble.com/alignable-portion-of-a- > genome-tp23480025p23480025.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From dan.bolser at gmail.com Tue May 12 05:10:59 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Tue, 12 May 2009 10:10:59 +0100 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features' from SearchIO?) Message-ID: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com> Thanks for the info guys, I think I was naively hoping that the feature would know how to cast itself as a 'SeqFeature' (GFF). I think I understand the problem better now, so I'll try to summarise: There is no standard way to encode a HSP as a feature (not least because there are two choices about which sequence (query or the hit) it should be attached to). BioPerl will try, but the result will not be "well structured" SeqFeatures or "well formed" GFF. >From what I read I guess it should be possible to standardize this mapping (based on something in one of the examples or the 'search2gff' script), assuming you specify weather you want features put on the query or on the hit. At some point last year I was trying out the bp_search2gff.pl and my own code to write a GFF file for loading and viewing by Gbrowse. At that time I gave up, as nothing seemed to be working. I was hoping that doing this at a lower level (i.e. never writing any GFF myself) it would stand a better chance of working. Also I was thinking that Gbrowse, if given a SeqFeature::Store, could autoconfigure its interface to some degree. I guess its back to the docs ;-) I'll keep trying and see if I can get anywhere. Thanks again, Dan. References for the above: 2009/5/11 Jason Stajich : > otherwise you need to be converting the HSPs into seqfeatures with the right associated information (i.e. the tag/value pairs that are in the 9th column) in order to have well structured data in the database. > You can get the individual features from the feature pair with $hsp->query or $hsp->hit which can also be passed to a GFF writer (or call $hsp->hit->gff_string). Note that since the data storage is not structured in a GFF3 like-way this won't immediately produce well formed GFF3 for the 9th column. 2009/5/11 Chris Fields : > The main problem is the mapping is subjective based on what your reference sequence is within the BLAST run (e.g. whether it is the query or the hit), and is something that can't be automatically discerned. I ended up rolling my own with SeqFeature::Store (just mapped the relevant data to Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant scripts to integrate my changes in, just haven't had the time From miguel.pignatelli at uv.es Tue May 12 04:45:46 2009 From: miguel.pignatelli at uv.es (Miguel Pignatelli) Date: Tue, 12 May 2009 10:45:46 +0200 Subject: [Bioperl-l] alignable portion of a genome In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> References: <23480025.post@talk.nabble.com> <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> Message-ID: For mismatches, take a look at the CPAN module Text::LevenshteinXS which calculates the Levenshtein distance (edit distance) of two strings. For more information about Levenshtein distance: http://en.wikipedia.org/wiki/Levenshtein_distance M; El 12/05/2009, a las 5:55, Smithies, Russell escribi?: > Perfect matches is easy: > > $seq = "atcgacgatcgaacgatcga"; > > foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++} > foreach (keys %hash){ $singles++ if($hash{$_} eq 1)} > print $singles/$h; > > Could probably be done with map as well. > Counting the miss-matches might take a bit more thinking.... > Any ideas MAJ? > > --Russell > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of fadista >> Sent: Monday, 11 May 2009 9:32 p.m. >> To: Bioperl-l at lists.open-bio.org >> Subject: [Bioperl-l] alignable portion of a genome >> >> >> Hi, >> >> I would like to know of a good and fast way that could help me >> calculate the >> alignable portion of a genome (not human), given a reference >> sequence. >> When I say alignable portion I mean that I want to know all the >> positions of >> the genome that can be covered uniquely by reads of 36 bp and up to 2 >> mismatches. >> >> Some have advised me to work with Perl using the following strategy >> but I am >> not a Perl user so if someone has already a script for this >> function, it >> would be nice: >> >> "you could approach it by walking along the genome in a sliding >> window of >> 36 nt, and hash the frequency of each 36 nt sequence that you >> encounter. >> Then count how many of the 36 nt sequences had a frequency of exactly >> one. Divide this by the total number of 36nt windows visited. This >> should be do-able in about 20 lines of Perl." >> >> >> Best regards and thanks in advance >> >> -- >> View this message in context: http://www.nabble.com/alignable-portion-of-a- >> genome-tp23480025p23480025.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > = > ====================================================================== > Attention: The information contained in this message and/or > attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or > privileged > material. Any review, retransmission, dissemination or other use of, > or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by > AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > = > ====================================================================== > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From dan.bolser at gmail.com Tue May 12 06:11:39 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Tue, 12 May 2009 11:11:39 +0100 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features' from SearchIO?) In-Reply-To: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com> Message-ID: <2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com> Unfortunately bp_search2gff.pl is giving me errors: bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered -f blasttable -o BlastResults/blast_table_filtered.gff -t hit --match --target --component --------------------- WARNING --------------------- MSG: Removing score value(s) --------------------------------------------------- Can't locate object method "remove_tags" via package "Bio::SeqFeature::Similarity" at /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line 393, line 5. Anyone seen this before? Cheers, Dan. 2009/5/12 Dan Bolser : > Thanks for the info guys, I think I was naively hoping that the > feature would know how to cast itself as a 'SeqFeature' (GFF). > > I think I understand the problem better now, so I'll try to summarise: > > There is no standard way to encode a HSP as a feature (not least > because there are two choices about which sequence (query or the hit) > it should be attached to). BioPerl will try, but the result will not > be "well structured" SeqFeatures or "well formed" GFF. > > > From what I read I guess it should be possible to standardize this > mapping (based on something in one of the examples or the 'search2gff' > script), assuming you specify weather you want features put on the > query or on the hit. > > At some point last year I was trying out the bp_search2gff.pl and my > own code to write a GFF file for loading and viewing by Gbrowse. At > that time I gave up, as nothing seemed to be working. I was hoping > that doing this at a lower level (i.e. never writing any GFF myself) > it would stand a better chance of working. > > Also I was thinking that Gbrowse, if given a SeqFeature::Store, could > autoconfigure its interface to some degree. I guess its back to the > docs ;-) > > > > I'll keep trying and see if I can get anywhere. > > Thanks again, > Dan. > > > > References for the above: > > 2009/5/11 Jason Stajich : > >> otherwise you need to be converting the HSPs into seqfeatures with the right associated information (i.e. the tag/value pairs that are in the 9th column) in order to have well structured data in the database. > >> You can get the individual features from the feature pair with $hsp->query ?or $hsp->hit ?which can also be passed to a GFF writer (or call $hsp->hit->gff_string). ? Note that since the data storage is not structured in a GFF3 like-way this won't immediately produce well formed GFF3 for the 9th column. > > > 2009/5/11 Chris Fields : > >> The main problem is the mapping is subjective based on what your reference sequence is within the BLAST run (e.g. whether it is the query or the hit), and is something that can't be automatically discerned. ?I ended up rolling my own with SeqFeature::Store (just mapped the relevant data to Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant scripts to integrate my changes in, just haven't had the time > From dan.bolser at gmail.com Tue May 12 06:55:34 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Tue, 12 May 2009 11:55:34 +0100 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features' from SearchIO?) In-Reply-To: <2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com> <2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com> Message-ID: <2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com> 2009/5/12 Dan Bolser : > Unfortunately bp_search2gff.pl is giving me errors: > > bp_search2gff.pl --version 3 ? -i BlastResults/blast_table_filtered -f > blasttable ? -o BlastResults/blast_table_filtered.gff ? -t hit > --match ? --target ? --component > > --------------------- WARNING --------------------- > MSG: Removing score value(s) > --------------------------------------------------- > Can't locate object method "remove_tags" via package > "Bio::SeqFeature::Similarity" at > /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line > 393, line 5. I'm just learning the ropes... --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 15:25:55.000000000 +0100 +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 11:52:41.000000000 +0100 @@ -390,7 +390,7 @@ } if ($self->has_tag('score')) { $self->warn("Removing score value(s)"); - $self->remove_tags('score'); + $self->remove_tag('score'); } $self->add_tag_value('score',$value); } > Anyone seen this before? > > Cheers, > Dan. > > > > 2009/5/12 Dan Bolser : >> Thanks for the info guys, I think I was naively hoping that the >> feature would know how to cast itself as a 'SeqFeature' (GFF). >> >> I think I understand the problem better now, so I'll try to summarise: >> >> There is no standard way to encode a HSP as a feature (not least >> because there are two choices about which sequence (query or the hit) >> it should be attached to). BioPerl will try, but the result will not >> be "well structured" SeqFeatures or "well formed" GFF. >> >> >> From what I read I guess it should be possible to standardize this >> mapping (based on something in one of the examples or the 'search2gff' >> script), assuming you specify weather you want features put on the >> query or on the hit. >> >> At some point last year I was trying out the bp_search2gff.pl and my >> own code to write a GFF file for loading and viewing by Gbrowse. At >> that time I gave up, as nothing seemed to be working. I was hoping >> that doing this at a lower level (i.e. never writing any GFF myself) >> it would stand a better chance of working. >> >> Also I was thinking that Gbrowse, if given a SeqFeature::Store, could >> autoconfigure its interface to some degree. I guess its back to the >> docs ;-) >> >> >> >> I'll keep trying and see if I can get anywhere. >> >> Thanks again, >> Dan. >> >> >> >> References for the above: >> >> 2009/5/11 Jason Stajich : >> >>> otherwise you need to be converting the HSPs into seqfeatures with the right associated information (i.e. the tag/value pairs that are in the 9th column) in order to have well structured data in the database. >> >>> You can get the individual features from the feature pair with $hsp->query ?or $hsp->hit ?which can also be passed to a GFF writer (or call $hsp->hit->gff_string). ? Note that since the data storage is not structured in a GFF3 like-way this won't immediately produce well formed GFF3 for the 9th column. >> >> >> 2009/5/11 Chris Fields : >> >>> The main problem is the mapping is subjective based on what your reference sequence is within the BLAST run (e.g. whether it is the query or the hit), and is something that can't be automatically discerned. ?I ended up rolling my own with SeqFeature::Store (just mapped the relevant data to Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant scripts to integrate my changes in, just haven't had the time >> > From ajmackey at gmail.com Tue May 12 08:18:25 2009 From: ajmackey at gmail.com (Aaron Mackey) Date: Tue, 12 May 2009 08:18:25 -0400 Subject: [Bioperl-l] alignable portion of a genome In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> References: <23480025.post@talk.nabble.com> <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> Message-ID: <24c96eca0905120518v4a5aae30r364986ef1211afaf@mail.gmail.com> A better idea than using Perl to count the mismatches is to actually generate all the (unique) 36-mers from the reference genome as an artificial set of "reads", and then use a program like Mosaik or maq to align them back to the reference genome. Both tools have means of then reporting the coverage along the genome of uniquely aligned reads. That way you can also change the Mosaik/maq parameters to reflect your true read alignment strategy. -Aaron On Mon, May 11, 2009 at 11:55 PM, Smithies, Russell < Russell.Smithies at agresearch.co.nz> wrote: > Perfect matches is easy: > > $seq = "atcgacgatcgaacgatcga"; > > foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++} > foreach (keys %hash){ $singles++ if($hash{$_} eq 1)} > print $singles/$h; > > Could probably be done with map as well. > Counting the miss-matches might take a bit more thinking.... > Any ideas MAJ? > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of fadista > > Sent: Monday, 11 May 2009 9:32 p.m. > > To: Bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] alignable portion of a genome > > > > > > Hi, > > > > I would like to know of a good and fast way that could help me calculate > the > > alignable portion of a genome (not human), given a reference sequence. > > When I say alignable portion I mean that I want to know all the positions > of > > the genome that can be covered uniquely by reads of 36 bp and up to 2 > > mismatches. > > > > Some have advised me to work with Perl using the following strategy but I > am > > not a Perl user so if someone has already a script for this function, it > > would be nice: > > > > "you could approach it by walking along the genome in a sliding window of > > 36 nt, and hash the frequency of each 36 nt sequence that you encounter. > > Then count how many of the 36 nt sequences had a frequency of exactly > > one. Divide this by the total number of 36nt windows visited. This > > should be do-able in about 20 lines of Perl." > > > > > > Best regards and thanks in advance > > > > -- > > View this message in context: > http://www.nabble.com/alignable-portion-of-a- > > genome-tp23480025p23480025.html > > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at illinois.edu Tue May 12 08:23:35 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 12 May 2009 07:23:35 -0500 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features' from SearchIO?) In-Reply-To: <2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com> <2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com> <2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com> Message-ID: <9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu> Fixed that in svn. We're all still learning the ropes... chris On May 12, 2009, at 5:55 AM, Dan Bolser wrote: > 2009/5/12 Dan Bolser : >> Unfortunately bp_search2gff.pl is giving me errors: >> >> bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered >> -f >> blasttable -o BlastResults/blast_table_filtered.gff -t hit >> --match --target --component >> >> --------------------- WARNING --------------------- >> MSG: Removing score value(s) >> --------------------------------------------------- >> Can't locate object method "remove_tags" via package >> "Bio::SeqFeature::Similarity" at >> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line >> 393, line 5. > > > I'm just learning the ropes... > > --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 > 15:25:55.000000000 +0100 > +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 > 11:52:41.000000000 +0100 > @@ -390,7 +390,7 @@ > } > if ($self->has_tag('score')) { > $self->warn("Removing score value(s)"); > - $self->remove_tags('score'); > + $self->remove_tag('score'); > } > $self->add_tag_value('score',$value); > } > > > > > >> Anyone seen this before? >> >> Cheers, >> Dan. >> >> >> >> 2009/5/12 Dan Bolser : >>> Thanks for the info guys, I think I was naively hoping that the >>> feature would know how to cast itself as a 'SeqFeature' (GFF). >>> >>> I think I understand the problem better now, so I'll try to >>> summarise: >>> >>> There is no standard way to encode a HSP as a feature (not least >>> because there are two choices about which sequence (query or the >>> hit) >>> it should be attached to). BioPerl will try, but the result will not >>> be "well structured" SeqFeatures or "well formed" GFF. >>> >>> >>> From what I read I guess it should be possible to standardize this >>> mapping (based on something in one of the examples or the >>> 'search2gff' >>> script), assuming you specify weather you want features put on the >>> query or on the hit. >>> >>> At some point last year I was trying out the bp_search2gff.pl and my >>> own code to write a GFF file for loading and viewing by Gbrowse. At >>> that time I gave up, as nothing seemed to be working. I was hoping >>> that doing this at a lower level (i.e. never writing any GFF myself) >>> it would stand a better chance of working. >>> >>> Also I was thinking that Gbrowse, if given a SeqFeature::Store, >>> could >>> autoconfigure its interface to some degree. I guess its back to the >>> docs ;-) >>> >>> >>> >>> I'll keep trying and see if I can get anywhere. >>> >>> Thanks again, >>> Dan. >>> >>> >>> >>> References for the above: >>> >>> 2009/5/11 Jason Stajich : >>> >>>> otherwise you need to be converting the HSPs into seqfeatures >>>> with the right associated information (i.e. the tag/value pairs >>>> that are in the 9th column) in order to have well structured data >>>> in the database. >>> >>>> You can get the individual features from the feature pair with >>>> $hsp->query or $hsp->hit which can also be passed to a GFF >>>> writer (or call $hsp->hit->gff_string). Note that since the >>>> data storage is not structured in a GFF3 like-way this won't >>>> immediately produce well formed GFF3 for the 9th column. >>> >>> >>> 2009/5/11 Chris Fields : >>> >>>> The main problem is the mapping is subjective based on what your >>>> reference sequence is within the BLAST run (e.g. whether it is >>>> the query or the hit), and is something that can't be >>>> automatically discerned. I ended up rolling my own with >>>> SeqFeature::Store (just mapped the relevant data to >>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the >>>> relevant scripts to integrate my changes in, just haven't had the >>>> time >>> >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dan.bolser at gmail.com Tue May 12 09:17:56 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Tue, 12 May 2009 14:17:56 +0100 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features' from SearchIO?) In-Reply-To: <9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com> <2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com> <2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com> <9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu> Message-ID: <2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> 2009/5/12 Chris Fields : > Fixed that in svn. ?We're all still learning the ropes... In that case, I'm seeing multiple instances of... Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 256 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 412 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 429 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 465 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 473 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 494 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 502 Hmm... I was about to go on to complain about the weird GFF that I was seeing, but suddenly it looks OK. My bioperl install must think your standing over my shoulder and is therefore behaving itself! Thanks again for all the help, Dan. > chris > > On May 12, 2009, at 5:55 AM, Dan Bolser wrote: > >> 2009/5/12 Dan Bolser : >>> >>> Unfortunately bp_search2gff.pl is giving me errors: >>> >>> bp_search2gff.pl --version 3 ? -i BlastResults/blast_table_filtered -f >>> blasttable ? -o BlastResults/blast_table_filtered.gff ? -t hit >>> --match ? --target ? --component >>> >>> --------------------- WARNING --------------------- >>> MSG: Removing score value(s) >>> --------------------------------------------------- >>> Can't locate object method "remove_tags" via package >>> "Bio::SeqFeature::Similarity" at >>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line >>> 393, line 5. >> >> >> I'm just learning the ropes... >> >> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ ? 2009-05-11 >> 15:25:55.000000000 +0100 >> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm ? ?2009-05-12 >> 11:52:41.000000000 +0100 >> @@ -390,7 +390,7 @@ >> ? ? ? ?} >> ? ? ? ?if ($self->has_tag('score')) { >> ? ? ? ? ? ?$self->warn("Removing score value(s)"); >> - ? ? ? ? ? ?$self->remove_tags('score'); >> + ? ? ? ? ? ?$self->remove_tag('score'); >> ? ? ? ?} >> ? ? ? ?$self->add_tag_value('score',$value); >> ? ?} >> >> >> >> >> >>> Anyone seen this before? >>> >>> Cheers, >>> Dan. >>> >>> >>> >>> 2009/5/12 Dan Bolser : >>>> >>>> Thanks for the info guys, I think I was naively hoping that the >>>> feature would know how to cast itself as a 'SeqFeature' (GFF). >>>> >>>> I think I understand the problem better now, so I'll try to summarise: >>>> >>>> There is no standard way to encode a HSP as a feature (not least >>>> because there are two choices about which sequence (query or the hit) >>>> it should be attached to). BioPerl will try, but the result will not >>>> be "well structured" SeqFeatures or "well formed" GFF. >>>> >>>> >>>> From what I read I guess it should be possible to standardize this >>>> mapping (based on something in one of the examples or the 'search2gff' >>>> script), assuming you specify weather you want features put on the >>>> query or on the hit. >>>> >>>> At some point last year I was trying out the bp_search2gff.pl and my >>>> own code to write a GFF file for loading and viewing by Gbrowse. At >>>> that time I gave up, as nothing seemed to be working. I was hoping >>>> that doing this at a lower level (i.e. never writing any GFF myself) >>>> it would stand a better chance of working. >>>> >>>> Also I was thinking that Gbrowse, if given a SeqFeature::Store, could >>>> autoconfigure its interface to some degree. I guess its back to the >>>> docs ;-) >>>> >>>> >>>> >>>> I'll keep trying and see if I can get anywhere. >>>> >>>> Thanks again, >>>> Dan. >>>> >>>> >>>> >>>> References for the above: >>>> >>>> 2009/5/11 Jason Stajich : >>>> >>>>> otherwise you need to be converting the HSPs into seqfeatures with the >>>>> right associated information (i.e. the tag/value pairs that are in the 9th >>>>> column) in order to have well structured data in the database. >>>> >>>>> You can get the individual features from the feature pair with >>>>> $hsp->query ?or $hsp->hit ?which can also be passed to a GFF writer (or call >>>>> $hsp->hit->gff_string). ? Note that since the data storage is not structured >>>>> in a GFF3 like-way this won't immediately produce well formed GFF3 for the >>>>> 9th column. >>>> >>>> >>>> 2009/5/11 Chris Fields : >>>> >>>>> The main problem is the mapping is subjective based on what your >>>>> reference sequence is within the BLAST run (e.g. whether it is the query or >>>>> the hit), and is something that can't be automatically discerned. ?I ended >>>>> up rolling my own with SeqFeature::Store (just mapped the relevant data to >>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant scripts >>>>> to integrate my changes in, just haven't had the time >>>> >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From maj at fortinbras.us Tue May 12 09:29:32 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 12 May 2009 09:29:32 -0400 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features' fromSearchIO?) In-Reply-To: <2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu> <2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> Message-ID: This sounds like a $sum = eval join( '+', @a); problem, which can be fixed with $sum = eval join('+', map { $_ || () } @a) ; MAJ ----- Original Message ----- From: "Dan Bolser" To: "Chris Fields" Cc: "BioPerl List" Sent: Tuesday, May 12, 2009 9:17 AM Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' fromSearchIO?) 2009/5/12 Chris Fields : > Fixed that in svn. We're all still learning the ropes... In that case, I'm seeing multiple instances of... Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 256 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 412 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 429 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 465 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 473 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 494 Argument "" isn't numeric in addition (+) at Bio/Search/SearchUtils.pm line 502 Hmm... I was about to go on to complain about the weird GFF that I was seeing, but suddenly it looks OK. My bioperl install must think your standing over my shoulder and is therefore behaving itself! Thanks again for all the help, Dan. > chris > > On May 12, 2009, at 5:55 AM, Dan Bolser wrote: > >> 2009/5/12 Dan Bolser : >>> >>> Unfortunately bp_search2gff.pl is giving me errors: >>> >>> bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered -f >>> blasttable -o BlastResults/blast_table_filtered.gff -t hit >>> --match --target --component >>> >>> --------------------- WARNING --------------------- >>> MSG: Removing score value(s) >>> --------------------------------------------------- >>> Can't locate object method "remove_tags" via package >>> "Bio::SeqFeature::Similarity" at >>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line >>> 393, line 5. >> >> >> I'm just learning the ropes... >> >> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 >> 15:25:55.000000000 +0100 >> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 >> 11:52:41.000000000 +0100 >> @@ -390,7 +390,7 @@ >> } >> if ($self->has_tag('score')) { >> $self->warn("Removing score value(s)"); >> - $self->remove_tags('score'); >> + $self->remove_tag('score'); >> } >> $self->add_tag_value('score',$value); >> } >> >> >> >> >> >>> Anyone seen this before? >>> >>> Cheers, >>> Dan. >>> >>> >>> >>> 2009/5/12 Dan Bolser : >>>> >>>> Thanks for the info guys, I think I was naively hoping that the >>>> feature would know how to cast itself as a 'SeqFeature' (GFF). >>>> >>>> I think I understand the problem better now, so I'll try to summarise: >>>> >>>> There is no standard way to encode a HSP as a feature (not least >>>> because there are two choices about which sequence (query or the hit) >>>> it should be attached to). BioPerl will try, but the result will not >>>> be "well structured" SeqFeatures or "well formed" GFF. >>>> >>>> >>>> From what I read I guess it should be possible to standardize this >>>> mapping (based on something in one of the examples or the 'search2gff' >>>> script), assuming you specify weather you want features put on the >>>> query or on the hit. >>>> >>>> At some point last year I was trying out the bp_search2gff.pl and my >>>> own code to write a GFF file for loading and viewing by Gbrowse. At >>>> that time I gave up, as nothing seemed to be working. I was hoping >>>> that doing this at a lower level (i.e. never writing any GFF myself) >>>> it would stand a better chance of working. >>>> >>>> Also I was thinking that Gbrowse, if given a SeqFeature::Store, could >>>> autoconfigure its interface to some degree. I guess its back to the >>>> docs ;-) >>>> >>>> >>>> >>>> I'll keep trying and see if I can get anywhere. >>>> >>>> Thanks again, >>>> Dan. >>>> >>>> >>>> >>>> References for the above: >>>> >>>> 2009/5/11 Jason Stajich : >>>> >>>>> otherwise you need to be converting the HSPs into seqfeatures with the >>>>> right associated information (i.e. the tag/value pairs that are in the 9th >>>>> column) in order to have well structured data in the database. >>>> >>>>> You can get the individual features from the feature pair with >>>>> $hsp->query or $hsp->hit which can also be passed to a GFF writer (or call >>>>> $hsp->hit->gff_string). Note that since the data storage is not structured >>>>> in a GFF3 like-way this won't immediately produce well formed GFF3 for the >>>>> 9th column. >>>> >>>> >>>> 2009/5/11 Chris Fields : >>>> >>>>> The main problem is the mapping is subjective based on what your >>>>> reference sequence is within the BLAST run (e.g. whether it is the query >>>>> or >>>>> the hit), and is something that can't be automatically discerned. I ended >>>>> up rolling my own with SeqFeature::Store (just mapped the relevant data to >>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant >>>>> scripts >>>>> to integrate my changes in, just haven't had the time >>>> >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Tue May 12 10:04:26 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 12 May 2009 09:04:26 -0500 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features' fromSearchIO?) In-Reply-To: References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu> <2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> Message-ID: More complicated than that, I'm afraid. We should try to fix that at the source of the problem. This appears to stem from SearchUtils HSP tiling, which in turn utilizes HSPI::matches(), which in turn checks num_identical/ num_conserved. My guess is, since this is blasttable format, one of these isn't set and thus is returning the wrong value. I'll attempt to track it down today, but it may take some time. chris On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: > This sounds like a > > $sum = eval join( '+', @a); > > problem, which can be fixed with > > $sum = eval join('+', map { $_ || () } @a) ; > > MAJ > ----- Original Message ----- From: "Dan Bolser" > To: "Chris Fields" > Cc: "BioPerl List" > Sent: Tuesday, May 12, 2009 9:17 AM > Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' > fromSearchIO?) > > > 2009/5/12 Chris Fields : >> Fixed that in svn. We're all still learning the ropes... > > In that case, I'm seeing multiple instances of... > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 256 > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 412 > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 429 > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 465 > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 473 > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 494 > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 502 > > > Hmm... I was about to go on to complain about the weird GFF that I was > seeing, but suddenly it looks OK. My bioperl install must think your > standing over my shoulder and is therefore behaving itself! > > > Thanks again for all the help, > Dan. > > > > >> chris >> >> On May 12, 2009, at 5:55 AM, Dan Bolser wrote: >> >>> 2009/5/12 Dan Bolser : >>>> >>>> Unfortunately bp_search2gff.pl is giving me errors: >>>> >>>> bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered >>>> -f >>>> blasttable -o BlastResults/blast_table_filtered.gff -t hit >>>> --match --target --component >>>> >>>> --------------------- WARNING --------------------- >>>> MSG: Removing score value(s) >>>> --------------------------------------------------- >>>> Can't locate object method "remove_tags" via package >>>> "Bio::SeqFeature::Similarity" at >>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm >>>> line >>>> 393, line 5. >>> >>> >>> I'm just learning the ropes... >>> >>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 >>> 15:25:55.000000000 +0100 >>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 >>> 11:52:41.000000000 +0100 >>> @@ -390,7 +390,7 @@ >>> } >>> if ($self->has_tag('score')) { >>> $self->warn("Removing score value(s)"); >>> - $self->remove_tags('score'); >>> + $self->remove_tag('score'); >>> } >>> $self->add_tag_value('score',$value); >>> } >>> >>> >>> >>> >>> >>>> Anyone seen this before? >>>> >>>> Cheers, >>>> Dan. >>>> >>>> >>>> >>>> 2009/5/12 Dan Bolser : >>>>> >>>>> Thanks for the info guys, I think I was naively hoping that the >>>>> feature would know how to cast itself as a 'SeqFeature' (GFF). >>>>> >>>>> I think I understand the problem better now, so I'll try to >>>>> summarise: >>>>> >>>>> There is no standard way to encode a HSP as a feature (not least >>>>> because there are two choices about which sequence (query or the >>>>> hit) >>>>> it should be attached to). BioPerl will try, but the result will >>>>> not >>>>> be "well structured" SeqFeatures or "well formed" GFF. >>>>> >>>>> >>>>> From what I read I guess it should be possible to standardize this >>>>> mapping (based on something in one of the examples or the >>>>> 'search2gff' >>>>> script), assuming you specify weather you want features put on the >>>>> query or on the hit. >>>>> >>>>> At some point last year I was trying out the bp_search2gff.pl >>>>> and my >>>>> own code to write a GFF file for loading and viewing by Gbrowse. >>>>> At >>>>> that time I gave up, as nothing seemed to be working. I was hoping >>>>> that doing this at a lower level (i.e. never writing any GFF >>>>> myself) >>>>> it would stand a better chance of working. >>>>> >>>>> Also I was thinking that Gbrowse, if given a SeqFeature::Store, >>>>> could >>>>> autoconfigure its interface to some degree. I guess its back to >>>>> the >>>>> docs ;-) >>>>> >>>>> >>>>> >>>>> I'll keep trying and see if I can get anywhere. >>>>> >>>>> Thanks again, >>>>> Dan. >>>>> >>>>> >>>>> >>>>> References for the above: >>>>> >>>>> 2009/5/11 Jason Stajich : >>>>> >>>>>> otherwise you need to be converting the HSPs into seqfeatures >>>>>> with the >>>>>> right associated information (i.e. the tag/value pairs that are >>>>>> in the 9th >>>>>> column) in order to have well structured data in the database. >>>>> >>>>>> You can get the individual features from the feature pair with >>>>>> $hsp->query or $hsp->hit which can also be passed to a GFF >>>>>> writer (or call >>>>>> $hsp->hit->gff_string). Note that since the data storage is not >>>>>> structured >>>>>> in a GFF3 like-way this won't immediately produce well formed >>>>>> GFF3 for the >>>>>> 9th column. >>>>> >>>>> >>>>> 2009/5/11 Chris Fields : >>>>> >>>>>> The main problem is the mapping is subjective based on what your >>>>>> reference sequence is within the BLAST run (e.g. whether it is >>>>>> the query or >>>>>> the hit), and is something that can't be automatically >>>>>> discerned. I ended >>>>>> up rolling my own with SeqFeature::Store (just mapped the >>>>>> relevant data to >>>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the >>>>>> relevant scripts >>>>>> to integrate my changes in, just haven't had the time >>>>> >>>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From mmuratet at hudsonalpha.org Tue May 12 10:31:21 2009 From: mmuratet at hudsonalpha.org (Michael Muratet) Date: Tue, 12 May 2009 09:31:21 -0500 Subject: [Bioperl-l] fastq parsing problem In-Reply-To: References: Message-ID: On May 9, 2009, at 5:55 AM, John Marshall wrote: > Michael Muratet wrote: >> I've got a problem parsing fastq output from the maq aligner. The >> parser is throwing an exception for the following record: >> >> @HWI-EAS146:3:1:2:177#0/1 >> CTCCGCTNNCTTCTCAG[...] >> + >> @,AB=>-&&:5).;+*=[...] >> >> I looked up the line in fastq.pm that does the parsing: >> >> 116 my ($top,$sequence,$top2,$qualsequence) = [...] > > This is the fastq parser from 1.5.2 or thereabouts, which had a bug > (the > $/ definition just above this code) that prevented it from parsing a > record with a quality line starting with "@". This was probably not > recognised as a bug for a long time due to the enduring myth that > fastq > quality lines always start with "!". > > The fastq next_seq() was rewritten for 1.6.0 and parses this > successfully. > (Unfortunately the documentation at the top of fastq.pm was not > updated > and still reflects the now-unused false belief about an initial "!" > quality.) > > You may be able to just drop 1.6.0's Bio/SeqIO/fastq.pm in front of > your > existing Bioperl installation, if you're a little crazy and don't > want to > update the installation properly. If you do that, or if you update, > you'll find that the new parser emits the following pedantic warning > for > your fastq sequences: > John I did install 1.6.0 (which is very smooth, my compliments to the chefs) and it solved the problem except for the warning you note which Chris Fields fixed this morning. Thanks for the help. Mike > MSG: Seq/Qual descriptions don't match; using sequence description > > In practice, lots of people (probably even most!) don't bother > putting the > sequence id on the "+" line, as it is entirely pointless duplication, > instead leaving the "+" line otherwise empty. So I hope the > maintainers > agree that this warning should be relaxed, such as in the attached > patch. > Or even removed -- there was no equivalent warning in the previous > code. > > Cheers, > > John > > > > -- > The Wellcome Trust Sanger Institute is operated by Genome Research > Limited, a charity registered in England with number 1021457 and a > company registered in England with number 2742969, whose registered > office is 215 Euston Road, London, NW1 2BE. > From KBriedis at accelrys.com Tue May 12 13:19:39 2009 From: KBriedis at accelrys.com (Kristine Briedis) Date: Tue, 12 May 2009 13:19:39 -0400 Subject: [Bioperl-l] Eutilities epost/efetch problem In-Reply-To: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> References: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> Message-ID: Hi Warren, We've noticed the same EFetch error. I emailed NCBI and will let you know what they say. Cheers, Kristine =============================== Kristine Briedis, Ph.D. Bioinformatics Software Engineer Accelrys, Inc. 10188 Telesis Court, Suite 100 San Diego, CA 92121 USA kbriedis at accelrys.com -----Original Message----- From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of Warren Gallin Sent: Monday, May 11, 2009 6:36 PM To: BioPerl List Subject: [Bioperl-l] Eutilities epost/efetch problem Hi folks, Something started failing for me this morning that had been working reliably for the last week, I post an array of gi numbers, a history is successfully returned, but when I try to use efetch to get the records, it fails with the error: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Response Error Not Found STACK: Error::throw STACK: Bio::Root::Root::throw /Library/Perl/5.8.8/Bio/Root/Root.pm:368 STACK: Bio::DB::GenericWebAgent::get_Response /Library/Perl/5.8.8/Bio/ DB/GenericWebAgent.pm:215 STACK: 090507_Stable_gb_update.pl:238 ----------------------------------------------------------- I'm running the efetch inside an eval and letting it try a total of 6 times with a 5 sedond sleep in between, but the error is consistent. So I consider two possibilities: 1) Has something changed on the Entrez server recently? Has anyone else started having this kind of problem? 2) Have I inserted some subtle flaw into my code that would lead to a failure of efetch. I am attaching two text files, one with the code chunklet that is doing this and the other the output from the script. Any help or suggestions are profoundly appreciated. Warren Gallin From bix at sendu.me.uk Tue May 12 14:11:44 2009 From: bix at sendu.me.uk (Sendu Bala) Date: Tue, 12 May 2009 19:11:44 +0100 Subject: [Bioperl-l] fastq parsing problem In-Reply-To: References: Message-ID: <4A09BBE0.7010000@sendu.me.uk> John Marshall wrote: > Michael Muratet wrote: >> I've got a problem parsing fastq output from the maq aligner. The >> parser is throwing an exception for the following record: >> >> @HWI-EAS146:3:1:2:177#0/1 >> CTCCGCTNNCTTCTCAG[...] >> + >> @,AB=>-&&:5).;+*=[...] >> >> I looked up the line in fastq.pm that does the parsing: >> >> 116 my ($top,$sequence,$top2,$qualsequence) = [...] > > This is the fastq parser from 1.5.2 or thereabouts, which had a bug (the > $/ definition just above this code) that prevented it from parsing a > record with a quality line starting with "@". This was probably not > recognised as a bug for a long time due to the enduring myth that fastq > quality lines always start with "!". I see you talked about it in the discussion page, but I think it might be time to change the wiki page as well: http://www.bioperl.org/wiki/FASTQ_sequence_format That caught me out as well. *sigh* From gmodhelp at googlemail.com Tue May 12 13:36:27 2009 From: gmodhelp at googlemail.com (Dave Clements, GMOD Help Desk) Date: Tue, 12 May 2009 10:36:27 -0700 Subject: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem with Module::Builder versions In-Reply-To: <4A03986D.7080007@gmail.com> References: <4A03986D.7080007@gmail.com> Message-ID: <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> Hi Neil, I'm cross-posting your question to the BioPerl list as 1) it is more of a perl question than a GBrowse question, and 2) I don't know the answer. Dave C. GMOD Help Desk Was this helpful? Let us know at http://gmod.org/wiki/Help_Desk_Feedback Learn more about GMOD at SMBE & Arthropod Genomics: http://ccg.biology.uiowa.edu/smbe/symposia.php?action=view&sym_ID=27 http://www.k-state.edu/agc/symp2009/seminar.html On Thu, May 7, 2009 at 7:26 PM, Neil Saunders wrote: > I'm trying to install the latest Gbrowse (1.99) on a machine where I do > not have root access (Ubuntu/dapper). > > I have set up non-root CPAN and installed all of the prerequisites, no > problems, in ~/lib/perl5. ?However, when I try to install Gbrowse either > via CPAN or using the latest CVS Build script, I run into this problem: > > Global symbol "$VAR1" requires explicit package name at (eval 28) line > 1088, line 1. > ? ? ? ?...propagated at /usr/local/share/perl/5.8.7/Module/Build/Base.pm line > 1002, line 1. > make: *** [all] Error 255 > ? LDS/GBrowse-1.99.tar.gz > ? /usr/bin/make -- NOT OK > > > It seems that there are 2 versions of Module::Builder on the machine. ?I > have installed a version from CPAN which is found in > ~/lib/perl5/site_perl/Module/. ?However, from the above error it looks > as though the install is trying to use a system-wide version of > Module::Build in /usr/local/share/perl/5.8.7. > > Can anyone shed any light on either the error message, or a way to force > usage of my $HOME module, not the system one? > > thanks, > Neil Saunders > -- > ?Statistical Bioinformatics - Health > ?CSIRO Mathematical and Information Sciences > ?Locked Bag 17, North Ryde, NSW 1670, Australia > > http://friendfeed.com/neilfws > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your > production scanning environment may not be a perfect world - but thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Gmod-gbrowse mailing list > Gmod-gbrowse at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse > From cjfields at illinois.edu Tue May 12 14:36:25 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 12 May 2009 13:36:25 -0500 Subject: [Bioperl-l] Eutilities epost/efetch problem In-Reply-To: References: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> Message-ID: <2F9A2E1F-877B-4571-B10C-4F89FB22488D@illinois.edu> Not showing up in tests, so this may be something very specific that changed. I'll try to reproduce it. chris On May 12, 2009, at 12:19 PM, Kristine Briedis wrote: > Hi Warren, > > We've noticed the same EFetch error. I emailed NCBI and will let > you know what they say. > > Cheers, > Kristine > > > =============================== > Kristine Briedis, Ph.D. > Bioinformatics Software Engineer > Accelrys, Inc. > 10188 Telesis Court, Suite 100 > San Diego, CA 92121 USA > kbriedis at accelrys.com > > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org > ] On Behalf Of Warren Gallin > Sent: Monday, May 11, 2009 6:36 PM > To: BioPerl List > Subject: [Bioperl-l] Eutilities epost/efetch problem > > Hi folks, > > Something started failing for me this morning that had been working > reliably for the last week, > > I post an array of gi numbers, a history is successfully returned, > but when I try to use efetch to get the records, it fails with the > error: > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Response Error > Not Found > STACK: Error::throw > STACK: Bio::Root::Root::throw /Library/Perl/5.8.8/Bio/Root/Root.pm:368 > STACK: Bio::DB::GenericWebAgent::get_Response /Library/Perl/5.8.8/Bio/ > DB/GenericWebAgent.pm:215 > STACK: 090507_Stable_gb_update.pl:238 > ----------------------------------------------------------- > > > I'm running the efetch inside an eval and letting it try a total of 6 > times with a 5 sedond sleep in between, but the error is consistent. > > So I consider two possibilities: > 1) Has something changed on the Entrez server recently? Has anyone > else started having this kind of problem? > > 2) Have I inserted some subtle flaw into my code that would lead to a > failure of efetch. > > I am attaching two text files, one with the code chunklet that is > doing this and the other the output from the script. > > Any help or suggestions are profoundly appreciated. > > Warren Gallin > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Tue May 12 14:36:40 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 12 May 2009 13:36:40 -0500 Subject: [Bioperl-l] fastq parsing problem In-Reply-To: <4A09BBE0.7010000@sendu.me.uk> References: <4A09BBE0.7010000@sendu.me.uk> Message-ID: On May 12, 2009, at 1:11 PM, Sendu Bala wrote: > John Marshall wrote: >> Michael Muratet wrote: >>> I've got a problem parsing fastq output from the maq aligner. The >>> parser is throwing an exception for the following record: >>> >>> @HWI-EAS146:3:1:2:177#0/1 >>> CTCCGCTNNCTTCTCAG[...] >>> + >>> @,AB=>-&&:5).;+*=[...] >>> >>> I looked up the line in fastq.pm that does the parsing: >>> >>> 116 my ($top,$sequence,$top2,$qualsequence) = [...] >> This is the fastq parser from 1.5.2 or thereabouts, which had a bug >> (the >> $/ definition just above this code) that prevented it from parsing a >> record with a quality line starting with "@". This was probably not >> recognised as a bug for a long time due to the enduring myth that >> fastq >> quality lines always start with "!". > > I see you talked about it in the discussion page, but I think it > might be time to change the wiki page as well: > http://www.bioperl.org/wiki/FASTQ_sequence_format > > That caught me out as well. *sigh* Updated, along with links to the MAQ FASTQ page and Wikipedia. I'll update the module docs as well. chris From KBriedis at accelrys.com Tue May 12 14:42:33 2009 From: KBriedis at accelrys.com (Kristine Briedis) Date: Tue, 12 May 2009 14:42:33 -0400 Subject: [Bioperl-l] Eutilities epost/efetch problem In-Reply-To: <2F9A2E1F-877B-4571-B10C-4F89FB22488D@illinois.edu> References: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> <2F9A2E1F-877B-4571-B10C-4F89FB22488D@illinois.edu> Message-ID: Hi Chris, I'm not getting the error anymore. NCBI must have fixed something. Cheers, Kristine -----Original Message----- From: Chris Fields [mailto:cjfields at illinois.edu] Sent: Tuesday, May 12, 2009 11:36 AM To: Kristine Briedis Cc: Warren Gallin; BioPerl List Subject: Re: [Bioperl-l] Eutilities epost/efetch problem Not showing up in tests, so this may be something very specific that changed. I'll try to reproduce it. chris On May 12, 2009, at 12:19 PM, Kristine Briedis wrote: > Hi Warren, > > We've noticed the same EFetch error. I emailed NCBI and will let > you know what they say. > > Cheers, > Kristine > > > =============================== > Kristine Briedis, Ph.D. > Bioinformatics Software Engineer > Accelrys, Inc. > 10188 Telesis Court, Suite 100 > San Diego, CA 92121 USA > kbriedis at accelrys.com > > > > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org > ] On Behalf Of Warren Gallin > Sent: Monday, May 11, 2009 6:36 PM > To: BioPerl List > Subject: [Bioperl-l] Eutilities epost/efetch problem > > Hi folks, > > Something started failing for me this morning that had been working > reliably for the last week, > > I post an array of gi numbers, a history is successfully returned, > but when I try to use efetch to get the records, it fails with the > error: > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Response Error > Not Found > STACK: Error::throw > STACK: Bio::Root::Root::throw /Library/Perl/5.8.8/Bio/Root/Root.pm:368 > STACK: Bio::DB::GenericWebAgent::get_Response /Library/Perl/5.8.8/Bio/ > DB/GenericWebAgent.pm:215 > STACK: 090507_Stable_gb_update.pl:238 > ----------------------------------------------------------- > > > I'm running the efetch inside an eval and letting it try a total of 6 > times with a 5 sedond sleep in between, but the error is consistent. > > So I consider two possibilities: > 1) Has something changed on the Entrez server recently? Has anyone > else started having this kind of problem? > > 2) Have I inserted some subtle flaw into my code that would lead to a > failure of efetch. > > I am attaching two text files, one with the code chunklet that is > doing this and the other the output from the script. > > Any help or suggestions are profoundly appreciated. > > Warren Gallin > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Tue May 12 14:57:32 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 12 May 2009 13:57:32 -0500 Subject: [Bioperl-l] Eutilities epost/efetch problem In-Reply-To: References: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> <2F9A2E1F-877B-4571-B10C-4F89FB22488D@illinois.edu> Message-ID: Same here (no error). Just ran the below. chris #!/usr/bin/perl -w use strict; use warnings; use Bio::DB::EUtilities; my @gi_number = qw( 41395563 31618162 81831839 54038971 ); my $gpeptfactory = Bio::DB::EUtilities->new( -eutil => 'epost', -db => 'protein', -rettype => 'gp', -retmode => 'text', -tool => 'VKCDB_Update', -email => 'wgallin at ualberta.ca', -id => \@gi_number, -keep_histories => 1); my $hist = $gpeptfactory->next_cookie || die "Arghh!"; $gpeptfactory->set_parameters(-eutil => 'efetch', -history => $hist); $gpeptfactory->get_Response(-file => '>test.gb'); On May 12, 2009, at 1:42 PM, Kristine Briedis wrote: > Hi Chris, > > I'm not getting the error anymore. NCBI must have fixed something. > > Cheers, > Kristine > > > -----Original Message----- > From: Chris Fields [mailto:cjfields at illinois.edu] > Sent: Tuesday, May 12, 2009 11:36 AM > To: Kristine Briedis > Cc: Warren Gallin; BioPerl List > Subject: Re: [Bioperl-l] Eutilities epost/efetch problem > > Not showing up in tests, so this may be something very specific that > changed. I'll try to reproduce it. > > chris > > On May 12, 2009, at 12:19 PM, Kristine Briedis wrote: > >> Hi Warren, >> >> We've noticed the same EFetch error. I emailed NCBI and will let >> you know what they say. >> >> Cheers, >> Kristine >> >> >> =============================== >> Kristine Briedis, Ph.D. >> Bioinformatics Software Engineer >> Accelrys, Inc. >> 10188 Telesis Court, Suite 100 >> San Diego, CA 92121 USA >> kbriedis at accelrys.com >> >> >> >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org >> ] On Behalf Of Warren Gallin >> Sent: Monday, May 11, 2009 6:36 PM >> To: BioPerl List >> Subject: [Bioperl-l] Eutilities epost/efetch problem >> >> Hi folks, >> >> Something started failing for me this morning that had been working >> reliably for the last week, >> >> I post an array of gi numbers, a history is successfully returned, >> but when I try to use efetch to get the records, it fails with the >> error: >> >> >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Response Error >> Not Found >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /Library/Perl/5.8.8/Bio/Root/Root.pm: >> 368 >> STACK: Bio::DB::GenericWebAgent::get_Response /Library/Perl/5.8.8/ >> Bio/ >> DB/GenericWebAgent.pm:215 >> STACK: 090507_Stable_gb_update.pl:238 >> ----------------------------------------------------------- >> >> >> I'm running the efetch inside an eval and letting it try a total >> of 6 >> times with a 5 sedond sleep in between, but the error is consistent. >> >> So I consider two possibilities: >> 1) Has something changed on the Entrez server recently? Has anyone >> else started having this kind of problem? >> >> 2) Have I inserted some subtle flaw into my code that would lead >> to a >> failure of efetch. >> >> I am attaching two text files, one with the code chunklet that is >> doing this and the other the output from the script. >> >> Any help or suggestions are profoundly appreciated. >> >> Warren Gallin >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > From rmb32 at cornell.edu Tue May 12 15:19:03 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 12 May 2009 12:19:03 -0700 Subject: [Bioperl-l] Eutilities epost/efetch problem In-Reply-To: References: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> <2F9A2E1F-877B-4571-B10C-4F89FB22488D@illinois.edu> Message-ID: <4A09CBA7.7010206@cornell.edu> I don't think this is terribly unusual to have Efetch go down. I have an automated pipeline that uses efetch to cross-check some stuff, and it goes down every once in a while, sometimes for up to a day or so. Might consider having a little nicer error message for this case? Rob -- Robert Buels Bioinformatics Analyst, Sol Genomics Network Boyce Thompson Institute for Plant Research Tower Rd Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu Chris Fields wrote: > Same here (no error). Just ran the below. > > chris > > #!/usr/bin/perl -w > > use strict; > use warnings; > use Bio::DB::EUtilities; > > my @gi_number = qw( > 41395563 > 31618162 > 81831839 > 54038971 > ); > > my $gpeptfactory = Bio::DB::EUtilities->new( > -eutil => 'epost', > -db => 'protein', > -rettype => 'gp', > -retmode => 'text', > -tool => 'VKCDB_Update', > -email => 'wgallin at ualberta.ca', > -id => \@gi_number, > -keep_histories => 1); > > my $hist = $gpeptfactory->next_cookie || die "Arghh!"; > > $gpeptfactory->set_parameters(-eutil => 'efetch', > -history => $hist); > > $gpeptfactory->get_Response(-file => '>test.gb'); > > On May 12, 2009, at 1:42 PM, Kristine Briedis wrote: > >> Hi Chris, >> >> I'm not getting the error anymore. NCBI must have fixed something. >> >> Cheers, >> Kristine >> >> >> -----Original Message----- >> From: Chris Fields [mailto:cjfields at illinois.edu] >> Sent: Tuesday, May 12, 2009 11:36 AM >> To: Kristine Briedis >> Cc: Warren Gallin; BioPerl List >> Subject: Re: [Bioperl-l] Eutilities epost/efetch problem >> >> Not showing up in tests, so this may be something very specific that >> changed. I'll try to reproduce it. >> >> chris >> >> On May 12, 2009, at 12:19 PM, Kristine Briedis wrote: >> >>> Hi Warren, >>> >>> We've noticed the same EFetch error. I emailed NCBI and will let >>> you know what they say. >>> >>> Cheers, >>> Kristine >>> >>> >>> =============================== >>> Kristine Briedis, Ph.D. >>> Bioinformatics Software Engineer >>> Accelrys, Inc. >>> 10188 Telesis Court, Suite 100 >>> San Diego, CA 92121 USA >>> kbriedis at accelrys.com >>> >>> >>> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org >>> [mailto:bioperl-l-bounces at lists.open-bio.org >>> ] On Behalf Of Warren Gallin >>> Sent: Monday, May 11, 2009 6:36 PM >>> To: BioPerl List >>> Subject: [Bioperl-l] Eutilities epost/efetch problem >>> >>> Hi folks, >>> >>> Something started failing for me this morning that had been working >>> reliably for the last week, >>> >>> I post an array of gi numbers, a history is successfully returned, >>> but when I try to use efetch to get the records, it fails with the >>> error: >>> >>> >>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>> MSG: Response Error >>> Not Found >>> STACK: Error::throw >>> STACK: Bio::Root::Root::throw /Library/Perl/5.8.8/Bio/Root/Root.pm:368 >>> STACK: Bio::DB::GenericWebAgent::get_Response /Library/Perl/5.8.8/Bio/ >>> DB/GenericWebAgent.pm:215 >>> STACK: 090507_Stable_gb_update.pl:238 >>> ----------------------------------------------------------- >>> >>> >>> I'm running the efetch inside an eval and letting it try a total >>> of 6 >>> times with a 5 sedond sleep in between, but the error is consistent. >>> >>> So I consider two possibilities: >>> 1) Has something changed on the Entrez server recently? Has anyone >>> else started having this kind of problem? >>> >>> 2) Have I inserted some subtle flaw into my code that would lead >>> to a >>> failure of efetch. >>> >>> I am attaching two text files, one with the code chunklet that is >>> doing this and the other the output from the script. >>> >>> Any help or suggestions are profoundly appreciated. >>> >>> Warren Gallin >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Tue May 12 15:36:00 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 12 May 2009 14:36:00 -0500 Subject: [Bioperl-l] Eutilities epost/efetch problem In-Reply-To: <4A09CBA7.7010206@cornell.edu> References: <2977DDC5-D26F-4643-AA4C-5A11EC323B94@ualberta.ca> <2F9A2E1F-877B-4571-B10C-4F89FB22488D@illinois.edu> <4A09CBA7.7010206@cornell.edu> Message-ID: Rob, The error message is generated on their side or from LWP; from GenericWebAgent: if ($response->is_error) { $self->throw("Response Error\n".$response->message); } We could change that, but I try to leave it as generic as possible. chris On May 12, 2009, at 2:19 PM, Robert Buels wrote: > I don't think this is terribly unusual to have Efetch go down. I > have an automated pipeline that uses efetch to cross-check some > stuff, and it goes down every once in a while, sometimes for up to a > day or so. > > Might consider having a little nicer error message for this case? > > Rob > > -- > Robert Buels > Bioinformatics Analyst, Sol Genomics Network > Boyce Thompson Institute for Plant Research > Tower Rd > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > > > Chris Fields wrote: >> Same here (no error). Just ran the below. >> >> chris >> >> #!/usr/bin/perl -w >> >> use strict; >> use warnings; >> use Bio::DB::EUtilities; >> >> my @gi_number = qw( >> 41395563 >> 31618162 >> 81831839 >> 54038971 >> ); >> >> my $gpeptfactory = Bio::DB::EUtilities->new( >> -eutil => 'epost', >> -db => 'protein', >> -rettype => 'gp', >> -retmode => 'text', >> -tool => 'VKCDB_Update', >> -email => 'wgallin at ualberta.ca', >> -id => \@gi_number, >> -keep_histories => 1); >> >> my $hist = $gpeptfactory->next_cookie || die "Arghh!"; >> >> $gpeptfactory->set_parameters(-eutil => 'efetch', >> -history => $hist); >> >> $gpeptfactory->get_Response(-file => '>test.gb'); >> >> On May 12, 2009, at 1:42 PM, Kristine Briedis wrote: >> >>> Hi Chris, >>> >>> I'm not getting the error anymore. NCBI must have fixed something. >>> >>> Cheers, >>> Kristine >>> >>> >>> -----Original Message----- >>> From: Chris Fields [mailto:cjfields at illinois.edu] >>> Sent: Tuesday, May 12, 2009 11:36 AM >>> To: Kristine Briedis >>> Cc: Warren Gallin; BioPerl List >>> Subject: Re: [Bioperl-l] Eutilities epost/efetch problem >>> >>> Not showing up in tests, so this may be something very specific that >>> changed. I'll try to reproduce it. >>> >>> chris >>> >>> On May 12, 2009, at 12:19 PM, Kristine Briedis wrote: >>> >>>> Hi Warren, >>>> >>>> We've noticed the same EFetch error. I emailed NCBI and will let >>>> you know what they say. >>>> >>>> Cheers, >>>> Kristine >>>> >>>> >>>> =============================== >>>> Kristine Briedis, Ph.D. >>>> Bioinformatics Software Engineer >>>> Accelrys, Inc. >>>> 10188 Telesis Court, Suite 100 >>>> San Diego, CA 92121 USA >>>> kbriedis at accelrys.com >>>> >>>> >>>> >>>> -----Original Message----- >>>> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l-bounces at lists.open-bio.org >>>> ] On Behalf Of Warren Gallin >>>> Sent: Monday, May 11, 2009 6:36 PM >>>> To: BioPerl List >>>> Subject: [Bioperl-l] Eutilities epost/efetch problem >>>> >>>> Hi folks, >>>> >>>> Something started failing for me this morning that had been >>>> working >>>> reliably for the last week, >>>> >>>> I post an array of gi numbers, a history is successfully >>>> returned, >>>> but when I try to use efetch to get the records, it fails with the >>>> error: >>>> >>>> >>>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>>> MSG: Response Error >>>> Not Found >>>> STACK: Error::throw >>>> STACK: Bio::Root::Root::throw /Library/Perl/5.8.8/Bio/Root/ >>>> Root.pm:368 >>>> STACK: Bio::DB::GenericWebAgent::get_Response /Library/Perl/5.8.8/ >>>> Bio/ >>>> DB/GenericWebAgent.pm:215 >>>> STACK: 090507_Stable_gb_update.pl:238 >>>> ----------------------------------------------------------- >>>> >>>> >>>> I'm running the efetch inside an eval and letting it try a >>>> total of 6 >>>> times with a 5 sedond sleep in between, but the error is >>>> consistent. >>>> >>>> So I consider two possibilities: >>>> 1) Has something changed on the Entrez server recently? Has >>>> anyone >>>> else started having this kind of problem? >>>> >>>> 2) Have I inserted some subtle flaw into my code that would >>>> lead to a >>>> failure of efetch. >>>> >>>> I am attaching two text files, one with the code chunklet that >>>> is >>>> doing this and the other the output from the script. >>>> >>>> Any help or suggestions are profoundly appreciated. >>>> >>>> Warren Gallin >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Tue May 12 14:50:40 2009 From: bosborne11 at verizon.net (Brian Osborne) Date: Tue, 12 May 2009 14:50:40 -0400 Subject: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem with Module::Builder versions In-Reply-To: <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> References: <4A03986D.7080007@gmail.com> <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> Message-ID: <2A93D0F6-3A2A-4638-A095-26CFC595E8A1@verizon.net> Neil, Try setting the environmental variable PERL5LIB: PERL5LIB A colon-separated list of directories in which to look for Perl library files before looking in the standard library and the current directory. If PERL5LIB is not defined, PERLLIB is used. When running taint checks (because the script was running setuid or setgid, or the - T switch was used), neither variable is used. The script should instead say use lib "/my/directory"; Brian O. On May 12, 2009, at 1:36 PM, Dave Clements, GMOD Help Desk wrote: > Hi Neil, > > I'm cross-posting your question to the BioPerl list as 1) it is more > of a perl question than a GBrowse question, and 2) I don't know the > answer. > > Dave C. > GMOD Help Desk > > Was this helpful? Let us know at http://gmod.org/wiki/Help_Desk_Feedback > > Learn more about GMOD at SMBE & Arthropod Genomics: > http://ccg.biology.uiowa.edu/smbe/symposia.php?action=view&sym_ID=27 > http://www.k-state.edu/agc/symp2009/seminar.html > > > > On Thu, May 7, 2009 at 7:26 PM, Neil Saunders > wrote: >> I'm trying to install the latest Gbrowse (1.99) on a machine where >> I do >> not have root access (Ubuntu/dapper). >> >> I have set up non-root CPAN and installed all of the prerequisites, >> no >> problems, in ~/lib/perl5. However, when I try to install Gbrowse >> either >> via CPAN or using the latest CVS Build script, I run into this >> problem: >> >> Global symbol "$VAR1" requires explicit package name at (eval 28) >> line >> 1088, line 1. >> ...propagated at /usr/local/share/perl/5.8.7/Module/Build/ >> Base.pm line >> 1002, line 1. >> make: *** [all] Error 255 >> LDS/GBrowse-1.99.tar.gz >> /usr/bin/make -- NOT OK >> >> >> It seems that there are 2 versions of Module::Builder on the >> machine. I >> have installed a version from CPAN which is found in >> ~/lib/perl5/site_perl/Module/. However, from the above error it >> looks >> as though the install is trying to use a system-wide version of >> Module::Build in /usr/local/share/perl/5.8.7. >> >> Can anyone shed any light on either the error message, or a way to >> force >> usage of my $HOME module, not the system one? >> >> thanks, >> Neil Saunders >> -- >> Statistical Bioinformatics - Health >> CSIRO Mathematical and Information Sciences >> Locked Bag 17, North Ryde, NSW 1670, Australia >> >> http://friendfeed.com/neilfws >> >> ------------------------------------------------------------------------------ >> The NEW KODAK i700 Series Scanners deliver under ANY circumstances! >> Your >> production scanning environment may not be a perfect world - but >> thanks to >> Kodak, there's a perfect scanner to get the job done! With the NEW >> KODAK i700 >> Series Scanner you'll get full speed at 300 dpi even with all image >> processing features enabled. http://p.sf.net/sfu/kodak-com >> _______________________________________________ >> Gmod-gbrowse mailing list >> Gmod-gbrowse at lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dan.bolser at gmail.com Tue May 12 16:27:40 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Tue, 12 May 2009 21:27:40 +0100 Subject: [Bioperl-l] The Power of R (Chris Fields) In-Reply-To: <1F1240778FB0AF46B4E5A72C44D2C7472A29BE7E@exch1-hi.accelrys.net> References: <8F435B66-33CF-467D-8D86-AA8EF2309E98@lsi.upc.edu> <1F1240778FB0AF46B4E5A72C44D2C7472A29BE7E@exch1-hi.accelrys.net> Message-ID: <2c8757af0905121327i681b13c3q4892ea9751c4adad@mail.gmail.com> 2009/5/8 Scott Markel : > Gabriel, > > A quick personal comment - Thank you for referencing the "Using > BioPerl" book that Jason Stajich, Ewan Birney, and I are writing. > Now we'll have to finish it. :) Please hurry! ;-) > Scott > From maj at fortinbras.us Tue May 12 16:06:56 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 12 May 2009 16:06:56 -0400 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) In-Reply-To: References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu><2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> Message-ID: <88072CDBECA446D0A7C48E74237F59FE@NewLife> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on Dan's data, but something deeper may be going on. Also get many of the following warnings, haven't looked at it closely: --------------------- WARNING --------------------- MSG: Removing score value(s) --------------------------------------------------- PATCH: Index: SearchUtils.pm =================================================================== --- SearchUtils.pm (revision 15674) +++ SearchUtils.pm (working copy) @@ -252,8 +252,8 @@ } $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1; - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'}; - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'}; + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'} || 0; + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'} || 0; $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; } @@ -407,9 +407,12 @@ }; if($@) { warn "\a\n$@\n"; } else { + # make sure numerical + $_->{'iden'} ||= 0; + $_->{'cons'} ||= 0; $_->{'start'} = $start; # Assign a new start coordinate to the contig - $_->{'iden'} += $numID; # and add new data to #identical, #conserved. - $_->{'cons'} += $numCons; + $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved. + $_->{'cons'} += ($numCons||0); push(@{$_->{hsps}}, $hsp); $overlap = 1; } @@ -424,9 +427,13 @@ }; if($@) { warn "\a\n$@\n"; } else { + # make sure numerical + $_->{'iden'} ||= 0; + $_->{'cons'} ||= 0; + $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig - $_->{'iden'} += $numID; # and add new data to #identical, #conserved. - $_->{'cons'} += $numCons; + $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved. + $_->{'cons'} += ($numCons||0); push(@{$_->{hsps}}, $hsp); $overlap = 1; } @@ -461,8 +468,8 @@ }; if($@) { warn "\a\n$@\n"; } else { - $ids += $these_ids; - $cons += $these_cons; + $ids += ($these_ids||0); + $cons += ($these_cons||0); } last if $hsp_start == $u_start; @@ -490,8 +497,8 @@ }; if($@) { warn "\a\n$@\n"; } else { - $ids += $these_ids; - $cons += $these_cons; + $ids += ($these_ids||0); + $cons += ($these_cons||0); } last if $hsp_end == $u_stop; ----- Original Message ----- From: "Chris Fields" To: "Mark A. Jensen" Cc: "BioPerl List" ; "Dan Bolser" Sent: Tuesday, May 12, 2009 10:04 AM Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) > More complicated than that, I'm afraid. We should try to fix that at the > source of the problem. > > This appears to stem from SearchUtils HSP tiling, which in turn utilizes > HSPI::matches(), which in turn checks num_identical/ num_conserved. My guess > is, since this is blasttable format, one of these isn't set and thus is > returning the wrong value. I'll attempt to track it down today, but it may > take some time. > > chris > > On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: > >> This sounds like a >> >> $sum = eval join( '+', @a); >> >> problem, which can be fixed with >> >> $sum = eval join('+', map { $_ || () } @a) ; >> >> MAJ >> ----- Original Message ----- From: "Dan Bolser" >> To: "Chris Fields" >> Cc: "BioPerl List" >> Sent: Tuesday, May 12, 2009 9:17 AM >> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' >> fromSearchIO?) >> >> >> 2009/5/12 Chris Fields : >>> Fixed that in svn. We're all still learning the ropes... >> >> In that case, I'm seeing multiple instances of... >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line >> 256 >> Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line >> 412 >> Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line >> 429 >> Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line >> 465 >> Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line >> 473 >> Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line >> 494 >> Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line >> 502 >> >> >> Hmm... I was about to go on to complain about the weird GFF that I was >> seeing, but suddenly it looks OK. My bioperl install must think your >> standing over my shoulder and is therefore behaving itself! >> >> >> Thanks again for all the help, >> Dan. >> >> >> >> >>> chris >>> >>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote: >>> >>>> 2009/5/12 Dan Bolser : >>>>> >>>>> Unfortunately bp_search2gff.pl is giving me errors: >>>>> >>>>> bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered -f >>>>> blasttable -o BlastResults/blast_table_filtered.gff -t hit >>>>> --match --target --component >>>>> >>>>> --------------------- WARNING --------------------- >>>>> MSG: Removing score value(s) >>>>> --------------------------------------------------- >>>>> Can't locate object method "remove_tags" via package >>>>> "Bio::SeqFeature::Similarity" at >>>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line >>>>> 393, line 5. >>>> >>>> >>>> I'm just learning the ropes... >>>> >>>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 >>>> 15:25:55.000000000 +0100 >>>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 >>>> 11:52:41.000000000 +0100 >>>> @@ -390,7 +390,7 @@ >>>> } >>>> if ($self->has_tag('score')) { >>>> $self->warn("Removing score value(s)"); >>>> - $self->remove_tags('score'); >>>> + $self->remove_tag('score'); >>>> } >>>> $self->add_tag_value('score',$value); >>>> } >>>> >>>> >>>> >>>> >>>> >>>>> Anyone seen this before? >>>>> >>>>> Cheers, >>>>> Dan. >>>>> >>>>> >>>>> >>>>> 2009/5/12 Dan Bolser : >>>>>> >>>>>> Thanks for the info guys, I think I was naively hoping that the >>>>>> feature would know how to cast itself as a 'SeqFeature' (GFF). >>>>>> >>>>>> I think I understand the problem better now, so I'll try to summarise: >>>>>> >>>>>> There is no standard way to encode a HSP as a feature (not least >>>>>> because there are two choices about which sequence (query or the hit) >>>>>> it should be attached to). BioPerl will try, but the result will not >>>>>> be "well structured" SeqFeatures or "well formed" GFF. >>>>>> >>>>>> >>>>>> From what I read I guess it should be possible to standardize this >>>>>> mapping (based on something in one of the examples or the 'search2gff' >>>>>> script), assuming you specify weather you want features put on the >>>>>> query or on the hit. >>>>>> >>>>>> At some point last year I was trying out the bp_search2gff.pl and my >>>>>> own code to write a GFF file for loading and viewing by Gbrowse. At >>>>>> that time I gave up, as nothing seemed to be working. I was hoping >>>>>> that doing this at a lower level (i.e. never writing any GFF myself) >>>>>> it would stand a better chance of working. >>>>>> >>>>>> Also I was thinking that Gbrowse, if given a SeqFeature::Store, could >>>>>> autoconfigure its interface to some degree. I guess its back to the >>>>>> docs ;-) >>>>>> >>>>>> >>>>>> >>>>>> I'll keep trying and see if I can get anywhere. >>>>>> >>>>>> Thanks again, >>>>>> Dan. >>>>>> >>>>>> >>>>>> >>>>>> References for the above: >>>>>> >>>>>> 2009/5/11 Jason Stajich : >>>>>> >>>>>>> otherwise you need to be converting the HSPs into seqfeatures with the >>>>>>> right associated information (i.e. the tag/value pairs that are in the >>>>>>> 9th >>>>>>> column) in order to have well structured data in the database. >>>>>> >>>>>>> You can get the individual features from the feature pair with >>>>>>> $hsp->query or $hsp->hit which can also be passed to a GFF writer (or >>>>>>> call >>>>>>> $hsp->hit->gff_string). Note that since the data storage is not >>>>>>> structured >>>>>>> in a GFF3 like-way this won't immediately produce well formed GFF3 for >>>>>>> the >>>>>>> 9th column. >>>>>> >>>>>> >>>>>> 2009/5/11 Chris Fields : >>>>>> >>>>>>> The main problem is the mapping is subjective based on what your >>>>>>> reference sequence is within the BLAST run (e.g. whether it is the >>>>>>> query or >>>>>>> the hit), and is something that can't be automatically discerned. I >>>>>>> ended >>>>>>> up rolling my own with SeqFeature::Store (just mapped the relevant data >>>>>>> to >>>>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant >>>>>>> scripts >>>>>>> to integrate my changes in, just haven't had the time >>>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From Russell.Smithies at agresearch.co.nz Tue May 12 17:27:57 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Wed, 13 May 2009 09:27:57 +1200 Subject: [Bioperl-l] alignable portion of a genome In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> References: <23480025.post@talk.nabble.com> <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493DA54D2@exchsth.agresearch.co.nz> Adding the mutations is a little hacky (and probably slow) but I think it works correctly. The stats should work out OK but it's too early and I haven't had a coffee yet so can't be sure :-) --Russell ============================ #!perl -w my $seq = "atcgacgatcgaacgatcga"; my $debug = 0; foreach ($seq =~ /(?=(\w{5}))/g){ $h++; # add all the exact words to the hash $hash{$_}++; print "$_\n" if $debug; # mutate words and add to hash my at rr = mutate($_); foreach (@rr){ print "$_\n" if $debug; $h++; $hash{$_}++; } } # print out the hash counts & stats foreach (keys %hash){ print "$_\t$hash{$_}\n" if $debug; $singles++ if($hash{$_} eq 1); } print $singles/$h,"\n"; sub mutate{ my @array = split '',shift; my @res = (); my $rep = 'X'; for(my$i = 0; $i <= $#array; $i++){ my $old1 = $array[$i]; splice @array, $i, 1, $rep; push @res, (join '', @array); for(my$j = $i+1; $j <= $#array; $j++){ my $old2 = $array[$j]; splice @array, $j, 1, $rep; push @res, (join '', @array); splice @array, $j, 1, $old2; } splice @array, $i, 1, $old1; } return @res; } ================================ > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > Sent: Tuesday, 12 May 2009 3:56 p.m. > To: 'fadista'; 'Bioperl-l at lists.open-bio.org' > Subject: Re: [Bioperl-l] alignable portion of a genome > > Perfect matches is easy: > > $seq = "atcgacgatcgaacgatcga"; > > foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++} > foreach (keys %hash){ $singles++ if($hash{$_} eq 1)} > print $singles/$h; > > Could probably be done with map as well. > Counting the miss-matches might take a bit more thinking.... > Any ideas MAJ? > > --Russell > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of fadista > > Sent: Monday, 11 May 2009 9:32 p.m. > > To: Bioperl-l at lists.open-bio.org > > Subject: [Bioperl-l] alignable portion of a genome > > > > > > Hi, > > > > I would like to know of a good and fast way that could help me calculate the > > alignable portion of a genome (not human), given a reference sequence. > > When I say alignable portion I mean that I want to know all the positions of > > the genome that can be covered uniquely by reads of 36 bp and up to 2 > > mismatches. > > > > Some have advised me to work with Perl using the following strategy but I am > > not a Perl user so if someone has already a script for this function, it > > would be nice: > > > > "you could approach it by walking along the genome in a sliding window of > > 36 nt, and hash the frequency of each 36 nt sequence that you encounter. > > Then count how many of the 36 nt sequences had a frequency of exactly > > one. Divide this by the total number of 36nt windows visited. This > > should be do-able in about 20 lines of Perl." > > > > > > Best regards and thanks in advance > > > > -- > > View this message in context: http://www.nabble.com/alignable-portion-of-a- > > genome-tp23480025p23480025.html > > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From jason at bioperl.org Tue May 12 19:04:35 2009 From: jason at bioperl.org (Jason Stajich) Date: Tue, 12 May 2009 16:04:35 -0700 Subject: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem with Module::Builder versions In-Reply-To: <4A09FBE2.8040000@gmail.com> References: <4A03986D.7080007@gmail.com> <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> <2A93D0F6-3A2A-4638-A095-26CFC595E8A1@verizon.net> <4A09FBE2.8040000@gmail.com> Message-ID: So this doesn't work? ./Build --install_base ~/ install On May 12, 2009, at 3:44 PM, Neil Saunders wrote: >> Try setting the environmental variable PERL5LIB: > > Thanks for the tip - however, PERL5LIB is set (to ~/lib/perl5). > > The Module::Build docs state that Config.pm is used by Module::Build. > So far as I can tell, the initial Build is using the system-wide perl > installation (/usr/local, /etc/CPAN) and the 'Build install' to $HOME > uses my personal Module::Build. Problems arise because these are > different versions (0.28 v 0.32). > > I assume that I can edit the Build scripts in some way to use only my > personal installation - will keep working on this. > > Neil > -- > Statistical Bioinformatics - Health > CSIRO Mathematical and Information Sciences > Locked Bag 17, North Ryde, NSW 1670, Australia > > http://friendfeed.com/neilfws > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! > Your > production scanning environment may not be a perfect world - but > thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW > KODAK i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Gmod-gbrowse mailing list > Gmod-gbrowse at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse Jason Stajich jason at bioperl.org From jason at bioperl.org Tue May 12 19:07:21 2009 From: jason at bioperl.org (Jason Stajich) Date: Tue, 12 May 2009 16:07:21 -0700 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) In-Reply-To: <88072CDBECA446D0A7C48E74237F59FE@NewLife> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu><2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> <88072CDBECA446D0A7C48E74237F59FE@NewLife> Message-ID: I really don't think tile_hsps should be used on BLAST data folks, it is a pretty blind approach. If you really want the right answer you need to do -links with WU- BLAST or FASTA. Been discussed a few times on the mailing list. Good to fix the code bug I guess to avoid the warnings, but unless you are going to walk through all the HSPs and extract the consistent paths wrt query I think you'll have loops, etc in there which will make Hit->percent_id non-accurate. -jason On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote: > Patch below (to SearchUtils.pm) fixes the non-numeric warnings on > Dan's data, but something deeper may be going on. > Also get many of the following warnings, haven't looked at it closely: > > --------------------- WARNING --------------------- > MSG: Removing score value(s) > --------------------------------------------------- > > PATCH: > > Index: SearchUtils.pm > =================================================================== > --- SearchUtils.pm (revision 15674) > +++ SearchUtils.pm (working copy) > @@ -252,8 +252,8 @@ > } > > $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_- > >{'stop'} - $_->{'start'} + 1; > - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- > >{'iden'}; > - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- > >{'cons'}; > + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- > >{'iden'} || 0; > + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- > >{'cons'} || 0; > $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; > } > > @@ -407,9 +407,12 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > + # make sure numerical > + $_->{'iden'} ||= 0; > + $_->{'cons'} ||= 0; > $_->{'start'} = $start; # Assign a new start > coordinate to the contig > - $_->{'iden'} += $numID; # and add new data to > #identical, #conserved. > - $_->{'cons'} += $numCons; > + $_->{'iden'} += ($numID||0); # and add new data to > #identical, #conserved. > + $_->{'cons'} += ($numCons||0); > push(@{$_->{hsps}}, $hsp); > $overlap = 1; > } > @@ -424,9 +427,13 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > + # make sure numerical > + $_->{'iden'} ||= 0; > + $_->{'cons'} ||= 0; > + > $_->{'stop'} = $stop; # Assign a new stop coordinate > to the contig > - $_->{'iden'} += $numID; # and add new data to > #identical, #conserved. > - $_->{'cons'} += $numCons; > + $_->{'iden'} += ($numID||0); # and add new data to > #identical, #conserved. > + $_->{'cons'} += ($numCons||0); > push(@{$_->{hsps}}, $hsp); > $overlap = 1; > } > @@ -461,8 +468,8 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > - $ids += $these_ids; > - $cons += $these_cons; > + $ids += ($these_ids||0); > + $cons += ($these_cons||0); > } > > last if $hsp_start == $u_start; > @@ -490,8 +497,8 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > - $ids += $these_ids; > - $cons += $these_cons; > + $ids += ($these_ids||0); > + $cons += ($these_cons||0); > } > > last if $hsp_end == $u_stop; > > ----- Original Message ----- From: "Chris Fields" > > To: "Mark A. Jensen" > Cc: "BioPerl List" ; "Dan Bolser" > > Sent: Tuesday, May 12, 2009 10:04 AM > Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting > 'features'fromSearchIO?) > > >> More complicated than that, I'm afraid. We should try to fix that >> at the source of the problem. >> >> This appears to stem from SearchUtils HSP tiling, which in turn >> utilizes HSPI::matches(), which in turn checks num_identical/ >> num_conserved. My guess is, since this is blasttable format, one >> of these isn't set and thus is returning the wrong value. I'll >> attempt to track it down today, but it may take some time. >> >> chris >> >> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: >> >>> This sounds like a >>> >>> $sum = eval join( '+', @a); >>> >>> problem, which can be fixed with >>> >>> $sum = eval join('+', map { $_ || () } @a) ; >>> >>> MAJ >>> ----- Original Message ----- From: "Dan Bolser" >> > >>> To: "Chris Fields" >>> Cc: "BioPerl List" >>> Sent: Tuesday, May 12, 2009 9:17 AM >>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' >>> fromSearchIO?) >>> >>> >>> 2009/5/12 Chris Fields : >>>> Fixed that in svn. We're all still learning the ropes... >>> >>> In that case, I'm seeing multiple instances of... >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 256 >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 412 >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 429 >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 465 >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 473 >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 494 >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 502 >>> >>> >>> Hmm... I was about to go on to complain about the weird GFF that I >>> was >>> seeing, but suddenly it looks OK. My bioperl install must think your >>> standing over my shoulder and is therefore behaving itself! >>> >>> >>> Thanks again for all the help, >>> Dan. >>> >>> >>> >>> >>>> chris >>>> >>>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote: >>>> >>>>> 2009/5/12 Dan Bolser : >>>>>> >>>>>> Unfortunately bp_search2gff.pl is giving me errors: >>>>>> >>>>>> bp_search2gff.pl --version 3 -i BlastResults/ >>>>>> blast_table_filtered -f >>>>>> blasttable -o BlastResults/blast_table_filtered.gff -t hit >>>>>> --match --target --component >>>>>> >>>>>> --------------------- WARNING --------------------- >>>>>> MSG: Removing score value(s) >>>>>> --------------------------------------------------- >>>>>> Can't locate object method "remove_tags" via package >>>>>> "Bio::SeqFeature::Similarity" at >>>>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ >>>>>> Generic.pm line >>>>>> 393, line 5. >>>>> >>>>> >>>>> I'm just learning the ropes... >>>>> >>>>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 >>>>> 15:25:55.000000000 +0100 >>>>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 >>>>> 11:52:41.000000000 +0100 >>>>> @@ -390,7 +390,7 @@ >>>>> } >>>>> if ($self->has_tag('score')) { >>>>> $self->warn("Removing score value(s)"); >>>>> - $self->remove_tags('score'); >>>>> + $self->remove_tag('score'); >>>>> } >>>>> $self->add_tag_value('score',$value); >>>>> } >>>>> >>>>> >>>>> >>>>> >>>>> >>>>>> Anyone seen this before? >>>>>> >>>>>> Cheers, >>>>>> Dan. >>>>>> >>>>>> >>>>>> >>>>>> 2009/5/12 Dan Bolser : >>>>>>> >>>>>>> Thanks for the info guys, I think I was naively hoping that the >>>>>>> feature would know how to cast itself as a 'SeqFeature' (GFF). >>>>>>> >>>>>>> I think I understand the problem better now, so I'll try to >>>>>>> summarise: >>>>>>> >>>>>>> There is no standard way to encode a HSP as a feature (not least >>>>>>> because there are two choices about which sequence (query or >>>>>>> the hit) >>>>>>> it should be attached to). BioPerl will try, but the result >>>>>>> will not >>>>>>> be "well structured" SeqFeatures or "well formed" GFF. >>>>>>> >>>>>>> >>>>>>> From what I read I guess it should be possible to standardize >>>>>>> this >>>>>>> mapping (based on something in one of the examples or the >>>>>>> 'search2gff' >>>>>>> script), assuming you specify weather you want features put on >>>>>>> the >>>>>>> query or on the hit. >>>>>>> >>>>>>> At some point last year I was trying out the bp_search2gff.pl >>>>>>> and my >>>>>>> own code to write a GFF file for loading and viewing by >>>>>>> Gbrowse. At >>>>>>> that time I gave up, as nothing seemed to be working. I was >>>>>>> hoping >>>>>>> that doing this at a lower level (i.e. never writing any GFF >>>>>>> myself) >>>>>>> it would stand a better chance of working. >>>>>>> >>>>>>> Also I was thinking that Gbrowse, if given a >>>>>>> SeqFeature::Store, could >>>>>>> autoconfigure its interface to some degree. I guess its back >>>>>>> to the >>>>>>> docs ;-) >>>>>>> >>>>>>> >>>>>>> >>>>>>> I'll keep trying and see if I can get anywhere. >>>>>>> >>>>>>> Thanks again, >>>>>>> Dan. >>>>>>> >>>>>>> >>>>>>> >>>>>>> References for the above: >>>>>>> >>>>>>> 2009/5/11 Jason Stajich : >>>>>>> >>>>>>>> otherwise you need to be converting the HSPs into >>>>>>>> seqfeatures with the >>>>>>>> right associated information (i.e. the tag/value pairs that >>>>>>>> are in the 9th >>>>>>>> column) in order to have well structured data in the database. >>>>>>> >>>>>>>> You can get the individual features from the feature pair with >>>>>>>> $hsp->query or $hsp->hit which can also be passed to a GFF >>>>>>>> writer (or call >>>>>>>> $hsp->hit->gff_string). Note that since the data storage is >>>>>>>> not structured >>>>>>>> in a GFF3 like-way this won't immediately produce well >>>>>>>> formed GFF3 for the >>>>>>>> 9th column. >>>>>>> >>>>>>> >>>>>>> 2009/5/11 Chris Fields : >>>>>>> >>>>>>>> The main problem is the mapping is subjective based on what >>>>>>>> your >>>>>>>> reference sequence is within the BLAST run (e.g. whether it >>>>>>>> is the query or >>>>>>>> the hit), and is something that can't be automatically >>>>>>>> discerned. I ended >>>>>>>> up rolling my own with SeqFeature::Store (just mapped the >>>>>>>> relevant data to >>>>>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the >>>>>>>> relevant scripts >>>>>>>> to integrate my changes in, just haven't had the time >>>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From j_martin at lbl.gov Tue May 12 19:34:08 2009 From: j_martin at lbl.gov (Joel Martin) Date: Tue, 12 May 2009 16:34:08 -0700 Subject: [Bioperl-l] alignable portion of a genome In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493DA54D2@exchsth.agresearch.co.nz> References: <23480025.post@talk.nabble.com> <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493DA54D2@exchsth.agresearch.co.nz> Message-ID: <20090512233408.GB17765@eniac.jgi-psf.org> Hello, Doing this with hashes ends up being a little inefficient for larger kmers like 35 in larger genomes. A suffix array tool like 'tallymer' will tell you the unique/non-unique kmer counts quickly, and as Aaron suggested generating fake reads based on a reference and introduce errors into them so you can evaluate how well they map back is a good strategy. maq has a command to do that built in. Joel On Wed, May 13, 2009 at 09:27:57AM +1200, Smithies, Russell wrote: > Adding the mutations is a little hacky (and probably slow) but I think it works correctly. > The stats should work out OK but it's too early and I haven't had a coffee yet so can't be sure :-) > > --Russell > > ============================ > #!perl -w > > my $seq = "atcgacgatcgaacgatcga"; > my $debug = 0; > > > foreach ($seq =~ /(?=(\w{5}))/g){ > $h++; > # add all the exact words to the hash > $hash{$_}++; > print "$_\n" if $debug; > # mutate words and add to hash > my at rr = mutate($_); > foreach (@rr){ > print "$_\n" if $debug; > $h++; > $hash{$_}++; > } > } > > > # print out the hash counts & stats > foreach (keys %hash){ > print "$_\t$hash{$_}\n" if $debug; > $singles++ if($hash{$_} eq 1); > } > print $singles/$h,"\n"; > > > sub mutate{ > my @array = split '',shift; > my @res = (); > my $rep = 'X'; > for(my$i = 0; $i <= $#array; $i++){ > my $old1 = $array[$i]; > splice @array, $i, 1, $rep; > push @res, (join '', @array); > for(my$j = $i+1; $j <= $#array; $j++){ > my $old2 = $array[$j]; > splice @array, $j, 1, $rep; > push @res, (join '', @array); > splice @array, $j, 1, $old2; > } > splice @array, $i, 1, $old1; > } > return @res; > } > > ================================ > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > > Sent: Tuesday, 12 May 2009 3:56 p.m. > > To: 'fadista'; 'Bioperl-l at lists.open-bio.org' > > Subject: Re: [Bioperl-l] alignable portion of a genome > > > > Perfect matches is easy: > > > > $seq = "atcgacgatcgaacgatcga"; > > > > foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++} > > foreach (keys %hash){ $singles++ if($hash{$_} eq 1)} > > print $singles/$h; > > > > Could probably be done with map as well. > > Counting the miss-matches might take a bit more thinking.... > > Any ideas MAJ? > > > > --Russell > > > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of fadista > > > Sent: Monday, 11 May 2009 9:32 p.m. > > > To: Bioperl-l at lists.open-bio.org > > > Subject: [Bioperl-l] alignable portion of a genome > > > > > > > > > Hi, > > > > > > I would like to know of a good and fast way that could help me calculate the > > > alignable portion of a genome (not human), given a reference sequence. > > > When I say alignable portion I mean that I want to know all the positions of > > > the genome that can be covered uniquely by reads of 36 bp and up to 2 > > > mismatches. > > > > > > Some have advised me to work with Perl using the following strategy but I am > > > not a Perl user so if someone has already a script for this function, it > > > would be nice: > > > > > > "you could approach it by walking along the genome in a sliding window of > > > 36 nt, and hash the frequency of each 36 nt sequence that you encounter. > > > Then count how many of the 36 nt sequences had a frequency of exactly > > > one. Divide this by the total number of 36nt windows visited. This > > > should be do-able in about 20 lines of Perl." > > > > > > > > > Best regards and thanks in advance > > > > > > -- > > > View this message in context: http://www.nabble.com/alignable-portion-of-a- > > > genome-tp23480025p23480025.html > > > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > > > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From maj at fortinbras.us Tue May 12 20:31:53 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 12 May 2009 20:31:53 -0400 Subject: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem withModule::Builder versions In-Reply-To: <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> References: <4A03986D.7080007@gmail.com> <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> Message-ID: <35C29690B62445D4AAE15050823A6514@NewLife> Is this really an install problem? The error begins in Module::Build::Base in site_perl, no problem there. The error says $VAR1 has got scoping problems; that doesn't sound like a permissions problem. ----- Original Message ----- From: "Dave Clements, GMOD Help Desk" To: "Neil Saunders" ; "BioPerl List" Cc: Sent: Tuesday, May 12, 2009 1:36 PM Subject: Re: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem withModule::Builder versions Hi Neil, I'm cross-posting your question to the BioPerl list as 1) it is more of a perl question than a GBrowse question, and 2) I don't know the answer. Dave C. GMOD Help Desk Was this helpful? Let us know at http://gmod.org/wiki/Help_Desk_Feedback Learn more about GMOD at SMBE & Arthropod Genomics: http://ccg.biology.uiowa.edu/smbe/symposia.php?action=view&sym_ID=27 http://www.k-state.edu/agc/symp2009/seminar.html On Thu, May 7, 2009 at 7:26 PM, Neil Saunders wrote: > I'm trying to install the latest Gbrowse (1.99) on a machine where I do > not have root access (Ubuntu/dapper). > > I have set up non-root CPAN and installed all of the prerequisites, no > problems, in ~/lib/perl5. However, when I try to install Gbrowse either > via CPAN or using the latest CVS Build script, I run into this problem: > > Global symbol "$VAR1" requires explicit package name at (eval 28) line > 1088, line 1. > ...propagated at /usr/local/share/perl/5.8.7/Module/Build/Base.pm line > 1002, line 1. > make: *** [all] Error 255 > LDS/GBrowse-1.99.tar.gz > /usr/bin/make -- NOT OK > > > It seems that there are 2 versions of Module::Builder on the machine. I > have installed a version from CPAN which is found in > ~/lib/perl5/site_perl/Module/. However, from the above error it looks > as though the install is trying to use a system-wide version of > Module::Build in /usr/local/share/perl/5.8.7. > > Can anyone shed any light on either the error message, or a way to force > usage of my $HOME module, not the system one? > > thanks, > Neil Saunders > -- > Statistical Bioinformatics - Health > CSIRO Mathematical and Information Sciences > Locked Bag 17, North Ryde, NSW 1670, Australia > > http://friendfeed.com/neilfws > > ------------------------------------------------------------------------------ > The NEW KODAK i700 Series Scanners deliver under ANY circumstances! Your > production scanning environment may not be a perfect world - but thanks to > Kodak, there's a perfect scanner to get the job done! With the NEW KODAK i700 > Series Scanner you'll get full speed at 300 dpi even with all image > processing features enabled. http://p.sf.net/sfu/kodak-com > _______________________________________________ > Gmod-gbrowse mailing list > Gmod-gbrowse at lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/gmod-gbrowse > _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Tue May 12 20:46:39 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 12 May 2009 19:46:39 -0500 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) In-Reply-To: References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu><2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> <88072CDBECA446D0A7C48E74237F59FE@NewLife> Message-ID: We should probably indicate this in the BLAST docs (and possibly deprecate using tile_hsps and its ilk in the long run). Worries me seeing modification of the score where none is apparent, so it may be worth tracking that down. chris On May 12, 2009, at 6:07 PM, Jason Stajich wrote: > I really don't think tile_hsps should be used on BLAST data folks, > it is a pretty blind approach. > If you really want the right answer you need to do -links with WU- > BLAST or FASTA. > Been discussed a few times on the mailing list. > > Good to fix the code bug I guess to avoid the warnings, but unless > you are going to walk through all the HSPs and extract the > consistent paths wrt query I think you'll have loops, etc in there > which will make Hit->percent_id non-accurate. > > -jason > > On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote: > >> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on >> Dan's data, but something deeper may be going on. >> Also get many of the following warnings, haven't looked at it >> closely: >> >> --------------------- WARNING --------------------- >> MSG: Removing score value(s) >> --------------------------------------------------- >> >> PATCH: >> >> Index: SearchUtils.pm >> =================================================================== >> --- SearchUtils.pm (revision 15674) >> +++ SearchUtils.pm (working copy) >> @@ -252,8 +252,8 @@ >> } >> >> $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_- >> >{'stop'} - $_->{'start'} + 1; >> - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- >> >{'iden'}; >> - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- >> >{'cons'}; >> + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- >> >{'iden'} || 0; >> + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- >> >{'cons'} || 0; >> $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; >> } >> >> @@ -407,9 +407,12 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> + # make sure numerical >> + $_->{'iden'} ||= 0; >> + $_->{'cons'} ||= 0; >> $_->{'start'} = $start; # Assign a new start >> coordinate to the contig >> - $_->{'iden'} += $numID; # and add new data to >> #identical, #conserved. >> - $_->{'cons'} += $numCons; >> + $_->{'iden'} += ($numID||0); # and add new data to >> #identical, #conserved. >> + $_->{'cons'} += ($numCons||0); >> push(@{$_->{hsps}}, $hsp); >> $overlap = 1; >> } >> @@ -424,9 +427,13 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> + # make sure numerical >> + $_->{'iden'} ||= 0; >> + $_->{'cons'} ||= 0; >> + >> $_->{'stop'} = $stop; # Assign a new stop coordinate >> to the contig >> - $_->{'iden'} += $numID; # and add new data to >> #identical, #conserved. >> - $_->{'cons'} += $numCons; >> + $_->{'iden'} += ($numID||0); # and add new data to >> #identical, #conserved. >> + $_->{'cons'} += ($numCons||0); >> push(@{$_->{hsps}}, $hsp); >> $overlap = 1; >> } >> @@ -461,8 +468,8 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> - $ids += $these_ids; >> - $cons += $these_cons; >> + $ids += ($these_ids||0); >> + $cons += ($these_cons||0); >> } >> >> last if $hsp_start == $u_start; >> @@ -490,8 +497,8 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> - $ids += $these_ids; >> - $cons += $these_cons; >> + $ids += ($these_ids||0); >> + $cons += ($these_cons||0); >> } >> >> last if $hsp_end == $u_stop; >> >> ----- Original Message ----- From: "Chris Fields" > > >> To: "Mark A. Jensen" >> Cc: "BioPerl List" ; "Dan Bolser" > > >> Sent: Tuesday, May 12, 2009 10:04 AM >> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting >> 'features'fromSearchIO?) >> >> >>> More complicated than that, I'm afraid. We should try to fix that >>> at the source of the problem. >>> >>> This appears to stem from SearchUtils HSP tiling, which in turn >>> utilizes HSPI::matches(), which in turn checks num_identical/ >>> num_conserved. My guess is, since this is blasttable format, one >>> of these isn't set and thus is returning the wrong value. I'll >>> attempt to track it down today, but it may take some time. >>> >>> chris >>> >>> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: >>> >>>> This sounds like a >>>> >>>> $sum = eval join( '+', @a); >>>> >>>> problem, which can be fixed with >>>> >>>> $sum = eval join('+', map { $_ || () } @a) ; >>>> >>>> MAJ >>>> ----- Original Message ----- From: "Dan Bolser" >>> > >>>> To: "Chris Fields" >>>> Cc: "BioPerl List" >>>> Sent: Tuesday, May 12, 2009 9:17 AM >>>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' >>>> fromSearchIO?) >>>> >>>> >>>> 2009/5/12 Chris Fields : >>>>> Fixed that in svn. We're all still learning the ropes... >>>> >>>> In that case, I'm seeing multiple instances of... >>>> >>>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>>> SearchUtils.pm line 256 >>>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>>> SearchUtils.pm line 412 >>>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>>> SearchUtils.pm line 429 >>>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>>> SearchUtils.pm line 465 >>>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>>> SearchUtils.pm line 473 >>>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>>> SearchUtils.pm line 494 >>>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>>> SearchUtils.pm line 502 >>>> >>>> >>>> Hmm... I was about to go on to complain about the weird GFF that >>>> I was >>>> seeing, but suddenly it looks OK. My bioperl install must think >>>> your >>>> standing over my shoulder and is therefore behaving itself! >>>> >>>> >>>> Thanks again for all the help, >>>> Dan. >>>> >>>> >>>> >>>> >>>>> chris >>>>> >>>>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote: >>>>> >>>>>> 2009/5/12 Dan Bolser : >>>>>>> >>>>>>> Unfortunately bp_search2gff.pl is giving me errors: >>>>>>> >>>>>>> bp_search2gff.pl --version 3 -i BlastResults/ >>>>>>> blast_table_filtered -f >>>>>>> blasttable -o BlastResults/blast_table_filtered.gff -t hit >>>>>>> --match --target --component >>>>>>> >>>>>>> --------------------- WARNING --------------------- >>>>>>> MSG: Removing score value(s) >>>>>>> --------------------------------------------------- >>>>>>> Can't locate object method "remove_tags" via package >>>>>>> "Bio::SeqFeature::Similarity" at >>>>>>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ >>>>>>> Generic.pm line >>>>>>> 393, line 5. >>>>>> >>>>>> >>>>>> I'm just learning the ropes... >>>>>> >>>>>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 >>>>>> 15:25:55.000000000 +0100 >>>>>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 >>>>>> 11:52:41.000000000 +0100 >>>>>> @@ -390,7 +390,7 @@ >>>>>> } >>>>>> if ($self->has_tag('score')) { >>>>>> $self->warn("Removing score value(s)"); >>>>>> - $self->remove_tags('score'); >>>>>> + $self->remove_tag('score'); >>>>>> } >>>>>> $self->add_tag_value('score',$value); >>>>>> } >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Anyone seen this before? >>>>>>> >>>>>>> Cheers, >>>>>>> Dan. >>>>>>> >>>>>>> >>>>>>> >>>>>>> 2009/5/12 Dan Bolser : >>>>>>>> >>>>>>>> Thanks for the info guys, I think I was naively hoping that the >>>>>>>> feature would know how to cast itself as a 'SeqFeature' (GFF). >>>>>>>> >>>>>>>> I think I understand the problem better now, so I'll try to >>>>>>>> summarise: >>>>>>>> >>>>>>>> There is no standard way to encode a HSP as a feature (not >>>>>>>> least >>>>>>>> because there are two choices about which sequence (query or >>>>>>>> the hit) >>>>>>>> it should be attached to). BioPerl will try, but the result >>>>>>>> will not >>>>>>>> be "well structured" SeqFeatures or "well formed" GFF. >>>>>>>> >>>>>>>> >>>>>>>> From what I read I guess it should be possible to standardize >>>>>>>> this >>>>>>>> mapping (based on something in one of the examples or the >>>>>>>> 'search2gff' >>>>>>>> script), assuming you specify weather you want features put >>>>>>>> on the >>>>>>>> query or on the hit. >>>>>>>> >>>>>>>> At some point last year I was trying out the >>>>>>>> bp_search2gff.pl and my >>>>>>>> own code to write a GFF file for loading and viewing by >>>>>>>> Gbrowse. At >>>>>>>> that time I gave up, as nothing seemed to be working. I was >>>>>>>> hoping >>>>>>>> that doing this at a lower level (i.e. never writing any GFF >>>>>>>> myself) >>>>>>>> it would stand a better chance of working. >>>>>>>> >>>>>>>> Also I was thinking that Gbrowse, if given a >>>>>>>> SeqFeature::Store, could >>>>>>>> autoconfigure its interface to some degree. I guess its back >>>>>>>> to the >>>>>>>> docs ;-) >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'll keep trying and see if I can get anywhere. >>>>>>>> >>>>>>>> Thanks again, >>>>>>>> Dan. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> References for the above: >>>>>>>> >>>>>>>> 2009/5/11 Jason Stajich : >>>>>>>> >>>>>>>>> otherwise you need to be converting the HSPs into >>>>>>>>> seqfeatures with the >>>>>>>>> right associated information (i.e. the tag/value pairs that >>>>>>>>> are in the 9th >>>>>>>>> column) in order to have well structured data in the database. >>>>>>>> >>>>>>>>> You can get the individual features from the feature pair with >>>>>>>>> $hsp->query or $hsp->hit which can also be passed to a GFF >>>>>>>>> writer (or call >>>>>>>>> $hsp->hit->gff_string). Note that since the data storage is >>>>>>>>> not structured >>>>>>>>> in a GFF3 like-way this won't immediately produce well >>>>>>>>> formed GFF3 for the >>>>>>>>> 9th column. >>>>>>>> >>>>>>>> >>>>>>>> 2009/5/11 Chris Fields : >>>>>>>> >>>>>>>>> The main problem is the mapping is subjective based on what >>>>>>>>> your >>>>>>>>> reference sequence is within the BLAST run (e.g. whether it >>>>>>>>> is the query or >>>>>>>>> the hit), and is something that can't be automatically >>>>>>>>> discerned. I ended >>>>>>>>> up rolling my own with SeqFeature::Store (just mapped the >>>>>>>>> relevant data to >>>>>>>>> Bio::DB::SeqFeatures), but I have long wanted to fix up the >>>>>>>>> relevant scripts >>>>>>>>> to integrate my changes in, just haven't had the time >>>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Bioperl-l mailing list >>>>>> Bioperl-l at lists.open-bio.org >>>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Jason Stajich > jason at bioperl.org > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From maj at fortinbras.us Tue May 12 22:00:41 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 12 May 2009 22:00:41 -0400 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) In-Reply-To: References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu><2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> <88072CDBECA446D0A7C48E74237F59FE@NewLife> Message-ID: I dislike my patch, because it doesn't get to the bottom of why data members associated with numbers of conserved sites return from eval's undefined; it seems clear from the code that this is unexpected. Please excuse my naivete--why would this happen only in the blasttable format, and why hasn't this thing clucked before? ----- Original Message ----- From: Jason Stajich To: Mark A. Jensen Cc: Chris Fields ; BioPerl List ; Dan Bolser Sent: Tuesday, May 12, 2009 7:07 PM Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) I really don't think tile_hsps should be used on BLAST data folks, it is a pretty blind approach. If you really want the right answer you need to do -links with WU-BLAST or FASTA. Been discussed a few times on the mailing list. Good to fix the code bug I guess to avoid the warnings, but unless you are going to walk through all the HSPs and extract the consistent paths wrt query I think you'll have loops, etc in there which will make Hit->percent_id non-accurate. -jason On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote: Patch below (to SearchUtils.pm) fixes the non-numeric warnings on Dan's data, but something deeper may be going on. Also get many of the following warnings, haven't looked at it closely: --------------------- WARNING --------------------- MSG: Removing score value(s) --------------------------------------------------- PATCH: Index: SearchUtils.pm =================================================================== --- SearchUtils.pm (revision 15674) +++ SearchUtils.pm (working copy) @@ -252,8 +252,8 @@ } $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_->{'stop'} - $_->{'start'} + 1; - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'}; - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'}; + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_->{'iden'} || 0; + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_->{'cons'} || 0; $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; } @@ -407,9 +407,12 @@ }; if($@) { warn "\a\n$@\n"; } else { + # make sure numerical + $_->{'iden'} ||= 0; + $_->{'cons'} ||= 0; $_->{'start'} = $start; # Assign a new start coordinate to the contig - $_->{'iden'} += $numID; # and add new data to #identical, #conserved. - $_->{'cons'} += $numCons; + $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved. + $_->{'cons'} += ($numCons||0); push(@{$_->{hsps}}, $hsp); $overlap = 1; } @@ -424,9 +427,13 @@ }; if($@) { warn "\a\n$@\n"; } else { + # make sure numerical + $_->{'iden'} ||= 0; + $_->{'cons'} ||= 0; + $_->{'stop'} = $stop; # Assign a new stop coordinate to the contig - $_->{'iden'} += $numID; # and add new data to #identical, #conserved. - $_->{'cons'} += $numCons; + $_->{'iden'} += ($numID||0); # and add new data to #identical, #conserved. + $_->{'cons'} += ($numCons||0); push(@{$_->{hsps}}, $hsp); $overlap = 1; } @@ -461,8 +468,8 @@ }; if($@) { warn "\a\n$@\n"; } else { - $ids += $these_ids; - $cons += $these_cons; + $ids += ($these_ids||0); + $cons += ($these_cons||0); } last if $hsp_start == $u_start; @@ -490,8 +497,8 @@ }; if($@) { warn "\a\n$@\n"; } else { - $ids += $these_ids; - $cons += $these_cons; + $ids += ($these_ids||0); + $cons += ($these_cons||0); } last if $hsp_end == $u_stop; ----- Original Message ----- From: "Chris Fields" To: "Mark A. Jensen" Cc: "BioPerl List" ; "Dan Bolser" Sent: Tuesday, May 12, 2009 10:04 AM Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) More complicated than that, I'm afraid. We should try to fix that at the source of the problem. This appears to stem from SearchUtils HSP tiling, which in turn utilizes HSPI::matches(), which in turn checks num_identical/ num_conserved. My guess is, since this is blasttable format, one of these isn't set and thus is returning the wrong value. I'll attempt to track it down today, but it may take some time. chris On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: This sounds like a $sum = eval join( '+', @a); problem, which can be fixed with $sum = eval join('+', map { $_ || () } @a) ; MAJ ----- Original Message ----- From: "Dan Bolser" To: "Chris Fields" Cc: "BioPerl List" Sent: Tuesday, May 12, 2009 9:17 AM Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' fromSearchIO?) 2009/5/12 Chris Fields : Fixed that in svn. We're all still learning the ropes... In that case, I'm seeing multiple instances of... Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 256 Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 412 Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 429 Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 465 Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 473 Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 494 Argument "" isn't numeric in addition (+) at Bio/Search/ SearchUtils.pm line 502 Hmm... I was about to go on to complain about the weird GFF that I was seeing, but suddenly it looks OK. My bioperl install must think your standing over my shoulder and is therefore behaving itself! Thanks again for all the help, Dan. chris On May 12, 2009, at 5:55 AM, Dan Bolser wrote: 2009/5/12 Dan Bolser : Unfortunately bp_search2gff.pl is giving me errors: bp_search2gff.pl --version 3 -i BlastResults/blast_table_filtered -f blasttable -o BlastResults/blast_table_filtered.gff -t hit --match --target --component --------------------- WARNING --------------------- MSG: Removing score value(s) --------------------------------------------------- Can't locate object method "remove_tags" via package "Bio::SeqFeature::Similarity" at /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/Generic.pm line 393, line 5. I'm just learning the ropes... --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 15:25:55.000000000 +0100 +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 11:52:41.000000000 +0100 @@ -390,7 +390,7 @@ } if ($self->has_tag('score')) { $self->warn("Removing score value(s)"); - $self->remove_tags('score'); + $self->remove_tag('score'); } $self->add_tag_value('score',$value); } Anyone seen this before? Cheers, Dan. 2009/5/12 Dan Bolser : Thanks for the info guys, I think I was naively hoping that the feature would know how to cast itself as a 'SeqFeature' (GFF). I think I understand the problem better now, so I'll try to summarise: There is no standard way to encode a HSP as a feature (not least because there are two choices about which sequence (query or the hit) it should be attached to). BioPerl will try, but the result will not be "well structured" SeqFeatures or "well formed" GFF. From what I read I guess it should be possible to standardize this mapping (based on something in one of the examples or the 'search2gff' script), assuming you specify weather you want features put on the query or on the hit. At some point last year I was trying out the bp_search2gff.pl and my own code to write a GFF file for loading and viewing by Gbrowse. At that time I gave up, as nothing seemed to be working. I was hoping that doing this at a lower level (i.e. never writing any GFF myself) it would stand a better chance of working. Also I was thinking that Gbrowse, if given a SeqFeature::Store, could autoconfigure its interface to some degree. I guess its back to the docs ;-) I'll keep trying and see if I can get anywhere. Thanks again, Dan. References for the above: 2009/5/11 Jason Stajich : otherwise you need to be converting the HSPs into seqfeatures with the right associated information (i.e. the tag/value pairs that are in the 9th column) in order to have well structured data in the database. You can get the individual features from the feature pair with $hsp->query or $hsp->hit which can also be passed to a GFF writer (or call $hsp->hit->gff_string). Note that since the data storage is not structured in a GFF3 like-way this won't immediately produce well formed GFF3 for the 9th column. 2009/5/11 Chris Fields : The main problem is the mapping is subjective based on what your reference sequence is within the BLAST run (e.g. whether it is the query or the hit), and is something that can't be automatically discerned. I ended up rolling my own with SeqFeature::Store (just mapped the relevant data to Bio::DB::SeqFeatures), but I have long wanted to fix up the relevant scripts to integrate my changes in, just haven't had the time _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From jason at bioperl.org Wed May 13 02:34:56 2009 From: jason at bioperl.org (Jason Stajich) Date: Tue, 12 May 2009 23:34:56 -0700 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) In-Reply-To: References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu><2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> <88072CDBECA446D0A7C48E74237F59FE@NewLife> Message-ID: <97AD1493-F228-4B6D-9FCF-571951923C10@bioperl.org> Not looking at the code so I can't be sure, but I think since there is no sequence in the blasttable format and I think %cons or %id is calculated from some of the numbers that are taken from underlying sequence data in the alignment. I don't know why that would trip up tile_hsps necessarily but maybe that's not where the errors are coming from. I've generally solved these conversions with simpler perl scripts rather than searchio since I usually want speed for this kind of stuff -- so I convert all the data into blast m9 format (FASTA,SSEARCH,WUBLAST, BLAST) and then have a simple parser that converts the m9 (without searchio) into the appropriate GFF with or without additional bells and whistles. Dan --- I am slowly cleaning up and migrating more of the pipelines that we use to generate polished GFF for loading into Gbrowse. You should just set a goal of making good GFF3 and many of the visualization and query issues go away. For genomes I tend to have a separate scaffold file which is just the 1 line per chromosome/scaffold/contig listing. Then a file for each analysis ie. Protein to genome BLAST, EST to genome, remapped Pfam domains to genome coordinates, gene predictions, etc. Keep it simple at first you know what you've updated when something stops working the way you expect. -jason On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote: > I dislike my patch, because it doesn't get to the bottom of why > data members associated with numbers of conserved sites return > from eval's undefined; it seems clear from the code that this is > unexpected. Please excuse my naivete--why would this happen > only in the blasttable format, and why hasn't this thing clucked > before? > ----- Original Message ----- > From: Jason Stajich > To: Mark A. Jensen > Cc: Chris Fields ; BioPerl List ; Dan Bolser > Sent: Tuesday, May 12, 2009 7:07 PM > Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting > 'features'fromSearchIO?) > > > I really don't think tile_hsps should be used on BLAST data folks, > it is a pretty blind approach. > If you really want the right answer you need to do -links with WU- > BLAST or FASTA. > Been discussed a few times on the mailing list. > > > Good to fix the code bug I guess to avoid the warnings, but unless > you are going to walk through all the HSPs and extract the > consistent paths wrt query I think you'll have loops, etc in there > which will make Hit->percent_id non-accurate. > > > -jason > > > On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote: > > > Patch below (to SearchUtils.pm) fixes the non-numeric warnings on > Dan's data, but something deeper may be going on. > Also get many of the following warnings, haven't looked at it > closely: > > --------------------- WARNING --------------------- > MSG: Removing score value(s) > --------------------------------------------------- > > PATCH: > > Index: SearchUtils.pm > =================================================================== > --- SearchUtils.pm (revision 15674) > +++ SearchUtils.pm (working copy) > @@ -252,8 +252,8 @@ > } > > $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_- > >{'stop'} - $_->{'start'} + 1; > - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- > >{'iden'}; > - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- > >{'cons'}; > + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- > >{'iden'} || 0; > + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- > >{'cons'} || 0; > $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; > } > > @@ -407,9 +407,12 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > + # make sure numerical > + $_->{'iden'} ||= 0; > + $_->{'cons'} ||= 0; > $_->{'start'} = $start; # Assign a new start > coordinate to the contig > - $_->{'iden'} += $numID; # and add new data to > #identical, #conserved. > - $_->{'cons'} += $numCons; > + $_->{'iden'} += ($numID||0); # and add new data > to #identical, #conserved. > + $_->{'cons'} += ($numCons||0); > push(@{$_->{hsps}}, $hsp); > $overlap = 1; > } > @@ -424,9 +427,13 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > + # make sure numerical > + $_->{'iden'} ||= 0; > + $_->{'cons'} ||= 0; > + > $_->{'stop'} = $stop; # Assign a new stop > coordinate to the contig > - $_->{'iden'} += $numID; # and add new data to > #identical, #conserved. > - $_->{'cons'} += $numCons; > + $_->{'iden'} += ($numID||0); # and add new data > to #identical, #conserved. > + $_->{'cons'} += ($numCons||0); > push(@{$_->{hsps}}, $hsp); > $overlap = 1; > } > @@ -461,8 +468,8 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > - $ids += $these_ids; > - $cons += $these_cons; > + $ids += ($these_ids||0); > + $cons += ($these_cons||0); > } > > last if $hsp_start == $u_start; > @@ -490,8 +497,8 @@ > }; > if($@) { warn "\a\n$@\n"; } > else { > - $ids += $these_ids; > - $cons += $these_cons; > + $ids += ($these_ids||0); > + $cons += ($these_cons||0); > } > > last if $hsp_end == $u_stop; > > ----- Original Message ----- From: "Chris Fields" > > To: "Mark A. Jensen" > Cc: "BioPerl List" ; "Dan Bolser" > > Sent: Tuesday, May 12, 2009 10:04 AM > Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting > 'features'fromSearchIO?) > > > > More complicated than that, I'm afraid. We should try to fix > that at the source of the problem. > > > > This appears to stem from SearchUtils HSP tiling, which in > turn utilizes HSPI::matches(), which in turn checks num_identical/ > num_conserved. My guess is, since this is blasttable format, one > of these isn't set and thus is returning the wrong value. I'll > attempt to track it down today, but it may take some time. > > > > chris > > > > On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: > > > > This sounds like a > > > > $sum = eval join( '+', @a); > > > > problem, which can be fixed with > > > > $sum = eval join('+', map { $_ || () } @a) ; > > > > MAJ > > ----- Original Message ----- From: "Dan Bolser" > > > To: "Chris Fields" > > Cc: "BioPerl List" > > Sent: Tuesday, May 12, 2009 9:17 AM > > Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting > 'features' fromSearchIO?) > > > > > > 2009/5/12 Chris Fields : > > Fixed that in svn. We're all still learning the ropes... > > > > In that case, I'm seeing multiple instances of... > > > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 256 > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 412 > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 429 > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 465 > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 473 > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 494 > > Argument "" isn't numeric in addition (+) at Bio/Search/ > SearchUtils.pm line 502 > > > > > > Hmm... I was about to go on to complain about the weird GFF > that I was > > seeing, but suddenly it looks OK. My bioperl install must > think your > > standing over my shoulder and is therefore behaving itself! > > > > > > Thanks again for all the help, > > Dan. > > > > > > > > > > chris > > > > On May 12, 2009, at 5:55 AM, Dan Bolser wrote: > > > > 2009/5/12 Dan Bolser : > > > > Unfortunately bp_search2gff.pl is giving me errors: > > > > bp_search2gff.pl --version 3 -i BlastResults/ > blast_table_filtered -f > > blasttable -o BlastResults/blast_table_filtered.gff -t > hit > > --match --target --component > > > > --------------------- WARNING --------------------- > > MSG: Removing score value(s) > > --------------------------------------------------- > > Can't locate object method "remove_tags" via package > > "Bio::SeqFeature::Similarity" at > > /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ > Generic.pm line > > 393, line 5. > > > > > > I'm just learning the ropes... > > > > --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 > > 15:25:55.000000000 +0100 > > +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 > > 11:52:41.000000000 +0100 > > @@ -390,7 +390,7 @@ > > } > > if ($self->has_tag('score')) { > > $self->warn("Removing score value(s)"); > > - $self->remove_tags('score'); > > + $self->remove_tag('score'); > > } > > $self->add_tag_value('score',$value); > > } > > > > > > > > > > > > Anyone seen this before? > > > > Cheers, > > Dan. > > > > > > > > 2009/5/12 Dan Bolser : > > > > Thanks for the info guys, I think I was naively > hoping that the > > feature would know how to cast itself as a > 'SeqFeature' (GFF). > > > > I think I understand the problem better now, so I'll > try to summarise: > > > > There is no standard way to encode a HSP as a feature > (not least > > because there are two choices about which sequence > (query or the hit) > > it should be attached to). BioPerl will try, but the > result will not > > be "well structured" SeqFeatures or "well formed" GFF. > > > > > > From what I read I guess it should be possible to > standardize this > > mapping (based on something in one of the examples or > the 'search2gff' > > script), assuming you specify weather you want > features put on the > > query or on the hit. > > > > At some point last year I was trying out the > bp_search2gff.pl and my > > own code to write a GFF file for loading and viewing > by Gbrowse. At > > that time I gave up, as nothing seemed to be working. > I was hoping > > that doing this at a lower level (i.e. never writing > any GFF myself) > > it would stand a better chance of working. > > > > Also I was thinking that Gbrowse, if given a > SeqFeature::Store, could > > autoconfigure its interface to some degree. I guess > its back to the > > docs ;-) > > > > > > > > I'll keep trying and see if I can get anywhere. > > > > Thanks again, > > Dan. > > > > > > > > References for the above: > > > > 2009/5/11 Jason Stajich : > > > > otherwise you need to be converting the HSPs into > seqfeatures with the > > right associated information (i.e. the tag/value > pairs that are in the 9th > > column) in order to have well structured data in > the database. > > > > You can get the individual features from the > feature pair with > > $hsp->query or $hsp->hit which can also be passed > to a GFF writer (or call > > $hsp->hit->gff_string). Note that since the data > storage is not structured > > in a GFF3 like-way this won't immediately produce > well formed GFF3 for the > > 9th column. > > > > > > 2009/5/11 Chris Fields : > > > > The main problem is the mapping is subjective based > on what your > > reference sequence is within the BLAST run (e.g. > whether it is the query or > > the hit), and is something that can't be > automatically discerned. I ended > > up rolling my own with SeqFeature::Store (just > mapped the relevant data to > > Bio::DB::SeqFeatures), but I have long wanted to > fix up the relevant scripts > > to integrate my changes in, just haven't had the time > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > Jason Stajich > jason at bioperl.org > > > > > > > Jason Stajich jason at bioperl.org From jonathan at leto.net Wed May 13 03:05:47 2009 From: jonathan at leto.net (Jonathan Leto) Date: Wed, 13 May 2009 00:05:47 -0700 Subject: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem withModule::Builder versions In-Reply-To: <35C29690B62445D4AAE15050823A6514@NewLife> References: <4A03986D.7080007@gmail.com> <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> <35C29690B62445D4AAE15050823A6514@NewLife> Message-ID: <9aaadf9c0905130005x89c867an5a786025261abb5@mail.gmail.com> Howdy, >> I'm trying to install the latest Gbrowse (1.99) on a machine where I do >> not have root access (Ubuntu/dapper). >> >> I have set up non-root CPAN and installed all of the prerequisites, no >> problems, in ~/lib/perl5. However, when I try to install Gbrowse either >> via CPAN or using the latest CVS Build script, I run into this problem: >> Make sure that you export PERL5LIB=~/lib/perl5 before building. Cheers, -- Jonathan Leto jonathan at leto.net http://leto.net From dan.bolser at gmail.com Wed May 13 06:47:11 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Wed, 13 May 2009 11:47:11 +0100 Subject: [Bioperl-l] SearchIO to GFF (was: Getting 'features'fromSearchIO?) In-Reply-To: <97AD1493-F228-4B6D-9FCF-571951923C10@bioperl.org> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com> <2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com> <9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu> <2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com> <88072CDBECA446D0A7C48E74237F59FE@NewLife> <97AD1493-F228-4B6D-9FCF-571951923C10@bioperl.org> Message-ID: <2c8757af0905130347v4f99a7f2u6ab19211a2e13ee8@mail.gmail.com> 2009/5/13 Jason Stajich : > Dan --- > I am slowly cleaning up and migrating more of the pipelines that we use to > generate polished GFF for loading into Gbrowse. ?You should just set a goal > of making good GFF3 and many of the visualization and query issues go away. I see. That's actually really useful to know. > For genomes I tend to have a separate scaffold file which is just the 1 line > per chromosome/scaffold/contig listing. Is this the file that defines the 'entry points' for each sequence object? i.e. instead of generating this on the fly while processing the blast results, do it once initially for every sequence object you have? > Then a file for each analysis ie. Protein to genome BLAST, EST to genome, > remapped Pfam domains to genome coordinates, gene predictions, etc. Keep it > simple at first you know what you've updated when something stops working > the way you expect. > -jason Thanks all for the help. Very much appreciated! Dan. P.S. dumb question but... what defines a HIT in blasttable format? is it strictly the set of HSPs with the same query and subject, or is phase and relative position also taken into account? From dan.bolser at gmail.com Wed May 13 13:01:27 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Wed, 13 May 2009 18:01:27 +0100 Subject: [Bioperl-l] fasta2gff_landmark ... Message-ID: <2c8757af0905131001j37c8d29ev2a2b95fbfbe46b9f@mail.gmail.com> Hi, I am trying to create a script to write one 'GFF landmark' line for each entry in a fasta file. I'm doing this so that the landmarks show up in GBrowse (along with my attached similarity 'features'). In a nutshell, I was told to add lines like: C08HBa0216M19.1 . contig 1 . . . ID=C08HBa0216M19.1;Name=C08HBa0216M19.1 to the GFF so that GBrowse would pick up the features which have C08HBa0216M19.1 as their 'landmark'. This seems to be working great for the sequences where I added this line manually, but now I want to write a script to automatically generate these GFF landmarks from a fasta file. So far I have a script that looks like this: #!/usr/bin/perl -w use strict; use Bio::SeqIO; use Bio::SeqFeature::Generic; use Bio::FeatureIO; [snip] my $gffIO = new Bio::FeatureIO( -format => 'GFF', -version => 3 ); ## Now read in the fasta file to create the landmarks my $seqIO = new Bio::SeqIO( -file => $finished_bacs_repeatmasked, -format => 'fasta' ); ## Begwin while(my $seq = $seqIO->next_seq){ ## Create a new generic feature to hold the properties of the ## sequence; my $name = $seq->id; my $leng = $seq->length; my $accn; next unless $accn = $finished{$name}; my $f = new Bio::SeqFeature::Generic( -seq_id => $name, -primary => "SGN", -source => "genomic_clone", -start => 1, -end => $leng, -tag => { ID => $name, Name => $name, Dbxref => "GB:$accn", Ontology_term => "SO:0000040" } ); print $gffIO->write_feature($f); exit; } The script gives me the following error: Use of uninitialized value in pattern match (m//) at ~/perl5/lib/perl5/Bio/FeatureIO/gff.pm line 107, line 688. ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: only Bio::SeqFeature::Annotated objects are writeable STACK: Error::throw STACK: Bio::Root::Root::throw ~/perl5/lib/perl5/Bio/Root/Root.pm:368 STACK: Bio::FeatureIO::gff::write_feature ~/perl5/lib/perl5/Bio/FeatureIO/gff.pm:272 STACK: ./fasta2gff_landmarks.plx:72 ----------------------------------------------------------- With all due respect to: * http://www.bioperl.org/wiki/GFF_Refactor and * http://www.bioperl.org/wiki/GFF_code_audit I'm a bit stuck. Any quick fix in the meantime? I realise that using a feature object is overkill here, but I wanted to learn the 'BioPerl way' of doing things. Should I drop that approach until the code audit and refactorization is over, or is this the kind of use-case that can help with that process? Thanks for any pointers, Dan. P.S. Thanks also to the friendly people on irc://ircfreenode.net/#bioperl and irc://irc.freenode.net/#bioinformatics who helped me get this script together as it is. From scott at scottcain.net Wed May 13 13:17:48 2009 From: scott at scottcain.net (Scott Cain) Date: Wed, 13 May 2009 13:17:48 -0400 Subject: [Bioperl-l] fasta2gff_landmark ... In-Reply-To: <2c8757af0905131001j37c8d29ev2a2b95fbfbe46b9f@mail.gmail.com> References: <2c8757af0905131001j37c8d29ev2a2b95fbfbe46b9f@mail.gmail.com> Message-ID: <536f21b00905131017t375774c7r20bfb5e261fb7692@mail.gmail.com> Hi Dan, I've already done this; take a look at: http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/schema/chado/bin/gmod_fasta2gff3.pl While it is part of the GMOD/Chado distribution, the only modules it uses is Bio::DB::Fasta and GetOpt::Long, and it has some nice convenience functions for modifying the resulting GFF. Scott On Wed, May 13, 2009 at 1:01 PM, Dan Bolser wrote: > Hi, > > I am trying to create a script to write one 'GFF landmark' line for > each entry in a fasta file. I'm doing this so that the landmarks show > up in GBrowse (along with my attached similarity 'features'). > > In a nutshell, I was told to add lines like: > > C08HBa0216M19.1 ? ?. ? ?contig ? ?1 ? ? ? ?. ? ?. ? ?. > ID=C08HBa0216M19.1;Name=C08HBa0216M19.1 > > > to the GFF so that GBrowse would pick up the features which have > C08HBa0216M19.1 as their 'landmark'. This seems to be working great > for the sequences where I added this line manually, but now I want to > write a script to automatically generate these GFF landmarks from a > fasta file. > > > So far I have a script that looks like this: > > > #!/usr/bin/perl -w > > use strict; > > use Bio::SeqIO; > use Bio::SeqFeature::Generic; > use Bio::FeatureIO; > > [snip] > > my $gffIO = > ? ?new Bio::FeatureIO( -format => 'GFF', > ? ? ? ? ? ? ? ? ? ? ? ?-version => 3 > ? ? ? ? ? ? ? ? ? ? ? ?); > > ## Now read in the fasta file to create the landmarks > > my $seqIO = > ?new Bio::SeqIO( -file => $finished_bacs_repeatmasked, > ? ? ? ? ? ? ? ? ?-format => 'fasta' > ? ? ? ? ? ? ? ?); > > ## Begwin > > while(my $seq = $seqIO->next_seq){ > ?## Create a new generic feature to hold the properties of the > ?## sequence; > > ?my $name = $seq->id; > ?my $leng = $seq->length; > > ?my $accn; > > ?next unless $accn = $finished{$name}; > > ?my $f = > ? ?new Bio::SeqFeature::Generic( -seq_id ?=> $name, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-primary => "SGN", > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-source ?=> "genomic_clone", > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-start ? => ? ? 1, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-end ? ? => $leng, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?-tag ? ? => { ID ? => $name, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Name => $name, > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Dbxref => "GB:$accn", > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Ontology_term => "SO:0000040" > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?} > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?); > > ?print $gffIO->write_feature($f); > ?exit; > } > > > > The script gives me the following error: > > Use of uninitialized value in pattern match (m//) at > ~/perl5/lib/perl5/Bio/FeatureIO/gff.pm line 107, line 688. > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: only Bio::SeqFeature::Annotated objects are writeable > STACK: Error::throw > STACK: Bio::Root::Root::throw ~/perl5/lib/perl5/Bio/Root/Root.pm:368 > STACK: Bio::FeatureIO::gff::write_feature > ~/perl5/lib/perl5/Bio/FeatureIO/gff.pm:272 > STACK: ./fasta2gff_landmarks.plx:72 > ----------------------------------------------------------- > > > With all due respect to: > > * http://www.bioperl.org/wiki/GFF_Refactor and > * http://www.bioperl.org/wiki/GFF_code_audit > > > I'm a bit stuck. > > Any quick fix in the meantime? I realise that using a feature object > is overkill here, but I wanted to learn the 'BioPerl way' of doing > things. Should I drop that approach until the code audit and > refactorization is over, or is this the kind of use-case that can help > with that process? > > > Thanks for any pointers, > Dan. > > > P.S. Thanks also to the friendly people on > irc://ircfreenode.net/#bioperl and > irc://irc.freenode.net/#bioinformatics who helped me get this script > together as it is. > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From ajmackey at gmail.com Wed May 13 13:47:58 2009 From: ajmackey at gmail.com (Aaron Mackey) Date: Wed, 13 May 2009 13:47:58 -0400 Subject: [Bioperl-l] alignable portion of a genome In-Reply-To: <20090512233408.GB17765@eniac.jgi-psf.org> References: <23480025.post@talk.nabble.com> <18DF7D20DFEC044098A1062202F5FFF32493DA5373@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493DA54D2@exchsth.agresearch.co.nz> <20090512233408.GB17765@eniac.jgi-psf.org> Message-ID: <24c96eca0905131047s6ec727a5k97a5d506e1dc7bb0@mail.gmail.com> Just to be clear, my suggestion was to realign the genome-derived reads without any additional errors introduced. By not adding any other errors, you'll still get to find out which regions of the genome are unique even while accepting 2 mutations (or however you setup your alignment filtering). Russell's hash method also won't work well when the number of mutations increases to, say, 10 out of 100bp reads (which is what I'm dealing with right now, as a matter of fact ...), and certainly won't take into account the ability of using paired end reads to render an ambiguous alignment unique based on appropriate location relative to a uniquely aligned mate pair. So for all these reasons, I still argue that the best way to accurately and meaningfully measure the alignable portion of a genome is to simply align the genome (without errors) back to itself using your alignment method/filters of choice. If you're doing paired-end reads, you'll still have to simulate that aspect (don't just take reads spaced exactly by 200 bp -- instead, you should use the emperical distribution of read distances seen in your data, and repeat the simulation-alignment a few times to see how stable your final estimates are). -Aaron On Tue, May 12, 2009 at 7:34 PM, Joel Martin wrote: > Hello, > Doing this with hashes ends up being a little inefficient for > larger kmers like 35 in larger genomes. A suffix array tool > like 'tallymer' will tell you the unique/non-unique kmer counts > quickly, and as Aaron suggested generating fake reads based > on a reference and introduce errors into them so you can evaluate > how well they map back is a good strategy. maq has a command to do > that built in. > > Joel > > On Wed, May 13, 2009 at 09:27:57AM +1200, Smithies, Russell wrote: > > Adding the mutations is a little hacky (and probably slow) but I think it > works correctly. > > The stats should work out OK but it's too early and I haven't had a > coffee yet so can't be sure :-) > > > > --Russell > > > > ============================ > > #!perl -w > > > > my $seq = "atcgacgatcgaacgatcga"; > > my $debug = 0; > > > > > > foreach ($seq =~ /(?=(\w{5}))/g){ > > $h++; > > # add all the exact words to the hash > > $hash{$_}++; > > print "$_\n" if $debug; > > # mutate words and add to hash > > my at rr = mutate($_); > > foreach (@rr){ > > print "$_\n" if $debug; > > $h++; > > $hash{$_}++; > > } > > } > > > > > > # print out the hash counts & stats > > foreach (keys %hash){ > > print "$_\t$hash{$_}\n" if $debug; > > $singles++ if($hash{$_} eq 1); > > } > > print $singles/$h,"\n"; > > > > > > sub mutate{ > > my @array = split '',shift; > > my @res = (); > > my $rep = 'X'; > > for(my$i = 0; $i <= $#array; $i++){ > > my $old1 = $array[$i]; > > splice @array, $i, 1, $rep; > > push @res, (join '', @array); > > for(my$j = $i+1; $j <= $#array; $j++){ > > my $old2 = $array[$j]; > > splice @array, $j, 1, $rep; > > push @res, (join '', @array); > > splice @array, $j, 1, $old2; > > } > > splice @array, $i, 1, $old1; > > } > > return @res; > > } > > > > ================================ > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > > > Sent: Tuesday, 12 May 2009 3:56 p.m. > > > To: 'fadista'; 'Bioperl-l at lists.open-bio.org' > > > Subject: Re: [Bioperl-l] alignable portion of a genome > > > > > > Perfect matches is easy: > > > > > > $seq = "atcgacgatcgaacgatcga"; > > > > > > foreach ($seq =~ /(?=(\w{5}))/g){$h++; $hash{$_}++} > > > foreach (keys %hash){ $singles++ if($hash{$_} eq 1)} > > > print $singles/$h; > > > > > > Could probably be done with map as well. > > > Counting the miss-matches might take a bit more thinking.... > > > Any ideas MAJ? > > > > > > --Russell > > > > > > > > > > -----Original Message----- > > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > > bounces at lists.open-bio.org] On Behalf Of fadista > > > > Sent: Monday, 11 May 2009 9:32 p.m. > > > > To: Bioperl-l at lists.open-bio.org > > > > Subject: [Bioperl-l] alignable portion of a genome > > > > > > > > > > > > Hi, > > > > > > > > I would like to know of a good and fast way that could help me > calculate the > > > > alignable portion of a genome (not human), given a reference > sequence. > > > > When I say alignable portion I mean that I want to know all the > positions of > > > > the genome that can be covered uniquely by reads of 36 bp and up to 2 > > > > mismatches. > > > > > > > > Some have advised me to work with Perl using the following strategy > but I am > > > > not a Perl user so if someone has already a script for this function, > it > > > > would be nice: > > > > > > > > "you could approach it by walking along the genome in a sliding > window of > > > > 36 nt, and hash the frequency of each 36 nt sequence that you > encounter. > > > > Then count how many of the 36 nt sequences had a frequency of exactly > > > > one. Divide this by the total number of 36nt windows visited. This > > > > should be do-able in about 20 lines of Perl." > > > > > > > > > > > > Best regards and thanks in advance > > > > > > > > -- > > > > View this message in context: > http://www.nabble.com/alignable-portion-of-a- > > > > genome-tp23480025p23480025.html > > > > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > > > > > > > _______________________________________________ > > > > Bioperl-l mailing list > > > > Bioperl-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > ======================================================================= > > > Attention: The information contained in this message and/or attachments > > > from AgResearch Limited is intended only for the persons or entities > > > to which it is addressed and may contain confidential and/or privileged > > > material. Any review, retransmission, dissemination or other use of, or > > > taking of any action in reliance upon, this information by persons or > > > entities other than the intended recipients is prohibited by AgResearch > > > Limited. If you have received this message in error, please notify the > > > sender immediately. > > > ======================================================================= > > > > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at illinois.edu Wed May 13 13:55:17 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 13 May 2009 12:55:17 -0500 Subject: [Bioperl-l] Creating a fastq format file? Message-ID: Heikki, Did you still want to commit this? I think it's a good idea and would be worth including in the next 1.6 point release. chris ------------------------------------------------------------ I convinced at least myself to the degree that I wrote the range_convert() method - with plenty of tests. I mention this now so that no-one else need to start thinking through all the edge values. :) I'll contribute it to the code base once there is a consensus of best way forward. -Heikki 2009/4/27 Heikki Lehvaslaiho : >> I have tried to summarise this in a central place: >> http://en.wikipedia.org/wiki/FASTQ_format > > Torsten, > > Thanks for putting this together. Very helpful. > > Do you have a plan of action? Let me propose one for BioPerl. It > based on following assumptions: > > 1. There is multitude of different ways of coding quality values out there. > 2. Bio::Seq::Quality is agnostic of any quality value range rules > 3. The emerging open standard is the Sanger fastq specification > 4. Open source programs use the Sanger fastq specs > > > From these it follows that: > > > 1. BioPerl should support Sanger fastq standard > > 1.1. it already does and there are other SeqIO modules for dealing > with other non-fastq formats. > > 2. BioPerl should offer simple ways of converting between quality range rules > > 2.1. Have a generic method accessible from Bio::Seq::Quality with > preset versions of the method for converting between known variants > (Sanger fastq and the two Illumina versions) > > For example: > > range_convert ($from_lower, $from_upper, $to_lower, $to_upper, $value) > throw if $value < $from_lower or $value > $from_upper > return $newvalue > > range_convert_illumina2fastq(), range_convert_fastq2illumina(), > range_convert_fastq2phred(), range_convert_phred2fastq().... > > (assuming that illumina 1.3 eq phred) > > 2.2. Bio::SeqIO::Fastq::next_seq methods should convert Illumina > qualities into Sanger fastq on the fly > > 2.2.1 Bio::SeqIO::Fastq::next_seq should detect the incoming stream of > quality value range either automatically or be given a keyword > parameter indicating the range. > > 2.2.2. Bio::SeqIO::Fastq::next_seq should throw an error if it detects > a quality value out of range. > > 2.2.3. Bio::SeqIO::Fastq::write_seq should throw an error if it > detects a quality value out of range. > > 2.2.4. It would be useful but not absolutely necessary for > Bio::SeqIO::Fastq::write_seq to be able to write out in Illumina > ranges > > > What do you think? > > -Heikki > > 2009/4/26 Torsten Seemann : >>> > This might be a good place to ask the question: having looked at the >>> > fastq.pm page, is the fastq format defined (only) by a "@'" followed by >>> a >>> > sequence line and a "+" header followed by a quality line and the two >>> > headers have to agree? Now that Illumina is using phred scaling, are >>> > 'Sanger' and 'Illumina' versions the same? >>> >>> No they aren't the same, Illumina still encodes the ascii as value + 64 >>> and Sanger as value + 33. >>> >> >> Illumina have now CHANGED how they calculate the quality value however in >> the last month or so... Their Q range used to be -5..40 mapped to ASCII 64+, >> but now they produce Q >= 0 and it is unclear if they start at 69 or 64 >> now... >> >> I have tried to summarise this in a central place: >> >> http://en.wikipedia.org/wiki/FASTQ_format >> >> Corrections welcome! >> >> >> --Torsten Seemann >> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash >> University, AUSTRALIA >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > -Heikki > Heikki Lehvaslaiho - skype:heikki_lehvaslaiho > cell: +27 (0)714328090 > Sent from Claremont, WC, South Africa > -- -Heikki Heikki Lehvaslaiho - skype:heikki_lehvaslaiho cell: +27 (0)714328090 Sent from Claremont, WC, South Africa From jason at bioperl.org Wed May 13 19:57:52 2009 From: jason at bioperl.org (Jason Stajich) Date: Wed, 13 May 2009 16:57:52 -0700 Subject: [Bioperl-l] Bio::DB::SeqFeature::Store::berkeleydb Message-ID: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> Lincoln - Looks like there have been a change to the berkeleydb.pm Store module which force the 'create' bit to 0 when doing reindex_gfffiles . This means an empty DB cannot be created. So I get these errors: -------------------- EXCEPTION -------------------- MSG: Couldn't tie: ../gff/indexes/features.bdb No such file or directory STACK Bio::DB::SeqFeature::Store::berkeleydb::_open_databases /usr/ local/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:576 STACK Bio::DB::SeqFeature::Store::berkeleydb::reindex_gfffiles /usr/ local/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:374 STACK Bio::DB::SeqFeature::Store::berkeleydb::auto_reindex /usr/local/ lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:314 STACK Bio::DB::SeqFeature::Store::berkeleydb::init /usr/local/lib/ perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:293 STACK Bio::DB::SeqFeature::Store::new /usr/local/lib/perl5/Bio/DB/ SeqFeature/Store.pm:360 STACK toplevel scripts/haplotype_block_association.pl:12 This stems from Line 374 of berkeleydb.pm in the redindex_gfffiles ? 371 warn "Reindexing GFF files...\n" if $self->verbose; 372 $self->_permissions(1,1); 373 $self->_close_databases(); 374 $self->_open_databases(1,0); *Could* be changed to this: 374 $self->_open_databases(1,1); But I'm not sure do you want the CREATE bit always set to on when reindexing? Maybe yes since the DB was typically erased beforehand whenever a reindex (or initial index) is applied? Thanks, -jason -- Jason Stajich jason at bioperl.org From neilfws at gmail.com Tue May 12 18:44:50 2009 From: neilfws at gmail.com (Neil Saunders) Date: Wed, 13 May 2009 08:44:50 +1000 Subject: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem with Module::Builder versions In-Reply-To: <2A93D0F6-3A2A-4638-A095-26CFC595E8A1@verizon.net> References: <4A03986D.7080007@gmail.com> <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> <2A93D0F6-3A2A-4638-A095-26CFC595E8A1@verizon.net> Message-ID: <4A09FBE2.8040000@gmail.com> > Try setting the environmental variable PERL5LIB: Thanks for the tip - however, PERL5LIB is set (to ~/lib/perl5). The Module::Build docs state that Config.pm is used by Module::Build. So far as I can tell, the initial Build is using the system-wide perl installation (/usr/local, /etc/CPAN) and the 'Build install' to $HOME uses my personal Module::Build. Problems arise because these are different versions (0.28 v 0.32). I assume that I can edit the Build scripts in some way to use only my personal installation - will keep working on this. Neil -- Statistical Bioinformatics - Health CSIRO Mathematical and Information Sciences Locked Bag 17, North Ryde, NSW 1670, Australia http://friendfeed.com/neilfws From neilfws at gmail.com Tue May 12 19:09:59 2009 From: neilfws at gmail.com (Neil Saunders) Date: Wed, 13 May 2009 09:09:59 +1000 Subject: [Bioperl-l] [Gmod-gbrowse] Non-root installation: problem with Module::Builder versions In-Reply-To: References: <4A03986D.7080007@gmail.com> <71ee57c70905121036w7fba1ac6x7e68db41f3035ee3@mail.gmail.com> <2A93D0F6-3A2A-4638-A095-26CFC595E8A1@verizon.net> <4A09FBE2.8040000@gmail.com> Message-ID: <4A0A01C7.5090301@gmail.com> > So this doesn't work? > ./Build --install_base ~/ install Afraid not. That gives me the: Global symbol "$VAR1" requires explicit package name at (eval 28) line 1093, line 1. ...propagated at /usr/local/share/perl/5.8.7/Module/Build/Base.pm line 1002, line 1. I think I need the sysadmin to update the system Module::Build. Neil -- Statistical Bioinformatics - Health CSIRO Mathematical and Information Sciences Locked Bag 17, North Ryde, NSW 1670, Australia http://friendfeed.com/neilfws From ABARRIE at uta.edu Wed May 13 18:07:57 2009 From: ABARRIE at uta.edu (Barrie, Assiatu B) Date: Wed, 13 May 2009 17:07:57 -0500 Subject: [Bioperl-l] Errors with Bioperl installation Message-ID: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> Hello, I am a graduate student attempting to install Bioperl on a unix (Mac OS ) platform without any success. I keep running into the problem with the make file (here is the error message: Running make test Can't test without successful make Running make install make had returned bad status, install seems impossible). I always end up with this message no matter what I tried. Please help, I would really appreciate the help, I need Bioperl to run another program. Thank you Assiatu Barrie Research Assistant University of Texas at Arlington Biology Department Box 19498 Arlington, TX 76019-0498 abarrie at uta.edu Lab# 817-272-0523 From scott at scottcain.net Wed May 13 22:24:55 2009 From: scott at scottcain.net (Scott Cain) Date: Wed, 13 May 2009 22:24:55 -0400 Subject: [Bioperl-l] Errors with Bioperl installation In-Reply-To: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> Message-ID: <536f21b00905131924s38c197b4if5a94915fb8b0c31@mail.gmail.com> Hi Assiatu, Did you install the developers tools for Mac OS X? You probably need to have make installed. Scott On Wed, May 13, 2009 at 6:07 PM, Barrie, Assiatu B wrote: > Hello, > > I am a graduate student attempting to install Bioperl on a unix (Mac OS ) platform without any success. ?I keep running into the problem with the make file (here is the error message: Running make test > ?Can't test without successful make > Running make install > ?make had returned bad status, install seems impossible). ?I always end up with this message no matter what I tried. ?Please help, I would really appreciate the help, I need Bioperl to run another program. ?Thank you > > > > Assiatu Barrie > Research Assistant > University of Texas at Arlington > Biology Department Box 19498 > Arlington, TX 76019-0498 > abarrie at uta.edu > Lab# 817-272-0523 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From cjfields at illinois.edu Wed May 13 22:26:21 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 13 May 2009 21:26:21 -0500 Subject: [Bioperl-l] Errors with Bioperl installation In-Reply-To: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> Message-ID: This doesn't sound like the latest installation directions: http://www.bioperl.org/wiki/Installing_Bioperl_for_Unix chris On May 13, 2009, at 5:07 PM, Barrie, Assiatu B wrote: > Hello, > > I am a graduate student attempting to install Bioperl on a unix (Mac > OS ) platform without any success. I keep running into the problem > with the make file (here is the error message: Running make test > Can't test without successful make > Running make install > make had returned bad status, install seems impossible). I always > end up with this message no matter what I tried. Please help, I > would really appreciate the help, I need Bioperl to run another > program. Thank you > > > > Assiatu Barrie > Research Assistant > University of Texas at Arlington > Biology Department Box 19498 > Arlington, TX 76019-0498 > abarrie at uta.edu > Lab# 817-272-0523 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From dan.bolser at gmail.com Thu May 14 04:16:33 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Thu, 14 May 2009 09:16:33 +0100 Subject: [Bioperl-l] fasta2gff_landmark ... In-Reply-To: <536f21b00905131017t375774c7r20bfb5e261fb7692@mail.gmail.com> References: <2c8757af0905131001j37c8d29ev2a2b95fbfbe46b9f@mail.gmail.com> <536f21b00905131017t375774c7r20bfb5e261fb7692@mail.gmail.com> Message-ID: <2c8757af0905140116g7927ba2dlbcdc71b0f78857@mail.gmail.com> 2009/5/13 Scott Cain : > Hi Dan, > > I've already done this; take a look at: > > http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/schema/chado/bin/gmod_fasta2gff3.pl > > While it is part of the GMOD/Chado distribution, the only modules it > uses is Bio::DB::Fasta and GetOpt::Long, and it has some nice > convenience functions for modifying the resulting GFF. Thanks Scott, That is more or less exactly what I need (and a vote for 'wait for the code audit' I guess ;-). I'm still seeing a couple of errors: Use of uninitialized value in pattern match (m//) at ~/perl5/lib/perl5/Bio/DB/Fasta.pm line 1168. Use of uninitialized value in exists at ~/perl5/lib/perl5/Bio/DB/Fasta.pm line 617. But the resulting GFF looks fine. Thanks again, Dan. > Scott > From avilella at gmail.com Thu May 14 09:45:17 2009 From: avilella at gmail.com (Albert Vilella) Date: Thu, 14 May 2009 14:45:17 +0100 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> Message-ID: <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> Hi all, In Ensembl, we are interested in providing NeXML dumps for our Comparative Genomics data. Because our pipeline is written in Perl, I guess most of the work done here will be of great use to us. If I could only ask for only a feature, that would be to *try* and backport the NeXML support to bioperl-1.2.3 --- stress on the *try*. Bioperl 1.2.3 is the release that Ensembl decided to stick to many years ago, so it's cleaner for people to use our Perl API with only one version of bioperl as a dependency. Looking forward to hearing from this SoC. Have you got a blog? Cheers, Albert. On Mon, May 11, 2009 at 5:24 PM, Jason Stajich wrote: > Welcome Chase. > > Look forward to the project and helping where needed. > > -jason > > > On May 11, 2009, at 7:31 AM, Mark A. Jensen wrote: > > Hello all, >> With great pleasure, I want to introduce Chase Miller, my Google Summer of >> Code student from George Washington University, to the community. Chase will >> be working with me and Rutger Vos on a BioPerl wrapper for Rutger's >> Bio::Phylo package, with a particular emphasis on creating a BioPerl-native >> way to import and export the NeXML (http://nexml.org) phylogenetic data >> format. He wrote a great proposal, available here: >> https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit >> . >> We will be working throughout the summer on the project, and will of >> course come to you for sage advice. I know you will welcome him warmly, as >> you did me. >> Cheers, >> Mark >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > Jason Stajich > jason at bioperl.org > > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From scott at scottcain.net Thu May 14 09:53:22 2009 From: scott at scottcain.net (Scott Cain) Date: Thu, 14 May 2009 09:53:22 -0400 Subject: [Bioperl-l] Errors with Bioperl installation In-Reply-To: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E1@MAVMAIL1.uta.edu> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> <536f21b00905131924s38c197b4if5a94915fb8b0c31@mail.gmail.com> <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E1@MAVMAIL1.uta.edu> Message-ID: <4536f7700905140653y5367ee52w8c61e4fcf8152b4c@mail.gmail.com> Hi Assistu, Please keep you replies on the mailing list; if you had you'd probably had an answer before now. First when you type "which make" on the command line, what do you get? If nothing then make and the developer tools are not properly installed. If you get something then the cpan shell may not be properly configured to use make. That can happen if you tried to use cpan before you installed make. To fix that, when you're in the cpan shell, type "o conf init" and go through the config again. Scott On Thursday, May 14, 2009, Barrie, Assiatu B wrote: > Hello, > > Thank you for the reply, I'm sorry but I am very new to this and not very familiar with the program language. ?How do i install make, I already downloaded X code tools for my Mac OS X ver 10.5. ?I would really appreciate the help. > > Assiatu Barrie > Research Assistant > University of Texas at Arlington > Biology Department Box 19498 > Arlington, TX 76019-0498 > abarrie at uta.edu > ________________________________________ > From: cain.cshl at gmail.com [cain.cshl at gmail.com] On Behalf Of Scott Cain [scott at scottcain.net] > Sent: Wednesday, May 13, 2009 9:24 PM > To: Barrie, Assiatu B > Cc: bioperl-l at bioperl.org > Subject: Re: [Bioperl-l] Errors with Bioperl installation > > Hi Assiatu, > > Did you install the developers tools for Mac OS X? ?You probably need > to have make installed. > > Scott > > > On Wed, May 13, 2009 at 6:07 PM, Barrie, Assiatu B wrote: >> Hello, >> >> I am a graduate student attempting to install Bioperl on a unix (Mac OS ) platform without any success. ?I keep running into the problem with the make file (here is the error message: Running make test >> ?Can't test without successful make >> Running make install >> ?make had returned bad status, install seems impossible). ?I always end up with this message no matter what I tried. ?Please help, I would really appreciate the help, I need Bioperl to run another program. ?Thank you >> >> >> >> Assiatu Barrie >> Research Assistant >> University of Texas at Arlington >> Biology Department Box 19498 >> Arlington, TX 76019-0498 >> abarrie at uta.edu >> Lab# 817-272-0523 >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > > > -- > ------------------------------------------------------------------------ > Scott Cain, Ph. D. ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? scott at scottcain dot net > GMOD Coordinator (http://gmod.org/) ? ? ? ? ? ? ? ? ? ? 216-392-3087 > Ontario Institute for Cancer Research > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From scott at scottcain.net Thu May 14 09:55:42 2009 From: scott at scottcain.net (Scott Cain) Date: Thu, 14 May 2009 09:55:42 -0400 Subject: [Bioperl-l] fasta2gff_landmark ... In-Reply-To: <2c8757af0905140116g7927ba2dlbcdc71b0f78857@mail.gmail.com> References: <2c8757af0905131001j37c8d29ev2a2b95fbfbe46b9f@mail.gmail.com> <536f21b00905131017t375774c7r20bfb5e261fb7692@mail.gmail.com> <2c8757af0905140116g7927ba2dlbcdc71b0f78857@mail.gmail.com> Message-ID: <4536f7700905140655n59cc3515rbc10de5863505832@mail.gmail.com> Hi Dan Could the uninit warnings be a data problem? Like something as simple as having a space betwen the carrot and the id? Scott On Thursday, May 14, 2009, Dan Bolser wrote: > 2009/5/13 Scott Cain : >> Hi Dan, >> >> I've already done this; take a look at: >> >> http://gmod.cvs.sourceforge.net/viewvc/*checkout*/gmod/schema/chado/bin/gmod_fasta2gff3.pl >> >> While it is part of the GMOD/Chado distribution, the only modules it >> uses is Bio::DB::Fasta and GetOpt::Long, and it has some nice >> convenience functions for modifying the resulting GFF. > > Thanks Scott, > > That is more or less exactly what I need (and a vote for 'wait for the > code audit' I guess ;-). > > > I'm still seeing a couple of errors: > > Use of uninitialized value in pattern match (m//) at > ~/perl5/lib/perl5/Bio/DB/Fasta.pm line 1168. > Use of uninitialized value in exists at > ~/perl5/lib/perl5/Bio/DB/Fasta.pm line 617. > > > But the resulting GFF looks fine. > > Thanks again, > Dan. > > >> Scott >> > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From lincoln.stein at gmail.com Thu May 14 09:59:03 2009 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Thu, 14 May 2009 09:59:03 -0400 Subject: [Bioperl-l] Bio::DB::SeqFeature::Store::berkeleydb In-Reply-To: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> References: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> Message-ID: <6dce9a0b0905140659o7d54e7ebq52d04ee92d036c47@mail.gmail.com> Hi Jason, Could you send me the script the elicits the problem? I am unsure of the create/write logic because -- unfortunately -- the berkeleydb implementation has this dual use of supporting conventional loading as well as autoindexing. I have to redo the autoindexing in any case so that it doesn't need to reload every file whenever one file changes. Lincoln On Wed, May 13, 2009 at 7:57 PM, Jason Stajich wrote: > Lincoln - > > Looks like there have been a change to the berkeleydb.pm Store module > which force the 'create' bit to 0 when doing reindex_gfffiles . This means > an empty DB cannot be created. > > So I get these errors: > -------------------- EXCEPTION -------------------- > MSG: Couldn't tie: ../gff/indexes/features.bdb No such file or directory > STACK Bio::DB::SeqFeature::Store::berkeleydb::_open_databases > /usr/local/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:576 > STACK Bio::DB::SeqFeature::Store::berkeleydb::reindex_gfffiles > /usr/local/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:374 > STACK Bio::DB::SeqFeature::Store::berkeleydb::auto_reindex > /usr/local/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:314 > STACK Bio::DB::SeqFeature::Store::berkeleydb::init > /usr/local/lib/perl5/Bio/DB/SeqFeature/Store/berkeleydb.pm:293 > STACK Bio::DB::SeqFeature::Store::new > /usr/local/lib/perl5/Bio/DB/SeqFeature/Store.pm:360 > STACK toplevel scripts/haplotype_block_association.pl:12 > > > This stems from Line 374 of berkeleydb.pm in the redindex_gfffiles ? > > 371 warn "Reindexing GFF files...\n" if $self->verbose; > 372 $self->_permissions(1,1); > 373 $self->_close_databases(); > 374 $self->_open_databases(1,0); > > *Could* be changed to this: > 374 $self->_open_databases(1,1); > > > But I'm not sure do you want the CREATE bit always set to on when > reindexing? Maybe yes since the DB was typically erased beforehand whenever > a reindex (or initial index) is applied? > > Thanks, > -jason > > -- > Jason Stajich > jason at bioperl.org > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- Lincoln D. Stein Director, Informatics and Biocomputing Platform Ontario Institute for Cancer Research 101 College St., Suite 800 Toronto, ON, Canada M5G0A3 416 673-8514 Assistant: Renata Musa From maj at fortinbras.us Thu May 14 10:07:45 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Thu, 14 May 2009 10:07:45 -0400 Subject: [Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?) In-Reply-To: <97AD1493-F228-4B6D-9FCF-571951923C10@bioperl.org> References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu><2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com><88072CDBECA446D0A7C48E74237F59FE@NewLife> <97AD1493-F228-4B6D-9FCF-571951923C10@bioperl.org> Message-ID: I'll patch in the fix with a cluck that these are undefined, just so's it doesn't slip through Bioperl's fingers (if no objections in the next 10min). ----- Original Message ----- From: "Jason Stajich" To: "Mark A. Jensen" ; "Dan Bolser" Cc: "Chris Fields" ; "BioPerl List" Sent: Wednesday, May 13, 2009 2:34 AM Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?) > Not looking at the code so I can't be sure, but I think since there is no > sequence in the blasttable format and I think %cons or %id is calculated from > some of the numbers that are taken from underlying sequence data in the > alignment. I don't know why that would trip up tile_hsps necessarily but > maybe that's not where the errors are coming from. > > I've generally solved these conversions with simpler perl scripts rather than > searchio since I usually want speed for this kind of stuff -- so I convert > all the data into blast m9 format (FASTA,SSEARCH,WUBLAST, BLAST) and then > have a simple parser that converts the m9 (without searchio) into the > appropriate GFF with or without additional bells and whistles. > > Dan --- > I am slowly cleaning up and migrating more of the pipelines that we use to > generate polished GFF for loading into Gbrowse. You should just set a goal > of making good GFF3 and many of the visualization and query issues go away. > > For genomes I tend to have a separate scaffold file which is just the 1 line > per chromosome/scaffold/contig listing. > Then a file for each analysis ie. Protein to genome BLAST, EST to genome, > remapped Pfam domains to genome coordinates, gene predictions, etc. Keep it > simple at first you know what you've updated when something stops working the > way you expect. > > -jason > On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote: > >> I dislike my patch, because it doesn't get to the bottom of why >> data members associated with numbers of conserved sites return >> from eval's undefined; it seems clear from the code that this is >> unexpected. Please excuse my naivete--why would this happen >> only in the blasttable format, and why hasn't this thing clucked before? >> ----- Original Message ----- >> From: Jason Stajich >> To: Mark A. Jensen >> Cc: Chris Fields ; BioPerl List ; Dan Bolser >> Sent: Tuesday, May 12, 2009 7:07 PM >> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting >> 'features'fromSearchIO?) >> >> >> I really don't think tile_hsps should be used on BLAST data folks, it is a >> pretty blind approach. >> If you really want the right answer you need to do -links with WU- BLAST or >> FASTA. >> Been discussed a few times on the mailing list. >> >> >> Good to fix the code bug I guess to avoid the warnings, but unless you are >> going to walk through all the HSPs and extract the consistent paths wrt >> query I think you'll have loops, etc in there which will make >> Hit->percent_id non-accurate. >> >> >> -jason >> >> >> On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote: >> >> >> Patch below (to SearchUtils.pm) fixes the non-numeric warnings on Dan's >> data, but something deeper may be going on. >> Also get many of the following warnings, haven't looked at it closely: >> >> --------------------- WARNING --------------------- >> MSG: Removing score value(s) >> --------------------------------------------------- >> >> PATCH: >> >> Index: SearchUtils.pm >> =================================================================== >> --- SearchUtils.pm (revision 15674) >> +++ SearchUtils.pm (working copy) >> @@ -252,8 +252,8 @@ >> } >> >> $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_- >> >{'stop'} - $_->{'start'} + 1; >> - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- >> >{'iden'}; >> - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- >> >{'cons'}; >> + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- >> >{'iden'} || 0; >> + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- >> >{'cons'} || 0; >> $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; >> } >> >> @@ -407,9 +407,12 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> + # make sure numerical >> + $_->{'iden'} ||= 0; >> + $_->{'cons'} ||= 0; >> $_->{'start'} = $start; # Assign a new start coordinate to >> the contig >> - $_->{'iden'} += $numID; # and add new data to >> #identical, #conserved. >> - $_->{'cons'} += $numCons; >> + $_->{'iden'} += ($numID||0); # and add new data to >> #identical, #conserved. >> + $_->{'cons'} += ($numCons||0); >> push(@{$_->{hsps}}, $hsp); >> $overlap = 1; >> } >> @@ -424,9 +427,13 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> + # make sure numerical >> + $_->{'iden'} ||= 0; >> + $_->{'cons'} ||= 0; >> + >> $_->{'stop'} = $stop; # Assign a new stop coordinate to >> the contig >> - $_->{'iden'} += $numID; # and add new data to >> #identical, #conserved. >> - $_->{'cons'} += $numCons; >> + $_->{'iden'} += ($numID||0); # and add new data to >> #identical, #conserved. >> + $_->{'cons'} += ($numCons||0); >> push(@{$_->{hsps}}, $hsp); >> $overlap = 1; >> } >> @@ -461,8 +468,8 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> - $ids += $these_ids; >> - $cons += $these_cons; >> + $ids += ($these_ids||0); >> + $cons += ($these_cons||0); >> } >> >> last if $hsp_start == $u_start; >> @@ -490,8 +497,8 @@ >> }; >> if($@) { warn "\a\n$@\n"; } >> else { >> - $ids += $these_ids; >> - $cons += $these_cons; >> + $ids += ($these_ids||0); >> + $cons += ($these_cons||0); >> } >> >> last if $hsp_end == $u_stop; >> >> ----- Original Message ----- From: "Chris Fields" > > >> To: "Mark A. Jensen" >> Cc: "BioPerl List" ; "Dan Bolser" >> > > >> Sent: Tuesday, May 12, 2009 10:04 AM >> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting >> 'features'fromSearchIO?) >> >> >> >> More complicated than that, I'm afraid. We should try to fix that at >> the source of the problem. >> >> >> >> This appears to stem from SearchUtils HSP tiling, which in turn >> utilizes HSPI::matches(), which in turn checks num_identical/ num_conserved. >> My guess is, since this is blasttable format, one of these isn't set and >> thus is returning the wrong value. I'll attempt to track it down today, >> but it may take some time. >> >> >> >> chris >> >> >> >> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: >> >> >> >> This sounds like a >> >> >> >> $sum = eval join( '+', @a); >> >> >> >> problem, which can be fixed with >> >> >> >> $sum = eval join('+', map { $_ || () } @a) ; >> >> >> >> MAJ >> >> ----- Original Message ----- From: "Dan Bolser" > > >> >> To: "Chris Fields" >> >> Cc: "BioPerl List" >> >> Sent: Tuesday, May 12, 2009 9:17 AM >> >> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting 'features' >> fromSearchIO?) >> >> >> >> >> >> 2009/5/12 Chris Fields : >> >> Fixed that in svn. We're all still learning the ropes... >> >> >> >> In that case, I'm seeing multiple instances of... >> >> >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ >> SearchUtils.pm line 256 >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ >> SearchUtils.pm line 412 >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ >> SearchUtils.pm line 429 >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ >> SearchUtils.pm line 465 >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ >> SearchUtils.pm line 473 >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ >> SearchUtils.pm line 494 >> >> Argument "" isn't numeric in addition (+) at Bio/Search/ >> SearchUtils.pm line 502 >> >> >> >> >> >> Hmm... I was about to go on to complain about the weird GFF that I >> was >> >> seeing, but suddenly it looks OK. My bioperl install must think your >> >> standing over my shoulder and is therefore behaving itself! >> >> >> >> >> >> Thanks again for all the help, >> >> Dan. >> >> >> >> >> >> >> >> >> >> chris >> >> >> >> On May 12, 2009, at 5:55 AM, Dan Bolser wrote: >> >> >> >> 2009/5/12 Dan Bolser : >> >> >> >> Unfortunately bp_search2gff.pl is giving me errors: >> >> >> >> bp_search2gff.pl --version 3 -i BlastResults/ >> blast_table_filtered -f >> >> blasttable -o BlastResults/blast_table_filtered.gff -t hit >> >> --match --target --component >> >> >> >> --------------------- WARNING --------------------- >> >> MSG: Removing score value(s) >> >> --------------------------------------------------- >> >> Can't locate object method "remove_tags" via package >> >> "Bio::SeqFeature::Similarity" at >> >> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ >> Generic.pm line >> >> 393, line 5. >> >> >> >> >> >> I'm just learning the ropes... >> >> >> >> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ 2009-05-11 >> >> 15:25:55.000000000 +0100 >> >> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 >> >> 11:52:41.000000000 +0100 >> >> @@ -390,7 +390,7 @@ >> >> } >> >> if ($self->has_tag('score')) { >> >> $self->warn("Removing score value(s)"); >> >> - $self->remove_tags('score'); >> >> + $self->remove_tag('score'); >> >> } >> >> $self->add_tag_value('score',$value); >> >> } >> >> >> >> >> >> >> >> >> >> >> >> Anyone seen this before? >> >> >> >> Cheers, >> >> Dan. >> >> >> >> >> >> >> >> 2009/5/12 Dan Bolser : >> >> >> >> Thanks for the info guys, I think I was naively hoping that >> the >> >> feature would know how to cast itself as a 'SeqFeature' >> (GFF). >> >> >> >> I think I understand the problem better now, so I'll try to >> summarise: >> >> >> >> There is no standard way to encode a HSP as a feature (not >> least >> >> because there are two choices about which sequence (query or >> the hit) >> >> it should be attached to). BioPerl will try, but the result >> will not >> >> be "well structured" SeqFeatures or "well formed" GFF. >> >> >> >> >> >> From what I read I guess it should be possible to standardize >> this >> >> mapping (based on something in one of the examples or the >> 'search2gff' >> >> script), assuming you specify weather you want features put >> on the >> >> query or on the hit. >> >> >> >> At some point last year I was trying out the bp_search2gff.pl >> and my >> >> own code to write a GFF file for loading and viewing by >> Gbrowse. At >> >> that time I gave up, as nothing seemed to be working. I was >> hoping >> >> that doing this at a lower level (i.e. never writing any GFF >> myself) >> >> it would stand a better chance of working. >> >> >> >> Also I was thinking that Gbrowse, if given a >> SeqFeature::Store, could >> >> autoconfigure its interface to some degree. I guess its back >> to the >> >> docs ;-) >> >> >> >> >> >> >> >> I'll keep trying and see if I can get anywhere. >> >> >> >> Thanks again, >> >> Dan. >> >> >> >> >> >> >> >> References for the above: >> >> >> >> 2009/5/11 Jason Stajich : >> >> >> >> otherwise you need to be converting the HSPs into >> seqfeatures with the >> >> right associated information (i.e. the tag/value pairs that >> are in the 9th >> >> column) in order to have well structured data in the >> database. >> >> >> >> You can get the individual features from the feature pair >> with >> >> $hsp->query or $hsp->hit which can also be passed to a GFF >> writer (or call >> >> $hsp->hit->gff_string). Note that since the data storage is >> not structured >> >> in a GFF3 like-way this won't immediately produce well >> formed GFF3 for the >> >> 9th column. >> >> >> >> >> >> 2009/5/11 Chris Fields : >> >> >> >> The main problem is the mapping is subjective based on what >> your >> >> reference sequence is within the BLAST run (e.g. whether it >> is the query or >> >> the hit), and is something that can't be automatically >> discerned. I ended >> >> up rolling my own with SeqFeature::Store (just mapped the >> relevant data to >> >> Bio::DB::SeqFeatures), but I have long wanted to fix up the >> relevant scripts >> >> to integrate my changes in, just haven't had the time >> >> >> >> >> >> >> >> _______________________________________________ >> >> Bioperl-l mailing list >> >> Bioperl-l at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> >> >> >> >> >> _______________________________________________ >> >> Bioperl-l mailing list >> >> Bioperl-l at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> >> >> >> >> >> _______________________________________________ >> >> Bioperl-l mailing list >> >> Bioperl-l at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> >> >> Jason Stajich >> jason at bioperl.org >> >> >> >> >> >> >> > > Jason Stajich > jason at bioperl.org > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at illinois.edu Thu May 14 10:14:54 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 14 May 2009 09:14:54 -0500 Subject: [Bioperl-l] SearchIO to GFF (was: Getting'features'fromSearchIO?) In-Reply-To: References: <2c8757af0905120210i607dfb90sad0d426e6e6b4a4e@mail.gmail.com><2c8757af0905120311y3075c96cs12b0bfb1ad9c0d52@mail.gmail.com><2c8757af0905120355i576ef21o3f1ed8774c00d01c@mail.gmail.com><9ED2DE76-C331-41D2-B303-7BD97B5CAF10@illinois.edu><2c8757af0905120617v37aa7b8udf94ca558ed3415@mail.gmail.com><88072CDBECA446D0A7C48E74237F59FE@NewLife> <97AD1493-F228-4B6D-9FCF-571951923C10@bioperl.org> Message-ID: <1FEF35E8-D982-4AFB-A958-35BDB08FCA31@illinois.edu> Go for it (as long as it passes regression tests). chris On May 14, 2009, at 9:07 AM, Mark A. Jensen wrote: > I'll patch in the fix with a cluck that these are undefined, just > so's it doesn't slip > through Bioperl's fingers (if no objections in the next 10min). > ----- Original Message ----- From: "Jason Stajich" > To: "Mark A. Jensen" ; "Dan Bolser" > > Cc: "Chris Fields" ; "BioPerl List" > > Sent: Wednesday, May 13, 2009 2:34 AM > Subject: Re: [Bioperl-l] SearchIO to GFF (was: > Getting'features'fromSearchIO?) > > >> Not looking at the code so I can't be sure, but I think since there >> is no sequence in the blasttable format and I think %cons or %id >> is calculated from some of the numbers that are taken from >> underlying sequence data in the alignment. I don't know why that >> would trip up tile_hsps necessarily but maybe that's not where the >> errors are coming from. >> >> I've generally solved these conversions with simpler perl scripts >> rather than searchio since I usually want speed for this kind of >> stuff -- so I convert all the data into blast m9 format >> (FASTA,SSEARCH,WUBLAST, BLAST) and then have a simple parser that >> converts the m9 (without searchio) into the appropriate GFF with >> or without additional bells and whistles. >> >> Dan --- >> I am slowly cleaning up and migrating more of the pipelines that >> we use to generate polished GFF for loading into Gbrowse. You >> should just set a goal of making good GFF3 and many of the >> visualization and query issues go away. >> >> For genomes I tend to have a separate scaffold file which is just >> the 1 line per chromosome/scaffold/contig listing. >> Then a file for each analysis ie. Protein to genome BLAST, EST to >> genome, remapped Pfam domains to genome coordinates, gene >> predictions, etc. Keep it simple at first you know what you've >> updated when something stops working the way you expect. >> >> -jason >> On May 12, 2009, at 7:00 PM, Mark A. Jensen wrote: >> >>> I dislike my patch, because it doesn't get to the bottom of why >>> data members associated with numbers of conserved sites return >>> from eval's undefined; it seems clear from the code that this is >>> unexpected. Please excuse my naivete--why would this happen >>> only in the blasttable format, and why hasn't this thing clucked >>> before? >>> ----- Original Message ----- >>> From: Jason Stajich >>> To: Mark A. Jensen >>> Cc: Chris Fields ; BioPerl List ; Dan Bolser >>> Sent: Tuesday, May 12, 2009 7:07 PM >>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting >>> 'features'fromSearchIO?) >>> >>> >>> I really don't think tile_hsps should be used on BLAST data >>> folks, it is a pretty blind approach. >>> If you really want the right answer you need to do -links with WU- >>> BLAST or FASTA. >>> Been discussed a few times on the mailing list. >>> >>> >>> Good to fix the code bug I guess to avoid the warnings, but >>> unless you are going to walk through all the HSPs and extract >>> the consistent paths wrt query I think you'll have loops, etc in >>> there which will make Hit->percent_id non-accurate. >>> >>> >>> -jason >>> >>> >>> On May 12, 2009, at 1:06 PM, Mark A. Jensen wrote: >>> >>> >>> Patch below (to SearchUtils.pm) fixes the non-numeric warnings >>> on Dan's data, but something deeper may be going on. >>> Also get many of the following warnings, haven't looked at it >>> closely: >>> >>> --------------------- WARNING --------------------- >>> MSG: Removing score value(s) >>> --------------------------------------------------- >>> >>> PATCH: >>> >>> Index: SearchUtils.pm >>> >>> =================================================================== >>> --- SearchUtils.pm (revision 15674) >>> +++ SearchUtils.pm (working copy) >>> @@ -252,8 +252,8 @@ >>> } >>> >>> $qctg_dat{ "$frame$strand" }->{'length_aln_query'} += $_- >>> >{'stop'} - $_->{'start'} + 1; >>> - $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- >>> >{'iden'}; >>> - $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- >>> >{'cons'}; >>> + $qctg_dat{ "$frame$strand" }->{'totalIdentical'} += $_- >>> >{'iden'} || 0; >>> + $qctg_dat{ "$frame$strand" }->{'totalConserved'} += $_- >>> >{'cons'} || 0; >>> $qctg_dat{ "$frame$strand" }->{'qstrand'} = $strand; >>> } >>> >>> @@ -407,9 +407,12 @@ >>> }; >>> if($@) { warn "\a\n$@\n"; } >>> else { >>> + # make sure numerical >>> + $_->{'iden'} ||= 0; >>> + $_->{'cons'} ||= 0; >>> $_->{'start'} = $start; # Assign a new start >>> coordinate to the contig >>> - $_->{'iden'} += $numID; # and add new data to >>> #identical, #conserved. >>> - $_->{'cons'} += $numCons; >>> + $_->{'iden'} += ($numID||0); # and add new >>> data to #identical, #conserved. >>> + $_->{'cons'} += ($numCons||0); >>> push(@{$_->{hsps}}, $hsp); >>> $overlap = 1; >>> } >>> @@ -424,9 +427,13 @@ >>> }; >>> if($@) { warn "\a\n$@\n"; } >>> else { >>> + # make sure numerical >>> + $_->{'iden'} ||= 0; >>> + $_->{'cons'} ||= 0; >>> + >>> $_->{'stop'} = $stop; # Assign a new stop >>> coordinate to the contig >>> - $_->{'iden'} += $numID; # and add new data to >>> #identical, #conserved. >>> - $_->{'cons'} += $numCons; >>> + $_->{'iden'} += ($numID||0); # and add new >>> data to #identical, #conserved. >>> + $_->{'cons'} += ($numCons||0); >>> push(@{$_->{hsps}}, $hsp); >>> $overlap = 1; >>> } >>> @@ -461,8 +468,8 @@ >>> }; >>> if($@) { warn "\a\n$@\n"; } >>> else { >>> - $ids += $these_ids; >>> - $cons += $these_cons; >>> + $ids += ($these_ids||0); >>> + $cons += ($these_cons||0); >>> } >>> >>> last if $hsp_start == $u_start; >>> @@ -490,8 +497,8 @@ >>> }; >>> if($@) { warn "\a\n$@\n"; } >>> else { >>> - $ids += $these_ids; >>> - $cons += $these_cons; >>> + $ids += ($these_ids||0); >>> + $cons += ($these_cons||0); >>> } >>> >>> last if $hsp_end == $u_stop; >>> >>> ----- Original Message ----- From: "Chris Fields" >> > >>> To: "Mark A. Jensen" >>> Cc: "BioPerl List" ; "Dan Bolser" >> > >>> Sent: Tuesday, May 12, 2009 10:04 AM >>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting >>> 'features'fromSearchIO?) >>> >>> >>> >>> More complicated than that, I'm afraid. We should try to fix >>> that at the source of the problem. >>> >>> >>> >>> This appears to stem from SearchUtils HSP tiling, which in >>> turn utilizes HSPI::matches(), which in turn checks >>> num_identical/ num_conserved. My guess is, since this is >>> blasttable format, one of these isn't set and thus is returning >>> the wrong value. I'll attempt to track it down today, but it >>> may take some time. >>> >>> >>> >>> chris >>> >>> >>> >>> On May 12, 2009, at 8:29 AM, Mark A. Jensen wrote: >>> >>> >>> >>> This sounds like a >>> >>> >>> >>> $sum = eval join( '+', @a); >>> >>> >>> >>> problem, which can be fixed with >>> >>> >>> >>> $sum = eval join('+', map { $_ || () } @a) ; >>> >>> >>> >>> MAJ >>> >>> ----- Original Message ----- From: "Dan Bolser" >> > >>> >>> To: "Chris Fields" >>> >>> Cc: "BioPerl List" >>> >>> Sent: Tuesday, May 12, 2009 9:17 AM >>> >>> Subject: Re: [Bioperl-l] SearchIO to GFF (was: Getting >>> 'features' fromSearchIO?) >>> >>> >>> >>> >>> >>> 2009/5/12 Chris Fields : >>> >>> Fixed that in svn. We're all still learning the ropes... >>> >>> >>> >>> In that case, I'm seeing multiple instances of... >>> >>> >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 256 >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 412 >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 429 >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 465 >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 473 >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 494 >>> >>> Argument "" isn't numeric in addition (+) at Bio/Search/ >>> SearchUtils.pm line 502 >>> >>> >>> >>> >>> >>> Hmm... I was about to go on to complain about the weird GFF >>> that I was >>> >>> seeing, but suddenly it looks OK. My bioperl install must >>> think your >>> >>> standing over my shoulder and is therefore behaving itself! >>> >>> >>> >>> >>> >>> Thanks again for all the help, >>> >>> Dan. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> chris >>> >>> >>> >>> On May 12, 2009, at 5:55 AM, Dan Bolser wrote: >>> >>> >>> >>> 2009/5/12 Dan Bolser : >>> >>> >>> >>> Unfortunately bp_search2gff.pl is giving me errors: >>> >>> >>> >>> bp_search2gff.pl --version 3 -i BlastResults/ >>> blast_table_filtered -f >>> >>> blasttable -o BlastResults/blast_table_filtered.gff - >>> t hit >>> >>> --match --target --component >>> >>> >>> >>> --------------------- WARNING --------------------- >>> >>> MSG: Removing score value(s) >>> >>> --------------------------------------------------- >>> >>> Can't locate object method "remove_tags" via package >>> >>> "Bio::SeqFeature::Similarity" at >>> >>> /local/Scratch/dbolser/perl5/lib/perl5/Bio/SeqFeature/ >>> Generic.pm line >>> >>> 393, line 5. >>> >>> >>> >>> >>> >>> I'm just learning the ropes... >>> >>> >>> >>> --- ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm~ >>> 2009-05-11 >>> >>> 15:25:55.000000000 +0100 >>> >>> +++ ~/perl5/lib/perl5/Bio/SeqFeature/Generic.pm 2009-05-12 >>> >>> 11:52:41.000000000 +0100 >>> >>> @@ -390,7 +390,7 @@ >>> >>> } >>> >>> if ($self->has_tag('score')) { >>> >>> $self->warn("Removing score value(s)"); >>> >>> - $self->remove_tags('score'); >>> >>> + $self->remove_tag('score'); >>> >>> } >>> >>> $self->add_tag_value('score',$value); >>> >>> } >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> Anyone seen this before? >>> >>> >>> >>> Cheers, >>> >>> Dan. >>> >>> >>> >>> >>> >>> >>> >>> 2009/5/12 Dan Bolser : >>> >>> >>> >>> Thanks for the info guys, I think I was naively >>> hoping that the >>> >>> feature would know how to cast itself as a >>> 'SeqFeature' (GFF). >>> >>> >>> >>> I think I understand the problem better now, so >>> I'll try to summarise: >>> >>> >>> >>> There is no standard way to encode a HSP as a >>> feature (not least >>> >>> because there are two choices about which sequence >>> (query or the hit) >>> >>> it should be attached to). BioPerl will try, but >>> the result will not >>> >>> be "well structured" SeqFeatures or "well formed" GFF. >>> >>> >>> >>> >>> >>> From what I read I guess it should be possible to >>> standardize this >>> >>> mapping (based on something in one of the examples >>> or the 'search2gff' >>> >>> script), assuming you specify weather you want >>> features put on the >>> >>> query or on the hit. >>> >>> >>> >>> At some point last year I was trying out the >>> bp_search2gff.pl and my >>> >>> own code to write a GFF file for loading and >>> viewing by Gbrowse. At >>> >>> that time I gave up, as nothing seemed to be >>> working. I was hoping >>> >>> that doing this at a lower level (i.e. never >>> writing any GFF myself) >>> >>> it would stand a better chance of working. >>> >>> >>> >>> Also I was thinking that Gbrowse, if given a >>> SeqFeature::Store, could >>> >>> autoconfigure its interface to some degree. I guess >>> its back to the >>> >>> docs ;-) >>> >>> >>> >>> >>> >>> >>> >>> I'll keep trying and see if I can get anywhere. >>> >>> >>> >>> Thanks again, >>> >>> Dan. >>> >>> >>> >>> >>> >>> >>> >>> References for the above: >>> >>> >>> >>> 2009/5/11 Jason Stajich : >>> >>> >>> >>> otherwise you need to be converting the HSPs into >>> seqfeatures with the >>> >>> right associated information (i.e. the tag/value >>> pairs that are in the 9th >>> >>> column) in order to have well structured data in >>> the database. >>> >>> >>> >>> You can get the individual features from the >>> feature pair with >>> >>> $hsp->query or $hsp->hit which can also be passed >>> to a GFF writer (or call >>> >>> $hsp->hit->gff_string). Note that since the data >>> storage is not structured >>> >>> in a GFF3 like-way this won't immediately produce >>> well formed GFF3 for the >>> >>> 9th column. >>> >>> >>> >>> >>> >>> 2009/5/11 Chris Fields : >>> >>> >>> >>> The main problem is the mapping is subjective >>> based on what your >>> >>> reference sequence is within the BLAST run (e.g. >>> whether it is the query or >>> >>> the hit), and is something that can't be >>> automatically discerned. I ended >>> >>> up rolling my own with SeqFeature::Store (just >>> mapped the relevant data to >>> >>> Bio::DB::SeqFeatures), but I have long wanted to >>> fix up the relevant scripts >>> >>> to integrate my changes in, just haven't had the >>> time >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> Bioperl-l mailing list >>> >>> Bioperl-l at lists.open-bio.org >>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> Bioperl-l mailing list >>> >>> Bioperl-l at lists.open-bio.org >>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> >>> Bioperl-l mailing list >>> >>> Bioperl-l at lists.open-bio.org >>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >>> >>> Jason Stajich >>> jason at bioperl.org >>> >>> >>> >>> >>> >>> >>> >> >> Jason Stajich >> jason at bioperl.org >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 14 10:16:46 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 14 May 2009 09:16:46 -0500 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> Message-ID: <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> Albert, Just to note, I have been using bioperl 1.6.0 with the ensembl API w/o problems, and Sendu Bala added an ensembl 'wrapper' to bioperl-run. Do we know precisely what breaks btwn 1.2.3 and 1.6 (and thus leads everyone to believe 1.2.3 is absolutely required)? The previous answers have been pretty nebulous and unspecific. I would have to go on record as being opposed to this. If there is a true compatibility issue, I would much rather spend the energy and tuits driving towards ensembl compatibility with the current bioperl version than backporting to 1.2.3. What about having users popping in with bug reports on list (here or ensembl) about bioperl versions 5+ years out-of-date? Furthermore, it's a slippery slope; the next thing will be requests to backport specific bug fixes in the current branch to 1.2.3. Who's willing to maintain that branch? We have few devs as it is, so is someone on the ensembl end willing to take that up? Perl 5 development has been held up with the same issues, something they have recently just started digging themselves out of. Regardless, I think way too many changes have occurred in that particular code that make such endeavors unrealistic, unfeasible, and unmaintainable. chris On May 14, 2009, at 8:45 AM, Albert Vilella wrote: > Hi all, > > In Ensembl, we are interested in providing NeXML dumps for our > Comparative > Genomics data. Because our pipeline is > written in Perl, I guess most of the work done here will be of great > use to > us. > > If I could only ask for only a feature, that would be to *try* and > backport > the NeXML support to bioperl-1.2.3 --- stress on the *try*. Bioperl > 1.2.3 is > the release that Ensembl decided to stick to many years ago, so it's > cleaner > for people to use our Perl API with only one version of bioperl as a > dependency. > > Looking forward to hearing from this SoC. Have you got a blog? > > Cheers, > > Albert. > > > On Mon, May 11, 2009 at 5:24 PM, Jason Stajich > wrote: > >> Welcome Chase. >> >> Look forward to the project and helping where needed. >> >> -jason >> >> >> On May 11, 2009, at 7:31 AM, Mark A. Jensen wrote: >> >> Hello all, >>> With great pleasure, I want to introduce Chase Miller, my Google >>> Summer of >>> Code student from George Washington University, to the community. >>> Chase will >>> be working with me and Rutger Vos on a BioPerl wrapper for Rutger's >>> Bio::Phylo package, with a particular emphasis on creating a >>> BioPerl-native >>> way to import and export the NeXML (http://nexml.org) phylogenetic >>> data >>> format. He wrote a great proposal, available here: >>> https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit >>> . >>> We will be working throughout the summer on the project, and will of >>> course come to you for sage advice. I know you will welcome him >>> warmly, as >>> you did me. >>> Cheers, >>> Mark >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> Jason Stajich >> jason at bioperl.org >> >> >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bix at sendu.me.uk Thu May 14 11:36:12 2009 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 14 May 2009 16:36:12 +0100 Subject: [Bioperl-l] Errors with Bioperl installation In-Reply-To: <4536f7700905140653y5367ee52w8c61e4fcf8152b4c@mail.gmail.com> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> <536f21b00905131924s38c197b4if5a94915fb8b0c31@mail.gmail.com> <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E1@MAVMAIL1.uta.edu> <4536f7700905140653y5367ee52w8c61e4fcf8152b4c@mail.gmail.com> Message-ID: <4A0C3A6C.2010001@sendu.me.uk> Scott Cain wrote: > Hi Assistu, > > Please keep you replies on the mailing list; if you had you'd probably > had an answer before now. > > First when you type "which make" on the command line, what do you get? > If nothing then make and the developer tools are not properly > installed. If you get something then the cpan shell may not be > properly configured to use make. That can happen if you tried to use > cpan before you installed make. To fix that, when you're in the cpan > shell, type "o conf init" and go through the config again. If using the Module::Build install method of recent BioPerl versions, you don't need 'make' (or the dev tools for Mac OS X). See the install instructions that Chris linked. From scott at scottcain.net Thu May 14 11:46:24 2009 From: scott at scottcain.net (Scott Cain) Date: Thu, 14 May 2009 11:46:24 -0400 Subject: [Bioperl-l] Errors with Bioperl installation In-Reply-To: <4A0C3A6C.2010001@sendu.me.uk> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> <536f21b00905131924s38c197b4if5a94915fb8b0c31@mail.gmail.com> <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E1@MAVMAIL1.uta.edu> <4536f7700905140653y5367ee52w8c61e4fcf8152b4c@mail.gmail.com> <4A0C3A6C.2010001@sendu.me.uk> Message-ID: <4536f7700905140846p38415766we2835e4fab9cd9ba@mail.gmail.com> Just out of curiosity, since I haven't tried this in a long time, can all the prereqs be installed without make as well? If so, that is excellent. Scott On Thu, May 14, 2009 at 11:36 AM, Sendu Bala wrote: > Scott Cain wrote: >> >> Hi Assistu, >> >> Please keep you replies on the mailing list; if you had you'd probably >> had an answer before now. >> >> First when you type "which make" on the command line, what do you get? >> ?If nothing then make and the developer tools are not properly >> installed. ?If you get something then the cpan shell may not be >> properly configured to use make. ?That can happen if you tried to use >> cpan before you installed make. ?To fix that, when you're in the cpan >> shell, type "o conf init" and go through the config again. > > If using the Module::Build install method of recent BioPerl versions, you > don't need 'make' (or the dev tools for Mac OS X). See the install > instructions that Chris linked. > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From koenvanderdrift at gmail.com Thu May 14 12:54:31 2009 From: koenvanderdrift at gmail.com (Koen van der Drift) Date: Thu, 14 May 2009 12:54:31 -0400 Subject: [Bioperl-l] Errors with Bioperl installation Message-ID: <5cba6b9f0905140954x2f42afc6sbb89152d6612d17f@mail.gmail.com> Hi, I'm jumping in a bit late here since I am on the mailinglist digest. I hope I included (by copy-pasting) the correct recipients. There is also the possibility to install bioperl on Mac OS X using fink (http://fink.sf.net). Version 1.6.0 is available, including many of the prerequisites which are automagically downloaded and installed as needed. The current fink package uses Module::Build for this. - Koen. From bix at sendu.me.uk Thu May 14 12:51:59 2009 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 14 May 2009 17:51:59 +0100 Subject: [Bioperl-l] Errors with Bioperl installation In-Reply-To: <4536f7700905140846p38415766we2835e4fab9cd9ba@mail.gmail.com> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> <536f21b00905131924s38c197b4if5a94915fb8b0c31@mail.gmail.com> <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E1@MAVMAIL1.uta.edu> <4536f7700905140653y5367ee52w8c61e4fcf8152b4c@mail.gmail.com> <4A0C3A6C.2010001@sendu.me.uk> <4536f7700905140846p38415766we2835e4fab9cd9ba@mail.gmail.com> Message-ID: <4A0C4C2F.4030907@sendu.me.uk> Scott Cain wrote: > Just out of curiosity, since I haven't tried this in a long time, can > all the prereqs be installed without make as well? If so, that is > excellent. Oh, I've no idea. They'd all need to use Module::Build as well. It's within the realms of possibility? But anyway, BioPerl would still get installed even if some deps failed to install due to lack of make. From chmille4 at gmail.com Thu May 14 13:15:26 2009 From: chmille4 at gmail.com (Chase Miller) Date: Thu, 14 May 2009 13:15:26 -0400 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> Message-ID: <991fb8210905141015w297d2ca9lde6916b5db330ed@mail.gmail.com> Hi all, Thanks for the warm welcome. I'm really looking forward to working with everyone. Albert, I don't have a blog yet. Currently, you can check the project page for any updates ( https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit ). Chase On Thu, May 14, 2009 at 10:16 AM, Chris Fields wrote: > Albert, > > Just to note, I have been using bioperl 1.6.0 with the ensembl API w/o > problems, and Sendu Bala added an ensembl 'wrapper' to bioperl-run. Do we > know precisely what breaks btwn 1.2.3 and 1.6 (and thus leads everyone to > believe 1.2.3 is absolutely required)? The previous answers have been > pretty nebulous and unspecific. > > I would have to go on record as being opposed to this. If there is a true > compatibility issue, I would much rather spend the energy and tuits driving > towards ensembl compatibility with the current bioperl version than > backporting to 1.2.3. What about having users popping in with bug reports > on list (here or ensembl) about bioperl versions 5+ years out-of-date? > Furthermore, it's a slippery slope; the next thing will be requests to > backport specific bug fixes in the current branch to 1.2.3. > > Who's willing to maintain that branch? We have few devs as it is, so is > someone on the ensembl end willing to take that up? > > Perl 5 development has been held up with the same issues, something they > have recently just started digging themselves out of. Regardless, I think > way too many changes have occurred in that particular code that make such > endeavors unrealistic, unfeasible, and unmaintainable. > > chris > > > On May 14, 2009, at 8:45 AM, Albert Vilella wrote: > > Hi all, >> >> In Ensembl, we are interested in providing NeXML dumps for our Comparative >> Genomics data. Because our pipeline is >> written in Perl, I guess most of the work done here will be of great use >> to >> us. >> >> If I could only ask for only a feature, that would be to *try* and >> backport >> the NeXML support to bioperl-1.2.3 --- stress on the *try*. Bioperl 1.2.3 >> is >> the release that Ensembl decided to stick to many years ago, so it's >> cleaner >> for people to use our Perl API with only one version of bioperl as a >> dependency. >> >> Looking forward to hearing from this SoC. Have you got a blog? >> >> Cheers, >> >> Albert. >> >> >> On Mon, May 11, 2009 at 5:24 PM, Jason Stajich wrote: >> >> Welcome Chase. >>> >>> Look forward to the project and helping where needed. >>> >>> -jason >>> >>> >>> On May 11, 2009, at 7:31 AM, Mark A. Jensen wrote: >>> >>> Hello all, >>> >>>> With great pleasure, I want to introduce Chase Miller, my Google Summer >>>> of >>>> Code student from George Washington University, to the community. Chase >>>> will >>>> be working with me and Rutger Vos on a BioPerl wrapper for Rutger's >>>> Bio::Phylo package, with a particular emphasis on creating a >>>> BioPerl-native >>>> way to import and export the NeXML (http://nexml.org) phylogenetic data >>>> format. He wrote a great proposal, available here: >>>> >>>> https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit >>>> . >>>> We will be working throughout the summer on the project, and will of >>>> course come to you for sage advice. I know you will welcome him warmly, >>>> as >>>> you did me. >>>> Cheers, >>>> Mark >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>> Jason Stajich >>> jason at bioperl.org >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > From deep.vik at gmail.com Thu May 14 14:06:02 2009 From: deep.vik at gmail.com (vikas.deep) Date: Thu, 14 May 2009 11:06:02 -0700 (PDT) Subject: [Bioperl-l] problem with bioperl code- How to create a hash Message-ID: <23545942.post@talk.nabble.com> Some one Kindly help me I am a novice programmer I am trying to create a hash using the code as follows #! /usr/bin/perl -w use Bio::DB::GenBank; use strict; use Bio::Seq; use Bio::SeqIO; use Bio::DB::SwissProt; {my @key; my @value; my @arr; open (my $FH, 'parse21.txt') or die "Cannot open the file parse21.txt: $!"; open (my $FHH, '>parse211.txt') or die "Cannot open the file parse212.txt: $!"; @key = <$FH>; chomp (@key); my $arref = \@key; my $gb = Bio::DB::SwissProt->new; my $seqio = $gb -> get_Stream_by_id($arref); while (my $clone = $seqio->next_seq ) { print $FHH $clone ->seq,"\n"; } close $FHH; open (my $FHHH, 'parse211.txt') or die "Cannot open the file parse211.txt: $!"; @value = <$FHHH>; chomp (@value); @h{@key} = @value; print values %h; In the output print keys %h does not prints anything while print values %h prints what is expected and print %h prints what is expected only for the last key and for the rest it is just printing the values!. Also if @key is not chomped then keys ARE PRINTED!! And $h{$key} always says #[root at localhost Documents]# perl bp.pl #Use of uninitialized value in print at bp.pl line 46, <$FHHH> line 4. At the very basic level the problem is to create a hash using a key file which has been opened using $FH and a value file which has been opened twice once for writing with a file handle $FHH and then a second time for reading using $FHHH. I do not know why this is not working. Please Help:-(( -- View this message in context: http://www.nabble.com/problem-with-bioperl-code--How-to-create-a-hash-tp23545942p23545942.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From ABARRIE at uta.edu Thu May 14 13:55:27 2009 From: ABARRIE at uta.edu (Barrie, Assiatu B) Date: Thu, 14 May 2009 12:55:27 -0500 Subject: [Bioperl-l] Error with bioperl Message-ID: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E2@MAVMAIL1.uta.edu> Hello, I downloaded the developer's tools, reconfig make, but I am still running into an error when I go to install bioperl (here is message). I will really appreciate the help, I really need Bioperl installed on my Mac. # running Build.PL /usr/bin/perl Build.PL BioPerl minimal core version 1.006000 is required for BioPerl-run Couldn't run Build.PL: at /Library/Perl/5.8.8/Module/Build/Compat.pm line 261. Running make test Make had some problems, maybe interrupted? Won't test Running make install Make had some problems, maybe interrupted? Won't install Assiatu Barrie Research Assistant University of Texas at Arlington Biology Department Box 19498 Arlington, TX 76019-0498 abarrie at uta.edu From shalabh.sharma7 at gmail.com Thu May 14 14:46:45 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Thu, 14 May 2009 14:46:45 -0400 Subject: [Bioperl-l] Error with bioperl In-Reply-To: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E2@MAVMAIL1.uta.edu> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E2@MAVMAIL1.uta.edu> Message-ID: <9fcc48c70905141146sf167369o42bccdf544c01bd1@mail.gmail.com> Hi Assiatu, It seems by your error that you are installing bioperl-run not bioperl-live which is a core module. -Shalabh On Thu, May 14, 2009 at 1:55 PM, Barrie, Assiatu B wrote: > Hello, > > I downloaded the developer's tools, reconfig make, but I am still running > into an error when I go to install bioperl (here is message). I will really > appreciate the help, I really need Bioperl installed on my Mac. > # running Build.PL > /usr/bin/perl Build.PL > BioPerl minimal core version 1.006000 is required for BioPerl-run > Couldn't run Build.PL: at /Library/Perl/5.8.8/Module/Build/Compat.pm line > 261. > Running make test > Make had some problems, maybe interrupted? Won't test > Running make install > Make had some problems, maybe interrupted? Won't install > > > > Assiatu Barrie > Research Assistant > University of Texas at Arlington > Biology Department Box 19498 > Arlington, TX 76019-0498 > abarrie at uta.edu > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From scott at scottcain.net Thu May 14 14:47:30 2009 From: scott at scottcain.net (Scott Cain) Date: Thu, 14 May 2009 14:47:30 -0400 Subject: [Bioperl-l] Error with bioperl In-Reply-To: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E2@MAVMAIL1.uta.edu> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E2@MAVMAIL1.uta.edu> Message-ID: <536f21b00905141147k7134f9ck5547ae81535d960a@mail.gmail.com> Hi Assiatu, You need to install BioPerl before you can install BioPerl-run. Scott On Thu, May 14, 2009 at 1:55 PM, Barrie, Assiatu B wrote: > Hello, > > I downloaded the developer's tools, reconfig make, but I am still running into an error when I go to install bioperl (here is message). I will really appreciate the help, I really need Bioperl installed on my Mac. > # running Build.PL > /usr/bin/perl Build.PL > BioPerl minimal core version 1.006000 is required for BioPerl-run > Couldn't run Build.PL: ?at /Library/Perl/5.8.8/Module/Build/Compat.pm line 261. > Running make test > ?Make had some problems, maybe interrupted? Won't test > Running make install > ?Make had some problems, maybe interrupted? Won't install > > > > Assiatu Barrie > Research Assistant > University of Texas at Arlington > Biology Department Box 19498 > Arlington, TX 76019-0498 > abarrie at uta.edu > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > -- ------------------------------------------------------------------------ Scott Cain, Ph. D. scott at scottcain dot net GMOD Coordinator (http://gmod.org/) 216-392-3087 Ontario Institute for Cancer Research From cjfields at illinois.edu Thu May 14 15:29:58 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 14 May 2009 14:29:58 -0500 Subject: [Bioperl-l] Errors with Bioperl installation In-Reply-To: <4A0C4C2F.4030907@sendu.me.uk> References: <96093972E63BDB408B0700EB1C1FF74D6CBA6D58DF@MAVMAIL1.uta.edu> <536f21b00905131924s38c197b4if5a94915fb8b0c31@mail.gmail.com> <96093972E63BDB408B0700EB1C1FF74D6CBA6D58E1@MAVMAIL1.uta.edu> <4536f7700905140653y5367ee52w8c61e4fcf8152b4c@mail.gmail.com> <4A0C3A6C.2010001@sendu.me.uk> <4536f7700905140846p38415766we2835e4fab9cd9ba@mail.gmail.com> <4A0C4C2F.4030907@sendu.me.uk> Message-ID: <0EC80532-11BC-4971-994D-69865AF9A199@illinois.edu> On May 14, 2009, at 11:51 AM, Sendu Bala wrote: > Scott Cain wrote: >> Just out of curiosity, since I haven't tried this in a long time, can >> all the prereqs be installed without make as well? If so, that is >> excellent. > > Oh, I've no idea. They'd all need to use Module::Build as well. It's > within the realms of possibility? > > But anyway, BioPerl would still get installed even if some deps > failed to install due to lack of make. I think the makefile is just a passthrough, generated with distribution building ('./Build dist'): http://cpansearch.perl.org/src/CJFIELDS/BioPerl-1.6.0/Makefile.PL I don't think there is a flag to run automatic installation of all prereqs, though it could be added in. chris From cjfields at illinois.edu Thu May 14 15:33:18 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 14 May 2009 14:33:18 -0500 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: <991fb8210905141015w297d2ca9lde6916b5db330ed@mail.gmail.com> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <991fb8210905141015w297d2ca9lde6916b5db330ed@mail.gmail.com> Message-ID: <9741778A-B401-41AF-8430-8A83DD04F4E5@illinois.edu> Welcome to the BioPerl community Chase! Let us know if you need any help. chris (less cranky now I've had my coffee) On May 14, 2009, at 12:15 PM, Chase Miller wrote: > Hi all, > Thanks for the warm welcome. I'm really looking forward to working > with > everyone. > > Albert, I don't have a blog yet. Currently, you can check the > project page > for any updates ( > https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit > ). > > Chase > > On Thu, May 14, 2009 at 10:16 AM, Chris Fields > wrote: > >> Albert, >> >> Just to note, I have been using bioperl 1.6.0 with the ensembl API >> w/o >> problems, and Sendu Bala added an ensembl 'wrapper' to bioperl- >> run. Do we >> know precisely what breaks btwn 1.2.3 and 1.6 (and thus leads >> everyone to >> believe 1.2.3 is absolutely required)? The previous answers have >> been >> pretty nebulous and unspecific. >> >> I would have to go on record as being opposed to this. If there is >> a true >> compatibility issue, I would much rather spend the energy and tuits >> driving >> towards ensembl compatibility with the current bioperl version than >> backporting to 1.2.3. What about having users popping in with bug >> reports >> on list (here or ensembl) about bioperl versions 5+ years out-of- >> date? >> Furthermore, it's a slippery slope; the next thing will be requests >> to >> backport specific bug fixes in the current branch to 1.2.3. >> >> Who's willing to maintain that branch? We have few devs as it is, >> so is >> someone on the ensembl end willing to take that up? >> >> Perl 5 development has been held up with the same issues, something >> they >> have recently just started digging themselves out of. Regardless, >> I think >> way too many changes have occurred in that particular code that >> make such >> endeavors unrealistic, unfeasible, and unmaintainable. >> >> chris >> >> >> On May 14, 2009, at 8:45 AM, Albert Vilella wrote: >> >> Hi all, >>> >>> In Ensembl, we are interested in providing NeXML dumps for our >>> Comparative >>> Genomics data. Because our pipeline is >>> written in Perl, I guess most of the work done here will be of >>> great use >>> to >>> us. >>> >>> If I could only ask for only a feature, that would be to *try* and >>> backport >>> the NeXML support to bioperl-1.2.3 --- stress on the *try*. >>> Bioperl 1.2.3 >>> is >>> the release that Ensembl decided to stick to many years ago, so it's >>> cleaner >>> for people to use our Perl API with only one version of bioperl as a >>> dependency. >>> >>> Looking forward to hearing from this SoC. Have you got a blog? >>> >>> Cheers, >>> >>> Albert. >>> >>> >>> On Mon, May 11, 2009 at 5:24 PM, Jason Stajich >>> wrote: >>> >>> Welcome Chase. >>>> >>>> Look forward to the project and helping where needed. >>>> >>>> -jason >>>> >>>> >>>> On May 11, 2009, at 7:31 AM, Mark A. Jensen wrote: >>>> >>>> Hello all, >>>> >>>>> With great pleasure, I want to introduce Chase Miller, my Google >>>>> Summer >>>>> of >>>>> Code student from George Washington University, to the >>>>> community. Chase >>>>> will >>>>> be working with me and Rutger Vos on a BioPerl wrapper for >>>>> Rutger's >>>>> Bio::Phylo package, with a particular emphasis on creating a >>>>> BioPerl-native >>>>> way to import and export the NeXML (http://nexml.org) >>>>> phylogenetic data >>>>> format. He wrote a great proposal, available here: >>>>> >>>>> https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit >>>>> . >>>>> We will be working throughout the summer on the project, and >>>>> will of >>>>> course come to you for sage advice. I know you will welcome him >>>>> warmly, >>>>> as >>>>> you did me. >>>>> Cheers, >>>>> Mark >>>>> _______________________________________________ >>>>> Bioperl-l mailing list >>>>> Bioperl-l at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>>> >>>>> >>>> Jason Stajich >>>> jason at bioperl.org >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Thu May 14 15:53:44 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Thu, 14 May 2009 15:53:44 -0400 Subject: [Bioperl-l] Parsing needle/water output Message-ID: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> Hi All, Is there any parser/module available to parse needle/water output report (from emboss) to get the start and end position of alignment. Thanks Shalabh From jason at bioperl.org Thu May 14 15:56:01 2009 From: jason at bioperl.org (Jason Stajich) Date: Thu, 14 May 2009 12:56:01 -0700 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> Message-ID: <749E6053-A77E-4174-A5AC-25D90FE782EC@bioperl.org> Bio::AlignIO::emboss does parse or you can request it in msf format and parse with Bio::AlignIO::msf -jason On May 14, 2009, at 12:53 PM, shalabh sharma wrote: > Hi All, > Is there any parser/module available to parse needle/water > output > report (from emboss) to get the start and end position of alignment. > > Thanks > Shalabh > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From SMarkel at accelrys.com Thu May 14 15:58:32 2009 From: SMarkel at accelrys.com (Scott Markel) Date: Thu, 14 May 2009 15:58:32 -0400 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> Message-ID: <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> Shalabh, Have you looked at Bio::AlignIO::emboss? Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Co-chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of shalabh sharma > Sent: Thursday, 14 May 2009 12:54 PM > To: bioperl-l > Subject: [Bioperl-l] Parsing needle/water output > > Hi All, > Is there any parser/module available to parse needle/water output > report (from emboss) to get the start and end position of alignment. > > Thanks > Shalabh > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Thu May 14 16:13:33 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Thu, 14 May 2009 16:13:33 -0400 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> Message-ID: <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> yes, i tried to read the documentation about Bio::AlignIO::emboss, but there is not much in it.So is like i can call the same functions which are used in searchIO. I need start and end position of the pairwise alignment. Thanks Shalabh On Thu, May 14, 2009 at 3:58 PM, Scott Markel wrote: > Shalabh, > > Have you looked at Bio::AlignIO::emboss? > > Scott > > Scott Markel, Ph.D. > Principal Bioinformatics Architect email: smarkel at accelrys.com > Accelrys (SciTegic R&D) mobile: +1 858 205 3653 > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 > San Diego, CA 92121 fax: +1 858 799 5222 > USA web: http://www.accelrys.com > > http://www.linkedin.com/in/smarkel > Vice President, Board of Directors: > International Society for Computational Biology > Co-chair: ISCB Publications Committee > Associate Editor: PLoS Computational Biology > Editorial Board: Briefings in Bioinformatics > > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > Sent: Thursday, 14 May 2009 12:54 PM > > To: bioperl-l > > Subject: [Bioperl-l] Parsing needle/water output > > > > Hi All, > > Is there any parser/module available to parse needle/water > output > > report (from emboss) to get the start and end position of alignment. > > > > Thanks > > Shalabh > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From assiatu.barrie at gmail.com Thu May 14 17:09:50 2009 From: assiatu.barrie at gmail.com (Assiatu Barrie) Date: Thu, 14 May 2009 16:09:50 -0500 Subject: [Bioperl-l] Bioperl help Message-ID: <84cc2c200905141409y64028d33s7fa39a99aabf9463@mail.gmail.com> Hello again, So I think I installed bioperl successfully, but when I try running a test script i get a blank output file so i tried running a script called fastaConcat.pl on 2 fasta seq files but the output file was blank "perl fastaConcat.pl seq1.txt seq2.txt > out.txt" would appreciate the help From carson.holt at utah.edu Thu May 14 16:33:04 2009 From: carson.holt at utah.edu (Carson Hinton Holt) Date: Thu, 14 May 2009 14:33:04 -0600 Subject: [Bioperl-l] GFF3 to GenBank converter Message-ID: Hello, I'm trying to see if anyone has built a GFF3 to GenBank format converter? I've found multiple examples of GenBank to GFF3 converters, but not the other way around. Thanks, Carson From cjfields at illinois.edu Thu May 14 17:14:49 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 14 May 2009 16:14:49 -0500 Subject: [Bioperl-l] [ANNOUNCEMENT] Google Summer of Code student Xin Shuai Message-ID: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> All, I am proud to introduce Xin 'David' Shuai, my student for the Google Summer of Code 2009, to the Open Bioinformatics community. David's project centers on developing SWIG-based bindings to libsequence (a population genetics library) for the BioLib project: http://biolib.open-bio.org/wiki/Main_Page Besides myself, David will be co-mentored by Mark Jensen and Pjotr Prins. As the BioLib project centers on creating common, maintainable SWIG- based bindings to popular bioinformatics libraries for the various Bio* toolkits, we will likely need input from the various Open Bio communities at various stages in the project. At this time, David's initial plans are to develop and test libsequence bindings for Perl and Python. David's proposal and project plan are available here: http://biolib.open-bio.org/wiki/User:David Congratulations David, and welcome to the Open-Bio community! Sincerely, Christopher Fields University of Illinois Urbana-Champaign Institute for Genomic Biology Urbana, IL 61801 From SMarkel at accelrys.com Thu May 14 17:26:29 2009 From: SMarkel at accelrys.com (Scott Markel) Date: Thu, 14 May 2009 17:26:29 -0400 Subject: [Bioperl-l] Bioperl help In-Reply-To: <84cc2c200905141409y64028d33s7fa39a99aabf9463@mail.gmail.com> References: <84cc2c200905141409y64028d33s7fa39a99aabf9463@mail.gmail.com> Message-ID: <1F1240778FB0AF46B4E5A72C44D2C7472A37752B@exch1-hi.accelrys.net> Is the Perl script the same as the one found at http://www.faculty.uaf.edu/ffnt/teaching/programming/perl-scripts/fastaConcat.pl? That's the only one Google finds. What do seq1.txt and seq2.txt contain? Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Co-chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Assiatu Barrie > Sent: Thursday, 14 May 2009 2:10 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] Bioperl help > > Hello again, > > So I think I installed bioperl successfully, but when I try running a test > script i get a blank output file > > so i tried running a script called fastaConcat.pl on 2 fasta seq files but > the output file was blank > > "perl fastaConcat.pl seq1.txt seq2.txt > out.txt" > > would appreciate the help > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From Kevin.M.Brown at asu.edu Thu May 14 17:41:53 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Thu, 14 May 2009 14:41:53 -0700 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com><1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> Message-ID: <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> http://bioperl.org/cgi-bin/deob_interface.cgi?Search=Search&module=Bio%3 A%3AAlignIO%3A%3Aemboss&sort_order=by+method&search_string=Bio%3A%3AAlig nio%3A%3Aemboss > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > shalabh sharma > Sent: Thursday, May 14, 2009 1:14 PM > To: Scott Markel > Cc: bioperl-l > Subject: Re: [Bioperl-l] Parsing needle/water output > > yes, i tried to read the documentation about > Bio::AlignIO::emboss, but there > is not much in it.So is like i can call the same functions > which are used in > searchIO. I need start and end position of the pairwise alignment. > > Thanks > Shalabh > > > On Thu, May 14, 2009 at 3:58 PM, Scott Markel > wrote: > > > Shalabh, > > > > Have you looked at Bio::AlignIO::emboss? > > > > Scott > > > > Scott Markel, Ph.D. > > Principal Bioinformatics Architect email: smarkel at accelrys.com > > Accelrys (SciTegic R&D) mobile: +1 858 205 3653 > > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 > > San Diego, CA 92121 fax: +1 858 799 5222 > > USA web: http://www.accelrys.com > > > > http://www.linkedin.com/in/smarkel > > Vice President, Board of Directors: > > International Society for Computational Biology > > Co-chair: ISCB Publications Committee > > Associate Editor: PLoS Computational Biology > > Editorial Board: Briefings in Bioinformatics > > > > > > > -----Original Message----- > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > > Sent: Thursday, 14 May 2009 12:54 PM > > > To: bioperl-l > > > Subject: [Bioperl-l] Parsing needle/water output > > > > > > Hi All, > > > Is there any parser/module available to parse > needle/water > > output > > > report (from emboss) to get the start and end position of > alignment. > > > > > > Thanks > > > Shalabh > > > _______________________________________________ > > > Bioperl-l mailing list > > > Bioperl-l at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From SMarkel at accelrys.com Thu May 14 18:01:51 2009 From: SMarkel at accelrys.com (Scott Markel) Date: Thu, 14 May 2009 18:01:51 -0400 Subject: [Bioperl-l] Bioperl help In-Reply-To: <84cc2c200905141439w67ab5b1dm26186625f72fb4f0@mail.gmail.com> References: <84cc2c200905141409y64028d33s7fa39a99aabf9463@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A37752B@exch1-hi.accelrys.net> <84cc2c200905141439w67ab5b1dm26186625f72fb4f0@mail.gmail.com> Message-ID: <1F1240778FB0AF46B4E5A72C44D2C7472A37757B@exch1-hi.accelrys.net> Please keep replies on the list. You get no warning or error messages? Just an empty output file? I downloaded the scripts, used your input file contents, and got the following result. 59 > perl fastaConcat.pl seq1.txt seq2.txt >seq1 AAAAAATT >seq2 TTTGG >seq3 GGGGCC Scott From: Assiatu Barrie [mailto:assiatu.barrie at gmail.com] Sent: Thursday, 14 May 2009 2:40 PM To: Scott Markel Subject: Re: [Bioperl-l] Bioperl help This is the same script, it is the only I could find, and seq 1 and two are the same on the website seq1file >seq1 AAAAAA >seq2 TTT >seq3 GGGG seq2file contains >seq1 TT >seq2 GG >seq3 CC On Thu, May 14, 2009 at 4:26 PM, Scott Markel > wrote: Is the Perl script the same as the one found at http://www.faculty.uaf.edu/ffnt/teaching/programming/perl-scripts/fastaConcat.pl? That's the only one Google finds. What do seq1.txt and seq2.txt contain? Scott Scott Markel, Ph.D. Principal Bioinformatics Architect email: smarkel at accelrys.com Accelrys (SciTegic R&D) mobile: +1 858 205 3653 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 San Diego, CA 92121 fax: +1 858 799 5222 USA web: http://www.accelrys.com http://www.linkedin.com/in/smarkel Vice President, Board of Directors: International Society for Computational Biology Co-chair: ISCB Publications Committee Associate Editor: PLoS Computational Biology Editorial Board: Briefings in Bioinformatics > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Assiatu Barrie > Sent: Thursday, 14 May 2009 2:10 PM > To: bioperl-l at bioperl.org > Subject: [Bioperl-l] Bioperl help > > Hello again, > > So I think I installed bioperl successfully, but when I try running a test > script i get a blank output file > > so i tried running a script called fastaConcat.pl on 2 fasta seq files but > the output file was blank > > "perl fastaConcat.pl seq1.txt seq2.txt > out.txt" > > would appreciate the help > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Thu May 14 18:16:30 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 14 May 2009 18:16:30 -0400 Subject: [Bioperl-l] [ANNOUNCEMENT] Google Summer of Code student Xin Shuai In-Reply-To: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> References: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> Message-ID: <4482F182-B0B0-4C71-B8DE-B5B7A4EC4D81@gmx.net> Welcome David, good luck with your project, and I hope (actually, am quite certain) that you'll enjoy your summer with us. -hilmar On May 14, 2009, at 5:14 PM, Chris Fields wrote: > All, > > I am proud to introduce Xin 'David' Shuai, my student for the Google > Summer of Code 2009, to the Open Bioinformatics community. David's > project centers on developing SWIG-based bindings to libsequence (a > population genetics library) for the BioLib project: > > http://biolib.open-bio.org/wiki/Main_Page > > Besides myself, David will be co-mentored by Mark Jensen and Pjotr > Prins. > > As the BioLib project centers on creating common, maintainable SWIG- > based bindings to popular bioinformatics libraries for the various > Bio* toolkits, we will likely need input from the various Open Bio > communities at various stages in the project. At this time, David's > initial plans are to develop and test libsequence bindings for Perl > and Python. > > David's proposal and project plan are available here: > > http://biolib.open-bio.org/wiki/User:David > > Congratulations David, and welcome to the Open-Bio community! > > Sincerely, > > Christopher Fields > University of Illinois Urbana-Champaign > Institute for Genomic Biology > Urbana, IL 61801 > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From lincoln.stein at gmail.com Thu May 14 23:01:07 2009 From: lincoln.stein at gmail.com (Lincoln Stein) Date: Thu, 14 May 2009 23:01:07 -0400 Subject: [Bioperl-l] populating gbrowse with genomic data rapidly In-Reply-To: References: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> <6dce9a0b0905140659o7d54e7ebq52d04ee92d036c47@mail.gmail.com> Message-ID: <6dce9a0b0905142001v307a703dk105fc7737affd409@mail.gmail.com> Hi Liam, You'll find a bunch of genomes in gff3 format here: http://www.gbrowse.org/wiki/index.php/Main_Page I think you can simply load them up into gbrowse databases. Unfortunately you won't be able to find ready-to-go configuration files for gbrowse2. If they have gbrowse configuration files at all, they will be for gbrowse version 1, which is similar, but not identical. So I'm afraid you'll have to inspect the gff3 files and figure out what tracks you want to show, and then write the track config sections. If you like, I can send you some ready-made gbrowse2 config/gff3 sets for worm and fly genes. Add this to the yeast example, and you'll have three genomes to show. Lincoln On Thu, May 14, 2009 at 10:48 PM, Liam Elbourne wrote: > Hi Lincoln (and all), > > This is really a gbrowse specific, and not a particularly bioperly > question, but I'm not on a gbrowse list, and I figured other bioperl people > were likeliest to know how to help. > > I've (to all appearances) completely successfully installed gbrowse(2.0), > with some minor glitches mainly caused by typos in the instructions, which I > will pass on in due course. The demo data looks great. > > I've been asked (spelt begged, ordered, requested, commanded) if at all > possible to get about 6/7 genomes available for browsing by Sunday (USA > time) for a meeting. I've skimmed the tutorial (which looks excellent, thank > you Lincoln!) and started working through it, but wondered if somewhere > there was a cheat sheet or "Dummies guide to stuffing gbrowse full of Genome > Data" that would allow me to get these genomes up by then. I know there is a > script for converting genbank data to gff, which will get me part of the way > there, as most or all of the genomes have annotation in genbank format, so > from my attempts to date (yesterday afternoon) I would say that what I need > is: > > * appropriately setup .conf files > * and instructions on how the data needs to formatted (ie what has to go > into the gff files) named and located (presumably all in the "databases" > directory), in order to 'match' the .conf files. > > Absolutely any assistance would be appreciated, including "it's completely > impossible, give up now!" or I guess potentially "it's all in the > instructions", which I'm sure it is... I apologise in advance if there is > already a short guide available on the GMOD wiki or elsewhere that I have > missed, and will happily thank whoever will point me towards it! > > > Regards, > Liam. > > > > > > > ______________________________ > > Dr Liam Elbourne > Research Fellow (Bioinformatics) > Paulsen Laboratory > Macquarie University > Sydney > Australia. > > > -- Lincoln D. Stein Director, Informatics and Biocomputing Platform Ontario Institute for Cancer Research 101 College St., Suite 800 Toronto, ON, Canada M5G0A3 416 673-8514 Assistant: Renata Musa From lelbourn at cbms.mq.edu.au Thu May 14 22:48:44 2009 From: lelbourn at cbms.mq.edu.au (Liam Elbourne) Date: Fri, 15 May 2009 12:48:44 +1000 Subject: [Bioperl-l] populating gbrowse with genomic data rapidly In-Reply-To: <6dce9a0b0905140659o7d54e7ebq52d04ee92d036c47@mail.gmail.com> References: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> <6dce9a0b0905140659o7d54e7ebq52d04ee92d036c47@mail.gmail.com> Message-ID: Hi Lincoln (and all), This is really a gbrowse specific, and not a particularly bioperly question, but I'm not on a gbrowse list, and I figured other bioperl people were likeliest to know how to help. I've (to all appearances) completely successfully installed gbrowse(2.0), with some minor glitches mainly caused by typos in the instructions, which I will pass on in due course. The demo data looks great. I've been asked (spelt begged, ordered, requested, commanded) if at all possible to get about 6/7 genomes available for browsing by Sunday (USA time) for a meeting. I've skimmed the tutorial (which looks excellent, thank you Lincoln!) and started working through it, but wondered if somewhere there was a cheat sheet or "Dummies guide to stuffing gbrowse full of Genome Data" that would allow me to get these genomes up by then. I know there is a script for converting genbank data to gff, which will get me part of the way there, as most or all of the genomes have annotation in genbank format, so from my attempts to date (yesterday afternoon) I would say that what I need is: * appropriately setup .conf files * and instructions on how the data needs to formatted (ie what has to go into the gff files) named and located (presumably all in the "databases" directory), in order to 'match' the .conf files. Absolutely any assistance would be appreciated, including "it's completely impossible, give up now!" or I guess potentially "it's all in the instructions", which I'm sure it is... I apologise in advance if there is already a short guide available on the GMOD wiki or elsewhere that I have missed, and will happily thank whoever will point me towards it! Regards, Liam. ______________________________ Dr Liam Elbourne Research Fellow (Bioinformatics) Paulsen Laboratory Macquarie University Sydney Australia. From rajgolla at indiana.edu Thu May 14 23:21:47 2009 From: rajgolla at indiana.edu (rajesh gollapudi) Date: Thu, 14 May 2009 23:21:47 -0400 Subject: [Bioperl-l] populating gbrowse with genomic data rapidly In-Reply-To: <6dce9a0b0905142001v307a703dk105fc7737affd409@mail.gmail.com> References: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> <6dce9a0b0905140659o7d54e7ebq52d04ee92d036c47@mail.gmail.com> <6dce9a0b0905142001v307a703dk105fc7737affd409@mail.gmail.com> Message-ID: <43fa5f7b0905142021u50fec60ey581f1cc14737a2f1@mail.gmail.com> Hello Liam, We have developed a tool called WebGBrowse which could be the solution to your problem. You can find it at http://webgbrowse.cgb.indiana.edu. WebGBrowse serves as a configuration utility for GBrowse, so that users can visualize their genomes in gff3 format. It creates a configuration file for you, based on the tracks available. In the end you could even download the configuration file and use it on your installation of GBrowse. Alternatively, you can use the version of GBrowse provided by this tool too. You can find a tutorial on how to use WebGBrowse at http://webgbrowse.cgb.indiana.edu/webgbrowse/tutorial.html Rajesh On Thu, May 14, 2009 at 11:01 PM, Lincoln Stein wrote: > Hi Liam, > > You'll find a bunch of genomes in gff3 format here: > > http://www.gbrowse.org/wiki/index.php/Main_Page > > I think you can simply load them up into gbrowse databases. > > Unfortunately you won't be able to find ready-to-go configuration files for > gbrowse2. If they have gbrowse configuration files at all, they will be for > gbrowse version 1, which is similar, but not identical. So I'm afraid > you'll > have to inspect the gff3 files and figure out what tracks you want to show, > and then write the track config sections. > > If you like, I can send you some ready-made gbrowse2 config/gff3 sets for > worm and fly genes. Add this to the yeast example, and you'll have three > genomes to show. > > Lincoln > > On Thu, May 14, 2009 at 10:48 PM, Liam Elbourne >wrote: > > > Hi Lincoln (and all), > > > > This is really a gbrowse specific, and not a particularly bioperly > > question, but I'm not on a gbrowse list, and I figured other bioperl > people > > were likeliest to know how to help. > > > > I've (to all appearances) completely successfully installed gbrowse(2.0), > > with some minor glitches mainly caused by typos in the instructions, > which I > > will pass on in due course. The demo data looks great. > > > > I've been asked (spelt begged, ordered, requested, commanded) if at all > > possible to get about 6/7 genomes available for browsing by Sunday (USA > > time) for a meeting. I've skimmed the tutorial (which looks excellent, > thank > > you Lincoln!) and started working through it, but wondered if somewhere > > there was a cheat sheet or "Dummies guide to stuffing gbrowse full of > Genome > > Data" that would allow me to get these genomes up by then. I know there > is a > > script for converting genbank data to gff, which will get me part of the > way > > there, as most or all of the genomes have annotation in genbank format, > so > > from my attempts to date (yesterday afternoon) I would say that what I > need > > is: > > > > * appropriately setup .conf files > > * and instructions on how the data needs to formatted (ie what has to go > > into the gff files) named and located (presumably all in the "databases" > > directory), in order to 'match' the .conf files. > > > > Absolutely any assistance would be appreciated, including "it's > completely > > impossible, give up now!" or I guess potentially "it's all in the > > instructions", which I'm sure it is... I apologise in advance if there > is > > already a short guide available on the GMOD wiki or elsewhere that I have > > missed, and will happily thank whoever will point me towards it! > > > > > > Regards, > > Liam. > > > > > > > > > > > > > > ______________________________ > > > > Dr Liam Elbourne > > Research Fellow (Bioinformatics) > > Paulsen Laboratory > > Macquarie University > > Sydney > > Australia. > > > > > > > > > -- > Lincoln D. Stein > Director, Informatics and Biocomputing Platform > Ontario Institute for Cancer Research > 101 College St., Suite 800 > Toronto, ON, Canada M5G0A3 > 416 673-8514 > Assistant: Renata Musa > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From lelbourn at cbms.mq.edu.au Fri May 15 01:57:09 2009 From: lelbourn at cbms.mq.edu.au (Liam Elbourne) Date: Fri, 15 May 2009 15:57:09 +1000 Subject: [Bioperl-l] Fwd: populating gbrowse with genomic data rapidly References: Message-ID: <51B2B662-3C24-4E92-9091-446164504EAC@cbms.mq.edu.au> Hi All, Quel embarrassment, I just found the highly esteemed Dr Stein's http://search.cpan.org/~lds/GBrowse-1.993/docs/pod/GENBANK_HOWTO.pod which is exactly what I asked for. Sorry for spamming everybody on the list! Google first, ask questions later... Regards, Liam. Begin forwarded message: > From: Liam Elbourne > Date: 15 May 2009 12:48:44 PM > To: Lincoln Stein > Cc: BioPerl List > Subject: populating gbrowse with genomic data rapidly > > Hi Lincoln (and all), > > This is really a gbrowse specific, and not a particularly bioperly > question, but I'm not on a gbrowse list, and I figured other bioperl > people were likeliest to know how to help. > > I've (to all appearances) completely successfully installed > gbrowse(2.0), with some minor glitches mainly caused by typos in the > instructions, which I will pass on in due course. The demo data > looks great. > > I've been asked (spelt begged, ordered, requested, commanded) if at > all possible to get about 6/7 genomes available for browsing by > Sunday (USA time) for a meeting. I've skimmed the tutorial (which > looks excellent, thank you Lincoln!) and started working through it, > but wondered if somewhere there was a cheat sheet or "Dummies guide > to stuffing gbrowse full of Genome Data" that would allow me to get > these genomes up by then. I know there is a script for converting > genbank data to gff, which will get me part of the way there, as > most or all of the genomes have annotation in genbank format, so > from my attempts to date (yesterday afternoon) I would say that what > I need is: > > * appropriately setup .conf files > * and instructions on how the data needs to formatted (ie what has > to go into the gff files) named and located (presumably all in the > "databases" directory), in order to 'match' the .conf files. > > Absolutely any assistance would be appreciated, including "it's > completely impossible, give up now!" or I guess potentially "it's > all in the instructions", which I'm sure it is... I apologise in > advance if there is already a short guide available on the GMOD wiki > or elsewhere that I have missed, and will happily thank whoever will > point me towards it! > > > Regards, > Liam. > > > > > > > ______________________________ > > Dr Liam Elbourne > Research Fellow (Bioinformatics) > Paulsen Laboratory > Macquarie University > Sydney > Australia. > > ______________________________ Dr Liam Elbourne Research Fellow (Bioinformatics) Paulsen Laboratory Macquarie University Sydney Australia. From heikki.lehvaslaiho at gmail.com Fri May 15 05:00:45 2009 From: heikki.lehvaslaiho at gmail.com (Heikki Lehvaslaiho) Date: Fri, 15 May 2009 11:00:45 +0200 Subject: [Bioperl-l] Creating a fastq format file? In-Reply-To: References: Message-ID: My initial assumption of linear relationship between different integer ranges used to represent quality values was overly simplistic. I've spent some time trying to understand the real relationships between quality integers and their ASCII encodings. My notes turned into a small essay which makes it quite long for an email. Please bear with me and reply back to the list if you think you have an insight into these matters. -Heikki I started from a series of probability estimates of true nucleotide calls [1, 2]. They are in column 1 (Prob) in the table below. The second column (phd) gives the corresponding phred quality. The third has the solexa qualities as defined in [1]. >From these columns, I can see that phred and solexa qualities are identical at value 6 and above. That quality is a lot lower than any sensible threshold. 6 means one out of five nucleotides are wrong. Quality score 20 that is often used as threshold means that one out of 100 nucleotides is wrong. In practice, quality values 6 or under are automatically discarded and never considered seriously. I find it difficult to understand why these qualities are considered to be so different! Both phred and solexa qualities are encoded into one character representations into FASTQ text format [3, 4]. The rules of encoding are slightly different. Phred qualities are positive integers. The Sanger FASTQ specifications [6] show that the encoding goes from "!" (ASCII 33 corresponding to phred quality 0) up to "~" (ASCII 126, phred quality 93, probability 5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for sanger encoding. The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred quality score from 0 to 60 using ASCII 33 to 93." It generally accepted that Phred qualities are limited to 60. See column 5 (sa6(ASCII)). Or are there exceptions? Could it the that the authors of the maq web pages [3] were confusing quality 60 for ASCII 60 and that lead to talking about quality 93 being the limit? The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);) from [3] seems to point that way? Aside: What confuses me even more is that bioperl test data set contains a quality file (t/data/qualfile.qual) where values go from 0 to 90. Where do these values come from? Which program generated them? Solexa quality is encoded based on 64 and is limited by upper value of 40. There is no predefined lower limit although in practice used values do not go lower than -5. In other words, the encoding goes from "" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6, colc(ASCII) ). It was reported [6], that high solexa quality scores are over-optimistic and low scores underestimate the data quality. It is therefore for the better that Solexa quality is now only of historical interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred qualities 0 to 40. It still uses 64 as a base. So: The character encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column illc(ASCII) ). If the facts and assumptions from above can be agreed on, we can move on to coding. :) Table: The columns are probability, phred quality, solexa quality, sanger encoding of phred qualities with corresponding ASCII values, sanger encoding of phred qualities limited to Q60 with corresponding ASCII values, solexa/illumina (1.0) encoding of solexa qualities with corresponding ASCII values, and illumina (1.3) encoding of phred quality with corresponding ASCII values. I started from a series of probability estimates of true nucleotide calls [References 1, 2]. They are in column 1 (Prob) in the table below. The second column (phd) gives the corresponding phred quality. The third has the solexa qualities as defined in [1]. >From these columns, I can see that phred and solexa qualities are identical at value 6 and above. That quality is a lot lower than any sensible threshold. 6 means one out of five nucleotides are wrong. Quality score 20 that is often used as threshold means that one out of 100 nucleotides is wrong. In practice, quality values 6 or under are automatically discarded and never considered seriously. I find it difficult to understand why these qualities are considered to be so different! Both phred and solexa qualities are encoded into one character representations into FASTQ text format [3, 4]. The rules of encoding are slightly different. Phred qualities are positive integers. The Sanger FASTQ specifications [6] show that the encoding goes from "!" (ASCII 33 corresponding to phred quality 0) up to "~" (ASCII 126, phred quality 93, probability 5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for sanger encoding. The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred quality score from 0 to 60 using ASCII 33 to 93." It generally accepted that Phred qualities are limited to 60. See column 5 (sa6(ASCII)). Or are there exceptions? Could it the that the authors of the maq web pages [3] were confusing quality 60 for ASCII 60 and that lead to talking about quality 93 being the limit? The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);) from [3] seems to point that way? Aside: What confuses me even more is that bioperl test data set contains a quality file (t/data/qualfile.qual) where values go from 0 to 90. Where do these values come from? Which program generated them? Solexa quality is encoded based on 64 and is limited by upper value of 40. There is no predefined lower limit although in practice used values do not go lower than -5. In other words, the encoding goes from "" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6, colc(ASCII) ). It was reported [6], that high solexa quality scores are over-optimistic and low scores underestimate the data quality. It is therefore for the better that Solexa quality is now only of historical interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred qualities 0 to 40. It still uses 64 as a base. So: The character encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column illc(ASCII) ). If the facts and assumptions from above can be agreed on, we can move on to coding. :) Table: The columns are probability, phred quality, solexa quality, sanger encoding of phred qualities with corresponding ASCII values, sanger encoding of phred qualities limited to Q60 with corresponding ASCII values, solexa/illumina (1.0) encoding of solexa qualities with corresponding ASCII values, and illumina (1.3) encoding of phred quality with corresponding ASCII values. Prob phq soq sac(ASCII) sa6(ASCII) solc(ASCII) illc(ASCII) 0.999 0 -29 !( 33) !( 33) "( 34) @( 64) 0.9 0 -9 !( 33) !( 33) 6( 54) @( 64) 0.8 1 -5 !( 33) !( 33) 9( 57) @( 64) 0.7 1 -3 "( 34) "( 34) <( 60) A( 65) 0.6 2 -1 #( 35) #( 35) >( 62) B( 66) 0.5 3 0 $( 36) $( 36) @( 64) C( 67) 0.4 4 1 $( 36) $( 36) A( 65) C( 67) 0.3 5 3 &( 38) &( 38) C( 67) E( 69) 0.2 7 6 '( 39) '( 39) F( 70) F( 70) 0.1 10 9 +( 43) +( 43) I( 73) J( 74) 0.01 20 20 5( 53) 5( 53) S( 83) T( 84) 0.001 30 30 ?( 63) ?( 63) ]( 93) ^( 94) 0.0001 40 40 I( 73) I( 73) g(103) h(104) 1e-05 50 50 S( 83) S( 83) h(104) h(104) 1e-06 60 60 ]( 93) ]( 93) h(104) h(104) 1e-07 70 70 g(103) ]( 93) h(104) h(104) 1e-08 80 80 q(113) ]( 93) h(104) h(104) 1e-09 90 90 z(122) ]( 93) h(104) h(104) 5e-10 93 93 ~(126) ]( 93) h(104) h(104) 1e-10 100 100 ~(126) ]( 93) h(104) h(104) 1e-11 110 110 ~(126) ]( 93) h(104) h(104) Code. This perl code generates the above table. -------------------------------------------------------------------- #!/usr/bin/env perl use strict; use warnings; my @p = (0.999, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001, 0.00000001, 0.000000001, 0.0000000005, 0.0000000001, 0.00000000001); printf "%7s %3s %3s %s%6s %s%6s %s%6s %s%6s\n", 'Prob', 'phq', 'soq', 'sac', '(ASCII)', 'sa6', '(ASCII)', 'solc', '(ASCII)', 'illc', '(ASCII)'; for my $e (@p) { my $Q = -10 * log($e) / log(10); #phred quality my $sQ = -10 * log($e / (1 - $e)) / log(10); # solexa quality my $qc = chr(($Q<=93 ? $Q : 93) + 33); # sanger character my $q6 = chr(($Q<=60 ? $Q : 60) + 33); # sanger character, limited to Q60 my $qs = chr(($Q<=40 ? $sQ : 40) + 64); # solexa character # based on phred quality my $qi = chr(($Q<=40 ? $Q : 40) + 64); #illumina character # I've added 0.1 some int values to couteract a strange downward # rounding of values by perl printf "%7s %3d %3d %s(%3d) %s(%3d) %s(%3d) %s(%3d)\n", $e, $Q+0.1, $sQ+0.1, $qc, ord($qc), $q6, ord($q6), $qs, ord($qs), $qi, ord($qi); } -------------------------------------------------------------------- References: 1. http://maq.sourceforge.net/qual.shtml 2. http://en.wikipedia.org/wiki/Phred_quality_score 3. http://maq.sourceforge.net/fastq.shtml 4. http://en.wikipedia.org/wiki/FASTQ_format 5. http://en.wikipedia.org/wiki/FASTQ_format#Encoding 6. http://maq.sourceforge.net/fastq.shtml#spec 7. Dohm JC, Lottaz C, Borodina T and Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing Nucleic Acids Research 2008 36(16):e105; doi:10.1093/nar/gkn425 http://nar.oxfordjournals.org/cgi/content/abstract/36/16/e105 8. Solution to Sanger/Solexa/Illumina FASTQ confusion http://seqanswers.com/forums/showthread.php?t=1526 2009/5/13 Chris Fields : > Heikki, > > Did you still want to commit this? ?I think it's a good idea and would be > worth including in the next 1.6 point release. > > chris > > ------------------------------------------------------------ > I convinced at least myself to the degree that I wrote the > range_convert() method - with plenty of tests. I mention this now so > that no-one else need to start thinking through all the edge values. > :) > > I'll contribute it to the code base once there is a consensus of best > way forward. > > ? ?-Heikki > > 2009/4/27 Heikki Lehvaslaiho : >>> I have tried to summarise this in a central place: >>> http://en.wikipedia.org/wiki/FASTQ_format >> >> Torsten, >> >> Thanks for putting this together. Very helpful. >> >> Do you have a plan of action? ?Let me propose one for BioPerl. It >> based on following assumptions: >> >> 1. There is multitude of different ways of coding quality values out >> there. >> 2. Bio::Seq::Quality is agnostic of any quality value range rules >> 3. The emerging open standard is the Sanger fastq specification >> 4. Open source programs use the Sanger fastq specs >> >> >> From these it follows that: >> >> >> 1. BioPerl should support Sanger fastq standard >> >> 1.1. it already does and there are other SeqIO modules for dealing >> with other non-fastq formats. >> >> 2. BioPerl should offer simple ways of converting between quality range >> rules >> >> 2.1. Have a generic method accessible from Bio::Seq::Quality with >> preset versions of the method for converting between known variants >> (Sanger fastq and the two Illumina versions) >> >> For example: >> >> range_convert ($from_lower, $from_upper, $to_lower, $to_upper, $value) >> ?throw if $value < $from_lower or $value > $from_upper >> ?return $newvalue >> >> range_convert_illumina2fastq(), range_convert_fastq2illumina(), >> range_convert_fastq2phred(), ?range_convert_phred2fastq().... >> >> (assuming that illumina 1.3 eq phred) >> >> 2.2. Bio::SeqIO::Fastq::next_seq methods should convert Illumina >> qualities into Sanger fastq on the fly >> >> 2.2.1 Bio::SeqIO::Fastq::next_seq should detect the incoming stream of >> quality value range either automatically or be given a keyword >> parameter indicating the range. >> >> 2.2.2. Bio::SeqIO::Fastq::next_seq should throw an error if it detects >> a quality value out of range. >> >> 2.2.3. Bio::SeqIO::Fastq::write_seq should throw an error if it >> detects a quality value out of range. >> >> 2.2.4. It would be useful but not absolutely necessary for >> Bio::SeqIO::Fastq::write_seq to be able to write out in Illumina >> ranges >> >> >> What do you think? >> >> ? ?-Heikki >> >> 2009/4/26 Torsten Seemann : >>>> > This might be a good place to ask the question: having looked at the >>>> > fastq.pm page, is the fastq format defined (only) by a "@'" followed >>>> > by >>>> a >>>> > sequence line and a "+" header followed by a quality line and the two >>>> > headers have to agree? Now that Illumina is using phred scaling, are >>>> > 'Sanger' and 'Illumina' versions the same? >>>> >>>> No they aren't the same, Illumina still encodes the ascii as value + 64 >>>> and Sanger as value + 33. >>>> >>> >>> Illumina have now CHANGED how they calculate the quality value however in >>> the last month or so... Their Q range used to be -5..40 mapped to ASCII >>> 64+, >>> but now they produce Q >= 0 and it is unclear if they start at 69 or 64 >>> now... >>> >>> I have tried to summarise this in a central place: >>> >>> http://en.wikipedia.org/wiki/FASTQ_format >>> >>> Corrections welcome! >>> >>> >>> --Torsten Seemann >>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash >>> University, AUSTRALIA >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> >> >> >> -- >> ? ?-Heikki >> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho >> cell: +27 (0)714328090 >> Sent from Claremont, WC, South Africa >> > > > > -- > ? ?-Heikki > Heikki Lehvaslaiho - skype:heikki_lehvaslaiho > cell: +27 (0)714328090 > Sent from Claremont, WC, South Africa > -- -Heikki Heikki Lehvaslaiho - skype:heikki_lehvaslaiho cell: +27 (0)714328090 Sent from Claremont, WC, South Africa From dan.bolser at gmail.com Fri May 15 08:53:16 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Fri, 15 May 2009 13:53:16 +0100 Subject: [Bioperl-l] populating gbrowse with genomic data rapidly In-Reply-To: <43fa5f7b0905142021u50fec60ey581f1cc14737a2f1@mail.gmail.com> References: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> <6dce9a0b0905140659o7d54e7ebq52d04ee92d036c47@mail.gmail.com> <6dce9a0b0905142001v307a703dk105fc7737affd409@mail.gmail.com> <43fa5f7b0905142021u50fec60ey581f1cc14737a2f1@mail.gmail.com> Message-ID: <2c8757af0905150553r3451268en8affc80736fcfcb5@mail.gmail.com> 2009/5/15 rajesh gollapudi : > Hello Liam, > We have developed a tool called WebGBrowse which could be the solution to > your problem. You can find it at http://webgbrowse.cgb.indiana.edu. > > WebGBrowse serves as a configuration utility for GBrowse, so that users can > visualize their genomes in gff3 format. It creates a configuration file for > you, based on the tracks available. In the end you could even download the > configuration file and use it on your installation of GBrowse. > Alternatively, you can use the version of GBrowse provided by this tool too. > You can find a tutorial on how to use WebGBrowse at > http://webgbrowse.cgb.indiana.edu/webgbrowse/tutorial.html I really like this service, but can you confirm that it will create configuration files compatible with GBrowse 2.0? Dan. > Rajesh > > > On Thu, May 14, 2009 at 11:01 PM, Lincoln Stein wrote: > >> Hi Liam, >> >> You'll find a bunch of genomes in gff3 format here: >> >> http://www.gbrowse.org/wiki/index.php/Main_Page >> >> I think you can simply load them up into gbrowse databases. >> >> Unfortunately you won't be able to find ready-to-go configuration files for >> gbrowse2. If they have gbrowse configuration files at all, they will be for >> gbrowse version 1, which is similar, but not identical. So I'm afraid >> you'll >> have to inspect the gff3 files and figure out what tracks you want to show, >> and then write the track config sections. >> >> If you like, I can send you some ready-made gbrowse2 config/gff3 sets for >> worm and fly genes. Add this to the yeast example, and you'll have three >> genomes to show. >> >> Lincoln >> >> On Thu, May 14, 2009 at 10:48 PM, Liam Elbourne > >wrote: >> >> > Hi Lincoln (and all), >> > >> > This is really a gbrowse specific, and not a particularly bioperly >> > question, but I'm not on a gbrowse list, and I figured other bioperl >> people >> > were likeliest to know how to help. >> > >> > I've (to all appearances) completely successfully installed gbrowse(2.0), >> > with some minor glitches mainly caused by typos in the instructions, >> which I >> > will pass on in due course. The demo data looks great. >> > >> > ?I've been asked (spelt begged, ordered, requested, commanded) if at all >> > possible to get about 6/7 genomes available for browsing by Sunday (USA >> > time) for a meeting. I've skimmed the tutorial (which looks excellent, >> thank >> > you Lincoln!) and started working through it, but wondered if somewhere >> > there was a cheat sheet or "Dummies guide to stuffing gbrowse full of >> Genome >> > Data" that would allow me to get these genomes up by then. I know there >> is a >> > script for converting genbank data to gff, which will get me part of the >> way >> > there, as most or all of the genomes have annotation in genbank format, >> so >> > from my attempts to date (yesterday afternoon) I would say that what I >> need >> > is: >> > >> > ?* appropriately setup .conf files >> > ?* and instructions on how the data needs to formatted (ie what has to go >> > into the gff files) ?named and located (presumably all in the "databases" >> > directory), in order to 'match' the .conf files. >> > >> > Absolutely any assistance would be appreciated, including "it's >> completely >> > impossible, give up now!" or I guess potentially "it's all in the >> > instructions", which I'm sure it is... ?I apologise in advance if there >> is >> > already a short guide available on the GMOD wiki or elsewhere that I have >> > missed, and will happily thank whoever will point me towards it! >> > >> > >> > Regards, >> > Liam. >> > >> > >> > >> > >> > >> > >> > ______________________________ >> > >> > Dr Liam Elbourne >> > Research Fellow (Bioinformatics) >> > Paulsen Laboratory >> > Macquarie University >> > Sydney >> > Australia. >> > >> > >> > >> >> >> -- >> Lincoln D. Stein >> Director, Informatics and Biocomputing Platform >> Ontario Institute for Cancer Research >> 101 College St., Suite 800 >> Toronto, ON, Canada M5G0A3 >> 416 673-8514 >> Assistant: Renata Musa >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From rajgolla at indiana.edu Fri May 15 09:07:02 2009 From: rajgolla at indiana.edu (rajesh gollapudi) Date: Fri, 15 May 2009 09:07:02 -0400 Subject: [Bioperl-l] populating gbrowse with genomic data rapidly In-Reply-To: <2c8757af0905150553r3451268en8affc80736fcfcb5@mail.gmail.com> References: <776A880F-7477-41D3-8FCD-5571EE6D804D@bioperl.org> <6dce9a0b0905140659o7d54e7ebq52d04ee92d036c47@mail.gmail.com> <6dce9a0b0905142001v307a703dk105fc7737affd409@mail.gmail.com> <43fa5f7b0905142021u50fec60ey581f1cc14737a2f1@mail.gmail.com> <2c8757af0905150553r3451268en8affc80736fcfcb5@mail.gmail.com> Message-ID: <43fa5f7b0905150607i53dd42ffo77ced9105045cd48@mail.gmail.com> Hello Dan, As per this url (http://gmod.org/wiki/Migrating_from_GBrowse_1.X_to_2.X), there isn't much different between gbrowse1.X and 2.X conf files. Personally I haven't tried it, but I am confident that it will work. Rajesh On Fri, May 15, 2009 at 8:53 AM, Dan Bolser wrote: > 2009/5/15 rajesh gollapudi : > > Hello Liam, > > We have developed a tool called WebGBrowse which could be the solution to > > your problem. You can find it at http://webgbrowse.cgb.indiana.edu. > > > > WebGBrowse serves as a configuration utility for GBrowse, so that users > can > > visualize their genomes in gff3 format. It creates a configuration file > for > > you, based on the tracks available. In the end you could even download > the > > configuration file and use it on your installation of GBrowse. > > Alternatively, you can use the version of GBrowse provided by this tool > too. > > You can find a tutorial on how to use WebGBrowse at > > http://webgbrowse.cgb.indiana.edu/webgbrowse/tutorial.html > > I really like this service, but can you confirm that it will create > configuration files compatible with GBrowse 2.0? > > Dan. > > > > Rajesh > > > > > > On Thu, May 14, 2009 at 11:01 PM, Lincoln Stein >wrote: > > > >> Hi Liam, > >> > >> You'll find a bunch of genomes in gff3 format here: > >> > >> http://www.gbrowse.org/wiki/index.php/Main_Page > >> > >> I think you can simply load them up into gbrowse databases. > >> > >> Unfortunately you won't be able to find ready-to-go configuration files > for > >> gbrowse2. If they have gbrowse configuration files at all, they will be > for > >> gbrowse version 1, which is similar, but not identical. So I'm afraid > >> you'll > >> have to inspect the gff3 files and figure out what tracks you want to > show, > >> and then write the track config sections. > >> > >> If you like, I can send you some ready-made gbrowse2 config/gff3 sets > for > >> worm and fly genes. Add this to the yeast example, and you'll have three > >> genomes to show. > >> > >> Lincoln > >> > >> On Thu, May 14, 2009 at 10:48 PM, Liam Elbourne < > lelbourn at cbms.mq.edu.au > >> >wrote: > >> > >> > Hi Lincoln (and all), > >> > > >> > This is really a gbrowse specific, and not a particularly bioperly > >> > question, but I'm not on a gbrowse list, and I figured other bioperl > >> people > >> > were likeliest to know how to help. > >> > > >> > I've (to all appearances) completely successfully installed > gbrowse(2.0), > >> > with some minor glitches mainly caused by typos in the instructions, > >> which I > >> > will pass on in due course. The demo data looks great. > >> > > >> > I've been asked (spelt begged, ordered, requested, commanded) if at > all > >> > possible to get about 6/7 genomes available for browsing by Sunday > (USA > >> > time) for a meeting. I've skimmed the tutorial (which looks excellent, > >> thank > >> > you Lincoln!) and started working through it, but wondered if > somewhere > >> > there was a cheat sheet or "Dummies guide to stuffing gbrowse full of > >> Genome > >> > Data" that would allow me to get these genomes up by then. I know > there > >> is a > >> > script for converting genbank data to gff, which will get me part of > the > >> way > >> > there, as most or all of the genomes have annotation in genbank > format, > >> so > >> > from my attempts to date (yesterday afternoon) I would say that what I > >> need > >> > is: > >> > > >> > * appropriately setup .conf files > >> > * and instructions on how the data needs to formatted (ie what has to > go > >> > into the gff files) named and located (presumably all in the > "databases" > >> > directory), in order to 'match' the .conf files. > >> > > >> > Absolutely any assistance would be appreciated, including "it's > >> completely > >> > impossible, give up now!" or I guess potentially "it's all in the > >> > instructions", which I'm sure it is... I apologise in advance if > there > >> is > >> > already a short guide available on the GMOD wiki or elsewhere that I > have > >> > missed, and will happily thank whoever will point me towards it! > >> > > >> > > >> > Regards, > >> > Liam. > >> > > >> > > >> > > >> > > >> > > >> > > >> > ______________________________ > >> > > >> > Dr Liam Elbourne > >> > Research Fellow (Bioinformatics) > >> > Paulsen Laboratory > >> > Macquarie University > >> > Sydney > >> > Australia. > >> > > >> > > >> > > >> > >> > >> -- > >> Lincoln D. Stein > >> Director, Informatics and Biocomputing Platform > >> Ontario Institute for Cancer Research > >> 101 College St., Suite 800 > >> Toronto, ON, Canada M5G0A3 > >> 416 673-8514 > >> Assistant: Renata Musa > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > From xshuai at umail.iu.edu Thu May 14 21:52:40 2009 From: xshuai at umail.iu.edu (Xin Shuai) Date: Thu, 14 May 2009 21:52:40 -0400 Subject: [Bioperl-l] [ANNOUNCEMENT] Google Summer of Code student Xin Shuai In-Reply-To: <4482F182-B0B0-4C71-B8DE-B5B7A4EC4D81@gmx.net> References: <7A388EB0-E9B2-4579-80EA-83AC95817EF9@illinois.edu> <4482F182-B0B0-4C71-B8DE-B5B7A4EC4D81@gmx.net> Message-ID: <3a7743460905141852x69e32266oc440e7557d59d84f@mail.gmail.com> Thank you for everyone's help during my application. I will do my best to accomplish it.Since I'm a totally newcomer in Bio* language program, I will have a lot to ask during the following months and hope to get your help! On Thu, May 14, 2009 at 6:16 PM, Hilmar Lapp wrote: > Welcome David, good luck with your project, and I hope (actually, am quite > certain) that you'll enjoy your summer with us. > > -hilmar > > > On May 14, 2009, at 5:14 PM, Chris Fields wrote: > > All, >> >> I am proud to introduce Xin 'David' Shuai, my student for the Google >> Summer of Code 2009, to the Open Bioinformatics community. David's project >> centers on developing SWIG-based bindings to libsequence (a population >> genetics library) for the BioLib project: >> >> http://biolib.open-bio.org/wiki/Main_Page >> >> Besides myself, David will be co-mentored by Mark Jensen and Pjotr Prins. >> >> As the BioLib project centers on creating common, maintainable SWIG-based >> bindings to popular bioinformatics libraries for the various Bio* toolkits, >> we will likely need input from the various Open Bio communities at various >> stages in the project. At this time, David's initial plans are to develop >> and test libsequence bindings for Perl and Python. >> >> David's proposal and project plan are available here: >> >> http://biolib.open-bio.org/wiki/User:David >> >> Congratulations David, and welcome to the Open-Bio community! >> >> Sincerely, >> >> Christopher Fields >> University of Illinois Urbana-Champaign >> Institute for Genomic Biology >> Urbana, IL 61801 >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > -- Xin Shuai (David) PhD of Complex System in School of Informatics Indiana University Bloomington 812-606-8019 From fungazid at yahoo.com Fri May 15 09:05:28 2009 From: fungazid at yahoo.com (fungazid) Date: Fri, 15 May 2009 06:05:28 -0700 (PDT) Subject: [Bioperl-l] looks like a Bio::SeqIO error Message-ID: <23559474.post@talk.nabble.com> Hello, I hope this is the right address for bioperl programming issues. Bioperl saves me a lot of time (not to re-invent the wheel), but there are some extremely irritating problems (I would change the code myself if I knew how). I am trying to read a file (~20MB) containing multiple fasta sequences: >a AGTAGTGAGTGCGCTGA......... >b GCGCTGAAGTAGTGAGT....... >c AGTAGTGAGTGCGCTGA......... >d........... with the following lines: my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); LOOP1: while ( my $seqobj1 = $seqin->next_seq()) { ...... my $seq=$seqobj1->subseq(1,$seqobj1->length); ....... } This works right for the first ~30000 contig sequences but then the following message appears: Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm line 744 DESTROY() mysql_insert obj destroying HANDLE What to do ??? (this is only one of some different Bioperl related bugs that I'm experiencing) -- View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From fungazid at yahoo.com Fri May 15 09:17:34 2009 From: fungazid at yahoo.com (fungazid) Date: Fri, 15 May 2009 06:17:34 -0700 (PDT) Subject: [Bioperl-l] looks like a Bio::SeqIO error Message-ID: <23559474.post@talk.nabble.com> Hello, I hope this is the right address for bioperl programming issues. Bioperl saves me a lot of time (not to re-invent the wheel), but there are some extremely irritating problems (I would change the code myself if I knew how). I am trying to read a file (~20MB) containing multiple fasta sequences: >a AGTAGTGAGTGCGCTGA......... >b GCGCTGAAGTAGTGAGT....... >c AGTAGTGAGTGCGCTGA......... >d........... with the following lines: my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); LOOP1: while ( my $seqobj1 = $seqin->next_seq()) { ...... my $seq=$seqobj1->subseq(1,$seqobj1->length); ....... } This works right for the first ~30000 contig sequences but then the following message appears: Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm line 744 What to do ??? (this is only one of some different Bioperl related bugs that I'm experiencing) -- View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From heikki.lehvaslaiho at gmail.com Fri May 15 09:39:55 2009 From: heikki.lehvaslaiho at gmail.com (Heikki Lehvaslaiho) Date: Fri, 15 May 2009 15:39:55 +0200 Subject: [Bioperl-l] Creating a fastq format file? In-Reply-To: References: Message-ID: Sorry for confusingly composed post. Here is a cleaner version. -Heikki I started from a series of probability estimates of true nucleotide calls [1, 2]. They are in column 1 (Prob) in the table below. The second column (phd) gives the corresponding phred quality. The third has the solexa qualities as defined in [1]. >From these columns, I can see that phred and solexa qualities are identical at value 6 and above. That quality is a lot lower than any sensible threshold. 6 means one out of five nucleotides are wrong. Quality score 20 that is often used as threshold means that one out of 100 nucleotides is wrong. In practice, quality values 6 or under are automatically discarded and never considered seriously. I find it difficult to understand why these qualities are considered to be so different! Both phred and solexa qualities are encoded into one character representations into FASTQ text format [3, 4]. The rules of encoding are slightly different. Phred qualities are positive integers. The Sanger FASTQ specifications [6] show that the encoding goes from "!" (ASCII 33 corresponding to phred quality 0) up to "~" (ASCII 126, phred quality 93, probability 5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for sanger encoding. The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred quality score from 0 to 60 using ASCII 33 to 93." It generally accepted that Phred qualities are limited to 60. See column 5 (sa6(ASCII)). Or are there exceptions? Could it the that the authors of the maq web pages [3] were confusing quality 60 for ASCII 60 and that lead to talking about quality 93 being the limit? The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);) from [3] seems to point that way? Aside: What confuses me even more is that bioperl test data set contains a quality file (t/data/qualfile.qual) where values go from 0 to 90. Where do these values come from? Which program generated them? Solexa quality is encoded based on 64 and is limited by upper value of 40. There is no predefined lower limit although in practice used values do not go lower than -5. In other words, the encoding goes from "" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6, colc(ASCII) ). It was reported [6], that high solexa quality scores are over-optimistic and low scores underestimate the data quality. It is therefore for the better that Solexa quality is now only of historical interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred qualities 0 to 40. It still uses 64 as a base. So: The character encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column illc(ASCII) ). If the facts and assumptions from above can be agreed on, we can move on to coding. :) Table: The columns are probability, phred quality, solexa quality, sanger encoding of phred qualities with corresponding ASCII values, sanger encoding of phred qualities limited to Q60 with corresponding ASCII values, solexa/illumina (1.0) encoding of solexa qualities with corresponding ASCII values, and illumina (1.3) encoding of phred quality with corresponding ASCII values. Prob phq soq sac(ASCII) sa6(ASCII) solc(ASCII) illc(ASCII) 0.999 0 -29 !( 33) !( 33) "( 34) @( 64) 0.9 0 -9 !( 33) !( 33) 6( 54) @( 64) 0.8 1 -5 !( 33) !( 33) 9( 57) @( 64) 0.7 1 -3 "( 34) "( 34) <( 60) A( 65) 0.6 2 -1 #( 35) #( 35) >( 62) B( 66) 0.5 3 0 $( 36) $( 36) @( 64) C( 67) 0.4 4 1 $( 36) $( 36) A( 65) C( 67) 0.3 5 3 &( 38) &( 38) C( 67) E( 69) 0.2 7 6 '( 39) '( 39) F( 70) F( 70) 0.1 10 9 +( 43) +( 43) I( 73) J( 74) 0.01 20 20 5( 53) 5( 53) S( 83) T( 84) 0.001 30 30 ?( 63) ?( 63) ]( 93) ^( 94) 0.0001 40 40 I( 73) I( 73) g(103) h(104) 1e-05 50 50 S( 83) S( 83) h(104) h(104) 1e-06 60 60 ]( 93) ]( 93) h(104) h(104) 1e-07 70 70 g(103) ]( 93) h(104) h(104) 1e-08 80 80 q(113) ]( 93) h(104) h(104) 1e-09 90 90 z(122) ]( 93) h(104) h(104) 5e-10 93 93 ~(126) ]( 93) h(104) h(104) 1e-10 100 100 ~(126) ]( 93) h(104) h(104) 1e-11 110 110 ~(126) ]( 93) h(104) h(104) Code. This perl code generates the above table. -------------------------------------------------------------------- #!/usr/bin/env perl use strict; use warnings; my @p = (0.999, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001, 0.00000001, 0.000000001, 0.0000000005, 0.0000000001, 0.00000000001); printf "%7s %3s %3s %s%6s %s%6s %s%6s %s%6s\n", 'Prob', 'phq', 'soq', 'sac', '(ASCII)', 'sa6', '(ASCII)', 'solc', '(ASCII)', 'illc', '(ASCII)'; for my $e (@p) { my $Q = -10 * log($e) / log(10); #phred quality my $sQ = -10 * log($e / (1 - $e)) / log(10); # solexa quality my $qc = chr(($Q<=93 ? $Q : 93) + 33); # sanger character my $q6 = chr(($Q<=60 ? $Q : 60) + 33); # sanger character, limited to Q60 my $qs = chr(($Q<=40 ? $sQ : 40) + 64); # solexa character # based on phred quality my $qi = chr(($Q<=40 ? $Q : 40) + 64); #illumina character # I've added 0.1 some int values to couteract a strange downward # rounding of values by perl printf "%7s %3d %3d %s(%3d) %s(%3d) %s(%3d) %s(%3d)\n", $e, $Q+0.1, $sQ+0.1, $qc, ord($qc), $q6, ord($q6), $qs, ord($qs), $qi, ord($qi); } -------------------------------------------------------------------- References: 1. http://maq.sourceforge.net/qual.shtml 2. http://en.wikipedia.org/wiki/Phred_quality_score 3. http://maq.sourceforge.net/fastq.shtml 4. http://en.wikipedia.org/wiki/FASTQ_format 5. http://en.wikipedia.org/wiki/FASTQ_format#Encoding 6. http://maq.sourceforge.net/fastq.shtml#spec 7. Dohm JC, Lottaz C, Borodina T and Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing Nucleic Acids Research 2008 36(16):e105; doi:10.1093/nar/gkn425 http://nar.oxfordjournals.org/cgi/content/abstract/36/16/e105 8. Solution to Sanger/Solexa/Illumina FASTQ confusion http://seqanswers.com/forums/showthread.php?t=1526 2009/5/15 Heikki Lehvaslaiho : > My initial assumption of linear relationship between different integer > ranges used to represent quality values was overly simplistic. I've > spent some time trying to understand the real relationships between > quality integers and their ASCII encodings. > > My notes turned into a small essay which makes it quite long for an > email. Please bear with me and reply back to the list if you think you > have an insight into these matters. > > ? ?-Heikki > > > > I started from a series of probability estimates of true nucleotide > calls [1, 2]. They are in column 1 (Prob) in the table below. The > second column (phd) gives the corresponding phred quality. The third > has the solexa qualities as defined in [1]. > > From these columns, I can see that phred and solexa qualities are > identical at value 6 and above. That quality is a lot lower than any > sensible threshold. 6 means one out of five nucleotides are wrong. > Quality score 20 that is often used as threshold means that one out of > 100 nucleotides is wrong. In practice, quality values 6 or under are > automatically discarded and never considered seriously. I find it > difficult to understand why these qualities are considered to be so > different! > > Both phred and solexa qualities are encoded into one character > representations into FASTQ text format [3, 4]. The rules of encoding > are slightly different. > > Phred qualities are positive integers. The Sanger FASTQ specifications > [6] show that the encoding goes from "!" (ASCII 33 corresponding to > phred quality 0) up to "~" (ASCII 126, phred quality 93, probability > 5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for > sanger encoding. > > The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred > quality score from 0 to 60 using ASCII 33 to 93." It generally > accepted that Phred qualities are limited to 60. See column 5 > (sa6(ASCII)). Or are there exceptions? > > Could it the that the authors of the maq web pages [3] were confusing > quality 60 for ASCII 60 and that lead to talking about quality 93 > being the limit? > > The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);) > from [3] seems to point that way? > > Aside: What confuses me even more is that bioperl test data set > contains a quality file (t/data/qualfile.qual) where values go from 0 > to 90. Where do these values come from? Which program generated them? > > ?Solexa quality is encoded based on 64 and is limited by upper value > of 40. There is no predefined lower limit although in practice used > values do not go lower than -5. In other words, the encoding goes from > "" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up > to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6, > colc(ASCII) ). > > It was reported [6], that high solexa quality scores are > over-optimistic and low scores underestimate the data quality. It is > therefore for the better that Solexa quality is now only of historical > interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred > qualities 0 to 40. It still uses 64 as a base. So: The character > encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up > to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column > illc(ASCII) ). > > If the facts and assumptions from above can be agreed on, we can move > on to coding. :) > > Table: > > The columns are probability, phred quality, solexa quality, sanger > encoding of phred qualities with corresponding ASCII values, sanger > encoding of phred qualities limited to Q60 with corresponding ASCII > values, solexa/illumina (1.0) encoding of solexa qualities with > corresponding ASCII values, and illumina (1.3) encoding of phred > quality with corresponding ASCII values. > > > > I started from a series of probability estimates of true nucleotide > calls [References 1, 2]. They are in column 1 (Prob) in the table > below. The second > column (phd) gives the corresponding phred quality. The third has the > solexa qualities as defined in [1]. > > From these columns, I can see that phred and solexa qualities are > identical at value 6 and above. That quality is a lot lower than any > sensible threshold. 6 means one out of five nucleotides are > wrong. Quality score 20 that is often used as threshold means that one > out of 100 nucleotides is wrong. In practice, quality values 6 or > under are automatically discarded and never considered seriously. I > find it difficult to understand why these qualities are considered to > be so different! > > Both phred and solexa qualities are encoded into one character > representations into FASTQ text format [3, 4]. The rules of encoding > are slightly different. > > Phred qualities are positive integers. The Sanger FASTQ specifications > [6] show that the encoding goes from "!" (ASCII 33 corresponding to > phred quality 0) up to "~" (ASCII 126, phred quality 93, probability > 5e-10 of nucleotide being wrong ). See column 4 (sac[ASCII]) for > sanger encoding. > > The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred > quality score from 0 to 60 using ASCII 33 to 93." ?It generally > accepted that Phred qualities are limited to 60. See column 5 > (sa6(ASCII)). Or are there exceptions? > > Could it the that the authors of the maq web pages [3] were confusing > quality 60 for ASCII 60 and that lead to talking about quality 93 > being the limit? > > The following perl code snippet ($q = chr(($Q<=93? $Q : 93) + 33);) > from [3] seems to point that way? > > Aside: What confuses me even more is that bioperl test data set > contains a quality file (t/data/qualfile.qual) where values go from 0 > to 90. Where do these values come from? Which program generated them? > > > Solexa quality is encoded based on 64 and is limited by upper value of > 40. ?There is no predefined lower limit although in practice used > values do not go lower than -5. In other words, the encoding goes from > "" (ASCII xx) to "@" (ASCII 64, solexa quality 0, probability 0.5 ) up > to "h" (ASCII 104, solexa quality 40, probability 1e-04) (Column 6, > colc(ASCII) ). > > It was reported [6], that high solexa quality scores are > over-optimistic and low scores underestimate the data quality. It is > therefore for the better that Solexa quality is now only of historical > interest, as Solexa/Illumina pipeline v.1.3 ("illumina") uses phred > qualities 0 to 40. It still uses 64 as a base. So: The character > encoding goes from "@" (ASCII 64, phred quality 0, probability 0.8) up > to "h" (ASCII 104, phred quality 40, probability 1e-04) (final column > illc(ASCII) ). > > If the facts and assumptions ?from above can be agreed on, we can move > on to coding. :) > > > > Table: > > The columns are probability, phred quality, solexa quality, sanger > encoding of phred qualities with corresponding ASCII values, sanger > encoding of phred qualities limited to Q60 with corresponding ASCII > values, solexa/illumina (1.0) encoding of solexa qualities with > corresponding ASCII values, and illumina (1.3) encoding of phred > quality with corresponding ASCII values. > > > ? Prob ?phq ?soq ? ?sac(ASCII) ?sa6(ASCII) ?solc(ASCII) illc(ASCII) > ?0.999 ? ?0 ?-29 ? ?!( 33) ? ? ?!( 33) ? ? ?"( 34) ? ? ?@( 64) > ? ?0.9 ? ?0 ? -9 ? ?!( 33) ? ? ?!( 33) ? ? ?6( 54) ? ? ?@( 64) > ? ?0.8 ? ?1 ? -5 ? ?!( 33) ? ? ?!( 33) ? ? ?9( 57) ? ? ?@( 64) > ? ?0.7 ? ?1 ? -3 ? ?"( 34) ? ? ?"( 34) ? ? ?<( 60) ? ? ?A( 65) > ? ?0.6 ? ?2 ? -1 ? ?#( 35) ? ? ?#( 35) ? ? ?>( 62) ? ? ?B( 66) > ? ?0.5 ? ?3 ? ?0 ? ?$( 36) ? ? ?$( 36) ? ? ?@( 64) ? ? ?C( 67) > ? ?0.4 ? ?4 ? ?1 ? ?$( 36) ? ? ?$( 36) ? ? ?A( 65) ? ? ?C( 67) > ? ?0.3 ? ?5 ? ?3 ? ?&( 38) ? ? ?&( 38) ? ? ?C( 67) ? ? ?E( 69) > ? ?0.2 ? ?7 ? ?6 ? ?'( 39) ? ? ?'( 39) ? ? ?F( 70) ? ? ?F( 70) > ? ?0.1 ? 10 ? ?9 ? ?+( 43) ? ? ?+( 43) ? ? ?I( 73) ? ? ?J( 74) > ? 0.01 ? 20 ? 20 ? ?5( 53) ? ? ?5( 53) ? ? ?S( 83) ? ? ?T( 84) > ?0.001 ? 30 ? 30 ? ??( 63) ? ? ??( 63) ? ? ?]( 93) ? ? ?^( 94) > ?0.0001 ? 40 ? 40 ? ?I( 73) ? ? ?I( 73) ? ? ?g(103) ? ? ?h(104) > ?1e-05 ? 50 ? 50 ? ?S( 83) ? ? ?S( 83) ? ? ?h(104) ? ? ?h(104) > ?1e-06 ? 60 ? 60 ? ?]( 93) ? ? ?]( 93) ? ? ?h(104) ? ? ?h(104) > ?1e-07 ? 70 ? 70 ? ?g(103) ? ? ?]( 93) ? ? ?h(104) ? ? ?h(104) > ?1e-08 ? 80 ? 80 ? ?q(113) ? ? ?]( 93) ? ? ?h(104) ? ? ?h(104) > ?1e-09 ? 90 ? 90 ? ?z(122) ? ? ?]( 93) ? ? ?h(104) ? ? ?h(104) > ?5e-10 ? 93 ? 93 ? ?~(126) ? ? ?]( 93) ? ? ?h(104) ? ? ?h(104) > ?1e-10 ?100 ?100 ? ?~(126) ? ? ?]( 93) ? ? ?h(104) ? ? ?h(104) > ?1e-11 ?110 ?110 ? ?~(126) ? ? ?]( 93) ? ? ?h(104) ? ? ?h(104) > > > Code. This perl code generates the above table. > > -------------------------------------------------------------------- > #!/usr/bin/env perl > > use strict; > use warnings; > > my @p = (0.999, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, > ? ? ? ? 0.01, 0.001, 0.0001, 0.00001, 0.000001, 0.0000001, > ? ? ? ? 0.00000001, 0.000000001, 0.0000000005, 0.0000000001, > ? ? ? ? 0.00000000001); > > printf "%7s ?%3s ?%3s ? ?%s%6s ?%s%6s ?%s%6s %s%6s\n", > ? ?'Prob', 'phq', 'soq', 'sac', '(ASCII)', 'sa6', '(ASCII)', > ? ?'solc', '(ASCII)', 'illc', '(ASCII)'; > > for my $e (@p) { > > ? ?my $Q = -10 * log($e) / log(10); ? #phred quality > ? ?my $sQ = -10 * log($e / (1 - $e)) / log(10); # solexa quality > > ? ?my $qc = chr(($Q<=93 ?? $Q : 93) ?+ 33); # sanger character > ? ?my $q6 = chr(($Q<=60 ?? $Q : 60) ?+ 33); # sanger character, limited to Q60 > ? ?my $qs = chr(($Q<=40 ? $sQ : 40) + 64); ?# solexa character > ? ?# based on phred quality > ? ?my $qi = chr(($Q<=40 ?? $Q : 40) ?+ 64); #illumina character > > ? ?# I've added 0.1 some int values to couteract a strange downward > ? ?# ?rounding of values by perl > ? ?printf "%7s ?%3d ?%3d ? ?%s(%3d) ? ? ?%s(%3d) ? ? ?%s(%3d) ? ? ?%s(%3d)\n", > ? ? ? ?$e, $Q+0.1, $sQ+0.1, $qc, ord($qc), $q6, ord($q6), > ? ? ? ?$qs, ord($qs), $qi, ord($qi); > } > -------------------------------------------------------------------- > > References: > > 1. http://maq.sourceforge.net/qual.shtml > > 2. http://en.wikipedia.org/wiki/Phred_quality_score > > 3. http://maq.sourceforge.net/fastq.shtml > > 4. http://en.wikipedia.org/wiki/FASTQ_format > > 5. http://en.wikipedia.org/wiki/FASTQ_format#Encoding > > 6. http://maq.sourceforge.net/fastq.shtml#spec > > 7. Dohm JC, Lottaz C, Borodina T and Himmelbauer H: Substantial biases > ? in ultra-short read data sets from high-throughput DNA sequencing > ? Nucleic Acids Research 2008 36(16):e105; doi:10.1093/nar/gkn425 > ? http://nar.oxfordjournals.org/cgi/content/abstract/36/16/e105 > > 8. Solution to Sanger/Solexa/Illumina FASTQ confusion > ? http://seqanswers.com/forums/showthread.php?t=1526 > > > > > > > > > 2009/5/13 Chris Fields : >> Heikki, >> >> Did you still want to commit this? ?I think it's a good idea and would be >> worth including in the next 1.6 point release. >> >> chris >> >> ------------------------------------------------------------ >> I convinced at least myself to the degree that I wrote the >> range_convert() method - with plenty of tests. I mention this now so >> that no-one else need to start thinking through all the edge values. >> :) >> >> I'll contribute it to the code base once there is a consensus of best >> way forward. >> >> ? ?-Heikki >> >> 2009/4/27 Heikki Lehvaslaiho : >>>> I have tried to summarise this in a central place: >>>> http://en.wikipedia.org/wiki/FASTQ_format >>> >>> Torsten, >>> >>> Thanks for putting this together. Very helpful. >>> >>> Do you have a plan of action? ?Let me propose one for BioPerl. It >>> based on following assumptions: >>> >>> 1. There is multitude of different ways of coding quality values out >>> there. >>> 2. Bio::Seq::Quality is agnostic of any quality value range rules >>> 3. The emerging open standard is the Sanger fastq specification >>> 4. Open source programs use the Sanger fastq specs >>> >>> >>> From these it follows that: >>> >>> >>> 1. BioPerl should support Sanger fastq standard >>> >>> 1.1. it already does and there are other SeqIO modules for dealing >>> with other non-fastq formats. >>> >>> 2. BioPerl should offer simple ways of converting between quality range >>> rules >>> >>> 2.1. Have a generic method accessible from Bio::Seq::Quality with >>> preset versions of the method for converting between known variants >>> (Sanger fastq and the two Illumina versions) >>> >>> For example: >>> >>> range_convert ($from_lower, $from_upper, $to_lower, $to_upper, $value) >>> ?throw if $value < $from_lower or $value > $from_upper >>> ?return $newvalue >>> >>> range_convert_illumina2fastq(), range_convert_fastq2illumina(), >>> range_convert_fastq2phred(), ?range_convert_phred2fastq().... >>> >>> (assuming that illumina 1.3 eq phred) >>> >>> 2.2. Bio::SeqIO::Fastq::next_seq methods should convert Illumina >>> qualities into Sanger fastq on the fly >>> >>> 2.2.1 Bio::SeqIO::Fastq::next_seq should detect the incoming stream of >>> quality value range either automatically or be given a keyword >>> parameter indicating the range. >>> >>> 2.2.2. Bio::SeqIO::Fastq::next_seq should throw an error if it detects >>> a quality value out of range. >>> >>> 2.2.3. Bio::SeqIO::Fastq::write_seq should throw an error if it >>> detects a quality value out of range. >>> >>> 2.2.4. It would be useful but not absolutely necessary for >>> Bio::SeqIO::Fastq::write_seq to be able to write out in Illumina >>> ranges >>> >>> >>> What do you think? >>> >>> ? ?-Heikki >>> >>> 2009/4/26 Torsten Seemann : >>>>> > This might be a good place to ask the question: having looked at the >>>>> > fastq.pm page, is the fastq format defined (only) by a "@'" followed >>>>> > by >>>>> a >>>>> > sequence line and a "+" header followed by a quality line and the two >>>>> > headers have to agree? Now that Illumina is using phred scaling, are >>>>> > 'Sanger' and 'Illumina' versions the same? >>>>> >>>>> No they aren't the same, Illumina still encodes the ascii as value + 64 >>>>> and Sanger as value + 33. >>>>> >>>> >>>> Illumina have now CHANGED how they calculate the quality value however in >>>> the last month or so... Their Q range used to be -5..40 mapped to ASCII >>>> 64+, >>>> but now they produce Q >= 0 and it is unclear if they start at 69 or 64 >>>> now... >>>> >>>> I have tried to summarise this in a central place: >>>> >>>> http://en.wikipedia.org/wiki/FASTQ_format >>>> >>>> Corrections welcome! >>>> >>>> >>>> --Torsten Seemann >>>> --Victorian Bioinformatics Consortium, Dept. Microbiology, Monash >>>> University, AUSTRALIA >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>> >>> >>> >>> -- >>> ? ?-Heikki >>> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho >>> cell: +27 (0)714328090 >>> Sent from Claremont, WC, South Africa >>> >> >> >> >> -- >> ? ?-Heikki >> Heikki Lehvaslaiho - skype:heikki_lehvaslaiho >> cell: +27 (0)714328090 >> Sent from Claremont, WC, South Africa >> > > > > -- > ? ?-Heikki > Heikki Lehvaslaiho - skype:heikki_lehvaslaiho > cell: +27 (0)714328090 > Sent from Claremont, WC, South Africa > -- -Heikki Heikki Lehvaslaiho - skype:heikki_lehvaslaiho cell: +27 (0)714328090 Sent from Claremont, WC, South Africa From maj at fortinbras.us Fri May 15 10:03:36 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Fri, 15 May 2009 10:03:36 -0400 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23559474.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> Message-ID: <886C0614C34E441398B700E01E4DCCE1@NewLife> fungazid- One thing to do is to clean up your /tmp directory. Another thing to try is to specify the temporary directory where the tempfiles are created according to the perldoc for largefasta.pm: This module handles very large sequence files by using the Bio::Seq::LargePrimarySeq module to store all the sequence data in a file. This can be a problem if you have limited disk space on your computer because this will effectively cause 2 copies of the sequence file to reside on disk for the life of the Bio::Seq::LargePrimarySeq object. The default location for this is specified by the File::Spec->tmpdir routine which is usually /tmp on UNIX. If a sequence file is larger than the swap space (capacity of the /tmp dir) this could cause problems for the machine. It is possible to set the directory where the temporary file is located by adding the following line to your code BEFORE calling next_seq. See Bio::Seq::LargePrimarySeq for more information. This will give you more control over where the big seqs are cached. You also may be inadvertently creating tempfiles in your code somewhere on each loop. If a bug extremely irritates you, you may direct your irritation to http://bugzilla.bioperl.org. Read http://www.bioperl.org/wiki/Bugs#Submitting_Bugs before you do. Thanks for using BioPerl, and have a great day. Mark ----- Original Message ----- From: "fungazid" To: Sent: Friday, May 15, 2009 9:17 AM Subject: [Bioperl-l] looks like a Bio::SeqIO error > > Hello, > > I hope this is the right address for bioperl programming issues. Bioperl > saves me a lot of time (not to re-invent the wheel), but there are some > extremely irritating problems (I would change the code myself if I knew > how). > > I am trying to read a file (~20MB) containing multiple fasta sequences: >>a > AGTAGTGAGTGCGCTGA......... >>b > GCGCTGAAGTAGTGAGT....... >>c > AGTAGTGAGTGCGCTGA......... >>d........... > > with the following lines: > > my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); > > LOOP1: while ( my $seqobj1 = $seqin->next_seq()) > > { > ...... > my $seq=$seqobj1->subseq(1,$seqobj1->length); > ....... > } > > > This works right for the first ~30000 contig sequences but then the > following message appears: > > Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory > /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm line 744 > > What to do ??? (this is only one of some different Bioperl related bugs that > I'm experiencing) > > > > > -- > View this message in context: > http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From bosborne11 at verizon.net Fri May 15 09:52:40 2009 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 15 May 2009 09:52:40 -0400 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23559474.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> Message-ID: <00FF03C7-F451-42A9-8997-86A0016C758E@verizon.net> fungazid, What version of BioPerl are you using? If I'm not mistaken this problem has been seen before, and is now fixed in Bioperl 1.6. Brian O. On May 15, 2009, at 9:17 AM, fungazid wrote: > > Hello, > > I hope this is the right address for bioperl programming issues. > Bioperl > saves me a lot of time (not to re-invent the wheel), but there are > some > extremely irritating problems (I would change the code myself if I > knew > how). > > I am trying to read a file (~20MB) containing multiple fasta > sequences: >> a > AGTAGTGAGTGCGCTGA......... >> b > GCGCTGAAGTAGTGAGT....... >> c > AGTAGTGAGTGCGCTGA......... >> d........... > > with the following lines: > > my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); > > LOOP1: while ( my $seqobj1 = $seqin->next_seq()) > > { > ...... > my $seq=$seqobj1->subseq(1,$seqobj1->length); > ....... > } > > > This works right for the first ~30000 contig sequences but then the > following message appears: > > Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory > /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm > line 744 > > What to do ??? (this is only one of some different Bioperl related > bugs that > I'm experiencing) > > > > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Fri May 15 11:16:23 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 15 May 2009 10:16:23 -0500 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23559474.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> Message-ID: Just curious, but why not use something like Bio::DB::Fasta? It may be better suited for something like this. chris On May 15, 2009, at 8:17 AM, fungazid wrote: > Hello, > > I hope this is the right address for bioperl programming issues. > Bioperl > saves me a lot of time (not to re-invent the wheel), but there are > some > extremely irritating problems (I would change the code myself if I > knew > how). > > I am trying to read a file (~20MB) containing multiple fasta > sequences: >> a > AGTAGTGAGTGCGCTGA......... >> b > GCGCTGAAGTAGTGAGT....... >> c > AGTAGTGAGTGCGCTGA......... >> d........... > > with the following lines: > > my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); > > LOOP1: while ( my $seqobj1 = $seqin->next_seq()) > > { > ...... > my $seq=$seqobj1->subseq(1,$seqobj1->length); > ....... > } > > > This works right for the first ~30000 contig sequences but then the > following message appears: > > Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory > /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm > line 744 > > What to do ??? (this is only one of some different Bioperl related > bugs that > I'm experiencing) > > > > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From shalabh.sharma7 at gmail.com Fri May 15 11:39:13 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 15 May 2009 11:39:13 -0400 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> Message-ID: <9fcc48c70905150839l29844c60tddf763e5d90d55d0@mail.gmail.com> Thanks a lot, I really appreciate it. -Shalabh On Thu, May 14, 2009 at 5:41 PM, Kevin Brown wrote: > http://bioperl.org/cgi-bin/deob_interface.cgi?Search=Search&module=Bio%3 > A%3AAlignIO%3A%3Aemboss&sort_order=by+method&search_string=Bio%3A%3AAlig > nio%3A%3Aemboss > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org > > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > > shalabh sharma > > Sent: Thursday, May 14, 2009 1:14 PM > > To: Scott Markel > > Cc: bioperl-l > > Subject: Re: [Bioperl-l] Parsing needle/water output > > > > yes, i tried to read the documentation about > > Bio::AlignIO::emboss, but there > > is not much in it.So is like i can call the same functions > > which are used in > > searchIO. I need start and end position of the pairwise alignment. > > > > Thanks > > Shalabh > > > > > > On Thu, May 14, 2009 at 3:58 PM, Scott Markel > > wrote: > > > > > Shalabh, > > > > > > Have you looked at Bio::AlignIO::emboss? > > > > > > Scott > > > > > > Scott Markel, Ph.D. > > > Principal Bioinformatics Architect email: smarkel at accelrys.com > > > Accelrys (SciTegic R&D) mobile: +1 858 205 3653 > > > 10188 Telesis Court, Suite 100 voice: +1 858 799 5603 > > > San Diego, CA 92121 fax: +1 858 799 5222 > > > USA web: http://www.accelrys.com > > > > > > http://www.linkedin.com/in/smarkel > > > Vice President, Board of Directors: > > > International Society for Computational Biology > > > Co-chair: ISCB Publications Committee > > > Associate Editor: PLoS Computational Biology > > > Editorial Board: Briefings in Bioinformatics > > > > > > > > > > -----Original Message----- > > > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > > > bounces at lists.open-bio.org] On Behalf Of shalabh sharma > > > > Sent: Thursday, 14 May 2009 12:54 PM > > > > To: bioperl-l > > > > Subject: [Bioperl-l] Parsing needle/water output > > > > > > > > Hi All, > > > > Is there any parser/module available to parse > > needle/water > > > output > > > > report (from emboss) to get the start and end position of > > alignment. > > > > > > > > Thanks > > > > Shalabh > > > > _______________________________________________ > > > > Bioperl-l mailing list > > > > Bioperl-l at lists.open-bio.org > > > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From fungazid at yahoo.com Fri May 15 11:45:11 2009 From: fungazid at yahoo.com (fungazid) Date: Fri, 15 May 2009 08:45:11 -0700 (PDT) Subject: [Bioperl-l] looks like a Bio::SeqIO error Message-ID: <23562461.post@talk.nabble.com> Thanks all for your rapid replies, 1) I tried cleaned my swap space (/tmp), and this indeed delays the eruption of the error but does not prevent it tottaly. 2) I used bioperl 1.5.2.02 lubuntu1 (this is installed automatically by linux ubuntu package manager). I unsuccessfully tried to install bioperl 1.6 with 'sudo cpan' command, and I get the following report (is it possible to force install with this repeort ??). Test Summary Report ------------------- t/ClusterIO/ClusterIO.t (Wstat: 65280 Tests: 2 Failed: 0) Non-zero exit status: 255 Parse errors: Bad plan. You planned 12 tests but ran 2. Files=318, Tests=15584, 160 wallclock secs ( 4.88 usr 0.56 sys + 146.54 cusr 7.12 csys = 159.10 CPU) Result: FAIL Failed 1/318 test programs. 0/15584 subtests failed. make: *** [test] Error 255 CJFIELDS/BioPerl-1.6.0.tar.gz /usr/bin/make test -- NOT OK //hint// to see the cpan-testers results for installing this module, try: reports CJFIELDS/BioPerl-1.6.0.tar.gz Warning (usually harmless): 'YAML' not installed, will not store persistent state Running make install make test had returned bad status, won't install without force Failed during this command: CJFIELDS/BioPerl-1.6.0.tar.gz : make_test NO 3) I used Bio::Seq::LargeSeq (and not other modules) simply because it worked fine in the past, and did what I wanted with no problems. ____________________________________________________________________________________ Brian Osborne-2 wrote: > > fungazid, > > What version of BioPerl are you using? > > If I'm not mistaken this problem has been seen before, and is now > fixed in Bioperl 1.6. > > Brian O. > > > On May 15, 2009, at 9:17 AM, fungazid wrote: > >> >> Hello, >> >> I hope this is the right address for bioperl programming issues. >> Bioperl >> saves me a lot of time (not to re-invent the wheel), but there are >> some >> extremely irritating problems (I would change the code myself if I >> knew >> how). >> >> I am trying to read a file (~20MB) containing multiple fasta >> sequences: >>> a >> AGTAGTGAGTGCGCTGA......... >>> b >> GCGCTGAAGTAGTGAGT....... >>> c >> AGTAGTGAGTGCGCTGA......... >>> d........... >> >> with the following lines: >> >> my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); >> >> LOOP1: while ( my $seqobj1 = $seqin->next_seq()) >> >> { >> ...... >> my $seq=$seqobj1->subseq(1,$seqobj1->length); >> ....... >> } >> >> >> This works right for the first ~30000 contig sequences but then the >> following message appears: >> >> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory >> /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm >> line 744 >> >> What to do ??? (this is only one of some different Bioperl related >> bugs that >> I'm experiencing) >> >> >> >> >> -- >> View this message in context: >> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23562461.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From Kevin.M.Brown at asu.edu Fri May 15 12:03:16 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Fri, 15 May 2009 09:03:16 -0700 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23559474.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> Message-ID: <1A4207F8295607498283FE9E93B775B405FE1AC8@EX02.asurite.ad.asu.edu> How large are the individual sequences? You have more than 30,000 sequences in there? Then I wouldn't recommend using the LargeFasta format as each of those sequences could easily be loaded and held in memory. LargeFasta really seems to be better for holding large individual sequences (e.g. a human chromosome). > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of fungazid > Sent: Friday, May 15, 2009 6:18 AM > To: Bioperl-l at lists.open-bio.org > Subject: [Bioperl-l] looks like a Bio::SeqIO error > > > Hello, > > I hope this is the right address for bioperl programming > issues. Bioperl > saves me a lot of time (not to re-invent the wheel), but > there are some > extremely irritating problems (I would change the code myself > if I knew > how). > > I am trying to read a file (~20MB) containing multiple fasta > sequences: > >a > AGTAGTGAGTGCGCTGA......... > >b > GCGCTGAAGTAGTGAGT....... > >c > AGTAGTGAGTGCGCTGA......... > >d........... > > with the following lines: > > my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); > > LOOP1: while ( my $seqobj1 = $seqin->next_seq()) > > { > ...... > my $seq=$seqobj1->subseq(1,$seqobj1->length); > ....... > } > > > This works right for the first ~30000 contig sequences but then the > following message appears: > > Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory > /tmp/6eS92VzVjm: Too many links at > /usr/share/perl5/Bio/Root/IO.pm line 744 > > What to do ??? (this is only one of some different Bioperl > related bugs that > I'm experiencing) > > > > > -- > View this message in context: > http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp2355 > 9474p23559474.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at illinois.edu Fri May 15 12:20:10 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 15 May 2009 11:20:10 -0500 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23562461.post@talk.nabble.com> References: <23562461.post@talk.nabble.com> Message-ID: <9CF9E13F-96FB-4BB8-9165-0500AD534877@illinois.edu> You can still install core with 'force install', but it would be nice to see what was causing ClusterIO tests to bork. That's the first time I recall seeing that one fail. chris On May 15, 2009, at 10:45 AM, fungazid wrote: > Thanks all for your rapid replies, > > 1) I tried cleaned my swap space (/tmp), and this indeed delays the > eruption > of the error but does not prevent it tottaly. > > 2) I used bioperl 1.5.2.02 lubuntu1 (this is installed automatically > by > linux ubuntu package manager). I unsuccessfully tried to install > bioperl 1.6 > with 'sudo cpan' command, and I get the following report (is it > possible to > force install with this repeort ??). > > Test Summary Report > ------------------- > t/ClusterIO/ClusterIO.t (Wstat: 65280 Tests: 2 > Failed: 0) > Non-zero exit status: 255 > Parse errors: Bad plan. You planned 12 tests but ran 2. > Files=318, Tests=15584, 160 wallclock secs ( 4.88 usr 0.56 sys + > 146.54 > cusr 7.12 csys = 159.10 CPU) > Result: FAIL > Failed 1/318 test programs. 0/15584 subtests failed. > make: *** [test] Error 255 > CJFIELDS/BioPerl-1.6.0.tar.gz > /usr/bin/make test -- NOT OK > //hint// to see the cpan-testers results for installing this module, > try: > reports CJFIELDS/BioPerl-1.6.0.tar.gz > Warning (usually harmless): 'YAML' not installed, will not store > persistent > state > Running make install > make test had returned bad status, won't install without force > Failed during this command: > CJFIELDS/BioPerl-1.6.0.tar.gz : make_test NO > > > > 3) I used Bio::Seq::LargeSeq (and not other modules) simply because it > worked fine in the past, and did what I wanted with no problems. > > > ____________________________________________________________________________________ > > > > > Brian Osborne-2 wrote: >> >> fungazid, >> >> What version of BioPerl are you using? >> >> If I'm not mistaken this problem has been seen before, and is now >> fixed in Bioperl 1.6. >> >> Brian O. >> >> >> On May 15, 2009, at 9:17 AM, fungazid wrote: >> >>> >>> Hello, >>> >>> I hope this is the right address for bioperl programming issues. >>> Bioperl >>> saves me a lot of time (not to re-invent the wheel), but there are >>> some >>> extremely irritating problems (I would change the code myself if I >>> knew >>> how). >>> >>> I am trying to read a file (~20MB) containing multiple fasta >>> sequences: >>>> a >>> AGTAGTGAGTGCGCTGA......... >>>> b >>> GCGCTGAAGTAGTGAGT....... >>>> c >>> AGTAGTGAGTGCGCTGA......... >>>> d........... >>> >>> with the following lines: >>> >>> my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=> >>> $file1); >>> >>> LOOP1: while ( my $seqobj1 = $seqin->next_seq()) >>> >>> { >>> ...... >>> my $seq=$seqobj1->subseq(1,$seqobj1->length); >>> ....... >>> } >>> >>> >>> This works right for the first ~30000 contig sequences but then the >>> following message appears: >>> >>> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory >>> /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm >>> line 744 >>> >>> What to do ??? (this is only one of some different Bioperl related >>> bugs that >>> I'm experiencing) >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23562461.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bosborne11 at verizon.net Fri May 15 12:00:41 2009 From: bosborne11 at verizon.net (Brian Osborne) Date: Fri, 15 May 2009 12:00:41 -0400 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23562461.post@talk.nabble.com> References: <23562461.post@talk.nabble.com> Message-ID: <4FD65D25-56E3-49FF-BB04-02CC674263A5@verizon.net> fungazid, I would do a "force install" in CPAN, unless you think that single failed test in ClusterIO.t will affect you. Will you use ClusterIO? Brian O. On May 15, 2009, at 11:45 AM, fungazid wrote: > > Thanks all for your rapid replies, > > 1) I tried cleaned my swap space (/tmp), and this indeed delays the > eruption > of the error but does not prevent it tottaly. > > 2) I used bioperl 1.5.2.02 lubuntu1 (this is installed automatically > by > linux ubuntu package manager). I unsuccessfully tried to install > bioperl 1.6 > with 'sudo cpan' command, and I get the following report (is it > possible to > force install with this repeort ??). > > Test Summary Report > ------------------- > t/ClusterIO/ClusterIO.t (Wstat: 65280 Tests: 2 > Failed: 0) > Non-zero exit status: 255 > Parse errors: Bad plan. You planned 12 tests but ran 2. > Files=318, Tests=15584, 160 wallclock secs ( 4.88 usr 0.56 sys + > 146.54 > cusr 7.12 csys = 159.10 CPU) > Result: FAIL > Failed 1/318 test programs. 0/15584 subtests failed. > make: *** [test] Error 255 > CJFIELDS/BioPerl-1.6.0.tar.gz > /usr/bin/make test -- NOT OK > //hint// to see the cpan-testers results for installing this module, > try: > reports CJFIELDS/BioPerl-1.6.0.tar.gz > Warning (usually harmless): 'YAML' not installed, will not store > persistent > state > Running make install > make test had returned bad status, won't install without force > Failed during this command: > CJFIELDS/BioPerl-1.6.0.tar.gz : make_test NO > > > > 3) I used Bio::Seq::LargeSeq (and not other modules) simply because it > worked fine in the past, and did what I wanted with no problems. > > > ____________________________________________________________________________________ > > > > > Brian Osborne-2 wrote: >> >> fungazid, >> >> What version of BioPerl are you using? >> >> If I'm not mistaken this problem has been seen before, and is now >> fixed in Bioperl 1.6. >> >> Brian O. >> >> >> On May 15, 2009, at 9:17 AM, fungazid wrote: >> >>> >>> Hello, >>> >>> I hope this is the right address for bioperl programming issues. >>> Bioperl >>> saves me a lot of time (not to re-invent the wheel), but there are >>> some >>> extremely irritating problems (I would change the code myself if I >>> knew >>> how). >>> >>> I am trying to read a file (~20MB) containing multiple fasta >>> sequences: >>>> a >>> AGTAGTGAGTGCGCTGA......... >>>> b >>> GCGCTGAAGTAGTGAGT....... >>>> c >>> AGTAGTGAGTGCGCTGA......... >>>> d........... >>> >>> with the following lines: >>> >>> my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=> >>> $file1); >>> >>> LOOP1: while ( my $seqobj1 = $seqin->next_seq()) >>> >>> { >>> ...... >>> my $seq=$seqobj1->subseq(1,$seqobj1->length); >>> ....... >>> } >>> >>> >>> This works right for the first ~30000 contig sequences but then the >>> following message appears: >>> >>> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory >>> /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm >>> line 744 >>> >>> What to do ??? (this is only one of some different Bioperl related >>> bugs that >>> I'm experiencing) >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23562461.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From avilella at gmail.com Fri May 15 13:07:43 2009 From: avilella at gmail.com (Albert Vilella) Date: Fri, 15 May 2009 18:07:43 +0100 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> Message-ID: <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> Heh, I understand what you say. I am in a similar position from the point that I would prefer to switch to a more modern bioperl but the ensembl comparative genomics code -- ensembl-compara -- relies on the ensembl-core code, which relies on bioperl 1.2.3. We could all switch to bioperl 1.6 but I cannot switch the ensembl-compara code if code doesn't switch as well. I haven't been very successful in raising this issue so far, but I can try again :-p One of the things that has changed a lot is swissprot support (swiss.pm). Another object that I am using a lot is SimpleAlign.pm, which in the modern version has a lot more methods. On Thu, May 14, 2009 at 3:16 PM, Chris Fields wrote: > Albert, > > Just to note, I have been using bioperl 1.6.0 with the ensembl API w/o > problems, and Sendu Bala added an ensembl 'wrapper' to bioperl-run. Do we > know precisely what breaks btwn 1.2.3 and 1.6 (and thus leads everyone to > believe 1.2.3 is absolutely required)? The previous answers have been > pretty nebulous and unspecific. > > I would have to go on record as being opposed to this. If there is a true > compatibility issue, I would much rather spend the energy and tuits driving > towards ensembl compatibility with the current bioperl version than > backporting to 1.2.3. What about having users popping in with bug reports > on list (here or ensembl) about bioperl versions 5+ years out-of-date? > Furthermore, it's a slippery slope; the next thing will be requests to > backport specific bug fixes in the current branch to 1.2.3. > > Who's willing to maintain that branch? We have few devs as it is, so is > someone on the ensembl end willing to take that up? > > Perl 5 development has been held up with the same issues, something they > have recently just started digging themselves out of. Regardless, I think > way too many changes have occurred in that particular code that make such > endeavors unrealistic, unfeasible, and unmaintainable. > > chris > > > On May 14, 2009, at 8:45 AM, Albert Vilella wrote: > > Hi all, >> >> In Ensembl, we are interested in providing NeXML dumps for our Comparative >> Genomics data. Because our pipeline is >> written in Perl, I guess most of the work done here will be of great use >> to >> us. >> >> If I could only ask for only a feature, that would be to *try* and >> backport >> the NeXML support to bioperl-1.2.3 --- stress on the *try*. Bioperl 1.2.3 >> is >> the release that Ensembl decided to stick to many years ago, so it's >> cleaner >> for people to use our Perl API with only one version of bioperl as a >> dependency. >> >> Looking forward to hearing from this SoC. Have you got a blog? >> >> Cheers, >> >> Albert. >> >> >> On Mon, May 11, 2009 at 5:24 PM, Jason Stajich wrote: >> >> Welcome Chase. >>> >>> Look forward to the project and helping where needed. >>> >>> -jason >>> >>> >>> On May 11, 2009, at 7:31 AM, Mark A. Jensen wrote: >>> >>> Hello all, >>> >>>> With great pleasure, I want to introduce Chase Miller, my Google Summer >>>> of >>>> Code student from George Washington University, to the community. Chase >>>> will >>>> be working with me and Rutger Vos on a BioPerl wrapper for Rutger's >>>> Bio::Phylo package, with a particular emphasis on creating a >>>> BioPerl-native >>>> way to import and export the NeXML (http://nexml.org) phylogenetic data >>>> format. He wrote a great proposal, available here: >>>> >>>> https://www.nescent.org/wg_phyloinformatics/PhyloSoC:BioPerl_integration_of_the_NeXML_exchange_standard_and_Bio::Phylo_toolkit >>>> . >>>> We will be working throughout the summer on the project, and will of >>>> course come to you for sage advice. I know you will welcome him warmly, >>>> as >>>> you did me. >>>> Cheers, >>>> Mark >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>>> >>>> >>> Jason Stajich >>> jason at bioperl.org >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > From fungazid at yahoo.com Fri May 15 12:47:33 2009 From: fungazid at yahoo.com (fungazid) Date: Fri, 15 May 2009 09:47:33 -0700 (PDT) Subject: [Bioperl-l] looks like a Bio::SeqIO error Message-ID: <23563520.post@talk.nabble.com> Chris, I attach to the end of this massage the specific cpan test reporting the ClusterIO error. A About using other module (Bio::DB Fasta): I will change the code if necessary (in fact the same code works fine in windows, but not in linux ubuntu). The fasta sequences are of 50-1500bp long (not a huge sequences I admit). t/ClusterIO/ClusterIO.t ...................... 1/12 Bio::ClusterIO: could not load dbsnp - for more details on supported formats please see the ClusterIO docs Exception ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Failed to load module Bio::ClusterIO::dbsnp. Can't locate XML/SAX.pm in @INC (@INC contains: t/lib . /home/home/Desktop/bioperl1.6/BioPerl-1.6.0/blib/lib /home/home/Desktop/bioperl1.6/BioPerl-1.6.0/blib/arch /home/home/Desktop/bioperl1.6/BioPerl-1.6.0 /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl) at Bio/ClusterIO/dbsnp.pm line 59. BEGIN failed--compilation aborted at Bio/ClusterIO/dbsnp.pm line 59. Compilation failed in require at Bio/Root/Root.pm line 420. Chris Fields-5 wrote: > > You can still install core with 'force install', but it would be nice > to see what was causing ClusterIO tests to bork. That's the first > time I recall seeing that one fail. > > chris > > On May 15, 2009, at 10:45 AM, fungazid wrote: > >> Thanks all for your rapid replies, >> >> 1) I tried cleaned my swap space (/tmp), and this indeed delays the >> eruption >> of the error but does not prevent it tottaly. >> >> 2) I used bioperl 1.5.2.02 lubuntu1 (this is installed automatically >> by >> linux ubuntu package manager). I unsuccessfully tried to install >> bioperl 1.6 >> with 'sudo cpan' command, and I get the following report (is it >> possible to >> force install with this repeort ??). >> >> Test Summary Report >> ------------------- >> t/ClusterIO/ClusterIO.t (Wstat: 65280 Tests: 2 >> Failed: 0) >> Non-zero exit status: 255 >> Parse errors: Bad plan. You planned 12 tests but ran 2. >> Files=318, Tests=15584, 160 wallclock secs ( 4.88 usr 0.56 sys + >> 146.54 >> cusr 7.12 csys = 159.10 CPU) >> Result: FAIL >> Failed 1/318 test programs. 0/15584 subtests failed. >> make: *** [test] Error 255 >> CJFIELDS/BioPerl-1.6.0.tar.gz >> /usr/bin/make test -- NOT OK >> //hint// to see the cpan-testers results for installing this module, >> try: >> reports CJFIELDS/BioPerl-1.6.0.tar.gz >> Warning (usually harmless): 'YAML' not installed, will not store >> persistent >> state >> Running make install >> make test had returned bad status, won't install without force >> Failed during this command: >> CJFIELDS/BioPerl-1.6.0.tar.gz : make_test NO >> >> >> >> 3) I used Bio::Seq::LargeSeq (and not other modules) simply because it >> worked fine in the past, and did what I wanted with no problems. >> >> >> ____________________________________________________________________________________ >> >> >> >> >> Brian Osborne-2 wrote: >>> >>> fungazid, >>> >>> What version of BioPerl are you using? >>> >>> If I'm not mistaken this problem has been seen before, and is now >>> fixed in Bioperl 1.6. >>> >>> Brian O. >>> >>> >>> On May 15, 2009, at 9:17 AM, fungazid wrote: >>> >>>> >>>> Hello, >>>> >>>> I hope this is the right address for bioperl programming issues. >>>> Bioperl >>>> saves me a lot of time (not to re-invent the wheel), but there are >>>> some >>>> extremely irritating problems (I would change the code myself if I >>>> knew >>>> how). >>>> >>>> I am trying to read a file (~20MB) containing multiple fasta >>>> sequences: >>>>> a >>>> AGTAGTGAGTGCGCTGA......... >>>>> b >>>> GCGCTGAAGTAGTGAGT....... >>>>> c >>>> AGTAGTGAGTGCGCTGA......... >>>>> d........... >>>> >>>> with the following lines: >>>> >>>> my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=> >>>> $file1); >>>> >>>> LOOP1: while ( my $seqobj1 = $seqin->next_seq()) >>>> >>>> { >>>> ...... >>>> my $seq=$seqobj1->subseq(1,$seqobj1->length); >>>> ....... >>>> } >>>> >>>> >>>> This works right for the first ~30000 contig sequences but then the >>>> following message appears: >>>> >>>> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory >>>> /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm >>>> line 744 >>>> >>>> What to do ??? (this is only one of some different Bioperl related >>>> bugs that >>>> I'm experiencing) >>>> >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html >>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>>> >>>> _______________________________________________ >>>> Bioperl-l mailing list >>>> Bioperl-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >>> >> >> -- >> View this message in context: >> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23562461.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23563520.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From fungazid at yahoo.com Fri May 15 13:12:48 2009 From: fungazid at yahoo.com (fungazid) Date: Fri, 15 May 2009 10:12:48 -0700 (PDT) Subject: [Bioperl-l] looks like a Bio::SeqIO error Message-ID: <23563950.post@talk.nabble.com> After installing XML-SAX module, the my installation of bioperl 1.6 was successful (I didn't need to force install).But the error I reported previously (Too many links at /usr/local/share/perl/5.8.8/Bio/Root/IO.pm line 740) still exists. -- View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23563950.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From uludag at ebi.ac.uk Fri May 15 13:45:29 2009 From: uludag at ebi.ac.uk (Mahmut Uludag) Date: Fri, 15 May 2009 18:45:29 +0100 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <9fcc48c70905150839l29844c60tddf763e5d90d55d0@mail.gmail.com> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> <9fcc48c70905150839l29844c60tddf763e5d90d55d0@mail.gmail.com> Message-ID: <1242409529.21726.58.camel@emboss2.ebi.ac.uk> > > > yes, i tried to read the documentation about > > > Bio::AlignIO::emboss, but there > > > is not much in it.So is like i can call the same functions > > > which are used in > > > searchIO. I need start and end position of the pairwise alignment. Hi Shalabh, I copied below an example script that prints start and end position of sequences used for constructing pairwise alignment by EMBOSS 'water' program. After your last email this became an unnecessary example but i thought it would be useful for people with similar questions in the future. Regards, Mahmut use Bio::Factory::EMBOSS; use Bio::Seq; use Bio::AlignIO; my $aseq = Bio::Seq->new( -id => 'seq1', -seq => 'AACATGTAGGGATAG' ); my $bseq = Bio::Seq->new( -id => 'seq2', -seq => 'GCATGTTTAGATAG' ); my $factory = new Bio::Factory::EMBOSS; my $water = $factory->program("water"); my $wateroutfile = 'out.water'; $water->run( { -asequence => $aseq, -bsequence => $bseq, -gapopen => '10.0', -gapextend => '0.5', -outfile => $wateroutfile } ); my $alnin = new Bio::AlignIO( -format => 'emboss', -file => $wateroutfile ); while ( my $aln = $alnin->next_aln ) { # process the alignment -- Bio::SimpleAlign objects foreach $seq ( $aln->each_seq() ) { print "\n" . $seq->display_id; print " " . $seq->start . " " . $seq->end; } print "\n" . $aln->percentage_identity; print "\n" . $aln->consensus_string(50) . "\n"; } From j_martin at lbl.gov Fri May 15 13:44:34 2009 From: j_martin at lbl.gov (Joel Martin) Date: Fri, 15 May 2009 10:44:34 -0700 Subject: [Bioperl-l] Creating a fastq format file? In-Reply-To: References: Message-ID: <20090515174434.GA15394@eniac.jgi-psf.org> Hello, > The wikipedia FASTQ page [5] says that "Sanger format encodes a Phred > quality score from 0 to 60 using ASCII 33 to 93." It generally > accepted that Phred qualities are limited to 60. See column 5 > (sa6(ASCII)). Or are there exceptions? consed and phrap reserve values 98 and 99 to mean user edited bases. disallowing those would probably cause trouble eventually, as consed reads fastq files now. Joel On Fri, May 15, 2009 at 03:39:55PM +0200, Heikki Lehvaslaiho wrote: > Sorry for confusingly composed post. Here is a cleaner version. > > -Heikki From maj at fortinbras.us Fri May 15 14:09:37 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Fri, 15 May 2009 14:09:37 -0400 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23563950.post@talk.nabble.com> References: <23563950.post@talk.nabble.com> Message-ID: <8CB3AAC43B574C5E9E86DBE4AD229820@NewLife> (you might need to clear out the /tmp dir once again, left over from the buggy 1.5.2 routines) ----- Original Message ----- From: "fungazid" To: Sent: Friday, May 15, 2009 1:12 PM Subject: Re: [Bioperl-l] looks like a Bio::SeqIO error > > After installing XML-SAX module, the my installation of bioperl 1.6 was > successful (I didn't need to force install).But the error I reported > previously (Too many links at /usr/local/share/perl/5.8.8/Bio/Root/IO.pm > line 740) still exists. > > -- > View this message in context: > http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23563950.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From shalabh.sharma7 at gmail.com Fri May 15 14:13:50 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Fri, 15 May 2009 14:13:50 -0400 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <1242409529.21726.58.camel@emboss2.ebi.ac.uk> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> <9fcc48c70905150839l29844c60tddf763e5d90d55d0@mail.gmail.com> <1242409529.21726.58.camel@emboss2.ebi.ac.uk> Message-ID: <9fcc48c70905151113r732544epa2d48d6c2fcc263c@mail.gmail.com> Hi Mahmut, Thanks a lot, actually this is exactly what i was looking for. Its really helpful. One thing more, is there any way i can get the full id of the sequence from the water output. like -> My sequece id is: JCVI_READ_1103769852490 and the water output gives something like this : 4206170-42077 409 CCATGCCGCGTGTATGAAGAAGGCCTTCGG 458 ||| .||| ||.| ||||||| |||.||||| ||.| JCVI_READ_110 1 CCA--ACGC-----TGCA----------GGGTTGT-AAG 31 So when i run the script i get output: JCVI_READ_110 1 1078 Instead of getting JCVI_READ_1103769852490 i am getting just JCVI_READ_110 Thanks Shalabh On Fri, May 15, 2009 at 1:45 PM, Mahmut Uludag wrote: > > > > > yes, i tried to read the documentation about > > > > Bio::AlignIO::emboss, but there > > > > is not much in it.So is like i can call the same functions > > > > which are used in > > > > searchIO. I need start and end position of the pairwise alignment. > > Hi Shalabh, > > I copied below an example script that prints start and end position of > sequences used for constructing pairwise alignment by EMBOSS 'water' > program. After your last email this became an unnecessary example but i > thought it would be useful for people with similar questions in the > future. > > Regards, > Mahmut > > > use Bio::Factory::EMBOSS; > use Bio::Seq; > use Bio::AlignIO; > > my $aseq = Bio::Seq->new( -id => 'seq1', -seq => 'AACATGTAGGGATAG' ); > my $bseq = Bio::Seq->new( -id => 'seq2', -seq => 'GCATGTTTAGATAG' ); > my $factory = new Bio::Factory::EMBOSS; > my $water = $factory->program("water"); > my $wateroutfile = 'out.water'; > > $water->run( > { > -asequence => $aseq, > -bsequence => $bseq, > -gapopen => '10.0', > -gapextend => '0.5', > -outfile => $wateroutfile > } > ); > > my $alnin = new Bio::AlignIO( > -format => 'emboss', > -file => $wateroutfile > ); > > while ( my $aln = $alnin->next_aln ) { > # process the alignment -- Bio::SimpleAlign objects > foreach $seq ( $aln->each_seq() ) { > print "\n" . $seq->display_id; > print " " . $seq->start . " " . $seq->end; > } > print "\n" . $aln->percentage_identity; > print "\n" . $aln->consensus_string(50) . "\n"; > } > > > From cjfields at illinois.edu Fri May 15 14:42:01 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 15 May 2009 13:42:01 -0500 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23563950.post@talk.nabble.com> References: <23563950.post@talk.nabble.com> Message-ID: As mentioned in the thread by Kevin, using LargeSeq may not be the best route as it generates a tmpdir/tmpfile for each sequence, so you are probably reaching a hard limit for the number of temp files for your system or for File::Temp. The oddity is one or the other (or both) are not removed, so I think this is a legit bug. Regardless, there's a good reason this hasn't been encountered before. If you have ~30,000 sequences that you extract subseqs from you should seriously consider using a flat file db such as Bio::DB::Fasta or similar. It is capable of handling very large sequences or files and can extract subseqs easily. chris On May 15, 2009, at 12:12 PM, fungazid wrote: > > After installing XML-SAX module, the my installation of bioperl 1.6 > was > successful (I didn't need to force install).But the error I reported > previously (Too many links at /usr/local/share/perl/5.8.8/Bio/Root/ > IO.pm > line 740) still exists. > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23563950.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at gmx.net Fri May 15 14:44:50 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 15 May 2009 14:44:50 -0400 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> Message-ID: <54B1B8D5-AB84-4A56-A696-BF32AB5D94CD@gmx.net> On May 15, 2009, at 1:07 PM, Albert Vilella wrote: > Heh, I understand what you say. I am in a similar position from the > point > that I would prefer to switch to a more modern bioperl but the ensembl > comparative genomics code -- ensembl-compara -- relies on the > ensembl-core > code, which relies on bioperl 1.2.3. We could all switch to bioperl > 1.6 but > I cannot switch the ensembl-compara code if code doesn't switch as > well. I > haven't been very successful in raising this issue so far, but I can > try > again :-p > > One of the things that has changed a lot is swissprot support > (swiss.pm). > Another object that I am using a lot is SimpleAlign.pm, which in the > modern > version has a lot more methods. That should be a positive, no? I understand that there have been (are?) good reasons for inertia on the Ensembl end - undoubtedly such a switch would require a huge amount of testing to be sure all the wrinkles have been ironed out. So the question I'd like to ask is, from an Ensembl perspective what BioPerl features or functions or other things we can actually control would make that effort worth it? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Fri May 15 14:58:38 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 15 May 2009 14:58:38 -0400 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23559474.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> Message-ID: <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> I think you're running up against an OS limit on the number of open files, or the number of files in a directory. You can check (and change) your limits with ulimit. The largefasta modules is designed for reading in and handling large (like, really large - whole-chromosome scale) sequences which, if all held in memory, would exhaust the memory either immediately or pretty quickly. So it stores them in temporary files. Most unix systems will limit the number of files you can have open at any one time. If your sequences in that file aren't huge, largefasta isn't the module you want to use - just use the fasta parser, or if you need random access to sequences in the file (do you?) then Bio::DB::Fasta. Writing sequences to temporary files is a waste of time if they fit into memory just fine. The odd thing is that you actually run up to the limit. Normally the temporary files should be closed and deleted when the sequence objects go out of scope (I think - should verify in the code of course ...) , so the fact that they don't lets me suspect that the code snippet that you presented isn't all that there is to it - are you storing the sequences somewhere in a variable, such as in an array or a hash table? -hilmar On May 15, 2009, at 9:05 AM, fungazid wrote: > > Hello, > > I hope this is the right address for bioperl programming issues. > Bioperl > saves me a lot of time (not to re-invent the wheel), but there are > some > extremely irritating problems (I would change the code myself if I > knew > how). > > I am trying to read a file (~20MB) containing multiple fasta > sequences: >> a > AGTAGTGAGTGCGCTGA......... >> b > GCGCTGAAGTAGTGAGT....... >> c > AGTAGTGAGTGCGCTGA......... >> d........... > > with the following lines: > > my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); > > LOOP1: while ( my $seqobj1 = $seqin->next_seq()) > > { > ...... > my $seq=$seqobj1->subseq(1,$seqobj1->length); > ....... > } > > > This works right for the first ~30000 contig sequences but then the > following message appears: > > Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory > /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm > line 744 > DESTROY() mysql_insert obj > destroying HANDLE > > What to do ??? (this is only one of some different Bioperl related > bugs that > I'm experiencing) > > > > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From fungazid at yahoo.com Fri May 15 14:06:52 2009 From: fungazid at yahoo.com (fungazid) Date: Fri, 15 May 2009 11:06:52 -0700 (PDT) Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23563950.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> <00FF03C7-F451-42A9-8997-86A0016C758E@verizon.net> <23562461.post@talk.nabble.com> <9CF9E13F-96FB-4BB8-9165-0500AD534877@illinois.edu> <23563520.post@talk.nabble.com> <23563950.post@talk.nabble.com> Message-ID: <23564748.post@talk.nabble.com> Ok, thank you all very very much, The problems were solved completely: 1) Bioperl 1.6 was installed (just needed to install XML-SAX) 2) after changing '-format' to 'Fasta' instead of 'largefasta', all problems were gone. my $seqin = Bio::SeqIO->new('-format'=>'Fasta','-file'=>$file1); LOOP1: while ( my $seqobj1 = $seqin->next_seq()) { .... -- View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23564748.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From jason at bioperl.org Fri May 15 16:03:05 2009 From: jason at bioperl.org (Jason Stajich) Date: Fri, 15 May 2009 13:03:05 -0700 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <9fcc48c70905151113r732544epa2d48d6c2fcc263c@mail.gmail.com> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> <9fcc48c70905150839l29844c60tddf763e5d90d55d0@mail.gmail.com> <1242409529.21726.58.camel@emboss2.ebi.ac.uk> <9fcc48c70905151113r732544epa2d48d6c2fcc263c@mail.gmail.com> Message-ID: That's a problem from water not bioperl - if the report doesn't include the header then it won't be parsed by Bioperl. You could consider replacing water with SSEARCH and GGSEARCH for needle functionality both in the FASTA package. Bio::SearchIO parses those formats and IDs aren't truncated there. -jason On May 15, 2009, at 11:13 AM, shalabh sharma wrote: > Hi Mahmut, Thanks a lot, actually this is exactly > what i > was looking for. Its really helpful. > One thing more, is there any way i can get the full id of the > sequence from > the water output. > like -> > My sequece id is: JCVI_READ_1103769852490 > > and the water output gives something like this : > > 4206170-42077 409 CCATGCCGCGTGTATGAAGAAGGCCTTCGG 458 > > ||| .||| ||.| ||||||| |||.||||| ||.| > > JCVI_READ_110 1 CCA--ACGC-----TGCA----------GGGTTGT-AAG 31 > > So when i run the script i get output: > > JCVI_READ_110 1 > 1078 > > Instead of getting JCVI_READ_1103769852490 i am getting just > JCVI_READ_110 > > Thanks > Shalabh > > On Fri, May 15, 2009 at 1:45 PM, Mahmut Uludag > wrote: > >> >>>>> yes, i tried to read the documentation about >>>>> Bio::AlignIO::emboss, but there >>>>> is not much in it.So is like i can call the same functions >>>>> which are used in >>>>> searchIO. I need start and end position of the pairwise alignment. >> >> Hi Shalabh, >> >> I copied below an example script that prints start and end position >> of >> sequences used for constructing pairwise alignment by EMBOSS 'water' >> program. After your last email this became an unnecessary example >> but i >> thought it would be useful for people with similar questions in the >> future. >> >> Regards, >> Mahmut >> >> >> use Bio::Factory::EMBOSS; >> use Bio::Seq; >> use Bio::AlignIO; >> >> my $aseq = Bio::Seq->new( -id => 'seq1', -seq => 'AACATGTAGGGATAG' ); >> my $bseq = Bio::Seq->new( -id => 'seq2', -seq => 'GCATGTTTAGATAG' ); >> my $factory = new Bio::Factory::EMBOSS; >> my $water = $factory->program("water"); >> my $wateroutfile = 'out.water'; >> >> $water->run( >> { >> -asequence => $aseq, >> -bsequence => $bseq, >> -gapopen => '10.0', >> -gapextend => '0.5', >> -outfile => $wateroutfile >> } >> ); >> >> my $alnin = new Bio::AlignIO( >> -format => 'emboss', >> -file => $wateroutfile >> ); >> >> while ( my $aln = $alnin->next_aln ) { >> # process the alignment -- Bio::SimpleAlign objects >> foreach $seq ( $aln->each_seq() ) { >> print "\n" . $seq->display_id; >> print " " . $seq->start . " " . $seq->end; >> } >> print "\n" . $aln->percentage_identity; >> print "\n" . $aln->consensus_string(50) . "\n"; >> } >> >> >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From uludag at ebi.ac.uk Fri May 15 17:09:16 2009 From: uludag at ebi.ac.uk (uludag at ebi.ac.uk) Date: Fri, 15 May 2009 22:09:16 +0100 (BST) Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <9fcc48c70905151113r732544epa2d48d6c2fcc263c@mail.gmail.com> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> <9fcc48c70905150839l29844c60tddf763e5d90d55d0@mail.gmail.com> <1242409529.21726.58.camel@emboss2.ebi.ac.uk> <9fcc48c70905151113r732544epa2d48d6c2fcc263c@mail.gmail.com> Message-ID: <48174.86.154.46.60.1242421756.squirrel@webmail.ebi.ac.uk> Hi Shalab, > is there any way i can get the full id of the sequence from the water output. I checked emboss website and have a quick look into the relevant source file but cannot find an option to adjust length of the sequence identifiers written in alignment reports. For example prettyplot has 'maxnamelen' option for a similar purpose, a similar option seems to be reasonable for alignment reports. For now, as a workaround you can read sequence identifiers from your input bioperl sequence objects. It seems EMBOSS doesn't change the order of sequences. Regards, Mahmut From cjfields at illinois.edu Fri May 15 19:36:47 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 15 May 2009 18:36:47 -0500 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> References: <23559474.post@talk.nabble.com> <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> Message-ID: <7147B639-3F75-4617-ACBF-20C85B7C6673@illinois.edu> On May 15, 2009, at 1:58 PM, Hilmar Lapp wrote: > I think you're running up against an OS limit on the number of open > files, or the number of files in a directory. You can check (and > change) your limits with ulimit. > > The largefasta modules is designed for reading in and handling large > (like, really large - whole-chromosome scale) sequences which, if > all held in memory, would exhaust the memory either immediately or > pretty quickly. So it stores them in temporary files. Most unix > systems will limit the number of files you can have open at any one > time. > > If your sequences in that file aren't huge, largefasta isn't the > module you want to use - just use the fasta parser, or if you need > random access to sequences in the file (do you?) then > Bio::DB::Fasta. Writing sequences to temporary files is a waste of > time if they fit into memory just fine. > > The odd thing is that you actually run up to the limit. Normally the > temporary files should be closed and deleted when the sequence > objects go out of scope (I think - should verify in the code of > course ...) , so the fact that they don't lets me suspect that the > code snippet that you presented isn't all that there is to it - are > you storing the sequences somewhere in a variable, such as in an > array or a hash table? > > -hilmar This was a legit bug. The DESTROY method in LargePrimarySeq only removed files but not their directories. I added a few extra lines to do that. chris From hlapp at gmx.net Sat May 16 11:19:06 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 11:19:06 -0400 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <7147B639-3F75-4617-ACBF-20C85B7C6673@illinois.edu> References: <23559474.post@talk.nabble.com> <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> <7147B639-3F75-4617-ACBF-20C85B7C6673@illinois.edu> Message-ID: <0FAD4F85-0367-4B99-9483-D4130DB0C179@gmx.net> On May 15, 2009, at 7:36 PM, Chris Fields wrote: > This was a legit bug. The DESTROY method in LargePrimarySeq only > removed files but not their directories. I added a few extra lines > to do that. Cool you spotted this and thanks for fixing! -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 18:34:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 18:34:57 -0400 Subject: [Bioperl-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Don't you love SwissProt (or UniProt as we must call it now I suppose). They (understandably) try to squeeze ever more annotation into the existing tags, rather than adding new tags. So, of the following structure: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; really only the first line, with the 'RecName: Full=' removed, is the description line as we know it. The rest, I would say, is annotation, such as two alternative names, amino acid chains contained in the full record (shouldn't this be feature annotation, really? and indeed it is - why it needs to be repeated here is beyond me) and their names as well as alternative names, and the fact that the sequence is a precursor form. Leaving all this in one string has the advantage that we can round- trip it (and there is probably hardly any other way to accomplish that), but clearly in terms of semantics this isn't the sequence description as we know it anymore. Does anyone else think too that completely changing the semantics of sequence annotation fields is a bad idea? My inclination from a BioPerl perspective is to extract the part following 'RecName: Full=' as the description, and attach the rest as annotation. We could in fact use the TagTree class for this. I'm cross- posting to BioPerl too to gather what other BioPerl'ers think about this. -hilmar On May 14, 2009, at 2:20 PM, Peter wrote: > Hi, > > This is cross-posted between biopython-dev and biosql-l as it regards > parsing the description (DE) lines in SwissProt files and how they are > stored in BioSQL. This follows from an earlier discussion on > biopython-dev > > Older SwissProt files just had one or two DE lines, and it made sense > to treat this as a simple string mapped onto the description field in > the bioentry table in BioSQL. This appears to what happens with > BioPerl 1.5.x and in Biopython (although the details regarding white > space differ). However, newer SwissProt files have many DE lines with > additional structure. The example Michiel gave earlier on the > biopython-dev list was: > > http://www.uniprot.org/uniprot/Q9XHP0.txt > > This has the following DE lines: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > I had to fight with perl to get my old copy of BioPerl working again > (some week reference thing), but I managed, and then loaded this file > into my test BioSQL database with: > > $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass > XXX --namespace biosql_test --format swiss Q9XHP0.txt > > Then I looked at the resulting description in the main bioentry table: > > $ mysql --user=root -p biosql_test -e 'SELECT description FROM > bioentry WHERE accession="Q9XHP0";' > > This is stored as one huge long string (without the newlines, I'm not > sure if BioPerl strips those in parsing the file, or when loading it > into the database): > > RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S > globulin seed storage protein II; AltName: Full=Alpha-globulin; > Contains: RecName: Full=11S globulin seed storage protein 2 acidic > chain; AltName: Full=11S globulin seed storage protein II acidic > chain; Contains: RecName: Full=11S globulin seed storage protein 2 > basic chain; AltName: Full=11S globulin seed storage protein II basic > chain; Flags: Precursor; > > For Biopython, I emptied the database then did: > >>>> from Bio import SeqIO >>>> from BioSQL import BioSeqDatabase >>>> server = BioSeqDatabase.open_database(driver="MySQLdb", >>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>>> db = server["biosql-test"] #namespace >>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) > 1 >>>> server.commit() > > As before, I looked in the table with mysql. Again - this stores the > full description from the DE line, although with the newlines > embedded. So, Biopython is consistent with my old copy of BioPerl > (1.5.x) if we ignore the white space. > > However, how does this look in BioPerl 1.6? If this is the same, are > there any plans to change this? For Biopython we have discussed > recording most of the DE information under the annotations instead > (keyed off RecName, AltName, Contains, Flags), but I would like to be > consistent with BioPerl+BioSQL. > > Thanks > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at illinois.edu Sat May 16 19:16:05 2009 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 16 May 2009 18:16:05 -0500 Subject: [Bioperl-l] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Message-ID: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > Don't you love SwissProt (or UniProt as we must call it now I > suppose). They (understandably) try to squeeze ever more annotation > into the existing tags, rather than adding new tags. > > So, of the following structure: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > really only the first line, with the 'RecName: Full=' removed, is > the description line as we know it. The rest, I would say, is > annotation, such as two alternative names, amino acid chains > contained in the full record (shouldn't this be feature annotation, > really? and indeed it is - why it needs to be repeated here is > beyond me) and their names as well as alternative names, and the > fact that the sequence is a precursor form. > > Leaving all this in one string has the advantage that we can round- > trip it (and there is probably hardly any other way to accomplish > that), but clearly in terms of semantics this isn't the sequence > description as we know it anymore. > > Does anyone else think too that completely changing the semantics of > sequence annotation fields is a bad idea? > > My inclination from a BioPerl perspective is to extract the part > following 'RecName: Full=' as the description, and attach the rest > as annotation. We could in fact use the TagTree class for this. I'm > cross-posting to BioPerl too to gather what other BioPerl'ers think > about this. > > -hilmar This is much like the GN issues we've run into before, and we *could* set this up using TagTree or similar. In the latter case of gene name the data is stored in a text tree as follows: gene_names: gene_name: Name: GC1QBP Synonyms: HABP1 Synonyms: SF2P32 Synonyms: C1QBP That could be changed to an XML string: GC1QBP HABP1 SF2P32 C1QBP Thinking about this we should attempt to coalesce around a standard instead of forcing the other Bio* to a specific format. chris From hlapp at gmx.net Sat May 16 19:37:14 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:37:14 -0400 Subject: [Bioperl-l] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> On May 16, 2009, at 7:28 PM, Peter wrote: >> That could be changed to an XML string: >> >> >> >> >> GC1QBP >> HABP1 >> SF2P32 >> C1QBP >> >> >> >> Thinking about this we should attempt to coalesce around a standard >> instead >> of forcing the other Bio* to a specific format. > > How would you record this in BioSQL? As an XML string for an > annotation value? Yes. A TagTree object can be serialized to XML, and the XML can be stored as the annotation value in BioSQL. As the XML can be read back in, it allows full round-tripping. > Brad has suggested JSON might be useful for this kind of thing (see > also per-letter-annotation discussion). JSON could be another serialization format, but XML is equally or better supported in all languages except JavaScript. Furthermore, you could just send the XML to the browser and have an XSLT (either directly, or indirectly through JavaScript doing the transformation) do the rendering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun May 17 11:21:59 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 17 May 2009 11:21:59 -0400 Subject: [Bioperl-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: On May 17, 2009, at 8:40 AM, Peter wrote: > On 5/17/09, Hilmar Lapp wrote: >> >> On May 16, 2009, at 7:28 PM, Peter wrote: >>>> That could be changed to an XML string: >>>> >>>> >>>> >>>> >>>> GC1QBP >>>> HABP1 >>>> SF2P32 >>>> C1QBP >>>> >>>> >>>> >>>> Thinking about this we should attempt to coalesce around a standard >>>> instead of forcing the other Bio* to a specific format. > > [...] Here you have mapped RecName and AltName fields in the DE > lines to > Name and Synonyms (shouldn't that be Synonym singular?). The example is for the GN lines in SwissProt, not the DE lines. > [...] > On 5/17/09, Hilmar Lapp wrote: >> Not necessarily. If you have a flat serialization (such as XML) the >> nested >> structure isn't needed. Of course that's not a fully normalized >> relational >> representation, but if you had one, how often would it be used, how >> efficient would those queries be (SQL is poor at nested or >> recursive data >> structures), and how much pain would it be to write the object- >> relational >> mappings? > > In this example, searching the database using one of the SwissProt > AltNames (synonyms), or filtering on the Flags sounds like a > reasonable request - but this would be very difficult if the data is > stored inside XML strings. Actually no. Modern full-text indexers (inside or outside the database) can index XML text columns right away and very well. In fact, for the last project that I built a full-text search for (on top of a BioSQL database) I did that by writing custom XML documents to a separate table for each record I wanted indexed. Oracle's full text indexer did the rest. I also built a separate identifier/name/ accession index that pulled all the gene names, symbols, accession numbers, identifiers etc into a single table for indexing. What I mean is, a fully normalized relational representation, especially if nested, is often not the most efficient data structure for efficient searching and filtering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at illinois.edu Sun May 17 18:40:24 2009 From: cjfields at illinois.edu (Chris Fields) Date: Sun, 17 May 2009 17:40:24 -0500 Subject: [Bioperl-l] Google Summer of Code student Chase Miller In-Reply-To: <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> Message-ID: <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> On May 15, 2009, at 12:07 PM, Albert Vilella wrote: > Heh, I understand what you say. I am in a similar position from the > point that I would prefer to switch to a more modern bioperl but the > ensembl comparative genomics code -- ensembl-compara -- relies on > the ensembl-core code, which relies on bioperl 1.2.3. We could all > switch to bioperl 1.6 but I cannot switch the ensembl-compara code > if code doesn't switch as well. I haven't been very successful in > raising this issue so far, but I can try again :-p > > One of the things that has changed a lot is swissprot support > (swiss.pm). Another object that I am using a lot is SimpleAlign.pm, > which in the modern version has a lot more methods. I understand that the reasoning for requiring 1.2.3 has something to do with Bio::Annotation being too heavyweight. If that is the only impediment I think we can work something out. chris From Russell.Smithies at agresearch.co.nz Mon May 18 00:53:05 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Mon, 18 May 2009 16:53:05 +1200 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> Does anyone know of a way to get GI numbers for Uniprot/Swissprot accessions? Fasta from Uniprot's FTP site doesn't formatdb correctly (with the -o T option) as it's missing the gi number in the fasta header. NCBI won't let you use SwissProt ids in batch-entrez and I don't want to have to look up all 466,739 of them. I could use Bio::DB::Eutilities and query each id but even at 10 queries/second (the limit changed recently) it would take too long. Any ideas? Is there a swissprot2gi list somewhere? Thanx, Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E? russell.smithies at agresearch.co.nz Invermay? Research Centre Puddle Alley, Mosgiel, New Zealand T? +64 3 489 3809?? F? +64 3 489 9174? www.agresearch.co.nz ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From fungazid at yahoo.com Fri May 15 16:55:50 2009 From: fungazid at yahoo.com (fungazid) Date: Fri, 15 May 2009 13:55:50 -0700 (PDT) Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> References: <23559474.post@talk.nabble.com> <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> Message-ID: <23567169.post@talk.nabble.com> hilmar, I believe your suspicions are wrong. The proof: changing -format to 'Fasta' instead of 'largefasta' in: Bio::SeqIO->new(-file=> $fileIn, -format => 'Fasta') solved my problem (as was suggested, this is probably not the right method to use, but it works). Hilmar Lapp wrote: > > I think you're running up against an OS limit on the number of open > files, or the number of files in a directory. You can check (and > change) your limits with ulimit. > > The largefasta modules is designed for reading in and handling large > (like, really large - whole-chromosome scale) sequences which, if all > held in memory, would exhaust the memory either immediately or pretty > quickly. So it stores them in temporary files. Most unix systems will > limit the number of files you can have open at any one time. > > If your sequences in that file aren't huge, largefasta isn't the > module you want to use - just use the fasta parser, or if you need > random access to sequences in the file (do you?) then Bio::DB::Fasta. > Writing sequences to temporary files is a waste of time if they fit > into memory just fine. > > The odd thing is that you actually run up to the limit. Normally the > temporary files should be closed and deleted when the sequence objects > go out of scope (I think - should verify in the code of course ...) , > so the fact that they don't lets me suspect that the code snippet that > you presented isn't all that there is to it - are you storing the > sequences somewhere in a variable, such as in an array or a hash table? > > -hilmar > > On May 15, 2009, at 9:05 AM, fungazid wrote: > >> >> Hello, >> >> I hope this is the right address for bioperl programming issues. >> Bioperl >> saves me a lot of time (not to re-invent the wheel), but there are >> some >> extremely irritating problems (I would change the code myself if I >> knew >> how). >> >> I am trying to read a file (~20MB) containing multiple fasta >> sequences: >>> a >> AGTAGTGAGTGCGCTGA......... >>> b >> GCGCTGAAGTAGTGAGT....... >>> c >> AGTAGTGAGTGCGCTGA......... >>> d........... >> >> with the following lines: >> >> my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=>$file1); >> >> LOOP1: while ( my $seqobj1 = $seqin->next_seq()) >> >> { >> ...... >> my $seq=$seqobj1->subseq(1,$seqobj1->length); >> ....... >> } >> >> >> This works right for the first ~30000 contig sequences but then the >> following message appears: >> >> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory >> /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm >> line 744 >> DESTROY() mysql_insert obj >> destroying HANDLE >> >> What to do ??? (this is only one of some different Bioperl related >> bugs that >> I'm experiencing) >> >> >> >> >> -- >> View this message in context: >> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html >> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23567169.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From biopython at maubp.freeserve.co.uk Sat May 16 19:28:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:28:43 +0100 Subject: [Bioperl-l] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> On 5/17/09, Chris Fields wrote: > > On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > > My inclination from a BioPerl perspective is to extract the part following > > 'RecName: Full=' as the description, and attach the rest as annotation. We > > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > > too to gather what other BioPerl'ers think about this. > > > > -hilmar > > > > This is much like the GN issues we've run into before, and we *could* set > this up using TagTree or similar. In the latter case of gene name the data > is stored in a text tree as follows: > > gene_names: > gene_name: > Name: GC1QBP > Synonyms: HABP1 > Synonyms: SF2P32 > Synonyms: C1QBP > > That could be changed to an XML string: > > > > > GC1QBP > HABP1 > SF2P32 > C1QBP > > > > Thinking about this we should attempt to coalesce around a standard instead > of forcing the other Bio* to a specific format. How would you record this in BioSQL? As an XML string for an annotation value? Brad has suggested JSON might be useful for this kind of thing (see also per-letter-annotation discussion). Peter From biopython at maubp.freeserve.co.uk Sun May 17 08:40:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 13:40:47 +0100 Subject: [Bioperl-l] [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> On 5/17/09, Hilmar Lapp wrote: > > On May 16, 2009, at 7:28 PM, Peter wrote: > > > That could be changed to an XML string: > > > > > > > > > > > > > > > GC1QBP > > > HABP1 > > > SF2P32 > > > C1QBP > > > > > > > > > > > > Thinking about this we should attempt to coalesce around a standard > > > instead of forcing the other Bio* to a specific format. Absolutely - some common standard should be agreed. Would you envision doing this for other structured fields, inventing a new mini XML format each time? That seems open ended and likely to cause a lot of work keeping all the Bio* project synchronised. Here you have mapped RecName and AltName fields in the DE lines to Name and Synonyms (shouldn't that be Synonym singular?). I also don't get why you have used a gene_name entry inside a gene_names list. Would you hold the contains information and the flags information from the DE lines in separate XML entries? I would have gone for something much closer to the original DE line markup i.e. using the field names UniProt use, RecName and AltName, rather than mapping these to Name and Synonym. > > How would you record this in BioSQL? As an XML string for an annotation > > value? > > Yes. A TagTree object can be serialized to XML, and the XML can be stored > as the annotation value in BioSQL. As the XML can be read back in, it allows > full round-tripping. Assuming you stored all the DE markup, then yes, a round trip back to the SwissProt file could be possible. And, depending on the details of the XML structure used, it would be possible to represent this in a python structure too. > > Brad has suggested JSON might be useful for this kind of thing (see > > also per-letter-annotation discussion). > > JSON could be another serialization format, but XML is equally or better > supported in all languages except JavaScript. Furthermore, you could just > send the XML to the browser and have an XSLT (either directly, or indirectly > through JavaScript doing the transformation) do the rendering. I have no strong preference for either XML or JSON (but would rather avoid them if they are not really needed). For other types of annotation there may be a clearer advantage for one over the other, e.g. per letter annotation like the secondary structure of a protein sequence, or the quality scores of a nucleotide contig. On 5/17/09, Hilmar Lapp wrote: > Not necessarily. If you have a flat serialization (such as XML) the nested > structure isn't needed. Of course that's not a fully normalized relational > representation, but if you had one, how often would it be used, how > efficient would those queries be (SQL is poor at nested or recursive data > structures), and how much pain would it be to write the object-relational > mappings? In this example, searching the database using one of the SwissProt AltNames (synonyms), or filtering on the Flags sounds like a reasonable request - but this would be very difficult if the data is stored inside XML strings. Of course, because the RecName and AltName entries are top level, we could just record them as normal - simple strings in the annotations table. This seems much nicer. Likewise the "Flags: Precursor;" line. i.e. listing the tag/value pairs which could be used in the bioentry_qualifier_value table: AltName = "Full=11S globulin seed storage protein II" AltName = "Full=Alpha-globulin" Flags = "Precursor" (the RecName field, "Full=11S globulin seed storage protein 2", could be used for the bioentry.description instead) The above are all pretty easy. We only need to consider nesting (or something like XML or JSON) for some of the DE information, in the example discussed the Contains lines. Even this could be even be done by storing each contains entry as a single long string (holding both the name and synonyms) directly from the DE line itself, something like this: Contains = "RecName: Full=11S globulin seed storage protein 2 acidic chain;\nAltName: Full=11S globulin seed storage protein II acidic chain;" Contains = "RecName: Full=11S globulin seed storage protein 2 basic chain;\nAltName: Full=11S globulin seed storage protein II basic chain;" Peter From hlapp at gmx.net Mon May 18 09:25:45 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 18 May 2009 09:25:45 -0400 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23567169.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> <23567169.post@talk.nabble.com> Message-ID: <5CEB0EF7-8B43-4A44-8DD6-8CB52E0423DF@gmx.net> Yep, as Chris wrote, there was a bug about the cleanup not being complete. Thanks for your report and sticking with it, it helped us identify and fix that problem. -hilmar On May 15, 2009, at 4:55 PM, fungazid wrote: > > hilmar, I believe your suspicions are wrong. The proof: changing - > format to > 'Fasta' instead of 'largefasta' in: > Bio::SeqIO->new(-file=> $fileIn, -format => 'Fasta') > solved my problem (as was suggested, this is probably not the right > method > to use, but it works). > > > Hilmar Lapp wrote: >> >> I think you're running up against an OS limit on the number of open >> files, or the number of files in a directory. You can check (and >> change) your limits with ulimit. >> >> The largefasta modules is designed for reading in and handling large >> (like, really large - whole-chromosome scale) sequences which, if all >> held in memory, would exhaust the memory either immediately or pretty >> quickly. So it stores them in temporary files. Most unix systems will >> limit the number of files you can have open at any one time. >> >> If your sequences in that file aren't huge, largefasta isn't the >> module you want to use - just use the fasta parser, or if you need >> random access to sequences in the file (do you?) then Bio::DB::Fasta. >> Writing sequences to temporary files is a waste of time if they fit >> into memory just fine. >> >> The odd thing is that you actually run up to the limit. Normally the >> temporary files should be closed and deleted when the sequence >> objects >> go out of scope (I think - should verify in the code of course ...) , >> so the fact that they don't lets me suspect that the code snippet >> that >> you presented isn't all that there is to it - are you storing the >> sequences somewhere in a variable, such as in an array or a hash >> table? >> >> -hilmar >> >> On May 15, 2009, at 9:05 AM, fungazid wrote: >> >>> >>> Hello, >>> >>> I hope this is the right address for bioperl programming issues. >>> Bioperl >>> saves me a lot of time (not to re-invent the wheel), but there are >>> some >>> extremely irritating problems (I would change the code myself if I >>> knew >>> how). >>> >>> I am trying to read a file (~20MB) containing multiple fasta >>> sequences: >>>> a >>> AGTAGTGAGTGCGCTGA......... >>>> b >>> GCGCTGAAGTAGTGAGT....... >>>> c >>> AGTAGTGAGTGCGCTGA......... >>>> d........... >>> >>> with the following lines: >>> >>> my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=> >>> $file1); >>> >>> LOOP1: while ( my $seqobj1 = $seqin->next_seq()) >>> >>> { >>> ...... >>> my $seq=$seqobj1->subseq(1,$seqobj1->length); >>> ....... >>> } >>> >>> >>> This works right for the first ~30000 contig sequences but then the >>> following message appears: >>> >>> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory >>> /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm >>> line 744 >>> DESTROY() mysql_insert obj >>> destroying HANDLE >>> >>> What to do ??? (this is only one of some different Bioperl related >>> bugs that >>> I'm experiencing) >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23567169.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon May 18 09:38:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 14:38:03 +0100 Subject: [Bioperl-l] [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com> On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp wrote: > > On May 17, 2009, at 8:40 AM, Peter wrote: >> >> [...] Here you have mapped RecName and AltName fields in the DE lines to >> Name and Synonyms (shouldn't that be Synonym singular?). > > The example is for the GN lines in SwissProt, not the DE lines. Ah, that probably explains some of my confusion. >> In this example, searching the database using one of the SwissProt >> AltNames (synonyms), or filtering on the Flags sounds like a >> reasonable request - but this would be very difficult if the data is >> stored inside XML strings. > > Actually no. Modern full-text indexers (inside or outside the database) can > index XML text columns right away and very well. In fact, for the last > project that I built a full-text search for (on top of a BioSQL database) I > did that by writing custom XML documents to a separate table for each > record I wanted indexed. Oracle's full text indexer did the rest. I also built a > separate identifier/name/accession index that pulled all the gene names, > symbols, accession numbers, identifiers etc into a single table for > indexing. OK, when I said searching "would be very difficult if the data is stored inside XML strings", maybe it wasn't so difficult for you - but that still sounds complicated! Sticking with the GN lines and the synonym, if this was stored as a simple tag/value as usual in BioSQL, I would write my SQL statement to search the annotation table where the term id was that associated with a GN synonym, and the annotation value was "HABP1". Simple. Using the XML approach, are you suggesting you could do a full text search on the annotation value field, looking for any rows where the field contains "HABP1", where the term id matches the GN lines' XML string? This sounds simplistic and probably rather slow - presumably why you resorted to the more complicated indexing scheme described above? > What I mean is, a fully normalized relational representation, especially if > nested, is often not the most efficient data structure for efficient > searching and filtering. OK. But do we really need to worry about complex nested structures for the SwissProt annotation (or in general)? Peter From shalabh.sharma7 at gmail.com Mon May 18 10:29:40 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Mon, 18 May 2009 10:29:40 -0400 Subject: [Bioperl-l] Parsing needle/water output In-Reply-To: <48174.86.154.46.60.1242421756.squirrel@webmail.ebi.ac.uk> References: <9fcc48c70905141253y23c0f835vc94e85ccb07d7238@mail.gmail.com> <1F1240778FB0AF46B4E5A72C44D2C7472A377482@exch1-hi.accelrys.net> <9fcc48c70905141313t1718dcd6l8106423c00ce768@mail.gmail.com> <1A4207F8295607498283FE9E93B775B405FE19DF@EX02.asurite.ad.asu.edu> <9fcc48c70905150839l29844c60tddf763e5d90d55d0@mail.gmail.com> <1242409529.21726.58.camel@emboss2.ebi.ac.uk> <9fcc48c70905151113r732544epa2d48d6c2fcc263c@mail.gmail.com> <48174.86.154.46.60.1242421756.squirrel@webmail.ebi.ac.uk> Message-ID: <9fcc48c70905180729s35f06434s504396be53105bcf@mail.gmail.com> Thanks for your valuable suggestions. -Shalabh On Fri, May 15, 2009 at 5:09 PM, wrote: > Hi Shalab, > > > is there any way i can get the full id of the sequence from the water > output. > > I checked emboss website and have a quick look into the relevant source > file but cannot find an option to adjust length of the sequence > identifiers written in alignment reports. For example prettyplot has > 'maxnamelen' option for a similar purpose, a similar option seems to be > reasonable for alignment reports. > > For now, as a workaround you can read sequence identifiers from your input > bioperl sequence objects. It seems EMBOSS doesn't change the order of > sequences. > > Regards, > Mahmut > > > From MEC at stowers.org Mon May 18 10:34:39 2009 From: MEC at stowers.org (Cook, Malcolm) Date: Mon, 18 May 2009 09:34:39 -0500 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> Message-ID: you could: 1) Use eutils search with -database protein -term "srcdb swiss prot"[Properties] If you use a retmax of 100000 it should only take a few seconds to download the 458,445 ginumbers. I just did it. 2) use fastacmd to extract the fasta from nr for these gis, and parse the defline. (assuming you have a copy of nr) Does this work for you? Malcolm Cook Stowers Institute for Medical Research - Kansas City, Missouri > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org > [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Smithies, Russell > Sent: Sunday, May 17, 2009 11:53 PM > To: 'BioPerl List' > Subject: [Bioperl-l] Uniprot/Swiss accessions? > > Does anyone know of a way to get GI numbers for > Uniprot/Swissprot accessions? > > Fasta from Uniprot's FTP site doesn't formatdb correctly > (with the -o T option) as it's missing the gi number in the > fasta header. > NCBI won't let you use SwissProt ids in batch-entrez and I > don't want to have to look up all 466,739 of them. > I could use Bio::DB::Eutilities and query each id but even at > 10 queries/second (the limit changed recently) it would take too long. > > Any ideas? > Is there a swissprot2gi list somewhere? > > Thanx, > > > Russell Smithies > > Bioinformatics Applications Developer > T +64 3 489 9085 > E? russell.smithies at agresearch.co.nz > > Invermay? Research Centre > Puddle Alley, > Mosgiel, > New Zealand > T? +64 3 489 3809 > F? +64 3 489 9174 > www.agresearch.co.nz > > > > ============================================================== > ========= > Attention: The information contained in this message and/or > attachments from AgResearch Limited is intended only for the > persons or entities to which it is addressed and may contain > confidential and/or privileged material. Any review, > retransmission, dissemination or other use of, or taking of > any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by > AgResearch Limited. If you have received this message in > error, please notify the sender immediately. > ============================================================== > ========= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From cjfields at illinois.edu Mon May 18 12:26:39 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 18 May 2009 11:26:39 -0500 Subject: [Bioperl-l] looks like a Bio::SeqIO error In-Reply-To: <23567169.post@talk.nabble.com> References: <23559474.post@talk.nabble.com> <810A96D0-CCAC-403E-A3C2-E5AA60DE0176@gmx.net> <23567169.post@talk.nabble.com> Message-ID: <1AF6C7CD-2BA4-4A3A-B3AC-CE83FEEF1C4A@illinois.edu> The problem was that both temp files and temp dirs were created for each seq, but only the files were removed (not the dirs). That's been fixed in svn. Regardless, using 'fasta' format is a better option (though I still think using Bio::DB::Fasta is best). chris On May 15, 2009, at 3:55 PM, fungazid wrote: > > hilmar, I believe your suspicions are wrong. The proof: changing - > format to > 'Fasta' instead of 'largefasta' in: > Bio::SeqIO->new(-file=> $fileIn, -format => 'Fasta') > solved my problem (as was suggested, this is probably not the right > method > to use, but it works). > > > Hilmar Lapp wrote: >> >> I think you're running up against an OS limit on the number of open >> files, or the number of files in a directory. You can check (and >> change) your limits with ulimit. >> >> The largefasta modules is designed for reading in and handling large >> (like, really large - whole-chromosome scale) sequences which, if all >> held in memory, would exhaust the memory either immediately or pretty >> quickly. So it stores them in temporary files. Most unix systems will >> limit the number of files you can have open at any one time. >> >> If your sequences in that file aren't huge, largefasta isn't the >> module you want to use - just use the fasta parser, or if you need >> random access to sequences in the file (do you?) then Bio::DB::Fasta. >> Writing sequences to temporary files is a waste of time if they fit >> into memory just fine. >> >> The odd thing is that you actually run up to the limit. Normally the >> temporary files should be closed and deleted when the sequence >> objects >> go out of scope (I think - should verify in the code of course ...) , >> so the fact that they don't lets me suspect that the code snippet >> that >> you presented isn't all that there is to it - are you storing the >> sequences somewhere in a variable, such as in an array or a hash >> table? >> >> -hilmar >> >> On May 15, 2009, at 9:05 AM, fungazid wrote: >> >>> >>> Hello, >>> >>> I hope this is the right address for bioperl programming issues. >>> Bioperl >>> saves me a lot of time (not to re-invent the wheel), but there are >>> some >>> extremely irritating problems (I would change the code myself if I >>> knew >>> how). >>> >>> I am trying to read a file (~20MB) containing multiple fasta >>> sequences: >>>> a >>> AGTAGTGAGTGCGCTGA......... >>>> b >>> GCGCTGAAGTAGTGAGT....... >>>> c >>> AGTAGTGAGTGCGCTGA......... >>>> d........... >>> >>> with the following lines: >>> >>> my $seqin = Bio::SeqIO->new('-format'=>'largefasta','-file'=> >>> $file1); >>> >>> LOOP1: while ( my $seqobj1 = $seqin->next_seq()) >>> >>> { >>> ...... >>> my $seq=$seqobj1->subseq(1,$seqobj1->length); >>> ....... >>> } >>> >>> >>> This works right for the first ~30000 contig sequences but then the >>> following message appears: >>> >>> Error in tempdir() using /tmp/XXXXXXXXXX: Could not create directory >>> /tmp/6eS92VzVjm: Too many links at /usr/share/perl5/Bio/Root/IO.pm >>> line 744 >>> DESTROY() mysql_insert obj >>> destroying HANDLE >>> >>> What to do ??? (this is only one of some different Bioperl related >>> bugs that >>> I'm experiencing) >>> >>> >>> >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23559474.html >>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. >>> >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> >> > > -- > View this message in context: http://www.nabble.com/looks-like-a-Bio%3A%3ASeqIO-error-tp23559474p23567169.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Mon May 18 16:44:17 2009 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 18 May 2009 15:44:17 -0500 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> Message-ID: <46E6664A-ABA5-45F3-9B48-2DB39DE5BFF9@illinois.edu> If you need to retain mapping between acc => gi it gets a little more complicated; most procedures to NCBI return a 'bag' of gi's w/o any relation to their original accession. You can grab them via esummary, though, but you'll have to iterate through them. The other option is LiveLists (has both nuc and protein acc => gi). I'm assuming this would have the swissprot accessions included (famous last words): ftp://ftp.ncbi.nih.gov/genbank/livelists/README.genbank.livelists chris On May 18, 2009, at 9:34 AM, Cook, Malcolm wrote: > you could: > > 1) Use eutils search with -database protein -term "srcdb swiss > prot"[Properties] > If you use a retmax of 100000 it should only take a few seconds to > download the 458,445 ginumbers. > I just did it. > > 2) use fastacmd to extract the fasta from nr for these gis, and > parse the defline. > (assuming you have a copy of nr) > > > Does this work for you? > > > Malcolm Cook > Stowers Institute for Medical Research - Kansas City, Missouri > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of >> Smithies, Russell >> Sent: Sunday, May 17, 2009 11:53 PM >> To: 'BioPerl List' >> Subject: [Bioperl-l] Uniprot/Swiss accessions? >> >> Does anyone know of a way to get GI numbers for >> Uniprot/Swissprot accessions? >> >> Fasta from Uniprot's FTP site doesn't formatdb correctly >> (with the -o T option) as it's missing the gi number in the >> fasta header. >> NCBI won't let you use SwissProt ids in batch-entrez and I >> don't want to have to look up all 466,739 of them. >> I could use Bio::DB::Eutilities and query each id but even at >> 10 queries/second (the limit changed recently) it would take too >> long. >> >> Any ideas? >> Is there a swissprot2gi list somewhere? >> >> Thanx, >> >> >> Russell Smithies >> >> Bioinformatics Applications Developer >> T +64 3 489 9085 >> E russell.smithies at agresearch.co.nz >> >> Invermay Research Centre >> Puddle Alley, >> Mosgiel, >> New Zealand >> T +64 3 489 3809 >> F +64 3 489 9174 >> www.agresearch.co.nz >> >> >> >> ============================================================== >> ========= >> Attention: The information contained in this message and/or >> attachments from AgResearch Limited is intended only for the >> persons or entities to which it is addressed and may contain >> confidential and/or privileged material. Any review, >> retransmission, dissemination or other use of, or taking of >> any action in reliance upon, this information by persons or >> entities other than the intended recipients is prohibited by >> AgResearch Limited. If you have received this message in >> error, please notify the sender immediately. >> ============================================================== >> ========= >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From MEC at stowers.org Mon May 18 17:11:40 2009 From: MEC at stowers.org (Cook, Malcolm) Date: Mon, 18 May 2009 16:11:40 -0500 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <46E6664A-ABA5-45F3-9B48-2DB39DE5BFF9@illinois.edu> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> <46E6664A-ABA5-45F3-9B48-2DB39DE5BFF9@illinois.edu> Message-ID: Chris, livelists, eh? Cool! So, the gis could be obtained using eutil search, which could be translated to accessions using livelists. On a side note.... Do you happen if livelists includes refseq identifiers/gis? Thx, Malcolm Cook Stowers Institute for Medical Research - Kansas City, Missouri > -----Original Message----- > From: Chris Fields [mailto:cjfields at illinois.edu] > Sent: Monday, May 18, 2009 3:44 PM > To: Cook, Malcolm > Cc: 'Smithies, Russell'; 'BioPerl List' > Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? > > If you need to retain mapping between acc => gi it gets a > little more complicated; most procedures to NCBI return a > 'bag' of gi's w/o any relation to their original accession. > You can grab them via esummary, though, but you'll have to > iterate through them. > > The other option is LiveLists (has both nuc and protein acc => gi). > I'm assuming this would have the swissprot accessions > included (famous last words): > > ftp://ftp.ncbi.nih.gov/genbank/livelists/README.genbank.livelists > > chris > > > > On May 18, 2009, at 9:34 AM, Cook, Malcolm wrote: > > > you could: > > > > 1) Use eutils search with -database protein -term "srcdb swiss > > prot"[Properties] If you use a retmax of 100000 it should > only take a > > few seconds to download the 458,445 ginumbers. > > I just did it. > > > > 2) use fastacmd to extract the fasta from nr for these gis, > and parse > > the defline. > > (assuming you have a copy of nr) > > > > > > Does this work for you? > > > > > > Malcolm Cook > > Stowers Institute for Medical Research - Kansas City, Missouri > > > > > >> -----Original Message----- > >> From: bioperl-l-bounces at lists.open-bio.org > >> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > Smithies, > >> Russell > >> Sent: Sunday, May 17, 2009 11:53 PM > >> To: 'BioPerl List' > >> Subject: [Bioperl-l] Uniprot/Swiss accessions? > >> > >> Does anyone know of a way to get GI numbers for Uniprot/Swissprot > >> accessions? > >> > >> Fasta from Uniprot's FTP site doesn't formatdb correctly > (with the -o > >> T option) as it's missing the gi number in the fasta header. > >> NCBI won't let you use SwissProt ids in batch-entrez and I > don't want > >> to have to look up all 466,739 of them. > >> I could use Bio::DB::Eutilities and query each id but even at 10 > >> queries/second (the limit changed recently) it would take too long. > >> > >> Any ideas? > >> Is there a swissprot2gi list somewhere? > >> > >> Thanx, > >> > >> > >> Russell Smithies > >> > >> Bioinformatics Applications Developer T +64 3 489 9085 E > >> russell.smithies at agresearch.co.nz > >> > >> Invermay Research Centre > >> Puddle Alley, > >> Mosgiel, > >> New Zealand > >> T +64 3 489 3809 > >> F +64 3 489 9174 > >> www.agresearch.co.nz > >> > >> > >> > >> ============================================================== > >> ========= > >> Attention: The information contained in this message and/or > >> attachments from AgResearch Limited is intended only for > the persons > >> or entities to which it is addressed and may contain confidential > >> and/or privileged material. Any review, retransmission, > dissemination > >> or other use of, or taking of any action in reliance upon, this > >> information by persons or entities other than the intended > recipients > >> is prohibited by AgResearch Limited. If you have received this > >> message in error, please notify the sender immediately. > >> ============================================================== > >> ========= > >> > >> _______________________________________________ > >> Bioperl-l mailing list > >> Bioperl-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > >> > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From Russell.Smithies at agresearch.co.nz Mon May 18 17:52:31 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 19 May 2009 09:52:31 +1200 Subject: [Bioperl-l] Uniprot/Swiss accessions? Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> Hi guys, Thanx for your suggestions. With the magic of awk and comm, I split the amalgamated accessions and created lists of swissprot IDs for both the file from NCBI and the file from Uniprot. sp_ncbi_accessions.txt 458,377 ids sp_uniprot_accessions.txt 466,739 ids * The NCBI file has 95 ids that don't appear in the Uniprot list * The Uniprot file has 8,457 ids that don't appear in the NCBI list * There are 458,282 ids that appear on both lists. I did a quick random sample of the 8,457 ids unique to Uniprot and none could be found in the "protein" database at NCBI but all were in the "gene" database as "reference sequences that belong to a specific genome build" and all belonged to recently sequenced bacterial genomes. As none are in the "protein" database, they don't have GI numbers. The 95 ids that were at NCBI but not in Uniprot were usually (random sample again) described as "putative protein" (or "very putative protein" in one case) and are the result of gene predictions. Eg http://www.ncbi.nlm.nih.gov/protein/48429254 So what I'll do is use the NCBI database and add in the extra 8,457 ids unique to Uniprot and assign them fake GI numbers so I can formatdb them with the " -o T" option. Thanx again for your help, Russell Smithies Bioinformatics Applications Developer T +64 3 489 9085 E russell.smithies at agresearch.co.nz Invermay Research Centre Puddle Alley, Mosgiel, New Zealand T +64 3 489 3809 F +64 3 489 9174 www.agresearch.co.nz Toitu te whenua, Toitu te tangata Sustain the land, Sustain the people ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From granjeau at tagc.univ-mrs.fr Mon May 18 17:39:07 2009 From: granjeau at tagc.univ-mrs.fr (granjeau at tagc.univ-mrs.fr) Date: Mon, 18 May 2009 23:39:07 +0200 (CEST) Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <46E6664A-ABA5-45F3-9B48-2DB39DE5BFF9@illinois.edu> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> <46E6664A-ABA5-45F3-9B48-2DB39DE5BFF9@illinois.edu> Message-ID: <3300.85.69.51.120.1242682747.squirrel@tagc.univ-mrs.fr> May be you try the PICR service at EBI http://www.ebi.ac.uk/Tools/picr/ or some other ID converter (as for example some Gene Ontology tools) or even SRS. I think there could be more than one gi per sp (it's not clear to me if you are looking at SwissProt or UniProtKB, ie SP+TrEMBL). Answer us your solution. Regards, Samuel > If you need to retain mapping between acc => gi it gets a little more > complicated; most procedures to NCBI return a 'bag' of gi's w/o any > relation to their original accession. You can grab them via esummary, > though, but you'll have to iterate through them. > > The other option is LiveLists (has both nuc and protein acc => gi). > I'm assuming this would have the swissprot accessions included (famous > last words): > > ftp://ftp.ncbi.nih.gov/genbank/livelists/README.genbank.livelists > > chris > > > > On May 18, 2009, at 9:34 AM, Cook, Malcolm wrote: > >> you could: >> >> 1) Use eutils search with -database protein -term "srcdb swiss >> prot"[Properties] >> If you use a retmax of 100000 it should only take a few seconds to >> download the 458,445 ginumbers. >> I just did it. >> >> 2) use fastacmd to extract the fasta from nr for these gis, and >> parse the defline. >> (assuming you have a copy of nr) >> >> >> Does this work for you? >> >> >> Malcolm Cook >> Stowers Institute for Medical Research - Kansas City, Missouri >> >> >>> -----Original Message----- >>> From: bioperl-l-bounces at lists.open-bio.org >>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of >>> Smithies, Russell >>> Sent: Sunday, May 17, 2009 11:53 PM >>> To: 'BioPerl List' >>> Subject: [Bioperl-l] Uniprot/Swiss accessions? >>> >>> Does anyone know of a way to get GI numbers for >>> Uniprot/Swissprot accessions? >>> >>> Fasta from Uniprot's FTP site doesn't formatdb correctly >>> (with the -o T option) as it's missing the gi number in the >>> fasta header. >>> NCBI won't let you use SwissProt ids in batch-entrez and I >>> don't want to have to look up all 466,739 of them. >>> I could use Bio::DB::Eutilities and query each id but even at >>> 10 queries/second (the limit changed recently) it would take too >>> long. >>> >>> Any ideas? >>> Is there a swissprot2gi list somewhere? >>> >>> Thanx, >>> >>> >>> Russell Smithies >>> >>> Bioinformatics Applications Developer >>> T +64 3 489 9085 >>> E russell.smithies at agresearch.co.nz >>> >>> Invermay Research Centre >>> Puddle Alley, >>> Mosgiel, >>> New Zealand >>> T +64 3 489 3809 >>> F +64 3 489 9174 >>> www.agresearch.co.nz >>> >>> From Russell.Smithies at agresearch.co.nz Mon May 18 19:11:40 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 19 May 2009 11:11:40 +1200 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <3300.85.69.51.120.1242682747.squirrel@tagc.univ-mrs.fr> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> <46E6664A-ABA5-45F3-9B48-2DB39DE5BFF9@illinois.edu> <3300.85.69.51.120.1242682747.squirrel@tagc.univ-mrs.fr> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493E0291E@exchsth.agresearch.co.nz> As far as I can see, none of the fasta at ftp://ftp.uniprot.org/pub/databases/uniprot_datafiles_by_format/fasta/ will correctly formatdb with the "-o T" option. This is with the latest version of blast (2.2.20 [Feb-08-2009]) If you fomatdb uniprot_sprot.fasta or uniprot_trembl.fasta from the above link, they successfully create the required files but the blast result descriptions are truncated. NCBI say it's not their fault and EBI don't answer their email. A quick hack of prepending fake GI numbers to each accession gets the files formatted correctly and allows sequence retrieval but it's not an ideal solution. --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of granjeau at tagc.univ-mrs.fr > Sent: Tuesday, 19 May 2009 9:39 a.m. > To: "Cook, Malcolm "@tagc.univ-mrs.fr; " "@tagc.univ-mrs.fr > Cc: 'BioPerl List' > Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? > > May be you try the PICR service at EBI > http://www.ebi.ac.uk/Tools/picr/ > or some other ID converter (as for example some Gene Ontology tools) or > even SRS. > > I think there could be more than one gi per sp (it's not clear to me if > you are looking at SwissProt or UniProtKB, ie SP+TrEMBL). > > Answer us your solution. > > Regards, > Samuel > > > If you need to retain mapping between acc => gi it gets a little more > > complicated; most procedures to NCBI return a 'bag' of gi's w/o any > > relation to their original accession. You can grab them via esummary, > > though, but you'll have to iterate through them. > > > > The other option is LiveLists (has both nuc and protein acc => gi). > > I'm assuming this would have the swissprot accessions included (famous > > last words): > > > > ftp://ftp.ncbi.nih.gov/genbank/livelists/README.genbank.livelists > > > > chris > > > > > > > > On May 18, 2009, at 9:34 AM, Cook, Malcolm wrote: > > > >> you could: > >> > >> 1) Use eutils search with -database protein -term "srcdb swiss > >> prot"[Properties] > >> If you use a retmax of 100000 it should only take a few seconds to > >> download the 458,445 ginumbers. > >> I just did it. > >> > >> 2) use fastacmd to extract the fasta from nr for these gis, and > >> parse the defline. > >> (assuming you have a copy of nr) > >> > >> > >> Does this work for you? > >> > >> > >> Malcolm Cook > >> Stowers Institute for Medical Research - Kansas City, Missouri > >> > >> > >>> -----Original Message----- > >>> From: bioperl-l-bounces at lists.open-bio.org > >>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of > >>> Smithies, Russell > >>> Sent: Sunday, May 17, 2009 11:53 PM > >>> To: 'BioPerl List' > >>> Subject: [Bioperl-l] Uniprot/Swiss accessions? > >>> > >>> Does anyone know of a way to get GI numbers for > >>> Uniprot/Swissprot accessions? > >>> > >>> Fasta from Uniprot's FTP site doesn't formatdb correctly > >>> (with the -o T option) as it's missing the gi number in the > >>> fasta header. > >>> NCBI won't let you use SwissProt ids in batch-entrez and I > >>> don't want to have to look up all 466,739 of them. > >>> I could use Bio::DB::Eutilities and query each id but even at > >>> 10 queries/second (the limit changed recently) it would take too > >>> long. > >>> > >>> Any ideas? > >>> Is there a swissprot2gi list somewhere? > >>> > >>> Thanx, > >>> > >>> > >>> Russell Smithies > >>> > >>> Bioinformatics Applications Developer > >>> T +64 3 489 9085 > >>> E russell.smithies at agresearch.co.nz > >>> > >>> Invermay Research Centre > >>> Puddle Alley, > >>> Mosgiel, > >>> New Zealand > >>> T +64 3 489 3809 > >>> F +64 3 489 9174 > >>> www.agresearch.co.nz > >>> > >>> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From bill at genenformics.com Mon May 18 20:11:51 2009 From: bill at genenformics.com (bill at genenformics.com) Date: Mon, 18 May 2009 17:11:51 -0700 (PDT) Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493E0291E@exchsth.agresearch.co.nz> References: <59B4ABC0-7C98-4CD6-9629-50B2503F040E@bioperl.org> <358f4d650905140645x3f4f8b91ke87ecb0b783e43e6@mail.gmail.com> <4C7181E4-FDBC-484E-99E5-D26A98555C9A@illinois.edu> <358f4d650905151007o23fb95eg1c06b1df8a5257ca@mail.gmail.com> <6A465AFB-6D69-4B2C-9A71-CAF0761E96E0@illinois.edu> <18DF7D20DFEC044098A1062202F5FFF32493E027BB@exchsth.agresearch.co.nz> <46E6664A-ABA5-45F3-9B48-2DB39DE5BFF9@illinois.edu> <3300.85.69.51.120.1242682747.squirrel@tagc.univ-mrs.fr> <18DF7D20DFEC044098A1062202F5FFF32493E0291E@exchsth.agresearch.co.nz> Message-ID: The problem is that makeblastdb does not recognize the first block of deflines: I changed the defline from: >UniRef50_P0C9F1 Protein MGF 100-1R n=5 Tax=African swine fever virus RepID=1001R_ASFM2 to >sp|P0C9F1 Protein MGF 100-1R n=5 Tax=African swine fever virus RepID=1001R_ASFM2 and it works! It seems that prefixing your protein id with 'sp|' right after '>' will work. Good luck! Bill at genenformics > As far as I can see, none of the fasta at > ftp://ftp.uniprot.org/pub/databases/uniprot_datafiles_by_format/fasta/ > will correctly formatdb with the "-o T" option. This is with the latest > version of blast (2.2.20 [Feb-08-2009]) > If you fomatdb uniprot_sprot.fasta or uniprot_trembl.fasta from the above > link, they successfully create the required files but the blast result > descriptions are truncated. > NCBI say it's not their fault and EBI don't answer their email. > > A quick hack of prepending fake GI numbers to each accession gets the > files formatted correctly and allows sequence retrieval but it's not an > ideal solution. > > > --Russell > > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of granjeau at tagc.univ-mrs.fr >> Sent: Tuesday, 19 May 2009 9:39 a.m. >> To: "Cook, Malcolm "@tagc.univ-mrs.fr; " >> "@tagc.univ-mrs.fr >> Cc: 'BioPerl List' >> Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? >> >> May be you try the PICR service at EBI >> http://www.ebi.ac.uk/Tools/picr/ >> or some other ID converter (as for example some Gene Ontology tools) or >> even SRS. >> >> I think there could be more than one gi per sp (it's not clear to me if >> you are looking at SwissProt or UniProtKB, ie SP+TrEMBL). >> >> Answer us your solution. >> >> Regards, >> Samuel >> >> > If you need to retain mapping between acc => gi it gets a little more >> > complicated; most procedures to NCBI return a 'bag' of gi's w/o any >> > relation to their original accession. You can grab them via esummary, >> > though, but you'll have to iterate through them. >> > >> > The other option is LiveLists (has both nuc and protein acc => gi). >> > I'm assuming this would have the swissprot accessions included (famous >> > last words): >> > >> > ftp://ftp.ncbi.nih.gov/genbank/livelists/README.genbank.livelists >> > >> > chris >> > >> > >> > >> > On May 18, 2009, at 9:34 AM, Cook, Malcolm wrote: >> > >> >> you could: >> >> >> >> 1) Use eutils search with -database protein -term "srcdb swiss >> >> prot"[Properties] >> >> If you use a retmax of 100000 it should only take a few seconds to >> >> download the 458,445 ginumbers. >> >> I just did it. >> >> >> >> 2) use fastacmd to extract the fasta from nr for these gis, and >> >> parse the defline. >> >> (assuming you have a copy of nr) >> >> >> >> >> >> Does this work for you? >> >> >> >> >> >> Malcolm Cook >> >> Stowers Institute for Medical Research - Kansas City, Missouri >> >> >> >> >> >>> -----Original Message----- >> >>> From: bioperl-l-bounces at lists.open-bio.org >> >>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of >> >>> Smithies, Russell >> >>> Sent: Sunday, May 17, 2009 11:53 PM >> >>> To: 'BioPerl List' >> >>> Subject: [Bioperl-l] Uniprot/Swiss accessions? >> >>> >> >>> Does anyone know of a way to get GI numbers for >> >>> Uniprot/Swissprot accessions? >> >>> >> >>> Fasta from Uniprot's FTP site doesn't formatdb correctly >> >>> (with the -o T option) as it's missing the gi number in the >> >>> fasta header. >> >>> NCBI won't let you use SwissProt ids in batch-entrez and I >> >>> don't want to have to look up all 466,739 of them. >> >>> I could use Bio::DB::Eutilities and query each id but even at >> >>> 10 queries/second (the limit changed recently) it would take too >> >>> long. >> >>> >> >>> Any ideas? >> >>> Is there a swissprot2gi list somewhere? >> >>> >> >>> Thanx, >> >>> >> >>> >> >>> Russell Smithies >> >>> >> >>> Bioinformatics Applications Developer >> >>> T +64 3 489 9085 >> >>> E russell.smithies at agresearch.co.nz >> >>> >> >>> Invermay Research Centre >> >>> Puddle Alley, >> >>> Mosgiel, >> >>> New Zealand >> >>> T +64 3 489 3809 >> >>> F +64 3 489 9174 >> >>> www.agresearch.co.nz >> >>> >> >>> >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From bill at genenformics.com Mon May 18 20:19:53 2009 From: bill at genenformics.com (bill at genenformics.com) Date: Mon, 18 May 2009 17:19:53 -0700 (PDT) Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> Message-ID: <0727310c9364ae23501cf64293d15209.squirrel@mail.dreamhost.com> Hi, Smithies, Using an integral local id should work as well. A define will look like '>lcl|12345 ...' Bill > Hi guys, > Thanx for your suggestions. > > With the magic of awk and comm, I split the amalgamated accessions and > created lists of swissprot IDs for both the file from NCBI and the file > from Uniprot. > > sp_ncbi_accessions.txt 458,377 ids > sp_uniprot_accessions.txt 466,739 ids > > * The NCBI file has 95 ids that don't appear in the Uniprot list > * The Uniprot file has 8,457 ids that don't appear in the NCBI list > * There are 458,282 ids that appear on both lists. > > I did a quick random sample of the 8,457 ids unique to Uniprot and none > could be found in the "protein" database at NCBI but all were in the > "gene" database as "reference sequences that belong to a specific genome > build" and all belonged to recently sequenced bacterial genomes. As none > are in the "protein" database, they don't have GI numbers. > > The 95 ids that were at NCBI but not in Uniprot were usually (random > sample again) described as "putative protein" (or "very putative protein" > in one case) and are the result of gene predictions. Eg > http://www.ncbi.nlm.nih.gov/protein/48429254 > > > So what I'll do is use the NCBI database and add in the extra 8,457 ids > unique to Uniprot and assign them fake GI numbers so I can formatdb them > with the " -o T" option. > > > Thanx again for your help, > > > > Russell Smithies > Bioinformatics Applications Developer > T +64 3 489 9085 > E russell.smithies at agresearch.co.nz > Invermay Research Centre > Puddle Alley, > Mosgiel, > New Zealand > T +64 3 489 3809 > F +64 3 489 9174 > www.agresearch.co.nz > > > Toitu te whenua, Toitu te tangata > Sustain the land, Sustain the people > > > > > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From Russell.Smithies at agresearch.co.nz Mon May 18 21:44:35 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 19 May 2009 13:44:35 +1200 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <0727310c9364ae23501cf64293d15209.squirrel@mail.dreamhost.com> References: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> <0727310c9364ae23501cf64293d15209.squirrel@mail.dreamhost.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493E02A06@exchsth.agresearch.co.nz> No, that doesn't work :-( Here's some blast output with the database formatted with local ids: ===================================================================== Database: uniprot_sprot.fasta 466,739 sequences; 165,389,953 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value sp|Q4U9M9|104K_THEAN Unknown 421 e-117 sp|P15711|104K_THEPA Unknown 265 6e-70 sp|Q2SPQ2|CHED_HAHCH Unknown 33 4.2 Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix adjust. Identities = 0/209 (0%), Positives = 0/209 (0%) Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60 Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120 Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180 Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209 =========================================================================== If I tweak the fasta and change the ids from lcl to gi and re-formatdb, all works correctly: =========================================================================== Query= test (612 letters) Database: uniprot_sprot.fasta 466,739 sequences; 165,389,953 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theile... 421 e-117 sp|P15711|104K_THEPA 104 kDa microneme/rhoptry antigen OS=Theile... 265 6e-70 sp|Q2SPQ2|CHED_HAHCH Probable chemoreceptor glutamine deamidase ... 33 4.2 >sp|Q4U9M9|104K_THEAN 104 kDa microneme/rhoptry antigen OS=Theileria annulata GN=TA08425 PE=3 SV=1 Length = 893 Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix adjust. Identities = 201/209 (96%), Positives = 201/209 (96%) Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED Sbjct: 72 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 131 Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 120 QYLA IHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD Sbjct: 132 QYLARLQQLRQQIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD 191 Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 180 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY Sbjct: 192 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY 251 Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209 VATIPKLKDFAEPYHPIILDISDIDYVNF Sbjct: 252 VATIPKLKDFAEPYHPIILDISDIDYVNF 280 ============================================================================ To my mind, this is a bug in formatdb but NCBI don't see it that way. --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of bill at genenformics.com > Sent: Tuesday, 19 May 2009 12:20 p.m. > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? > > Hi, Smithies, > > Using an integral local id should work as well. > > A define will look like '>lcl|12345 ...' > > Bill > > > Hi guys, > > Thanx for your suggestions. > > > > With the magic of awk and comm, I split the amalgamated accessions and > > created lists of swissprot IDs for both the file from NCBI and the file > > from Uniprot. > > > > sp_ncbi_accessions.txt 458,377 ids > > sp_uniprot_accessions.txt 466,739 ids > > > > * The NCBI file has 95 ids that don't appear in the Uniprot list > > * The Uniprot file has 8,457 ids that don't appear in the NCBI list > > * There are 458,282 ids that appear on both lists. > > > > I did a quick random sample of the 8,457 ids unique to Uniprot and none > > could be found in the "protein" database at NCBI but all were in the > > "gene" database as "reference sequences that belong to a specific genome > > build" and all belonged to recently sequenced bacterial genomes. As none > > are in the "protein" database, they don't have GI numbers. > > > > The 95 ids that were at NCBI but not in Uniprot were usually (random > > sample again) described as "putative protein" (or "very putative protein" > > in one case) and are the result of gene predictions. Eg > > http://www.ncbi.nlm.nih.gov/protein/48429254 > > > > > > So what I'll do is use the NCBI database and add in the extra 8,457 ids > > unique to Uniprot and assign them fake GI numbers so I can formatdb them > > with the " -o T" option. > > > > > > Thanx again for your help, > > > > > > > > Russell Smithies > > Bioinformatics Applications Developer > > T +64 3 489 9085 > > E russell.smithies at agresearch.co.nz > > Invermay Research Centre > > Puddle Alley, > > Mosgiel, > > New Zealand > > T +64 3 489 3809 > > F +64 3 489 9174 > > www.agresearch.co.nz > > > > > > Toitu te whenua, Toitu te tangata > > Sustain the land, Sustain the people > > > > > > > > > > ======================================================================= > > Attention: The information contained in this message and/or attachments > > from AgResearch Limited is intended only for the persons or entities > > to which it is addressed and may contain confidential and/or privileged > > material. Any review, retransmission, dissemination or other use of, or > > taking of any action in reliance upon, this information by persons or > > entities other than the intended recipients is prohibited by AgResearch > > Limited. If you have received this message in error, please notify the > > sender immediately. > > ======================================================================= > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From bill at genenformics.com Mon May 18 23:13:19 2009 From: bill at genenformics.com (bill at genenformics.com) Date: Mon, 18 May 2009 20:13:19 -0700 (PDT) Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493E02A06@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> <0727310c9364ae23501cf64293d15209.squirrel@mail.dreamhost.com> <18DF7D20DFEC044098A1062202F5FFF32493E02A06@exchsth.agresearch.co.nz> Message-ID: <9c8429fa8d42621828062dcb08845c06.squirrel@mail.dreamhost.com> I could not see the difference. Do you follow the rules for FASTA defline: http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.table.632 Bill > No, that doesn't work :-( > Here's some blast output with the database formatted with local ids: > ===================================================================== > Database: uniprot_sprot.fasta > 466,739 sequences; 165,389,953 total letters > > Searching..................................................done > > > > Score > E > Sequences producing significant alignments: (bits) > Value > > sp|Q4U9M9|104K_THEAN Unknown 421 > e-117 > sp|P15711|104K_THEPA Unknown 265 > 6e-70 > sp|Q2SPQ2|CHED_HAHCH Unknown 33 > 4.2 > > > Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix > adjust. > Identities = 0/209 (0%), Positives = 0/209 (0%) > > Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60 > > Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD > 120 > > Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY > 180 > > Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209 > > =========================================================================== > > If I tweak the fasta and change the ids from lcl to gi and re-formatdb, > all works correctly: > > =========================================================================== > Query= test > (612 letters) > > Database: uniprot_sprot.fasta > 466,739 sequences; 165,389,953 total letters > > Searching..................................................done > > > > %2 From Russell.Smithies at agresearch.co.nz Mon May 18 23:43:20 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 19 May 2009 15:43:20 +1200 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <9c8429fa8d42621828062dcb08845c06.squirrel@mail.dreamhost.com> References: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> <0727310c9364ae23501cf64293d15209.squirrel@mail.dreamhost.com> <18DF7D20DFEC044098A1062202F5FFF32493E02A06@exchsth.agresearch.co.nz> <9c8429fa8d42621828062dcb08845c06.squirrel@mail.dreamhost.com> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493E02ABB@exchsth.agresearch.co.nz> There's no descriptions in the top of the blast output and no accessions in the alignments. The fasta is coming from UniProt so surely they know how to format files. And it does match what NCBI require in their defline i.e. sp|accession|entry name --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of bill at genenformics.com > Sent: Tuesday, 19 May 2009 3:13 p.m. > To: bioperl-l at lists.open-bio.org > Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? > > I could not see the difference. > > Do you follow the rules for FASTA defline: > > http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.table.632 > > Bill > > > > No, that doesn't work :-( > > Here's some blast output with the database formatted with local ids: > > ===================================================================== > > Database: uniprot_sprot.fasta > > 466,739 sequences; 165,389,953 total letters > > > > Searching..................................................done > > > > > > > > Score > > E > > Sequences producing significant alignments: (bits) > > Value > > > > sp|Q4U9M9|104K_THEAN Unknown 421 > > e-117 > > sp|P15711|104K_THEPA Unknown 265 > > 6e-70 > > sp|Q2SPQ2|CHED_HAHCH Unknown 33 > > 4.2 > > > > > > Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix > > adjust. > > Identities = 0/209 (0%), Positives = 0/209 (0%) > > > > Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60 > > > > Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD > > 120 > > > > Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY > > 180 > > > > Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209 > > > > =========================================================================== > > > > If I tweak the fasta and change the ids from lcl to gi and re-formatdb, > > all works correctly: > > > > =========================================================================== > > Query= test > > (612 letters) > > > > Database: uniprot_sprot.fasta > > 466,739 sequences; 165,389,953 total letters > > > > Searching..................................................done > > > > > > > > %2 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l ======================================================================= Attention: The information contained in this message and/or attachments from AgResearch Limited is intended only for the persons or entities to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipients is prohibited by AgResearch Limited. If you have received this message in error, please notify the sender immediately. ======================================================================= From Russell.Smithies at agresearch.co.nz Mon May 18 23:54:41 2009 From: Russell.Smithies at agresearch.co.nz (Smithies, Russell) Date: Tue, 19 May 2009 15:54:41 +1200 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493E02ABB@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> <0727310c9364ae23501cf64293d15209.squirrel@mail.dreamhost.com> <18DF7D20DFEC044098A1062202F5FFF32493E02A06@exchsth.agresearch.co.nz> <9c8429fa8d42621828062dcb08845c06.squirrel@mail.dreamhost.com> <18DF7D20DFEC044098A1062202F5FFF32493E02ABB@exchsth.agresearch.co.nz> Message-ID: <18DF7D20DFEC044098A1062202F5FFF32493E02AD4@exchsth.agresearch.co.nz> We re-installed blast version 2.2.18 and everything works perfectly. It formats the Uniprot fasta as it should and retrieves sequences with fastacmd as it should. I think we'll email NCBI and tell them they broke formatdb in their "upgrade" --Russell > -----Original Message----- > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > bounces at lists.open-bio.org] On Behalf Of Smithies, Russell > Sent: Tuesday, 19 May 2009 3:43 p.m. > To: 'bill at genenformics.com'; 'bioperl-l at lists.open-bio.org' > Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? > > There's no descriptions in the top of the blast output and no accessions in > the alignments. > The fasta is coming from UniProt so surely they know how to format files. > And it does match what NCBI require in their defline i.e. sp|accession|entry > name > > --Russell > > > -----Original Message----- > > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- > > bounces at lists.open-bio.org] On Behalf Of bill at genenformics.com > > Sent: Tuesday, 19 May 2009 3:13 p.m. > > To: bioperl-l at lists.open-bio.org > > Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? > > > > I could not see the difference. > > > > Do you follow the rules for FASTA defline: > > > > http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.table.632 > > > > Bill > > > > > > > No, that doesn't work :-( > > > Here's some blast output with the database formatted with local ids: > > > ===================================================================== > > > Database: uniprot_sprot.fasta > > > 466,739 sequences; 165,389,953 total letters > > > > > > Searching..................................................done > > > > > > > > > > > > Score > > > E > > > Sequences producing significant alignments: (bits) > > > Value > > > > > > sp|Q4U9M9|104K_THEAN Unknown 421 > > > e-117 > > > sp|P15711|104K_THEPA Unknown 265 > > > 6e-70 > > > sp|Q2SPQ2|CHED_HAHCH Unknown 33 > > > 4.2 > > > > > > > > > Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix > > > adjust. > > > Identities = 0/209 (0%), Positives = 0/209 (0%) > > > > > > Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60 > > > > > > Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD > > > 120 > > > > > > Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY > > > 180 > > > > > > Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209 > > > > > > > =========================================================================== > > > > > > If I tweak the fasta and change the ids from lcl to gi and re-formatdb, > > > all works correctly: > > > > > > > =========================================================================== > > > Query= test > > > (612 letters) > > > > > > Database: uniprot_sprot.fasta > > > 466,739 sequences; 165,389,953 total letters > > > > > > Searching..................................................done > > > > > > > > > > > > %2 > > > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > ======================================================================= > Attention: The information contained in this message and/or attachments > from AgResearch Limited is intended only for the persons or entities > to which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipients is prohibited by AgResearch > Limited. If you have received this message in error, please notify the > sender immediately. > ======================================================================= > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From nandini_bn at hotmail.com Tue May 19 04:30:26 2009 From: nandini_bn at hotmail.com (nandini_bn) Date: Tue, 19 May 2009 01:30:26 -0700 (PDT) Subject: [Bioperl-l] Errors in my script Message-ID: <23611935.post@talk.nabble.com> Hi, I have written this script using Bioperl methods but it gives me a lot of errors: the script is as follows Could u please help me out? Thanks #! /usr/bin/perl use warnings; use strict; #Reading in an alignment file in msf format from the command line, performing an analysis of the alignment #Writing the original alignment in fasta format of sequences selected by user #Storing the consensus sequence as a new sequence object and it's annotation in genebank format #loading the necessary packages use Bio::Perl; use Bio::AlignIO; use Bio::Seq; use Bio::SeqIO; #Reading in of msf file my $in = Bio::AlignIO->new(-file => $ARGV[0] , -format => 'msf'); #The threshold value my $t = $ARGV[1]; my $aln = $in -> next_aln(); #some descriptors my $length = $aln->length(); my $residues = $aln->no_residues(); my $isflush = $aln->is_flush(); my $sequence = $aln->no_sequences(); my $identity = $aln->percentage_identity(); #Consensus sequence my $consensus = $aln->consensus_string($t); #Printing the details of the alignment print "Length: $length \n"; print "Number of residues: $residues \n"; print "Is flush: $isflush \n"; print "Number of sequences: $sequence \n"; print "Percentage of identity: $identity \n"; print "Consensus string: $consensus \n"; #Writing out the file in fasta format my $out = Bio::AlignIO->new(-file => ">$ARGV[2]", -format => 'fasta'); $out -> write_aln($aln); #All ? in the sequence replaced by a X. $consensus =~s/\?/X/g; print "\n$consensus\n"; #Making a sequence object with annotation my $seqobj = Bio::PrimarySeq->new( -seq => $consensus, -id => $identity, -organism => $length, -comment => $residues, -alphabet => 'protein'); #Writing the consensus sequence in swissprot format my $conin = Bio::SeqIO->new(-seq => $seqobj, -format => 'txt'); my $conout = Bio::SeqIO->new(-file => ">$ARGV[3]", -format => 'Genbank'); while ( my $seq = $conin->next_seq() ) { $conout->write_seq($seq); } THE ERRORS ARE Bio::SeqIO: txt cannot be found Exception ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Failed to load module Bio::SeqIO::txt. Can't locate Bio/SeqIO/txt.pm in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.8.8 /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .) at /usr/share/perl5/Bio/Root/Root.pm line 425, line 82. STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:359 STACK: Bio::Root::Root::_load_module /usr/share/perl5/Bio/Root/Root.pm:427 STACK: Bio::SeqIO::_load_format_module /usr/share/perl5/Bio/SeqIO.pm:555 STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:376 STACK: nb175hw4.pl:60 ----------------------------------------------------------- For more information about the SeqIO system please see the SeqIO docs. This includes ways of checking for formats at compile time, not run time Can't call method "next_seq" on an undefined value at nb175hw4.pl line 65, line 82. -- View this message in context: http://www.nabble.com/Errors-in-my-script-tp23611935p23611935.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From praveen_spillai at yahoo.co.in Tue May 19 08:03:50 2009 From: praveen_spillai at yahoo.co.in (Praveen Surendran) Date: Tue, 19 May 2009 17:33:50 +0530 (IST) Subject: [Bioperl-l] Warning while running bioperl StandAloneBlast. Message-ID: <420922.4996.qm@web94205.mail.in2.yahoo.com> Hi, I am getting some warning along with results which says --> "[NULL_CAPTION] Warning: Failed to initialize search. ISAM Error code is -5" Can?someone please let me know what?this is all about. Please find my script below. ---------------------------------------------------------------------------------- use Bio::SeqIO; use Bio::Tools::Run::StandAloneBlast; open FILE, ">./blast.txt"; my $Seq_in = Bio::SeqIO->new (-file => $ARGV[0], ?????????????????????????????? -format => 'fasta'); my $query = $Seq_in->next_seq(); my $factory = Bio::Tools::Run::StandAloneBlast->new(?'program'? => 'blastn', ????'database' => 'nr', ????_READMETHOD => "Blast" ); my $blast_report = $factory->blastall($query); my $result = $blast_report->next_result; while( my $hit = $result->next_hit()) {? ? print FILE "\thit name: ", $hit->name(), " significance: ", $hit->significance(), "\n"; ? # while( my $hsp = $hit->next_hsp()) { ? # print "E: ", $hsp->evalue(), "frac_identical: ", $hsp->frac_identical(), "\n";??? }} } ? close FILE?? -----------------------------------------------------------------------------------? ? Kind Regards, ? Praveen Surendran. Explore your hobbies and interests. Go to http://in.promos.yahoo.com/groups/ From shalabh.sharma7 at gmail.com Tue May 19 10:28:11 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Tue, 19 May 2009 10:28:11 -0400 Subject: [Bioperl-l] Warning while running bioperl StandAloneBlast. In-Reply-To: <420922.4996.qm@web94205.mail.in2.yahoo.com> References: <420922.4996.qm@web94205.mail.in2.yahoo.com> Message-ID: <9fcc48c70905190728x5d950e94r21211a247e29a316@mail.gmail.com> Hi Praveen, Your script is bit confusing, i don't understand what you actually trying to do. But what i see is that you are parsing blast report without using SearchIO module. You have to include Bio::SearchIO in your script. Hope this helps. -Shalabh Sharma On Tue, May 19, 2009 at 8:03 AM, Praveen Surendran < praveen_spillai at yahoo.co.in> wrote: > Hi, > > I am getting some warning along with results which says --> "[NULL_CAPTION] > Warning: Failed to initialize search. ISAM Error code is -5" > Can someone please let me know what this is all about. > > Please find my script below. > > > ---------------------------------------------------------------------------------- > use Bio::SeqIO; > use Bio::Tools::Run::StandAloneBlast; > open FILE, ">./blast.txt"; > my $Seq_in = Bio::SeqIO->new (-file => $ARGV[0], > -format => 'fasta'); > my $query = $Seq_in->next_seq(); > my $factory = Bio::Tools::Run::StandAloneBlast->new( 'program' => > 'blastn', > 'database' => 'nr', > _READMETHOD => "Blast" ); > my $blast_report = $factory->blastall($query); > my $result = $blast_report->next_result; > while( my $hit = $result->next_hit()) { > print FILE "\thit name: ", $hit->name(), " significance: ", > $hit->significance(), "\n"; > # while( my $hsp = $hit->next_hsp()) { > # print "E: ", $hsp->evalue(), "frac_identical: ", > $hsp->frac_identical(), "\n"; }} > } > close FILE > > ----------------------------------------------------------------------------------- > > Kind Regards, > > Praveen Surendran. > > > Explore your hobbies and interests. Go to > http://in.promos.yahoo.com/groups/ > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From markus.liebscher at gmx.de Tue May 19 10:53:47 2009 From: markus.liebscher at gmx.de (manni122) Date: Tue, 19 May 2009 07:53:47 -0700 (PDT) Subject: [Bioperl-l] Errors in my script In-Reply-To: <23611935.post@talk.nabble.com> References: <23611935.post@talk.nabble.com> Message-ID: <23618038.post@talk.nabble.com> Hi, it seems that error comes from your line my $conin = Bio::SeqIO->new(-seq => $seqobj, -format => 'txt'); "txt" is not supported by this module. Try something else like "fasta". Markus nandini_bn wrote: > > Hi, I have written this script using Bioperl methods but it gives me a lot > of errors: the script is as follows > Could u please help me out? > Thanks > > > #! /usr/bin/perl > > use warnings; > use strict; > > #Reading in an alignment file in msf format from the command line, > performing an analysis of the alignment > #Writing the original alignment in fasta format of sequences selected by > user > #Storing the consensus sequence as a new sequence object and it's > annotation in genebank format > > #loading the necessary packages > use Bio::Perl; > use Bio::AlignIO; > use Bio::Seq; > use Bio::SeqIO; > > #Reading in of msf file > my $in = Bio::AlignIO->new(-file => $ARGV[0] , > -format => 'msf'); > > #The threshold value > my $t = $ARGV[1]; > my $aln = $in -> next_aln(); > > #some descriptors > my $length = $aln->length(); > my $residues = $aln->no_residues(); > my $isflush = $aln->is_flush(); > my $sequence = $aln->no_sequences(); > my $identity = $aln->percentage_identity(); > #Consensus sequence > my $consensus = $aln->consensus_string($t); > > #Printing the details of the alignment > print "Length: $length \n"; > print "Number of residues: $residues \n"; > print "Is flush: $isflush \n"; > print "Number of sequences: $sequence \n"; > print "Percentage of identity: $identity \n"; > print "Consensus string: $consensus \n"; > > #Writing out the file in fasta format > my $out = Bio::AlignIO->new(-file => ">$ARGV[2]", > -format => 'fasta'); > > $out -> write_aln($aln); > > #All ? in the sequence replaced by a X. > $consensus =~s/\?/X/g; > > print "\n$consensus\n"; > > #Making a sequence object with annotation > my $seqobj = Bio::PrimarySeq->new( -seq => $consensus, > -id => $identity, > -organism => $length, > -comment => $residues, > -alphabet => 'protein'); > > #Writing the consensus sequence in swissprot format > my $conin = Bio::SeqIO->new(-seq => $seqobj, > -format => 'txt'); > my $conout = Bio::SeqIO->new(-file => ">$ARGV[3]", > -format => 'Genbank'); > > while ( my $seq = $conin->next_seq() ) > { > $conout->write_seq($seq); > } > > > > > > > THE ERRORS ARE > Bio::SeqIO: txt cannot be found > Exception > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Failed to load module Bio::SeqIO::txt. Can't locate Bio/SeqIO/txt.pm > in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.8.8 > /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 > /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .) at > /usr/share/perl5/Bio/Root/Root.pm line 425, line 82. > > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:359 > STACK: Bio::Root::Root::_load_module /usr/share/perl5/Bio/Root/Root.pm:427 > STACK: Bio::SeqIO::_load_format_module /usr/share/perl5/Bio/SeqIO.pm:555 > STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:376 > STACK: nb175hw4.pl:60 > ----------------------------------------------------------- > > For more information about the SeqIO system please see the SeqIO docs. > This includes ways of checking for formats at compile time, not run time > Can't call method "next_seq" on an undefined value at nb175hw4.pl line 65, > line 82. > > > > -- View this message in context: http://www.nabble.com/Errors-in-my-script-tp23611935p23618038.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From markus.liebscher at gmx.de Tue May 19 10:55:58 2009 From: markus.liebscher at gmx.de (manni122) Date: Tue, 19 May 2009 07:55:58 -0700 (PDT) Subject: [Bioperl-l] Errors in my script In-Reply-To: <23611935.post@talk.nabble.com> References: <23611935.post@talk.nabble.com> Message-ID: <23618096.post@talk.nabble.com> Hi, it seems that error comes from your line my $conin = Bio::SeqIO->new(-seq => $seqobj, -format => 'txt'); "txt" is not supported by this module. Try something else like "fasta". Markus nandini_bn wrote: > > Hi, I have written this script using Bioperl methods but it gives me a lot > of errors: the script is as follows > Could u please help me out? > Thanks > > > #! /usr/bin/perl > > use warnings; > use strict; > > #Reading in an alignment file in msf format from the command line, > performing an analysis of the alignment > #Writing the original alignment in fasta format of sequences selected by > user > #Storing the consensus sequence as a new sequence object and it's > annotation in genebank format > > #loading the necessary packages > use Bio::Perl; > use Bio::AlignIO; > use Bio::Seq; > use Bio::SeqIO; > > #Reading in of msf file > my $in = Bio::AlignIO->new(-file => $ARGV[0] , > -format => 'msf'); > > #The threshold value > my $t = $ARGV[1]; > my $aln = $in -> next_aln(); > > #some descriptors > my $length = $aln->length(); > my $residues = $aln->no_residues(); > my $isflush = $aln->is_flush(); > my $sequence = $aln->no_sequences(); > my $identity = $aln->percentage_identity(); > #Consensus sequence > my $consensus = $aln->consensus_string($t); > > #Printing the details of the alignment > print "Length: $length \n"; > print "Number of residues: $residues \n"; > print "Is flush: $isflush \n"; > print "Number of sequences: $sequence \n"; > print "Percentage of identity: $identity \n"; > print "Consensus string: $consensus \n"; > > #Writing out the file in fasta format > my $out = Bio::AlignIO->new(-file => ">$ARGV[2]", > -format => 'fasta'); > > $out -> write_aln($aln); > > #All ? in the sequence replaced by a X. > $consensus =~s/\?/X/g; > > print "\n$consensus\n"; > > #Making a sequence object with annotation > my $seqobj = Bio::PrimarySeq->new( -seq => $consensus, > -id => $identity, > -organism => $length, > -comment => $residues, > -alphabet => 'protein'); > > #Writing the consensus sequence in swissprot format > my $conin = Bio::SeqIO->new(-seq => $seqobj, > -format => 'txt'); > my $conout = Bio::SeqIO->new(-file => ">$ARGV[3]", > -format => 'Genbank'); > > while ( my $seq = $conin->next_seq() ) > { > $conout->write_seq($seq); > } > > > > > > > THE ERRORS ARE > Bio::SeqIO: txt cannot be found > Exception > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Failed to load module Bio::SeqIO::txt. Can't locate Bio/SeqIO/txt.pm > in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.8.8 > /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 > /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .) at > /usr/share/perl5/Bio/Root/Root.pm line 425, line 82. > > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:359 > STACK: Bio::Root::Root::_load_module /usr/share/perl5/Bio/Root/Root.pm:427 > STACK: Bio::SeqIO::_load_format_module /usr/share/perl5/Bio/SeqIO.pm:555 > STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:376 > STACK: nb175hw4.pl:60 > ----------------------------------------------------------- > > For more information about the SeqIO system please see the SeqIO docs. > This includes ways of checking for formats at compile time, not run time > Can't call method "next_seq" on an undefined value at nb175hw4.pl line 65, > line 82. > > > > -- View this message in context: http://www.nabble.com/Errors-in-my-script-tp23611935p23618096.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From Kevin.M.Brown at asu.edu Tue May 19 11:15:50 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Tue, 19 May 2009 08:15:50 -0700 Subject: [Bioperl-l] Errors in my script In-Reply-To: <23611935.post@talk.nabble.com> References: <23611935.post@talk.nabble.com> Message-ID: <1A4207F8295607498283FE9E93B775B405FE1E13@EX02.asurite.ad.asu.edu> This is the line you are erroring on. > my $conin = Bio::SeqIO->new(-seq => $seqobj, > -format => 'txt'); > STACK: nb175hw4.pl:60 Hmm... Nb175hw4??? Sounds like this is a script for a homework assignment. From Kevin.M.Brown at asu.edu Tue May 19 11:50:15 2009 From: Kevin.M.Brown at asu.edu (Kevin Brown) Date: Tue, 19 May 2009 08:50:15 -0700 Subject: [Bioperl-l] Warning while running bioperl StandAloneBlast. In-Reply-To: <9fcc48c70905190728x5d950e94r21211a247e29a316@mail.gmail.com> References: <420922.4996.qm@web94205.mail.in2.yahoo.com> <9fcc48c70905190728x5d950e94r21211a247e29a316@mail.gmail.com> Message-ID: <1A4207F8295607498283FE9E93B775B405FE1E24@EX02.asurite.ad.asu.edu> > Hi Praveen, > Your script is bit confusing, i don't > understand what you > actually trying to do. > But what i see is that you are parsing blast report without > using SearchIO > module. > You have to include Bio::SearchIO in your script. > Hope this helps. Actually, he doesn't as it is included from the factory that is running blast and returning the report. > > Hi, > > > > I am getting some warning along with results which says --> > "[NULL_CAPTION] > > Warning: Failed to initialize search. ISAM Error code is -5" > > Can someone please let me know what this is all about. This sounds like you are running into a database error, not a perl script error. Something might be wrong with your blast database. Yep, quick google search verifies that ISAM errors are a database problem. From maj at fortinbras.us Tue May 19 15:31:48 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 19 May 2009 15:31:48 -0400 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation Message-ID: Hi All- With the frequent posts concerning HSP tilings, I thought it was time to create the sought-after Bio::Search::Tiling namespace, and attempt to provide a robust and exact tiling algorithm. I think it's timely, too, since Jason's usual remarks involve the use of wu-blast with the --links option, and wu-blast has recently turned commercial and is evidently costly to obtain. The namespace includes an abstract interface B:S:Tiling::TilingI, and a concrete class called B:S:Tiling::MapTiling. The object is constructed like so $tiling = Bio::Search::Tiling::MapTiling($your_blast_hit); and provides methods for identities(), conserved(), and length(); other stats could also be provided. Identities and conserved sites are correctly estimated, accounting for multiple overlapping HSPs. There is also a method next_tiling($type), where $type is 'hit', 'subject' (alias for 'hit'), or 'query', which an iterator stepping through all minimal sets of HSPs that completely cover the 'hit' or 'query' sequence. One feature is that the individual tilings do not need to be generated to estimate the statistics; next_tiling provides the individual tilings only if you want/need them. I've made it available in a pre-alpha state on bioperl-dev. It's working and workable with plenty pod: see the synopses. It would be excellent if interested folks would try it out on their favorite data. Some niceties are not yet implemented, so BLASTP data is your best bet for success. Check it out via svn into a separate working directory, let me know if there are any questions. Below is table of comparison numbers using the current SearchUtils tiling implementation and some of the new methods, on some test data in t/data. Please see pod for many more details. Cheers, Mark ***** Comparision of methods with (patched) Bio::Search::SearchUtils using test data t/data/dcr1_sp.WUBLASTP SU: SearchUtils MT: MapTiling, using methods 'exact', 'est', 'max' so MT(q:x) is MapTiling, stats calculated on the query, with the exact method, etc. Identities Hit SU MT(q:x) MT(q:e) MT(q:m) MT(s:x) MT(s:e) MT(s:m) sp|P34529.2|DCR1_CAEEL 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 sp|Q9VCU9.1|DCR1_DROME 668.00 664.50 666.26 668.00 678.00 678.00 678.00 sp|Q9UPY3.2|DICER_HUMAN 706.00 706.00 706.00 706.00 706.00 706.00 706.00 sp|Q8R418.2|DICER_MOUSE 698.00 698.00 698.00 698.00 698.00 698.00 698.00 sp|P84634.1|DCL4_ARATH 341.00 341.00 341.00 341.00 341.00 341.00 341.00 sp|Q7S8J7.1|DCL1_NEUCR 403.00 402.00 403.47 403.00 374.17 371.65 379.00 sp|A4RKC3.2|DCL1_MAGGR 331.00 333.00 333.54 337.00 347.50 348.51 348.00 sp|Q0CW42.2|DCL1_ASPTN 387.00 387.50 388.07 388.00 381.00 382.67 389.00 sp|Q1DKI1.2|DCL1_COCIM 331.00 335.00 334.44 339.00 337.50 336.98 341.00 sp|A2RAF3.2|DCL1_ASPNC 282.00 277.50 279.39 282.00 289.00 289.00 289.00 sp|Q09884.1|DCR1_SCHPO 314.00 314.00 314.60 316.00 319.33 318.59 328.00 sp|A1CBC9.2|DCL1_ASPCL 343.00 338.50 341.51 343.00 333.50 332.13 340.00 sp|A1DE13.1|DCL1_NEOFI 339.00 342.50 343.06 346.00 348.00 348.45 351.00 sp|Q2VF19.1|DCL1_CRYPA 284.00 284.00 284.78 284.00 288.00 289.00 290.00 sp|Q0UI93.2|DCL1_PHANO 366.00 366.00 366.00 366.00 355.00 356.36 356.00 sp|Q4WVE3.3|DCL1_ASPFU 313.00 310.00 314.71 315.00 328.00 328.45 331.00 sp|Q2U6C4.2|DCL1_ASPOR 325.00 325.00 325.00 325.00 325.00 325.00 325.00 sp|Q2UNX5.1|DCL2_ASPOR 282.00 282.50 282.74 283.00 275.50 275.86 276.00 sp|Q4WA22.2|DCL2_ASPFU 319.00 319.00 318.95 319.00 320.00 320.00 320.00 Conserved Sites ('Positives') Hit SU MT(q:x) MT(q:e) MT(q:m) MT(s:x) MT(s:e) MT(s:m) sp|P34529.2|DCR1_CAEEL 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 sp|Q9VCU9.1|DCR1_DROME 993.00 991.50 991.84 993.00 1010.00 1010.00 1010.00 sp|Q9UPY3.2|DICER_HUMAN 1011.00 1011.00 1011.00 1011.00 1011.00 1011.00 1011.00 sp|Q8R418.2|DICER_MOUSE 1005.00 1005.00 1005.00 1005.00 1005.00 1005.00 1005.00 sp|P84634.1|DCL4_ARATH 518.00 518.00 518.00 518.00 518.00 518.00 518.00 sp|Q7S8J7.1|DCL1_NEUCR 659.00 659.00 660.70 659.00 602.33 602.65 609.00 sp|A4RKC3.2|DCL1_MAGGR 535.00 538.00 538.59 543.00 563.50 564.61 564.00 sp|Q0CW42.2|DCL1_ASPTN 616.00 622.00 619.70 628.00 608.50 611.79 613.00 sp|Q1DKI1.2|DCL1_COCIM 548.00 552.50 550.64 557.00 554.00 554.47 557.00 sp|A2RAF3.2|DCL1_ASPNC 445.00 441.00 441.96 445.00 457.00 457.00 457.00 sp|Q09884.1|DCR1_SCHPO 509.00 502.50 503.94 509.00 508.67 511.01 522.00 sp|A1CBC9.2|DCL1_ASPCL 550.00 542.00 541.71 550.00 525.17 525.26 534.00 sp|A1DE13.1|DCL1_NEOFI 516.00 517.00 518.34 518.00 518.00 518.91 521.00 sp|Q2VF19.1|DCL1_CRYPA 464.00 462.50 463.91 464.00 470.00 471.74 472.00 sp|Q0UI93.2|DCL1_PHANO 633.00 633.00 633.00 633.00 612.50 611.97 613.00 sp|Q4WVE3.3|DCL1_ASPFU 484.00 481.00 485.86 484.00 500.00 500.51 503.00 sp|Q2U6C4.2|DCL1_ASPOR 515.00 515.00 515.00 515.00 515.00 515.00 515.00 sp|Q2UNX5.1|DCL2_ASPOR 473.00 474.00 473.07 475.00 462.50 462.57 463.00 sp|Q4WA22.2|DCL2_ASPFU 529.00 529.50 530.12 530.00 532.00 532.00 532.00 From nandini_bn at hotmail.com Tue May 19 12:54:58 2009 From: nandini_bn at hotmail.com (nandini_bn) Date: Tue, 19 May 2009 09:54:58 -0700 (PDT) Subject: [Bioperl-l] Errors in my script In-Reply-To: <23618096.post@talk.nabble.com> References: <23611935.post@talk.nabble.com> <23618096.post@talk.nabble.com> Message-ID: <23620286.post@talk.nabble.com> Thanks a lot for ur reply. I tried changing the format to a flag. that worked but I am still getting these errors. Could u help me out with these? Thank u ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: Could not open s.Genbank: No such file or directory STACK: Error::throw STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:359 STACK: Bio::Root::IO::_initialize_io /usr/share/perl5/Bio/Root/IO.pm:310 STACK: Bio::SeqIO::_initialize /usr/share/perl5/Bio/SeqIO.pm:454 STACK: Bio::SeqIO::genbank::_initialize /usr/share/perl5/Bio/SeqIO/genbank.pm:202 STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:351 STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:377 STACK: Bio::SeqIO::newFh /usr/share/perl5/Bio/SeqIO.pm:398 manni122 wrote: > > Hi, > it seems that error comes from your line > > my $conin = Bio::SeqIO->new(-seq => $seqobj, > -format => 'txt'); > > "txt" is not supported by this module. Try something else like "fasta". > > Markus > > > nandini_bn wrote: >> >> Hi, I have written this script using Bioperl methods but it gives me a >> lot of errors: the script is as follows >> Could u please help me out? >> Thanks >> >> >> #! /usr/bin/perl >> >> use warnings; >> use strict; >> >> #Reading in an alignment file in msf format from the command line, >> performing an analysis of the alignment >> #Writing the original alignment in fasta format of sequences selected by >> user >> #Storing the consensus sequence as a new sequence object and it's >> annotation in genebank format >> >> #loading the necessary packages >> use Bio::Perl; >> use Bio::AlignIO; >> use Bio::Seq; >> use Bio::SeqIO; >> >> #Reading in of msf file >> my $in = Bio::AlignIO->new(-file => $ARGV[0] , >> -format => 'msf'); >> >> #The threshold value >> my $t = $ARGV[1]; >> my $aln = $in -> next_aln(); >> >> #some descriptors >> my $length = $aln->length(); >> my $residues = $aln->no_residues(); >> my $isflush = $aln->is_flush(); >> my $sequence = $aln->no_sequences(); >> my $identity = $aln->percentage_identity(); >> #Consensus sequence >> my $consensus = $aln->consensus_string($t); >> >> #Printing the details of the alignment >> print "Length: $length \n"; >> print "Number of residues: $residues \n"; >> print "Is flush: $isflush \n"; >> print "Number of sequences: $sequence \n"; >> print "Percentage of identity: $identity \n"; >> print "Consensus string: $consensus \n"; >> >> #Writing out the file in fasta format >> my $out = Bio::AlignIO->new(-file => ">$ARGV[2]", >> -format => 'fasta'); >> >> $out -> write_aln($aln); >> >> #All ? in the sequence replaced by a X. >> $consensus =~s/\?/X/g; >> >> print "\n$consensus\n"; >> >> #Making a sequence object with annotation >> my $seqobj = Bio::PrimarySeq->new( -seq => $consensus, >> -id => $identity, >> -organism => $length, >> -comment => $residues, >> -alphabet => 'protein'); >> >> #Writing the consensus sequence in swissprot format >> my $conin = Bio::SeqIO->new(-seq => $seqobj, >> -format => 'txt'); >> my $conout = Bio::SeqIO->new(-file => ">$ARGV[3]", >> -format => 'Genbank'); >> >> while ( my $seq = $conin->next_seq() ) >> { >> $conout->write_seq($seq); >> } >> >> >> >> >> >> >> THE ERRORS ARE >> Bio::SeqIO: txt cannot be found >> Exception >> ------------- EXCEPTION: Bio::Root::Exception ------------- >> MSG: Failed to load module Bio::SeqIO::txt. Can't locate Bio/SeqIO/txt.pm >> in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.8.8 >> /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 >> /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .) at >> /usr/share/perl5/Bio/Root/Root.pm line 425, line 82. >> >> STACK: Error::throw >> STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:359 >> STACK: Bio::Root::Root::_load_module >> /usr/share/perl5/Bio/Root/Root.pm:427 >> STACK: Bio::SeqIO::_load_format_module /usr/share/perl5/Bio/SeqIO.pm:555 >> STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:376 >> STACK: nb175hw4.pl:60 >> ----------------------------------------------------------- >> >> For more information about the SeqIO system please see the SeqIO docs. >> This includes ways of checking for formats at compile time, not run time >> Can't call method "next_seq" on an undefined value at nb175hw4.pl line >> 65, line 82. >> >> >> >> > > -- View this message in context: http://www.nabble.com/Errors-in-my-script-tp23611935p23620286.html Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. From emsch at its.caltech.edu Tue May 19 15:54:54 2009 From: emsch at its.caltech.edu (Erich Schwarz) Date: Tue, 19 May 2009 12:54:54 -0700 (PDT) Subject: [Bioperl-l] the binary formerly known as WU-BLAST (Was: experimental Bio::Search::Tiling implementation) Message-ID: On Tue, 19 May 2009, Mark A. Jensen wrote: > Jason's usual remarks involve the use of wu-blast with the --links > option, and wu-blast has recently turned commercial and is > evidently costly to obtain. Is AB-BLAST (the putative successor to WU-BLAST) obtainable *anywhere* on any terms? The only information I've been able to get has been this rather unhelpful site: http://www.advbiocomp.com/ http://www.advbiocomp.com/faq.html whose FAQ has been claiming availability "soon" for the last half year. I wouldn't care, except that [1] I belatedly moved my heavier computing jobs from an i686 Linux box to an x86_64 megabox, and [2] a repeat-finding package I will probably need to be running fairly often in 2009-2010: http://www.repeatmasker.org/RepeatModeler.html relies on WU-BLAST i686 binaries which don't work terribly well on x86_64, and for which I cannot get x86_64 replacements. Sorry to go on a tangent, but I'd appreciate any information anybody has about getting WU-BLAST effectively replaced. --Erich From cjfields at illinois.edu Tue May 19 16:30:52 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 19 May 2009 15:30:52 -0500 Subject: [Bioperl-l] Errors in my script In-Reply-To: <23620286.post@talk.nabble.com> References: <23611935.post@talk.nabble.com> <23618096.post@talk.nabble.com> <23620286.post@talk.nabble.com> Message-ID: <29440385-0381-4F1C-A11B-F05D027527B4@illinois.edu> As pointed out by Kevin, this is very likely a homework problem. You also notably removed the last line from the exception output below indicating where the problem derived from. We don't help with homework (for the obvious reasons). chris On May 19, 2009, at 11:54 AM, nandini_bn wrote: > > Thanks a lot for ur reply. I tried changing the format to a flag. > that worked > but I am still getting these errors. Could u help me out with these? > Thank u > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: Could not open s.Genbank: No such file or directory > STACK: Error::throw > STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:359 > STACK: Bio::Root::IO::_initialize_io /usr/share/perl5/Bio/Root/IO.pm: > 310 > STACK: Bio::SeqIO::_initialize /usr/share/perl5/Bio/SeqIO.pm:454 > STACK: Bio::SeqIO::genbank::_initialize > /usr/share/perl5/Bio/SeqIO/genbank.pm:202 > STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:351 > STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:377 > STACK: Bio::SeqIO::newFh /usr/share/perl5/Bio/SeqIO.pm:398 > > > manni122 wrote: >> >> Hi, >> it seems that error comes from your line >> >> my $conin = Bio::SeqIO->new(-seq => $seqobj, >> -format => 'txt'); >> >> "txt" is not supported by this module. Try something else like >> "fasta". >> >> Markus >> >> >> nandini_bn wrote: >>> >>> Hi, I have written this script using Bioperl methods but it gives >>> me a >>> lot of errors: the script is as follows >>> Could u please help me out? >>> Thanks >>> >>> >>> #! /usr/bin/perl >>> >>> use warnings; >>> use strict; >>> >>> #Reading in an alignment file in msf format from the command line, >>> performing an analysis of the alignment >>> #Writing the original alignment in fasta format of sequences >>> selected by >>> user >>> #Storing the consensus sequence as a new sequence object and it's >>> annotation in genebank format >>> >>> #loading the necessary packages >>> use Bio::Perl; >>> use Bio::AlignIO; >>> use Bio::Seq; >>> use Bio::SeqIO; >>> >>> #Reading in of msf file >>> my $in = Bio::AlignIO->new(-file => $ARGV[0] , >>> -format => 'msf'); >>> >>> #The threshold value >>> my $t = $ARGV[1]; >>> my $aln = $in -> next_aln(); >>> >>> #some descriptors >>> my $length = $aln->length(); >>> my $residues = $aln->no_residues(); >>> my $isflush = $aln->is_flush(); >>> my $sequence = $aln->no_sequences(); >>> my $identity = $aln->percentage_identity(); >>> #Consensus sequence >>> my $consensus = $aln->consensus_string($t); >>> >>> #Printing the details of the alignment >>> print "Length: $length \n"; >>> print "Number of residues: $residues \n"; >>> print "Is flush: $isflush \n"; >>> print "Number of sequences: $sequence \n"; >>> print "Percentage of identity: $identity \n"; >>> print "Consensus string: $consensus \n"; >>> >>> #Writing out the file in fasta format >>> my $out = Bio::AlignIO->new(-file => ">$ARGV[2]", >>> -format => 'fasta'); >>> >>> $out -> write_aln($aln); >>> >>> #All ? in the sequence replaced by a X. >>> $consensus =~s/\?/X/g; >>> >>> print "\n$consensus\n"; >>> >>> #Making a sequence object with annotation >>> my $seqobj = Bio::PrimarySeq->new( -seq => $consensus, >>> -id => $identity, >>> -organism => $length, >>> -comment => $residues, >>> -alphabet => 'protein'); >>> >>> #Writing the consensus sequence in swissprot format >>> my $conin = Bio::SeqIO->new(-seq => $seqobj, >>> -format => 'txt'); >>> my $conout = Bio::SeqIO->new(-file => ">$ARGV[3]", >>> -format => 'Genbank'); >>> >>> while ( my $seq = $conin->next_seq() ) >>> { >>> $conout->write_seq($seq); >>> } >>> >>> >>> >>> >>> >>> >>> THE ERRORS ARE >>> Bio::SeqIO: txt cannot be found >>> Exception >>> ------------- EXCEPTION: Bio::Root::Exception ------------- >>> MSG: Failed to load module Bio::SeqIO::txt. Can't locate Bio/SeqIO/ >>> txt.pm >>> in @INC (@INC contains: /etc/perl /usr/local/lib/perl/5.8.8 >>> /usr/local/share/perl/5.8.8 /usr/lib/perl5 /usr/share/perl5 >>> /usr/lib/perl/5.8 /usr/share/perl/5.8 /usr/local/lib/site_perl .) at >>> /usr/share/perl5/Bio/Root/Root.pm line 425, line 82. >>> >>> STACK: Error::throw >>> STACK: Bio::Root::Root::throw /usr/share/perl5/Bio/Root/Root.pm:359 >>> STACK: Bio::Root::Root::_load_module >>> /usr/share/perl5/Bio/Root/Root.pm:427 >>> STACK: Bio::SeqIO::_load_format_module /usr/share/perl5/Bio/ >>> SeqIO.pm:555 >>> STACK: Bio::SeqIO::new /usr/share/perl5/Bio/SeqIO.pm:376 >>> STACK: nb175hw4.pl:60 >>> ----------------------------------------------------------- >>> >>> For more information about the SeqIO system please see the SeqIO >>> docs. >>> This includes ways of checking for formats at compile time, not >>> run time >>> Can't call method "next_seq" on an undefined value at nb175hw4.pl >>> line >>> 65, line 82. >>> >>> >>> >>> >> >> > > -- > View this message in context: http://www.nabble.com/Errors-in-my-script-tp23611935p23620286.html > Sent from the Perl - Bioperl-L mailing list archive at Nabble.com. > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From schwarz at tenaya.caltech.edu Tue May 19 16:30:10 2009 From: schwarz at tenaya.caltech.edu (Erich Schwarz) Date: Tue, 19 May 2009 13:30:10 -0700 (PDT) Subject: [Bioperl-l] the binary formerly known as WU-BLAST (Was: experimental Bio::Search::Tiling implementation) Message-ID: On Tue, 19 May 2009, Mark A. Jensen wrote: > Jason's usual remarks involve the use of wu-blast with the --links > option, and wu-blast has recently turned commercial and is > evidently costly to obtain. Is AB-BLAST (the putative successor to WU-BLAST) obtainable *anywhere* on any terms? The only information I've been able to get has been this rather unhelpful site: http://www.advbiocomp.com/ http://www.advbiocomp.com/faq.html whose FAQ has been claiming availability "soon" for the last half year. I wouldn't care, except that [1] I belatedly moved my heavier computing jobs from an i686 Linux box to an x86_64 megabox, and [2] a repeat-finding package I will probably need to be running fairly often in 2009-2010: http://www.repeatmasker.org/RepeatModeler.html relies on WU-BLAST i686 binaries which don't work terribly well on x86_64, and for which I cannot get x86_64 replacements. Sorry to go on a tangent, but I'd appreciate any information anybody has about getting WU-BLAST effectively replaced. --Erich From sac at bioperl.org Tue May 19 18:21:25 2009 From: sac at bioperl.org (Steve Chervitz) Date: Tue, 19 May 2009 15:21:25 -0700 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation In-Reply-To: References: Message-ID: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> Mark, Great work. My SearchUtils tiling function has been lingering for far too long (at least a decade). Your comment about BLASTP is fitting. I was working almost exclusively with BLASTP when developing the original tiling function and it seems like the trouble ensued when using it with other blast flavors. There was insufficient exploration of blast alignment edge cases. It would be good to come up with a comprehensive collection of blast reports to stress test your tiling impl. The set currently in t/data is a good start, but may not be sufficient. Cheers, Steve On Tue, May 19, 2009 at 12:31 PM, Mark A. Jensen wrote: > Hi All- > > With the frequent posts concerning HSP tilings, I thought it was time > to create the sought-after Bio::Search::Tiling namespace, and attempt > to provide a robust and exact tiling algorithm. I think it's timely, > too, since Jason's usual remarks involve the use of wu-blast with > the --links option, and wu-blast has recently turned commercial and > is evidently costly to obtain. > > The namespace includes an abstract interface B:S:Tiling::TilingI, and > a concrete class called B:S:Tiling::MapTiling. The object is > constructed like so > > ?$tiling = Bio::Search::Tiling::MapTiling($your_blast_hit); > > and provides methods for identities(), conserved(), and length(); > other stats could also be provided. Identities and conserved sites are > correctly estimated, accounting for multiple overlapping HSPs. There > is also a method next_tiling($type), where $type is 'hit', 'subject' > (alias for 'hit'), or 'query', which an iterator stepping through all > minimal sets of HSPs that completely cover the 'hit' or 'query' > sequence. One feature is that the individual tilings do not need to be > generated to estimate the statistics; next_tiling provides the individual > tilings only if you want/need them. > > I've made it available in a pre-alpha state on bioperl-dev. It's > working and workable with plenty pod: see the synopses. It would > be excellent if interested folks would try it out on their favorite > data. Some niceties are not yet implemented, so BLASTP data is your > best bet for success. Check it out via svn into a separate working > directory, let me know if there are any questions. > > Below is table of comparison numbers using the current SearchUtils > tiling implementation and some of the new methods, on some test data > in t/data. Please see pod for many more details. > > Cheers, > Mark > > > ***** > Comparision of methods with (patched) Bio::Search::SearchUtils > using test data t/data/dcr1_sp.WUBLASTP > > SU: SearchUtils > MT: MapTiling, using methods 'exact', 'est', 'max' > so MT(q:x) is MapTiling, stats calculated on the query, with the exact > method, etc. > > Identities > Hit ? SU MT(q:x) MT(q:e) MT(q:m) MT(s:x) MT(s:e) MT(s:m) > sp|P34529.2|DCR1_CAEEL 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 > sp|Q9VCU9.1|DCR1_DROME 668.00 664.50 666.26 668.00 678.00 678.00 678.00 > sp|Q9UPY3.2|DICER_HUMAN 706.00 706.00 706.00 706.00 706.00 706.00 706.00 > sp|Q8R418.2|DICER_MOUSE 698.00 698.00 698.00 698.00 698.00 698.00 698.00 > sp|P84634.1|DCL4_ARATH 341.00 341.00 341.00 341.00 341.00 341.00 341.00 > sp|Q7S8J7.1|DCL1_NEUCR 403.00 402.00 403.47 403.00 374.17 371.65 379.00 > sp|A4RKC3.2|DCL1_MAGGR 331.00 333.00 333.54 337.00 347.50 348.51 348.00 > sp|Q0CW42.2|DCL1_ASPTN 387.00 387.50 388.07 388.00 381.00 382.67 389.00 > sp|Q1DKI1.2|DCL1_COCIM 331.00 335.00 334.44 339.00 337.50 336.98 341.00 > sp|A2RAF3.2|DCL1_ASPNC 282.00 277.50 279.39 282.00 289.00 289.00 289.00 > sp|Q09884.1|DCR1_SCHPO 314.00 314.00 314.60 316.00 319.33 318.59 328.00 > sp|A1CBC9.2|DCL1_ASPCL 343.00 338.50 341.51 343.00 333.50 332.13 340.00 > sp|A1DE13.1|DCL1_NEOFI 339.00 342.50 343.06 346.00 348.00 348.45 351.00 > sp|Q2VF19.1|DCL1_CRYPA 284.00 284.00 284.78 284.00 288.00 289.00 290.00 > sp|Q0UI93.2|DCL1_PHANO 366.00 366.00 366.00 366.00 355.00 356.36 356.00 > sp|Q4WVE3.3|DCL1_ASPFU 313.00 310.00 314.71 315.00 328.00 328.45 331.00 > sp|Q2U6C4.2|DCL1_ASPOR 325.00 325.00 325.00 325.00 325.00 325.00 325.00 > sp|Q2UNX5.1|DCL2_ASPOR 282.00 282.50 282.74 283.00 275.50 275.86 276.00 > sp|Q4WA22.2|DCL2_ASPFU 319.00 319.00 318.95 319.00 320.00 320.00 320.00 > > Conserved Sites ('Positives') > Hit ? SU MT(q:x) MT(q:e) MT(q:m) MT(s:x) MT(s:e) MT(s:m) > sp|P34529.2|DCR1_CAEEL 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 1845.00 > sp|Q9VCU9.1|DCR1_DROME 993.00 991.50 991.84 993.00 1010.00 1010.00 1010.00 > sp|Q9UPY3.2|DICER_HUMAN 1011.00 1011.00 1011.00 1011.00 1011.00 1011.00 1011.00 > sp|Q8R418.2|DICER_MOUSE 1005.00 1005.00 1005.00 1005.00 1005.00 1005.00 1005.00 > sp|P84634.1|DCL4_ARATH 518.00 518.00 518.00 518.00 518.00 518.00 518.00 > sp|Q7S8J7.1|DCL1_NEUCR 659.00 659.00 660.70 659.00 602.33 602.65 609.00 > sp|A4RKC3.2|DCL1_MAGGR 535.00 538.00 538.59 543.00 563.50 564.61 564.00 > sp|Q0CW42.2|DCL1_ASPTN 616.00 622.00 619.70 628.00 608.50 611.79 613.00 > sp|Q1DKI1.2|DCL1_COCIM 548.00 552.50 550.64 557.00 554.00 554.47 557.00 > sp|A2RAF3.2|DCL1_ASPNC 445.00 441.00 441.96 445.00 457.00 457.00 457.00 > sp|Q09884.1|DCR1_SCHPO 509.00 502.50 503.94 509.00 508.67 511.01 522.00 > sp|A1CBC9.2|DCL1_ASPCL 550.00 542.00 541.71 550.00 525.17 525.26 534.00 > sp|A1DE13.1|DCL1_NEOFI 516.00 517.00 518.34 518.00 518.00 518.91 521.00 > sp|Q2VF19.1|DCL1_CRYPA 464.00 462.50 463.91 464.00 470.00 471.74 472.00 > sp|Q0UI93.2|DCL1_PHANO 633.00 633.00 633.00 633.00 612.50 611.97 613.00 > sp|Q4WVE3.3|DCL1_ASPFU 484.00 481.00 485.86 484.00 500.00 500.51 503.00 > sp|Q2U6C4.2|DCL1_ASPOR 515.00 515.00 515.00 515.00 515.00 515.00 515.00 > sp|Q2UNX5.1|DCL2_ASPOR 473.00 474.00 473.07 475.00 462.50 462.57 463.00 > sp|Q4WA22.2|DCL2_ASPFU 529.00 529.50 530.12 530.00 532.00 532.00 532.00 > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From maj at fortinbras.us Tue May 19 19:16:55 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 19 May 2009 19:16:55 -0400 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation In-Reply-To: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> References: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> Message-ID: Thanks Steve--great idea. It would be great if users who have had any issues with Bioperl tiling (or any other algorithm, for that matter) on particular datasets would send them along. I will enter an enhancement bug report for this purpose; folks can attach their problem data to it. MAJ (P.S. to all; there are also some rudimentary run tests at bioperl-dev/trunk/t/SearchIO/Tiling.t) ----- Original Message ----- From: "Steve Chervitz" To: "Mark A. Jensen" Cc: "BioPerl List" Sent: Tuesday, May 19, 2009 6:21 PM Subject: Re: [Bioperl-l] experimental Bio::Search::Tiling implementation Mark, Great work. My SearchUtils tiling function has been lingering for far too long (at least a decade). Your comment about BLASTP is fitting. I was working almost exclusively with BLASTP when developing the original tiling function and it seems like the trouble ensued when using it with other blast flavors. There was insufficient exploration of blast alignment edge cases. It would be good to come up with a comprehensive collection of blast reports to stress test your tiling impl. The set currently in t/data is a good start, but may not be sufficient. Cheers, Steve On Tue, May 19, 2009 at 12:31 PM, Mark A. Jensen wrote: > Hi All- > > With the frequent posts concerning HSP tilings, I thought it was time > to create the sought-after Bio::Search::Tiling namespace, and attempt > to provide a robust and exact tiling algorithm. I think it's timely, > too, since Jason's usual remarks involve the use of wu-blast with > the --links option, and wu-blast has recently turned commercial and > is evidently costly to obtain. > > The namespace includes an abstract interface B:S:Tiling::TilingI, and > a concrete class called B:S:Tiling::MapTiling. The object is > constructed like so > > $tiling = Bio::Search::Tiling::MapTiling($your_blast_hit); > > and provides methods for identities(), conserved(), and length(); > other stats could also be provided. Identities and conserved sites are > correctly estimated, accounting for multiple overlapping HSPs. There > is also a method next_tiling($type), where $type is 'hit', 'subject' > (alias for 'hit'), or 'query', which an iterator stepping through all > minimal sets of HSPs that completely cover the 'hit' or 'query' > sequence. One feature is that the individual tilings do not need to be > generated to estimate the statistics; next_tiling provides the individual > tilings only if you want/need them. > > I've made it available in a pre-alpha state on bioperl-dev. It's > working and workable with plenty pod: see the synopses. It would > be excellent if interested folks would try it out on their favorite > data. Some niceties are not yet implemented, so BLASTP data is your > best bet for success. Check it out via svn into a separate working > directory, let me know if there are any questions. > > Below is table of comparison numbers using the current SearchUtils > tiling implementation and some of the new methods, on some test data > in t/data. Please see pod for many more details. > > Cheers, > Mark > > > ***** > Comparision of methods with (patched) Bio::Search::SearchUtils > using test data t/data/dcr1_sp.WUBLASTP > > SU: SearchUtils > MT: MapTiling, using methods 'exact', 'est', 'max' > so MT(q:x) is MapTiling, stats calculated on the query, with the exact > method, etc. > > Identities > Hit SU MT(q:x) MT(q:e) MT(q:m) MT(s:x) MT(s:e) MT(s:m) ... > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > _______________________________________________ Bioperl-l mailing list Bioperl-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/bioperl-l From maj at fortinbras.us Tue May 19 19:32:10 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 19 May 2009 19:32:10 -0400 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation In-Reply-To: References: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> Message-ID: <2D053EA003BA451BB07A1B75CB75C835@NewLife> [and here it is: http://bugzilla.bioperl.org/show_bug.cgi?id=2830 ] ----- Original Message ----- From: "Mark A. Jensen" To: "Steve Chervitz" ; "BioPerl List" Sent: Tuesday, May 19, 2009 7:16 PM Subject: Re: [Bioperl-l] experimental Bio::Search::Tiling implementation > Thanks Steve--great idea. It would be great if users who have had any issues > with Bioperl tiling (or any other algorithm, for that matter) > on particular datasets would send them along. I will enter an enhancement bug > report for this purpose; folks can attach their problem data to it. MAJ > (P.S. to all; there are also some rudimentary run tests at > bioperl-dev/trunk/t/SearchIO/Tiling.t) > > ----- Original Message ----- > From: "Steve Chervitz" > To: "Mark A. Jensen" > Cc: "BioPerl List" > Sent: Tuesday, May 19, 2009 6:21 PM > Subject: Re: [Bioperl-l] experimental Bio::Search::Tiling implementation > > > Mark, > > Great work. My SearchUtils tiling function has been lingering for far > too long (at least a decade). > > Your comment about BLASTP is fitting. I was working almost exclusively > with BLASTP when developing the original tiling function and it seems > like the trouble ensued when using it with other blast flavors. There > was insufficient exploration of blast alignment edge cases. It would > be good to come up with a comprehensive collection of blast reports to > stress test your tiling impl. The set currently in t/data is a good > start, but may not be sufficient. > > Cheers, > Steve > > On Tue, May 19, 2009 at 12:31 PM, Mark A. Jensen wrote: >> Hi All- >> >> With the frequent posts concerning HSP tilings, I thought it was time >> to create the sought-after Bio::Search::Tiling namespace, and attempt >> to provide a robust and exact tiling algorithm. I think it's timely, >> too, since Jason's usual remarks involve the use of wu-blast with >> the --links option, and wu-blast has recently turned commercial and >> is evidently costly to obtain. >> >> The namespace includes an abstract interface B:S:Tiling::TilingI, and >> a concrete class called B:S:Tiling::MapTiling. The object is >> constructed like so >> >> $tiling = Bio::Search::Tiling::MapTiling($your_blast_hit); >> >> and provides methods for identities(), conserved(), and length(); >> other stats could also be provided. Identities and conserved sites are >> correctly estimated, accounting for multiple overlapping HSPs. There >> is also a method next_tiling($type), where $type is 'hit', 'subject' >> (alias for 'hit'), or 'query', which an iterator stepping through all >> minimal sets of HSPs that completely cover the 'hit' or 'query' >> sequence. One feature is that the individual tilings do not need to be >> generated to estimate the statistics; next_tiling provides the individual >> tilings only if you want/need them. >> >> I've made it available in a pre-alpha state on bioperl-dev. It's >> working and workable with plenty pod: see the synopses. It would >> be excellent if interested folks would try it out on their favorite >> data. Some niceties are not yet implemented, so BLASTP data is your >> best bet for success. Check it out via svn into a separate working >> directory, let me know if there are any questions. >> >> Below is table of comparison numbers using the current SearchUtils >> tiling implementation and some of the new methods, on some test data >> in t/data. Please see pod for many more details. >> >> Cheers, >> Mark >> >> >> ***** >> Comparision of methods with (patched) Bio::Search::SearchUtils >> using test data t/data/dcr1_sp.WUBLASTP >> >> SU: SearchUtils >> MT: MapTiling, using methods 'exact', 'est', 'max' >> so MT(q:x) is MapTiling, stats calculated on the query, with the exact >> method, etc. >> >> Identities >> Hit SU MT(q:x) MT(q:e) MT(q:m) MT(s:x) MT(s:e) MT(s:m) > ... >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From bix at sendu.me.uk Tue May 19 20:07:03 2009 From: bix at sendu.me.uk (Sendu Bala) Date: Wed, 20 May 2009 01:07:03 +0100 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation In-Reply-To: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> References: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> Message-ID: <4A1349A7.50603@sendu.me.uk> Steve Chervitz wrote: > Great work. My SearchUtils tiling function has been lingering for far > too long (at least a decade). I actually reworked it a little from revisions r10199 to r11247. Mark: tricky.wublast and frac_problems.blast were the test files that had problems that needed solving. The corresponding tests were in SearchIO.t but have since (hopefully!) been moved to another test file somewhere. From cjfields at illinois.edu Tue May 19 21:14:36 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 19 May 2009 20:14:36 -0500 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation In-Reply-To: <4A1349A7.50603@sendu.me.uk> References: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> <4A1349A7.50603@sendu.me.uk> Message-ID: <36FE2174-B63C-48B3-8DF1-9744666D5E7E@illinois.edu> On May 19, 2009, at 7:07 PM, Sendu Bala wrote: > Steve Chervitz wrote: >> Great work. My SearchUtils tiling function has been lingering for far >> too long (at least a decade). > > I actually reworked it a little from revisions r10199 to r11247. > > Mark: tricky.wublast and frac_problems.blast were the test files > that had problems that needed solving. > > The corresponding tests were in SearchIO.t but have since > (hopefully!) been moved to another test file somewhere. They should be in t/SearchIO/blast.t; the SearchIO tests were split up based on format. Was there a bug linked to these that need to be merged with Mark's? Mark: great work! I suppose the next step is to try refactoring the relevant methods in HSPI's to see what happens. chris From maj at fortinbras.us Tue May 19 21:27:27 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Tue, 19 May 2009 21:27:27 -0400 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation In-Reply-To: <36FE2174-B63C-48B3-8DF1-9744666D5E7E@illinois.edu> References: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com> <4A1349A7.50603@sendu.me.uk> <36FE2174-B63C-48B3-8DF1-9744666D5E7E@illinois.edu> Message-ID: <2EA754E64FE448E58488E6B226EC55FC@NewLife> Great, guys -- thanks-- Yesterday I committed a fix (r15689) that took care of problems with tile_hsps()'s handling of gapped alignments. This may have solved some of the old problems. I will have a go at the data Sendu mentions. (I didn't go through all 1093 tests in blast.t; will look more closely...) Will also try to make representations of the HSPI methods that are generalizable, and use these to flesh out the TilingI interface (create aliases back to HSPI, e.g.) cheers- Mark ----- Original Message ----- From: "Chris Fields" To: "Sendu Bala" Cc: "Steve Chervitz" ; "BioPerl List" ; "Mark A. Jensen" Sent: Tuesday, May 19, 2009 9:14 PM Subject: Re: [Bioperl-l] experimental Bio::Search::Tiling implementation > On May 19, 2009, at 7:07 PM, Sendu Bala wrote: > >> Steve Chervitz wrote: >>> Great work. My SearchUtils tiling function has been lingering for far >>> too long (at least a decade). >> >> I actually reworked it a little from revisions r10199 to r11247. >> >> Mark: tricky.wublast and frac_problems.blast were the test files that had >> problems that needed solving. >> >> The corresponding tests were in SearchIO.t but have since (hopefully!) been >> moved to another test file somewhere. > > They should be in t/SearchIO/blast.t; the SearchIO tests were split up based > on format. Was there a bug linked to these that need to be merged with > Mark's? > > Mark: great work! I suppose the next step is to try refactoring the relevant > methods in HSPI's to see what happens. > > chris > > > From rmb32 at cornell.edu Tue May 19 23:48:55 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Tue, 19 May 2009 20:48:55 -0700 Subject: [Bioperl-l] the binary formerly known as WU-BLAST (Was: experimental Bio::Search::Tiling implementation) In-Reply-To: References: Message-ID: <4A137DA7.4080905@cornell.edu> Are the developers of RepeatMasker and RepeatModeler aware of this availability problem? Perhaps they have some ideas? I'm CCing R. Hubley, whose email is listed first in the RepeatModeler source file headers. Rob -- Robert Buels Bioinformatics Analyst, Sol Genomics Network Boyce Thompson Institute for Plant Research Tower Rd Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu Erich Schwarz wrote: > On Tue, 19 May 2009, Mark A. Jensen wrote: > >> Jason's usual remarks involve the use of wu-blast with the --links >> option, and wu-blast has recently turned commercial and is >> evidently costly to obtain. > > Is AB-BLAST (the putative successor to WU-BLAST) obtainable > *anywhere* on any terms? The only information I've been able to get > has been this rather unhelpful site: > > http://www.advbiocomp.com/ > http://www.advbiocomp.com/faq.html > > whose FAQ has been claiming availability "soon" for the last half > year. > > I wouldn't care, except that [1] I belatedly moved my heavier > computing jobs from an i686 Linux box to an x86_64 megabox, and [2] > a repeat-finding package I will probably need to be running fairly > often in 2009-2010: > > http://www.repeatmasker.org/RepeatModeler.html > > relies on WU-BLAST i686 binaries which don't work terribly well on > x86_64, and for which I cannot get x86_64 replacements. > > Sorry to go on a tangent, but I'd appreciate any information > anybody has about getting WU-BLAST effectively replaced. > > > --Erich > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From schwarz at tenaya.caltech.edu Wed May 20 00:06:25 2009 From: schwarz at tenaya.caltech.edu (Erich Schwarz) Date: Tue, 19 May 2009 21:06:25 -0700 (PDT) Subject: [Bioperl-l] the binary formerly known as WU-BLAST (Was: experimental Bio::Search::Tiling implementation) In-Reply-To: <4A137DA7.4080905@cornell.edu> References: <4A137DA7.4080905@cornell.edu> Message-ID: On Tue, 19 May 2009, Robert Buels wrote: > Are the developers of RepeatMasker and RepeatModeler aware of this > availability problem? Perhaps they have some ideas? Now that you point it out, I had taken it for granted that they already knew, but shouldn't have assumed that. So, thanks for cc:ing Dr. Hubley! (This is a somewhat non-BioPerl-ish topic -- thanks for everybody's collective patience with that. On the other hand, given how prevalent a dependency WU-BLAST is for a lot of bioinformatics Perl code, maybe this isn't *so* off-topic after all...) --Erich From maj at fortinbras.us Wed May 20 07:46:43 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Wed, 20 May 2009 07:46:43 -0400 Subject: [Bioperl-l] experimental Bio::Search::Tiling implementation In-Reply-To: <36FE2174-B63C-48B3-8DF1-9744666D5E7E@illinois.edu> References: <8f200b4c0905191521o1680df6cyf634559625f95485@mail.gmail.com><4A1349A7.50603@sendu.me.uk> <36FE2174-B63C-48B3-8DF1-9744666D5E7E@illinois.edu> Message-ID: <13E8884FC3474B6B9C1853EBF79D21C9@NewLife> Re: refactor-- I'll have a go at the relevant Bio::Search::[Hit | HSP] modules and put the modified files in the right places in bioperl-dev. (BTW: development may slow down a tad, due to $job, but never fear.) ----- Original Message ----- From: "Chris Fields" To: "Sendu Bala" Cc: "BioPerl List" ; "Mark A. Jensen" Sent: Tuesday, May 19, 2009 9:14 PM Subject: Re: [Bioperl-l] experimental Bio::Search::Tiling implementation > On May 19, 2009, at 7:07 PM, Sendu Bala wrote: > >> Steve Chervitz wrote: >>> Great work. My SearchUtils tiling function has been lingering for far >>> too long (at least a decade). >> >> I actually reworked it a little from revisions r10199 to r11247. >> >> Mark: tricky.wublast and frac_problems.blast were the test files that had >> problems that needed solving. >> >> The corresponding tests were in SearchIO.t but have since (hopefully!) been >> moved to another test file somewhere. > > They should be in t/SearchIO/blast.t; the SearchIO tests were split up based > on format. Was there a bug linked to these that need to be merged with > Mark's? > > Mark: great work! I suppose the next step is to try refactoring the relevant > methods in HSPI's to see what happens. > > chris > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From cjfields at illinois.edu Wed May 20 08:23:58 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 20 May 2009 07:23:58 -0500 Subject: [Bioperl-l] the binary formerly known as WU-BLAST (Was: experimental Bio::Search::Tiling implementation) In-Reply-To: References: <4A137DA7.4080905@cornell.edu> Message-ID: <1C8DAFE4-8EF8-4B88-97C7-092D07BF911D@illinois.edu> On May 19, 2009, at 11:06 PM, Erich Schwarz wrote: > On Tue, 19 May 2009, Robert Buels wrote: > >> Are the developers of RepeatMasker and RepeatModeler aware of this >> availability problem? Perhaps they have some ideas? > > Now that you point it out, I had taken it for granted that they > already knew, but shouldn't have assumed that. So, thanks for > cc:ing Dr. Hubley! > > (This is a somewhat non-BioPerl-ish topic -- thanks for > everybody's collective patience with that. On the other hand, given > how prevalent a dependency WU-BLAST is for a lot of bioinformatics > Perl code, maybe this isn't *so* off-topic after all...) > > > --Erich There does seem to be a preference for WU-BLAST over BLAST in many cases, so I think it's relevant. I hate to see it go closed-source and commercial. chris From shalabh.sharma7 at gmail.com Wed May 20 09:05:39 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Wed, 20 May 2009 09:05:39 -0400 Subject: [Bioperl-l] Accession to locus Message-ID: <9fcc48c70905200605x7bd2e73dyc4e9783fef632995@mail.gmail.com> Hi All, Is there anyway i can get locus from accession number? Thanks Shalabh From jay at jays.net Wed May 20 10:00:17 2009 From: jay at jays.net (Jay Hannah) Date: Wed, 20 May 2009 09:00:17 -0500 Subject: [Bioperl-l] YAPC::NA 2009 Message-ID: Going to YAPC::NA? Want to hack on some BioPerl? Me too! http://yapc10.org/yn2009/wiki?node=Hackathon I'll buy you a beer/sandwich. :) Jay Hannah http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah From maj at fortinbras.us Wed May 20 11:13:48 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Wed, 20 May 2009 11:13:48 -0400 Subject: [Bioperl-l] YAPC::NA 2009 In-Reply-To: References: Message-ID: How about a beer sandwich? ----- Original Message ----- From: "Jay Hannah" To: Sent: Wednesday, May 20, 2009 10:00 AM Subject: [Bioperl-l] YAPC::NA 2009 > Going to YAPC::NA? Want to hack on some BioPerl? Me too! > > http://yapc10.org/yn2009/wiki?node=Hackathon > > I'll buy you a beer/sandwich. :) > > Jay Hannah > http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From shalabh.sharma7 at gmail.com Wed May 20 11:27:18 2009 From: shalabh.sharma7 at gmail.com (shalabh sharma) Date: Wed, 20 May 2009 11:27:18 -0400 Subject: [Bioperl-l] GenBank entries Message-ID: <9fcc48c70905200827y53bcda13ub4dd55c502241826@mail.gmail.com> Hi, Is there any way i can get a complete genBank entry for a corresponding accession number? I have a list of Accession number and i want locus_tag for those ids. I am thinking of getting complete genBank entry for those ids and then parse them to get locus_tag. Is there any easy way to do it (like directly accession to locus_tag. Thanks Shalabh From cjfields at illinois.edu Wed May 20 11:40:54 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 20 May 2009 10:40:54 -0500 Subject: [Bioperl-l] YAPC::NA 2009 In-Reply-To: References: Message-ID: Might be kinda soggy... chris On May 20, 2009, at 10:13 AM, Mark A. Jensen wrote: > How about a beer sandwich? > > ----- Original Message ----- From: "Jay Hannah" > To: > Sent: Wednesday, May 20, 2009 10:00 AM > Subject: [Bioperl-l] YAPC::NA 2009 > > >> Going to YAPC::NA? Want to hack on some BioPerl? Me too! >> http://yapc10.org/yn2009/wiki?node=Hackathon >> I'll buy you a beer/sandwich. :) >> Jay Hannah >> http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From maj at fortinbras.us Wed May 20 11:51:59 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Wed, 20 May 2009 11:51:59 -0400 Subject: [Bioperl-l] GenBank entries In-Reply-To: <9fcc48c70905200827y53bcda13ub4dd55c502241826@mail.gmail.com> References: <9fcc48c70905200827y53bcda13ub4dd55c502241826@mail.gmail.com> Message-ID: <1B32556442C84B01BCFDA10C84929C6E@NewLife> Shalabh- I think you want to look at http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook specif. http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#Get_GIs_for_a_list_of_accessions and then http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#Get_accessions_.28as_well_as_other_information.29_for_a_list_of_GIs MAJ ----- Original Message ----- From: "shalabh sharma" To: "bioperl-l" Sent: Wednesday, May 20, 2009 11:27 AM Subject: [Bioperl-l] GenBank entries > Hi, > Is there any way i can get a complete genBank entry for a corresponding > accession number? > I have a list of Accession number and i want locus_tag for those ids. I am > thinking of getting complete genBank entry for those ids and then parse them > to get locus_tag. > Is there any easy way to do it (like directly accession to locus_tag. > > Thanks > Shalabh > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From schwarz at tenaya.caltech.edu Wed May 20 13:40:00 2009 From: schwarz at tenaya.caltech.edu (Erich Schwarz) Date: Wed, 20 May 2009 10:40:00 -0700 (PDT) Subject: [Bioperl-l] the binary formerly known as WU-BLAST (Was: experimental Bio::Search::Tiling implementation) In-Reply-To: <4A143C88.4050901@systemsbiology.org> References: <4A137DA7.4080905@cornell.edu> <4A143C88.4050901@systemsbiology.org> Message-ID: On Wed, 20 May 2009, Robert Hubley wrote: > I have been working on fixing some incompatibilities in > RepeatMasker/RepeatModeler/RepeatProteinMask etc.. and should have > something out soon ( probably in a week or so ). Great! > Warren [Gish] assures me that you can still get ABBlast free ( or > cheaply ) under an academic license. Where?!? Not here: http://www.advbiocomp.com/blast.html http://www.advbiocomp.com/faq.html which is still saying, after half a year, that AB-BLAST will be available Real Soon Now... --Erich From bernd.web at gmail.com Wed May 20 15:19:25 2009 From: bernd.web at gmail.com (Bernd Web) Date: Wed, 20 May 2009 21:19:25 +0200 Subject: [Bioperl-l] Uniprot/Swiss accessions? In-Reply-To: <18DF7D20DFEC044098A1062202F5FFF32493E02AD4@exchsth.agresearch.co.nz> References: <18DF7D20DFEC044098A1062202F5FFF32493E0289E@exchsth.agresearch.co.nz> <0727310c9364ae23501cf64293d15209.squirrel@mail.dreamhost.com> <18DF7D20DFEC044098A1062202F5FFF32493E02A06@exchsth.agresearch.co.nz> <9c8429fa8d42621828062dcb08845c06.squirrel@mail.dreamhost.com> <18DF7D20DFEC044098A1062202F5FFF32493E02ABB@exchsth.agresearch.co.nz> <18DF7D20DFEC044098A1062202F5FFF32493E02AD4@exchsth.agresearch.co.nz> Message-ID: <716af09c0905201219k6394e201y7699bf9a8abb24fc@mail.gmail.com> Hi Russel, Thanks for posting this issue. I have the same problem with formatdb from 2.2.19. When using my older still installed 2.2.17 everything was fine again :) So you were lucky to revert from to 2.2.18. Regards, Bernd On Tue, May 19, 2009 at 5:54 AM, Smithies, Russell wrote: > We re-installed blast version 2.2.18 and everything works perfectly. > It formats the Uniprot fasta as it should and retrieves sequences with fastacmd as it should. > > I think we'll email NCBI and tell them they broke formatdb in their "upgrade" > > --Russell > >> -----Original Message----- >> From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> bounces at lists.open-bio.org] On Behalf Of Smithies, Russell >> Sent: Tuesday, 19 May 2009 3:43 p.m. >> To: 'bill at genenformics.com'; 'bioperl-l at lists.open-bio.org' >> Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? >> >> There's no descriptions in the top of the blast output and no accessions in >> the alignments. >> The fasta is coming from UniProt so surely they know how to format files. >> And it does match what NCBI require in their defline i.e. sp|accession|entry >> name >> >> --Russell >> >> > -----Original Message----- >> > From: bioperl-l-bounces at lists.open-bio.org [mailto:bioperl-l- >> > bounces at lists.open-bio.org] On Behalf Of bill at genenformics.com >> > Sent: Tuesday, 19 May 2009 3:13 p.m. >> > To: bioperl-l at lists.open-bio.org >> > Subject: Re: [Bioperl-l] Uniprot/Swiss accessions? >> > >> > I could not see the difference. >> > >> > Do you follow the rules for FASTA defline: >> > >> > http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=handbook.table.632 >> > >> > Bill >> > >> > >> > > No, that doesn't work :-( >> > > Here's some blast output with the database formatted with local ids: >> > > ===================================================================== >> > > Database: uniprot_sprot.fasta >> > > 466,739 sequences; 165,389,953 total letters >> > > >> > > Searching..................................................done >> > > >> > > >> > > >> > > Score >> > > E >> > > Sequences producing significant alignments: (bits) >> > > Value >> > > >> > > sp|Q4U9M9|104K_THEAN Unknown 421 >> > > e-117 >> > > sp|P15711|104K_THEPA Unknown 265 >> > > 6e-70 >> > > sp|Q2SPQ2|CHED_HAHCH Unknown 33 >> > > 4.2 >> > > >> > > >> > > Score = 421 bits (1083), Expect = e-117, Method: Compositional matrix >> > > adjust. >> > > Identities = 0/209 (0%), Positives = 0/209 (0%) >> > > >> > > Query: 1 VHKVVEGDIVIWENEEMPLYTCAIVTQNEVPYMAYVELLEDPDLIFFLKEGDQWAPIPED 60 >> > > >> > > Query: 61 QYLAXXXXXXXXIHTESFFSLNLSFQHENYKYEMVSSFQHSIKMVVFTPKNGHICKMVYD >> > > 120 >> > > >> > > Query: 121 KNIRIFKALYNEYVTSVIGFFRGLKLLLLNIFVIDDRGMIGNKYFQLLDDKYAPISVQGY >> > > 180 >> > > >> > > Query: 181 VATIPKLKDFAEPYHPIILDISDIDYVNF 209 >> > > >> > > >> =========================================================================== >> > > >> > > If I tweak the fasta and change the ids from lcl to gi and re-formatdb, >> > > all works correctly: >> > > >> > > >> =========================================================================== >> > > Query= test >> > > (612 letters) >> > > >> > > Database: uniprot_sprot.fasta >> > > 466,739 sequences; 165,389,953 total letters >> > > >> > > Searching..................................................done >> > > >> > > >> > > >> > > %2 >> > >> > _______________________________________________ >> > Bioperl-l mailing list >> > Bioperl-l at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/bioperl-l >> ======================================================================= >> Attention: The information contained in this message and/or attachments >> from AgResearch Limited is intended only for the persons or entities >> to which it is addressed and may contain confidential and/or privileged >> material. Any review, retransmission, dissemination or other use of, or >> taking of any action in reliance upon, this information by persons or >> entities other than the intended recipients is prohibited by AgResearch >> Limited. If you have received this message in error, please notify the >> sender immediately. >> ======================================================================= >> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From rmb32 at cornell.edu Wed May 20 16:02:40 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 20 May 2009 13:02:40 -0700 Subject: [Bioperl-l] YAPC::NA 2009 In-Reply-To: References: Message-ID: <4A1461E0.80306@cornell.edu> Erm, so you guys didn't answer his question. I'm going, anyway. Would love to meet you all and see if I can make myself useful. Rob -- Robert Buels Bioinformatics Analyst, Sol Genomics Network Boyce Thompson Institute for Plant Research Tower Rd Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu Jay Hannah wrote: > Going to YAPC::NA? Want to hack on some BioPerl? Me too! > > http://yapc10.org/yn2009/wiki?node=Hackathon > > I'll buy you a beer/sandwich. :) > > Jay Hannah > http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > From rmb32 at cornell.edu Wed May 20 16:11:31 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 20 May 2009 13:11:31 -0700 Subject: [Bioperl-l] YAPC::NA 2009 In-Reply-To: <4A1461E0.80306@cornell.edu> References: <4A1461E0.80306@cornell.edu> Message-ID: <4A1463F3.2090106@cornell.edu> Also, I would like a beer sandwich. R Robert Buels wrote: > Erm, so you guys didn't answer his question. > > I'm going, anyway. Would love to meet you all and see if I can make > myself useful. > > Rob > -- Robert Buels Bioinformatics Analyst, Sol Genomics Network Boyce Thompson Institute for Plant Research Tower Rd Ithaca, NY 14853 Tel: 503-889-8539 rmb32 at cornell.edu http://www.sgn.cornell.edu From cjfields at illinois.edu Wed May 20 16:20:53 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 20 May 2009 15:20:53 -0500 Subject: [Bioperl-l] YAPC::NA 2009 In-Reply-To: <4A1461E0.80306@cornell.edu> References: <4A1461E0.80306@cornell.edu> Message-ID: <00698523-8FD6-417E-8A62-F1ABF321E2BF@illinois.edu> I'm likely not going due to $job (though if I do, now I know to aim for June 25-26). chris On May 20, 2009, at 3:02 PM, Robert Buels wrote: > Erm, so you guys didn't answer his question. > > I'm going, anyway. Would love to meet you all and see if I can make > myself useful. > > Rob > > -- > Robert Buels > Bioinformatics Analyst, Sol Genomics Network > Boyce Thompson Institute for Plant Research > Tower Rd > Ithaca, NY 14853 > Tel: 503-889-8539 > rmb32 at cornell.edu > http://www.sgn.cornell.edu > > Jay Hannah wrote: >> Going to YAPC::NA? Want to hack on some BioPerl? Me too! >> http://yapc10.org/yn2009/wiki?node=Hackathon >> I'll buy you a beer/sandwich. :) >> Jay Hannah >> http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From hlapp at duke.edu Thu May 21 16:00:54 2009 From: hlapp at duke.edu (Hilmar Lapp) Date: Thu, 21 May 2009 16:00:54 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> Message-ID: Moving this question to the BioPerl list, which is where we need to discuss this I think. Can someone refresh my memory on what the Bioperl-dev repository is or was meant for? It doesn't seem documented on the wiki. My (admittedly vague) recollection is that bioperl-dev is basically for highly experimental changes or functionality. I'm not clear why everything else shouldn't go either into the main trunk or into a branch. If there is a realistic expectation for something to be folded into the main trunk sooner or later, what would be the reasons for not putting it into a branch of the main repository? If we are putting it into a separate repository, we're waiving a lot of svn's support for merging and resolving concurrent edits. I would also go actually go a step further and suggest that even if this GSoC project starts out on a branch (which I can see good reasons for, such as eliminating fear to disrupt something), there should be a plan to move to main trunk before the end of the project. We've had a good tradition in BioPerl of developing directly on the main trunk. It sometimes leads to occasional disruptions when lots of tests seem failing, but it also encourages development discipline and make new code to melt into the BioPerl code base without requiring any extra steps by someone. Any and all thoughts or comments welcome and appreciated! -hilmar On May 21, 2009, at 11:26 AM, Chase Miller wrote: > This brings me to a question about where I should have my code > repository. Originally, I was going to use Bioperl-dev, but it was > brought to my attention that that repository does not normally > receive daily updates and it might not be the right place for my day > to day development. An alternative would be to use something like > google code on a daily basis and commit to Bioperl-dev on a weekly > basis. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : =========================================================== From maj at fortinbras.us Thu May 21 16:26:54 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Thu, 21 May 2009 16:26:54 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> Message-ID: These are key points. I do believe (and think in these terms) that bioperl-dev modules are intended for the trunk, as soon as they are not so broken as to be testable by users. (my interp). See this thread to refresh memory: http://lists.open-bio.org/pipermail/bioperl-l/2009-March/029661.html ----- Original Message ----- From: "Hilmar Lapp" To: "Chase Miller" Cc: "BioPerl List" Sent: Thursday, May 21, 2009 4:00 PM Subject: Re: [Bioperl-l] bioperl-dev or branch? > Moving this question to the BioPerl list, which is where we need to > discuss this I think. Can someone refresh my memory on what the > Bioperl-dev repository is or was meant for? It doesn't seem documented > on the wiki. > > My (admittedly vague) recollection is that bioperl-dev is basically > for highly experimental changes or functionality. > > I'm not clear why everything else shouldn't go either into the main > trunk or into a branch. If there is a realistic expectation for > something to be folded into the main trunk sooner or later, what would > be the reasons for not putting it into a branch of the main > repository? If we are putting it into a separate repository, we're > waiving a lot of svn's support for merging and resolving concurrent > edits. > > I would also go actually go a step further and suggest that even if > this GSoC project starts out on a branch (which I can see good reasons > for, such as eliminating fear to disrupt something), there should be a > plan to move to main trunk before the end of the project. We've had a > good tradition in BioPerl of developing directly on the main trunk. It > sometimes leads to occasional disruptions when lots of tests seem > failing, but it also encourages development discipline and make new > code to melt into the BioPerl code base without requiring any extra > steps by someone. > > Any and all thoughts or comments welcome and appreciated! > > -hilmar > > On May 21, 2009, at 11:26 AM, Chase Miller wrote: > >> This brings me to a question about where I should have my code >> repository. Originally, I was going to use Bioperl-dev, but it was >> brought to my attention that that repository does not normally >> receive daily updates and it might not be the right place for my day >> to day development. An alternative would be to use something like >> google code on a daily basis and commit to Bioperl-dev on a weekly >> basis. > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : > =========================================================== > > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From bix at sendu.me.uk Thu May 21 16:41:19 2009 From: bix at sendu.me.uk (Sendu Bala) Date: Thu, 21 May 2009 21:41:19 +0100 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> Message-ID: <4A15BC6F.7010306@sendu.me.uk> Hilmar Lapp wrote: > Moving this question to the BioPerl list, which is where we need to > discuss this I think. Can someone refresh my memory on what the > Bioperl-dev repository is or was meant for? It doesn't seem documented > on the wiki. > > My (admittedly vague) recollection is that bioperl-dev is basically for > highly experimental changes or functionality. [...] > Any and all thoughts or comments welcome and appreciated! I'd say anything new or not radical enough to make tests always fail (due to API changes) go straight to head. Radical changes also go straight to head if they've been discussed and people think they're OK and the dev is confident in them. These changes can always be reverted or excluded in a subsequent release branch if necessary. From hlapp at gmx.net Thu May 21 17:00:46 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 May 2009 17:00:46 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> Message-ID: <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> To quote from the thread: "The idea behind bioperl-dev, as I understand from Chris, is to provide a sort of sandbox for experimental code. Adventuresome users should feel free to play with the code there, but not expect much in the way of support, bug fixes, and the like. There be dragons there. When a bioperl-dev module graduates to the core, then the usual support mechanisms kick in." I.e., there is a possibility, but no expectation to graduate to core. I think that's important. My sense is that we all agree that we don't want to abandon svn branches (or do we?). To I'll state the question again: what disqualifies a development project from going into the main trunk (thanks to Sendu for keeping this on the table), and what disqualifies it from going onto a branch, with the remaining resort being bioperl- dev. I'm worried about fragmentation here - historically we've been a crowd that has been rather inviting of new contributions into the main code base and tolerant of those additions needing time to mature, and we have been lazy on committing on behalf of other people (which merging patches, branches, and separate repositories on behalf of someone else is) and hence liberal in giving out commit access, and commit to main trunk access. -hilmar On May 21, 2009, at 4:26 PM, Mark A. Jensen wrote: > These are key points. I do believe (and think in these terms) that > bioperl-dev modules are intended for the trunk, as soon as they are > not so broken as to be testable by users. (my interp). See this > thread to refresh memory: http://lists.open-bio.org/pipermail/bioperl-l/2009-March/029661.html > > ----- Original Message ----- From: "Hilmar Lapp" > To: "Chase Miller" > Cc: "BioPerl List" > Sent: Thursday, May 21, 2009 4:00 PM > Subject: Re: [Bioperl-l] bioperl-dev or branch? > > >> Moving this question to the BioPerl list, which is where we need >> to discuss this I think. Can someone refresh my memory on what >> the Bioperl-dev repository is or was meant for? It doesn't seem >> documented on the wiki. >> My (admittedly vague) recollection is that bioperl-dev is >> basically for highly experimental changes or functionality. >> I'm not clear why everything else shouldn't go either into the >> main trunk or into a branch. If there is a realistic expectation >> for something to be folded into the main trunk sooner or later, >> what would be the reasons for not putting it into a branch of the >> main repository? If we are putting it into a separate repository, >> we're waiving a lot of svn's support for merging and resolving >> concurrent edits. >> I would also go actually go a step further and suggest that even >> if this GSoC project starts out on a branch (which I can see good >> reasons for, such as eliminating fear to disrupt something), there >> should be a plan to move to main trunk before the end of the >> project. We've had a good tradition in BioPerl of developing >> directly on the main trunk. It sometimes leads to occasional >> disruptions when lots of tests seem failing, but it also >> encourages development discipline and make new code to melt into >> the BioPerl code base without requiring any extra steps by someone. >> Any and all thoughts or comments welcome and appreciated! >> -hilmar >> On May 21, 2009, at 11:26 AM, Chase Miller wrote: >>> This brings me to a question about where I should have my code >>> repository. Originally, I was going to use Bioperl-dev, but it >>> was brought to my attention that that repository does not >>> normally receive daily updates and it might not be the right >>> place for my day to day development. An alternative would be to >>> use something like google code on a daily basis and commit to >>> Bioperl-dev on a weekly basis. >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : >> =========================================================== >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at illinois.edu Thu May 21 17:52:48 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 21 May 2009 16:52:48 -0500 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> Message-ID: On May 21, 2009, at 4:00 PM, Hilmar Lapp wrote: > To quote from the thread: > > "The idea behind bioperl-dev, as I understand from Chris, is to > provide a sort of sandbox for experimental code. Adventuresome users > should feel free to play with the code there, but not expect much in > the way of support, bug fixes, and the like. There be dragons there. > When a bioperl-dev module graduates to the core, then the usual > support mechanisms kick in." > > I.e., there is a possibility, but no expectation to graduate to > core. I think that's important. > > My sense is that we all agree that we don't want to abandon svn > branches (or do we?). To I'll state the question again: what > disqualifies a development project from going into the main trunk > (thanks to Sendu for keeping this on the table), and what > disqualifies it from going onto a branch, with the remaining resort > being bioperl-dev. At one point not so long ago, we had a very enlivening discussion on- list about splitting off chunks of modules into their own subdistributions. Everyone seemed relatively on board for the idea, which is somewhat summarized here (bioperl-dev is also mentioned): http://www.bioperl.org/wiki/Proposed_1.6_core_modules The key aspect that started this all is to have a lean (read: few dependencies, relatively stable) core set of modules. Other related modules which add additional dependencies could be moved into a bioperl-tools. And (finally) anything lacking the guarantee of API stability would be bioperl-dev. I think several other alternatives came up but these seemed to be the final ones. So: core = minimal functionality tools = complete functionality (requires core) dev = experimental APIs, etc (requires core and possibly tools) If we claim that bioperl-live is representing 'core', we are going the direction of code bloat by dropping all new (possibly untested) code there. However, if we can somehow package up the 'live' modules into distinct bundles as delineated above that also solves the issue, so any new code can go in. Depending on which avenue we take, I either agree or disagree with Sendu on dropping all new code into bioperl- live (like a coding version of Schr?dinger's cat). Anyway, I saw a distinct lack of progress towards any solution, so I just took the initiative to set up *SOMETHING* so progress is made, one way or the other. If bioperl-dev ends up being a temporary spot in the road I don't have a problem with that, just as long as we come to some consensus on what to do, how to approach it, and make progress towards that goal. > I'm worried about fragmentation here - historically we've been a > crowd that has been rather inviting of new contributions into the > main code base and tolerant of those additions needing time to > mature, and we have been lazy on committing on behalf of other > people (which merging patches, branches, and separate repositories > on behalf of someone else is) and hence liberal in giving out commit > access, and commit to main trunk access. > > -hilmar I'm a big fan of branches (I ran all the feature/annotation refactors off a branch). So I have no problem with anyone wanting to test stuff out there, with the idea that stuff be merged back in at some point. (Aside: do we really need to retain feature branches? Most projects remove them after they are merged back in) Anyway, I don't have a problem with new code being added in and adding new devs. And there isn't anything that prevents any dev from committing to any of the other bioperl-* repos AFAIK. But a consensus would be nice, and the sooner the better. chris From rmb32 at cornell.edu Thu May 21 19:31:48 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 21 May 2009 16:31:48 -0700 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> Message-ID: <4A15E464.60801@cornell.edu> Hilmar Lapp wrote: > something to be folded into the main trunk sooner or later, what would > be the reasons for not putting it into a branch of the main repository? > If we are putting it into a separate repository, we're waiving a lot of > svn's support for merging and resolving concurrent edits. Just to clarify, it doesn't look like bioperl-dev is actually in a separate repo, it's just separated from bioperl-live as a different distribution, but still in the 'bioperl' repository. So it seems to me there's no need to worry about being able to merge from it. Sorry if I'm butting in on the larger organization issue here, since I don't exactly have any history with this group, but here are my 2 cents, they may or may not make sense: I would agree with Sendu's assertion that there doesn't really seem to be a need for a separate distribution for highly experimental things, that role would probably be most straightforwardly performed by a branch of the appropriate bioperl-* distribution. In fact, having a separate bioperl-dev distribution could actually be a headache for anybody wanting to actually install it (as in make install from a tarball or something), since anything radioactive enough to be in there is quite likely going to *conflict* namespace-wise or at least functionality-wise with what's in bioperl-live. And by the way (I may be opening a can of worms here), wouldn't bioperl-live be more appropriately called bioperl-core? Rob From jason at bioperl.org Thu May 21 19:48:17 2009 From: jason at bioperl.org (Jason Stajich) Date: Thu, 21 May 2009 16:48:17 -0700 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <4A15E464.60801@cornell.edu> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <4A15E464.60801@cornell.edu> Message-ID: <7EFDDC65-B111-4666-BB31-67CF4F262AEE@bioperl.org> > > And by the way (I may be opening a can of worms here), wouldn't > bioperl-live be more appropriately called bioperl-core? > yes, I agree. But (like a lot of code/API/naming decisions) we're sort of wedded to the name right now only for historical reasons. If we can symlink in SVN so the old name still worked that would be fine. We call it bioperl-core in the downloads and distribution links on the wiki. > Rob > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Jason Stajich jason at bioperl.org From kellert at ohsu.edu Thu May 21 19:54:10 2009 From: kellert at ohsu.edu (Thomas Keller) Date: Thu, 21 May 2009 16:54:10 -0700 Subject: [Bioperl-l] GenBank entries In-Reply-To: References: Message-ID: <0433B9F6-34A9-4115-840E-04F20D49D6A5@ohsu.edu> Greetings, I don't think the EUtilities HOWTO is working. But the HOWTO: getting genomic sequences on the wiki is pretty nice. URL: http://www.bioperl.org/wiki/Getting_Genomic_Sequences It does take some tweaking though; sometimes you think you're getting a value when you're getting another object and you need to pipe it to another method. Tom kellert at ohsu.edu 503-494-2442 On May 21, 2009, at 2:53 PM, bioperl-l-request at lists.open-bio.org wrote: > 7. Re: GenBank entries (Mark A. Jensen) From hlapp at gmx.net Thu May 21 21:12:44 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 May 2009 21:12:44 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> Message-ID: <70167486-E718-4AB7-9295-276D4E20CD52@gmx.net> On May 21, 2009, at 5:52 PM, Chris Fields wrote: > Aside: do we really need to retain feature branches? Most projects > remove them after they are merged back in I agree completely, with svn we should really do that too. I can't see a good reason to keep them - it's like cruft code. It's probably one of those cvs-derived habits that die hard. (cvs doesn't version directories) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From maj at fortinbras.us Thu May 21 20:56:17 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Thu, 21 May 2009 20:56:17 -0400 Subject: [Bioperl-l] GenBank entries In-Reply-To: <0433B9F6-34A9-4115-840E-04F20D49D6A5@ohsu.edu> References: <0433B9F6-34A9-4115-840E-04F20D49D6A5@ohsu.edu> Message-ID: Thanks Tom-- if you have code examples of the tweaks, and also places where the EUt HOWTO is lying, please send them along, and "we" (maybe I) can do some updating-- [and, of course, everyone's welcome to make fixes to the wiki!] cheers, Mark ----- Original Message ----- From: "Thomas Keller" To: Sent: Thursday, May 21, 2009 7:54 PM Subject: Re: [Bioperl-l] GenBank entries > Greetings, > > I don't think the EUtilities HOWTO is working. > But the HOWTO: getting genomic sequences on the wiki is pretty nice. > URL: http://www.bioperl.org/wiki/Getting_Genomic_Sequences > > It does take some tweaking though; sometimes you think you're getting > a value when you're getting another object and you need to pipe it to > another method. > > > Tom > kellert at ohsu.edu > 503-494-2442 > > > On May 21, 2009, at 2:53 PM, bioperl-l-request at lists.open-bio.org wrote: > >> 7. Re: GenBank entries (Mark A. Jensen) > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From hlapp at gmx.net Thu May 21 21:33:52 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 May 2009 21:33:52 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <4A15E464.60801@cornell.edu> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <4A15E464.60801@cornell.edu> Message-ID: <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> On May 21, 2009, at 7:31 PM, Robert Buels wrote: > Just to clarify, it doesn't look like bioperl-dev is actually in a > separate repo, it's just separated from bioperl-live as a different > distribution, but still in the 'bioperl' repository. True, actually, my mistake. I guess I was assuming from the early days of cvs that these are actually different repositories. Sorry 'bout that. So in essence bioperl-dev and bioperl-live are different directory trees within the same repository. On May 21, 2009, at 5:52 PM, Chris Fields wrote: > The key aspect that started this all is to have a lean (read: few > dependencies, relatively stable) core set of modules. Other related > modules which add additional dependencies could be moved into a > bioperl-tools. And (finally) anything lacking the guarantee of API > stability would be bioperl-dev. I think several other alternatives > came up but these seemed to be the final ones. > > So: > > core = minimal functionality > tools = complete functionality (requires core) > dev = experimental APIs, etc (requires core and possibly tools) I do like this. Am I right that the way we should be looking at this is as disjoint subsets that as a total make up BioPerl. So a module would be found in one and only one subset, and if I wanted the entire BioPerl package I download each one. Bioperl-dev then isn't for holding different (e.g., more experimental) versions of the same modules that are also in bioperl-core (aka bioperl-live). Likewise, each subset then has its tags and branches and main trunk, right? (Though hopefully the release tags would be present in all) That sounds all good to me, except that bioperl-dev has Bio/Root/* replicated. It should not, right? If we want to introduce experimental changes to Bio/Root modules, they should go into a bioperl-live branch, right? (Otherwise I'm confused what a bioperl-live branch is for.) So in this picture Chase's project would go into bioperl-dev, main trunk. Users would obtain it by downloading bioperl-dev from svn or as package and simply install on top of a Bioperl-core package, without fear of clobbering stable modules that came with Bioperl-core. Right? If that's the idea it makes a lot of sense to me and seems sane. Conversely, using bioperl-dev as another way to branch bioperl-core modules doesn't, though I may be missing something. (I hope I'm making sense. Please do say if I'm not ...) -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Thu May 21 21:38:09 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 May 2009 21:38:09 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <4A15E464.60801@cornell.edu> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <4A15E464.60801@cornell.edu> Message-ID: On May 21, 2009, at 7:31 PM, Robert Buels wrote: > I would agree with Sendu's assertion that there doesn't really seem > to be a need for a separate distribution for highly experimental > things, that role would probably be most straightforwardly performed > by a branch of the appropriate bioperl-* distribution. Yes, if by "highly experimental things" we are talking about experimental versions of modules that already exist in either bioperl- core or bioperl-dev. > In fact, having a separate bioperl-dev distribution could actually > be a headache for anybody wanting to actually install it (as in make > install from a tarball or something), since anything radioactive > enough to be in there is quite likely going to *conflict* namespace- > wise or at least functionality-wise with what's in bioperl-live. Yeah I think that's why bioperl-dev and bioperl-core need to be disjoint sets. Or do you think that even in that case your scenario could be a problem? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From maj at fortinbras.us Thu May 21 21:51:47 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Thu, 21 May 2009 21:51:47 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com><4A15E464.60801@cornell.edu> <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> Message-ID: <454792CE1F444C01B8835206DB42A80D@NewLife> H- ----- Original Message ----- From: "Hilmar Lapp" To: "Robert Buels" ; "Chris Fields" Cc: "BioPerl List" Sent: Thursday, May 21, 2009 9:33 PM Subject: Re: [Bioperl-l] bioperl-dev or branch? ... > That sounds all good to me, except that bioperl-dev has Bio/Root/* > replicated. It should not, right? If we want to introduce experimental > changes to Bio/Root modules, they should go into a bioperl-live branch, > right? (Otherwise I'm confused what a bioperl-live branch is for.) Well, this is what I'm wondering. For Chase's project, e.g., there may be mods that need to take place (or be tried out, and possibly discarded later, which is key) in the core modules, to interface the new stuff with the core. So experiments may occasionally depend on (possibly radical) changes to the core--nasty for folks (I expect they are many) that update on the core regularly and expect changes there to be incremental fixes and/or new and relatively well-tested functionality. So I do conceive of bioperl-dev (rightly or wrongly) as a parallel branch of bioperl-live, but not a temporary "feature" branch as such. It can and prob should be pretty persistent. But read on... > > So in this picture Chase's project would go into bioperl-dev, main trunk. > Users would obtain it by downloading bioperl-dev from svn or as package and > simply install on top of a Bioperl-core package, without fear of clobbering > stable modules that came with Bioperl-core. Right? This is the way I'd love it to work, modulo the exptl core changes mentioned above. My own tendency in development is to make stuff as separable as possible (by specifically overriding core methods in 'Helper' modules, for example), and if this were part of the bioperl-dev rules of engagement (i.e., no core modules, only overrides), then users could count on the behavior you describe. Mirroring the existing core paths is also key for this expectation. > > If that's the idea it makes a lot of sense to me and seems sane. Conversely, > using bioperl-dev as another way to branch bioperl-core modules doesn't, > though I may be missing something. > That's the idea in my own private Idaho, here in Georgia. > (I hope I'm making sense. Please do say if I'm not ...) > Me too X 2. MAJ > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From maj at fortinbras.us Thu May 21 22:02:12 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Thu, 21 May 2009 22:02:12 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> Message-ID: <945F5E2CFA8542F2AD1A04AD8BD7C67B@NewLife> Also wanted to chime in briefly here--for me as a new developer, commit access to The Trunk is a little scary, but bioperl-dev seems friendly, so I find I'm more comfortable putting my hare-brained schemes there where I know I won't break anything, but experienced folks can monitor, comment, ignore, get excited, emit raspberries, etc. the whole time. So I'm committing and developing where I might otherwise have shied away. In this way bioperl-dev may be an encouragement to the liberal tradition you describe. also: > When a bioperl-dev module graduates to the core, then the usual support > mechanisms kick in." > > I.e., there is a possibility, but no expectation to graduate to core. I think > that's important. Not "If", but "When"! I had expectation and not possibility in mind! (My perennial optimism...) cheers- MAJ ----- Original Message ----- From: "Hilmar Lapp" To: "Mark A. Jensen" Cc: "BioPerl List" ; "Chase Miller" Sent: Thursday, May 21, 2009 5:00 PM Subject: Re: [Bioperl-l] bioperl-dev or branch? > To quote from the thread: > > "The idea behind bioperl-dev, as I understand from Chris, is to provide a > sort of sandbox for experimental code. Adventuresome users should feel free > to play with the code there, but not expect much in the way of support, bug > fixes, and the like. There be dragons there. When a bioperl-dev module > graduates to the core, then the usual support mechanisms kick in." > > I.e., there is a possibility, but no expectation to graduate to core. I think > that's important. > > My sense is that we all agree that we don't want to abandon svn branches (or > do we?). To I'll state the question again: what disqualifies a development > project from going into the main trunk (thanks to Sendu for keeping this on > the table), and what disqualifies it from going onto a branch, with the > remaining resort being bioperl- dev. > > I'm worried about fragmentation here - historically we've been a crowd that > has been rather inviting of new contributions into the main code base and > tolerant of those additions needing time to mature, and we have been lazy on > committing on behalf of other people (which merging patches, branches, and > separate repositories on behalf of someone else is) and hence liberal in > giving out commit access, and commit to main trunk access. > > -hilmar > > On May 21, 2009, at 4:26 PM, Mark A. Jensen wrote: > >> These are key points. I do believe (and think in these terms) that >> bioperl-dev modules are intended for the trunk, as soon as they are not so >> broken as to be testable by users. (my interp). See this thread to refresh >> memory: http://lists.open-bio.org/pipermail/bioperl-l/2009-March/029661.html >> >> ----- Original Message ----- From: "Hilmar Lapp" >> To: "Chase Miller" >> Cc: "BioPerl List" >> Sent: Thursday, May 21, 2009 4:00 PM >> Subject: Re: [Bioperl-l] bioperl-dev or branch? >> >> >>> Moving this question to the BioPerl list, which is where we need to >>> discuss this I think. Can someone refresh my memory on what the >>> Bioperl-dev repository is or was meant for? It doesn't seem documented on >>> the wiki. >>> My (admittedly vague) recollection is that bioperl-dev is basically for >>> highly experimental changes or functionality. >>> I'm not clear why everything else shouldn't go either into the main trunk >>> or into a branch. If there is a realistic expectation for something to be >>> folded into the main trunk sooner or later, what would be the reasons for >>> not putting it into a branch of the main repository? If we are putting it >>> into a separate repository, we're waiving a lot of svn's support for >>> merging and resolving concurrent edits. >>> I would also go actually go a step further and suggest that even if this >>> GSoC project starts out on a branch (which I can see good reasons for, >>> such as eliminating fear to disrupt something), there should be a plan to >>> move to main trunk before the end of the project. We've had a good >>> tradition in BioPerl of developing directly on the main trunk. It >>> sometimes leads to occasional disruptions when lots of tests seem failing, >>> but it also encourages development discipline and make new code to melt >>> into the BioPerl code base without requiring any extra steps by someone. >>> Any and all thoughts or comments welcome and appreciated! >>> -hilmar >>> On May 21, 2009, at 11:26 AM, Chase Miller wrote: >>>> This brings me to a question about where I should have my code >>>> repository. Originally, I was going to use Bioperl-dev, but it was >>>> brought to my attention that that repository does not normally receive >>>> daily updates and it might not be the right place for my day to day >>>> development. An alternative would be to use something like google code >>>> on a daily basis and commit to Bioperl-dev on a weekly basis. >>> -- >>> =========================================================== >>> : Hilmar Lapp -:- Durham, NC -:- hlapp at duke dot edu : >>> =========================================================== >>> _______________________________________________ >>> Bioperl-l mailing list >>> Bioperl-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l >>> >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > From hlapp at gmx.net Thu May 21 22:27:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 21 May 2009 22:27:57 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <945F5E2CFA8542F2AD1A04AD8BD7C67B@NewLife> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> <945F5E2CFA8542F2AD1A04AD8BD7C67B@NewLife> Message-ID: On May 21, 2009, at 10:02 PM, Mark A. Jensen wrote: > Also wanted to chime in briefly here--for me as a new developer, > commit > access to The Trunk is a little scary, but bioperl-dev seems friendly, > so I find I'm more comfortable putting my hare-brained schemes there > where I know I won't break anything, but experienced folks can > monitor, > comment, ignore, get excited, emit raspberries, etc. the whole time. Yeah I understand that but what you are describing is a branch, really. Bioperl-dev as a subset of BioPerl may also be updated frequently by other people, so unless you want only useless and/or outright dangerous stuff there (which if I understand correctly runs counter to what Chris has in mind) there is no reason to expect that no-one will be using that. It's simply stuff whose API hasn't stabilized yet to the extent desirable for being moved to -core or -tools. (Chris chime in if I am misrepresenting this) In theory, branches are cheap in svn (in practice, that's not necessarily so in a local checkout, but we're still only talking about required storage), and you should feel free to open one at any time. If people find branching off in bioperl too expensive then that's another argument to really look into git. On May 21, 2009, at 9:51 PM, Mark A. Jensen wrote: > For Chase's project, e.g., there may be mods that need to take place > (or be tried out, and possibly discarded later, which is key) in the > core modules, to interface the new stuff with the core. So > experiments may occasionally depend on (possibly radical) changes to > the core--nasty for folks (I expect they are many) that update on > the core regularly and expect changes there to be incremental fixes > and/or new and relatively well-tested functionality. You should really branch off the core for this. BTW in subversion it is up to you what you put on a branch. You can copy the entire code tree, or only a small part of it. So I find it pretty important that we get the distinction between bioperl-dev and a branch (of bioperl-dev or of bioperl-live) down as unambiguously as possible. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at illinois.edu Thu May 21 22:33:20 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 21 May 2009 21:33:20 -0500 Subject: [Bioperl-l] GenBank entries In-Reply-To: <0433B9F6-34A9-4115-840E-04F20D49D6A5@ohsu.edu> References: <0433B9F6-34A9-4115-840E-04F20D49D6A5@ohsu.edu> Message-ID: <7D1C5126-EB9B-4C61-A82F-7624F4E01F50@illinois.edu> If you let me know where it doesn't work, then I can fix it. chris On May 21, 2009, at 6:54 PM, Thomas Keller wrote: > Greetings, > > I don't think the EUtilities HOWTO is working. > But the HOWTO: getting genomic sequences on the wiki is pretty nice. > URL: http://www.bioperl.org/wiki/Getting_Genomic_Sequences > > It does take some tweaking though; sometimes you think you're > getting a value when you're getting another object and you need to > pipe it to another method. > > > Tom > kellert at ohsu.edu > 503-494-2442 > > > On May 21, 2009, at 2:53 PM, bioperl-l-request at lists.open-bio.org > wrote: > >> 7. Re: GenBank entries (Mark A. Jensen) > > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From cjfields at illinois.edu Thu May 21 23:08:27 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 21 May 2009 22:08:27 -0500 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <4A15E464.60801@cornell.edu> <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> Message-ID: <0FEBC2BD-2AB4-4077-8DFE-6DF624C0D994@illinois.edu> On May 21, 2009, at 8:33 PM, Hilmar Lapp wrote: > > On May 21, 2009, at 7:31 PM, Robert Buels wrote: > >> Just to clarify, it doesn't look like bioperl-dev is actually in a >> separate repo, it's just separated from bioperl-live as a different >> distribution, but still in the 'bioperl' repository. > > True, actually, my mistake. I guess I was assuming from the early > days of cvs that these are actually different repositories. Sorry > 'bout that. > > So in essence bioperl-dev and bioperl-live are different directory > trees within the same repository. Yes. > On May 21, 2009, at 5:52 PM, Chris Fields wrote: > >> The key aspect that started this all is to have a lean (read: few >> dependencies, relatively stable) core set of modules. Other >> related modules which add additional dependencies could be moved >> into a bioperl-tools. And (finally) anything lacking the guarantee >> of API stability would be bioperl-dev. I think several other >> alternatives came up but these seemed to be the final ones. >> >> So: >> >> core = minimal functionality >> tools = complete functionality (requires core) >> dev = experimental APIs, etc (requires core and possibly tools) > > I do like this. Am I right that the way we should be looking at this > is as disjoint subsets that as a total make up BioPerl. So a module > would be found in one and only one subset, and if I wanted the > entire BioPerl package I download each one. Yes. > Bioperl-dev then isn't for holding different (e.g., more > experimental) versions of the same modules that are also in bioperl- > core (aka bioperl-live). It's however we want to set it up. I would prefer that we not have a module share the exact same namespace, but I think we can put experimental implementations there that could replace something in core (particularly if replacing something in core could possibly become a thorny proposition, or if the new code goes unmaintained). However, as I pointed out before I have done some major revisions on branch and merged back w/o significant issues (and those that did pop up were fixed very easily). > Likewise, each subset then has its tags and branches and main trunk, > right? (Though hopefully the release tags would be present in all) Yes. > That sounds all good to me, except that bioperl-dev has Bio/Root/* > replicated. It should not, right? If we want to introduce > experimental changes to Bio/Root modules, they should go into a > bioperl-live branch, right? (Otherwise I'm confused what a bioperl- > live branch is for.) Mm, yes, see what you're saying, the Bio::Root modules in dev should be removed. Was there a particular reason those were present, Mark? Do they contain anything new? I ran a diff on a few of them and couldn't find anything. I think, for the sake of not confusing users module names should be new and not conflict with a core module's already-claimed namespace. I think it's okay to have something new with a base namespace like Bio::Root::*, but the module name should be unique (I have thought of Bio::Root::Moose, for instance, as a Moose-based Root metaclass, but that may go elsewhere...). If the module is intended as a replacement for something in core we can decide then how to proceed, but (as mentioned above) it could be something that is worked out on a branch. Seeing how the perl6 and Parrot projects progress, everything goes to a branch and gets merged back unless the changes don't work. After merging it gets removed unless it's a release branch. > So in this picture Chase's project would go into bioperl-dev, main > trunk. Users would obtain it by downloading bioperl-dev from svn or > as package and simply install on top of a Bioperl-core package, > without fear of clobbering stable modules that came with Bioperl- > core. Right? Yes. > If that's the idea it makes a lot of sense to me and seems sane. > Conversely, using bioperl-dev as another way to branch bioperl-core > modules doesn't, though I may be missing something. No, we should use branches as normally intended. I had thought about doing some stuff with FeatureIO in dev but decided against it and will probably go to a branch when the time comes. > (I hope I'm making sense. Please do say if I'm not ...) > > -hilmar You're making sense. :> chris From cjfields at illinois.edu Thu May 21 23:38:08 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 21 May 2009 22:38:08 -0500 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <4A15E464.60801@cornell.edu> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <4A15E464.60801@cornell.edu> Message-ID: <3872B96E-A97F-4569-B7C6-045C07F4257F@illinois.edu> On May 21, 2009, at 6:31 PM, Robert Buels wrote: > Hilmar Lapp wrote: >> something to be folded into the main trunk sooner or later, what >> would be the reasons for not putting it into a branch of the main >> repository? If we are putting it into a separate repository, we're >> waiving a lot of svn's support for merging and resolving concurrent >> edits. > > Just to clarify, it doesn't look like bioperl-dev is actually in a > separate repo, it's just separated from bioperl-live as a different > distribution, but still in the 'bioperl' repository. So it seems to > me there's no need to worry about being able to merge from it. > > Sorry if I'm butting in on the larger organization issue here, since > I don't exactly have any history with this group, but here are my 2 > cents, they may or may not make sense: I would agree with Sendu's > assertion that there doesn't really seem to be a need for a separate > distribution for highly experimental things, that role would > probably be most straightforwardly performed by a branch of the > appropriate bioperl-* distribution. Ah, but we're trying to put core on a diet and not arbitrarily drop code in. There's lots of cruft, dead hunks o' code floating about in bioperl that could be moved/removed. I have no qualms on getting rid of stuff that no longer works or never worked as intended (see my reverts on feature/annotation, or deprecation/removal of modules in the last release). I would like to see some unmaintained core modules go the same route (I'm staring directly at you, Bio::SF::Annotated). > In fact, having a separate bioperl-dev distribution could actually > be a headache for anybody wanting to actually install it (as in make > install from a tarball or something), since anything radioactive > enough to be in there is quite likely going to *conflict* namespace- > wise or at least functionality-wise with what's in bioperl-live. Nope, that's not it's purpose (we would branch for that). If our regression tests are worth their salt they should catch issues with code merged back in. > And by the way (I may be opening a can of worms here), wouldn't > bioperl-live be more appropriately called bioperl-core? > > Rob Yes, it should (as jason points out). As mentioned we could alias it... chris From cjfields at illinois.edu Thu May 21 23:40:09 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 21 May 2009 22:40:09 -0500 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <454792CE1F444C01B8835206DB42A80D@NewLife> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com><4A15E464.60801@cornell.edu> <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> <454792CE1F444C01B8835206DB42A80D@NewLife> Message-ID: <53ABB3C4-CFBF-46DD-8C0C-60EBDF7B6432@illinois.edu> On May 21, 2009, at 8:51 PM, Mark A. Jensen wrote: > H- > ----- Original Message ----- From: "Hilmar Lapp" > To: "Robert Buels" ; "Chris Fields" > > Cc: "BioPerl List" > Sent: Thursday, May 21, 2009 9:33 PM > Subject: Re: [Bioperl-l] bioperl-dev or branch? > > ... > >> That sounds all good to me, except that bioperl-dev has Bio/Root/* >> replicated. It should not, right? If we want to introduce >> experimental changes to Bio/Root modules, they should go into a >> bioperl-live branch, right? (Otherwise I'm confused what a bioperl- >> live branch is for.) > > Well, this is what I'm wondering. For Chase's project, e.g., there > may be > mods that need to take place (or be tried out, and possibly > discarded later, > which is key) in the core modules, to interface the new stuff with > the core. That should probably happen in bioperl-live, then (or a branch thereof). Don't be afraid to branch off bioperl-live if needed to do the work. If it passes tests and looks good we merge it back to core trunk. > So experiments may occasionally depend on (possibly radical) changes > to > the core--nasty for folks (I expect they are many) that update on > the core > regularly and expect changes there to be incremental fixes and/or > new and > relatively well-tested functionality. So I do conceive of bioperl- > dev (rightly > or wrongly) as a parallel branch of bioperl-live, but not a temporary > "feature" branch as such. It can and prob should be pretty persistent. > But read on... >> >> So in this picture Chase's project would go into bioperl-dev, main >> trunk. Users would obtain it by downloading bioperl-dev from svn or >> as package and simply install on top of a Bioperl-core package, >> without fear of clobbering stable modules that came with Bioperl- >> core. Right? > > This is the way I'd love it to work, modulo the exptl core changes > mentioned > above. My own tendency in development is to make stuff as separable as > possible (by specifically overriding core methods in 'Helper' > modules, for example), > and if this were part of the bioperl-dev rules of engagement (i.e., > no core modules, > only overrides), then users could count on the behavior you > describe. Mirroring > the existing core paths is also key for this expectation. I think mirroring the paths is feasible, but not copying the module's name directly. So you could have a Bio::Root::Moose, but not a Bio::Root::Root. >> If that's the idea it makes a lot of sense to me and seems sane. >> Conversely, using bioperl-dev as another way to branch bioperl- >> core modules doesn't, though I may be missing something. > > That's the idea in my own private Idaho, here in Georgia. Same here (from somewhere in corn country). >> (I hope I'm making sense. Please do say if I'm not ...) >> > Me too X 2. > MAJ chris From lsbrath at gmail.com Thu May 21 23:41:39 2009 From: lsbrath at gmail.com (Mgavi Brathwaite) Date: Thu, 21 May 2009 23:41:39 -0400 Subject: [Bioperl-l] Parsing blast with Bio::SearchIO::get_aln Message-ID: <69367b8f0905212041i16637756g79709dc914e0ab0c@mail.gmail.com> Hello, I am parsing a blast report. My script is returning the following output to file: Name: 13414|HD Percent Identity: 100 Alignment: Bio::SimpleAlign=HASH(0x40d0ac4) How do I access " Bio::SimpleAlign=HASH(0x40d0ac4)" actual alignment for this object? LomSpace From jason at bioperl.org Thu May 21 23:51:05 2009 From: jason at bioperl.org (Jason Stajich) Date: Thu, 21 May 2009 20:51:05 -0700 Subject: [Bioperl-l] SimpleAlign object In-Reply-To: <69367b8f0905212038j1212a886i254ebe523bc622af@mail.gmail.com> References: <69367b8f0905212038j1212a886i254ebe523bc622af@mail.gmail.com> Message-ID: I'm not sure what you *want* to print in "alignment" section? you have the "actual alignment" as an object. It looks like you already got that alignment object with get_aln() from your HSP- did you look at the documentation for that method? You can also get the query/hit/ homology strings with methods called on the HSP object. The SearchIO HOWTO describes this. See Bio::SearchIO::Writer::TextResultWriter if you want to reformat a blast report. Or if you want to write it out in a Multiple alignment format (multi- fasta, clustalw, phylip, etc) with Bio::AlignIO You can use IO::String to write the sequence into a string instead of to a file/STDOUT if you want to format things in your script my $str = IO::String->new; my $alnio = Bio::AlignIO->new(-format => 'clustalw', -fh => $str); $alnio->write_aln($str); The alignment formatted as a string is now in $str->string_ref so you can access it from ${str->string_ref} On May 21, 2009, at 8:38 PM, Mgavi Brathwaite wrote: > Hello, > > I am parsing a blast report. My script is returning the following > output to file: > > Name: 13414|HD > Percent Identity: 100 > Alignment: Bio::SimpleAlign=HASH(0x40d0ac4) > How do I access " Bio::SimpleAlign=HASH(0x40d0ac4)" actual alignment > for this object? > > LomSpace Jason Stajich jason at bioperl.org From rmb32 at cornell.edu Thu May 21 23:55:06 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 21 May 2009 20:55:06 -0700 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <454792CE1F444C01B8835206DB42A80D@NewLife> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com><4A15E464.60801@cornell.edu> <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> <454792CE1F444C01B8835206DB42A80D@NewLife> Message-ID: <4A16221A.90007@cornell.edu> Mark A. Jensen wrote: > relatively well-tested functionality. So I do conceive of bioperl-dev > (rightly > or wrongly) as a parallel branch of bioperl-live, but not a temporary > "feature" branch as such. It can and prob should be pretty persistent. Branches can be persistent, if there is a really good reason to keep them as such. But this instance does not seem to be one of them. > This is the way I'd love it to work, modulo the exptl core changes > mentioned > above. My own tendency in development is to make stuff as separable as > possible (by specifically overriding core methods in 'Helper' modules, > for example), > and if this were part of the bioperl-dev rules of engagement (i.e., no > core modules, > only overrides), then users could count on the behavior you describe. > Mirroring > the existing core paths is also key for this expectation. The problem with trying elaborate ploys to avoid developing on branches is the 'trying to keep stuff separable' invariably fails in some cases. It is far better to rely to on your modern version control system to sanely manage your changes. I think you guys are pretty scarred from years of CVS, a version control system which (by modern standards) is laughably broken. SVN is no shining jewel either, but at least it understands the concept of file trees. Rob From rmb32 at cornell.edu Thu May 21 23:55:26 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 21 May 2009 20:55:26 -0700 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <945F5E2CFA8542F2AD1A04AD8BD7C67B@NewLife> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <6B19AA9E-F4A5-4359-91B8-9CD536CFAB59@gmx.net> <945F5E2CFA8542F2AD1A04AD8BD7C67B@NewLife> Message-ID: <4A16222E.2010402@cornell.edu> Mark A. Jensen wrote: > Also wanted to chime in briefly here--for me as a new developer, commit > access to The Trunk is a little scary, but bioperl-dev seems friendly, The point of a version control system is so that mistakes can be traced and undone. However, this is no substitute for writing good tests and running them before you commit. If your test coverage is good, you have a reasonable chance of catching a goof before it's committed, but if it gets by, relief is just an svn-reverse-merge away. > so I find I'm more comfortable putting my hare-brained schemes there > where I know I won't break anything, but experienced folks can monitor, > comment, ignore, get excited, emit raspberries, etc. the whole time. So I'm > committing and developing where I might otherwise have shied away. In > this way bioperl-dev may be an encouragement to the liberal tradition you > describe. These things are what a branch is for. "Hey, svn switch to my branch, run the tests, does this look sane?", "Could you diff this part to trunk and tell me what you think about those changes". But of course changes should not be merged back into the trunk until they have test coverage and the rest of the test suite in that branch is passing. Rob From rmb32 at cornell.edu Thu May 21 23:44:23 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Thu, 21 May 2009 20:44:23 -0700 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com> <4A15E464.60801@cornell.edu> Message-ID: <4A161F97.5010809@cornell.edu> Hilmar Lapp wrote: > Yes, if by "highly experimental things" we are talking about > experimental versions of modules that already exist in either > bioperl-core or bioperl-dev. > [snip] > Yeah I think that's why bioperl-dev and bioperl-core need to be disjoint > sets. Or do you think that even in that case your scenario could be a > problem? It's just the normal way to develop a new, potentially disruptive feature: branch, develop your feature (possibly pulling updates from the main branch if you feel that you need to), and then when it's far enough along for development to proceed on the trunk, you use your version control system's merging tools to merge the changes into the trunk, toss out your old branch, and continue your polishing on the trunk. That's just the way it's usually done nowadays, with modern version control systems. If the things you're doing on the branch don't overlap at all with the existing code from the trunk, the merge is completely clean and you're golden. However, if your changes are not completely disjoint, with merging you have a pretty good shot of getting them in there cleanly and automatically, whereas if you're developing essentially outside of the codebase, you're going to either have to merge your changes in manually. In the disjoint case, you're equally fine with a branch, and in the non-disjoint case, you are much better off with a branch. One of the cases is a tie, and one is a clear win. Trust your version control system, and use its features. Rob P.S. more modern version control systems make this sort of thing quite a bit easier than with svn, but svn's simple merge functionality is still better than merging changes manually. From maj at fortinbras.us Fri May 22 00:31:35 2009 From: maj at fortinbras.us (Mark A. Jensen) Date: Fri, 22 May 2009 00:31:35 -0400 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <0FEBC2BD-2AB4-4077-8DFE-6DF624C0D994@illinois.edu> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com><4A15E464.60801@cornell.edu><11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> <0FEBC2BD-2AB4-4077-8DFE-6DF624C0D994@illinois.edu> Message-ID: <895D57CAEB37462CACF9005619ACDD91@NewLife> ----- Original Message ----- From: "Chris Fields" To: "Hilmar Lapp" Cc: "Robert Buels" ; "BioPerl List" Sent: Thursday, May 21, 2009 11:08 PM Subject: Re: [Bioperl-l] bioperl-dev or branch? ... > Mm, yes, see what you're saying, the Bio::Root modules in dev should be > removed. Was there a particular reason those were present, Mark? Do they > contain anything new? I ran a diff on a few of them and couldn't find > anything. I think I had a vague idea that the Root modules should exist directly in the repository, but that is making less and less sense to me now, especially after these discussions. The gross distinction I'm making now is that while bioperl-live and bioperl-dev possess parallel organization, they are otherwise very different animals: bioperl-live is a unit wherein the component modules are (or can be expected to be) completely interdependent, while different experiments going on in different regions of bioperl-dev don't necessarily have to play at all together. (That sounds suspiciously like a bunch of legitimate branches.) >From this angle, Root shouldn't be there at all, unless it's also on the operating table. There are Root modules involved in building/testing. I thought that these at least should be there, but now not so sure. What sense does a bioperl-dev build make, since one is probably not interested in installing everyone's random experiments at once? Probably incorporating the desired code as a layer onto the current install is the way to go. I get the impression from Rob's insights that more sophistication (on my part, definitely) in the use of version control would obviate a lot of our consternation. So the question is: what's bioperl-dev got that svn ain't got? Maybe the solution here is what you cjf have been suggesting lately, that the trunk is your friend, and that frequent and fearless branching+merging needs to become part of the Modern BioPerl Way. [Jason, fellow bioperl-dev user, are you there?] MAJ > > I think, for the sake of not confusing users module names should be new and > not conflict with a core module's already-claimed namespace. I think it's > okay to have something new with a base namespace like Bio::Root::*, but the > module name should be unique (I have thought of Bio::Root::Moose, for > instance, as a Moose-based Root metaclass, but that may go elsewhere...). > From rmb32 at cornell.edu Fri May 22 03:30:42 2009 From: rmb32 at cornell.edu (Robert Buels) Date: Fri, 22 May 2009 00:30:42 -0700 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <895D57CAEB37462CACF9005619ACDD91@NewLife> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com><4A15E464.60801@cornell.edu><11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> <0FEBC2BD-2AB4-4077-8DFE-6DF624C0D994@illinois.edu> <895D57CAEB37462CACF9005619ACDD91@NewLife> Message-ID: <4A1654A2.2010607@cornell.edu> Mark A. Jensen wrote: > Maybe the solution here is what you cjf have been suggesting lately, that > the trunk is your friend, and that frequent and fearless branching+merging > needs to become part of the Modern BioPerl Way. Never any need for fear when you have version control. Once it's in the repository, everything is completely reversible, completely recoverable. Actually, junk lying around a developer's working copy is the most frequent source of pain. For example, somebody moves a dir in the repository, you update your working copy, and your uncommitted changes are attempting to sit on top of something somebody else has moved out from under you, and you have to move your local changes out of the way and then put them back in the right (new) place, and it's a pain. A couple of techniques for avoiding things like this: 1.) As much as possible, try to keep your working copy clean and make your commits both small and atomic, meaning each commit you make should strive to be a self-contained change that works and stands alone. This is especially important if you are working on the trunk, where changes might be coming in often and some of them could be major. I find it useful to run 'svn diff' on the set of changes I'm about to commit, to make sure they make sense in and of themselves. Also, by following this rule, not only do you avoid problems in your working copy but you also help ensure that the svn HEAD is not very broken at any given time. 2.) If you find it increasingly hard to satisfy rule 1, that is a good indication that you are making a lot of changes, and should be working on a branch. If you get to the point where you realize this and have heretofore managed to follow rule 1, then you can actually just make yourself a branch and switch to it right to your own branch right then and there, even with your uncommitted changes still in your working copy: svn cp svn+ssh://.../trunk svn+ssh://.../branches/mybrokenstuff svn switch svn+ssh://.../branches/mybrokenstuff and poof, there you are, you are now sitting on a checkout of that branch, and the changes that you have sitting in your working copy are still there, and you can commit whatever you want, and it won't affect anybody else. You should still strive to hew to rule 1 though, it just makes things nicer all around. "But what about when you have to merge your stuff back into trunk?" You say? People sometimes make rather too much of 'the difficulty of merging back into trunk'. If you are working on different things from other people, the merge is likely to come off without a hitch. If there are conflicts, they are usually minor textual differences that can be resolved easily. If two developers are actually working on the same thing, they should probably be on the same branch, or at least merge their work into one of their branches before merging that one back into trunk. All this branching and merging is a little more cumbersome with svn than with more recent tools, though, because only the very latest versions of svn (1.5 and above I think) actually have support for tracking exactly which revisions of a branch need to be merged at any given time. Because of this, if you're doing a big long-lived branch in which you are tracking changes from the trunk (the most common case), or periodically sending changes to the trunk and keeping the branch, or both, it's a good idea to use an additional tool to help track exactly which revisions you have already merged in each direction, like svnmerge.py, svk, or git + git-svn (listed in increasing order of sophistication). I've only ever used svnmerge.py, cause I'm still a relatively new at this version control stuff. I'm working on learning git and git-svn though. :-) Rob From dan.bolser at gmail.com Fri May 22 07:38:38 2009 From: dan.bolser at gmail.com (Dan Bolser) Date: Fri, 22 May 2009 12:38:38 +0100 Subject: [Bioperl-l] Error loading GFF3: MSG: xxx doesn't have a primary id ... Message-ID: <2c8757af0905220438m76b27421r1ab8f410ca3d09d@mail.gmail.com> Hi, I'm using Bio::DB::SeqFeature::Store::GFF3Loader to load GFF into a DB::SeqFeature::Store database. I first load in a set of 'clones' in a GFF file that looks like this... S.lycopersicum-chr4 SGN:chr04.v14.agp cloned_genomic_insert 7400895 7558294 . - . ID=C04SLm0125H12.1;Alias=89;Ontology_term=SO:0000914 S.lycopersicum-chr4 SGN:chr04.v14.agp cloned_genomic_insert 7558295 7620759 . + . ID=C04HBa0002B09.1;Alias=90;Ontology_term=SO:0000914 S.lycopersicum-chr4 SGN:chr04.v14.agp cloned_genomic_insert 7670760 7801908 . + . ID=C04HBa0077O05.2;Alias=92;Ontology_term=SO:0000914 And then I load a bunch of Blast hits from those clones in a GFF file that looks like this... S.lycopersicum-chr4 BLASTN match_part 14263569 14263620 56.0 - 0 Target=BAC10.Contig16 314 365;score=56.0;Parent=C04HBa0107N23.1 S.lycopersicum-chr4 BLASTN match_part 7565714 7565734 42.1 + 0 Target=BAC10.Contig16 199 219;score=42.1;Parent=C04HBa0002B09.1 S.lycopersicum-chr4 BLASTN match_part 4309103 4309134 48.1 - 0 Target=BAC10.Contig18 1704 1735;score=48.1;Parent=C04HBa0308B07.2 I'm not 100% sure I got the "tags" part of the latter GFF correct. I'm getting the following error loading the second GFF file: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: C04HBa0002B09.1 doesn't have a primary id STACK: Error::throw STACK: Bio::Root::Root::throw ~/perl5/lib/perl5/Bio/Root/Root.pm:368 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree_in_tables ~/perl5/lib/perl5/Bio/DB/SeqFeature/Store/GFF3Loader.pm:685 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::build_object_tree ~/perl5/lib/perl5/Bio/DB/SeqFeature/Store/GFF3Loader.pm:664 STACK: Bio::DB::SeqFeature::Store::GFF3Loader::finish_load ~/perl5/lib/perl5/Bio/DB/SeqFeature/Store/GFF3Loader.pm:318 STACK: Bio::DB::SeqFeature::Store::Loader::load_fh ~/perl5/lib/perl5/Bio/DB/SeqFeature/Store/Loader.pm:325 STACK: Bio::DB::SeqFeature::Store::Loader::load ~/perl5/lib/perl5/Bio/DB/SeqFeature/Store/Loader.pm:222 STACK: ~/BiO/Util/my_seqfeature_load.plx:44 ----------------------------------------------------------- As you can see the ID C04HBa0002B09.1 (from the Parent tag of the second GFF) *does* exist in the first GFF. The features are apparently loaded correctly, and calling 'reindex' on the database seems to run without error. I tried to look into the above code, but I'm confused by all the calls to the Load 'Helper'. a) is this the problem of my GFF? b) is this important? (the features are apparently loaded) c) can you fix it? ;-) Thanks for any tips, Dan. From cjfields at illinois.edu Fri May 22 08:24:11 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 22 May 2009 07:24:11 -0500 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <4A16221A.90007@cornell.edu> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com><4A15E464.60801@cornell.edu> <11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> <454792CE1F444C01B8835206DB42A80D@NewLife> <4A16221A.90007@cornell.edu> Message-ID: On May 21, 2009, at 10:55 PM, Robert Buels wrote: > Mark A. Jensen wrote: >> relatively well-tested functionality. So I do conceive of bioperl- >> dev (rightly >> or wrongly) as a parallel branch of bioperl-live, but not a temporary >> "feature" branch as such. It can and prob should be pretty >> persistent. > Branches can be persistent, if there is a really good reason to keep > them as such. But this instance does not seem to be one of them. > >> This is the way I'd love it to work, modulo the exptl core changes >> mentioned >> above. My own tendency in development is to make stuff as separable >> as >> possible (by specifically overriding core methods in 'Helper' >> modules, for example), >> and if this were part of the bioperl-dev rules of engagement (i.e., >> no core modules, >> only overrides), then users could count on the behavior you >> describe. Mirroring >> the existing core paths is also key for this expectation. > > The problem with trying elaborate ploys to avoid developing on > branches > is the 'trying to keep stuff separable' invariably fails in some > cases. > It is far better to rely to on your modern version control system to > sanely manage your changes. I think you guys are pretty scarred from > years of CVS, a version control system which (by modern standards) is > laughably broken. SVN is no shining jewel either, but at least it > understands the concept of file trees. > > Rob Well, there is the issue that we *dont't* want to pollute core anymore than it already is. I would rather not dump any/all code directly into core de novo. That practice has created code bloat, and unmaintained code bloat at that. chris From cjfields at illinois.edu Fri May 22 08:36:16 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 22 May 2009 07:36:16 -0500 Subject: [Bioperl-l] bioperl-dev or branch? In-Reply-To: <895D57CAEB37462CACF9005619ACDD91@NewLife> References: <991fb8210905210826v2a7990c0u90fcb3256f54b7d7@mail.gmail.com><4A15E464.60801@cornell.edu><11721AB8-5473-4630-8AE2-BDDB307F0F72@gmx.net> <0FEBC2BD-2AB4-4077-8DFE-6DF624C0D994@illinois.edu> <895D57CAEB37462CACF9005619ACDD91@NewLife> Message-ID: On May 21, 2009, at 11:31 PM, Mark A. Jensen wrote: > > ----- Original Message ----- From: "Chris Fields" > > To: "Hilmar Lapp" > Cc: "Robert Buels" ; "BioPerl List" > > Sent: Thursday, May 21, 2009 11:08 PM > Subject: Re: [Bioperl-l] bioperl-dev or branch? > ... >> Mm, yes, see what you're saying, the Bio::Root modules in dev >> should be removed. Was there a particular reason those were >> present, Mark? Do they contain anything new? I ran a diff on a >> few of them and couldn't find anything. > > I think I had a vague idea that the Root modules should exist directly > in the repository, but that is making less and less sense to me now, > especially after these discussions. The gross distinction I'm making > now > is that while bioperl-live and bioperl-dev possess parallel > organization, > they are otherwise very different animals: bioperl-live is a unit > wherein > the component modules are (or can be expected to be) completely > interdependent, while different experiments going on in different > regions of bioperl-dev don't necessarily have to play at all together. > (That sounds suspiciously like a bunch of legitimate branches.) > From this angle, Root shouldn't be there at all, unless it's also on > the > operating table. > > There are Root modules involved in building/testing. I thought that > these at least should be there, but now not so sure. What sense > does a bioperl-dev build make, since one is probably not interested > in installing everyone's random experiments at once? Probably > incorporating the desired code as a layer onto the current install > is the way to go. That's sort of the idea. If you want just the base functionality, get core/live. 'tools' is an additional layer, and 'dev' a layer around that. > I get the impression from Rob's insights that more sophistication > (on my part, > definitely) in the use of version control would obviate a lot of our > consternation. So the question is: what's bioperl-dev got that svn > ain't got? > Maybe the solution here is what you cjf have been suggesting lately, > that > the trunk is your friend, and that frequent and fearless branching > +merging > needs to become part of the Modern BioPerl Way. > > [Jason, fellow bioperl-dev user