[Bioperl-l] Bio::PopGen modules performance
bli1 at bcm.tmc.edu
Fri Nov 4 18:35:15 EST 2005
Hi Jason and Albert,
Thanks for the quick response. Actually the data are gene-like
sequences so I decide to implement a C++ code or perl code myself for
calculating various statistics. I haven't touched C++ programming for
a long time. So I want to to try perl first and hope it works fine.
As Jason pointed out, I may only need to calculate pi value (by my
code, not Bioperl to increase speed) and pass it to XX_count()
functions. I believe that the performance issue is not the
calculation at all. It takes most of time to construct various
objects, and in most of the functions, various objects are accessed
to do the final calculation. I mean it takes a detour to get the
final destination. Bioperl is very useful and convenient for small
data set. But for larger data set, code must be optimized as
performance instead of reusability is more important.
Thanks and have a nice weekend! ;-)
On Nov 4, 2005, at 3:29 PM, Albert Vilella wrote:
> If your datasets are annotations inside large syntenic regions (MBp-
> scale) and you are interested in sliding windows/Multiresolution
> you may be interested in trying VariScan:
> otherwise, for "gene-like" sequences (Kbp-MBp scale), Kevin's
> libsequence+analyse is a fantastic tool, and I believe the best
> El dv 04 de 11 del 2005 a les 15:35 -0500, en/na Jason Stajich va
>> My guess is it has more to do with the object creation/teardown than
>> the actual code calculating the statistics. I'm not entirely sure
>> how we solve this as I chose to use a rich objects so that you can
>> pass lots of different kinds of data in.
>> I wrote simple methods to calc Tajima's D, Fu & Li's D, etc just from
>> the simple counts - see the XX_counts method for more information.
>> For example, for Tajima's D, you can call tajima_D_counts with the #
>> samples, # sites, and pi to just get back D. But of course to
>> calculate pi you need to pass in a population object to the pi method
>> so it doesn't really solve it for you. Maybe we can figure out a way
>> to simplify it, but I embraced the object-oriented here to support a
>> flexible design, but I didn't realize the speed was going to be so
>> You can certainly use Kevin Thornton's msstats which is going to be
>> bazillion (approx.) times faster than the bioperl object code.
>> search down for msstats
>> I am hoping someone will have a magic Perl insight on how to do OO
>> better one day - and may that be a day before Perl6!
>> On Nov 4, 2005, at 2:18 PM, Bingshan Li wrote:
>>> Hi all,
>>> I used Bio::PopGen modules to calculate various statistics such as
>>> Tajima's D, Pi and so on. For single data, the performance is fine.
>>> But to get a sense of significance, I simulated the data using
>>> Hudson's "ms" program to generate 10000 simulated populations. When
>>> I used Bio::PopGen modules on the 10000 samples, it takes long time
>>> (finished 600 samples in about 10 hours, population size about 200,
>>> segregating size about 500). If I have a set of data, say 100, for
>>> each data I need 10000 simulated populations, I do not think it is
>>> doable. I am wondering if it makes sense for these modules or I can
>>> increase the performance by optimization of my code. I think 10000
>>> simulations are typical for population genetics analysis. Does any
>>> body have experiences with this issue and can anyone give me any
>>> suggestions about the performance?
>>> Thanks a lot!
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>> Jason Stajich
>> Duke University
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
More information about the Bioperl-l