Google Summer of Code
From BioPerl
Contents |
Mentor Volunteer List
BioPerl developers who volunteer to act as a mentor to a GSoC student.
- Robert Buels - 1 student
- Chris Fields - 1 student
- Mark Jensen - 1 student
Project Ideas for 2010
Applications for GSoC 2010 are currently being accepted, please email the BioPerl List for details.
Major BioPerl reorganization
- Rationale
- BioPerl is currently suffering from an overly-monolithic structure, which is becoming unwieldy and contributing to paralysis of the project.
- Approach
- Under the supervision of their mentor(s), the GSoC student will:
- break current thousand-module monolithic distributions into smaller, more manageable pieces
- improve characterization of dependencies
- improve build and testing systems for new distributions
- Challenges
- BioPerl contains nearly 2000 modules, with very complex relationships between them.
- Difficulty and needed skills
- easy to medium. Student will need excellent Perl programming skills, familiarity with Perl module authoring, and a good command of text processing tools (grep, ack, and Perl itself).
- Mentors
- Robert Buels, Chris Fields
Lightweight Sequence objects and Lazy Parsing
- Rationale
- The current Bio::Seq implementation is greedy and can take up a ton of memory, particularly with whole-genome information.
- Approach
- Implement a Bio::Seq class that can deal with very large datasets in a memory-efficient manner. Implement at least one corresponding parser that can either parse records lazily (akin to an XML pull parser), or create lightweight objects that can parse the raw data on the fly. These could be considered two projects but they are interrelated (lightweight objects could have many different backends, including lazy parsing), so development should proceed with this in mind.
- Difficulty and needed skills
- medium to hard. Student should have an excellent command of Perl and data structure, and some familiarity with parsing methodologies.
- Mentors
- Chris Fields
BioPerl 2.0 (and beyond)
- Rationale
- Design or reimplement BioPerl classes without API constraint, using Modern Perl tools or Perl 6.
- Approach
- Most BioPerl code is over 6 years old and doesn't take advantage of Modern Perl tools, such as new methods available in Perl 5.10 and 5.12, Moose/MooseX, DBIx::Class, Catalyst, and more. Furthermore, a viable Perl6 implementation, Rakudo, is currently available. This gives us an enormous opportunity to redesign fundamental aspects of BioPerl without the necessity for development hindered by a requirement for backwards compatibility.
Two projects, Biome (Moose-based BioPerl) and BioPerl6 (Perl 6 BioPerl) have already started but are in a very early stage. One could participate in:
- IO implementations for object iteration, or Perl6 grammars for common formats
- Redesign of common BioPerl classes
- etc.
This is an area ripe for new student project ideas. The more focused the better! Discussion is a must, either via IRC or email.
- Difficulty
- Project-dependent
- Mentors
- Chris Fields, Rob Buels
Alignment Subsystem Refactoring
- Rationale
- BioPerl's Bio::Align::AlignI subsystem is quite old and in need of significant refactoring. Furthermore, the Bio::AlignI and Bio::Assembly subsystems need further integration. This is an area ripe for reimplementation to make a more consistent set of modules.
- Approach
- see the Align Refactor page for more details.
- Difficulty and needed skills
- medium to hard. Excellent command of Perl, familiarity with sequence alignment and alignment tools.
- Mentors
- Chris Fields, Mark Jensen
- Student
- Jun Yin
- Code blog
- http://gsoc2010-junyin.blogspot.com/
Bio::Assembly
not terribly sexy, but it could be very useful ... Ace2Sam and Sam2Ace converter;
Continued refinement of AssemblyIO - sam or ace files once imported should have similar handles and/or methods.
Perl Run Wrappers for External Programs in a Flash
- Rationale
- BioPerl has a long tradition of providing wrapper objects for running external programs and parsing their output, mainly through the distribution called bioperl-run. Wrappers make it relatively easy to process data in highly customizable pipelines with the benefits of BioPerl objects and I/O. They also help to standardize the interfaces to typically idiosyncratic open-source utilities, reducing the burden on the developer. With new bioinformatics tools being released almost daily, however, it can be difficult for the BioPerl regulars to maintain a stable of run wrappers for the latest and greatest tools. Even harder is making the wrapper interfaces themselves conform to a standard API that users can count on.
- Approach
- There are two tracks, one relatively harder than the other.
- Improve/tighten/extend the Bio::Tools::Run::WrapperBase and Bio::Tools::Run::WrapperBase::CommandExts system for very general run wrappers, making them work robustly with the new Bio::Tools::WrapperMaker module currently under development. The goal will be to get these modules ready for release into the trunk.
- Produce and test a library of XML wrapper definition files that (through WrapperMaker) can substitute for or improve the current implementation of existing run wrappers. This may involve some experimental modification of existing wrappers, as well as performing experiments to determine whether WrapperMaker-based modules can successfully pass existing wrapper unit tests.
See HOWTO:Wrappers and the above module documentation for more details.
- Difficulty and needed skills
- Medium. The student should understand or be willing to work hard at understanding BioPerl object-oriented style. Some familiarity with XML and XML Schema will help in getting up to speed. An interest in playing with new open-source bioinformatics tools, especially those for managing next-generation sequence assembly, would also be valuable.
- Mentors
- Mark Jensen
Semantic Web Support
- Rationale
- There are great development opportunities in information discovery for bioinformatics using semantic web, specially thinking in the implementation of SPARQL queries for a "discoverable bio-cloud".
- Approach
- Previous efforts can be adopted and extended, such as resulting code from BioHackathon 3 and the code provided by Expasy. There are two main areas to explore:
- Parsers and converters from and to RDF, including IO modules for GenBank, EMBL, several XML specifications, et cetera.
- Storage and retrieval of information using SPARQL.
- Difficulty and needed skills
- Medium. Familiarity with SeqIO modules and Perl itself. The student should also be familiar with RDF format and the RDF triples concept for Semantic Web.
- Mentors
- To be determined.
(your idea here)
Past Projects
2009
As part of NESCent's Phyloinformatics GSoC
- Chase Miller - BioPerl wrapper for Rutger Vos's Bio::Phylo package
- See the HOWTO...
- Xin "David" Shuai - SWIG-based wrappers for libsequence
- Part of the BioLib project
- blog
- Project Page
2008
As part of NESCent's Phyloinformatics GSoC
- Mira Han - BioPerl PhyloXML support [1]
- Project page: PhyloXML support in BioPerl
- HOWTO:PhyloXML
Publications
- Han MV and Zmasek CM. phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics 2009 Oct 27; 10 356. doi:10.1186/1471-2105-10-356 pmid:19860910.


