PhyloXML support in BioPerl

From BioPerl
Jump to: navigation, search

Contents

Author

Mira Han, Indiana University. mirhan-at-indiana.edu

Original Proposal

Abstract

PhyloXML is an XML document model for phylogenetic data that incorporates various annotation types, including user customized data. The format is currently not supported by BioPerl. I propose a SAX based data structure and interface for PhyloXML support in BioPerl. I will use most of the existing IO structures such as TreeIO and TreeEventBuilder and subclass them to extend the functions specific to PhyloXML. The Project will consist of three main parts, each corresponding to the objects TreeIO::phyloXML, TreeIO::PhyloEventBuilder, Tree::NodePhyloXML. TreeIO::phyloXML inherits TreeIO and uses node type NodePhyloXML, and PhyloEventBuilder as the SAX parser. TreeIO::PhyloEventBuilder inherits TreeEventBuilder and specifies actions for each elements. Tree::NodePhyloXML inherits NodeI and carries phyloXML specific internal variables including hash for custom types. NodePhyloXML will be connected to various existing BioPerl modeules, such as SeqI, TaxonI, AnnotationI by reference in order to accomodate different phyloXML elements.


Building a parser using the TreeEventBuilder is going to be a straightforward task of detailing what to do with every element defined in the PhyloXML syntax. Since BioPerl already has a working version of TreeEventBuilder that builds nodes using the NodeI interface, the basics would be to use those methods for the <clade> elements, and extend the NodeI interface for PhyloXML specific data types.

The Project will consist of three main parts.

1. TreeIO::phyloXML.pm

 a. inherits TreeIO. 
 b. module is loaded by TreeIO just like other subclasses of TreeIO.
 c. uses node type NodePhyloXML.
 d. uses PhyloEventBuilder as the SAX parser.

2. TreeIO::PhyloEventBuilder.pm

 a. inherits TreeEventBuilder
 b. specifies actions for phyloXML elements.

3. Tree::NodePhyloXML.pm

 a. inherits NodeI
 b. implementation of NodeI interface and extension to handle phyloXML specific elements.

After these modules are implemented, TreeFunctionsI that only uses methods defined in NodeI and TreeI should work without any modification needed.

The XML parser module will be the basic parser that is selected for the EventHandlerI. An extension of the project, once it is successfully finished, would be to add support for DOM based parsing of phyloXML. Then bioperl would provide the user to choose the mode of parsing, similar to the way bsml is handled.

Time-line: Prep Design Object Structure TreeIO::phyloXML - phyloXML inherits TreeIO - module is loaded by TreeIO - uses node type NodePhyloXML - used PhyloEventBuilder as the SAX parser. TreeIO::PhyloEventBuilder - inherits TreeEventBuilder - specifies actions for elements. Tree::NodePhyloXML - inherits NodeI - design phyloXML specific internal variables. Design the connection with existing BioPerl modeules, such as SeqI, TaxonI, AnnotationI, etc.


Project Progress

5/25 Project Start

1. basic structure of TreeIO::phyloXML

  • attach_EventHandler
  • set _handler to a new instance of TreeIO::PhyloEventBuilder
  • load phyloXML module with TreeIO

2. basic structure of TreeIO::PhyloEventBuilder

  • start_document, end_document
  • connect with Tree->new()

3. basic structure of NodePhyloXML

  • new()
 done:
 1. made skeleton files for TreeIO:: PhyloEventBuilder, TreeIO::phyloXML, Tree::NodePhyloXML
 2. managed to connect and load them up but there is a bus error problem. 
 I think it’s probably due to some of the function calls that I’m making 
 That I haven’t looked into properly. I’m suspecting it will go away once I properly 
 build in the end_element for <clade>

6/1

implement start_element, and end_element for <phylogeny> and <clade>

  • start_element: <phylogeny>: add treelevel, <clade>: push data to current_items.
  • end_element: <phylogeny>: minus treelevel, <clade>: pop data from current_elements, use new() to build node from popped data.

get rid of that bus error. TreeIO::phyloXML::Next_tree() : look for element </phylogeny>

 done:
 The bug was fixed when <clade> was properly built in, but I had to get rid of the whole code anyways, because I decided to use libXML::Reader instead of the SAX EventBuilder. PhyloXMLEventBuilder file is removed, and all the parsing is now done in phyloxml.pm. 
 The interface of libXML::Reader made returning in the middle of parsing more straightforward. 
 Basic structure of next_tree() is finished for now, with the subroutines
 element_phylogeny() / end_element_phylogeny()
 element_clade() / end_element_clade() &
 end_element_name()
 Recursive clades are parsed into a tree object.

6/8

Start adding specific variables to NodePhyloXML. Will parse all elements that are straightforward.

  • <branch_length><Uri><confidence><Identifier><Distribution><BranchColor>, etc.

And add the elements to the NodePhyloXML structure.

 done:
 Changed if-elses to hash of functions. 
 Organized the test script.
 Parsed elements <branch_length> and <confidence>.
 Removed NodePhyloXML and added more general AnnotatableNode

6/15

Tags that were left over from last week, Custom elements: parse into Annotation::SimpleValue.

 done:
 <property> that encodes user defined properties to be attached to the <clade> via Annotation::SimpleValue.
 <id_ref> and <id_source> to define properties outside of the <clade>.

6/22

<taxonomy> - decided to use Annotation::StructuredValue instead of the Bio::Taxon, because of the incompatibility with the phyloxml design. ( in bio-perl Bio::Taxon is-a Node but in phyloxml <clade> has-a <taxon> ). Finish up <taxonomy>. Start implementating the to_string method. I think this is where we need the PhyloXMLNode that inherits AnnotatableNode. Since we need phyloxml specific implementations of to_string for the node.

6/29

Finish up <taxonomy>. to_string method. Chris suggested using the callback function for the to_string instead of making new class PhyloXMLNode. will try that this week. if to_string is finished, validate against phyloXML.xsd.

done:
AnnotatableNode::to_string
phyloxml::node_to_string callback.
phyloxml::write_tree with helper.
Changed element specific subroutines to default subroutine, 
So it works for all clade annotating elements with level of one. 
Used nested AnnotationCollections to handle attributes and text.

7/6

Make clade annotation work for elements with nested levels

  • <taxonomy> <events>,

Connect with other obects

  • <sequence> <domainarchitecture>
done:
clade annotation works for all elements with nested levels
Eg. <taxonomy>, <events>, <distribution>, <annotation>, <date>
currently all <sequence> annotations are connected to the node’s annotationcollection as regular annotations.

7/13 Midterm Evaluation due

should create sequence object and move annotations to sequence. <domain architecture> <sequence_relation> update documentation

 done:
 Seq object created with corresponding annotations. 
 AnnotatableNode has-a SeqI accessed via sequence()
 wiki updated.

7/20

<sequence_relation> No bioperl module to handle sequence relations (ortholog/paralog etc). something vaguely similar would be Ontology. Tree holds sequence relations in phyloxml. Should Tree be annotatable as well?

<clade_relation> multiple parents.

 done
 <Sequence_relation> orthology/paralogy/etc.
 <Clade_relation> network connection.
 Made a temporary implementation that puts the information in the tree as tags.
 Since bioperl doesn’t support multiple parents this may be the best way to keep the information. 
 Will ask about it on bioperl list.
 received the domain_architecture example document from Zmasek.

7/27

<domain_architecture><protein domain>, <alignment> - connect with Bio::SimpleAlign

 done
 creating an Annotation::Relation object per people’s recommendation.
 modifying <Sequence_relation>  <Clade_relation> to use Annotation::Relation on the Nodes instead of tags on the trees.
 turns out <alignment> if not part of phyloxml specification.

8/3

Finish up the Annotation::Relation <domain_architecture><protein domain> Testing & Documentations.


done
Finished Annotation::Relation object.
attached the Relation annotations to the corresponding nodes instead of the tree.
modified the write_tree command to lookup the clade_relation of each node and the sequence_relation of each sequences,  while writing the nodes, and summarize the clade_relations at the top of the tree. 
Added get_deep_Annotations in Annotation::Collection in order to get annotations that are within nested collections.
I wasn’t able to implement the code for <domain_architecture> and <protein_domain>

8/10 End coding

Test reading and writing of different phyloXML example documents. finalize, testing, documentation

done:
add -nowarnonempty to Bio::PrimarySeqI
modified get_deep_annotation to get_nested_annotation and added optional argument '-recursive' and 
default behavior of 'get_Annotations' 
added method read_annotation to TreeIO::phyloxml to provide easy way to access deep annotations via XPath-like 
path arguments.
more testing added to phyloxml.t ( total 90 tests )
documentation: Howto(Phyloxml Project Demo) and pdocs.

8/17 End Project

Write up Final Evaluation


Phyloxml Project Demo

To do list

Order of the tags to be compliant with the Phyloxml schema (as Christian activated the XSD validation in APTX)
Redundancy of sequence_relation (symmetrical duplication).
<protein domain>
<domain architecture>
xsd validation
Personal tools
Namespaces
Variants
Actions
Main Links
documentation
community
development
Toolbox