[Bioperl-l] bioperl based database infrastucture for directed graphs

Robson Francisco de Souza robfsouza at gmail.com
Wed Jan 9 08:20:08 EST 2008


Hello All!

Greetings for everybody and happy new year for those following an
western calendary!

I'm starting a new project to store and analyze distinct sets of
sequence annotation data which are related in a way suitable for
representation in a directed (e.g. transcript splicing) or undirected
(e.g. gene product interaction) graph. Analysis will require frequent
queries based on interval overlaps, feature neighbourhood, annotation
and, most importantly, feature relationships and stored paths.

At first, I thought of build an entire new database structure to store
project specific data (e.g. alternative splicing or protein interaction),
but as I have some experience with Lincon's
Bio::DB::SeqFeature::Store, I'm now considering extending it for the
purpose of storing graphs describing relationships among features.

I'm aware that some other bioperl related databases, specifically
BioSQL and Chado, do have  components which might be suitable for
storing all or some of these data but, since Lincon's feature storage
and interval binning implementations in
Bio::DB::SeqFeature::Store::mysql are both clean, simple and very fast,
perhaps extending it in a seemingly modular way is desirable. A good
extension to Lincon's database could include tables like
feature_relationship and feature_path, for edges and transitive
closures (just like in BioSQL) and feature_stored_path, for exclusion
of biologically irrelevant paths in DAGs, like certain splicing
isoforms. These tables could be used  to store sequence assemblies or
EST alignments efficiently, including scaffolds inferred by connecting
contigs.

Before starting, I would like to know if the BioSQL and Chado schemata
do have accelerators for quering intervals among billions of features
and feature relatioships (some examples using these databases would
also help, if they that these databases are efficient for such tasks).
If these or other databases are not as suitable as Bio::DB::SeqFeature
for feature retrieval based on interval overlap and attributes,  then
again I might consider extending Bio::DB::seqFeature
and contributing such extensions back to bioperl...

Any thoughts?

Best regards,
Robson

PS: sorry if anyone gets two copies of this post, but took me some
time to realize my new e-mail wasn't subscribed to bioperl-l...


More information about the Bioperl-l mailing list