[Bioperl-l] TFBS databases, Bio::Matrix::PSM suitable?

Sendu Bala bix at sendu.me.uk
Tue Aug 22 04:23:40 EDT 2006

I'm looking to extract data from some Transcription Factor Binding Site 
(TFBS) databases. For example, matrix, sequence and known position 
information out of Transfac flatfiles.

Currently there is Bio::Matrix::PSM::IO::transfac, but it only gives you 
the PSM matrices, not the 'instance' sequences. Bio::Matrix::PSM also 
has this to say:

> To handle a combination of site matrices and/or their corresponding
> sequence matches (instances). This object inherits from
> Bio::Matrix::PSM::SiteMatrix, so you can use the respective
> methods. It may hold also an array of Bio::Matrix::PSM::InstanceSite
> object, but you will have to retrieve these through
> Bio::Matrix::PSM::Psm-E<gt>instances method (see below). To some extent
> this is an expanded SiteMatrix object, holding data from analysis that
> also deal with sequence matches of a particular matrix.
> This does not make too much sense to me I am mixing PSM with PSM
> sequence matches Though they are very closely related, I am not
> satisfied by the way this is implemented here.  Heikki suggested
> different objects when one has something like meme But does this mean
> we have to write a different objects for mast, meme, transfac,
> theiresias, etc.?  To me the best way is to return SiteMatrix object +
> arrray of InstanceSite objects and then mast will return undef for
> SiteMatrix and transfac will return undef for InstanceSite. Probably I
> cannot see some other design issues that might arise from such
> approach, but it seems more straightforward.  Hilmar does not like
> this beacause it is an exception from the general BioPerl rules Should
> I leave this as an option?  Also the header rightfully belongs the
> driver object, and could be retrieved as hashes.  I do not think it
> can be done any other way, unless we want to create even one more
> object with very unclear content.

I actually want to get even more kinds of data out, so rather than 
extend Bio::Matrix::PSM::IO::transfac and related modules in some way, 
would it be more appropriate to have something like 
Bio::DB::TFBS::transfac which had a number of methods that gave specific 
kinds of objects? We could have get_psm() which gives a normal 'pure' 
Bio::Matrix::PSM with no InstanceSite objects, get_aln() which returns a 
Bio::SimpleAlign for the 'instance' sequences that were used to generate 
a given PSM, and get_map() which returns a new special kind of Bio::Map 
with binding site position information.

Another way it makes a little more sense for this to be a 'DB' module 
and not an IO one is that there are multiple huge Transfac data files in 
the database, with related and cross-referenced information. To extract 
the complete information you would want to parse them all and create 
indexes for fast lookups later, not something you really expect of an IO 

Thoughts anyone?

More information about the Bioperl-l mailing list