[Bioperl-l] TFBS databases, Bio::Matrix::PSM suitable?

Sendu Bala bix at sendu.me.uk
Tue Aug 22 08:18:15 EDT 2006

Stefan Kirov wrote:
> Sendu Bala wrote:
>> I'm looking to extract data from some Transcription Factor Binding 
>> Site (TFBS) databases. For example, matrix, sequence and known 
>> position information out of Transfac flatfiles.
>> Currently there is Bio::Matrix::PSM::IO::transfac, but it only gives 
>> you the PSM matrices, not the 'instance' sequences. Bio::Matrix::PSM 
>> also has this to say:
> Transfac is not an open database so, you cannot get the instance data 
> anyway.

You can. It is in the sites.dat file and often in the matrix.dat file. 
It is also available freely and publicly via at least 2 websites.

> There was a discussion on that recently. Since Bioperl is 
> completely open project, I am not sure it makes sense to put efforts 
> into supporting something that is not open- even if you have access to 
> the data files (which I believe Transfac does not allow in general)

It does allow it; you just have to pay for fast access to the latest 
data. Or you can use older data for free via the web. A Bio::DB module 
could provide access to either.

 > how the rest of us can use it or debug/support it?

It may be possible to include a small example subset of the data in 
t/data; there is after all already t/data/transfac.dat (which is a small 
matrix.dat file).

In any case, I don't see that your argument is valid. Why should bioperl 
be restricted to only dealing with 'open' data sources? If someone is 
willing to develop and maintain a module that deals with a data source, 
it makes no difference if that source is open or not - it is useful 
either way to other people who also have access to that data. If there 
comes a time that the maintainer can no longer maintain it and it stops 
working because the data format changes, and no one knows the new 
format, it can be deprecated.

Is there some 'popularity' threshold that must be passed before it is 
'worth' adding a database module to Bioperl? Why should there be one? 
The cost of having one is a few kb in disc storage space, the benefit 
extremely large to the person who might want to use it. There may be an 
argument that core shouldn't become cluttered with too much stuff that 
the majority of people won't use, but how is that line drawn? I don't 
personally use the majority of bioperl modules, but I don't think they 
should all be removed. And clearly the idea of having PWM, transfac 
related modules in bioperl has been deemed acceptable in the past, or we 
wouldn't have Bio::Matrix::PSM::transfac.

More information about the Bioperl-l mailing list