Bioperl: XML/BioPerl

Gunther Birznieks
Wed, 30 Dec 1998 14:23:34 -0500 (EST)

On Wed, 30 Dec 1998, David J. States wrote:

> Is anyone aware of plans on the part of the database organizations to serve 
> XML?
If you mean general databases, Matthew Seargent has a DBI->XML and
XML->DBI converter. If you mean specific biological databases, then I am
naive on this aspect but I do have some thoughts on the subject you bring
up below.

> An alternative to agreement on a bio DTD is to push the burden of data 
> resolution issues onto the client.  In writing an applet for a specific 
> display function, you would need to know the relationship of the various 
> fields in the data sources that you were referencing.  This seems less 
> desirable, but at least it is a way forward.
> Thoughts or suggestions?
Your questions are thought provoking. To me, they reveal much about the
subtle tension between XML as a data parsing standard and how much
attention should actually be paid to the new markup languages that are
vying to get formed as new "standards" under the XML umbrella.

[1] XML Markup "standards"

My feeling is that too many people are trying to focus too hard on
defining standard "all-encompassing" DTDs for their problem domains. My
belief is that in your initial prototype stage, you should definately
consider placing the burden on the client to understand your XML structure
rather waiting to conform to something if it doesn't exist or isn't well
defined yet.

Even if it is defined, but it is too complex for your data needs, then you
probably should define your own XML markup anyway.  Much of the value of
XML lies in [1] easily being able to build structures that map to your
specific problem space easily and [2] making that structure efficient and
trivial to parse.

If you are exposing a relatively simple interface to your data on the
Internet, your users are probably going to be happy just knowing that they
can do away with HTML::Parser, and instead download the data in a simple
format.  In addition, they won't have to weed through a bloated Document
standard API.

If the data you are presenting is complicated, I would still say the same
thing.  Even if your data is complex, then you still may want to
consider not using a "standard biological markup".  The reason being
efficiency.  Your XML may actually be more readable if you create it in a
way that is centered around your data rather than someone elses idea of
how the data should look.

I am very wary of database "standards" since they tend to lock people into
inefficiencies and kludges to get around those inefficiencies. In many
cases, it would have been just as easy to provide an easier, readable
format without 20 million "exceptions to the rule" as people start using
these things in the real world.

My rule of thumb: if you are creating a general tool to do lots of general
things, then stick to a standard. If you are creating a specific tool to
generate specific data, sometimes conforming to the "standard" may not
produce very readable or useable code as just having formed a simple API
for your specific data set in the first place.

By the way, this rule of thumb probably might make more sense if you
think about the analogy to relational database schemas.  If you want a
general database to do store lots of different general data types in a
problem domain, it makes sense to stick with a standard database schema.
However, the more specific your database becomes within that problem
domain, the more you will find yourself trying to jump through hoops to
gain efficiency and ease of data access.

I realize my views may be controversial. In advance I have to say that I
am struggling with the notion of what XML is and is not good for in this
early stage of XML. So take what I say with that caveat.

[2] XML Perl object serialization?

To touch on another note, I would not recommend using a tool
to auto-generate XML from perl object structures.  The likelihood is high
that such a tool probably exposes too much extraneous stuff to the user.
The only time I would really think seriously about doing that is to get 
the ability to potentially recreate the object over the network based on
an XML stream. Sort of like Java object serialization over HTTP.

But that would not be a good "cross-language" interface to expose to the
world I imagine.

[3] Applets and XML

I think you also mentioned something about applets for display? I would
not recommend XML parsing inside applets quite yet.  The Java XML parsers
are not quite that fast and they are still a bit bloated.  Let's put it
this way, Aelfred which is optimized for speed and size is still 26k
uncompressed and 15k compressed jar. :(  And that does not even give you
any utilities for handling timeouts or any other standard communications
stuff you would want to do via HTTP communications.

If someone wants to write an small applet that displays your data in a
novel way, I would tend to suggest they write a CGI/Perl script to decode
the XML data through LWP. Then, the script can output the data in a
trivially easy to parse comma or pipe delimited way to the Java applet.
Furthermore, using CGI/Perl (or Servlet or whatever) middleware allows you
to move processing logic to the web server and out of the applet, further
minimizing the impact of the size of the applet itself.

If you are interested in a library to handle this sort of
communications/parsing for you, JavaCGIBridge 2.0 is located @ It comes with a smaller
automatic delimited file->Vector parser, is less than 10k uncompressed for
the core classes, and handles communications timeouts and stuff like that
automatically for you so you aren't stuck with the blocking URLConnection
JDK class.

Of course, if the applet being developed is already huge for some other
reason like a complication in the data display algorithms or is set for
deployment on an intranet, then a 26k XML parser becomes less of a
concern. But most people tend to want to keep their applets as thin as


=========== Bioperl Project Mailing List Message Footer =======
Project URL:
For info about how to (un)subscribe, where messages are archived, etc: