[Bioperl-guts-l] bioperl commit

Brian Osborne bosborne at pub.open-bio.org
Tue Aug 3 08:12:26 EDT 2004


bosborne
Tue Aug  3 08:12:26 EDT 2004
Update of /home/repository/bioperl/bioperl-live/doc/howto/sgml
In directory pub.open-bio.org:/tmp/cvs-serv12611

Modified Files:
	OBDA_Access.sgml 
Log Message:
OBDA does not currently support Biosql

bioperl-live/doc/howto/sgml OBDA_Access.sgml,1.4,1.5
===================================================================
RCS file: /home/repository/bioperl/bioperl-live/doc/howto/sgml/OBDA_Access.sgml,v
retrieving revision 1.4
retrieving revision 1.5
diff -u -r1.4 -r1.5
--- /home/repository/bioperl/bioperl-live/doc/howto/sgml/OBDA_Access.sgml	2003/12/28 18:32:22	1.4
+++ /home/repository/bioperl/bioperl-live/doc/howto/sgml/OBDA_Access.sgml	2004/08/03 12:12:25	1.5
@@ -122,7 +122,7 @@
     <abstract>
       <para>
 	This is a HOWTO written in DocBook (SGML) that explains how to set up 
-      and use the Open Biological Database Access system. 
+	and use the Open Biological Database Access system. 
       </para>
     </abstract>
   </articleinfo>
@@ -130,74 +130,75 @@
   <section id="introduction">
     <title>Introduction</title>
     <para>
-Importing sequences with annotations is a central part of
-most bioinformatics tasks.  Bioperl supports importing sequences from
-indexed flat-files, local relational databases and remote (internet)
-databases. Previously, separate programming syntax was required for
-each of these types of data access (see for example Section III.1 of
-the Bioperl tutorial). In addition, if one wanted to change one's mode
-of sequence-data acquisition (for example, by implementing a local
-relational database version of Genbank when previously the data had
-been stored in an indexed flat-file) one had to rewrite all of the
-data-access subroutines in one's application code.
-    </para>
-    <para>
-The Open Biological Database Access (OBDA) System was designed so that
-one could use the same application code to access data from all three
-of the database types by simply changing a few lines in a
-configuration file. This makes application code more portable and
-easier to maintain. This document shows how to set up the required
-OBDA-registry configuration file and how to access data from the
-databases referred to in the configuration file using a perl script as
-well as from the command line. The Web site for OBDA is 
-<ulink url="http://obda.open-bio.org">obda.open-bio.org</ulink>.
+      Importing sequences with annotations is a central part of
+      most bioinformatics tasks.  Bioperl supports importing sequences from
+      indexed flat-files, local relational databases and remote (internet)
+      databases. Previously, separate programming syntax was required for
+      each of these types of data access (see for example Section III.1 of
+      the Bioperl tutorial). In addition, if one wanted to change one's mode
+      of sequence-data acquisition (for example, by implementing a local
+      relational database version of Genbank when previously the data had
+      been stored in an indexed flat-file) one had to rewrite all of the
+      data-access subroutines in one's application code.
+    </para>
+    <para>
+      The Open Biological Database Access (OBDA) System was designed so that
+      one could use the same application code to access data from all three
+      of the database types by simply changing a few lines in a
+      configuration file. This makes application code more portable and
+      easier to maintain. This document shows how to set up the required
+      OBDA-registry configuration file and how to access data from the
+      databases referred to in the configuration file using a perl script as
+      well as from the command line. The Web site for OBDA is 
+      <ulink url="http://obda.open-bio.org">obda.open-bio.org</ulink>.
     </para>
     <note>
-Accessing data via the OBDA system is optional in Bioperl.  It
-is still possible, though probably not advantageous, to access
-sequence data via the old database-format-specific modules such as 
-Bio::Index::Fasta or Bio::DB::Fasta.
+      Accessing data via the OBDA system is optional in Bioperl.  It
+      is still possible to access
+      sequence data via the old database-format-specific modules such as 
+      Bio::Index::Fasta or Bio::DB::Fasta.
     </note>
   </section>
+
   <section id="registry">
     <title>Using the OBDA Registry System</title>
     <para>
-The OBDA BioDirectory Registry is a platform-independent system for 
-specifying how BioPerl programs find sequence databases.  It uses a 
-site-wide configuration file, known as the registry, which defines 
-one or more databases and the access methods to use to access them.
+      The OBDA BioDirectory Registry is a platform-independent system for 
+      specifying how BioPerl programs find sequence databases.  It uses both 
+      local and site-wide configuration files, known as the registry, which define 
+      one or more databases and the access methods to use to access them.
     </para>
     <para>
-For instance, you might start out by accessing sequence data over the
-web, and later decide to install a locally mirrored copy of GenBank.
-By changing one line in the registry file, all
-Bio{Perl,Java,Python,Ruby} applications will start using the mirrored
-local copy automatically - no source code changes are necessary.
+      For instance, you might start out by accessing sequence data over the
+      web, and later decide to install a locally mirrored copy of GenBank.
+      By changing one line in the registry file, all
+      Bio{Perl,Java,Python,Ruby} applications will start using the mirrored
+      local copy automatically - no source code changes are necessary.
     </para>
   </section>
 
   <section id="installation">
     <title>Installing the Registry File</title>
     <para>
-The registry file should be named seqdatabase.ini.  By default, it
-should be installed in one or more of the following locations:
+      The registry file should be named seqdatabase.ini.  By default, it
+      should be installed in one or more of the following locations:
     <programlisting>
-   $HOME/.bioinformatics/seqdatabase.ini
-   /etc/bioinformatics/seqdatabase.ini
+	$HOME/.bioinformatics/seqdatabase.ini
+	/etc/bioinformatics/seqdatabase.ini
     </programlisting>
-The Bio{Perl,Java,Python,Ruby} registry-handling code will initialize
-itself from the registry file located in the home directory first,
-and then it will read the system-wide default in /etc. Windows Perl users
-should make sure to set the $HOME variable, otherwise the
-seqdatabase.ini file may not be found. Unix users need not do this
-since the code will use the getpwuid() method.    
+      The Bio{Perl,Java,Python,Ruby} registry-handling code will initialize
+      itself from the registry file located in the home directory first,
+      and then it will read the system-wide default in /etc. Windows Perl users
+      should make sure to set the $HOME variable, otherwise the
+      seqdatabase.ini file may not be found. Unix users need not do this
+      since the code will use the getpwuid() method.    
     </para>
     <para>
       If a local registry file cannot be found, the registry-handling code
       will attempt to copy the file located at this URL to a
       $HOME/.bioinformatics directory:
     <programlisting>
-   http://www.open-bio.org/registry/seqdatabase.ini
+	http://www.open-bio.org/registry/seqdatabase.ini
     </programlisting>    
     </para>
   </section>
@@ -205,37 +206,37 @@
   <section id="modifying">
     <title>Modifying the Search Path</title>
     <para>
-The registry file search path can be modified by setting the
-environment variable OBDA_SEARCH_PATH.  This variable is a
+      The registry file search path can be modified by setting the
+      environment variable OBDA_SEARCH_PATH.  This variable is a
       semicolon-delimited string of directories and URLs, for example:
     <programlisting>
  OBDA_SEARCH_PATH=/home/lstein/;http://foo.org/
     </programlisting>
     </para>    
     <important>
-Note that the fact that the search path is for an entire file
-(seqdatabase.ini) rather than for single entry (e.g. 'genbank') means
-that you have to copy any default values you want to keep from the
-(old) default configuration file to your new configuration file. 
+      Note that the fact that the search path is for an entire file
+      (seqdatabase.ini) rather than for single entry (e.g. 'genbank') means
+      that you have to copy any default values you want to keep from the
+      (old) default configuration file to your new configuration file. 
     </important>
     <para>
-For example, say you have been using biofetch with the default
-configuration file http://www.open-bio.org/registry/seqdatabase.ini
-for all your sequence-data retrieval.  If you now install a local copy
-of genbank, your local seqdatabase.ini must not only have a section
-indicating that the genbank copy is local but it must have sections
-configuring the web access for all the other databases you use, since 
-http://www.open-bio.org/registry/seqdatabase.ini will no longer be
-found in a registry-file search.
+      For example, say you have been using biofetch with the default
+      configuration file http://www.open-bio.org/registry/seqdatabase.ini
+      for all your sequence-data retrieval.  If you now install a local copy
+      of genbank, your local seqdatabase.ini must not only have a section
+      indicating that the genbank copy is local but it must have sections
+      configuring the web access for all the other databases you use, since 
+      http://www.open-bio.org/registry/seqdatabase.ini will no longer be
+      found in a registry-file search.
     </para>
   </section>
 
   <section id="format">
     <title>Format of the Registry File</title>
     <para>
-The registry file is a simple text file, as shown in the following
-example:
-     <programlisting>
+      The registry file is a simple text file, as shown in the following
+      example:
+      <programlisting>
  VERSION=1.00
 
  [embl]
@@ -247,13 +248,13 @@
  protocol=biofetch
  location=http://www.ebi.ac.uk/cgi-bin/dbfetch
  dbname=swall
-    </programlisting>
-The first line is the registry format version number in the format
-VERSION=X.XX.  The current version is 1.00.
+      </programlisting>
+      The first line is the registry format version number in the format
+      VERSION=X.XX.  The current version is 1.00.
     </para>
     <para>
-The rest of the file is composed of simple sections, formatted as:
-    <programlisting>
+      The rest of the file is composed of simple sections, formatted as:
+      <programlisting>
   [database-name]
   tag=value
   tag=value
@@ -261,65 +262,74 @@
   [database-name]
   tag=value
   tag=value
-    </programlisting>
-Each section starts with a symbolic database service name enclosed in
-square brackets.  Service names are case insensitive.  The remainder
-of the section is followed by a series of tag=value pairs that
-configure access to the service.
+      </programlisting>
+      Each section starts with a symbolic database service name enclosed in
+      square brackets. Service names are case-insensitive.  The remainder
+      of the section is followed by a series of tag=value pairs that
+      configure access to the service.
     </para>
     <para>
-Database-name sections can be repeated, in which case the client should
-try each service in turn from top to bottom.
+      Database-name sections can be repeated, in which case the client should
+      try each service in turn from top to bottom.
     </para>
     <para>
-The options under each section must have two non-optional tag=value
-lines being
+      The options under each section must have two non-optional tag=value
+      lines being
      <programlisting>
   protocol="protocol-type"
   location="location-string"
      </programlisting>
     </para>
   </section>
+
   <section id="protocol">
     <title>The Protocol Tag</title>
     <para>
-The protocol tag species what access mode to use.  Currently it can be
-one of:
+      The protocol tag species what access mode to use.  Currently it can be
+      one of:
     <programlisting>
   flat
   biofetch
   biosql
-    </programlisting>
-"flat" is used to fetch sequences from local flat files that have been
-indexed using BerkeleyDB or binary search indexing.
+      </programlisting>
+      "flat" is used to fetch sequences from local flat files that have been
+      indexed using BerkeleyDB or binary search indexing.
     </para>
     <para>
-"biofetch" is used to fetch sequences from web-based databses.  Due to
-restrictions on the use of these databases, this is recommended only
-for lightweight applications.
+      "biofetch" is used to fetch sequences from web-based databses.  Due to
+      restrictions on the use of these databases, this is recommended only
+      for lightweight applications.
     </para>
     <para>
-"biosql" fetches sequences from BioSQL databases.  To use this module
-you will need to have an instantiated relational database conforming
-to the BioSQL schema, and install the bioperl-db distribution.
+      "biosql" fetches sequences from BioSQL databases.  To use this module
+      you will need to have an instantiated relational database conforming
+      to the BioSQL schema, and install the bioperl-db distribution.
+    </para>
+    <para>
+      <emphasis>
+	Support for the biosql protocol is disabled as of
+	Bioperl version 1.4. We hope to remedy this in a subsequent
+	release.
+      </emphasis>
     </para>
   </section>
+
   <section id="location">
   <title>The Location Tag</title>
   <para>
-The location tag tells the bioperl sequence fetching code where the
-database is located.  Its interpretation depends on the protocol
-chosen.  For example, it might be a directory on the local file
-system, or a remote URL.  See below for protocol-specific details.
+      The location tag tells the bioperl sequence fetching code where the
+      database is located.  Its interpretation depends on the protocol
+      chosen.  For example, it might be a directory on the local file
+      system, or a remote URL.  See below for protocol-specific details.
   </para>
   </section>
   <section id="others">
   <title>Other Tags</title>
     <para>
-Any number of additional tag values are allowed.  The number and
-nature of these tags depends on the access protocol selected.  Some
-protocols require no additional tags, whereas others will require
-several.
+      Any number of additional tag values are allowed.  The number and
+      nature of these tags depends on the access protocol selected.  Some
+      protocols require no additional tags, whereas others will require
+      several.
     </para>
     <table>
       <title>OBDA Protocols</title>
@@ -332,142 +342,172 @@
           </row>
         </thead>
         <tbody>
-        <row>
- <entry>flat</entry><entry>location</entry><entry>Directory in which
- the index is stored. The "config.dat" file generated during indexing must be found in this location</entry>
- </row>
- <row>
- <entry>flat</entry><entry>dbname</entry><entry>Name of database</entry>
- </row>
- <row>
- <entry>biofetch</entry><entry>location</entry><entry>Base URL for the
- web service. Currently the only compatible biofetch service is http://www.ebi.ac.uk/cgi-bin/dbfetch</entry>
- </row>
- <row>>
- <entry>biofetch</entry><entry>dbname</entry><entry>Name of the
- database.  Currently recognized values are "embl" (sequence and
- protein), "swall" (SwissProt + TREMBL), and "refseq" (NCBI RefSeq entries)</entry>
- </row>
- <row>
- <entry>biosql</entry><entry>location</entry><entry>host:port</entry>
- </row>
- <row>
- <entry>biosql</entry><entry>dbname</entry><entry>database name</entry>
- </row>
- <row>
- <entry>biosql</entry><entry>driver</entry><entry>mysql|postgres|oracle|sybase|sqlserver|access|csv|informix|odbc|rdb</entry>
- </row>
- <row>
- <entry>biosql</entry><entry>user</entry><entry>username</entry>
- </row>
- <row>
- <entry>biosql</entry><entry>passwd</entry><entry>password</entry>
- </row>
- <row>
- <entry>biosql</entry><entry>biodbname</entry><entry>biodatabase name</entry>
-      </row>
-      </tbody>
-     </tgroup>
+	  <row>
+	    <entry>flat</entry><entry>location</entry><entry>Directory in which
+	      the index is stored. The "config.dat" file generated during indexing must be found in this location</entry>
+	  </row>
+	  <row>
+	    <entry>flat</entry><entry>dbname</entry><entry>Name of database</entry>
+	  </row>
+	  <row>
+	    <entry>biofetch</entry><entry>location</entry><entry>Base URL for the
+	      web service. Currently the only compatible biofetch service is http://www.ebi.ac.uk/cgi-bin/dbfetch</entry>
+	  </row>
+	  <row>>
+	    <entry>biofetch</entry><entry>dbname</entry><entry>Name of the
+	      database.  Currently recognized values are "embl" (sequence and
+	      protein), "swall" (SwissProt + TREMBL), and "refseq" (NCBI RefSeq entries)</entry>
+	  </row>
+	  <row>
+	    <entry>biosql</entry><entry>location</entry><entry>host:port</entry>
+	  </row>
+	  <row>
+	    <entry>biosql</entry><entry>dbname</entry><entry>database name</entry>
+	  </row>
+	  <row>
+	    <entry>biosql</entry><entry>driver</entry><entry>mysql|postgres|oracle|sybase|sqlserver|access|csv|informix|odbc|rdb</entry>
+	  </row>
+	  <row>
+	    <entry>biosql</entry><entry>user</entry><entry>username</entry>
+	  </row>
+	  <row>
+	    <entry>biosql</entry><entry>passwd</entry><entry>password</entry>
+	  </row>
+	  <row>
+	    <entry>biosql</entry><entry>biodbname</entry><entry>biodatabase name</entry>
+	  </row>
+	</tbody>
+      </tgroup>
     </table>
   </section>
+
   <section id="local">
   <title>Installing Local Databases</title>
   <para>
-If you are using the biofetch protocol, you're all set.  You can start
-reading sequences immediately.  For the flat and biosql protocols, you
-will need to create and initialize local databases.  See the following
-documentation on how to do this:
+      If you are using the biofetch protocol, you're all set.  You can start
+      reading sequences immediately.  For the flat protocol, you
+      will need to create and initialize a local database.  See the following
+      documentation on how to do this:
   </para>	
   <para>
-   flat protocol:<ulink
-   url="http://bioperl.org/HOWTOs/html/Flat_Databases.html">Flat
-   Databases HOWTO</ulink>
-  </para>
-  <para>
-   biosql protocol: BioSQL INSTALL (from the biosql-schema package,
-   available at <ulink
-   url="http://obda.open-bio.org/">obda.open-bio.org</ulink>).
+      flat protocol:<ulink
+	url="http://bioperl.org/HOWTOs/html/Flat_Databases.html">Flat
+	Databases HOWTO</ulink>
+    </para>
+    <para>
+      biosql protocol: BioSQL INSTALL (from the biosql-schema package,
+      available at <ulink
+	url="http://obda.open-bio.org/">obda.open-bio.org</ulink>).
       Download the Biosql tar file to view all the documentation.
-  </para>
+    </para>
+    <para>
+      <emphasis>
+	Support for the biosql protocol in OBDA is disabled as of
+	Bioperl version 1.4. We hope to remedy this in a subsequent
+	release.
+      </emphasis>
+    </para>
+    <para>
+      Once the "flat" database is created you can configure your
+      seqdatabase.ini file. Let's say that you have used the
+      bioflat_index.pl script to create the flat database and a new 
+      directory called "ppp" has
+      been created in your /home/sally/bioinf/ directory (and the ppp/ 
+      directory contains the config.dat file). Your sequence.ini entry
+      should contain these lines:
+      <programlisting>
+	protocol=flat
+        location=/home/sally/bioinf
+        dbname=ppp
+      </programlisting>
+    </para>
+    <para>
+      The database-name can be any useful name, it does not have to
+      refer to existing files or directories.
+    </para>
   </section>
 
   <section id="code">
   <title>Writing code to use the Registry</title>
   <para>
-Once you've set up the OBDA registry file, accessing sequence data
-from within a perl script is simple. The following example shows how;
-note that nowhere in the script do you explicitly specify whether the
-data is stored in a flat file, a local relational database or a
-database on the internet.
-  </para>
-  <para>
-To use the registry from a Perl script, use the following idiom:
-  <programlisting>
+      Once you've set up the OBDA registry file, accessing sequence data
+      from within a perl script is simple. The following example shows how;
+      note that nowhere in the script do you explicitly specify whether the
+      data is stored in a flat file, a local relational database or a
+      database on the internet.
+    </para>
+    <para>
+      To use the registry from a Perl script, use the following idiom:
+      <programlisting>
     1 use Bio::DB::Registry;
     2 $registry = Bio::DB::Registry->new;
     3 $db = $registry->get_database('embl');
     4 $seq = $db->get_Seq_by_acc("J02231");
     5 print $seq->seq,"\n";
-  </programlisting>
-  </para>
-  <para>
-In lines 1 and 2, we bring in the Bio::DB::Registry module and create
-a new Bio::DB::Registry object.  We then ask the registry to return a
-database accessor for the symbolic data source "embl", which must be
-defined in an [embl] section in the seqdatabase.ini registry file.
-  </para>
-  <para>
-The returned accessor is a Bio::DB::RandomAccessI object (see the
-appropriate manual page), which has just three methods:
-  <programlisting>
+      </programlisting>
+    </para>
+    <para>
+      In lines 1 and 2, we bring in the Bio::DB::Registry module and create
+      a new Bio::DB::Registry object.  We then ask the registry to return a
+      database accessor for the symbolic data source "embl", which must be
+      defined in an [embl] section in the seqdatabase.ini registry file.
+    </para>
+    <para>
+      The returned accessor is a Bio::DB::RandomAccessI object (see the
+      appropriate manual page), which has just three methods:
+      <programlisting>
    $db->get_Seq_by_id($id);
    $db->get_Seq_by_acc($acc);
    $db->get_Seq_by_version($versioned_acc);
-  </programlisting>
-  </para>
-  <para>
-These methods return Bio::Seq objects by searching for their primary
-IDs, accession numbers and accession.version numbers respectively.
-The returned objects have all the methods defined by Bio::Seq (see
-the appropriate manual page, online at <ulink
+      </programlisting>
+    </para>
+    <para>
+      These methods return Bio::Seq objects by searching for their primary
+      IDs, accession numbers and accession.version numbers respectively.
+      The returned objects have all the methods defined by Bio::Seq (see
+      the appropriate manual page, online at <ulink
 	url="http://doc.bioperl.org">doc.bioperl.org</ulink>).  In 
-	  line 5, we call the sequence
-object's seq() method to fetch and print out the DNA or protein
-string.
-  </para>
+      line 5, we call the sequence
+      object's seq() method to fetch and print out the DNA or protein
+      string.
+    </para>
   </section>
+
   <section id="biogetseq">
-  <title>Using biogetseq to Access Registry Databases</title>
-  <para>
+    <title>Using biogetseq to Access Registry Databases</title>
+    <para>
       As a convenience, the Bioperl distribution includes the script
       biogetseq.PLS that enables one to have OBDA access to sequence
       data from the command line.
-  </para>
-  <para>
-The script 'biogetseq' is located in the scripts/DB directory of the
-bioperl distribution. Move or add it into your path to run it. You can get to help by running it with no arguments:
-  <programlisting>
+    </para>
+    <para>
+      The script 'biogetseq' is located in the scripts/DB directory of the
+      bioperl distribution (it may also have been installed in your system
+      if you you asked for a script installation after 'make
+      install'. Move or add it into your path to run it. You can get 
+      this help text by running it with no arguments:
+      <programlisting>
 Usage: biogetseq --dbname embl --format embl --namespace acc id [ id ... ]*
        dbname defaults to embl
        format defaults to embl
        namespace defaults to 'acc' ['id', 'acc', 'version']
        rest of the arguments is a list of ids in the given namespace
-  </programlisting>
-  </para>
-  <para>
-If you have a set of ids you want to fetch from EMBL database, you
-just give them as space separated parameters:
-  <programlisting>
+      </programlisting>
+    </para>
+    <para>
+      If you have a set of ids you want to fetch from EMBL database, you
+      just give them as space-separated parameters:
+      <programlisting>
   % biogetseq J02231 A21530 A10516
-  </programlisting>
-  </para>
-  <para>
-The output is directed to STDOUT, so it can be redirected to a
-file.  The options can be given in long "double hyphen" format or
-abbreviated to one letter format:
-  <programlisting>
-  % biogetseq -f fasta -n acc J02231 A21530 A10516 > filed.seq
-  </programlisting>
-  </para>
+      </programlisting>
+    </para>
+    <para>
+      The output is directed to STDOUT, so it can be redirected to a
+      file.  The options can be given in the long "double hyphen" format or
+      abbreviated to one-letter format ("--fasta" or "-f"):
+      <programlisting>
+	% biogetseq -f fasta -n acc J02231 A21530 A10516 > filed.seq
+      </programlisting>
+    </para>
   </section>
 </article>



More information about the Bioperl-guts-l mailing list