Bioperl: Any non-redundant database tools out there ???

Ian Korf
Thu, 27 Aug 1998 11:54:42 -0500 (CDT)

Warren Gish has a non-redundant database tool at :

The README associated with the software is below.


Last update 1998/05/25

The file nrdb2.tar.Z is a compressed UNIX tar archive of UNIX-compatible C
source code for version 2 of a program called "nrdb" that can be used to
generate quasi-nonredundant protein and nucleotide sequence databases.  The
program merges 100% identical sequence entries into a single entry in the
output, with the associated descriptions concatenated into a single
description.  The program will read one or more input databases that are in
FASTA/Pearson format to produce a single, compacted output file that is also in
FASTA format.

Data sources for producing a comprehensive protein sequence database include
the SWISS-PROT, PIR, PDB, and "GenPept" databases, and the cumulative daily
GenPept updates.  A quasi-nonredundant nucleotide sequence database can be
built from the GenBank major release and cumulative GenBank daily updates.
All of the aforementioned input databases are presently available in their
native formats via anonymous FTP.  GenBank/GenPept, PIR, SWISS-PROT and PDB
are on; and the EMBL Data Library and its updates are
available on  There exist a variety of parsers from these
databases' native formats into FASTA format (see below).  The FASTA-format
output from nrdb can then be processed with "setdb" and "pressdb" to produce
blastable databases.

The size of a comprehensive, "nonredundant" protein sequence database is
roughly half of the total size of the input databases.  A nonredundant
database is consequently easier and faster to search, yet is no less
informative than searching the input databases individually.  The statistical
significance ascribed to BLAST alignment scores is also increased by a factor
of two, because the size of the search space is cut in half.

When definition lines for identical sequences are concatenated, the component
definitions are separated from one another in the output by a '\001'
character (ASCII Control-A or SOH [start of header]).  The BLAST application
programs use the Control-A character(s) to know where to break the sequence
descriptions for output in BLAST ASN.1 format.

A single file is acceptable input to the nrdb program -- duplicate entries are
found by the program both within files and between files.  A comprehensive
nonredundant protein sequence database can be generated in about 10 minutes by
a 200 MHz PentiumPro-based system with 128 MB RAM and fast disk drives.
This may be only a little faster than a PERL script written to accomplish
the same thing.  Memory use for the nrdb program will be less, however;
this is particularly true when operating on nucleotide sequences, where the
nrdb program rapidly compresses all-ACGT sequences 4:1 and all-IUPAC nucleotide
sequences (including ambiguity codes) 2:1 to conserve memory.

Additional source code is required to compile and link the nrdb program:  the
ncbi.tar.Z and gish.tar.Z archives posted beneath the /blast-14 directory.
This additional code may already be available on your system if you have
already built the BLAST database search software there (see /blast-14).  The
BLAST software distributed here also includes parsers to convert GenBank,
PIR, and SWISS-PROT flat files into FASTA format; the SWISS-PROT parser can
be used to parse EMBL flat files, too.  The NCBI software Toolbox includes a
demonstration program called "asn2fast" that converts NCBI ASN.1 sequence
data into FASTA format.

See Makefile for customizations that may be necessary to build the nrdb program
on your system.

CAUTION:  the nrdb program performs no validation of the letter codes it reads.

Warren Gish
May 9, 1992
Washington University, St. Louis
=========== Bioperl Project Mailing List Message Footer =======
Project URL:
For info about how to (un)subscribe, where messages are archived, etc: