(Modest) software for DNA data

Will Fischer wfischer@sunflower.bio.indiana.edu
Tue, 18 Feb 1997 10:25:02 -0500

Thought this might be of interest to the members of this list.
There is no doubt much duplicatoin of effort here, but the
functions are among those that should be included in the
bioperl modules.

If I only had the time to do it right, I wouldnt have to do it over.

-- WF

Release Announcement:  (Modest) Perl programs for molecular sequence data

I often extract data from GenBank, to make nucleotide and
protein alignments (for primer design and for phylogenetic analysis).
Some perl programs I wrote to ease this task are now available
for public use.

Where they are:

What they do:
1. parse features or whole entries from files of GenBank entries;
2. translate DNA sequences into amino-acid sequences;
3. make DNA alignments based on amino-acid alignments.

Complete descriptions are available at the above URL.


Parsing features from GenBank files

	extracts either whole genbank entries matching a pattern, 
	or new entries consisting of only those subsequences 
	specified by a FEATURE; I use it mostly to extract coding
	sequences from files of entries retrieved from NCBI, but 
	the code is general.  It will concatenate exons and complement
	sequence as specified in the FEATURE table.

Generating amino-acid translations from DNA sequences

	"nt2aa" (NucleoTide to AminoAcid)
	reads genbank, fasta, or GCG format files (or raw sequence data)
	and produces an amino acid sequence in any or all frames,  
	using any of the genetic codes defined by NCBI.
	It will translate degenerate codons as far as possible (which
	GCG's "translate" will not), and present a list of possibile
	amino acids if desired.  Output is either raw or fasta format.
Step Three:  Generating an amino-acid alignment

	You're on your own here:  use GCG's pileup, or clustalw, or
	mase, or macaw, or seaview, or whatever you like.  
	Save the output as fasta, or genbank, or Don Gilbert's
	excellent "readseq" program (q.v.) is installed, any format
	that it can handle.

Step Four:  Aligning sequence data to the amino-acid alignment

	reads two inout files containing DNA and amino-acid sequences (and an
	optional names file if the names are different), and inserts
	gaps in the DNA sequences (via reverse-translation) to match
	the alignment of the corresponding amino-acid sequences.  Takes
	input in fasta-format, or genbank, or anything "readseq"
	can handle (see above); writes fasta-format output.


These perl programs were written to serve my own needs;
they are not sterling examples of good coding practice 
(indeed, they may shock or amuse those who write elegant code).

I am making these programs available in the hope that they may be
useful; I cannot guarantee that they will not corrupt your data (so
keep backup copies).  Nonetheless, if you find a problem, I will
apologize to you and (my time permitting, and given enough detail)
fix the problem that caused you trouble.  No other warranties,
expressed or implied, apply.

The code is not in the public domain.  It may not be sold, or used in a
product which is sold, without the express consent of the author.

Will Fischer			

Biology Department             		wfischer@indiana.edu
Jordan Hall                   		http://www.bio.indiana.edu/~wfischer
Indiana University            		Lab:    812-855-2549
Bloomington, Indiana 47405 USA		FAX:    812-855-6705