[Bioperl-l] How to assess sequencing quality with Bioperl?
MEC at Stowers-Institute.org
Fri Sep 2 10:24:08 EDT 2005
Following is a manpage for a program I wrote using Bioperl (and phred).
If you'd like it, let me know, I'll slap our institute approved software
license on it an put a `make install`able tarball on a webpage
Malcolm Cook - mec at stowers-institute.org - 816-926-4449
Database Applications Manager - Bioinformatics
Stowers Institute for Medical Research - Kansas City, MO USA
SEQTRACESTATS(1) User Contributed Perl Documentation
seqtracestats --bins=max1,max2,max3 --informat trace_format
convention attributelist=expression --mtime daycount --abi2phd <
files > quality score summary
Given a set of DNA sequence traces (or possibly a directory to
them in based on their age), produce summary descriptive
the read quality (after optionally (re)base calling using phred).
Their case is ignored, and they may be abbreviated to uniqueness
--v instead of --verbose).
Options may be specified on the command line, and may optionally
be read from files by providing on the command line the path to
file preceded by a '@'. These option files provide simple access
typical calling scenarios (such as an analysis that is repeatedly
invoked from the command line with the same parameters).
if the current directory contains a file named
will be automatically used as an option file.
--informat = PHD|SCF|ABI
Format of input files. Defaults to PHD. Can be any
format that contains quality values, such as SCF.
TODO: currently only PHD works. Get the staden integration
and SCF files working with BioPerl.
--bins = n,m,o,p
A comma separated list of numbers which should be in
order. They define the ranges into which the sequence
score for each base-pair read is binned; each number sets the
mum allowable value in the bin, with the minimum value being
mined by the previous bin, if any.
Note that any quality scores encountered which are greater
largest bin will not be included in any bin. Otherwise, the
the values in each of the bins will equal the entire sequence
I have not encountered any documentation specifying a maximum
score assigned in practice by any base caller. However, I
never encountered a score, or a report of a score, greater
So, you can pretty safely use 60 as your highest bin, which
sponds to an error probability of one in a million.
Defaults to --bins=20,30,40,50,60.
Allows specification of time attributes to appear in report.
A method of decoding attributes such as are typically encoded
the name of files holding seqreads.
Implemented as an association between a comma delimited list
attribute names and a perl expression which, when evaluated
corresponding values for each attribute. The expression is
ated in the context of $_ being the filename holding a
Any attributes containing the word 'ignore' as a substring
supressed from the summary.
when reading the file: VPL3-15-I-BSB460_A01.seq
will assign to the current read attributes as follows:
gene => VPL3
library => 15
plate => I
primer => BSB460
platewellcoord => A01
Named aliases exist as shorthand methods for decoding
naming conventions. The following are predefined:
TODO: IMPLEMENT THESE - THEY ARE NOT.
Interpret the input files as directorys to search for ab1
created (modified) mtime days ago. If none supplied,
current working directory.
This value is used as --mtime argument to the unix find
(along with -daystart), and as such may be prefixed with '-'
'+', to mean 'less than' or 'greater than' mtime days ago.
May be specified multiple times to produce data ranges. I.e.
--mtime -31 --mtime +0 will select files between 1 and 30
Input files are ab1 chromatogram files and they should be
with corresponding quality file, generated using phred as
If there is an existing corresponding .phd (or .phd.1) file,
will be used.
Otherwise, the ab1 files will be (re) base-called by phred,
ating sequence quality files in the same directory with
extension of ".phd".
Note: if the input files already have corresponding '.phd' or
'.phd.1' file, it will NOT be re-basecalled and the existing
scores will be used.
Display command line usage with options.
Display complete manual page and exit.
Provides a trace of processing on STDERR.
Display the scripts version number and exit.
Output is in csv format (comma separated values) and may be
TODO: Elaborate this description?
seqtracestats --mtime 1 > seqtracestats.txt
Summarize the result of run phred on all ab1 files created
day which are subordinate to current directory, putting the
in file seqtracestats.txt
./seqtracestats --mtime -10 '/n/facility/Genomics/Sequenc-
ing/2003/Molecular\ Biology/' --readname 'platerow,plate-
Summarize reads less than 10 days old, parsing out platerow
plate col from the read filename, and interpreting selected
tory components as the submitter of the sequencing request,
lab of which s/he is a member.
seqtracestats --mtime +1,-30 ./2003 > ./2003/febqual.txt
Summarize trace files that are less than 30 but more than 1
old which are within the directory ./2003 which is in turn
the current directory. Put the results in
Note: if this example is preceded by
then additional parameters will be picked up from the file
tracestats.config in that directory. This file implements
standard read naming conventions.
seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
Result of summarize the PHD formatted files into three bins,
ing the selected attributes from the read name:
seqtracestats ./t/data/*.phd.1 --informat=PHD --bins=20,30,60
Result of running on the same data but parsing the readname
ently, including pulling out read direction and plate row and
ngth,gc_content,mean,variance,num bp q<=20,num bp q<=30,num bp q<=60
seqtracestats ./t/data/*.phd.1 @./eg.config
Process those same files, taking command line options from
fig in the current directory.
Process those same file, taking command line options from
tracestats.config in the current working directory.
Malcolm Cook (mec at stowers-institute.org)
perl, BioPerl, bioperl-ext, Time::Seconds,
Email the author for sources.
$Revision: 1.4 $
include support for ABI files
support other output formats & destinations? tab v. csv format
write to (mysql?) database. Use DBD::AnyData for implementation?
create textual report similar to ABI's software
allow for reporting on trimmed sequences
contibute patch to Bio::Tools::SeqStats to BioPerl
see the unix command trev, part of staden package, for viewing
files under unix
regarding the conversion of ABI files using SEQIO
perl v5.8.6 2005-09-02
More information about the Bioperl-l