[Bioperl-l] Bio::SeqIO::scf traces scrambled?

Cook, Malcolm MEC at stowers.org
Thu Jun 18 11:42:48 EDT 2009


Another possible stopgap that might work for you, if you're working with AB1 chromatograms and have ABIs kb-basecaller turned on, is to use Bio::Trace::ABIF


It works great and includes implementation of ABIs algorithm allowing to (re)compute trace clear ranges using kc-basecallers quality scores and any windowing/quality parameters.

Its not in the bioperl project but it is an easy install from CPAN.

I am familiar with staden::read installation woes.  

Below is a quick script I wrote that employs it... it could be better parameterized, but it might be useful to you "out of the box"....

Malcolm Cook
Stowers Institute for Medical Research - Kansas City, Missouri

#!/usr/bin/env perl

# PURPOSE: extract from AB1 files into fasta format the sequence in
# the 'clear range' defined by 3 parameters.  If there is no clear
# range, emit warning and skip the sequence.  The fasta 'defline'
# identifier is taken as the sample name.  Other useful attributes are
# also embedded into the defline using attribute=value syntax.

# USAGE: ABIFqtrim $window_width $bad_bases_threshold $quality_threshold f1.ab1 ... fn.ab1

# NOTE: 20 4 20 is ABI default settings

# ABIFqtrim 20 4 20 /n/facility/Bioinformatics/Software/ABI/ab1_test_files/*.ab1 > ab1_test_files_trimmed.fasta

# AUTHOR: malcolm_cook at stowers-institute.org

use strict;
use warnings;
use Bio::Trace::ABIF;
use Text::Wrap qw(wrap);
$Text::Wrap::columns = 72;	# wrap the sequence

use File::Basename;
my ($window_width,
    @ARGV) = @ARGV;

my $abif = Bio::Trace::ABIF->new();

sub main {} {
  foreach (@ARGV) {
    $abif->open_abif($_) or die "error opening $_ as ABIF";
    my ($clear_range_start,$clear_range_stop) = $abif->clear_range($window_width,
    my $sample_score = $abif->sample_score(
    #    my $contiguous_read_length = $abif->contiguous_read_length($window_width,
    #							       $quality_threshold,
    #							       0, # ==> trim_ends
    #							      );
    #    my $length_of_read = $abif->length_of_read(
    #				    $window_width,
    #				    $quality_threshold,
    #				    # $method
    #				   );
    my $defline = 
      join "\t", 
	  #basename($_,qw(.ab1 .abi)), # use this to use the filename's basename in the defline
	  #$abif->container_identifier . ':' . $abif->well_id,  # or this, for container:well_id formatted defline identifiers
	  (map {my $method = $_;
		"$method=". ($abif->$method() || '')}
	   qw(sample_name comment run_name well_id container_identifier sequence_length )), #comment
	     # sample_tracking_id - don't use this - it is internal to ABI software
    if ($clear_range_start == -1) {
      warn "NO CLEAR RANGE! SKIPPING $_:\n\t$defline";
    my $seq = wrap('','',substr($abif->sequence, $clear_range_start + 1, ($clear_range_stop + 1) - ($clear_range_start + 1) + 1));
    print ">$defline\n$seq\n";


main ();

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org 
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of 
> Charles Tilford
> Sent: Thursday, June 18, 2009 8:39 AM
> To: BioPerl List
> Subject: [Bioperl-l] Bio::SeqIO::scf traces scrambled?
> Nutshell: Bio::SeqIO::scf seems to be mixing up A/C/G/T trace 
> channels. 
> Can anyone confirm?
> Hi all,
> I'm using the SCF Bio::SeqIO module to parse trace data out 
> of chromatograms. The SCF files are being produced by phred 
> using the "-cd" 
> parameter. The traces come out great, and the corresponding 
> base calls from the .phd files align with the peaks 
> wonderfully when I visualize them on a rendered trace. 
> However, only the A bases align to the appropriate trace 
> channel, the rest are mixed up. I find that if I do the 
> following re-mapping, the phred base calls match the
> SeqIO : Remapped
> A : A
> C : G
> G : T
> T : C
> The relevant part of Bio::SeqIO::scf is here:
> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/B
> io/SeqIO/scf.html#CODE9
> ... which indicates that it expects the pack()ed trace data 
> to be in order ATGC. The base call parsing code is here:
> http://doc.bioperl.org/releases/bioperl-current/bioperl-live/B
> io/SeqIO/scf.html#CODE8
> ... which is unpacking in order ACGT. As far as I can tell, 
> the relevant official SCF documentation is here:
> http://staden.sourceforge.net/manual/formats_unix_4.html
> ... which indicates that both trace and base order should be 
> ACGT (matching the SeqIO unpack() for bases, but not traces). 
> My empirical channel unscrambling mapping implies order ACTG, 
> which is different from either of the two orders above. The 
> sequence from the SCF file (should be that from original AB1 
> file, I think) is not perfectly identical to that called by 
> phred, but is very similar (to be expected); that is, I don't 
> need to remap C, G and T to get it to align with the phred data.
> So it looks like the SeqIO module is not mapping the sections 
> of the packed trace data to the appropriate bases. The unpack 
> order is different than the staden documentation ... but so 
> is the order I impose to correct the problem. I am still 
> unclear as to the differences between
> V2 and V3 of the format. The major difference appears to be 
> coding the trace absolutely (V2) or relatively to prior 
> values (V3); I'd expect if I was using one format and SeqIO 
> was trying to parse the other that I would get garbage out. 
> Running in verbose reports "scf.pm is working with a version 2 scf."
> Thoughts on this would be appreciated - can anyone confirm a 
> problem with trace extraction from SCF?
> I'm hoping that once I convince our admin to (properly) 
> install staden::read that I can work directly with the ab1 
> files, but I need to stopgap on SCF for the time being....
> -CAT
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

More information about the Bioperl-l mailing list