Bioperl: RE:manipulating long strings (genomes) in PERL

Ian Korf
Tue, 30 Mar 1999 10:53:33 -0600 (CST)

X-Sun-Data-Type: text
X-Sun-Data-Description: text
X-Sun-Data-Name: text
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 17

I've attached a script called chromoShatter that probably does what you're
looking for. It breaks up chromosome fasta files into smaller chunks with
overlapping edges. The default is 10 Kb chunks with 100 bp overlap, but it's
a command line option. It creates two files:
	1) a fasta database named something.genome
	2) a lookup file named    something.lookup

The fasta database is pretty self explanatory I think. The lookup file
contains the name of each chunk and its offset in the something.genome file.
The sequence is all on one line, so a single read gets it. The lookup
file is sorted so you can search it with a binary search such as the UNIX
look command. Alternatively, you can just put the names and offsets in a
tied HASH (dbmopen) and access that way. A bit more overhead, but faster.
If you need help with the dbmopen, or possibely some other related stuff,
this link may help you:

X-Sun-Data-Type: default-app
X-Sun-Data-Description: default
X-Sun-Data-Name: chromoShatter
X-Sun-Charset: us-ascii
X-Sun-Content-Lines: 87

#!/usr/local/bin/perl -w
use strict;
use Getopt::Std;
use vars qw($opt_o $opt_s);
my $usage = '
CHROMOSHATTER - breaks a list of chromosome-sized fasta files into smaller
chunks and concatenates these into a fasta database. Progress is sent to
STDOUT, which should be sent to a .log file. Two files are created:

  file.genome    the concatenated fasta files (each chunk on one line)
  file.lookup    a list of each chunk and its offset in the .genome file

You can specify the size of each chunk and overlaps between chunks. Each entry
in the database specifies which chromosome and chunk number (and also the chunk
size and overlap). For example, chunk 57 on LG1 (with default chunksize and
overlap) would look like this:

  >LG1#57 (10000:100)

usage: chromoShatter [options] <database name> <chrom1> <chrom2> <...>

options:                                 default
  -s      size of each chunk              10000
  -o      overlap between each chunk        100

die $usage unless @ARGV > 1;

# Setup
my $size    = $opt_s ? $opt_s : 10000;
my $overlap = $opt_o ? $opt_o : 100;
my ($DBNAME, @chrom) = @ARGV;
my @lookup; # store lookups for sorting later
foreach my $chrom (@chrom) {die "$chrom no found\n" unless -e $chrom}
open(GENOME, ">$DBNAME.genome") or die "cannot create $DBNAME.genome\n";

# Main loop
my $offset = 0;
foreach my $chrom (@chrom) {
	open(CHROM, $chrom) or die "cannot open $chrom\n";
	my $chunk = 0;
	# let perl know how big I want this thing upfront to manage memory
	my $seq = " " x ($size + $overlap + 1000);
	$seq = "";
	my $def = <CHROM>;
	my $end_of_file = 0;
		my $line = <CHROM>;
		unless (defined $line) {
			$end_of_file = 1;
			last SEGMENT;
		$seq .= $line;
		last SEGMENT if (length($seq) > $size + $overlap);
		redo SEGMENT;

	# trim seq to proper size and save leftover
	my $leftover = substr($seq, $size) unless $end_of_file;
	$seq = substr($seq, 0, $size + $overlap);
	# write def and seq to genome database
	$def = ">$chrom#$chunk ($size:$overlap)\n";
	print GENOME $def, $seq, "\n";
	# write chrom + chunk to lookup
	push @lookup, "$chrom#$chunk $offset\n";
	# reset the offset and seq, increment chunk #
	$offset += length($def) + length($seq) + 1;
	$seq = $leftover;
	# get another segment or another chromosome
	goto SEGMENT unless $end_of_file;
	close CHROM;

close GENOME;

# sort lookup alphabetically
@lookup = sort {$a cmp $b} @lookup;
open(LOOKUP, ">$DBNAME.lookup") or die "cannot create $DBNAME.lookup\n";
print LOOKUP @lookup;
close LOOKUP;
=========== Bioperl Project Mailing List Message Footer =======
Project URL:
For info about how to (un)subscribe, where messages are archived, etc: