[Bioperl-l] Output a subset of FASTA data from a single large file

Sendu Bala sb at mrc-dunn.cam.ac.uk
Mon Jun 12 04:49:49 EDT 2006

Michael Oldham wrote:
> Thanks to everyone for their helpful advice.  I think I am getting closer,
> but no cigar quite yet.  The script below runs quickly with no errors--but
> the output file is empty.  It seems that the problem must lie somewhere in
> the 'while' loop, and I'm sure it's quite obvious to a more experienced
> eye--but not to mine!  Any suggestions?  Thanks again for your help.
> --Mike O.
> #!/usr/bin/perl -w
> use strict;
> my $IDs = 'ID.dat.txt';
> unless (open(IDFILE, $IDs)) {
> 	print "Could not open file $IDs!\n";
> 	}
> my $probes = 'HG_U95Av2_probe_fasta.txt';
> unless (open(PROBES, $probes)) {
> 	print "Could not open file $probes!\n";
> 	}
> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
> my @ID = <IDFILE>;
> chomp @ID;
> my %ID = map {($_, 1)} @ID; #Note: This creates a hash with keys=PSIDs and
> all values=1.
> 	while (<PROBES>) {
> 		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
> 		if ($idmatch){
> 			print OUT;
> 		}
> 	}
> exit;

Not sure why it would print nothing (are the ids in IDFILE the same case 
as the ids in the fasta file, do they only contain word characters?), 
but even if it did you would only be printing out the fasta headers and 
not the sequences. Doing it the bioperl way gives you more flexibility 
in the future; you may want to do something with the sequences after 
printing them out, in which case do it in bioperl using Seq objects and 
skip the intermediate step of printing them.

More information about the Bioperl-l mailing list