[Bioperl-l] FW: ORF finder

Gatherer, D. (Derek) D.Gatherer@organon.nhe.akzonobel.nl
Tue, 5 Sep 2000 10:45:30 +0200

-----Original Message-----
From: hilmar.lapp@pharma.Novartis.com
Sent: 04 September 2000 18:28
To: Gatherer, D. (Derek)
Cc: Ewan Birney; bioperl-l@bioperl.org
Subject: RE: ORF finder

>>what sort of ORF finder would you be interested in? One that takes the
>>sequence as it is and outputs the most likely ORF (based on length and
>>coding potential)?

A _very_ simple ORF finder that could scan a database of cDNAs, and print
out all the potential ORFs and their putative translated products.
>>A couple of weeks ago I wrote to the BioPerl list about ESTScan.....
educated suggestion of how to model the insertions and deletions ESTScan

Nothing so sophisticated, as yet!!

Here's the code I have so far:

# PACKAGE    : FindORF.pm
# PURPOSE    : To find open reading frames in a cDNA sequence
# AUTHOR     : Derek Gatherer (D.Gatherer@organon.nhe.akzonobel.nl)
# SOURCE     : 
# CREATED    : 5th September 2000
# DISCLAIMER : I am employed in the pharmaceutical industry but my 
#	       : employers do not endorse or sponsor this module
#	       : in any way whatsoever.  The above email address is
#	       : given purely for the purpose of easy communication
#            : with the author, and does not imply any connection
#	       : between my employers and anything written below.
# LICENCE    : You may distribute this module under the same terms 
#	       : as the rest of BioPerl.

=head1 NAME

Bio::Tools::FindORF - Object holding ORF statistics for one cDNA sequence


Take a sequence object from, say, an inputstream, and creates an object 
for the purposes of holding open reading frame (ORF) statistics about 
that sequence.  The minimum size of the ORF is specified.  The ORFs 
in only the forward orientation are searched.  For reverse orientation, 
it is recommended that the reverse complement be generated, by eg. 
$seqobj->revcom(), and the FindORF call repeated.

Creating the FindORF object, eg:

	my $inputstream = Bio::SeqIO->new( -file => "seqfile", -format =>
	my $seqobj = $inputstream->next_seq();
	my $ORF_obj = Bio::Tools::FindORF->new($seqobj);

	my $seqobj = Bio::PrimarySeq->new(-seq=>'[cut and paste a sequence
here]', -moltype = 'dna', -id = 'test');
	my $ORF_obj  =  Bio::Tools::FindORF->new($seqobj);

obtain an array of arrays, with each of the constituent arrays holding 
the start point, the stop point, and the DNA sequence of the ORF.
The start point is taken to be the position of the A of the starting ATG,
and the stop point is taken to be the position immediately prior to the 
first T of the stop codon.

So a sequence ATGAAAAAATGA would have start at 1 and stop at 9. 

	my $array_ref = $ORF_obj->get_ORFs(50);

will get all such ORFs of length more than 50 bases.

display array of arrays, eg:

	my @array_of_arrays = @$array_ref;
		print "\n@$_";


Bio::Tools::FindORF is a featherweight object for finding ORFs in a single 
cDNA sequence.  RNA sequences need to be converted to DNA before ORF
Protein sequences will throw an exception.  All ORFs are assumed to begin
methionine ATG, so this module applies to cDNA only, as yet.

See Synopsis above for object creation code.



=head2 Mailing Lists

User feedback is an integral part of the evolution of this
and other Bioperl modules. Send your comments and suggestions preferably
to one of the Bioperl mailing lists.
Your participation is much appreciated.

  mailto:vsns-bcd-perl@lists.uni-bielefeld.de         - General discussion
  mailto:vsns-bcd-perl-guts@lists.uni-bielefeld.de    - Technically-oriented
  http://bio.perl.org/MailList.html            	      - About the mailing

=head2 Reporting Bugs

Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution.
Bug reports can be submitted via email or the web:


=head1 AUTHOR

Derek Gatherer


The rest of the documentation details each of the object methods. 
Internal methods are usually preceded with a _


package Bio::Tools::FindORF;
use vars qw(@ISA);
use strict;
use Bio::PrimarySeq;		# for translating the ORF
use Bio::PrimarySeqI;

# Object preamble - inherits from Bio::Root::Object

@ISA = qw(Bio::Root::Object);

# new() is inherited from Bio::Root::Object

# _initialize is is called from within new()

sub _initialize 
    	my($self,@args) = @_;

	my $seqobj = shift (@args);
		$seqobj->throw("die in _init, FindORF works only on
PrimarySeqI objects\n");
	unless($seqobj->moltype() eq 'dna')   # one of 'dna','rna','protein'
		$seqobj->throw("die in _init, FindORF works only on DNA

    	$self->{'_seqref'} = $seqobj;
    	return $self;

=head2 getORFs

 Title   : getORFs
 Usage   : my $array_ref = $ORF_obj->get_ORFs($length);
 Function: finds ORFs in a single DNA sequence
	   : will give the array of arrays (each line one array)
	   : 3 29 ATGCCCATGCCCCGTAATGACCTAGCC and the corresponding protein
         : 19 24 ATGACC 
 Returns : Reference to an array of arrays, as above.
 Args    : $length, the minimum size of ORF to bother with.

  Throws an exception if the submitted sequence is not DNA.


sub get_ORFs
	my ($self,$length) = @_;
	my $seqobj =  $self->{'_seqref'};

	if(!$length | $length =~ /\D/ig | $length <= 0) 
		$self->throw("die in get_ORFs, the length variable must be a
positive integer\n");

	unless  ($seqobj->isa("Bio::PrimarySeqI")) 
		$self->throw("die in get_ORFs, FindORF works only on
PrimarySeqI objects\n");

	my $seqstring = uc $seqobj->seq();
	my @all_ORFs = ();			# the thing returned at the

# now the real business
	my $p = 0;					# count starts
	my $q = 0;					# count stops
	my @start = ();				# store starts
	my @stop = ();				# store stops
	while ($seqstring =~ /(ATG)/g)		# got a start?
		$start[$p++] = pos($seqstring);	# note position

	while ($seqstring =~ /(TGA|TAA|TAG)/g) 	# got a stop?
		$stop[$q++] = pos($seqstring); 	# note position

	for(my $x=0;$x<=$#start;$x++)		# all starts
		for(my $y=0;$y<=$#stop;$y++)	# pair up each with all
# test if stop is after start, AND is in same frame, AND is of required

			if(($stop[$y]>$start[$x]) &&
(($stop[$y]-$start[$x])%3==0) && (($stop[$y]-$start[$x])>=$length))
# if we have a start and stop in frame, is there a clear ORF between them?

				my $ORFseq =
				my $cont=0;		# switch to decide
if we have ORF
				while($ORFseq =~ /TGA|TAA|TAG/ig) # got an
internal stop?
					if((pos($ORFseq))%3==0)	  # does it
interfere with the frame?
						$cont=1;	  # if yes,
not an ORF, ignore.
				if($cont==0)			  # if no,
ORF is clean, proceed
					# make a new seqobj from the good
ORF sequence
					my $ORF_seqobj =
Bio::PrimarySeq->new(-seq => "$ORFseq", -moltype => 'dna');
					# make a protein object from it
					my $translated_obj =
					# pull out the protein sequence
					my $prot_product =
# make up data array for ORF, with start pos., stop pos., ORF DNA seq. and
predicted protein seq.
					my @this_ORF =
(($start[$x]-2),($stop[$y]-3),$ORFseq, $prot_product);
# stick this into a big array of results
					push @all_ORFs, [@this_ORF];

	return \@all_ORFs;		# return the results array
# and that's all the module


##!/usr/local/bin/perl -w

use strict;

use lib "/usr/lib/perl5/5.005/"; # or whatever your path is
use Bio::Seq;
use Bio::SeqIO;
use Bio::Tools::FindORF;

my $seqin = Bio::SeqIO->new( '-format' => 'Fasta' , -file => 'YOUR FILE

while((my $seqobj = $seqin->next_seq()))		# run through flat
	print "\n".$seqobj->display_id();		# each sequence in
	my $ORF_obj = Bio::Tools::FindORF->new($seqobj);# make an ORF object
	my $array_ref = $ORF_obj->get_ORFs(2000);	# run the method
	my @array_of_arrays = @$array_ref;		# dereference the
		print "\n@$_";				# print results