[Bioperl-l] ncDNA

Chris Fields cjfields at uiuc.edu
Wed Mar 8 17:48:27 EST 2006

It's a good idea to mail to the list as well; others may have better or
alternative ways of doing this.


Jason mentions masking all regions you don't want (CDS, rRNA, tRNA) by
changing them to N's, then splitting on multiple N's:




Using substr (as Jason suggests) should also speed things up.  


Here's my solution using only bioperl objects; I use this for some work I do
here.  It works but isn't terribly fast with large sequences due to object
creation. There is very likely a better way with larger sequences.  You
could probably use the first loop to extract seqfeatures and change the
second part to simple perl to speed things up a bit or try straight regex
matching combined with substr for maximum velocity.  Be careful though;
bioperl subseq() searches starting with 1 while I think perl substr starts
at 0!



use Bio::SeqIO;

use Bio::Seq;

use strict;


my $file = shift @ARGV;


my $seqio = Bio::SeqIO->new(-file => "$file",

                            -format => 'genbank');

my $seqout = Bio::SeqIO->new(-fh => \*STDOUT,

                             -format => 'fasta');

my @cr = {};                                             

my $seq = $seqio->next_seq;


# grab features, filter for those we want to keep

for my $feature ($seq->get_SeqFeatures) {                

   next unless $feature->primary_tag =~ /CDS|rRNA|tRNA/; 

   for my $value ($feature->get_tag_values('gene')) {

      push @cr, {'gene' => $value, 

                 'start' => $feature->start,             

                 'end' => $feature->end};                




# extract and format subseqs

my $count = 0;

while($cr[$count]) {

   if (defined $cr[$count]->{'gene'}) {

      # get end of previous gene; if no gene (undef), set to 1 (startpoint)

      my $start = $cr[$count-1]->{'end'} ? $cr[$count-1]->{'end'} + 1 : 1 ;

      # get start of second gene (endpoint)

      my $end = $cr[$count]->{'start'} - 1; 

      my $igrname =  $cr[$count-1]->{'gene'}

                     ? $cr[$count-1]->{'gene'}."-".$cr[$count]->{'gene'}.' |


                     : $cr[$count]->{'gene'}.'|'.


      my $igr = $end - $start;

      if($igr > 0 ) {  # filter out negative or zero values

         my $sub = $seq->subseq($start, $end);

         my $subseq = Bio::Seq->new(-seq   => $sub,

                                    -desc  => $igrname);








Christopher Fields
Postdoctoral Researcher - Switzer Lab
Dept. of Biochemistry
University of Illinois Urbana-Champaign 



From: perlmails at gmail.com [mailto:perlmails at gmail.com] 
Sent: Wednesday, March 08, 2006 3:37 AM
To: cjfields at uiuc.edu
Subject: Re: ncDNA


Hi Chris,

Apologies for the personal email and the delay in replying.

I think, there is a misunderstanding, since I want to extract non-coding
sequences (intergenic) from the EMBL chromosome file.

With the links you mentioned annotated features could be extracted from EMBL
FT entries.

I am deleteing the annotated parts of the EMBL file to get the inter-feature

In my script(below) I replace the annotated feature sequences from a fasta
formatted file based on the (start-end) coordinates saved in a different
file using 'substr'.

Do you know, how to go about it? or suggest a better way?



You're not using bioperl.  See:


then go to:



On Feb 26, 2006, at 5:51 AM, perlmails at gmail.com wrote:

> Dear Bioperl group,
> I have been working on extracting non-coding DNA (ncDNA) sequences 
> from an organimsm.
> I tried extracting the intergenic sequences from the sense-strand
> after filtering the features (CDS, gene, mRNA, tRNA, rRNA etc) from
> the EMBL feature table entries using the Bioperl and the additional 
> script (mentioned below).
> Now, I realised that there is a problem to extract the ncDNA sequences
> from the negative-strand, Any ideas?
> To extract the ncDNAs from negative-strand, I thought of converting 
> the negative-strand co-ordinates to sense-strand co-ordinates and
> adding these to the sense-strand cords. Then filter all the features
> (select the ncDNAs after discarding the features from EMBL FT) to get 
> all the ncDNAs.
> Is there anything I am missing for using from the bioperl kit?
> ##<<<code start>>
> use strict;
> my $EMBL_cord_file = "Organism.feature.cords ";  # feature
> co-ordinates: start \t end
> my $RAW_file = "Organism.raw";
> my $ncDNA_file = "Organism.ncDNA";
> open(EMBLCORD, $EMBL_cord_file) or die "Canot open EMBL_cord_file"; 
> open(RAW, $RAW_file) or die "Canot open RAW_file";
> open(OUT, ">$ncDNA_file") or die;
> my @dna=<RAW>;
> my $dna = join('', at dna);
> while($dna){ 
>       $dna=~s/\s//g;
>       while(<EMBLCORD>){
>               my @cords = split /\t/;
>               my      $start = $cords[0];
>               my      $end = $cords[1];
>               my $replaceString = "\n>$start..$end"; 
>               substr($dna, $start-1, $end-$start+1, $replaceString);
> }
>       print OUT $dna,"\n";
>       exit;
> }
> ##<<<code end>>
> Another thing is, since I am reading the whole file in a scalar the 
> script does not complete the extraction of all ncDNAs from the
> sense-strand. Obviously, the features are parsed first before the
> flattening of the 266,000 nt sequence into a single string.
> Any help would be appreciated.
> -PO

More information about the Bioperl-l mailing list