[Bioperl-l] Memory requirements for conversion from embl to genbank

Chris Fields cjfields at uiuc.edu
Thu Aug 31 10:13:31 EDT 2006


Martin,

That's the issue; I believe the tags are supposed to be unique (part of the
EMBL standard, I think).  I'll look at it but this may be, again, one of
those issues which we may not fix as it's a problem with the input sequence
(not in the correct format).  

At the very least it should break out of an infinite loop with a thrown
message.  Have you tried adding a debugging statement to the specific line
in genbank.pm to verify the infinite loop?
 
Wow, you've run into a hornet's nest of bad sequences.  Missing quotes, too
many quotes, now this!

Chris

> -----Original Message-----
> From: Martin MOKREJŠ [mailto:mmokrejs at ribosome.natur.cuni.cz]
> Sent: Thursday, August 31, 2006 8:50 AM
> To: Chris Fields
> Cc: bioperl-l at lists.open-bio.org
> Subject: Re: [Bioperl-l] Memory requirements for conversion from embl to
> genbank
> 
> It has slowed down after printing out, actually it stopped printing
> out the text (but that could be because the output is buffered, hmm,
> there used to be a way to unset buffering, I used to know the contents
> of 'man perlopentut' some years ago, but its gone from my head now):
> 
> Acc:BB133146
> Acc:BB199913
> Acc:BB199915
> Acc:BB199667
> Acc:BB199670
> Acc:BB199673
> Acc:BB199676
> Acc:BB199679
> Acc:BB199682
> Acc:BB228934
> Acc:BB229388
> Acc:BB229266
> Acc:BB229267
> Acc:BB199709
> Acc:BB199710
> Acc:BB199711
> Acc:BB199712
> Acc:BB200048
> Acc:BB199986
> Acc:BB199993
> 
> 
> It hasn't died yet, but I guess it will in a while. The next record
> which it did not spit out is:
> 
> ID   5HGB000664 standard; mRNA; VRL; 1892 BP.
> XX
> AC   BB199698;
> XX
> DT   20-NOV-2002 (Rel. 16, Created)
> DT   20-NOV-2002 (Rel. 16, Last updated, Version 1)
> XX
> DE   5'UTR in Hepatitis GB virus B subgenomic replicon neoRepB
> XX
> DR   EMBL; AJ428955;
> DR   UTR; CC221018;
> XX
> OS   Hepatitis GB virus B
> OS   Encephalomyocarditis virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae;
> OC   Cardiovirus.
> XX
> UT   5'UTR;
> XX
> FH   Key             Location/Qualifiers
> FH
> FT   5'UTR           1..1892
> FT                   /source="EMBL::AJ428955:1..1892"
> FT                   /product="non-structural polyprotein"
> FT   VECTOR          477..1274
> FT                   /source="EMBL::AJ428955:477..1274"
> FT                   /evidence="Similarity"
> FT                   /db_xref="EMBL:"
> FT                   /note="Possible vector contamination"
> FT                   /note="Length=798 BP. Identities=99.6%"
> XX
> 
> 
> Note the two /note feature lines. I guess the quoting code loops over
> and keeps adding quote after a quote. ;-)
> 
> M.
> 
> 
> 
> 
> Chris Fields wrote:
> > Martin,
> >
> > Do you get the same issue using SeqIO?
> >
> > #/usr/bin/perl -w
> >
> > use strict;
> > use warnings;
> > use Bio::SeqIO;
> >
> > $file_in = '5UTR.Vrl_nr.dat';
> >
> > $file_out = '5UTR.Vrl_nr.gb';
> >
> > my $seqin = Bio::SeqIO->new(-format => 'embl',
> >                             -file   => "<$file_in");
> >
> > my $seqout = Bio::SeqIO->new(-format => 'genbank',
> >                             -file   => ">$file_out");
> >
> > while (my $seq = $seqin->next_seq) {
> >     print "Acc:",$seq->accession,"\n";
> >     $seqout->write_seq($seq);
> > }
> >
> >
> > Chris
> >
> >
> > On Aug 31, 2006, at 7:44 AM, Martin MOKREJŠ wrote:
> >
> >> Hi,
> >>   I use bp_sreformat.pl to convert a file from embl format
> >> to genbank. I use current cvs HEAD version and cannot parse
> >> two files. Each record is small and I don't understand why
> >> is the such a huge memory requirement. The machine has 1GB
> >> RAM and running recent recent linux kernel. Moreover, I could
> >> parse the same file with bioperl-1.5.1 when I have manually
> >> fixed some missing quotes in the file.
> >>
> >>   With current changes to the embl & genbank parsing (bug #2077)
> >> I no longer can parse the file.
> >>
> >>   Here is the memory status at the moment when the machine ran
> >> out of memory and linux kernel killed the application:
> >>
> >>  1  0 803212  20936      8   2184    0    0     0     0 1062   38  99
> >> 1  0  0
> >>  1  0 803208  19944      8   2184    0    0     0     0 1062   38
> >> 100  0  0  0
> >>  1  0 803208  18828      8   2184    0    0     0     0 1061   37
> >> 100  0  0  0
> >>  1  0 803204  17836      8   2184    0    0     0     0 1062   40
> >> 100  0  0  0
> >>  1  0 803204  16844      8   2184    0    0     0     0 1062   48
> >> 100  0  0  0
> >>  1  0 803200  15728      8   2184   32    0    32     0 1063   41
> >> 100  0  0  0
> >>  1  0 803200  14736      8   2184    0    0     0     0 1062   41  99
> >> 1  0  0
> >>  1  0 803196  13744      8   2184    0    0     0     0 1061   38
> >> 100  0  0  0
> >>  1  0 803240  13640      8   2184    0   48     0    48 1063   68  99
> >> 1  0  0
> >>  1  1 803240  12920      8   1984    0   40     0    40 1065  136
> >> 100  0  0  0
> >>  1  1 803240  13192      8   1872    0 1056     0  1056 1114  326  96
> >> 4  0  0
> >>  1  1 803240  14448      8   1336    0   20     0    20 1081  192  90
> >> 10  0  0
> >>  1  1 803240  13656      8   1232    0   28     0    28 1070  104  87
> >> 13  0  0
> >>  1  1 803240  12892      8   1260   32    4   176     4 1069  113  86
> >> 14  0  0
> >>  0  4 803240  12144      8   1344  192   24   612    24 1088  185  44
> >> 16  0 40
> >>  0  7 803240  11952      8   1180   32   32   508    32 1113  591  46
> >> 23  0 32
> >>  0  3 803240  11948      8   1336 1120  500 10816   500 4390 1397   2
> >> 31  0 66
> >>  2  6 803240  12056      8   1788  752  136  9412   136 6101 1795   0
> >> 27  0 73
> >>  0  7 803240  12176      8   1748   12    0  2180     0 1132  326   0
> >> 20  0 80
> >> procs -----------memory---------- ---swap-- -----io---- -system--
> >> ----cpu----
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs  us
> >> sy id wa
> >>  0  5 803240  12492      8   1508  136   32  7508    32 2610  865   4
> >> 45  0 51
> >>  0  6 803240  12056      8   2004   64    8  1456     8 1138  312   9
> >> 18  0 73
> >>  1  6 803240  12668      8   1452   96   28 14856    28 2434  658   0
> >> 31  0 69
> >>  0  7 803240  13240      8    564    0    0  3112     0 4602 1492   4
> >> 38  0 58
> >>  0 10 803240  12768      8    688   36 15272  6000 15272 2026  431  26
> >> 39  0 35
> >>  0  2  81780 966512      8   5692  108    0  2904     0 2204  372   0
> >> 11  0 89
> >>  0  3  81780 966204      8   6056  128    0   488     3 1155   82   1
> >> 0  0 99
> >>  0  1  81780 965460      8   6260  492    0   696     0 1150  161   0
> >> 1 13 86
> >>  0  1  81732 963652      8   7860    8    0  1608     0 1147  199   1
> >> 2 42 55
> >>  0  1  81732 962052      8   8560    4    0   704     0 1129  177   6
> >> 1 43 50
> >>  0  1  81732 960120      8   9128    0    0   568     0 1124  161  12
> >> 2 57 29
> >>  0  1  81732 957512      8   9840    4    0   716     0 1137  191  13
> >> 2 27 58
> >>  1  0  81732 954992      8  10640   32    0   832     0 1135  191  14
> >> 1 47 38
> >>  1  0  81732 952824      8  11016    0    0   340     0 1096  128  64
> >> 1 18 16
> >>  1  0  81732 952152      8  11092    0    0     0     0 1062   80  99
> >> 1  0  0
> >>  1  0  81732 951424      8  11196    0    0     0     0 1062  105  99
> >> 1  0  0
> >>  1  0  81732 950808      8  11264    0    0     0     0 1062   74  99
> >> 1  0  0
> >>
> >>
> >> $ bp_sreformat.pl -if embl -of genbank -i 5UTR.Vrl_nr.dat -o
> >> 5UTR.Vrl_nr.gb
> >> Killed
> >> $
> >>
> >> The file can be obtained from ftp://bighost.ba.itb.cnr.it-fixed/pub/
> >> Embnet/Database/UTR/data/
> >>
> >> I am not a perl guru so nor am familiar with bioperl code. Does
> >> someone know
> >> whether the parsed records are held in the memory or not? It seems so.
> >> I guess deleting the objects from memory can be done by dereferencing
> >> them after they get written down in the new format immediately. Or,
> the
> >> garbage collector does not work well in perl 5.8.8.
> >>
> >> Thanks for any help.
> >> Martin
> >>
> >> --
> >> Dr. Martin Mokrejs
> >> Faculty of Science, Charles University
> >> Vinicna 5, 128 43 Prague, Czech Republic
> >> http://www.iresite.org
> >> http://www.iresite.org/~mmokrejs
> >> _______________________________________________
> >> Bioperl-l mailing list
> >> Bioperl-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> > Christopher Fields
> > Postdoctoral Researcher
> > Lab of Dr. Robert Switzer
> > Dept of Biochemistry
> > University of Illinois Urbana-Champaign
> >
> >
> >
> >
> 
> --
> Dr. Martin Mokrejs
> Faculty of Science, Charles University
> Vinicna 5, 128 43 Prague, Czech Republic
> http://www.iresite.org
> http://www.iresite.org/~mmokrejs




More information about the Bioperl-l mailing list