[Bioperl-l] [Structure of remote GenBank files]

Barry Moore barry.moore at genetics.utah.edu
Thu Apr 22 13:24:11 EDT 2004


I ran your script with BioPerl 1.4, Active State Perl 5.8 on Windows XP 
works fine for me.  I don't know what's causing your problem.  Maybe 
telling more about your system might help.  This doesn't have anything 
to do with you file format problems, but thought I'd mention that since 
your script takes accession numbers as input you could skip the query, 
and call a $gb->get_Stream_by_id(\@accession) on a array of accessions 
or $gb->get_Seq_by_acc($acc) on a scalar.


Sebastien Moretti wrote:

>I use a BioPerl script to get GenBank and RefSeq files in GenBank flat file 
>	#!/usr/bin/perl -w
>	use strict;
>	use Bio::DB::GenBank;
>	use Bio::DB::Query::GenBank;
>	use Bio::SeqIO;
>	my $acc=$ARGV[0] or die "\n\tThe accession number you seek for is 
>missing.\n\tTry something like: ./update_estCDK.pl NM_178432\n\n";
>	$acc=$acc."[Accession]";
>	my $query_string = "$acc";
>	my $query = Bio::DB::Query::GenBank->new(-db=>'nucleotide',
>	                                         -query=>$query_string);
>	my $gb = new Bio::DB::GenBank;
>	my $stream = $gb->get_Stream_by_query($query);
>	my $out=Bio::SeqIO->new(-format=>'genbank');
>	my $seq = $stream->next_seq();
>	my $result=$out->write_seq($seq);
>	$result =~ s/^1.*$//;
>	#print $out->write_seq($seq);
>	print $result;
>	exit;
>It works fine but I have two structures problems in my files:
>	- the PUBMED fields are pasted with the JOURNAL fields line above:
>  JOURNAL   J. Biol. Chem. 278 (42), 40815-40828 (2003) PUBMED   12912980
>  JOURNAL   J. Cancer Res. Clin. Oncol. 129 (9), 498-502 (2003) PUBMED
>            12884029
>  JOURNAL   Am. J. Physiol. Heart Circ. Physiol. 284 (6), H1917-H1923 (2003)
>            PUBMED   12742823
>	- the COMMENT fields haven't blank lines and \n, so COMMENT fields looks
>	   compact:
>COMMENT     REVIEWED REFSEQ: This record has been curated by NCBI staff. The
>            reference sequence was derived from Y00272.1 and BC014563.1. On
>            Oct 22, 2001 this sequence version replaced gi:4502708. Summary:
>            The protein encoded by this gene is a member of the Ser/Thr
>            protein kinase family. This protein is a catalytic subunit of the
>            highly conserved protein kinase complex known as M-phase promoting
>            factor (MPF), which is essential for G1/S and G2/M phase
>            transitions of eukaryotic cell cycle. Mitotic cyclins stably
>            associate with this protein and function as regulatory subunits.
>            The kinase activity of this protein is controlled by cyclin
>            accumulation and destruction through the cell cycle. The
>            phosphorylation and dephosphorylation of this protein also play
>            important regulatory roles in cell cycle control. Transcript
>            Variant: This variant (1) encodes the full length isoform.
>            COMPLETENESS: complete on the 3' end.
>Does it come from my script ?
>Do you see the same thing ?

Barry Moore
Dept. of Human Genetics
University of Utah
Salt Lake City, UT

More information about the Bioperl-l mailing list