[Bioperl-l] [Structure of remote GenBank files]
Sebastien.Moretti at igs.cnrs-mrs.fr
Fri Apr 23 03:10:29 EDT 2004
> I ran your script with BioPerl 1.4, Active State Perl 5.8 on Windows XP
> works fine for me. I don't know what's causing your problem. Maybe
> telling more about your system might help. This doesn't have anything
> to do with you file format problems, but thought I'd mention that since
> your script takes accession numbers as input you could skip the query,
> and call a $gb->get_Stream_by_id(\@accession) on a array of accessions
> or $gb->get_Seq_by_acc($acc) on a scalar.
I use Linux Suse 9.0 and 8.2 and BioPerl 1.4
I try to get GenBank and RefSeq files with
'$gb->get_Stream_by_id(\@accession)' and I still have the same problems (with
NM_178432, BC032122 or NM_000559 as accession number):
PUBMED fields are not on their own lines but paste to JOURNAL fields
COMMENT fields are compact, without blank lines and line breaks
Do you think it comes from linux system ?
It might be for blank lines but why for PUBMED fields ?
When a MEDLINE field is here (eg: NM_169678 or NM_079645), PUBMED and MEDLINE
fields are right placed.
The COMMENT field are still compact, without blank lines and line breaks.
> >I use a BioPerl script to get GenBank and RefSeq files in GenBank flat
> > file format.
> > #!/usr/bin/perl -w
> > use strict;
> > use Bio::DB::GenBank;
> > use Bio::DB::Query::GenBank;
> > use Bio::SeqIO;
> > my $acc=$ARGV or die "\n\tThe accession number you seek for is
> >missing.\n\tTry something like: ./update_estCDK.pl NM_178432\n\n";
> > $acc=$acc."[Accession]";
> > my $query_string = "$acc";
> > my $query = Bio::DB::Query::GenBank->new(-db=>'nucleotide',
> > -query=>$query_string);
> > my $gb = new Bio::DB::GenBank;
> > my $stream = $gb->get_Stream_by_query($query);
> > my $out=Bio::SeqIO->new(-format=>'genbank');
> > my $seq = $stream->next_seq();
> > my $result=$out->write_seq($seq);
> > $result =~ s/^1.*$//;
> > #print $out->write_seq($seq);
> > print $result;
> > exit;
> >It works fine but I have two structures problems in my files:
> > - the PUBMED fields are pasted with the JOURNAL fields line above:
> > JOURNAL J. Biol. Chem. 278 (42), 40815-40828 (2003) PUBMED 12912980
> > JOURNAL J. Cancer Res. Clin. Oncol. 129 (9), 498-502 (2003) PUBMED
> > 12884029
> > JOURNAL Am. J. Physiol. Heart Circ. Physiol. 284 (6), H1917-H1923
> > (2003) PUBMED 12742823
> > - the COMMENT fields haven't blank lines and \n, so COMMENT fields looks
> > compact:
> >COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff.
> > The reference sequence was derived from Y00272.1 and BC014563.1. On Oct
> > 22, 2001 this sequence version replaced gi:4502708. Summary: The protein
> > encoded by this gene is a member of the Ser/Thr protein kinase family.
> > This protein is a catalytic subunit of the highly conserved protein
> > kinase complex known as M-phase promoting factor (MPF), which is
> > essential for G1/S and G2/M phase transitions of eukaryotic cell cycle.
> > Mitotic cyclins stably associate with this protein and function as
> > regulatory subunits. The kinase activity of this protein is controlled by
> > cyclin accumulation and destruction through the cell cycle. The
> > phosphorylation and dephosphorylation of this protein also play important
> > regulatory roles in cell cycle control. Transcript Variant: This variant
> > (1) encodes the full length isoform. COMPLETENESS: complete on the 3'
> > end.
> >Does it come from my script ?
> >Do you see the same thing ?
CNRS - IGS
31 chemin Joseph Aiguier
13402 Marseille cedex 20, FRANCE
tel. +33 (0)4 91 16 44 55
More information about the Bioperl-l