[Bioperl-l] FASTA 2 GenBank

Peter.Robinson at t-online.de Peter.Robinson at t-online.de
Mon Oct 17 15:27:06 EDT 2005

Dear bioperlers,

forgive what may be a simple question, but consulting the howtos and Google did not reveal an answer to me.
I am in the process of analyzing ESTs from a nonmodel organism and would like to build GenBank style files for the contig sequences by adding in information about sequence features. I would like to start by adding info about the presumed ORF as follows:

## 1) This is the 'new' sequence
my $seqio = new Bio::SeqIO('-file'   => $inname , '-format' => 'fasta'); 
my $seq    = $seqio->next_seq();

## 2) This is the feature I would like to add, with $startpos
## and $endpos being the start/end of the ORF based on translations 
## and alignments
my $feat = new Bio::SeqFeature::Generic ( -start => $startpos, 
					  -end => $endpos,
				       -strand => 1, 
				       -primary => 'CDS',
				       -source => 'Manual annotation of CDS',
## 3) Here I would like to output the sequence in GenBank format
my $out =  Bio::SeqIO->new(-file => ">$outputfilename",
                         -format => 'EMBL');

### However, I get this:

ID   ABC2002.1   standard; DNA; UNK; 5914 BP.
AC   unknown;
DE   /early=858 /middle=1093 /late=436
FH   Key             Location/Qualifiers
FT   CDS             104..4501
SQ   Sequence 5914 BP; 1088 A; 1893 C; 1748 G; 1174 T; 11 other;

But I would like to get something like this:

LOCUS       XM_213440               5804 bp    mRNA    linear   ROD 15-APR-2005
DEFINITION  PREDICTED: Rattus norvegicus collagen, type 1, alpha 1 (Col1a1),
VERSION     XM_213440.3  GI:62656859
SOURCE      Rattus norvegicus (Norway rat)
  ORGANISM  Rattus norvegicus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muroidea; Muridae; Murinae; Rattus.
COMMENT     MODEL REFSEQ:  This record is predicted by automated computational
            analysis. This record is derived from an annotated genomic sequence
            (NW_047337) using gene prediction method: GNOMON, supported by mRNA
            and EST evidence.
            Also see:
                Documentation of NCBI's Annotation Process
            On Apr 15, 2005 this sequence version replaced gi:34873454.
FEATURES             Location/Qualifiers
     source          1..5804
                     /organism="Rattus norvegicus"
     gene            1..5804
                     /note="Derived by automated computational analysis using
                     gene prediction method: GNOMON. Supporting evidence
                     includes similarity to: 2 mRNAs, 48 ESTs, 1 Protein"
     CDS             95..4456
                     /product="similar to Collagen alpha1"
                      etc "
        1 gacggagcag gaggcacacg gagtgaggcc acgcatgagc cgaagctaac cccccacccc
       61 agccgcaaag agtctacatg tctagggtct agacatgttc a

I would be happy if I could get the CDS bit right and very happy if I could add some further information in the above style. At the moment some downstream applications are not working because the GenBank format is incorrect.

Thanks ,


More information about the Bioperl-l mailing list