[Bioperl-l] GenBank entries creation dates

Martin MOKREJŠ mmokrejs at ribosome.natur.cuni.cz
Tue Apr 15 06:45:48 EDT 2008


Chris Fields wrote:
> Note in the example I gave that, during the revision history, the 
> DBSOURCE changed at the point of the creation date (the original nuc.
>  record was a M. tuberculosis contig sequence, which later changed to
> an updated full M. tuberculosis genome record at the time of the
> 'create date').
> 
> Couldn't find anything specific in the GenBank docs on this, but it 
> appears (at least for a protein record) the creation date reflects
> the date in which the sequence was either originally deposited or
> originally derived from the nucleotide source record present in the
> record.  In other words, it may not reflect the original date of
> deposition (which could have come from a different record, as in this
> case).
> 
> chris

Hi,
I have few answers from the past from NCBI staff to my similar questions
regarding DATE issues and VERSION numbers not being increased upon
"changes" in a record.
I tried below to put into a more readable form my former correspondence.
Hope this helps everybody to understand what happens in the black box. ;)
Martin





Date: Thu, 17 Jan 2002 15:40:07 -0500 (EST)
From: David Wheeler
Subject: Brucella_melitensis on ftp site

> Hi, I'd like to point you to the fact, that the descriptions of 
> Brucella_melitensis differ in 
> ftp.ncbi.nih.nlm.gov/genomes/Bacteria/Brucella_melitensis and 
> ftp.ncbi.nih.nlm.gov/genbank/genomes/Bacteria/Brucella_melitensis
> 
> Namely, the description of the strain is retained in *.gbk files
> under /genomes/Bacteria/Brucella_melitensis only under the strain
> description field, but not in the DEFINITION line, where it is
> present in *.gbk files under
> /genbank/genomes/Bacteria/Brucella_melitensis.
> 
> LOCUS       NC_003318 1177787 bp    DNA   circular  BCT
> 13-NOV-2001 DEFINITION  Brucella melitensis chromosome II, complete
> sequence. ACCESSION   NC_003318 VERSION     NC_003318.1  GI:17988344
> 
> compared to
> 
> LOCUS       AE008918  1177787 bp    DNA   circular  BCT
> 27-DEC-2001 DEFINITION  Brucella melitensis strain 16M chromosome II,
> complete sequence. ACCESSION   AE008918 VERSION     AE008918
> 
> This makes me worried about the data. Why is the release date of 
> NON-curated files (AE008918) newer than the release data of CURATED
> data (NC_003318)? Is it expected case? Could someone explain me the
> difference between them (i.e. CURATED vs. NONCURATED)?

The curated record is initially a copy of the non-curated record with certain 
changes in documentation made in order to comply with the NCBI standard for 
reference genomes. One change which you have noticed is the difference in 
Definition line format.  Curated genomic records are created in order to 
standardize annotation for genomes in the Entrez Genomes database while leaving 
editorial control for the parent GenBank records in the hands of the original 
submitters.

Regardles of the date you see on the record, the curated version is derived from 
the non-curated one.  In this case, it appears that the processing of the 
non-curated version lagged a little bit relative to that of the curated version. 
Normally, however, the non-curated version will have the earlier date.




Date: Sun, 27 Jan 2002 00:16:55 -0500 (EST)
From: David Wheeler
Subject: Re: CONSULT: Brucella_melitensis on ftp site

> Are the raw sequence data always same in non-curated and curated 
> flatfiles?
> 
> Is the annotation of orf's/proteins different between them?
> 
> Are there any new or withdrawn orf's or proteins in the curated
> flatfiles compared to non-curated ones?
> 
> My feeling is that no-one except original submitters can modify
> submitted data, so you cannot modify non-curated files, i.e. cannot
> modify them and increase the version number.
> 
> Because of that, you've introduced curated versions, which are just
> copies of original but public data so you are free to modify it. So
> once again, are the differences between non-curated and curated
> flatfiles only in structure of the file? I don't think so. Examples
> would be Listeria genomes or the 2 Agrobacterium's, if I remember
> right.

Initially, there should be no or very few differences, however, as time
goes by, differences in the annotation will materialize.  There may also
be differences in the sequence, if errors in the original sequence come to
light, but these differences should be very rare.

So, practically speaking, you will probably find few differences but,
since the purpose of the Refseq is to curate, there may well be some
differences.




Date: Mon, 17 Dec 2001 11:57:06 -0500 (EST)
From: Dawn Lipshultz
Subject: Re: Buggy date in Staphylococcus aureus N315

>>>> Hi, I've found there has been released Staphylococcus aureus
>>>> N315 on 01-JAN-1900, which is nonsense. I guss you had y2K bug.
>>>> 
>>>> 
>>>> Please see
>>>> 
>> ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Staphylococcus_aureus_N315/BA000018.gbk
>> 
>>>> 
>>>> Can you please tell me the real release date?
>>>> 
>>>> Also, is newer the NC_xxxx for Staphylococcus aureus N315 under
>>>>  
>>>> ftp://ncbi.nlm.nih.gov/genomes/Bacteria/Staphylococcus_aureus_N315/
>>>>  or this BA000018 non-cured version?
>>>> 
>>>> 
>>>> LOCUS       BA000018  2814816 bp    DNA   circular  BCT
>>>> 01-JAN-1900 DEFINITION  Staphylococcus aureus strain N315,
>>>> complete genome.

>>> AP003129-AP003138. They are all dated June 2001.
>>> 
>>> The date for the record in the ftp file is April 2001. The record
>>> in GenBank (NC_002745) is dated October 2001. This version is
>>> apparently more updated than the one on the ftp site. Therefore,
>>> you may want to download the sequence from GenBank rather than
>>> the ftp site.
>>> 
>>> Regards, Dawn S. Lipshultz

>> I cannot find the record to which you refer in your message. When I
>>  did a search for accession number BA000018, I received results for
>>  accession numbers AP003129-AP003138. They are all dated June 2001.
>> 
>> 
>> The date for the record in the ftp file is April 2001. The record
>> in GenBank (NC_002745) is dated October 2001. This version is
>> apparently more updated than the one on the ftp site. Therefore,
>> you may want to download the sequence from GenBank rather than the
>> ftp site. Regards, Dawn S. Lipshultz

> 
> Hmm, but I do get: 
> http://www.ncbi.nlm.nih.gov:80/cgi-bin/Entrez/framik?db=genome&gi=179
> 
> 
> look at the "GenBank: NC_002745" text in left upper part of the
> window, it points to that OLD ftp file. The "RefSeq: NC_002745"
> points to the April 2001 version. So what is the right way to get the
> October 2001 release?
> 
> Where can I find the difference between NC_002745 from April compared
>  to NC_002745 from October?
> 
> What do you mean with "you may want to download the sequence from 
> GenBank rather than the ftp site."?
> 
> BOTH ftp directories at ftp://ncbi.nlm.nih.gov are outdated. I mean 
> the genomes/Bacteria/Staphylococcus_aureus_N315/NC_002745.* version 
> and also the 
> genbank/genomes/Bacteria/Staphylococcus_aureus_N315/BA000018.* 
> version.
> 
> The web links from www.ncbi.nlm.nih.gov:80/cgi-bin/Entrez/ point 
> anyway to the ftp site. Do you want to say that the ftp version
> aren't updated anymore?

The genome was originally released into the database on 4/20/2001
as 10 pieces with secondary accession number BA000018.  You can 
find these pieces in Entrez nucleotides by querying with BA000018.

The Genomes group here will fix the date on the record that is available
from Entrez genomes.

Regards,
Dawn



Date: Fri, 16 Nov 2001 16:09:59 -0500 (EST)
From: Susan Dombrowski
Subject: Re: Agrobacterium tumefaciens C58

> Dear colleague, I've noticed that there're somehow updated on Oct 17
> the genomic flatfiles of Agrobacterium tumefaciens C58 at 
> ftp://ncbi.nlm.nih.gov/genbank/genomes/Bacteria/Agrobacterium_tumefaciens/.
>  However, for example the AE007869.gbs does NOT self-explain what has
> been changed and also the VERSION number is not increased. Would you
> please explain what's the change, when can I find such information
> next time on web?
> 
> I've used the published sequence from your ftp site on 2001-08-29
> with same ID and would like to know, what differs.
> 
> LOCUS       AE007869  2841581 bp    DNA   circular  CON
> 17-OCT-2001 DEFINITION  Agrobacterium tumefaciens strain C58 circular
> chromosome, complete sequence. ACCESSION   AE007869 VERSION
> AE007869

Dear Colleague,
The version number of a sequence will *only* change if the content of the actual 
sequence has changed in any way since it was first made available. Although the 
date has changed, this date refers to the last time the actual record was 
manipulated by an NCBI staff member. Even if there is something simple, like 
adding a reference, changing a spelling mistake, etc., this will cause a change 
in the date field of the record. 

Thus, since the version has not changed, there are no differences to report.
Best Regards,
Susan




Date: Wed, 26 Jun 2002 11:04:48 -0400 (EDT)
From: Eric Sayers
Subject: Re: Mesorhizobium_loti flatfiles

>>>>> Hi,
>>>>>   I've found that you again silently changed flatfiles lying on your ftp
>>>>> some time ago without changing the revision number. Please apologize me,
>>>>> but this really causes troubles to other people working in this so called
>>>>> bioinformatics. :(
>>>>> 
>>>>> A week ago there was:
>>>>> 
>>>>> LOCUS       NC_002678            7036074 bp    DNA     circular BCT 10-SEP-2001
>>>>> DEFINITION  Mesorhizobium loti, complete genome.
>>>>> ACCESSION   NC_002678
>>>>> VERSION     NC_002678.1  GI:13470324
>>>>> 
>>>>> 
>>>>> and two other plasmid sequences. This yelds 7275 proteins.
>>>>> 
>>>>> But, last autumn there was:
>>>>> 
>>>>> LOCUS       NC_002678 7036074 bp    DNA   circular  BCT       28-MAR-2001
>>>>> DEFINITION  Mesorhizobium loti, complete genome.
>>>>> ACCESSION   NC_002678
>>>>> VERSION     NC_002678.1  GI:13470324
>>>>> 
>>>>> 
>>>>> That version had 7281 proteins in total.
>>>>> I have simple questions: "Why was NOT changed the VERSION number?".
>>>>>
>>>>> Do I understand it wrong, that it should get updated whenever a single
>>>>> character in the file contents is changed?
>>> 
>>>> The version number of a sequence only changes if the sequence itself is
>>>> modified. If anything else in the flat file is changed (ie spelling, authors,
>>>> annotations, etc) the version will not change. However, the modification date in
>>> 
>>> Sorry, do you under annotation also mean number of predicted genes, their
>>> coordinates(position) etc?
>>> 
>>>> the top line of the flat file will change for any of these modifications. (Note
>>>> that the dates are different in the file you display: Mar 28, 2001 vs Sept 10,
>>>> 2001.) I would track the modification date rather than or as well as the version
>>>> number to catch all changes in the files.
>>>> Regards,
>>>> Eric W. Sayers, Ph.D.
>>> 
>>> OK, but unless some of our programs have been buggy before or now (in
>>> either of those cases have failed to extract genes from flatfiles), I do
>>> not have an explanation for the differencies in amount of
>>> predicted/annotated genes.
>>> 
>>> I do not have anymore available the old flatfiles from Mar 28, but it
>>> seems to me that these were newly introduced in the Sept. 10 version:
>>> gi_15600768, gi_15600770, gi_15600769, gi_15600766, gi_15600767
>> 
>> Dear Colleague,
>> Again, the only reason the version number will change is if the sequence itself 
>> changes. The number of annotated/predicted genes is merely an annotation on the 
>> sequence, and does not change the sequence itself. Therefore, the version will 
>> not change when the number of annotations changes. The modification date on the 
>> flat file will (and did) change, of course.
>> 
>> Regards,
>> Eric W. Sayers, Ph.D.
> 
> Finally I've heard that from someone, thanks!
> Now just tell me, how can I figure out what changed between those
> different "date" releases? Is there a changelog available?
> I consider annotations changes very important.

We do not provide the details of flat file changes on our public websites, 
except for changes in the version number (ie actual sequence changes). In that 
particular case, all of the previous versions are linked to the current one. My 
advice to you if you want to chronicle non-sequence changes would be to check 
the flat files of interest periodically (by a script, for example) and look for 
changes in the modification dates. You could then simply compare the before and 
after flat files.

Regards,
Eric W. Sayers, Ph.D.


> Hi, Miguel:
> 
> id1_fetch can do it. Detailed instruction can be found at:  
> 
> http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=toolkit.section.ch_demo.id1_fetch.html
> 
> Here is an example:
> 
>> >id1_fetch -lt revisions -flat '12:74311105' -fmt fasta
> GI        Loaded      DB    Retrieval No.
> --        ------      --    -------------
> 74311105  12/07/2007  NCBI  19766263
> 74311105  01/23/2007  NCBI  16325656
> 74311105  03/30/2006  NCBI  13131204
> 74311105  03/03/2006  NCBI  12915541
> 74311105  03/02/2006  NCBI  12885275
> 74311105  12/03/2005  NCBI  12259793
> 74311105  09/09/2005  NCBI  11257262
> 74311105  09/09/2005  NCBI  11242667
> 
> Wenwu Cui PhD







More information about the Bioperl-l mailing list