[Bioperl-l] Best Practices for Downloading/Mirroring Genbank

Karalius, Joseph Joseph.Karalius at seminis.com
Mon Jun 14 13:31:21 EDT 2004

I'm working on setting up a local mirror of Genbank here at work and am
unsure of what the best way to go about it is.

I started off real simple with a wget -m ftp://genbank.sdsc.edu/pub (Yes, I
wanted the BLAST formatted databses and executables as well) and the
transfer is going just fine, albeit excruciatingly slow at times.  

But what happens:
1) between now and the next build?;
2) if I coose to mirror from an alternate source?;
3) after the next build?

For the first part, I just planned on doing daily wgets for the updates, and
the possibility occurred to me that if I miss the last couple days worth of
updates before the new build,  those updates get shuffled into the main
build files and I have to download the whole thing again?

For the second, If I choose to mirror from Biomirror or NCBI instead of San
Diego, those timestamps seem to be different for what I am assuming to be
the same build.  For example,

gbest1.seq.gz	19,454,020 bytes	5/22/04	5:04am SDSC Mirror
		19,454,020 bytes	4/25/04	2:01am NCBI Mirror
		19,454,020 bytes	4/25/04	2:01am BioMirror

For the third part,  do the build files really change or are new entries and
revisions just added on as extra build files?  I read that the files are
non-cumulative, so that would seem to confirm it, but the timestamps are
updated in sync with the latest build date.

How do I keep an updated mirror without losing daily builds or having to
download the whole thing every couple of months.  How do I verify that I do
have the latest data, because checking timestamps does not seem like it will
work?  Should I even bother with creating a true mirror?

I ran across this recent thesis on some of the issues in maintaining these
types of databases accurately while minimizing file transfers

I know that Biomirror has some scripts to facilitate efficient transfers but
do they handle updates.  I'm guessing this problem has already been
addressed, I just can't find the solution.

Thanks in advance for any input,

Joseph Karalius
RA, Bioinformatics
Molecular Markers and Applied Genomics
Seminis Vegetable Seeds, Inc
37437 State Highway 16
Woodland, CA 95695-9353

More information about the Bioperl-l mailing list