[Bioperl-l] Best Practices for Downloading/Mirroring Genbank
mjohnson at watson.wustl.edu
Thu Jun 24 12:13:07 EDT 2004
Rsync is your friend. Both NCBI and Biomirror are rsync friendly. You
can use rsync to maintain a local copy of whatever parts of the NCBI ftp
site you'd like. Then you can be assured that after the rsync finishes
you have a consistent local snapshot (as long as you didn't rsync in the
middle of a file update on the other end). It will even minimize your
bandwidth consumption...on subsequent invocations it will only transfer
files you don't have, or changes to files you do have.
> I'm working on setting up a local mirror of Genbank here at work and am
> unsure of what the best way to go about it is.
> I started off real simple with a wget -m ftp://genbank.sdsc.edu/pub (Yes,
> wanted the BLAST formatted databses and executables as well) and the
> transfer is going just fine, albeit excruciatingly slow at times.
> But what happens:
> 1) between now and the next build?;
> 2) if I coose to mirror from an alternate source?;
> 3) after the next build?
> For the first part, I just planned on doing daily wgets for the updates,
> the possibility occurred to me that if I miss the last couple days worth
> updates before the new build, those updates get shuffled into the main
> build files and I have to download the whole thing again?
> For the second, If I choose to mirror from Biomirror or NCBI instead of
> Diego, those timestamps seem to be different for what I am assuming to be
> the same build. For example,
> gbest1.seq.gz 19,454,020 bytes 5/22/04 5:04am SDSC Mirror
> 19,454,020 bytes 4/25/04 2:01am NCBI Mirror
> 19,454,020 bytes 4/25/04 2:01am BioMirror
> For the third part, do the build files really change or are new entries
> revisions just added on as extra build files? I read that the files are
> non-cumulative, so that would seem to confirm it, but the timestamps are
> updated in sync with the latest build date.
> How do I keep an updated mirror without losing daily builds or having to
> download the whole thing every couple of months. How do I verify that I
> have the latest data, because checking timestamps does not seem like it
> work? Should I even bother with creating a true mirror?
> I ran across this recent thesis on some of the issues in maintaining these
> types of databases accurately while minimizing file transfers
> I know that Biomirror has some scripts to facilitate efficient transfers
> do they handle updates. I'm guessing this problem has already been
> addressed, I just can't find the solution.
> Thanks in advance for any input,
> Joseph Karalius
> RA, Bioinformatics
> Molecular Markers and Applied Genomics
> Seminis Vegetable Seeds, Inc
> 37437 State Highway 16
> Woodland, CA 95695-9353
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
More information about the Bioperl-l