SynCron: tools for maintaining synchronised copies of the EMBL Nucleotide Sequence Database

Nicole Redaschi, Leif Landeman , Matteo diTommaso
Database Programming Group, EMBL Outstation, The European Bioinformatics Institute and
Reinhard Doelz
Sandoz Pharmaceuticals, Basel, Switzerland

Introduction

The EMBL nucleotide database doubles in size approximately every twelve months. A release of the database is made every 3 months and distributed via ftp and CD-ROM. Between releases, updates are available in two forms on the EBI ftp server: a cumulative set, containing all updates since the last full release to the current day, and incremental sets issued at least once a day and compiled into weekly sets. The incremental update files range between 100Kbytes and 25Mbytes with the typical file being less than 2Mbytes. The cumulative set at the end of release 48 has exceeded 150Mbytes.

Using the cumulative file to update a local database copy offers the advantage that there is no need for further data processing and, hence, no risk of errors arising through the manipulation of incremental update files. At many sites the available network bandwidth does not make it reasonable to transfer the cumulative file by ftp from the database provider site daily or even weekly. Downloading incremental files requires much less bandwith, but higher local maintenance effort, since the incremental files have to be integrated into a cumulative file to present the data as a single file for conversion into GCG format or indexing by SRS. To facilitate this processing step, and to provide a reliable mechanism to regenerate the cumulative file locally from incremental updates, we have developed the SynCron tools.

Systems and methods

The SynCron package contains a set of file manipulation utilities written in the C language and scripts to run the requested updating job on Unix (csh shell) and VMS (DCL) operating systems. SynCron is available by anonymous ftp from:

UNIX Version

VMS Version (backup/gzip):

VMS Version (tar/compress):

The version number (currently, 005) will change as the programs are updated.

SynCron makes use of transaction listings which have been made avaiable by the EBI for the nucleotide database since January 1996. The EBI supplies a listing for each of its update files that describes the update, insert and delete operations to the database represented in the flat-file updates. The core program of SynCron merges updates into the cumulative file following the instructions given in the transaction listings. Two additional utilities are provided to verify that the resulting cumulative file contains the correct entries. Currently, accession number, entry name, version, divison and a datestamp of an entry are validated. In the future, we may include the NID of an entry (a unique identifier for the sequence), or a checksum for the sequence to prove its identity.


   #######      1        +--+
   #     #    ----->     |  |
   #######               +--+
cumulative         cumulative
      data               list
                              +-+           2          +---+
                           +  | |         ----->       |   | <.....
                              +-+                      +---+      :
                              incremental              merged     :
                              list (EBI)               list       :
                                                                  : I
                                                                  :
   #######     ###     +-+      3       ##########     +---+      :
   #     #  +  # #  +  | |    ----->    #        #  +  |   | <....:
   #######     ###     +-+              ##########     +---+      :
cumulative     incremental                  merged     merged     :
      data     data + list                    data     list       :
                                                                  : II
                                                                  :
                                                       +---+      :
                                                       |   | <....:
                                                       +---+
                                                       cumulative
                                                       list (EBI)

Figure Legend:

A cumulative list is derived (1) from the local cumulative data file and merged (2) with the new incremental list to produce a merged list that will serve to verify the actual data merging process. The new incremental data file is then merged into the cumulative data file according to the instructions given in the incremental list (3), generating the new cumulative data file, as well as a merged list file. The latter is compared to the merged list produced in the previous steps (1+2), to verify the contents of the merged data file (I). If using daily update files there is an additional, optional verification step, a comparison to the cumulative list file that is generated daily at the EBI (II).


The programs are launched by a configurable script. The script keeps track of the incremental files that have already been merged into the cumulative file by writing their names to a logfile. If there is more than one new incremental file, the files are processed in sequential fashion. The script can easily be customized to exclude classes of data (e.g., EST, GSS, etc.) from the updates by preprocessing the transaction listings.

The SynCron package also includes tools to assist file transfer. If you have already a working configuration for obtaining the update files, there is no need to change this, except that you will need to add a procedure that fetches the list files in addition to the data files.

List files are available . The naming scheme is the same as for incremental and cumulative update files with the extension ".lis". All customization for local file names and paths can be done in a general configuration file, as explained in the installation instructions for the package.

Discussion

Surveys of the EBI ftp server show that incremental update files are already used by some sites. We noticed, however, that the data available on public servers (such as the EMBnet SRS servers) often differ by a number of entries.

Using SynCron it should be possible to keep a copy of the EMBL Nucleotide Sequence Database that exactly matches the contents of the database in operation at the EBI for external services, with manual intervention required only in the event of network failure, etc. We hope, thereby, to help users to improve and guarantee the quality of the EMBL Nucleotide Sequence Database updates obtained by electronic transfer.


Go to: previous article - next article - Table of contents