CLEANUP: a fast computer program for cleaning nucleotide sequence databasesof redundancies.

Graziano Pesole (1,2), Giorgio Grillo (3), Marcella Attimonelli (4) and Sabino Liuni (1,3)

1 Area di Ricerca del CNR, Bari, Italy
2 Dipartimento di Biologia, Difesa e Biotecnologie Agro-Forestali, Universit della Basilicata, Potenza, Italy
3 Centro di Studio sui Mitocondri e Metabolismo Energetico, CNR, Bari, Italy
4 Dipartimento di Biochimica e Biologia Molecolare, Universit di Bari, Italy

Correspondence to: Dr. Graziano Pesole, graziano@area.ba.cnr.it

Introduction

A key concept in comparing sequence collections is the issue of redundancy. The production of sequence collections cleaned from redundancy is useful both in performing statistical analyses and accelerating extensive database searches on nucleotide sequences. Publicly available databases contain multiple entries of identical or almost identical sequences. Performing statistical analysis on such biased data runs the risk of assignation of high significance to non significant patterns.

Given that an unambiguous definition of redundancy is impracticable for biological sequence data, the program presented uses a quantitative description of redundancy based on a measure of sequence similarity. A sequence is considered redundant if it shows a degree of similarity and overlap with a longer sequence in the database greater than a threshold fixed by the user. Here we present a new algorithm, based on an "approximate string matching" procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database without performing the time consuming task of pairwise global best alignments. The algorithm generates a new purified dataset from the redundant nucleotide sequence collection using cutoff parameters set by the user. The method is fast, sensitive and automatic.

Systems and methods

The algorithm presented here has been implemented in CLEANUP, a computer program written in the C language and will run on Unix and VMS operating systems. CLEANUP allows Pearson/FastA and GCG input formats for nucleotide sequences and automatically produces a new dataset cleaned from redundant sequences in both these formats. It is available free of charge by anonymous FTP

Algorithm

The searching process needs to perform pairwise comparisons of each sequence with all the others. We define the first sequence in the comparison as the "primary sequence" and all the others as "secondary sequences".

The algorithm is structured in the following steps:

  1. Sorting of the sequences according to decreasing length
  2. Numeric coding of the sequences and generation of the index structure
  3. Generation of the primary sequences data structure containing the pattern positions and compositions
  4. Similarity searching between primary and secondary sequences
  5. Creation of a sequence collection cleaned from redundancy according to user-set parameters
A detailed description of the algorithm is reported in Grillo et al. (1996).

Results and discussion

The CLEANUP algorithm has been tested on a dataset of 362 sequences accounting for all the human mitochondrial DNA sequences available in release 41 of the EMBL database. This dataset provides a very simple check of the cleaning accurateness of the algorithm as we can expect that after cleaning only a single sequence corresponding to the complete human genome is retained and all the others are erased. Table I reports part of the output produced by the CLEANUP application when run against the human mtDNA collection.

In this test application, 20 sequences have been retained out of the 362 instead of the expected single one corresponding to the complete human mtDNA genome.

Figure 1 shows a CLEANUP session on a dataset of 2400 sequences (5,523,925 nt) comprising all Drosophila entries in the Invertebrate division of Genbank collection (release 90). 400 out of the 2,400 sequences (420,707 nucleotides) were removed by CLEANUP as both their degree of sequence similarity and overlap exceeded the fixed threshold of 95%.

CLEANUP is an invaluable and efficient tool for all database users/developers to remove or classify redundant information.

Acknowledgements

We thank C. Saccone for helpful comments. This work was partially financed by MURST, Italy, Progetto Finalizzato Ingegneria Genetica, CNR, Italy and by the EMBnet Research and Development Project Committee.

Bibliography

Etzold, T., and P. Argos. 1993. SRS an indexing and retrieval tool for flat file data libraries. Comput. Appl. Biosci. 9: 49-57.

Gouy, M., C. Gautier, M. Attimonelli, C. Lanave, and G. Di Paola. 1985. ACNUC - a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput. Applic. Biosci. 1: 167-172.

Grillo, G., M. Attimonelli, S. Liuni, and G. Pesole. 1996. CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases. Comput. Applic. Biosci. 12: 1-8.

Figures and tables

Figure 1. Sample run of a CLEANUP session on a dataset of 2400 Drosophila entries. User input are boldface typed. CLEANUP generates a file listing all non-redundant sequence entry names which can be used as input to programs of the GCG package as well as with retrieval programs such as SRS (Etzold and Argos 1993) or ACNUC (Gouy et al. 1985). It optionally also generates a sequence data file in Pearson/FASTA format.
% Cleanup -sim=95 -overlap=95 

Cleanup generates a non redundant sequence data library from any set of 
sequences. 

Cleanup of what sequences ?  in:dro*

What should I call the output searching file (* in_dro.s *) ?
What should I call the file listing erased sequences (* in_dro.e *) ?
What should I call the file listing retained sequences (* in_dro.nr *) ?


Sequences = 2400 Nucleotides = 5523925 % Overlapping for cleaning = 95.00 % Similarity for cleaning = 95.00 Sequence matches = 424 Searching C.P.U. time = 158.84
Erased sequences = 400 Erased nucleotides = 420707 % Erased sequences = 16.67 % Erased nucleotides = 7.62
Table I. Output produced by CLEANUP on a sample of 362 sequences extracted from release 41 of the EMBL database (L1, L2 : length of primary (L1) and secondary (L2) sequence; P1, P2: starting position of the similarity region in the primary (P1) and secondary (P2) sequence; OS1, OS2: length of the overlapping segment in the primary (OS1) and secondary (OS2) sequence, (c) : similarity match found on the reverse / complementary strand of the secondary sequence).


C L E A N U P - MAIN SEARCH FILE
Sequence 1 Sequence 2 L1 L2 P1 P2 OS1 OS2 %Ident. %Overl.
MIHSCG HSCOXII 16569 708 7584 1 709 708 99.86 100.00 . HSMTALT1 . 360 16024 1 360 360 100.00 100.00 . HSMTNA02 . 360 16024 1 360 360 98.33 100.00 . HSMTNA03 . 360 16024 1 360 360 98.06 100.00 . HSMTNA04 . 360 16024 1 360 360 98.89 100.00 . HSMTNA05 . 360 16024 1 360 360 98.33 100.00 . HSMTNA06 . 360 16024 1 360 360 98.06 100.00 . HSMTNA07 . 360 16024 1 360 360 97.78 100.00 . HSMTNA08 . 360 16024 1 360 360 98.61 100.00 . HSMTNA09 . 360 16024 1 360 360 98.89 100.00 . HUMMTALT0 . 360 16024 1 360 360 99.72 100.00 . MIHS01 . 608 324 1 610 608 99.51 100.00 (c) . MIHSAIA . 360 16024 1 360 360 98.89 100.00 . MIHSAIAA . 360 16024 1 360 360 99.44 100.00 . MIHSAIAB . 360 16024 1 360 360 99.17 100.00 . MIHSAIB . 360 16024 1 360 360 99.17 100.00 . MIHSAIC . 360 16024 1 360 360 99.44 100.00 . MIHSAID . 360 16024 1 360 360 99.17 100.00 . MIHSAIE . 360 16024 1 360 360 98.06 100.00 . MIHSAIF . 360 16024 1 360 360 98.61 100.00 . MIHSAIG . 360 16024 1 360 360 98.06 100.00 . MIHSAIH . 360 16024 1 360 360 98.33 100.00 . MIHSAII . 360 16024 1 360 360 98.33 100.00 . MIHSAIJ . 360 16024 1 360 360 98.33 100.00 . MIHSAIK . 360 16024 1 360 360 98.61 100.00 . MIHSAIL . 360 16024 1 360 360 98.89 100.00 . MIHSAIM . 360 16024 1 360 360 98.61 100.00 . MIHSAIN . 360 16024 1 360 360 98.33 100.00 . MIHSAIO . 360 16024 1 360 360 98.06 100.00 . MIHSAIP . 360 16024 1 360 360 98.89 100.00 . MIHSAIQ . 360 16024 1 360 360 98.89 100.00 . MIHSAIR . 360 16024 1 360 360 98.89 100.00 . MIHSAIS . 360 16024 1 360 360 98.61 100.00 . MIHSAIT . 360 16024 1 360 360 98.61 100.00 . MIHSAIU . 360 16024 1 360 360 99.17 100.00 . MIHSAIV . 360 16024 1 360 360 99.17 100.00 . MIHSAIW . 360 16024 1 360 360 98.89 100.00 . MIHSAIX . 360 16024 1 360 360 98.89 100.00 . MIHSAIY . 360 16024 1 318 318 98.43 88.33 . MIHSALT01 . 360 16024 1 360 360 98.61 100.00 . MIHSALT02 . 360 16024 1 360 360 98.89 100.00 . MIHSALT03 . 360 16024 1 360 360 99.17 100.00 . MIHSALT04 . 360 16024 1 360 360 98.61 100.00 . MIHSALT05 . 360 16024 1 360 360 98.61 100.00 . MIHSALT06 . 360 16024 1 360 360 99.17 100.00 . MIHSALT07 . 360 16024 1 360 360 98.89 100.00 . MIHSALT08 . 360 16024 1 262 262 98.47 72.78 . . . . 16287 264 97 97 98.97 26.94 . MIHSALT10 . 360 16024 1 360 360 98.61 100.00 ( ...... Continued .......)

Go to:
previous article - next article - Table of contents