1 Area di Ricerca del CNR, Bari, Italy
2 Dipartimento di Biologia, Difesa e Biotecnologie Agro-Forestali, Universit
della Basilicata, Potenza, Italy
3 Centro di Studio sui Mitocondri e Metabolismo Energetico, CNR, Bari, Italy
4 Dipartimento di Biochimica e Biologia Molecolare, Universit di Bari, Italy
Correspondence to: Dr. Graziano Pesole, graziano@area.ba.cnr.it
Given that an unambiguous definition of redundancy is impracticable for biological sequence data, the program presented uses a quantitative description of redundancy based on a measure of sequence similarity. A sequence is considered redundant if it shows a degree of similarity and overlap with a longer sequence in the database greater than a threshold fixed by the user. Here we present a new algorithm, based on an "approximate string matching" procedure, which is able to determine the overall degree of similarity between each pair of sequences contained in a nucleotide sequence database without performing the time consuming task of pairwise global best alignments. The algorithm generates a new purified dataset from the redundant nucleotide sequence collection using cutoff parameters set by the user. The method is fast, sensitive and automatic.
The algorithm is structured in the following steps:
In this test application, 20 sequences have been retained out of the 362 instead of the expected single one corresponding to the complete human mtDNA genome.
Figure 1 shows a CLEANUP session on a dataset of 2400 sequences (5,523,925 nt) comprising all Drosophila entries in the Invertebrate division of Genbank collection (release 90). 400 out of the 2,400 sequences (420,707 nucleotides) were removed by CLEANUP as both their degree of sequence similarity and overlap exceeded the fixed threshold of 95%.
CLEANUP is an invaluable and efficient tool for all database users/developers to remove or classify redundant information.
Gouy, M., C. Gautier, M. Attimonelli, C. Lanave, and G. Di Paola. 1985. ACNUC - a portable retrieval system for nucleic acid sequence databases: logical and physical designs and usage. Comput. Applic. Biosci. 1: 167-172.
Grillo, G., M. Attimonelli, S. Liuni, and G. Pesole. 1996. CLEANUP: a fast computer program for removing redundancies from nucleotide sequence databases. Comput. Applic. Biosci. 12: 1-8.
% Cleanup -sim=95 -overlap=95 Cleanup generates a non redundant sequence data library from any set of sequences. Cleanup of what sequences ? in:dro* What should I call the output searching file (* in_dro.s *) ? What should I call the file listing erased sequences (* in_dro.e *) ? What should I call the file listing retained sequences (* in_dro.nr *) ?Table I. Output produced by CLEANUP on a sample of 362 sequences extracted from release 41 of the EMBL database (L1, L2 : length of primary (L1) and secondary (L2) sequence; P1, P2: starting position of the similarity region in the primary (P1) and secondary (P2) sequence; OS1, OS2: length of the overlapping segment in the primary (OS1) and secondary (OS2) sequence, (c) : similarity match found on the reverse / complementary strand of the secondary sequence).
Sequences = 2400 Nucleotides = 5523925 % Overlapping for cleaning = 95.00 % Similarity for cleaning = 95.00 Sequence matches = 424 Searching C.P.U. time = 158.84
Erased sequences = 400 Erased nucleotides = 420707 % Erased sequences = 16.67 % Erased nucleotides = 7.62
C L E A N U P - MAIN SEARCH FILE
Sequence 1 Sequence 2 L1 L2 P1 P2 OS1 OS2 %Ident. %Overl.
MIHSCG HSCOXII 16569 708 7584 1 709 708 99.86 100.00 . HSMTALT1 . 360 16024 1 360 360 100.00 100.00 . HSMTNA02 . 360 16024 1 360 360 98.33 100.00 . HSMTNA03 . 360 16024 1 360 360 98.06 100.00 . HSMTNA04 . 360 16024 1 360 360 98.89 100.00 . HSMTNA05 . 360 16024 1 360 360 98.33 100.00 . HSMTNA06 . 360 16024 1 360 360 98.06 100.00 . HSMTNA07 . 360 16024 1 360 360 97.78 100.00 . HSMTNA08 . 360 16024 1 360 360 98.61 100.00 . HSMTNA09 . 360 16024 1 360 360 98.89 100.00 . HUMMTALT0 . 360 16024 1 360 360 99.72 100.00 . MIHS01 . 608 324 1 610 608 99.51 100.00 (c) . MIHSAIA . 360 16024 1 360 360 98.89 100.00 . MIHSAIAA . 360 16024 1 360 360 99.44 100.00 . MIHSAIAB . 360 16024 1 360 360 99.17 100.00 . MIHSAIB . 360 16024 1 360 360 99.17 100.00 . MIHSAIC . 360 16024 1 360 360 99.44 100.00 . MIHSAID . 360 16024 1 360 360 99.17 100.00 . MIHSAIE . 360 16024 1 360 360 98.06 100.00 . MIHSAIF . 360 16024 1 360 360 98.61 100.00 . MIHSAIG . 360 16024 1 360 360 98.06 100.00 . MIHSAIH . 360 16024 1 360 360 98.33 100.00 . MIHSAII . 360 16024 1 360 360 98.33 100.00 . MIHSAIJ . 360 16024 1 360 360 98.33 100.00 . MIHSAIK . 360 16024 1 360 360 98.61 100.00 . MIHSAIL . 360 16024 1 360 360 98.89 100.00 . MIHSAIM . 360 16024 1 360 360 98.61 100.00 . MIHSAIN . 360 16024 1 360 360 98.33 100.00 . MIHSAIO . 360 16024 1 360 360 98.06 100.00 . MIHSAIP . 360 16024 1 360 360 98.89 100.00 . MIHSAIQ . 360 16024 1 360 360 98.89 100.00 . MIHSAIR . 360 16024 1 360 360 98.89 100.00 . MIHSAIS . 360 16024 1 360 360 98.61 100.00 . MIHSAIT . 360 16024 1 360 360 98.61 100.00 . MIHSAIU . 360 16024 1 360 360 99.17 100.00 . MIHSAIV . 360 16024 1 360 360 99.17 100.00 . MIHSAIW . 360 16024 1 360 360 98.89 100.00 . MIHSAIX . 360 16024 1 360 360 98.89 100.00 . MIHSAIY . 360 16024 1 318 318 98.43 88.33 . MIHSALT01 . 360 16024 1 360 360 98.61 100.00 . MIHSALT02 . 360 16024 1 360 360 98.89 100.00 . MIHSALT03 . 360 16024 1 360 360 99.17 100.00 . MIHSALT04 . 360 16024 1 360 360 98.61 100.00 . MIHSALT05 . 360 16024 1 360 360 98.61 100.00 . MIHSALT06 . 360 16024 1 360 360 99.17 100.00 . MIHSALT07 . 360 16024 1 360 360 98.89 100.00 . MIHSALT08 . 360 16024 1 262 262 98.47 72.78 . . . . 16287 264 97 97 98.97 26.94 . MIHSALT10 . 360 16024 1 360 360 98.61 100.00 ( ...... Continued .......)