MPSRCH - a New Method to Access Rapid Sequence Similarity Searching on Massively Parallel Computers using the HASSLE Protocol

R.Doelz, Swiss EMBnet node

The Searching Method

Sequence searching applications need to balance speed against sensitivity. As described in an earlier issue of embnet.news [1], the 'exhaustive' searching methods are optimised to detect even low similarity in large data collections. Most database searching programs are based on local alignment algorithms, which aim to get a reliable target hit - a region of the query sequence which is best fitting a database sequence.

The Smith and Waterman best local alignment algorithm is currently the most sensitive general-purpose application for database searching. The basic principle of this rigorous method is to compare each character of the query sequence with each character of a databank sequence. To detect remote similarity, insertion of gaps of any size at each position is possible. A sophisticated scoring schema is employed to find the best alignment out of all possible combinations. The handling of gaps can vary. Some implementations, like the one applied in the Blitz E-mail Server at the EMBL in Heidelberg, uses straightforward constant gap penalties. In the implementation of MPSRCH, gap insertion and gap extension penalties vary in numerical values. Once a gap is opened, an extension is less heavily penalised than the gap insertion itself.

The result of a rigorous database search must always be evaluated for its significance. As lab-bench biology is unknown to computers, statistical methods are employed to discriminate between findings that could have occurred by chance and those which are expected to occur only rarely. A 'probability' or a 'number of expected hits' aid the researcher to judge on the importance of any finding. Parameters which affect the ranking of a hit (i.e. its position in the result list) include the length difference between the query sequence and the database sequence, as well as the specific handling of very short, but well-matched, motifs.

The core sequence comparison method used in the MPSRCH program suite is the Smith and Waterman algorithm, as modified by Goth[2], combined with a prediction of an expected number of results and a ranking function, developed by Collins and Coulson[3]. The package was developed by John F. Collins and Shane Sturrock at the Biocomputing Resource Unit at the University of Edinburgh, and is distributed by IntelliGenetics, Inc.

Data for MPSRCH Sequence Searching at EMBnet Switzerland

Both protein and DNA sequence searching is possible with the MPSRCH program suite. Protein searches are, however, recommended, because the results are obtained in few minutes. DNA searches may take hours to complete. EMBnet Switzerland obtains the EMBL, SWISS-PROT and PIR databases from the original database providers. Updates are achieved with HASSLE-based services, which were developed specifically for this purpose. Data updates are synchronised with the originals as closely as possible, and the daily update status is subjected to a quality control system. EMBnet Switzerland also produces TREMBL, a database of all translated reading frames from the EMBL data library. TREMBL is created with the SRS software package (T.Etzold, EMBL). All available protein databases are merged using the 'nrdb' program from the NCBI. After processing, the MasPar searches a total of (currently) more than 150000 protein sequences in a few seconds, even for long protein sequences.

Accessibility

A subset of the MPSRCH software suite is now available on the powerful 4096-processor MasPar system of the University of Basel. Integration into the widely used GCG software package has been achieved with GCG-like text and WPI-type interfaces, utilising the GCG procedure library interface. The transfer of query data and the execution of commands operate via the HASSLE protocol [4], which is built into the MPSRCH clients and the server at the MasPar front end.

The MPSRCH clients implement a very basic functionality to render sequence searching as easy and as fast as possible for a biologist. The only parameter which can be modified, besides input and output file names, is the set of databases to be searched.

The following databases are supported:

       Input        	Database 
       peptide  	1) SWISS-PROT + weekly updates 
                        2) PIR 
                        3) SWISS-PROT, updates, PIR, TREMBL non-redundant set
       DNA             	1) Daily EMBL updates 
                        2) EMBL database + EMBL updates 
			3) non-redundant set, including genbank
The service is freely accessible to all Internet nodes for test-driving and rare access. Unlimited access is available to Swiss academia after registration.

URL for general information: http://www.ch.embnet.org/hassle.html

Software references

MPSRCH is Release 2.1D, Copyright (c) 1993, 1994, 1995,John F. Collins, Biocomputing Research Unit, University of Edinburgh, U.K.; Distribution rights by IntelliGenetics, Inc.
A GCG package (Genetics Computer Group, Inc.) is required locally if the GCG-like interfaces are to be used.
HASSLE is Version 5, (C) BioComputing Basel, 1992-1995.

Literature references

[1] Bottu, G
Fundamentals of Database Similarity searching methods, embnet.news 1(1), 1994.

[2] Goth O.
An improved algorithm for matching biological sequences. Journal of Molecular Biology, (162), 705-708.

[3] Collins J.F. and Coulson A.F.W.
Significance of protein sequence similarities. Methods in Enzymology, 183, 474-487, 1990.

[4] Advanced Computer Network Communication: Hierarchical Access System for Sequence Libraries in Europe. HASSLE v5 Redaschi N., Doelz R. and Eggenberger F. ;
Verlag Dr.U.Doelz, Basel, Switzerland. ISBN 3-905 434-01-6; Electronic documentation availabl via http://www.ch.embnet.org/hassle.html


Go to: previous article - next article - Table to contents