The Smith and Waterman best local alignment algorithm is currently the most sensitive general-purpose application for database searching. The basic principle of this rigorous method is to compare each character of the query sequence with each character of a databank sequence. To detect remote similarity, insertion of gaps of any size at each position is possible. A sophisticated scoring schema is employed to find the best alignment out of all possible combinations. The handling of gaps can vary. Some implementations, like the one applied in the Blitz E-mail Server at the EMBL in Heidelberg, uses straightforward constant gap penalties. In the implementation of MPSRCH, gap insertion and gap extension penalties vary in numerical values. Once a gap is opened, an extension is less heavily penalised than the gap insertion itself.
The result of a rigorous database search must always be evaluated for its significance. As lab-bench biology is unknown to computers, statistical methods are employed to discriminate between findings that could have occurred by chance and those which are expected to occur only rarely. A 'probability' or a 'number of expected hits' aid the researcher to judge on the importance of any finding. Parameters which affect the ranking of a hit (i.e. its position in the result list) include the length difference between the query sequence and the database sequence, as well as the specific handling of very short, but well-matched, motifs.
The core sequence comparison method used in the MPSRCH program suite is the Smith and Waterman algorithm, as modified by Goth[2], combined with a prediction of an expected number of results and a ranking function, developed by Collins and Coulson[3]. The package was developed by John F. Collins and Shane Sturrock at the Biocomputing Resource Unit at the University of Edinburgh, and is distributed by IntelliGenetics, Inc.
The MPSRCH clients implement a very basic functionality to render sequence searching as easy and as fast as possible for a biologist. The only parameter which can be modified, besides input and output file names, is the set of databases to be searched.
The following databases are supported:
Input Database
peptide 1) SWISS-PROT + weekly updates
2) PIR
3) SWISS-PROT, updates, PIR, TREMBL non-redundant set
DNA 1) Daily EMBL updates
2) EMBL database + EMBL updates
3) non-redundant set, including genbank
The service is freely accessible to all Internet nodes for test-driving
and rare access. Unlimited access is available to Swiss academia after
registration.
URL for general information: http://www.ch.embnet.org/hassle.html
[2] Goth O.
An improved algorithm for matching biological sequences. Journal
of Molecular Biology, (162), 705-708.
[3] Collins J.F. and Coulson A.F.W.
Significance of protein sequence similarities. Methods in Enzymology,
183, 474-487, 1990.
[4] Advanced Computer Network Communication:
Hierarchical Access System for Sequence Libraries in Europe. HASSLE v5 Redaschi N., Doelz R. and Eggenberger F. ;
Verlag Dr.U.Doelz, Basel, Switzerland. ISBN 3-905 434-01-6; Electronic documentation availabl
via
http://www.ch.embnet.org/hassle.html