Much of sequence analysis involves comparing a query sequence or pattern to a reference database. In general, the time required to complete such a search is directly proportional to the size of the database. With the relentless growth of the databases, and in spite of almost equally rapid progress in computer performance, exhaustive searches are becoming more and more time-consuming, to the point of making some promising analytical approaches impractical.
There are basically three ways to tackle the problem, which are not mutually exclusive:
Our approach has been to develop an Inexpensive Networked SEquence Comparison Technology (INSECT) based on the "pile of PCs" concept. Standard desktop computers have been equipped with increasingly powerful processors, mostly to keep up with increasingly bloated software products and fancy user interfaces. Being mass-produced commodity items, their prices have steadily plunged. As the raw processing power of a cheap Intel (or compatible) or PowerPC processor differs by much less than an order of magnitude from top-of-the-line DEC, Sun or MIPS processors, one can naively assume that 10 cheap processors should handily outperform a single expensive one. Our experience amply proves this point.
In principle, any cheap hardware from your favourite clone vendor will do. You can even recycle old machines that were deemed too slow by some local "power user", provided you balance database sizes according to CPU performance (see below). In our case, since we had a little money to spend, we bought the following hardware:
For a system with one master and 8 slaves, total cost was about CHF 10000 (about £4000 or $6500 at current rates). The main problem was to find space for the equipment, which landed on an old utility cart now decorated with spaghetti wiring. In a next incarnation, we may try to use an industrial rack with structured cabling, central power supply, etc. etc.
The master machine (which also sports a CD-ROM drive to simplify software installation) received two network cards, one to communicate with our in-house network and the Net, and the other to talk to the slaves. Linux 2.0.30 was installed on all the machines (an easy duplication from the Master), with kernel patches to power down the processor when idle. The entire hard disk of each slave is exported to the master. The INSECT network is invisible to the outside, and has its own (fake) IP domain, "beehive.org".
The basic concept is simple: each slave receives a portion of each searchable database on its local hard disk, proportional to its processor power (equal in our case). It accesses the executables and configuration files for the analysis software in a shared directory structure exported by the master. When it receives instructions to start a job, it performs a search on its chunk of the database and returns the results to the master. The master is responsible for scheduling the jobs, post-processing the results, and keeping order among the slaves.
We decided against using PVM or some other sophisticated task scheduler, mostly because of stability concerns (our experience with PVM had been mixed at best). Instead we developed a MOdular Low-cost Linux-based Unified Sequence Comparison System (MOLLUSCS). The role of MOLLUSCS is to provide the user with a Unix command line as similar as possible to the one used on a traditional system and to handle the details of dispatching, process creation, data collection, post-processing and cleanup.
The current [05/08/97] version of MOLLUSCS consists of one core script (mollusc.pl), a Perl module for each of the biological programs (e.g., Pfscan.pm), and auxiliary modules. Features include:
Modules exist currently for ssearch3 (Smith-Waterman searches, W. Pearson), pfsearch and pfscan (generalised profile against sequence database and sequence against profile database, Ph. Bucher),and pattern_find (search database with extended regexp, K. Hofmann); an additional module will be developed for searchwise (framesearch with differential gap scores, E. Birney) . We have also provided an easy way for MOLLUSCS to be invoked from CGI scripts, and have thus been able to incorporate the INSECT into our Web-based services.
An additional utility takes care of splitting the databases into chunks and distributing them to the slaves. The percentage of the database given to each slave can be specified, to account for possible differences in performance.
We have done some rough testing of the INSECT's performance. The following
table gives some preliminary data on performance with the pfsearch (search
the yeast protein database, 6141 entries, with a 53-aa profile) and ssearch
(search the same database with a 72-aa peptide, using the Smith-Waterman
algorithm) programs. Times are in seconds. More precise results may be obtained
by using longer queries and larger databases.
| Hardware (compile options) |
pfsearch
|
ssearch3
|
| Sun Sparc 20 |
786
|
29
|
| HP 735/125 (+O4) |
92
|
21
|
| DEC3000 (gcc -O4) |
164
|
25
|
| Pentium 166 (Linux, gcc -O6 -pent) |
210
|
18
|
| Dual PentiumPro 200 (Solaris, gcc -O6 ) |
112
|
22
|
| SGI Origin2000 server |
36
|
10
|
| Single slave |
172
|
22
|
| INSECT (8 slaves) |
22
|
8
|
It is already obvious from these preliminary data that the INSECT performs extremely well compared to other hardware we have available in Lausanne, including a brand-new SGI Origin2000 that cost almost 20 times more. We have observed another 2-fold increase in speed when we moved to the "full" configuration of one master and fifteen slaves.
We have also run the INSECT through the test devised at the EBI (see
http://industry.ebi.ac.uk/~thanaraj/seqassess/repedit3.html ) to compare
the Bioccelerator, the FDF and the MasPar sequence comparison accelerators.
On the INSECT, we ran the ssearch3 program using the blosum62 scoring matrix,
and with default gap opening and extension penalties. We repeated the same
search on the SGI Origin2000 (4 processors), using ssearch3 compiled either
as a standard or as a threaded application. The results show that the INSECT
with 15 slaves outperforms to the MasPar for all but the longest queries.
The search time for the INSECT increases roughly proportionally to the query
length, while for the dedicated machines performance improves. Nevertheless,
it is clear that the price/performance ratio still strongly favours the INSECT
(cost of about 15000 CHF for the configuration tested here, as compared to
60000 CHF and up for the dedicated processors and about CHF 150'000 for the
SGI server).
| Raw Search Time (seconds) | ||||||||
| Query |
Bioccelerator
|
FDF
|
MPsrch_pp
|
MPsrch_ppa
|
INSECT (15)
|
SGI (single)
|
SGI (threaded)
|
|
| plasto |
30.0
|
11.3
|
37.88
|
47.86
|
24.69
|
128.9
|
34.17
|
|
| calmod |
33.0
|
12.3
|
42.03
|
59.02
|
36.20
|
187.08
|
48.72
|
|
| histone |
38.0
|
14.3
|
46.17
|
71.20
|
45.72
|
239.39
|
57.92
|
|
| riboS3 |
40.0
|
14.3
|
49.89
|
75.06
|
48.05
|
285.65
|
68.54
|
|
| vmat |
45.0
|
17.4
|
56.10
|
88.72
|
59.64
|
366.09
|
88.67
|
|
| coat |
52.0
|
20.6
|
65.61
|
107.17
|
76.63
|
480.09
|
119.88
|
|
| amid |
60.0
|
28.6
|
76.40
|
131.02
|
100.72
|
629.38
|
159.31
|
|
| dnak |
67.0
|
32.7
|
87.38
|
158.32
|
155.21
|
798.34
|
191.87
|
|
| efg |
76.0
|
36.3
|
95.40
|
174.16
|
167.49
|
907.34
|
240.62
|
|
| ski |
80.0
|
42.0
|
99.95
|
185.44
|
163.74
|
964.94
|
217.75
|
|
| amdm |
85.0
|
80.0
|
105.59
|
194.33
|
159.49
|
1031.35
|
243.44
|
|
| phsg |
92.0
|
79.9
|
114.03
|
212.71
|
191.71
|
1157.95
|
288.07
|
|
| abl |
141.0
|
118.0
|
181.77
|
386.38
|
426.81
|
2083.54
|
523.41
|
|
| cin2 |
179.0
|
146.6
|
218.59
|
444.95
|
556.09
|
2549.9
|
646.71
|
|
NB: the queries are ranked by length. Details of the test can be found on the EBI Web site. The benchmarks on the SGI Origin2000 were not done in single-user (max. performance) mode.
In environments with limited financial resources, such as many of the EMBnet nodes or modestly endowed academic institutions, INSECT technology may provide an attractive alternative to expensive dedicated hardware. Scaling up is easy, in that additional units can be added at any time, but storage, wiring and maintenance may become problematic when too many units are connected. This may be circumvented (at a price...) by substituting industrial motherboards for the PCs, or by introducing structured wiring cabinets and autosensing KVM (keyboard, video, mouse) switches. Using PCs has the added bonus of allowing "recycling" of units between the biocomputing lab and regular office users.
Besides performance at an attractive price, the INSECTs also offer total
flexibility in the choice of the most appropriate algorithm and parameters.
This is by far not the case when using hardware-accelerated sequence comparison
machines. In principle, any Unix-based sequence comparison program can be
adapted to run in this environment, by adding appropriate modules to the
MOLLUSCS. For example, there is no hardware-accelerated version of the
pfscan program, which scans a protein sequence against a database
of profiles. We have recently developed a Web interface to pfscan
and a PROSITE regexp scanner running on the INSECTs, with access to our own
profile collection as well as to a reformatted version of Eddy & Sonnhammer's
PfamA HMM collection. It can be found at
http://ulrec3.unil.ch/software/PFSCAN_form.html. To our knowledge, this
is the fastest available Web server for protein motif searches.
The code for the MOLLUSCS is available from Thomas Junier (
Thomas.Junier@isrec.unil.ch )
Help for setting up INSECTs can be obtained from Christian Iseli ( chris@cmpteam4.unil.ch )
Go to: previous article - next article -Table of contents