INSECT and MOLLUSCS - supercomputing on the cheap

Victor Jongeneel, Thomas Junier, Christian Iseli, Kay Hofmann, and Philipp Bucher
ISREC Biocomputing Group and Swiss EMBnet node, 1066 Epalinges, Switzerland

The problem

Much of sequence analysis involves comparing a query sequence or pattern to a reference database. In general, the time required to complete such a search is directly proportional to the size of the database. With the relentless growth of the databases, and in spite of almost equally rapid progress in computer performance, exhaustive searches are becoming more and more time-consuming, to the point of making some promising analytical approaches impractical.

There are basically three ways to tackle the problem, which are not mutually exclusive:

  1. Develop more efficient algorithms, and fine-tune them to require as little computation as possible. This approach was taken by the developers of the BLAST and FASTA algorithms, and explains their popularity to a large extent. Unfortunately, there is always a price to pay in terms of the sensitivity of the comparison and the quality of the statistics. These limitations make heuristic algorithms unsuitable for exploring distant evolutionary relationships.
  2. Buy more powerful general-purpose computers (high-end servers, "supercomputers"). While this guarantees flexibility in the choice of algorithms, it is usually a very expensive proposition. Also, very few centres can afford to dedicate a high-end server to the tasks of database searching alone: usually, these jobs have to compete for CPU time with housekeeping tasks (e.g. reformatting databases), user interactive sessions, general-purpose sequence analysis (GCG), and sometimes unrelated activities such as molecular dynamics calculations from the Chemistry Department. Additional speedup can sometimes be achieved with a multiprocessor architecture, provided the program code supports threading. This is far from always the case. Also, multiprocessor machines are more complex, and thus more expensive, than single-processor models.
  3. Buy a dedicated sequence comparison processor (Bioccelerator, FDF), whose hardware has been designed for this purpose. This is an increasingly popular option, as the performance of these machines far outstrips that of a similarly priced workstation. A serious drawback, however, is that the machine architecture limits the flexibility of the algorithm. While this could in some cases be alleviated by reprogramming the microcode of the FPGA chips, there is no accelerator available at the present time that can accommodate a full complement of useful algorithms.

 Our approach has been to develop an Inexpensive Networked SEquence Comparison Technology (INSECT) based on the "pile of PCs" concept. Standard desktop computers have been equipped with increasingly powerful processors, mostly to keep up with increasingly bloated software products and fancy user interfaces. Being mass-produced commodity items, their prices have steadily plunged. As the raw processing power of a cheap Intel (or compatible) or PowerPC processor differs by much less than an order of magnitude from top-of-the-line DEC, Sun or MIPS processors, one can naively assume that 10 cheap processors should handily outperform a single expensive one. Our experience amply proves this point.

Hardware

In principle, any cheap hardware from your favourite clone vendor will do. You can even recycle old machines that were deemed too slow by some local "power user", provided you balance database sizes according to CPU performance (see below). In our case, since we had a little money to spend, we bought the following hardware:

For a system with one master and 8 slaves, total cost was about CHF 10000 (about £4000 or $6500 at current rates). The main problem was to find space for the equipment, which landed on an old utility cart now decorated with spaghetti wiring. In a next incarnation, we may try to use an industrial rack with structured cabling, central power supply, etc. etc.

The master machine (which also sports a CD-ROM drive to simplify software installation) received two network cards, one to communicate with our in-house network and the Net, and the other to talk to the slaves. Linux 2.0.30 was installed on all the machines (an easy duplication from the Master), with kernel patches to power down the processor when idle. The entire hard disk of each slave is exported to the master. The INSECT network is invisible to the outside, and has its own (fake) IP domain, "beehive.org".

Software

The basic concept is simple: each slave receives a portion of each searchable database on its local hard disk, proportional to its processor power (equal in our case). It accesses the executables and configuration files for the analysis software in a shared directory structure exported by the master. When it receives instructions to start a job, it performs a search on its chunk of the database and returns the results to the master. The master is responsible for scheduling the jobs, post-processing the results, and keeping order among the slaves.

We decided against using PVM or some other sophisticated task scheduler, mostly because of stability concerns (our experience with PVM had been mixed at best). Instead we developed a MOdular Low-cost Linux-based Unified Sequence Comparison System (MOLLUSCS). The role of MOLLUSCS is to provide the user with a Unix command line as similar as possible to the one used on a traditional system and to handle the details of dispatching, process creation, data collection, post-processing and cleanup.

The current [05/08/97] version of MOLLUSCS consists of one core script (mollusc.pl), a Perl module for each of the biological programs (e.g., Pfscan.pm), and auxiliary modules. Features include:

Modules exist currently for ssearch3 (Smith-Waterman searches, W. Pearson), pfsearch and pfscan (generalised profile against sequence database and sequence against profile database, Ph. Bucher),and pattern_find (search database with extended regexp, K. Hofmann); an additional module will be developed for searchwise (framesearch with differential gap scores, E. Birney) . We have also provided an easy way for MOLLUSCS to be invoked from CGI scripts, and have thus been able to incorporate the INSECT into our Web-based services.

An additional utility takes care of splitting the databases into chunks and distributing them to the slaves. The percentage of the database given to each slave can be specified, to account for possible differences in performance.

Performance

We have done some rough testing of the INSECT's performance. The following table gives some preliminary data on performance with the pfsearch (search the yeast protein database, 6141 entries, with a 53-aa profile) and ssearch (search the same database with a 72-aa peptide, using the Smith-Waterman algorithm) programs. Times are in seconds. More precise results may be obtained by using longer queries and larger databases.
 
Hardware (compile options)
pfsearch
ssearch3
Sun Sparc 20 
786
29
HP 735/125 (+O4) 
92
21
DEC3000 (gcc -O4) 
164
25
Pentium 166 (Linux, gcc -O6 -pent)
210
18
Dual PentiumPro 200 (Solaris, gcc -O6 )
112
22
SGI Origin2000 server 
36
10
Single slave
172
22
INSECT (8 slaves)
22
8

 

It is already obvious from these preliminary data that the INSECT performs extremely well compared to other hardware we have available in Lausanne, including a brand-new SGI Origin2000 that cost almost 20 times more. We have observed another 2-fold increase in speed when we moved to the "full" configuration of one master and fifteen slaves.

We have also run the INSECT through the test devised at the EBI (see http://industry.ebi.ac.uk/~thanaraj/seqassess/repedit3.html ) to compare the Bioccelerator, the FDF and the MasPar sequence comparison accelerators. On the INSECT, we ran the ssearch3 program using the blosum62 scoring matrix, and with default gap opening and extension penalties. We repeated the same search on the SGI Origin2000 (4 processors), using ssearch3 compiled either as a standard or as a threaded application. The results show that the INSECT with 15 slaves outperforms to the MasPar for all but the longest queries.  The search time for the INSECT increases roughly proportionally to the query length, while for the dedicated machines performance improves. Nevertheless, it is clear that the price/performance ratio still strongly favours the INSECT (cost of about 15000 CHF for the configuration tested here, as compared to 60000 CHF and up for the dedicated processors and about CHF 150'000 for the SGI server).
 
  Raw Search Time (seconds)  
Query
Bioccelerator
FDF
MPsrch_pp
MPsrch_ppa
INSECT (15)
SGI (single)
SGI (threaded)
plasto
30.0
11.3
37.88
47.86
24.69
128.9
34.17
calmod
33.0
12.3
42.03
59.02
36.20
187.08
48.72
histone
38.0
14.3
46.17
71.20
45.72
239.39
57.92
riboS3
40.0
14.3
49.89
75.06
48.05
285.65
68.54
vmat
45.0
17.4
56.10
88.72
59.64
366.09
88.67
coat
52.0
20.6
65.61
107.17
76.63
480.09
119.88
amid
60.0
28.6
76.40
131.02
100.72
629.38
159.31
dnak
67.0
32.7
87.38
158.32
155.21
798.34
191.87
efg
76.0
36.3
95.40
174.16
167.49
907.34
240.62
ski
80.0
42.0
99.95
185.44
163.74
964.94
217.75
amdm
85.0
80.0
105.59
194.33
159.49
1031.35
243.44
phsg
92.0
79.9
114.03
212.71
191.71
1157.95
288.07
abl
141.0
118.0
181.77
386.38
426.81
2083.54
523.41
cin2
179.0
146.6
218.59
444.95
556.09
2549.9
646.71

NB: the queries are ranked by length. Details of the test can be found on the EBI Web site. The benchmarks on the SGI Origin2000 were not done in single-user (max. performance) mode.

Conclusions

In environments with limited financial resources, such as many of the EMBnet nodes or modestly endowed academic institutions, INSECT technology may provide an attractive alternative to expensive dedicated hardware. Scaling up is easy, in that additional units can be added at any time, but storage, wiring and maintenance may become problematic when too many units are connected. This may be circumvented (at a price...) by substituting industrial motherboards for the PCs, or by introducing structured wiring cabinets and autosensing KVM (keyboard, video, mouse) switches. Using PCs has the added bonus of allowing "recycling" of units between the biocomputing lab and regular office users.

Besides performance at an attractive price, the INSECTs also offer total flexibility in the choice of the most appropriate algorithm and parameters. This is by far not the case when using hardware-accelerated sequence comparison machines. In principle, any Unix-based sequence comparison program can be adapted to run in this environment, by adding appropriate modules to the MOLLUSCS. For example, there is no hardware-accelerated version of the pfscan program, which scans a protein sequence against a database of profiles. We have recently developed a Web interface to pfscan and a PROSITE regexp scanner running on the INSECTs, with access to our own profile collection as well as to a reformatted version of Eddy & Sonnhammer's PfamA HMM collection. It can be found at http://ulrec3.unil.ch/software/PFSCAN_form.html. To our knowledge, this is the fastest available Web server for protein motif searches.


The code for the MOLLUSCS is available from Thomas Junier ( Thomas.Junier@isrec.unil.ch )

Help for setting up INSECTs can be obtained from Christian Iseli ( chris@cmpteam4.unil.ch )


Go to: previous article - next article -Table of contents