Oft times all thoughts of algorithms are left behind with school. If they’re not, it can be too easy to get caught up in spending too much time trying to do something the “most efficient” way at the expense of spending two weeks coding to save a few milliseconds, but still there is a lot to be said for knowing and studying algorithms.
I’ve always enjoyed opportunities to study algorithms, but haven’t had too many chances in recent years to do so. However the work I’ve been doing recently on the Raspberry Pi has brought me back to algorithms (Yay!). The main limitation of the Pi is I/O speed. The next limit is memory. There’s been a lot to say about limitations inspiring creativity; these limitations have brought me back to exploring algorithms and it’s been fun!
This is the first in a series of examining algorithms related to searching and set intersections (and subtraction) which lead up to a non-trivial, non-naive implementation of substructure matching of chemical compounds on a cluster of Raspberry Pi computers.
First we need to frame the problem.
The previous naive implementation used a fairly simple approach — grep
the collection of substances for using SMILES™ (Simplified Molecular Input Line Entry System), a syntax for describing compounds using a string, matching the fed in structure. This is naive in that there are potentially more than one ways to specify a structure using SMILES™ and this initial solution would only find one representation of the structure.
The task was spread across 5 Pi 2 B, with each of them running docker instances which used xargs
to spread the processing across the cores. A search for something like ‘C1CCCCC1=O’ (a benzene with an oxygen off the ring) would take ~1 minute the first time to discover and report ~2500 hits. The second time it would take about 20 seconds, and subsequent searches would require approximately 5 seconds. This is due to the Linux I/O cache being populated. Still, it’s pretty impressive — finding matches among ~69 million records in 5 seconds.
However, leaving off the ‘=’ drastically increased the size of the results to ~17K hits. Here the I/O limitations of the Raspberry Pi came into play — the results were reported to a single host over the 100 Mb/s link. In previous tests I had determined that the link could process ~7 MB/s. Having multiple hosts responding impacts this. I did find that by using bonded network interfaces I could obtain rates of approximately 13 MB/s. This is due to the shared USB 2 hub. The USB 2 bus allows a maximum of 480 Mb/s. The best performance I was able to obtain from any single device was approximately 300 Mb/s from an SSD.
One potential improvement lies in memoization — once a solution set has been found for a particular sub structure search, cache the result rather than re-calculating it.
That will be used in the new, less naive implementation. Initial performance will be slower than the naive implementation, however with time it will speed up thanks to memoization.
The PubChem database consists of roughly 69 million substances. One version of the file is an SDF. For each substance, there are a number of pieces of information such as the formula, representations of the substance and its structure, angles between the various atoms, etc.. Of interest for substructure matching is a fingerprint field. The fingerprint consists of 880 bits, each of which describe a property of a substance. For instance, there are fields which specify whether or not a particular element is part of the substance. Or if the substance has over ‘N’ Carbon atoms. In surveying a sample of the substances the average number of bits set in a substance was on the order of 160 bits.
The CDK is an open source library of tools for cheminformatics. One of the things you can do is to convert a SMILES™ or other representation of a structure into the fingerprint used by PubChem which can then be used to find matches.
The new approach uses adaptive set intersections and differences to find matches. The basic idea is that we can process the search fingerprint, only concerned with those bits which are turned on and for each have a set of substances which have that bit set. We can then say that the intersection of all of those sets matches the pattern.
Differences are used where the number of substances which have a particular bit turned on in the fingerprint is greater than half of the substances. An example would be that of the bit specifying that the substance contained Carbon — there are far more substances in the database with Carbon than without. As an optimization in this case differences are used — if we are looking for a substance with carbon, we remove those which have none from our consideration.
In order to prepare for the search, a total of 880 sets is first created — the SDF Toolkit is useful for this. Each set contains a list of substance ID’s (as a 4 byte unsigned integer) whose fingerprint has the corresponding bit set. If the number of substances with the bit set is greater than half the total number of substances, the list of substances without the bit set is calculated. The size of each set is stored to be used later both using the bit position as a hashing key as well as a sorted array.
These sets are then chunked into ranges — the idea is that the work will be spread across a number of rapberry pi computers with the goal of each chunk being sub-second timing for each set intersection.
Once the pre-processing is complete, the data is loaded into a distributed data store, such as Seaweed FS to be duplicated and made available across the cluster.
How requests are handled
On a high level description the implementation can be thought of as a series of map and reduce tasks. However, all of the mapping was performed in pre-processing leaving the reduce tasks. Hadoop was not considered to be a good solution — while it is possible to run on the Pi, it doesn’t do it well. Also, HDFS works best on large chunks of data; the pre-generated sets range from 0 bytes to approximately 10M in size.
- Request comes into the system to do a substructure search for a structure. If the fingerprint is not already cached, it is calculated.
- Work is pushed onto a job queue such as
beanstalkd
. Bits which have small sets are paired with bits having larger sets. This is one optimization. Additionally, a task can be broken into smaller chunks in order to address I/O limits. A counter is kept of the number of pieces of work to be performed.
- Workers pick up the tasks and if it is not already cached, the intersection or difference is determined. The result is cached and a results queue is notified that the piece is complete.
- The worker which monitors the results queue pops two results off of the results queue, decrements the counter, and pushes a task onto the job queue to calculate the intersection of the results (incrementing the counter).
- Steps 3-4 repeat until there is only one task left on the counter, at which point, when it is complete, we are done.
- The resulting set is returned to the requestor.
Memoization will allow the reduction of CPU and I/O (as well as tasks!) over time. Common requests will eventually execute in O(1).
The next entry in the series will talk about adaptive algorithms and how they will help to calculate the intersections and differences in less than O(n) — the algorithm should approach p*log(n), where ‘p’ is the size of the smaller set and ‘n’ is the size of the larger, in the worst case. On average, it should do far better. Other algorithms may approach O(n+p).
I’m also considering exploring the use of machine learning to build a neural net of fingerprints against which substructures are compared. I’m not sure how well it would work — there would have to be some way of determining thresholds of matches to eliminate false hits. Also another avenue of exploration is the use of Kalman filters to predict the number of workers needed based on thresholds of how long it will take to calculate. More on that later….