Apr 18

Naive Substructure Substance Matching on the Raspberry Pi

Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (n3) or O (n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of the Ullman algorithm or variations of it (i.e. SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called ‘fingerprints’ although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by ‘folding’, which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length. — Chemical database

Substructure substance matching is, in many ways, a non-trivial exercise in Cheminformatics. The amount of data used to determine matches grows very quickly. For instance, one method of describing a molecule’s “fingerprint” uses 880 bytes. Or 2^880 combinations. This space is very sparsely populated, but there are still many potential combinations.

Another way of describing the structure of a molecule is Simplified molecular-input line-entry system or SMILES. This method uses a string which describes the structure of a molecule. Hydrogen atoms are generally stripped from the structure, so the SMILES representation for water is ‘O’. Likewise, methane is ‘C’. Single bonds are assumed. Double bonds are described by ‘=’, so carbon dioxide is ‘O=C=O’.

As it turns out, grep happens to work very well to find substructure matches of SMILE data. The following searches are performed on a subset of the NIH PubChem Compound database, 13689519 compounds in total. The original data has been processed on a Raspberry Pi — compressed, this portion of the database is ~13GB. Pulling out the SMILES representation and the compound ID, the resultant flat data is 842M in 733 files.

The 842M happens to fit into the ram of the Pi. After a few searches, the files are buffered in RAM. At that point, the speed increases mightily. The limit for reads of a MicroSD card is ~15M/s. Once cached in RAM, however, it is able to read >400M/s:

Following is a series of searching demonstrating how the search speeds up as the data is read into cache.

Once the files are buffered in memory, the greps occur in close to constant time for reasonable searches sorted by the compound ID — the previous search matched 123 compounds; by comparison follows a search for a ring structure:

However, a ridiculous search for substances containing carbon does take a bit longer — there are limits to IO. This search matches almost all of the substances:

How, then, is the Pi processing so much data so quickly? Part of the secret lies in splitting the data into “reasonable” chunks of ~55MB. The other secret is in how xargs is invoked. Not all versions of xargs support multiple concurrent processes. The -P 4 says to run four instances of grep concurrently.

Notice that the improvement on the time required is not linear; there is not much difference in time between three (3) and four (4) concurrent threads. The limit of IO has been reached.

With five Pi 2 boards, substructure searches of all 68279512 compounds can be performed in seconds.

It’s not perfect, some structures can be described in more than one way with SMILES. However, it’s fast and simple.

The next substructure search will utilize fingerprints.

3 pings

  1. Docker Containers: Smaller is not always better » Ramblings

    […] « Naive Substructure Substance Matching on the Raspberry Pi […]

  2. Fast molecular data processing with the Raspberry Pi | Raspberry Pi Pod

    […] If you’re still with me, in his article he explains how the Pi does pattern matching with grep and how it speeds up when reading from the cache. Read his article here. […]

  3. ‘Piping’ Hot Docker Containers » Ramblings

    […] I ‘docker’ized the ‘grep’ discussed in Naive Substructure Substance Matching on the Raspberry Pi, I was able to attach the STDOUT from the grep to wc -l to get a count of the matching […]

Leave a Reply

%d bloggers like this: