Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (n3) or O (n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of the Ullman algorithm or variations of it (i.e. SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called ‘fingerprints’ although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by ‘folding’, which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length. — Chemical database
Substructure substance matching is, in many ways, a non-trivial exercise in Cheminformatics. The amount of data used to determine matches grows very quickly. For instance, one method of describing a molecule’s “fingerprint” uses 880 bytes. Or 2^880 combinations. This space is very sparsely populated, but there are still many potential combinations.
Another way of describing the structure of a molecule is Simplified molecular-input line-entry system or SMILES. This method uses a string which describes the structure of a molecule. Hydrogen atoms are generally stripped from the structure, so the SMILES representation for water is ‘O’. Likewise, methane is ‘C’. Single bonds are assumed. Double bonds are described by ‘=’, so carbon dioxide is ‘O=C=O’.
As it turns out, grep
happens to work very well to find substructure matches of SMILE data. The following searches are performed on a subset of the NIH PubChem Compound database, 13689519 compounds in total. The original data has been processed on a Raspberry Pi — compressed, this portion of the database is ~13GB. Pulling out the SMILES representation and the compound ID, the resultant flat data is 842M in 733 files.
The 842M happens to fit into the ram of the Pi. After a few searches, the files are buffered in RAM. At that point, the speed increases mightily. The limit for reads of a MicroSD card is ~15M/s. Once cached in RAM, however, it is able to read >400M/s:
1 2 3 4 5 6 7 8 9 10 11 |
HypriotOS: root@apis-rpi-09 in /opt $ hdparm -t /dev/mmcblk0 /dev/mmcblk0: Timing buffered disk reads: 50 MB in 3.01 seconds = 16.60 MB/sec HypriotOS: root@apis-rpi-09 in /opt $ hdparm -T /dev/mmcblk0 /dev/mmcblk0: Timing cached reads: 836 MB in 2.00 seconds = 418.22 MB/sec |
Following is a series of searching demonstrating how the search speeds up as the data is read into cache.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
HypriotOS: root@apis-rpi-09 in /opt $ du -h smiles 824M smiles HypriotOS: root@apis-rpi-09 in /opt $ ls smiles|wc -l 733 HypriotOS: root@apis-rpi-09 in /opt $ cat smiles/*|wc -l 13689519 HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l 123 real 0m17.955s user 0m3.060s sys 0m2.610s HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l 123 real 0m11.874s user 0m2.540s sys 0m2.290s HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l 123 real 0m1.600s user 0m2.500s sys 0m3.290s |
Once the files are buffered in memory, the greps occur in close to constant time for reasonable searches sorted by the compound ID — the previous search matched 123 compounds; by comparison follows a search for a ring structure:
1 2 3 4 5 6 7 8 |
HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C1CCCCC1'|wc -l 75325 real 0m1.709s user 0m3.300s sys 0m2.950s |
However, a ridiculous search for substances containing carbon does take a bit longer — there are limits to IO. This search matches almost all of the substances:
1 2 3 4 5 6 7 8 |
HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C'|wc -l 13679424 real 0m9.394s user 0m26.280s sys 0m7.140s |
How, then, is the Pi processing so much data so quickly? Part of the secret lies in splitting the data into “reasonable” chunks of ~55MB. The other secret is in how xargs
is invoked. Not all versions of xargs
support multiple concurrent processes. The -P 4
says to run four instances of grep
concurrently.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C1CCCCC1'|wc -l 75325 real 0m1.702s user 0m3.130s sys 0m3.020s HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 3 -n 50 grep -h 'C1CCCCC1'|wc -l 75325 real 0m1.881s user 0m3.210s sys 0m2.250s HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 2 -n 50 grep -h 'C1CCCCC1'|wc -l 75325 real 0m2.423s user 0m2.970s sys 0m1.860s HypriotOS: root@apis-rpi-09 in /opt $ time find /opt/smiles -type f |xargs -P 1 -n 50 grep -h 'C1CCCCC1'|wc -l 75325 real 0m4.344s user 0m2.920s sys 0m1.350s |
Notice that the improvement on the time required is not linear; there is not much difference in time between three (3) and four (4) concurrent threads. The limit of IO has been reached.
With five Pi 2 boards, substructure searches of all 68279512 compounds can be performed in seconds.
It’s not perfect, some structures can be described in more than one way with SMILES. However, it’s fast and simple.
The next substructure search will utilize fingerprints.
3 pings
Docker Containers: Smaller is not always better » Ramblings
April 19, 2015 at 11:19 am (UTC -5) Link to this comment
[…] « Naive Substructure Substance Matching on the Raspberry Pi […]
Fast molecular data processing with the Raspberry Pi | Raspberry Pi Pod
April 20, 2015 at 4:08 am (UTC -5) Link to this comment
[…] If you’re still with me, in his article he explains how the Pi does pattern matching with grep and how it speeds up when reading from the cache. Read his article here. […]
‘Piping’ Hot Docker Containers » Ramblings
April 21, 2015 at 12:48 am (UTC -5) Link to this comment
[…] I ‘docker’ized the ‘grep’ discussed in Naive Substructure Substance Matching on the Raspberry Pi, I was able to attach the STDOUT from the grep to wc -l to get a count of the matching […]