Naive Substructure Substance Matching on the Raspberry Pi

Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (n3) or O (n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of the Ullman algorithm or variations of it (i.e. SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called ‘fingerprints’ although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by ‘folding’, which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length. — Chemical database

Substructure substance matching is, in many ways, a non-trivial exercise in Cheminformatics. The amount of data used to determine matches grows very quickly. For instance, one method of describing a molecule’s “fingerprint” uses 880 bytes. Or 2^880 combinations. This space is very sparsely populated, but there are still many potential combinations.

Another way of describing the structure of a molecule is Simplified molecular-input line-entry system or SMILES. This method uses a string which describes the structure of a molecule. Hydrogen atoms are generally stripped from the structure, so the SMILES representation for water is ‘O’. Likewise, methane is ‘C’. Single bonds are assumed. Double bonds are described by ‘=’, so carbon dioxide is ‘O=C=O’.

As it turns out, grep happens to work very well to find substructure matches of SMILE data. The following searches are performed on a subset of the NIH PubChem Compound database, 13689519 compounds in total. The original data has been processed on a Raspberry Pi — compressed, this portion of the database is ~13GB. Pulling out the SMILES representation and the compound ID, the resultant flat data is 842M in 733 files.

The 842M happens to fit into the ram of the Pi. After a few searches, the files are buffered in RAM. At that point, the speed increases mightily. The limit for reads of a MicroSD card is ~15M/s. Once cached in RAM, however, it is able to read >400M/s:

HypriotOS: root@apis-rpi-09 in /opt
$ hdparm -t /dev/mmcblk0

/dev/mmcblk0:
 Timing buffered disk reads:  50 MB in  3.01 seconds =  16.60 MB/sec
HypriotOS: root@apis-rpi-09 in /opt
$ hdparm -T /dev/mmcblk0

/dev/mmcblk0:
 Timing cached reads:   836 MB in  2.00 seconds = 418.22 MB/sec

HypriotOS: root@apis-rpi-09 in /opt

$ hdparm -t /dev/mmcblk0

/dev/mmcblk0:

Timing buffered disk reads: 50 MB in 3.01 seconds = 16.60 MB/sec

HypriotOS: root@apis-rpi-09 in /opt

$ hdparm -T /dev/mmcblk0

/dev/mmcblk0:

Timing cached reads: 836 MB in 2.00 seconds = 418.22 MB/sec

Following is a series of searching demonstrating how the search speeds up as the data is read into cache.

HypriotOS: root@apis-rpi-09 in /opt
$ du -h smiles
824M    smiles
HypriotOS: root@apis-rpi-09 in /opt
$ ls smiles|wc -l
733
HypriotOS: root@apis-rpi-09 in /opt
$ cat smiles/*|wc -l
13689519
HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l
123

real    0m17.955s
user    0m3.060s
sys 0m2.610s
HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l
123

real    0m11.874s
user    0m2.540s
sys 0m2.290s
HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l
123

real    0m1.600s
user    0m2.500s
sys 0m3.290s

HypriotOS: root@apis-rpi-09 in /opt

$ du -h smiles

824M smiles

HypriotOS: root@apis-rpi-09 in /opt

$ ls smiles|wc -l

733

HypriotOS: root@apis-rpi-09 in /opt

$ cat smiles/*|wc -l

13689519

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l

123

real 0m17.955s

user 0m3.060s

sys 0m2.610s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l

123

real 0m11.874s

user 0m2.540s

sys 0m2.290s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'CC1CCCCC1=O'|sort|wc -l

123

real 0m1.600s

user 0m2.500s

sys 0m3.290s

Once the files are buffered in memory, the greps occur in close to constant time for reasonable searches sorted by the compound ID — the previous search matched 123 compounds; by comparison follows a search for a ring structure:

HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C1CCCCC1'|wc -l
75325

real    0m1.709s
user    0m3.300s
sys 0m2.950s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C1CCCCC1'|wc -l

75325

real 0m1.709s

user 0m3.300s

sys 0m2.950s

However, a ridiculous search for substances containing carbon does take a bit longer — there are limits to IO. This search matches almost all of the substances:

HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C'|wc -l
13679424

real    0m9.394s
user    0m26.280s
sys 0m7.140s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C'|wc -l

13679424

real 0m9.394s

user 0m26.280s

sys 0m7.140s

How, then, is the Pi processing so much data so quickly? Part of the secret lies in splitting the data into “reasonable” chunks of ~55MB. The other secret is in how xargs is invoked. Not all versions of xargs support multiple concurrent processes. The -P 4 says to run four instances of grep concurrently.

HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C1CCCCC1'|wc -l
75325

real    0m1.702s
user    0m3.130s
sys 0m3.020s
HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 3 -n 50 grep -h 'C1CCCCC1'|wc -l
75325

real    0m1.881s
user    0m3.210s
sys 0m2.250s
HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 2 -n 50 grep -h 'C1CCCCC1'|wc -l
75325

real    0m2.423s
user    0m2.970s
sys 0m1.860s
HypriotOS: root@apis-rpi-09 in /opt
$ time find /opt/smiles -type f |xargs -P 1 -n 50 grep -h 'C1CCCCC1'|wc -l
75325

real    0m4.344s
user    0m2.920s
sys 0m1.350s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 4 -n 50 grep -h 'C1CCCCC1'|wc -l

75325

real 0m1.702s

user 0m3.130s

sys 0m3.020s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 3 -n 50 grep -h 'C1CCCCC1'|wc -l

75325

real 0m1.881s

user 0m3.210s

sys 0m2.250s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 2 -n 50 grep -h 'C1CCCCC1'|wc -l

75325

real 0m2.423s

user 0m2.970s

sys 0m1.860s

HypriotOS: root@apis-rpi-09 in /opt

$ time find /opt/smiles -type f |xargs -P 1 -n 50 grep -h 'C1CCCCC1'|wc -l

75325

real 0m4.344s

user 0m2.920s

sys 0m1.350s

Notice that the improvement on the time required is not linear; there is not much difference in time between three (3) and four (4) concurrent threads. The limit of IO has been reached.

With five Pi 2 boards, substructure searches of all 68279512 compounds can be performed in seconds.

It’s not perfect, some structures can be described in more than one way with SMILES. However, it’s fast and simple.

The next substructure search will utilize fingerprints.

Fast molecular data processing with the Raspberry Pi | Raspberry Pi Pod

April 20, 2015 at 4:08 am (UTC -4) Link to this comment

[…] If you’re still with me, in his article he explains how the Pi does pattern matching with grep and how it speeds up when reading from the cache. Read his article here. […]

‘Piping’ Hot Docker Containers » Ramblings

April 21, 2015 at 12:48 am (UTC -4) Link to this comment

[…] I ‘docker’ized the ‘grep’ discussed in Naive Substructure Substance Matching on the Raspberry Pi, I was able to attach the STDOUT from the grep to wc -l to get a count of the matching […]

Ramblings

Musings of Matt Williams

Naive Substructure Substance Matching on the Raspberry Pi

Like this:

Related

3 pings

Docker Containers: Smaller is not always better » Ramblings

Fast molecular data processing with the Raspberry Pi | Raspberry Pi Pod

‘Piping’ Hot Docker Containers » Ramblings

Leave a Reply Cancel reply

Subscribe to Blog via Email

Recent Posts

Top Posts & Pages

Archives

Categories

Copyright

Ramblings

Musings of Matt Williams

Naive Substructure Substance Matching on the Raspberry Pi

Share this:

Like this:

Related

3 pings

Docker Containers: Smaller is not always better » Ramblings

Fast molecular data processing with the Raspberry Pi | Raspberry Pi Pod

‘Piping’ Hot Docker Containers » Ramblings

Leave a Reply Cancel reply

Subscribe to Blog via Email

Recent Posts

Top Posts & Pages

Archives

Categories

Copyright