Aug 27

The Three ‘R’s of DevOps

School days, school days
Dear old Golden Rule days
‘Reading and ‘riting and ‘rithmetic
Taught to the tune of the hick’ry stick
School Days by Will Cobb and Gus Edwards

It’s that time of year again… back to school. Just like in school where we learned ‘Reading and ‘riting and ‘rithmetic, DevOps has its own three ‘R’s.

*Repeatable*
Processes are documented and may be repeated with known outcomes. Scripting or using tools such as Puppet or Chef enable this.
*Reproducible*
Not only can proper results be reproduced, errors can as well! This ties into Repeatable, but there’s more than just repeatability.
*Reliable*
The system is not fragile or prone to vapours. Monitoring, alerting, and high availability all play a role here. However, you can’t have a truly reliable system without the previous two. Without repeatability and reproducibility we can’t have reliability.

Without school’s three ‘R’s, we can’t succeed in life. Likewise DevOps needs it’s ‘R’s. Practicing them makes us ‘able’.

Stealing a note from Maslow, you’d get this:

3r-pyramid2

Aug 27

Hacking Haiku

It’s late. Not tired. So… I’ll hack some Haiku. Not saying they’ll be good Haiku at this late hour, but they’ll do for Haiku.

 

Bits lost in aether
Wrote files to /dev/null
I lost a day’s work

 

Remember Mabel
Let her death be a lesson
Mount a scratch monkey

 

 

While novice wonders
Which language is the best
Master just writes code

 

one-fourty letters
a prison; can only tweet
this bird wants to sing

 

And finally…. one not written tonight. This I entered in the 2015 Columbus Asian Festival and won first place!

Promise of summer
Falls in ambush; a late frost
Blossoms on the wind

Aug 27

Stronger Faster Algorithms

Oft times all thoughts of algorithms are left behind with school. If they’re not, it can be too easy to get caught up in spending too much time trying to do something the “most efficient” way at the expense of spending two weeks coding to save a few milliseconds, but still there is a lot to be said for knowing and studying algorithms.

I’ve always enjoyed opportunities to study algorithms, but haven’t had too many chances in recent years to do so. However the work I’ve been doing recently on the Raspberry Pi has brought me back to algorithms (Yay!). The main limitation of the Pi is I/O speed. The next limit is memory. There’s been a lot to say about limitations inspiring creativity; these limitations have brought me back to exploring algorithms and it’s been fun!

This is the first in a series of examining algorithms related to searching and set intersections (and subtraction) which lead up to a non-trivial, non-naive implementation of substructure matching of chemical compounds on a cluster of Raspberry Pi computers.

First we need to frame the problem.

The previous naive implementation used a fairly simple approach — grep the collection of substances for using SMILES™ (Simplified Molecular Input Line Entry System), a syntax for describing compounds using a string, matching the fed in structure. This is naive in that there are potentially more than one ways to specify a structure using SMILES™ and this initial solution would only find one representation of the structure.

The task was spread across 5 Pi 2 B, with each of them running docker instances which used xargs to spread the processing across the cores. A search for something like ‘C1CCCCC1=O’ (a benzene with an oxygen off the ring) would take ~1 minute the first time to discover and report ~2500 hits. The second time it would take about 20 seconds, and subsequent searches would require approximately 5 seconds. This is due to the Linux I/O cache being populated. Still, it’s pretty impressive — finding matches among ~69 million records in 5 seconds.

However, leaving off the ‘=’ drastically increased the size of the results to ~17K hits. Here the I/O limitations of the Raspberry Pi came into play — the results were reported to a single host over the 100 Mb/s link. In previous tests I had determined that the link could process ~7 MB/s. Having multiple hosts responding impacts this. I did find that by using bonded network interfaces I could obtain rates of approximately 13 MB/s. This is due to the shared USB 2 hub. The USB 2 bus allows a maximum of 480 Mb/s. The best performance I was able to obtain from any single device was approximately 300 Mb/s from an SSD.

One potential improvement lies in memoization — once a solution set has been found for a particular sub structure search, cache the result rather than re-calculating it.

That will be used in the new, less naive implementation. Initial performance will be slower than the naive implementation, however with time it will speed up thanks to memoization.

The PubChem database consists of roughly 69 million substances. One version of the file is an SDF. For each substance, there are a number of pieces of information such as the formula, representations of the substance and its structure, angles between the various atoms, etc.. Of interest for substructure matching is a fingerprint field. The fingerprint consists of 880 bits, each of which describe a property of a substance. For instance, there are fields which specify whether or not a particular element is part of the substance. Or if the substance has over ‘N’ Carbon atoms. In surveying a sample of the substances the average number of bits set in a substance was on the order of 160 bits.

The CDK is an open source library of tools for cheminformatics. One of the things you can do is to convert a SMILES™ or other representation of a structure into the fingerprint used by PubChem which can then be used to find matches.

The new approach uses adaptive set intersections and differences to find matches. The basic idea is that we can process the search fingerprint, only concerned with those bits which are turned on and for each have a set of substances which have that bit set. We can then say that the intersection of all of those sets matches the pattern.

Differences are used where the number of substances which have a particular bit turned on in the fingerprint is greater than half of the substances. An example would be that of the bit specifying that the substance contained Carbon — there are far more substances in the database with Carbon than without. As an optimization in this case differences are used — if we are looking for a substance with carbon, we remove those which have none from our consideration.

In order to prepare for the search, a total of 880 sets is first created — the SDF Toolkit is useful for this. Each set contains a list of substance ID’s (as a 4 byte unsigned integer) whose fingerprint has the corresponding bit set. If the number of substances with the bit set is greater than half the total number of substances, the list of substances without the bit set is calculated. The size of each set is stored to be used later both using the bit position as a hashing key as well as a sorted array.

These sets are then chunked into ranges — the idea is that the work will be spread across a number of rapberry pi computers with the goal of each chunk being sub-second timing for each set intersection.

Once the pre-processing is complete, the data is loaded into a distributed data store, such as Seaweed FS to be duplicated and made available across the cluster.

How requests are handled

On a high level description the implementation can be thought of as a series of map and reduce tasks. However, all of the mapping was performed in pre-processing leaving the reduce tasks. Hadoop was not considered to be a good solution — while it is possible to run on the Pi, it doesn’t do it well. Also, HDFS works best on large chunks of data; the pre-generated sets range from 0 bytes to approximately 10M in size.

  1. Request comes into the system to do a substructure search for a structure. If the fingerprint is not already cached, it is calculated.
  2. Work is pushed onto a job queue such as beanstalkd. Bits which have small sets are paired with bits having larger sets. This is one optimization. Additionally, a task can be broken into smaller chunks in order to address I/O limits. A counter is kept of the number of pieces of work to be performed.
  3. Workers pick up the tasks and if it is not already cached, the intersection or difference is determined. The result is cached and a results queue is notified that the piece is complete.
  4. The worker which monitors the results queue pops two results off of the results queue, decrements the counter, and pushes a task onto the job queue to calculate the intersection of the results (incrementing the counter).
  5. Steps 3-4 repeat until there is only one task left on the counter, at which point, when it is complete, we are done.
  6. The resulting set is returned to the requestor.

Memoization will allow the reduction of CPU and I/O (as well as tasks!) over time. Common requests will eventually execute in O(1).

The next entry in the series will talk about adaptive algorithms and how they will help to calculate the intersections and differences in less than O(n) — the algorithm should approach p*log(n), where ‘p’ is the size of the smaller set and ‘n’ is the size of the larger, in the worst case. On average, it should do far better. Other algorithms may approach O(n+p).

I’m also considering exploring the use of machine learning to build a neural net of fingerprints against which substructures are compared. I’m not sure how well it would work — there would have to be some way of determining thresholds of matches to eliminate false hits. Also another avenue of exploration is the use of Kalman filters to predict the number of workers needed based on thresholds of how long it will take to calculate. More on that later….

Aug 15

“Well, I’m back,” he said.

When last we saw our intrepid hero, he was busily working with Raspberry Pi. With but a couple days notice, he was swept off to a knee surgeon. The surgery, itself, went well. However, I developed a secondary infection which threw me for a loop. I’m feeling better and finally getting back to it.

Kudos to anyone who can recognize the source of the title….

May 07

What developers think…

Check out @chaseadamsio’s Tweet: https://twitter.com/chaseadamsio/status/596368402765209601?s=09

May 07

Swarming Raspberry Pi: Building a Cloud in a Box

I’m speaking at a Meetup with DevOps Columbus http://meetu.ps/2HsLPg

The topic is about Docker, the cloud, and raspberry pi.

Apr 28

Heterogenous Docker Swarms Teaser

Note: This is all very experimental; Docker does not support any architecture other than X86_64.

The last few evenings I’ve been working on Mulitifarious, a means of creating heterogenous Docker Swarms. I’d previously found that I can create a swarm with heterogenous members — a swarm which has, say, X86_64 and Raspberry Pi members. The problem arose, of course, once I attempted to run containers in the swarm. Containers are architecture specific.

Enter Multifarious. And no, multifarious isn’t nefarious, even if the words sound similar. Rather it means “many varied parts or aspects” (Google)

Multifarious uses dependency injection to tell Docker the name of a container suited to an Architecture.

In the preliminary version, ClusterHQ’s powerstrip is used in order to inject the proper image name into the request to build a Docker container. Powerstrip, in turn, calls a small Sinatra Application which performs a lookup in Redis to find the proper name for the host’s architecture. If the image name is not registered with Redis, then it is passed through without modification. It can be configured to either provide the image name for every architecture of a canonical name, or such that multifarious replaces the default name only in the case of a “special case”.

multifarious-data-flow

Quite possibly a future version will be written in Go and rather than requiring multiple executables to perform the injection, I expect to merge powerstrip and the adapter into one. This should reduce the footprint a good deal.

I am still working on a cohesive demo, but the following will show that the dependency injection is working:

The -i is needed due to a powerstrip quirk. However, take note that the docker image being invoked on the command line is ‘hello’. The docker image being run is that of ‘hello-world’ and there is no ‘hello’ image. Injection is working and I can configure images to run based upon the architecture.

I’ve injected the proper name for the image based upon a Redis lookup. I chose Redis because it’s available for multiple platforms and is pretty easy to use. It just needs to have the lookup table fed to it.

The items are stored in Redis as a HSET:

At runtime the image is chosen and injected and life proceeds.

The repository is available on github and will be added to in the next couple of days, with a full-fledged writeup and demo to follow in the next couple of days.

The Feaured Image is a modification of a photo by JD Hancock:


flickr photo shared by JD Hancock under a Creative Commons ( BY ) license

Apr 21

‘Piping’ Hot Docker Containers

One of the possibly lesser used flags for docker run is -a which allows you to attach the container’s STDIN, STDOUT or STDERR and pipe it back to the shell which invoked the container. This allows you to construct pipelines of commands, just as you can with UNIX processes. For instance, using UNIX commands to count the number of files in a directory, you would do something like:

Since the Docker container acts as a command, it has its own STDIN, STDOUT, and STDERR. You can string together multiple commands and containers.

After I ‘docker’ized the ‘grep’ discussed in Naive Substructure Substance Matching on the Raspberry Pi, I was able to attach the STDOUT from the grep to wc -l to get a count of the matching substances.

This works just fine. In fact, it opens up opportunities for all sort of other commands/suites running inside a container. Pandoc running in a container to generate PDF’s comes to mind. Or ImageMagick. Or any of a number of other commands. All of the advantages of docker containers with all of the fun of UNIX pipes.

Then the imp of the perverse struck. If I could redirect the STDOUT of a container running on a local host, would it work as well on another? In short…. yes.

You can attach to the streams of a docker container running on a different host. The docker daemon needs to be bound to a port on the other host(s).

So, if I can run one at a time, why not five? I knocked out a couple of one line shell scripts (harness and runner) and, for grins and giggles, added a ‘-x’ magick cookie to demonstrate what’s happening. The lines below with the ‘+’ inside show the commands which are being performed behind the scenes:

In less than six seconds, it’s spawned docker containers on five other hosts. Each of these containers is performing a substructure (read grep) search of ~13.7 million chemical compounds for a total of ~69M compounds. The results are then sent back to the initiating host, which is dumping the results to a file as well as counting the results. Not too shabby. And it scales to O(n), too — IO is the main limiting factor here.

I can think of lots of uses for this. Poor man’s parallel processing. Map/Reduce. Many more.

The disadvantage of this quick and dirty method is that you need to know the IP addresses on which to run the commands. Swarm alleviates the necessity of knowing the addresses or of coming up with a methodology for distributing the workload, which is always a plus.

It’s not necessarily something I’d do to go to production, but for testing or experimentation, it works quite well. It also leads to other experiments.

Docker is really awesome; I’m learning new things to do with it all the time.

Apr 19

Docker Containers: Smaller is not always better

Generally smaller Docker containers are preferred to larger ones. However, a smaller container is not always as performant as a larger one. By using a (slightly) larger container, performance improved over 30x.

TL;DR

The grep included in busybox is painfully slow. When doing using grep to process lots of data, add a (real) grep to the container.

Background

As discussed in Naive Substructure Substance Matching on the Raspberry Pi » Ramblings, I am exploring the limits of the Raspberry Pi for processing data. I chose SubStructure searching as a problem set as it is a non-trivial problem and a decent demonstration for co-workers of the processing power of the Pi.

I’ve pre-processed the NIH Pubchem Compounds database to extract SMILES data — this is a language for describing the structure of chemical compounds. As a relatively naive first implementation I’m using grep to match substructures. I have split the files amongst five Pi 2s; each is processing ~840M in ~730 files. xargs is used to do concurrent processing across multiple cores. After a few cycles, the entire data is read into cache and the Pi is able to process it in 1-2 seconds for realistic searches. A ridiculous search, finding all of the carbon containing compounds (over 13 million) takes 8-10 seconds.

Having developed a solution, I then set about dockerizing it.

I chose voxxit/alpine-rpi for my base — it’s quite small, about 5mb and has almost everything needed. I discovered that the version of xargs which ships with the container does not support -P. So xargs is added via:

I ran my test and found that the performance was horrid.

I decided to drop into an interactive shell so that I could tweak. You can see the performance below in the ‘Before’.

Before:

Typically the performance of a large IO operation will improve after a few cycles; the system is able to cache disk reads. It generally takes 3 cycles before all of the data is in the cache. However, the numbers above did not improve. I did verify that multiple cores were, indeed, being used.

I proceeded down a rabbit hole, looking at IO and VM statistics. Horrible. From there I googled to see if, indeed, Docker uses the disk cache (it does) and/or if there was a flag I needed to set (I didn’t). Admittedly, I couldn’t believe that IO using Docker could be that much slower, but I am a firm believer in testing my assumptions.

After poking about in /proc and /sys and running the search outside of Docker, I decided to see if there might be a faster grep. As it turns out, the container uses busybox:

This is generally a good choice in terms of size. However, it appears that the embedded grep is considerably slower than molasses in January. On a whim I decided to install grep:

I then re-ran the test and did a Snoopy Dance.

After:

Lessons Learned

This episode drove home the need to question assumptions. In this case the assumption is that a smaller sized container is inherently better. I believe that smaller and lighter containers are a Good Practice and an admirable goal. However, as seen here, smaller is not always better.

I also habitually look at a container’s Dockerfile before pulling it. In this case it wasn’t enough. It reinforced the lesson that I need to know what’s running in a container before I try to use it.

Apr 18

Naive Substructure Substance Matching on the Raspberry Pi

Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (n3) or O (n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of the Ullman algorithm or variations of it (i.e. SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called ‘fingerprints’ although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by ‘folding’, which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length. — Chemical database

Substructure substance matching is, in many ways, a non-trivial exercise in Cheminformatics. The amount of data used to determine matches grows very quickly. For instance, one method of describing a molecule’s “fingerprint” uses 880 bytes. Or 2^880 combinations. This space is very sparsely populated, but there are still many potential combinations.

Another way of describing the structure of a molecule is Simplified molecular-input line-entry system or SMILES. This method uses a string which describes the structure of a molecule. Hydrogen atoms are generally stripped from the structure, so the SMILES representation for water is ‘O’. Likewise, methane is ‘C’. Single bonds are assumed. Double bonds are described by ‘=’, so carbon dioxide is ‘O=C=O’.

As it turns out, grep happens to work very well to find substructure matches of SMILE data. The following searches are performed on a subset of the NIH PubChem Compound database, 13689519 compounds in total. The original data has been processed on a Raspberry Pi — compressed, this portion of the database is ~13GB. Pulling out the SMILES representation and the compound ID, the resultant flat data is 842M in 733 files.

The 842M happens to fit into the ram of the Pi. After a few searches, the files are buffered in RAM. At that point, the speed increases mightily. The limit for reads of a MicroSD card is ~15M/s. Once cached in RAM, however, it is able to read >400M/s:

Following is a series of searching demonstrating how the search speeds up as the data is read into cache.

Once the files are buffered in memory, the greps occur in close to constant time for reasonable searches sorted by the compound ID — the previous search matched 123 compounds; by comparison follows a search for a ring structure:

However, a ridiculous search for substances containing carbon does take a bit longer — there are limits to IO. This search matches almost all of the substances:

How, then, is the Pi processing so much data so quickly? Part of the secret lies in splitting the data into “reasonable” chunks of ~55MB. The other secret is in how xargs is invoked. Not all versions of xargs support multiple concurrent processes. The -P 4 says to run four instances of grep concurrently.

Notice that the improvement on the time required is not linear; there is not much difference in time between three (3) and four (4) concurrent threads. The limit of IO has been reached.

With five Pi 2 boards, substructure searches of all 68279512 compounds can be performed in seconds.

It’s not perfect, some structures can be described in more than one way with SMILES. However, it’s fast and simple.

The next substructure search will utilize fingerprints.

Older posts «