Apr 28

Heterogenous Docker Swarms Teaser

Note: This is all very experimental; Docker does not support any architecture other than X86_64.

The last few evenings I’ve been working on Mulitifarious, a means of creating heterogenous Docker Swarms. I’d previously found that I can create a swarm with heterogenous members — a swarm which has, say, X86_64 and Raspberry Pi members. The problem arose, of course, once I attempted to run containers in the swarm. Containers are architecture specific.

Enter Multifarious. And no, multifarious isn’t nefarious, even if the words sound similar. Rather it means “many varied parts or aspects” (Google)

Multifarious uses dependency injection to tell Docker the name of a container suited to an Architecture.

In the preliminary version, ClusterHQ’s powerstrip is used in order to inject the proper image name into the request to build a Docker container. Powerstrip, in turn, calls a small Sinatra Application which performs a lookup in Redis to find the proper name for the host’s architecture. If the image name is not registered with Redis, then it is passed through without modification. It can be configured to either provide the image name for every architecture of a canonical name, or such that multifarious replaces the default name only in the case of a “special case”.

multifarious-data-flow

Quite possibly a future version will be written in Go and rather than requiring multiple executables to perform the injection, I expect to merge powerstrip and the adapter into one. This should reduce the footprint a good deal.

I am still working on a cohesive demo, but the following will show that the dependency injection is working:

The -i is needed due to a powerstrip quirk. However, take note that the docker image being invoked on the command line is ‘hello’. The docker image being run is that of ‘hello-world’ and there is no ‘hello’ image. Injection is working and I can configure images to run based upon the architecture.

I’ve injected the proper name for the image based upon a Redis lookup. I chose Redis because it’s available for multiple platforms and is pretty easy to use. It just needs to have the lookup table fed to it.

The items are stored in Redis as a HSET:

At runtime the image is chosen and injected and life proceeds.

The repository is available on github and will be added to in the next couple of days, with a full-fledged writeup and demo to follow in the next couple of days.

The Feaured Image is a modification of a photo by JD Hancock:


flickr photo shared by JD Hancock under a Creative Commons ( BY ) license

Apr 21

‘Piping’ Hot Docker Containers

One of the possibly lesser used flags for docker run is -a which allows you to attach the container’s STDIN, STDOUT or STDERR and pipe it back to the shell which invoked the container. This allows you to construct pipelines of commands, just as you can with UNIX processes. For instance, using UNIX commands to count the number of files in a directory, you would do something like:

Since the Docker container acts as a command, it has its own STDIN, STDOUT, and STDERR. You can string together multiple commands and containers.

After I ‘docker’ized the ‘grep’ discussed in Naive Substructure Substance Matching on the Raspberry Pi, I was able to attach the STDOUT from the grep to wc -l to get a count of the matching substances.

This works just fine. In fact, it opens up opportunities for all sort of other commands/suites running inside a container. Pandoc running in a container to generate PDF’s comes to mind. Or ImageMagick. Or any of a number of other commands. All of the advantages of docker containers with all of the fun of UNIX pipes.

Then the imp of the perverse struck. If I could redirect the STDOUT of a container running on a local host, would it work as well on another? In short…. yes.

You can attach to the streams of a docker container running on a different host. The docker daemon needs to be bound to a port on the other host(s).

So, if I can run one at a time, why not five? I knocked out a couple of one line shell scripts (harness and runner) and, for grins and giggles, added a ‘-x’ magick cookie to demonstrate what’s happening. The lines below with the ‘+’ inside show the commands which are being performed behind the scenes:

In less than six seconds, it’s spawned docker containers on five other hosts. Each of these containers is performing a substructure (read grep) search of ~13.7 million chemical compounds for a total of ~69M compounds. The results are then sent back to the initiating host, which is dumping the results to a file as well as counting the results. Not too shabby. And it scales to O(n), too — IO is the main limiting factor here.

I can think of lots of uses for this. Poor man’s parallel processing. Map/Reduce. Many more.

The disadvantage of this quick and dirty method is that you need to know the IP addresses on which to run the commands. Swarm alleviates the necessity of knowing the addresses or of coming up with a methodology for distributing the workload, which is always a plus.

It’s not necessarily something I’d do to go to production, but for testing or experimentation, it works quite well. It also leads to other experiments.

Docker is really awesome; I’m learning new things to do with it all the time.

Apr 19

Docker Containers: Smaller is not always better

Generally smaller Docker containers are preferred to larger ones. However, a smaller container is not always as performant as a larger one. By using a (slightly) larger container, performance improved over 30x.

TL;DR

The grep included in busybox is painfully slow. When doing using grep to process lots of data, add a (real) grep to the container.

Background

As discussed in Naive Substructure Substance Matching on the Raspberry Pi » Ramblings, I am exploring the limits of the Raspberry Pi for processing data. I chose SubStructure searching as a problem set as it is a non-trivial problem and a decent demonstration for co-workers of the processing power of the Pi.

I’ve pre-processed the NIH Pubchem Compounds database to extract SMILES data — this is a language for describing the structure of chemical compounds. As a relatively naive first implementation I’m using grep to match substructures. I have split the files amongst five Pi 2s; each is processing ~840M in ~730 files. xargs is used to do concurrent processing across multiple cores. After a few cycles, the entire data is read into cache and the Pi is able to process it in 1-2 seconds for realistic searches. A ridiculous search, finding all of the carbon containing compounds (over 13 million) takes 8-10 seconds.

Having developed a solution, I then set about dockerizing it.

I chose voxxit/alpine-rpi for my base — it’s quite small, about 5mb and has almost everything needed. I discovered that the version of xargs which ships with the container does not support -P. So xargs is added via:

I ran my test and found that the performance was horrid.

I decided to drop into an interactive shell so that I could tweak. You can see the performance below in the ‘Before’.

Before:

Typically the performance of a large IO operation will improve after a few cycles; the system is able to cache disk reads. It generally takes 3 cycles before all of the data is in the cache. However, the numbers above did not improve. I did verify that multiple cores were, indeed, being used.

I proceeded down a rabbit hole, looking at IO and VM statistics. Horrible. From there I googled to see if, indeed, Docker uses the disk cache (it does) and/or if there was a flag I needed to set (I didn’t). Admittedly, I couldn’t believe that IO using Docker could be that much slower, but I am a firm believer in testing my assumptions.

After poking about in /proc and /sys and running the search outside of Docker, I decided to see if there might be a faster grep. As it turns out, the container uses busybox:

This is generally a good choice in terms of size. However, it appears that the embedded grep is considerably slower than molasses in January. On a whim I decided to install grep:

I then re-ran the test and did a Snoopy Dance.

After:

Lessons Learned

This episode drove home the need to question assumptions. In this case the assumption is that a smaller sized container is inherently better. I believe that smaller and lighter containers are a Good Practice and an admirable goal. However, as seen here, smaller is not always better.

I also habitually look at a container’s Dockerfile before pulling it. In this case it wasn’t enough. It reinforced the lesson that I need to know what’s running in a container before I try to use it.

Apr 18

Naive Substructure Substance Matching on the Raspberry Pi

Chemists can search databases using parts of structures, parts of their IUPAC names as well as based on constraints on properties. Chemical databases are particularly different from other general purpose databases in their support for sub-structure search. This kind of search is achieved by looking for subgraph isomorphism (sometimes also called a monomorphism) and is a widely studied application of Graph theory. The algorithms for searching are computationally intensive, often of O (n3) or O (n4) time complexity (where n is the number of atoms involved). The intensive component of search is called atom-by-atom-searching (ABAS), in which a mapping of the search substructure atoms and bonds with the target molecule is sought. ABAS searching usually makes use of the Ullman algorithm or variations of it (i.e. SMSD ). Speedups are achieved by time amortization, that is, some of the time on search tasks are saved by using precomputed information. This pre-computation typically involves creation of bitstrings representing presence or absence of molecular fragments. By looking at the fragments present in a search structure it is possible to eliminate the need for ABAS comparison with target molecules that do not possess the fragments that are present in the search structure. This elimination is called screening (not to be confused with the screening procedures used in drug-discovery). The bit-strings used for these applications are also called structural-keys. The performance of such keys depends on the choice of the fragments used for constructing the keys and the probability of their presence in the database molecules. Another kind of key makes use of hash-codes based on fragments derived computationally. These are called ‘fingerprints’ although the term is sometimes used synonymously with structural-keys. The amount of memory needed to store these structural-keys and fingerprints can be reduced by ‘folding’, which is achieved by combining parts of the key using bitwise-operations and thereby reducing the overall length. — Chemical database

Substructure substance matching is, in many ways, a non-trivial exercise in Cheminformatics. The amount of data used to determine matches grows very quickly. For instance, one method of describing a molecule’s “fingerprint” uses 880 bytes. Or 2^880 combinations. This space is very sparsely populated, but there are still many potential combinations.

Another way of describing the structure of a molecule is Simplified molecular-input line-entry system or SMILES. This method uses a string which describes the structure of a molecule. Hydrogen atoms are generally stripped from the structure, so the SMILES representation for water is ‘O’. Likewise, methane is ‘C’. Single bonds are assumed. Double bonds are described by ‘=’, so carbon dioxide is ‘O=C=O’.

As it turns out, grep happens to work very well to find substructure matches of SMILE data. The following searches are performed on a subset of the NIH PubChem Compound database, 13689519 compounds in total. The original data has been processed on a Raspberry Pi — compressed, this portion of the database is ~13GB. Pulling out the SMILES representation and the compound ID, the resultant flat data is 842M in 733 files.

The 842M happens to fit into the ram of the Pi. After a few searches, the files are buffered in RAM. At that point, the speed increases mightily. The limit for reads of a MicroSD card is ~15M/s. Once cached in RAM, however, it is able to read >400M/s:

Following is a series of searching demonstrating how the search speeds up as the data is read into cache.

Once the files are buffered in memory, the greps occur in close to constant time for reasonable searches sorted by the compound ID — the previous search matched 123 compounds; by comparison follows a search for a ring structure:

However, a ridiculous search for substances containing carbon does take a bit longer — there are limits to IO. This search matches almost all of the substances:

How, then, is the Pi processing so much data so quickly? Part of the secret lies in splitting the data into “reasonable” chunks of ~55MB. The other secret is in how xargs is invoked. Not all versions of xargs support multiple concurrent processes. The -P 4 says to run four instances of grep concurrently.

Notice that the improvement on the time required is not linear; there is not much difference in time between three (3) and four (4) concurrent threads. The limit of IO has been reached.

With five Pi 2 boards, substructure searches of all 68279512 compounds can be performed in seconds.

It’s not perfect, some structures can be described in more than one way with SMILES. However, it’s fast and simple.

The next substructure search will utilize fingerprints.

Apr 15

Raspberry Pi and First World Problems

And now, dear reader, a brief intercalary segue…..

Stating the obvious, I’ve been doing a lot of work recently with the Raspberry Pi. In truth, I’ve been trying to discern and/or work around its limitations. Consequently, I’ve caught myself wishing for just a little bit more bandwidth — thus far I/O is the limiting factor for me and the types of work I’ve been doing with the Pi.

Limiting as in 30MB/s disk read/writes. ~13MB/s MicroSD read/writes. ~7MB/s network I/O over ethernet — I wonder if I went with wireless I could squeeze out a little bit more… See? I’m doing it again.

I can remember not terribly long ago that I’d have killed for such performance. For that matter, I’ve (only) got 30Mb/s (note the lowercase ‘b’) coming into the house. The Pi could consume the entire bandwidth into the house.

And then I think back a decade where I thought that dedicated 768Kbs up and down was quite nice. Two and a half decades and I thought that transferring files from White Sands to CMU at 9600 baud was quite impressive.

Then I start to think about all that I now take for granted in the Day-to-Day which not terribly long ago would have been considered a “hard problem” if not Magick. I told my daughter about a year ago that I had a magick mirror which would allow me to see and talk to people on the other side of the world. She didn’t believe me, so I pulled out my phone. “Dad, that’s not Magick, that’s Technology.” Out of the mouths of babes and innocents, I am reminded of Arthur C. Clarke:

Any sufficiently advanced technology is indistinguishable from Magic

Frankly, a decade ago the idea that I could have a computer with 4 cores was not something I’d contemplated. In 2001 I purchased a laptop with a single core 1000MHz processor and a Gigabyte of ram for over $1700. At the time I thought it was something quite nice. Go back further a bit and I had a computer with 1MHz processor and 64K ram. In the mid 90’s I was running systems with hundreds of users on 60MHz processors and 128MB ram.

Yet I’m complaining about limitations of I/O on a machine which is considerably more powerful.

And then I think about all of the regions in the world where there isn’t good, stable electricity. Or internet access. Or libraries and books.

Or Food.

Or Water.

Or stable government. Of being able to walk outside my house with a reasonable expectation that I won’t be kidnapped or killed. My daughter can leave the house and go to school without worrying that she’ll be shot or stolen.

Suddenly I’m ashamed to be complaining about I/O constraints. First World problems, indeed.

Apr 15

Swarming Raspberry Pi: Private Registry for Swarm Images

Some more backstory on the Pi Swarm

I was really excited when Amazon announced their Lambda offering. I thought that it was an awesome idea, but for the lack of an open solution and that it locked you into javascript.

I believe that using Docker, we can have a relatively simple Amazon Lambda work-alike which allows code from arbitrary languages to be run.

Along the way, I’ve investigated using Kubernetes, but it didn’t support ephemeral containers. It kept trying to resurrect the dead container. Hilarity ensued after a fashion.

Enter Swarm….

Swarm

respect-my-authority I’d seen mentioned that Swarm wasn’t working with servers which require authentication. Issue #374, in its history, indicates that there is no (as of yet) support for registry requiring authentication.

It occurred to me that it might be possible to have swarm working with a local private registry via an insecure registry. A few tests later, and it’s alive!!!

Configuration

Registry

You’ll need to have a registry running. Swarming Raspberry Pi Part 2: Registry and Mirror has details.

Private registries have the standalone=true flag set. According to the documentation:

On first reading, it looks as though a registry cannot work as both a mirror and a private registry. In tests, however, I was able to use a private registry as both a mirror and private registry. I believe but have not verified that the registry is passing index queries to the docker hub. I have verified that images cached on the local private registry are served locally. So… it might be thought of as a private registry which just so happens to act as a image cache for images from the canonical registry. However, it is not indexing the images. If you care to experiment, you can start a registry as follows:

On a Raspberry Pi, substitute nimblestratus/rpi-docker-registry for the image. One thing which I didn’t get working (although didn’t test extensively — the TCP/IP stack on my laptop gets fussy after repeatedly connecting to different networks and connecting/disconnecting from multiple VPNs) is to run a mirror registry as a docker container and pointing the docker daemon to the registry container. Part of me thinks it might work, but I can also see where it wouldn’t — Docker might attempt to talk to the registry on startup, realize it isn’t up, then give up. There’s a good chance that it’s a race condition, though I have not looked at the code as yet.

Note that it is both a STANDALONE and mirroring registry.

Daemon

On the host(s) which access the local private registry, the docker daemon needs to be configured in order to allow access to the private repository. You can either set an environment variable, a command-line argument, or edit a config file. More information may be found in the Docker Documentation. I’ve used the config file; on debian based systems it’s generally located at /etc/default/docker. Add a line similar to the following at the end of the file:

Once done, the daemon will need to be restarted. On a debian system it is typically sudo service docker restart.

Note: In my test, I added the private registry to a host which already had port 80 bound. Hence the specification of the port. Your name, etc., will vary. I have written some thoughts about the pros and cons of various private registry schemes at Good Practices for Configuring Docker Private Registries.

Catching My Breath

At this point, the Raspberry Pi Swarm has:

  1. Swarm
  2. Consul and Registrator
  3. A Private Registry and a Mirror
  4. Monitoring is available through Consul and Registrator. I am not sure how well they work with ephemeral containers however. That is the subject of a future test. I may need to hack registrator to ignore ephemeral containers.

The remainder:

  1. Bootstrapping the Swarm and basic services
  2. Storage. I’ve got NFS working (it’s easy). I intend to evaluate:
    a. S3
    b. HDFS
    c. Ceph or Gluster
  3. Log aggregation
  4. Solving some “real” problems. ${WORK} is involved in Cheminformatics and authoritative chemical information. I’ve decided as a way to stretch the abilities of the swarm to do some substructure searching of chemical substances. I am not a chemist; I remember a good deal of my Advanced Placement Chemistry from ’87, but let’s just say I’m learning a lot. It’s good though, I think. I don’t know what’s impossible!

"Lamb"da in the Cloud

“Lamb”da in the Cloud

At this point I’d like to introduce Agni. Agni is my answer to Apache’s Lambda. However, it differs in two major areas:

  • It’s Open Source and built upon Open Source.
  • It supports multiple languages — not just Javascript.

On a high level, code is registered with Agni and a container image is created. A part of the creation process entails specifying an event/message with will trigger the running of an instance of the container. When events are recieved, a listener spawns instances via swarm, passing the details of the message to the newly created Docker container.

More on Agni shortly…. Meanwhile I’m back to the cluster and seeing how I can leverage the work which the good people at Hypriot have done with Docker Machine (and to a lesser degree Kitematic since I don’t have a Mac).

Apr 14

Good Practices for Configuring Docker Private Registries

Private registries can be very helpful when using Docker — particularly if you’re wanting to be able to share code locally without either making it public or incurring the cost of a round trip. This post presents some practices which I think make life easier when using a Private Registry.

Where to look

Docker recognizes that an image is on a private registry when any of the following conditions occur:

  • An explicit port is specified in the image name, such as registry:5000/foobar.
  • An IP address is used, such as 127.0.0.1 or 192.168.1.123
  • A fully qualified domain name (FQDN), such as registry.nimblestrat.us or registry.local

By default, the registry port is 5000. By adhering to convention, it’s easy to look at an image and tell that it is coming from a private location. However, it’s extra typing and more to remember. I prefer using a FQDN and having the registry bind to port 80 — the name, assuming that the host has a good name (or CNAME record) such as registry.foo.bar.

How to use a private registry

In order to place an image into a private registry, you must first tag it with a name in which you have specified the location of the registry.

Each of these examples would work (assuming that a registry is bound to the IP/Port):

  • docker tag a1b2c3d4e5f6 127.0.0.1:5000/gnomovision
  • docker tag a1b2c3d4e5f6 registry.foo.bar/gnomovision

However, the following wouldn’t work for pushing an image to a private registry:

  • docker tag a1b2c3d4e5f6 gnomovision — mere mortals cannot “bless” an image and make it part of the “Official Repositories”
  • docker tag a1b2c3d4e5f6 registry/gnomovision — in this case, it considers registry to be a userid for the Docker Hub. There is not enough information to tell it that you’re trying to send it to a host named registry.

Recommended Practices

  1. Either name a host registry or, better yet, use a CNAME record to alias a host as registry. That way you don’t have to remember that xyz.pdq.io is the registry.
  2. Bind to the HTTP port.
  3. Where possible, use authentication. Since my major use case is with Swarm and it does not as yet support authentication, I am investigating other means, such as only allowing connections from a local network. Socketplane is an option, too — have the registry listening on a private network address. Neither is perfect, but for the moment….

I’d love to hear what other folk think — are there practices which you use?

Apr 12

Docker Workers Scale Nicely with Multiple Cores

Disclaimer: The title might be a bit misleading. For this workload, it’s scaling pretty much linearly. Other workloads might scale differently.

I was running a quick-ish test on a Pi to see how long it would take to churn through 50GB compressed data.

This processing consists of:

  1. Determining the files to be processed — this is via an offset since ultimately there will be 10 workers processing the data. Also I think I might get a slightly more representative set than by simply taking the first N files.
  2. For each file, start a Docker container which uncompresses the file to stdout, where data is extracted from the stream and appended to a file.

The input is read over NFS from a server with a single 10/100 NIC and the output is written locally. Why NFS? In this case, it’s easy to configure and works well enough until proven otherwise.

top output demonstrates that the process doing the extracting is, indeed, working hard:

Eeek! The extractor (from the SDF Toolkit) is pretty much eating a CPU by itself. It may need to be replaced, depending on whether I have good enough results.

However, I am not going to optimize without testing and evaluating. It might just be the case that pegging the CPU might be ok — when the whole swarm is working, I believe that IO is ultimately the limiting factor. I haven’t tested it yet, so I don’t have much confidence in it. As I’m writing, I begin to doubt it — even if NFS is stupidly chatty, this extraction should only be a one or maybe two time event. I’d spend more time writing code to parse and then testing it than just letting it run. If this were happening on a regular basis, I think that I’d be more concerned. (I just had the glimmer of a fairly easy to implement AWK or Ruby streaming parser, so if I find myself performing the extraction more than I anticipate…)

That usage pattern remains consistent with more processors:

The following tests were performed on a Pi 2 with increasing amounts of parallelism:

Number of Files/Total Size Number of Concurrent Docker Containers Elapsed time Average time/file (rounded)
3/43.6477MB 1 real 13m46.731s
user 0m1.420s
sys 0m1.350s
4:35
10/135.167MB 3 real 17m22.104s
user 0m2.910s
sys 0m1.420s
1:44
12/162.307MB 4 real 15m45.713s
user 0m3.840s
sys 0m0.920s
1:19

Note: 10 is not evenly divisible by 3, so the last file was running by itself.

Yes, the individual runs are slower (~4:35 for one vs. 5:15 when 4 cores in use), however the multiple cores more than make up for it.

The number of concurrent processes was controlled by xargs:

The -n 1 specifies that one argument (file) is sent to each invocation of the worker script. One advantage of doing it this way is that if one file finishes quicker than another (or is smaller) then processing is not held up.

The output is, on average, slightly more than 7MB for each of the 12 files. Small enough that I’m not too concerned about compressing them (yet) — I have 16GB MicroSD cards which are less than 25% full.

So…. there are 3664 files. Assuming that I have 4 processes per Pi 2, and 1 per Pi B+, that will give me 25 among the 10 worker nodes. If I press additional hosts to work I could get up to ~37 at the expense of more hosts hitting
a single NFS server. I think I shall copy the data to another data host & split the reads in half.

So, assuming 25 workers and each file taking about 1.5 minutes of wall clock time (padding for IO latency), I should be able to churn through the files in approximately 3 hours and 40 min. Even at 2 minutes/file, that is about 4 hours. Not too terribly bad.

I might be able to get a little more performance if I allow the docker containers to use the host’s network stack. That’s a test for another day, however.

Apr 12

Docker Commandline Arguments are Context Sensitive

All I can say is that it was late when I wrote the script. And I was distracted between the feline overlords (one of whom is attempting to climb into my lap) and the babble box. That’s my story and I’m sticking to it. PEBKAC and ID10T errors were not involved.


Creative Commons licensed ( BY ) flickr photo shared by JeepersMedia

I’m processing 50GB of compressed cheminformatics data on the Pi Swarm, extracting certain pieces of data from substance records from NIH. I created a docker image containing Perl and the SDF Toolkit rather than writing my own parser. Tested a trivial case and pushed it to Docker Hub as nimblestratus/rpi-sdf-toolkit. So far, so good. Then in order to get an idea of how long it would take to process the lot, I wrote a quick script to split up a chunk of the data among the Pi’s.

After determining its next chunk to process, the container is started. However, my script was invoking Docker like this:

And it printed the version and came back immediately. Did not pass Go. Did not collect $200. Certainly didn’t process any of my data.

What I should have typed to mount a volume was:

So…. long story made short: order does matter. Now serving number 35.

Apr 09

Abusing Awk

Almost ashamed to admit I did this, yet it’s still kinda cool.

I use awk for a lot of commandline parsing; I learned it back in 1989…. before perl was much of a thing. For some problems, awk “just works”. So I was wanting to count the number of instances of ‘A’ in a collection (25K) of long strings (each one >150 characters).

I thought about a quick way of counting these characters…. and it occurred to me that I could split() the string, using ‘A’ as the delimiter, then count the array size:

This worked. Quite well, actually. I’m sure that there’s a much “better” way to do it, but this one works.

A little later it occurred to me that awk was already splitting the string.

One of the nice things about Unix is that there’s usually five ways to do something and it’s usually faster to do it the way you know how rather than spend the time looking up the “right” way.

Older posts «

» Newer posts