«

»

Apr 19

Docker Containers: Smaller is not always better

Generally smaller Docker containers are preferred to larger ones. However, a smaller container is not always as performant as a larger one. By using a (slightly) larger container, performance improved over 30x.

TL;DR

The grep included in busybox is painfully slow. When doing using grep to process lots of data, add a (real) grep to the container.

Background

As discussed in Naive Substructure Substance Matching on the Raspberry Pi » Ramblings, I am exploring the limits of the Raspberry Pi for processing data. I chose SubStructure searching as a problem set as it is a non-trivial problem and a decent demonstration for co-workers of the processing power of the Pi.

I’ve pre-processed the NIH Pubchem Compounds database to extract SMILES data — this is a language for describing the structure of chemical compounds. As a relatively naive first implementation I’m using grep to match substructures. I have split the files amongst five Pi 2s; each is processing ~840M in ~730 files. xargs is used to do concurrent processing across multiple cores. After a few cycles, the entire data is read into cache and the Pi is able to process it in 1-2 seconds for realistic searches. A ridiculous search, finding all of the carbon containing compounds (over 13 million) takes 8-10 seconds.

Having developed a solution, I then set about dockerizing it.

I chose voxxit/alpine-rpi for my base — it’s quite small, about 5mb and has almost everything needed. I discovered that the version of xargs which ships with the container does not support -P. So xargs is added via:

I ran my test and found that the performance was horrid.

I decided to drop into an interactive shell so that I could tweak. You can see the performance below in the ‘Before’.

Before:

Typically the performance of a large IO operation will improve after a few cycles; the system is able to cache disk reads. It generally takes 3 cycles before all of the data is in the cache. However, the numbers above did not improve. I did verify that multiple cores were, indeed, being used.

I proceeded down a rabbit hole, looking at IO and VM statistics. Horrible. From there I googled to see if, indeed, Docker uses the disk cache (it does) and/or if there was a flag I needed to set (I didn’t). Admittedly, I couldn’t believe that IO using Docker could be that much slower, but I am a firm believer in testing my assumptions.

After poking about in /proc and /sys and running the search outside of Docker, I decided to see if there might be a faster grep. As it turns out, the container uses busybox:

This is generally a good choice in terms of size. However, it appears that the embedded grep is considerably slower than molasses in January. On a whim I decided to install grep:

I then re-ran the test and did a Snoopy Dance.

After:

Lessons Learned

This episode drove home the need to question assumptions. In this case the assumption is that a smaller sized container is inherently better. I believe that smaller and lighter containers are a Good Practice and an admirable goal. However, as seen here, smaller is not always better.

I also habitually look at a container’s Dockerfile before pulling it. In this case it wasn’t enough. It reinforced the lesson that I need to know what’s running in a container before I try to use it.

4 comments

2 pings

Skip to comment form

  1. Andreas Heissenberger

    It is more important to use the same container for all your projects instead of using different small containers. Docker shares the resources of images and this way you save disk space. If you need a data container – use the same image you used for your application

  2. Matt Williams

    Thank you for your comment.

    It depends, in my opinion. The case that you use, namely a data container, should start from scratch — the empty filesystem. That way saves the most space ;-). Likewise for go executables living as a single file in a container.

    That said, I think you can make a case for less layers in a container — there’s a cost in maintaining the layers.

    However, I’m also a pragmatist. If there’s a utility tool, such as a database or a monitor, is the relatively small amount of space saved in the footprint of the container with the price of rebuilding a the tool from a base container? I’d argue not.

    Also, if I am to put everything in every container it goes against the grain of the unix philosophy — having lots of tools which do one thing well.

    Starting from the same base, such as voxxit/alpine-rpi I could see for new development. but again, in many instances I don’t think there is a sufficient return on the investment of my time and energy.

  3. jonnalley

    I think the title of your post is a bit misleading. Your problem is with busybox grep, and has nothing to do with container size. You would have suffered the same performance issue if you were using busybox grep on bare metal. BTW, if you are interested in a performant grep alternative check out the silver searcher.

    http://geoff.greer.fm/ag/

  4. Matt Williams

    Well…. I’d gotten busybox’s grep by trying to make a smaller container… so at least in the headspace I was in at the time it made sense to me. I can see where you might find it misleading.

    I’ll definitely check out silver searcher, though. Thanks for the tip!

  1. ?? Docker ???? - ???????

    […] Docker Containers: Smaller is not always better […]

  2. Docker??????????-???

    […] ?????Docker Containers: Smaller is not always better??????? ??????? […]

Leave a Reply

%d bloggers like this: