Generally smaller Docker containers are preferred to larger ones. However, a smaller container is not always as performant as a larger one. By using a (slightly) larger container, performance improved over 30x.
TL;DR
The grep
included in busybox
is painfully slow. When doing using grep
to process lots of data, add a (real) grep to the container.
Background
As discussed in Naive Substructure Substance Matching on the Raspberry Pi » Ramblings, I am exploring the limits of the Raspberry Pi for processing data. I chose SubStructure searching as a problem set as it is a non-trivial problem and a decent demonstration for co-workers of the processing power of the Pi.
I’ve pre-processed the NIH Pubchem Compounds database to extract SMILES data — this is a language for describing the structure of chemical compounds. As a relatively naive first implementation I’m using grep to match substructures. I have split the files amongst five Pi 2s; each is processing ~840M in ~730 files. xargs
is used to do concurrent processing across multiple cores. After a few cycles, the entire data is read into cache and the Pi is able to process it in 1-2 seconds for realistic searches. A ridiculous search, finding all of the carbon containing compounds (over 13 million) takes 8-10 seconds.
Having developed a solution, I then set about dockerizing it.
I chose voxxit/alpine-rpi
for my base — it’s quite small, about 5mb and has almost everything needed. I discovered that the version of xargs
which ships with the container does not support -P
. So xargs is added via:
1 2 |
apk --update add findutils |
I ran my test and found that the performance was horrid.
I decided to drop into an interactive shell so that I could tweak. You can see the performance below in the ‘Before’.
Before:
1 2 3 4 5 6 7 8 |
/opt/smiles # date;time /bin/ash -c " ls | xargs -P 4 -n 50 grep -h 'C1CCCCC1C=O'| wc -l ";date Sun Apr 19 14:25:54 GMT 2015 19 real 1m 4.21s user 3m 57.52s sys 0m 3.52s Sun Apr 19 14:26:58 GMT 2015 |
Typically the performance of a large IO operation will improve after a few cycles; the system is able to cache disk reads. It generally takes 3 cycles before all of the data is in the cache. However, the numbers above did not improve. I did verify that multiple cores were, indeed, being used.
I proceeded down a rabbit hole, looking at IO and VM statistics. Horrible. From there I googled to see if, indeed, Docker uses the disk cache (it does) and/or if there was a flag I needed to set (I didn’t). Admittedly, I couldn’t believe that IO using Docker could be that much slower, but I am a firm believer in testing my assumptions.
After poking about in /proc
and /sys
and running the search outside of Docker, I decided to see if there might be a faster grep
. As it turns out, the container uses busybox:
1 2 3 |
/opt/smiles # ls -li /bin/grep 501101 lrwxrwxrwx 1 root root 12 Mar 6 13:27 /bin/grep -> /bin/busybox |
This is generally a good choice in terms of size. However, it appears that the embedded grep is considerably slower than molasses in January. On a whim I decided to install grep:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
/opt/smiles # apk search grep ngrep-1.45-r1 grep-doc-2.20-r1 grep-2.20-r1 /opt/smiles # apk --update add grep fetch http://repos.lax-noc.com/alpine/v3.1/main/armhf/APKINDEX.tar.gz (1/2) Installing pcre (8.36-r1) (2/2) Installing grep (2.20-r1) Executing busybox-1.22.1-r14.trigger OK: 6 MiB in 18 packages /opt/smiles # which grep /usr/bin/grep /opt/smiles # ls -li /usr/bin/grep 66417 -rwxr-xr-x 1 root root 189840 Feb 2 11:05 /usr/bin/grep |
I then re-ran the test and did a Snoopy Dance.
After:
1 2 3 4 5 6 7 8 |
/opt/smiles # date;time /bin/ash -c " ls | xargs -P 4 -n 50 grep -h 'C1CCCCC1C=O'| wc -l ";date Sun Apr 19 14:30:35 GMT 2015 19 real 0m 1.81s user 0m 4.39s sys 0m 2.38s Sun Apr 19 14:30:36 GMT 2015 |
Lessons Learned
This episode drove home the need to question assumptions. In this case the assumption is that a smaller sized container is inherently better. I believe that smaller and lighter containers are a Good Practice and an admirable goal. However, as seen here, smaller is not always better.
I also habitually look at a container’s Dockerfile
before pulling it. In this case it wasn’t enough. It reinforced the lesson that I need to know what’s running in a container before I try to use it.
4 comments
2 pings
Skip to comment form ↓
Andreas Heissenberger
April 21, 2015 at 6:28 am (UTC -5) Link to this comment
It is more important to use the same container for all your projects instead of using different small containers. Docker shares the resources of images and this way you save disk space. If you need a data container – use the same image you used for your application
Matt Williams
April 21, 2015 at 8:35 am (UTC -5) Link to this comment
Thank you for your comment.
It depends, in my opinion. The case that you use, namely a data container, should start from scratch — the empty filesystem. That way saves the most space ;-). Likewise for go executables living as a single file in a container.
That said, I think you can make a case for less layers in a container — there’s a cost in maintaining the layers.
However, I’m also a pragmatist. If there’s a utility tool, such as a database or a monitor, is the relatively small amount of space saved in the footprint of the container with the price of rebuilding a the tool from a base container? I’d argue not.
Also, if I am to put everything in every container it goes against the grain of the unix philosophy — having lots of tools which do one thing well.
Starting from the same base, such as voxxit/alpine-rpi I could see for new development. but again, in many instances I don’t think there is a sufficient return on the investment of my time and energy.
jonnalley
April 26, 2015 at 12:41 pm (UTC -5) Link to this comment
I think the title of your post is a bit misleading. Your problem is with busybox grep, and has nothing to do with container size. You would have suffered the same performance issue if you were using busybox grep on bare metal. BTW, if you are interested in a performant grep alternative check out the silver searcher.
http://geoff.greer.fm/ag/
Matt Williams
April 27, 2015 at 3:06 am (UTC -5) Link to this comment
Well…. I’d gotten busybox’s grep by trying to make a smaller container… so at least in the headspace I was in at the time it made sense to me. I can see where you might find it misleading.
I’ll definitely check out silver searcher, though. Thanks for the tip!
?? Docker ???? - ???????
April 15, 2016 at 10:53 am (UTC -5) Link to this comment
[…] Docker Containers: Smaller is not always better […]
Docker??????????-???
April 29, 2016 at 1:10 am (UTC -5) Link to this comment
[…] ?????Docker Containers: Smaller is not always better??????? ??????? […]