Apr 12

Docker Workers Scale Nicely with Multiple Cores

Disclaimer: The title might be a bit misleading. For this workload, it’s scaling pretty much linearly. Other workloads might scale differently.

I was running a quick-ish test on a Pi to see how long it would take to churn through 50GB compressed data.

This processing consists of:

  1. Determining the files to be processed — this is via an offset since ultimately there will be 10 workers processing the data. Also I think I might get a slightly more representative set than by simply taking the first N files.
  2. For each file, start a Docker container which uncompresses the file to stdout, where data is extracted from the stream and appended to a file.

The input is read over NFS from a server with a single 10/100 NIC and the output is written locally. Why NFS? In this case, it’s easy to configure and works well enough until proven otherwise.

top output demonstrates that the process doing the extracting is, indeed, working hard:

Eeek! The extractor (from the SDF Toolkit) is pretty much eating a CPU by itself. It may need to be replaced, depending on whether I have good enough results.

However, I am not going to optimize without testing and evaluating. It might just be the case that pegging the CPU might be ok — when the whole swarm is working, I believe that IO is ultimately the limiting factor. I haven’t tested it yet, so I don’t have much confidence in it. As I’m writing, I begin to doubt it — even if NFS is stupidly chatty, this extraction should only be a one or maybe two time event. I’d spend more time writing code to parse and then testing it than just letting it run. If this were happening on a regular basis, I think that I’d be more concerned. (I just had the glimmer of a fairly easy to implement AWK or Ruby streaming parser, so if I find myself performing the extraction more than I anticipate…)

That usage pattern remains consistent with more processors:

The following tests were performed on a Pi 2 with increasing amounts of parallelism:

Number of Files/Total Size Number of Concurrent Docker Containers Elapsed time Average time/file (rounded)
3/43.6477MB 1 real 13m46.731s
user 0m1.420s
sys 0m1.350s
10/135.167MB 3 real 17m22.104s
user 0m2.910s
sys 0m1.420s
12/162.307MB 4 real 15m45.713s
user 0m3.840s
sys 0m0.920s

Note: 10 is not evenly divisible by 3, so the last file was running by itself.

Yes, the individual runs are slower (~4:35 for one vs. 5:15 when 4 cores in use), however the multiple cores more than make up for it.

The number of concurrent processes was controlled by xargs:

The -n 1 specifies that one argument (file) is sent to each invocation of the worker script. One advantage of doing it this way is that if one file finishes quicker than another (or is smaller) then processing is not held up.

The output is, on average, slightly more than 7MB for each of the 12 files. Small enough that I’m not too concerned about compressing them (yet) — I have 16GB MicroSD cards which are less than 25% full.

So…. there are 3664 files. Assuming that I have 4 processes per Pi 2, and 1 per Pi B+, that will give me 25 among the 10 worker nodes. If I press additional hosts to work I could get up to ~37 at the expense of more hosts hitting
a single NFS server. I think I shall copy the data to another data host & split the reads in half.

So, assuming 25 workers and each file taking about 1.5 minutes of wall clock time (padding for IO latency), I should be able to churn through the files in approximately 3 hours and 40 min. Even at 2 minutes/file, that is about 4 hours. Not too terribly bad.

I might be able to get a little more performance if I allow the docker containers to use the host’s network stack. That’s a test for another day, however.

Leave a Reply

%d bloggers like this: