Docker Workers Scale Nicely with Multiple Cores

Disclaimer: The title might be a bit misleading. For this workload, it’s scaling pretty much linearly. Other workloads might scale differently.

I was running a quick-ish test on a Pi to see how long it would take to churn through 50GB compressed data.

This processing consists of:

Determining the files to be processed — this is via an offset since ultimately there will be 10 workers processing the data. Also I think I might get a slightly more representative set than by simply taking the first N files.
For each file, start a Docker container which uncompresses the file to stdout, where data is extracted from the stream and appended to a file.

The input is read over NFS from a server with a single 10/100 NIC and the output is written locally. Why NFS? In this case, it’s easy to configure and works well enough until proven otherwise.

top output demonstrates that the process doing the extracting is, indeed, working hard:

$ top

top - 21:12:03 up 2 days, 13:33,  2 users,  load average: 1.03, 2.09, 1.80
Tasks:  97 total,   2 running,  95 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.2 us,  0.2 sy,  0.0 ni, 74.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    947468 total,   554564 used,   392904 free,    48836 buffers
KiB Swap:        0 total,        0 used,        0 free,   438636 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                         
 6358 root      20   0  4736 4288 1752 R  97.8  0.5   3:27.19 extract_prop_sd                                                 
 6357 root      20   0  1100    8    0 S   2.6  0.0   0:05.27 gzip

$ top

top - 21:12:03 up 2 days, 13:33, 2 users, load average: 1.03, 2.09, 1.80

Tasks: 97 total, 2 running, 95 sleeping, 0 stopped, 0 zombie

%Cpu(s): 25.2 us, 0.2 sy, 0.0 ni, 74.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 947468 total, 554564 used, 392904 free, 48836 buffers

KiB Swap: 0 total, 0 used, 0 free, 438636 cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

6358 root 20 0 4736 4288 1752 R 97.8 0.5 3:27.19 extract_prop_sd

6357 root 20 0 1100 8 0 S 2.6 0.0 0:05.27 gzip

Eeek! The extractor (from the SDF Toolkit) is pretty much eating a CPU by itself. It may need to be replaced, depending on whether I have good enough results.

However, I am not going to optimize without testing and evaluating. It might just be the case that pegging the CPU might be ok — when the whole swarm is working, I believe that IO is ultimately the limiting factor. I haven’t tested it yet, so I don’t have much confidence in it. As I’m writing, I begin to doubt it — even if NFS is stupidly chatty, this extraction should only be a one or maybe two time event. I’d spend more time writing code to parse and then testing it than just letting it run. If this were happening on a regular basis, I think that I’d be more concerned. (I just had the glimmer of a fairly easy to implement AWK or Ruby streaming parser, so if I find myself performing the extraction more than I anticipate…)

That usage pattern remains consistent with more processors:

$ top

top - 21:15:14 up 2 days, 13:36,  2 users,  load average: 2.12, 1.69, 1.67
Tasks: 118 total,   5 running, 113 sleeping,   0 stopped,   0 zombie
%Cpu(s): 98.9 us,  1.0 sy,  0.0 ni,  0.0 id,  0.1 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:    947468 total,   500500 used,   446968 free,    48908 buffers
KiB Swap:        0 total,        0 used,        0 free,   369856 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                         
 6534 root      20   0  4748 4272 1736 R  97.5  0.5   0:26.10 extract_prop_sd                                                 
 6545 root      20   0  4732 4360 1812 R  97.5  0.5   0:25.72 extract_prop_sd                                                 
 6558 root      20   0  4744 4292 1756 R  97.5  0.5   0:25.90 extract_prop_sd                                                 
 6554 root      20   0  4716 4292 1756 R  96.5  0.5   0:25.85 extract_prop_sd                                                 
 6532 root      20   0  1100    8    0 S   2.6  0.0   0:00.63 gzip                                                            
 6543 root      20   0  1100    8    0 S   2.3  0.0   0:00.57 gzip                                                            
 6557 root      20   0  1100    8    0 S   2.3  0.0   0:00.62 gzip                                                            
 6553 root      20   0  1100    8    0 S   2.0  0.0   0:00.55 gzip

$ top

top - 21:15:14 up 2 days, 13:36, 2 users, load average: 2.12, 1.69, 1.67

Tasks: 118 total, 5 running, 113 sleeping, 0 stopped, 0 zombie

%Cpu(s): 98.9 us, 1.0 sy, 0.0 ni, 0.0 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem: 947468 total, 500500 used, 446968 free, 48908 buffers

KiB Swap: 0 total, 0 used, 0 free, 369856 cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

6534 root 20 0 4748 4272 1736 R 97.5 0.5 0:26.10 extract_prop_sd

6545 root 20 0 4732 4360 1812 R 97.5 0.5 0:25.72 extract_prop_sd

6558 root 20 0 4744 4292 1756 R 97.5 0.5 0:25.90 extract_prop_sd

6554 root 20 0 4716 4292 1756 R 96.5 0.5 0:25.85 extract_prop_sd

6532 root 20 0 1100 8 0 S 2.6 0.0 0:00.63 gzip

6543 root 20 0 1100 8 0 S 2.3 0.0 0:00.57 gzip

6557 root 20 0 1100 8 0 S 2.3 0.0 0:00.62 gzip

6553 root 20 0 1100 8 0 S 2.0 0.0 0:00.55 gzip

The following tests were performed on a Pi 2 with increasing amounts of parallelism:

Number of Files/Total Size	Number of Concurrent Docker Containers	Elapsed time	Average time/file (rounded)
3/43.6477MB	1	real 13m46.731s user 0m1.420s sys 0m1.350s	4:35
10/135.167MB	3	real 17m22.104s user 0m2.910s sys 0m1.420s	1:44
12/162.307MB	4	real 15m45.713s user 0m3.840s sys 0m0.920s	1:19

Note: 10 is not evenly divisible by 3, so the last file was running by itself.

Yes, the individual runs are slower (~4:35 for one vs. 5:15 when 4 cores in use), however the multiple cores more than make up for it.

The number of concurrent processes was controlled by xargs:

cat file_list | xargs -P $procs -n 1 worker.sh

1 2	cat file_list \| xargs -P $procs -n 1 worker.sh

The -n 1 specifies that one argument (file) is sent to each invocation of the worker script. One advantage of doing it this way is that if one file finishes quicker than another (or is smaller) then processing is not held up.

The output is, on average, slightly more than 7MB for each of the 12 files. Small enough that I’m not too concerned about compressing them (yet) — I have 16GB MicroSD cards which are less than 25% full.

So…. there are 3664 files. Assuming that I have 4 processes per Pi 2, and 1 per Pi B+, that will give me 25 among the 10 worker nodes. If I press additional hosts to work I could get up to ~37 at the expense of more hosts hitting
a single NFS server. I think I shall copy the data to another data host & split the reads in half.

So, assuming 25 workers and each file taking about 1.5 minutes of wall clock time (padding for IO latency), I should be able to churn through the files in approximately 3 hours and 40 min. Even at 2 minutes/file, that is about 4 hours. Not too terribly bad.

I might be able to get a little more performance if I allow the docker containers to use the host’s network stack. That’s a test for another day, however.

Ramblings

Musings of Matt Williams

Docker Workers Scale Nicely with Multiple Cores

Like this:

Related

Leave a Reply Cancel reply

Subscribe to Blog via Email

Recent Posts

Top Posts & Pages

Archives

Categories

Copyright

Ramblings

Musings of Matt Williams

Docker Workers Scale Nicely with Multiple Cores

Share this:

Like this:

Related

Leave a Reply Cancel reply

Subscribe to Blog via Email

Recent Posts

Top Posts & Pages

Archives

Categories

Copyright