Disclaimer: The title might be a bit misleading. For this workload, it’s scaling pretty much linearly. Other workloads might scale differently.
I was running a quick-ish test on a Pi to see how long it would take to churn through 50GB compressed data.
This processing consists of:
- Determining the files to be processed — this is via an offset since ultimately there will be 10 workers processing the data. Also I think I might get a slightly more representative set than by simply taking the first N files.
- For each file, start a Docker container which uncompresses the file to stdout, where data is extracted from the stream and appended to a file.
The input is read over NFS from a server with a single 10/100 NIC and the output is written locally. Why NFS? In this case, it’s easy to configure and works well enough until proven otherwise.
top
output demonstrates that the process doing the extracting is, indeed, working hard:
1 2 3 4 5 6 7 8 9 10 11 12 |
$ top top - 21:12:03 up 2 days, 13:33, 2 users, load average: 1.03, 2.09, 1.80 Tasks: 97 total, 2 running, 95 sleeping, 0 stopped, 0 zombie %Cpu(s): 25.2 us, 0.2 sy, 0.0 ni, 74.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 947468 total, 554564 used, 392904 free, 48836 buffers KiB Swap: 0 total, 0 used, 0 free, 438636 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6358 root 20 0 4736 4288 1752 R 97.8 0.5 3:27.19 extract_prop_sd 6357 root 20 0 1100 8 0 S 2.6 0.0 0:05.27 gzip |
Eeek! The extractor (from the SDF Toolkit) is pretty much eating a CPU by itself. It may need to be replaced, depending on whether I have good enough results.
However, I am not going to optimize without testing and evaluating. It might just be the case that pegging the CPU might be ok — when the whole swarm is working, I believe that IO is ultimately the limiting factor. I haven’t tested it yet, so I don’t have much confidence in it. As I’m writing, I begin to doubt it — even if NFS is stupidly chatty, this extraction should only be a one or maybe two time event. I’d spend more time writing code to parse and then testing it than just letting it run. If this were happening on a regular basis, I think that I’d be more concerned. (I just had the glimmer of a fairly easy to implement AWK or Ruby streaming parser, so if I find myself performing the extraction more than I anticipate…)
That usage pattern remains consistent with more processors:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
$ top top - 21:15:14 up 2 days, 13:36, 2 users, load average: 2.12, 1.69, 1.67 Tasks: 118 total, 5 running, 113 sleeping, 0 stopped, 0 zombie %Cpu(s): 98.9 us, 1.0 sy, 0.0 ni, 0.0 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 947468 total, 500500 used, 446968 free, 48908 buffers KiB Swap: 0 total, 0 used, 0 free, 369856 cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 6534 root 20 0 4748 4272 1736 R 97.5 0.5 0:26.10 extract_prop_sd 6545 root 20 0 4732 4360 1812 R 97.5 0.5 0:25.72 extract_prop_sd 6558 root 20 0 4744 4292 1756 R 97.5 0.5 0:25.90 extract_prop_sd 6554 root 20 0 4716 4292 1756 R 96.5 0.5 0:25.85 extract_prop_sd 6532 root 20 0 1100 8 0 S 2.6 0.0 0:00.63 gzip 6543 root 20 0 1100 8 0 S 2.3 0.0 0:00.57 gzip 6557 root 20 0 1100 8 0 S 2.3 0.0 0:00.62 gzip 6553 root 20 0 1100 8 0 S 2.0 0.0 0:00.55 gzip |
The following tests were performed on a Pi 2 with increasing amounts of parallelism:
Number of Files/Total Size | Number of Concurrent Docker Containers | Elapsed time | Average time/file (rounded) |
---|---|---|---|
3/43.6477MB | 1 | real 13m46.731s user 0m1.420s sys 0m1.350s |
4:35 |
10/135.167MB | 3 | real 17m22.104s user 0m2.910s sys 0m1.420s |
1:44 |
12/162.307MB | 4 | real 15m45.713s user 0m3.840s sys 0m0.920s |
1:19 |
Note: 10 is not evenly divisible by 3, so the last file was running by itself.
Yes, the individual runs are slower (~4:35 for one vs. 5:15 when 4 cores in use), however the multiple cores more than make up for it.
The number of concurrent processes was controlled by xargs
:
1 2 |
cat file_list | xargs -P $procs -n 1 worker.sh |
The -n 1
specifies that one argument (file) is sent to each invocation of the worker script. One advantage of doing it this way is that if one file finishes quicker than another (or is smaller) then processing is not held up.
The output is, on average, slightly more than 7MB for each of the 12 files. Small enough that I’m not too concerned about compressing them (yet) — I have 16GB MicroSD cards which are less than 25% full.
So…. there are 3664 files. Assuming that I have 4 processes per Pi 2, and 1 per Pi B+, that will give me 25 among the 10 worker nodes. If I press additional hosts to work I could get up to ~37 at the expense of more hosts hitting
a single NFS server. I think I shall copy the data to another data host & split the reads in half.
So, assuming 25 workers and each file taking about 1.5 minutes of wall clock time (padding for IO latency), I should be able to churn through the files in approximately 3 hours and 40 min. Even at 2 minutes/file, that is about 4 hours. Not too terribly bad.
I might be able to get a little more performance if I allow the docker containers to use the host’s network stack. That’s a test for another day, however.