Zombie Apocalypse! Docker AUFS + Java + Low Memory …. Hadoop in a Box Cloudera Manager Cluster

TL;DR — When using AUFS in a memory constrained environment, Java can spawn (lots!) of Zombies. A workaround is to change the storage driver to the device mapper.

In working on the Hadoop in a box CDH cluster with Cloudera Manager, I’ve discovered a few interesting things about AUFS. These experiences are with Ubuntu 14.04 and Docker 1.9.1. Others have reported similar results using Java in Docker without CDH.

I did my initial development of the CDH in a box containers in environments with 32G and 24G ram, switching to the latter when I was informed the target was for a host with 24G. With that amount of memory, everything just worked and no zombies. However, people started placing it on hosts with less ram and Java started spawning zombies. So I took a closer look.

I had previously noticed that the amount of cached and buffered memory seemed, to me, awful high, but I know that Linux uses it for optimizing IO. As it turns out, this memory doesn’t seem to be “free-able” when using aufs. Add to this Java, and weird things occur.

I tested on a quad core, 12G host, running up the manager and three workers. And then the zombies appeared. In a very short order — minutes — I had 260 zombies! This is in part due to supervisord restarting the failed jvms.

This necessitated a reboot. Once rebooted I started to do some research.

I found a couple of items hinting at issues and workarounds. I then decided to test the device mapper driver and set about converting my aufs rig to device mapper. After a few iterations, the least invasive steps are as follow:

docker ps -aq | xargs docker rm -f
docker images -q | xargs docker rmi
service docker stop
Edit /etc/default/docker and add the following to the end: DOCKER_OPTS="${DOCKER_OPTS} --storage-driver=devicemapper"
service docker start

Now you can restart the cluster. I did so and once things stabilized, started adding services back to the cluster. I did not tweak any parameters, except:

DataNode Default Group / Resource Management:
dfs.datanode.max.locked.memory = 65536 B — this alleviates “Cannot
start datanode because the configured max locked memory size… is
more than the datanode’s available RLIMIT_MEMLOCK ulimit,” as
documented at
Apache Hadoop 2.4.1 – Hadoop Distributed File System-2.4.1 – Centralized Cache Management in HDFS.
Service-Wide / Replication: dfs.replication = 1

After 2+ hours, no zombies

I started Zookeeper, HDFS, and Yarn.

Notes and Caveats

YMMV, but changing to the device mapper seems to slow things down about 10%. However, I’d rather, particularly in a test/development environment be stable and not spawning zombies!
This is not using the LVM backed storage.
Ubuntu 14.04 is on Kernel 3.13; other options emerge post 3.18.
I have been able to run quite a few more services on an openstack instance with 24G of ram:
Cluster of 4 Docker containers on an openstack image with 24G ram

Ramblings

Musings of Matt Williams

Zombie Apocalypse! Docker AUFS + Java + Low Memory …. Hadoop in a Box Cloudera Manager Cluster

Notes and Caveats

Like this:

Related

Leave a Reply Cancel reply

Subscribe to Blog via Email

Recent Posts

Top Posts & Pages

Archives

Categories

Copyright

Ramblings

Musings of Matt Williams

Zombie Apocalypse! Docker AUFS + Java + Low Memory …. Hadoop in a Box Cloudera Manager Cluster

Notes and Caveats

Share this:

Like this:

Related

Leave a Reply Cancel reply

Subscribe to Blog via Email

Recent Posts

Top Posts & Pages

Archives

Categories

Copyright