I’ve been rewriting a cleanroom version of the hadoop-in-a-box — just about finished. And, truth be told, the code, all in all, is a bit tighter than the original encumbered version.
However, I ran into an interesting feature of Volumes — I had thought perhaps to optimize things a bit, but it caused some unexpected behavior at O’dark thirty.
There are some directories and files that really need to be outside of the container for purposes of efficiency and reducing overhead:
- hdfs related directory trees — All of the writes soon lead to confusion in the storage drivers I’ve used.
- parcels on the worker nodes — these are also painful when there are constraints of memory
I thought I’d get ahead of the curve and add
Volume declarations to the base Dockerfile. For a variety of reasons I bootstrap the container in which the Cloudera Manager is running — it certainly helps to speed things up and it also removes human intervention from a few steps. However, one of the directory trees,
/opt, is one of the ones where I want to have different behaviors between the manager and the worker nodes. So, I went through the process of bootstrapping, downloading parcels to the manager and commiting the container only to find that they’d disappeared.
After a few cycles of this and looking inside the container and exported tar images, it occurred to me that I was seeing issues with
/opt/cloudera permissions which I hadn’t previously and files were disappearing. A quick check of the documentation revealed the following nuggets (emphasis my own):
The VOLUME instruction creates a mount point with the specified name and marks it as holding externally mounted volumes from native host or other containers…
The docker run command initializes the newly created volume with any data that exists at the specified location within the base image….
Note: If any build steps change the data within the volume after it has been declared, those changes will be discarded.
So, I was downloading the parcels only to have them go to the great bit-bucket in the sky. Premature optimization is the Enemy.
After removing the
volume declaration and rebuilding the images, everything worked as expected.
I am curious if there’s a way to “undeclare” a volume which was in a parent dockerfile. I’ve not had a chance yet to play with it, however.