Ramblings

Weaving with Light Pt. 1

This is the first in a series of posts regarding a recent project which integrated handweaving, fiber optics, and electronics. It’s a part of a costume for a cosplayer at work, but I’ll be discussing my part of it. TL;DR For those who can’t wait, here’s what the project looks like in the dark: And …

View full post

Abusing HAProxy: Stupid Simple Easy Dashboards

I wanted a simple way to have a dashboard to show if hosts and services are alive & didn’t want to write much code and/or run up a nagios instance (or anything like that). All I care is whether it’s green or red. I’d already been setting up HAProxy for a proxy forwarder, so I …

View full post

Rules for Operations

The following list was compiled in 2012 for a talk on Operations Principles for Developers (Ops4Devs). They are loosely inspired by the list of rules from Zombieland as well as from my experiences and those shared by others. Looking over the list four years later, I believe that they are still (very) applicable for all …

View full post

DevOps Creed (Work in Progress)

This is a work in progress of a DevOps Creed. It will always be a work in progress as I and others learn and grow. Suggestions are welcome! I have drunk deep of the DevOps Kool-Aid. From the visions which ensued, I have come to the following…. I Believe: DevOps methodologies lead to systems which …

View full post

I am not a Mindreader: a mini-saga

I must confess a severe failing on my part. I am not a mindreader. I am not privy to the thoughts in your head. I do not know your needs or desires. And I am certainly not aware of your expectations. This is why requirement documents exist. Please use them.

View full post

Weaving with Light Pt. 1 Abusing HAProxy: Stupid Simple Easy Dashboards Rules for Operations DevOps Creed (Work in Progress)I am not a Mindreader: a mini-saga

Aug 31

Weaving with Light Pt. 1

Categories:

cosplay, electronics, hardware, textiles, weaving

by Matt Williams

TL;DR

For those who can’t wait, here’s what the project looks like in the dark:

Handweaving with fiber optics viewed in the dark

And in the light:

You can still see the glow, it’s just not as bright.

In the Beginning

I recently started a new job; when one of my co-workers heard I am a weaver, she approached me with a challenge: to weave fiber optics into a fabric so that it would have an otherworldly glow. Originally the idea was for a Patronus from Harry Potter; we pivoted to a ghost from Ghostbusters.

I spent some time researching on the net; I have only found a couple of other instances of handweavers making fabric with embedded fiber optics. So it’s pushing the envelope in that regard 😉

Just the Facts

200 fiber optic strands, 2m in length
32 bright LEDs
Over 100 solder joints
5v input
640 mA draw
2 watt resistor
“Skirt” is separated into 8 strands, each 4″ wide
6.25 fiber optics/inch
Two types of cotton thread used for fabric structure
30 hours loom time
10 hours prototype
15 hours research and shopping
25 hours constructing wiring and LED Harness
LEDs are swappable
Two power busses to distribute to LEDs
Power regulator has a fan.
Almost everything (except fabric) can be swapped out and/or replaced.

I know this is something of a tease, but I’ll write more soon!

This post has no tag

Leave comment

Jun 24

Abusing HAProxy: Stupid Simple Easy Dashboards

Categories:

devops, Docker, hack, infrastructure, monitoring

by Matt Williams

I’d already been setting up HAProxy for a proxy forwarder, so I got the idea to turn on the stats page and just have a set of backends which HAProxy would check.

Sample config follows:

global
daemon

defaults
maxconn 250
timeout connect 5s
timeout client 5s
timeout server 5s

listen stats 0.0.0.0:2001
mode http
log global
timeout client 50s
timeout connect 50s
timeout server 50s
stats enable
stats hide-version
stats refresh 30s
stats show-node
stats uri /
stats auth admin:admin

backend CheckMe
mode tcp
server s1 xxx.xxx.xxx.xxx:yy check
server s2 xxx.xxx.xxx.xxx:zz check

global

daemon

defaults

maxconn 250

timeout connect 5s

timeout client 5s

timeout server 5s

listen stats 0.0.0.0:2001

mode http

log global

timeout client 50s

timeout connect 50s

timeout server 50s

stats enable

stats hide-version

stats refresh 30s

stats show-node

stats uri /

stats auth admin:admin

backend CheckMe

mode tcp

server s1 xxx.xxx.xxx.xxx:yy check

server s2 xxx.xxx.xxx.xxx:zz check

Make the file, then if you don’t even feel like installing haproxy, you could do a:

docker run -d --restart=always -d -p 2001:2001 --name stupid-simple-mon \
-v `pwd`/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro \
haproxy:1.5

docker run -d --restart=always -d -p 2001:2001 --name stupid-simple-mon \

-v `pwd`/haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg:ro \

haproxy:1.5

Point your browser to http://localhost:2001/ and enter user/pass as admin and you’re good to go. It even refreshes every 30 seconds.

This post has no tag

Leave comment

Apr 25

Rules for Operations

Categories:

devops, infrastructure, monitoring, philosophy

by Matt Williams

Exercise your environment
Test
Beware of Assumptions
Monitors
Be a smart reporter
Read the rest of this entry »

This post has no tag

3 comments

Apr 24

DevOps Creed (Work in Progress)

Categories:

devops, disaster recovery, philosophy

by Matt Williams

This is a work in progress of a DevOps Creed. It will always be a work in progress as I and others learn and grow. Suggestions are welcome!

devops-creed

I have drunk deep of the DevOps Kool-Aid. From the visions which ensued, I have come to the following….

I Believe:

DevOps methodologies lead to systems which are Repeatable, Reproducible, and Reliable.
You are not accountable (or responsible) if you will never get a call in the middle of the night.
Without shared pain, systems and software will not improve.
The goal of operations is not get unexpected calls in the middle of the night. If I am doing my job, then preventable issues are dealt with before they are a problem.
Post Mortems need to be blameless and a learning tool to mitigate or prevent future occurrences.
ITIL describes a set of best practices; it doesn’t dictate how to implement the practices. Just because they come from “suits” doesn’t mean they’re not valid.
Learning is key to growth. Once you stop learning, you’re dead.
It is important to question one’s assumptions. How many RFC’s have there been for email over the past three decades?
Utopia contains no pets and only cattle. It is an ideal to be pursued.
Infrastructure as code and revision control make disaster recovery much less painful.
Reduce the likelihood of operator error: automate whenever appropriate.
Paired programming and code review works for infrastructure, too.

This post has no tag

Leave comment

Apr 11

I am not a Mindreader: a mini-saga

Categories:

mini sagas, rants

by Matt Williams

crystalballlady-300px
I must confess a severe failing on my part. I am not a mindreader.

I am not privy to the thoughts in your head. I do not know your needs or desires. And I am certainly not aware of your expectations.

This is why requirement documents exist. Please use them.

This post has no tag

3 comments

Apr 07

Cloudera Manager Disaster Recovery with JSON Deployment Dump

Categories:

cloudera manager, disaster recovery, gotchas

by Matt Williams

Cloudera Manager is fairly opinionated. In its defence, it pretty much needs to be given that it needs to wrangle multiple underlying Open Source projects. Each of these, in turn, have their own quirks and opinions.

The following is a description of how to recover a Cloudera Manager cluster post disaster, assuming that you have a copy of the deployment. I will say that this is something of a hack; it treats the cluster a bit too much like a pet, but you could make a case that the Cloudera Manager’s deployment dump behaves similarly to infrastructure as code.

One of the tricks I’d previously discovered which can help is that the /var/lib/cloudera-scm-agent/uuid file can either be generated by the agent or you can choose your own. It is used by the Cloudera Manager as a primary key for hosts on which the agent lives. In the case of a disaster or server crash, if replacement hostnames and IP addresses remain the same (since they are set once within Cloudera Manager and cannot change) then the hosts can be dropped back into the cluster without creating multiple records of the same hostname. A means of doing so would be something like:

    #!/bin/bash

    python -c 'import socket; \
              print socket.getfqdn(), \
                    socket.gethostbyname(socket.getfqdn())' | \
    md5sum | \
    sed -e 's/ .*//' | \
    tr -d '\n' > /var/lib/cloudera-scm-agent/uuid

#!/bin/bash

python -c 'import socket; \

print socket.getfqdn(), \

socket.gethostbyname(socket.getfqdn())' | \

md5sum | \

sed -e 's/ .*//' | \

tr -d '\n' > /var/lib/cloudera-scm-agent/uuid

Note the call to tr above. The uuid file is used explicitly, so if there are any linefeeds in the file, the linefeed becomes part of the UUID inside Cloudera Manager. While it is legal, this can make some API calls awkward.

However, if you aren’t able to reuse the same UUID(s), or if the UUID is overwritten, say by Puppet, after the cluster is created, all is not lost. You likely will have some cleanup to perform, but it’s not
insurmountable.

The Cloudera Manager API is very powerful and opens up many possibilities — not the least of which is opening a door into the mind of the Manager and how it thinks. One of the calls, /cm/deployment, I had figured would work for backing up and restoring the state of a cluster. I’d tested it previously in a single node cluster, so I knew it could work — at least in the small!

I had an opportunity to test my theory in a larger cluster this evening.

The basic symptoms were a dashboard full of red and no messages in the logs when you attempted to restart a server — you could start the Agent, but it wouldn’t do much good — it wasn’t speaking to the Manager.

I determined (after some experimentation) that the reason why they weren’t speaking very well was that the UUID was being overwritten by Puppet. I started giggling. In a warped, BOFH sense, it is actually pretty funny what was causing the cluster to misbehave. All I could think was that the cluster was most definitely borked by a master!

bork

The Cloudera Manager instance was still available so I had a starting point from which to work — although if I’d had a backup of the deployment configuration it would still have worked if the cluster was totally dead.

I set out to replace the hostId entries and figured that was all I would need to do. Turns out there was a bit more than just that.

Here are a basic set of steps in order to recover:

Note: Replace MANAGER with the name of the host on which the Cloudera Manager is located. Also, replace the authentication user/pass as needed — it’s unlikely (I hope!) that you’re still using admin/admin for user/pass.

If the cluster is still up, then dump the hosts. The information you need is in the deployment, but it’s convenient to pull it from the hosts: curl -u 'admin:admin' http://MANAGER:7180/api/v11/hosts > hosts.json
If you don’t have a dump of the deployment, you can get one via:
curl -u 'admin:admin' http://MANAGER:7180/api/v11/cm/deployment > deployment.json
In this case, I needed all of the new uuid’s. You may be able to skip ahead to step 7 if the uuid’s haven’t changed. There may be modifications needed for the IP/Hosts if your replacement cluster is on a different network. Or you could use this to create a template and reproduce it in different networks.

    mkdir -p uuids 2>/dev/null
    for i in $(grep hostname hosts.json | sed -e 's/^.*: "//' -e 's/".*//' | sort | uniq); do
        echo $i
        scp ${i}:/var/lib/cloudera-scm-agent/uuid uuids/$i
    done

mkdir -p uuids 2>/dev/null

for i in $(grep hostname hosts.json | sed -e 's/^.*: "//' -e 's/".*//' | sort | uniq); do

echo $i

scp ${i}:/var/lib/cloudera-scm-agent/uuid uuids/$i

done

We’re going to use sed to replace all of the UUID entries. The following ruby code will generate our sed script for us:

    #!/bin/env ruby

    require 'json'


    hosts = JSON.parse(File.read("hosts.json"))

    sed_script = hosts["items"].map do |host|
                 newid=File.read("uuids/#{host["hostname"]}")
                 "-e 's/#{host["hostId"]}/#{newid}/'"
    end.join(" \\\n")

    sed_script = "sed -i .bak #{sed_script} deployment.json"

    File.write("fixer.sed",sed_script)

#!/bin/env ruby

require 'json'

hosts = JSON.parse(File.read("hosts.json"))

sed_script = hosts["items"].map do |host|

newid=File.read("uuids/#{host["hostname"]}")

"-e 's/#{host["hostId"]}/#{newid}/'"

end.join(" \\\n")

sed_script = "sed -i .bak #{sed_script} deployment.json"

File.write("fixer.sed",sed_script)

Save it to `sedder.rb` and run it `ruby sedder.rb`.

1 2	Save it to `sedder.rb` and run it `ruby sedder.rb`.

Paranoia Check time. Look at the output sedder.sed script. If the UUID’s are generated, then there is a good chance that they will have a “\n” in them. Consequently you may need to edit the sed script..
Run the script which was just generated: bash -x sedder.sed
At this point, there is likely some more fixing needed. The following is needed because:

When specifying roles to be created, the names provided for each role must not conflict with the names that CM auto-generates for roles. Specifically, names of the form “<service name>-<role type>-<arbitrary value>” cannot be used unless the <arbitrary value> is the same one CM would use. If CM detects such a conflict, the error message will indicate what is safe to use. Alternately, a differently formatted name should be used. — Cloudera Manager API

Cloudera Manager played fickle and didn’t think that the “arbitrary” values it thought safe previously were still any good. I searched a bit, but could not find anything to tell how it calculates those arbitrary safe values :-/. They are a large hexadecimal number; I was able to identify that part, which led (after much experimentation) to the following fix.

Save the following to fixer.awk:

/-[0-9a-f]+"/{gsub(/-/,"_")} {print}

1
2
3

/-[0-9a-f]+"/{gsub(/-/,"_")}
{print}

Execute it via: awk -f fixer.awk deployment.json > fixed-deployment.json
More Paranoia. Look at the output JSON text — if it looks borked, then don’t deploy it!
Install the fixed deployment

    curl -u 'admin:admin' -X PUT -i -H "content-type:application/json" -d @fixed-deployment.json \
            http://MANAGER:7180/api/v11/cm/deployment?deleteCurrentDeployment=true

curl -u 'admin:admin' -X PUT -i -H "content-type:application/json" -d @fixed-deployment.json \

http://MANAGER:7180/api/v11/cm/deployment?deleteCurrentDeployment=true

At this point, if all has gone well, it accepts the deployment. You’ll need to restart the cluster(s). I strongly suggest starting the Management Cluster first.

After the dust settles, you’ll have a repaired and/or newly deployed cluster.

In my case, I was mostly there when I found that puppet was also overwriting the Manager server entry in
/etc/cloudera-scm-agent/config.ini with localhost

doh

Good luck and let me know what you think.

This post has no tag

Leave comment

Mar 30

Interesting Feature of Dockerfile Volume Directives

Categories:

cloudera, cloudera manager, Docker, gotchas, hadoop

by Matt Williams

I’ve been rewriting a cleanroom version of the hadoop-in-a-box — just about finished. And, truth be told, the code, all in all, is a bit tighter than the original encumbered version.

However, I ran into an interesting feature of Volumes — I had thought perhaps to optimize things a bit, but it caused some unexpected behavior at O’dark thirty.

There are some directories and files that really need to be outside of the container for purposes of efficiency and reducing overhead:

hdfs related directory trees — All of the writes soon lead to confusion in the storage drivers I’ve used.
parcels on the worker nodes — these are also painful when there are constraints of memory

I thought I’d get ahead of the curve and add Volume declarations to the base Dockerfile. For a variety of reasons I bootstrap the container in which the Cloudera Manager is running — it certainly helps to speed things up and it also removes human intervention from a few steps. However, one of the directory trees, /opt, is one of the ones where I want to have different behaviors between the manager and the worker nodes. So, I went through the process of bootstrapping, downloading parcels to the manager and commiting the container only to find that they’d disappeared.

After a few cycles of this and looking inside the container and exported tar images, it occurred to me that I was seeing issues with /opt/cloudera permissions which I hadn’t previously and files were disappearing. A quick check of the documentation revealed the following nuggets (emphasis my own):

The VOLUME instruction creates a mount point with the specified name and marks it as holding externally mounted volumes from native host or other containers…
The docker run command initializes the newly created volume with any data that exists at the specified location within the base image….
Note: If any build steps change the data within the volume after it has been declared, those changes will be discarded.

So, I was downloading the parcels only to have them go to the great bit-bucket in the sky. Premature optimization is the Enemy.

After removing the volume declaration and rebuilding the images, everything worked as expected.

I am curious if there’s a way to “undeclare” a volume which was in a parent dockerfile. I’ve not had a chance yet to play with it, however.

This post has no tag

1 comment

Mar 17

Docker, Cgroups, Memory Constraints, and Java: A Cautionary Tale, or Here be Reapers (sometimes)

Categories:

Docker, Experiments, gotchas, Java, JVM, linux

by Matt Williams

TL;DR: Java and cgroups/Docker memory constraints don’t always
behave as you might expect. Always explicitly specify JVM heap
sizes. Also be aware that kernel features may not be enabled. And Linux… lies.

I’ve recently discovered an interesting “quirk” in potential
interactions between Java, cgroups, Docker, and the kernel which can
cause some surprising results.

Unless you explicitly state heap sizes, the JVM makes guesses about
sizing based on the host on which it runs. In general on any “server
class” machine — which now refers to just about anything other than a
Windows desktop or a Raspberry Pi — by default specifies a maximum
heap size of approximately 1/4 of the ram on the host. Where this
becomes interesting is that specifying the amount of memory available
to a container does not affect what the jvm believes is available.

Last year I wrote in
Looking Inside a JVM: -XX:+PrintFlagsFinal
about finding the values configured in the JVM at runtime. By not
specifying a heap size, I get the following on a host with 12G of ram:

$ java -XX:+PrintFlagsFinal -version|grep -i heapsize|egrep 'Initial|Max'
java version "1.8.0_74"
Java(TM) SE Runtime Environment (build 1.8.0_74-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)
    uintx InitialHeapSize                          := 188743680                           {product}
    uintx MaxHeapSize                              := 2988441600                          {product}

$ java -XX:+PrintFlagsFinal -version|grep -i heapsize|egrep 'Initial|Max'

java version "1.8.0_74"

Java(TM) SE Runtime Environment (build 1.8.0_74-b02)

Java HotSpot(TM) 64-Bit Server VM (build 25.74-b02, mixed mode)

uintx InitialHeapSize := 188743680 {product}

uintx MaxHeapSize := 2988441600 {product}

Notice that the MaxHeapSize is ~3GB.

You ever look inside of Java …. in Docker? — Half Brewed

$ docker run --rm java java  -XX:+PrintFlagsFinal -version |grep -i heapsize | egrep 'Initial|Max'
openjdk version "1.8.0_72-internal"
OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)
OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)
    uintx InitialHeapSize                          := 188743680                           {product}
    uintx MaxHeapSize                              := 2988441600                          {product}

$ docker run --rm java java -XX:+PrintFlagsFinal -version |grep -i heapsize | egrep 'Initial|Max'

openjdk version "1.8.0_72-internal"

OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)

OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)

uintx InitialHeapSize := 188743680 {product}

uintx MaxHeapSize := 2988441600 {product}

It’s the same. Ok, let’s set the max memory size of the container to
256m (-m 256m) and try again:

$ docker run -m 256m --rm java java  -XX:+PrintFlagsFinal -version |grep -i heapsize | egrep 'Initial|Max'

WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.
openjdk version "1.8.0_72-internal"
OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)
OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)
    uintx InitialHeapSize                          := 188743680                           {product}
    uintx MaxHeapSize                              := 2988441600                          {product}

$ docker run -m 256m --rm java java -XX:+PrintFlagsFinal -version |grep -i heapsize | egrep 'Initial|Max'

WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.

openjdk version "1.8.0_72-internal"

OpenJDK Runtime Environment (build 1.8.0_72-internal-b15)

OpenJDK 64-Bit Server VM (build 25.72-b15, mixed mode)

uintx InitialHeapSize := 188743680 {product}

uintx MaxHeapSize := 2988441600 {product}

Note the Warning…. we’ll come back to it later (much later)

And… it’s the same.

Fabio Kung has written an interesting discussion of
Memory inside Linux containers
and the reasons for why system calls do not return the amount of
memory inside a container. In short, the various tools and system
calls (including those which the JVM invoke) were created before
cgroups and have no concept that such limits might exist.

So, how much memory is actually available to the
JVM? Let’s start with a class which eats memory. I found the following
code at
Java memory test – How to consume all the memory (RAM) on a computer:

import java.util.Vector;

//  The following is from:
//  http://alvinalexander.com/blog/post/java/java-program-consume-all-memory-ram-on-computer


public class MemoryEater
{
    public static void main(String[] args)
    {
        Vector v = new Vector();
        while (true)
            {
                byte b[] = new byte[1048576];
                v.add(b);
                Runtime rt = Runtime.getRuntime();
                System.out.println( "free memory: " + rt.freeMemory() );
            }
    }
}

import java.util.Vector;

// The following is from:

// http://alvinalexander.com/blog/post/java/java-program-consume-all-memory-ram-on-computer

public class MemoryEater

{

public static void main(String[] args)

{

Vector v = new Vector();

while (true)

{

byte b[] = new byte[1048576];

v.add(b);

Runtime rt = Runtime.getRuntime();

System.out.println( "free memory: " + rt.freeMemory() );

}

We can use the Docker Java container to compile it:

docker run --rm -v "$PWD":/usr/src/myapp -w /usr/src/myapp java javac MemoryEater.java

1 2	docker run --rm -v "$PWD":/usr/src/myapp -w /usr/src/myapp java javac MemoryEater.java

Now that it is compiled, let’s test:

docker run --name memory_eater -d -v "$PWD":/usr/src/myapp -w /usr/src/myapp -m 256m java java -XX:+PrintFlagsFinal -XX:OnOutOfMemoryError="echo Out of Memory" -XX:ErrorFile=fatal.log MemoryEater

1 2	docker run --name memory_eater -d -v "$PWD":/usr/src/myapp -w /usr/src/myapp -m 256m java java -XX:+PrintFlagsFinal -XX:OnOutOfMemoryError="echo Out of Memory" -XX:ErrorFile=fatal.log MemoryEater

There are a few interesting flags:

Flag	Explanation
-XX:OnOutOfMemoryError=”echo Out of Memory”	Instruct the JVM to output a message on [OutOfMemoryError](https://docs.oracle.com/javase/7/docs/api/java/lang/OutOfMemoryError.html)
-XX:ErrorFile=fatal.log	When a fatal error occurs, an error log is created with information and the state obtained at the time of the fatal error. ([Fatal Error Log – Troubleshooting Guide for Java SE 6 with HotSpot VM](http://www.oracle.com/technetwork/java/javase/felog-138657.html))

Betwixt the two flags, we should get some indication of an error….

Testing, Testing….

The tests were performed in a variety of scenarios:

Environment	Docker Version	Ram	Swap	Docker Memory Constraint	Note(s)
4 Core, Openstack Instance	1.8.3	24G	0	`--memory=256m`	[HCF](https://en.wikipedia.org/wiki/Halt_and_Catch_Fire) within seconds — the OOMKiller kills the process.
4 Core, Physical	1.10.3	12G	15G	`--memory=256m`	Runs for a while and ends with OutOfMemoryError
8 Core, Physical	1.9.1	32G	32G	`--memory=256m`	Runs for about 5 minutes and exits with OutOfMemoryError
8 Core, Physical	1.9.1	32G	32G	`--memory=255m --memory-swap=256m`	Runs for about 5 minutes and exits with OutOfMemoryError
8 Core, Physical	1.9.1	32G	32G	`--memory=255m --memory-swap=256m`	Kernel level swap accounting turned on. OOMKiller strikes almost immediately.

In each case, the OS is Ubuntu 14.04 and the Docker container is java:latest.

I was expecting that the jvm would quickly attempt to grow beyond the container constraints and be killed. In the first test, it behaved as I expected. The container starts and then the logs abruptly end:

.....
free memory: 185915048
free memory: 184866456
free memory: 183817864

.....

free memory: 185915048

free memory: 184866456

free memory: 183817864

Upon inspection of the container, I see that it was killed by the OOMKiller:

....
"State": {
    "Running": false,
    "Paused": false,
    "Restarting": false,
    "OOMKilled": true,
    "Dead": false,
    "Pid": 0,
    "ExitCode": 137,
    "Error": "",
    "StartedAt": "2016-03-15T21:21:48.845032635Z",
    "FinishedAt": "2016-03-15T21:21:49.140794192Z"
},
....

....

"State": {

"Running": false,

"Paused": false,

"Restarting": false,

"OOMKilled": true,

"Dead": false,

"Pid": 0,

"ExitCode": 137,

"Error": "",

"StartedAt": "2016-03-15T21:21:48.845032635Z",

"FinishedAt": "2016-03-15T21:21:49.140794192Z"

....

Odd behavior, but just as I expected. cgroups is enforcing the amount
of space used by a container, but when the JVM or any other program
queries for the available memory, it doesn’t interfere:

<br />matt@nimbus:~/memory_eater$ free
             total       used       free     shared    buffers     cached
Mem:      32414832    1228224   31186608        948     262900     451720
-/+ buffers/cache:     513604   31901228
Swap:     33013756          0   33013756
matt@nimbus:~/memory_eater$ docker run --rm --memory=256m -it ubuntu /bin/bash
root@e584d1c56f32:/# free
             total       used       free     shared    buffers     cached
Mem:      32414832    1237240   31177592       1012     262956     451844
-/+ buffers/cache:     522440   31892392
Swap:     33013756          0   33013756
root@e584d1c56f32:/# exit

<br />matt@nimbus:~/memory_eater$ free

total used free shared buffers cached

Mem: 32414832 1228224 31186608 948 262900 451720

-/+ buffers/cache: 513604 31901228

Swap: 33013756 0 33013756

matt@nimbus:~/memory_eater$ docker run --rm --memory=256m -it ubuntu /bin/bash

root@e584d1c56f32:/# free

total used free shared buffers cached

Mem: 32414832 1237240 31177592 1012 262956 451844

-/+ buffers/cache: 522440 31892392

Swap: 33013756 0 33013756

root@e584d1c56f32:/# exit

At this point I decided that I had an interesting enough topic to write about. Little did I know but that I was about to go…..

Down the Rabbit Hole

I set down to diligently write about my findings; re-running the test on my laptop (the second entry in the table above), I was surprised to find that it behaved differently.

At first I thought it might be due to differences in Docker versions, so I tried on the 3rd host, where it ran even longer than on the laptop!

269.180: [Full GC (Ergonomics) [PSYoungGen: 1252864K->1252371K(1274368K)] [ParOldGen: 5401989K->5401989K(5402624K)] 6654853K->6654360K(6676992K), [Metaspace: 2574K->2574K(1056768K)], 3.7960775 secs] [Times: user=11.39 sys=0.91, real=3.80 secs] 
272.978: [Full GC (Allocation Failure) [PSYoungGen: 1252371K->1252371K(1274368K)] [ParOldGen: 5401989K->5401977K(5402624K)] 6654360K->6654349K(6676992K), [Metaspace: 2574K->2574K(1056768K)], 87.2372140 secs] [Times: user=529.11 sys=34.78, real=87.24 secs] 
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at MemoryEater.main(MemoryEater.java:15)
Heap
 PSYoungGen      total 1274368K, used 1252864K [0x000000071b200000, 0x000000076c400000, 0x00000007c0000000)
  eden space 1252864K, 100% used [0x000000071b200000,0x0000000767980000,0x0000000767980000)
  from space 21504K, 0% used [0x0000000768e80000,0x0000000768e80000,0x000000076a380000)
  to   space 21504K, 0% used [0x0000000767980000,0x0000000767980000,0x0000000768e80000)
 ParOldGen       total 5402624K, used 5401978K [0x00000005d1600000, 0x000000071b200000, 0x000000071b200000)
  object space 5402624K, 99% used [0x00000005d1600000,0x000000071b15ea50,0x000000071b200000)
 Metaspace       used 2604K, capacity 4486K, committed 4864K, reserved 1056768K
  class space    used 273K, capacity 386K, committed 512K, reserved 1048576K

269.180: [Full GC (Ergonomics) [PSYoungGen: 1252864K->1252371K(1274368K)] [ParOldGen: 5401989K->5401989K(5402624K)] 6654853K->6654360K(6676992K), [Metaspace: 2574K->2574K(1056768K)], 3.7960775 secs] [Times: user=11.39 sys=0.91, real=3.80 secs]

272.978: [Full GC (Allocation Failure) [PSYoungGen: 1252371K->1252371K(1274368K)] [ParOldGen: 5401989K->5401977K(5402624K)] 6654360K->6654349K(6676992K), [Metaspace: 2574K->2574K(1056768K)], 87.2372140 secs] [Times: user=529.11 sys=34.78, real=87.24 secs]

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at MemoryEater.main(MemoryEater.java:15)

Heap

PSYoungGen total 1274368K, used 1252864K [0x000000071b200000, 0x000000076c400000, 0x00000007c0000000)

eden space 1252864K, 100% used [0x000000071b200000,0x0000000767980000,0x0000000767980000)

from space 21504K, 0% used [0x0000000768e80000,0x0000000768e80000,0x000000076a380000)

to space 21504K, 0% used [0x0000000767980000,0x0000000767980000,0x0000000768e80000)

ParOldGen total 5402624K, used 5401978K [0x00000005d1600000, 0x000000071b200000, 0x000000071b200000)

object space 5402624K, 99% used [0x00000005d1600000,0x000000071b15ea50,0x000000071b200000)

Metaspace used 2604K, capacity 4486K, committed 4864K, reserved 1056768K

class space used 273K, capacity 386K, committed 512K, reserved 1048576K

(note the insane length of the garbage collection; this should have been my clue that something was seriously weird!)

I didn’t find anything indicating that memory constraints behaved differently between the 1.8.3 and more current versions.

I then wondered if it might be related to HugePageTables. As of 2011, Documentation/cgroups/memory.txt [LWN.net] states:

Kernel memory and Hugepages are not under control yet. We just manage
pages on LRU.

Ok… let’s see if it’s enabled:

sudo hugeadm --explain

1 2	sudo hugeadm --explain

Yup… I had them.

I then disabled HugePages on the 8 core host:

sudo hugeadm --thp-never

1 2	sudo hugeadm --thp-never

Ok, disabled. I rebooted for paranoia and re-ran my test. Still failed. Grump.

It was time to….

The Docker Run Reference section on memory constraints specifies that there are four scenarios for setting user memory usage:

No memory limits; the container can use as much as it likes. (Default behavior)
Specify memory, but no memory-swap — the container ram is limited and it may use an equivalent amount of swap as memory.
Specify memory and infinite (-1) memory-swap — the container is limited in ram, but not in swap.
Specify memory and memory-swap to set the total amount. In this case, memory-swap needs to be larger than memory:

$ docker run --rm --memory=255m --memory-swap=128m -it ubuntu /bin/bash
Error response from daemon: Minimum memoryswap limit should be larger than memory limit, see usage.

$ docker run --rm --memory=255m --memory-swap=128m -it ubuntu /bin/bash

Error response from daemon: Minimum memoryswap limit should be larger than memory limit, see usage.

The total amount is denoted by memory-swap.

Aha! I’ll just set these flags and run my container again…. Drat.

It still isn’t working.

And swap keeps growing and growing….

By now, it’s going on 3AM, but I’m definitely going to figure this out.

At this point I remembered the warning:

WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.

1 2	WARNING: Your kernel does not support swap limit capabilities, memory limited without swap.

A little bit of googling and I find that I need to set a kernel parameter. This can be done via grub.

You will need to edit /etc/default/grub — it is owned by root, so you will likely need to sudo.

On the GRUB_CMDLINE_LINUX line, edit it to add

cgroup_enable=memory
swapaccount=1

If there are no other arguments, it will look like this:

GRUB_CMDLINE_LINUX="cgroup_enable=memory swapaccount=1"

If there are other arguments, then just add the above; you’ll end up
with something along the lines of:

GRUB_CMDLINE_LINUX="acpi=off noapic cgroup_enable=memory swapaccount=1"

Next sudo update-grub && sudo reboot

Once the host reboots, the warning disappears and jvm is killed as expected:

   "State": {
        "Status": "exited",
        "Running": false,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": true,
        "Dead": false,
        "Pid": 0,
        "ExitCode": 137,
        "Error": "",
        "StartedAt": "2016-03-16T07:06:51.254992071Z",
        "FinishedAt": "2016-03-16T07:06:51.724280821Z"
    },

"State": {

"Status": "exited",

"Running": false,

"Paused": false,

"Restarting": false,

"OOMKilled": true,

"Dead": false,

"Pid": 0,

"ExitCode": 137,

"Error": "",

"StartedAt": "2016-03-16T07:06:51.254992071Z",

"FinishedAt": "2016-03-16T07:06:51.724280821Z"

Conclusion

The reason it behaved as expected on the OpenStack instance was that
there is no swap on the instance. Since there is no swap to be had,
the container is, by necessity, limited to the size of the memory
specified. And the jvm instance was reaped by the OOMKiller, as I’d expected it would.

oomkiller

This was definitely an instance of accidental success!

The most exciting phrase to hear in science, the one that heralds new discoveries, is not ‘Eureka!’ but ‘That’s funny…’ Isaac Asimov

I’m glad I went down the rabbit hole on this one; I learned a good bit even if it took considerably longer than I’d expected.

A few caveats with which to leave you:

It is best to always specify heap sizes when using the JVM. Don’t depend on heuristics. They can, have, and do change from version to version, let alone operating system and a host of other variables.
Assume that the OS lies and there’s less memory than it tells you. I haven’t even mentioned Linux’ “optimistic malloc” yet.
Know thy system. Understand how the different pieces work together.
And remember…. No software, just like no plan, survives contact with the …. user.

This post has no tag

9 comments

Mar 15

Zombie Apocalypse! Docker AUFS + Java + Low Memory …. Hadoop in a Box Cloudera Manager Cluster

Categories:

cloudera manager, Docker, gotchas, howto, Java, troubleshooting, Uncategorized

by Matt Williams

TL;DR — When using AUFS in a memory constrained environment, Java can spawn (lots!) of Zombies. A workaround is to change the storage driver to the device mapper.

In working on the Hadoop in a box CDH cluster with Cloudera Manager, I’ve discovered a few interesting things about AUFS. These experiences are with Ubuntu 14.04 and Docker 1.9.1. Others have reported similar results using Java in Docker without CDH.

I did my initial development of the CDH in a box containers in environments with 32G and 24G ram, switching to the latter when I was informed the target was for a host with 24G. With that amount of memory, everything just worked and no zombies. However, people started placing it on hosts with less ram and Java started spawning zombies. So I took a closer look.

I had previously noticed that the amount of cached and buffered memory seemed, to me, awful high, but I know that Linux uses it for optimizing IO. As it turns out, this memory doesn’t seem to be “free-able” when using aufs. Add to this Java, and weird things occur.

I tested on a quad core, 12G host, running up the manager and three workers. And then the zombies appeared. In a very short order — minutes — I had 260 zombies! This is in part due to supervisord restarting the failed jvms.

This necessitated a reboot. Once rebooted I started to do some research.

I found a couple of items hinting at issues and workarounds. I then decided to test the device mapper driver and set about converting my aufs rig to device mapper. After a few iterations, the least invasive steps are as follow:

docker ps -aq | xargs docker rm -f
docker images -q | xargs docker rmi
service docker stop
Edit /etc/default/docker and add the following to the end: DOCKER_OPTS="${DOCKER_OPTS} --storage-driver=devicemapper"
service docker start

Now you can restart the cluster. I did so and once things stabilized, started adding services back to the cluster. I did not tweak any parameters, except:

DataNode Default Group / Resource Management:
dfs.datanode.max.locked.memory = 65536 B — this alleviates “Cannot
start datanode because the configured max locked memory size… is
more than the datanode’s available RLIMIT_MEMLOCK ulimit,” as
documented at
Apache Hadoop 2.4.1 – Hadoop Distributed File System-2.4.1 – Centralized Cache Management in HDFS.
Service-Wide / Replication: dfs.replication = 1

After 2+ hours, no zombies

I started Zookeeper, HDFS, and Yarn.

Notes and Caveats

YMMV, but changing to the device mapper seems to slow things down about 10%. However, I’d rather, particularly in a test/development environment be stable and not spawning zombies!
This is not using the LVM backed storage.
Ubuntu 14.04 is on Kernel 3.13; other options emerge post 3.18.
I have been able to run quite a few more services on an openstack instance with 24G of ram:
Cluster of 4 Docker containers on an openstack image with 24G ram

This post has no tag

Leave comment

Mar 10

Cloudera Manager GUI and API Can Step on Each Other

Categories:

cloudera manager, gotchas

by Matt Williams

While learning how the configuration worked — in particular which arguments to pass in order to set non-default values, I discovered that I could lose changes by following these steps:

Use the GUI to set a value and save it. This is just so that you can find the variable. Keep the GUI open.
Dump the deployment to see what the variable name is (curl http://MANAGER:7180/api/v11/cm/deployment?view=export > SOMEFILE)
Call the API, setting the variable to the desired value.
Back in the GUI, either do a reload or look up another configuration parameter. (I’m not sure of the exact steps here, but I think I noticed it happening two different ways)

It appears that the GUI is storing the state (again) when you reload or migrate away from the page. This emerged when I spent a bit of time helping someone figure out why his API calls weren’t changing variables.

This post has no tag

Leave comment

Ramblings

Musings of Matt Williams

Weaving with Light Pt. 1

Abusing HAProxy: Stupid Simple Easy Dashboards

Rules for Operations

DevOps Creed (Work in Progress)

I am not a Mindreader: a mini-saga

Weaving with Light Pt. 1

TL;DR

In the Beginning

Just the Facts

Abusing HAProxy: Stupid Simple Easy Dashboards

Rules for Operations

DevOps Creed (Work in Progress)

I am not a Mindreader: a mini-saga

Cloudera Manager Disaster Recovery with JSON Deployment Dump

Interesting Feature of Dockerfile Volume Directives

Docker, Cgroups, Memory Constraints, and Java: A Cautionary Tale, or Here be Reapers (sometimes)

Testing, Testing….

Down the Rabbit Hole

Conclusion

Zombie Apocalypse! Docker AUFS + Java + Low Memory …. Hadoop in a Box Cloudera Manager Cluster

Notes and Caveats

Cloudera Manager GUI and API Can Step on Each Other

Subscribe to Blog via Email

Recent Posts

Top Posts & Pages

Archives

Categories

Copyright