«

»

Apr 07

Cloudera Manager Disaster Recovery with JSON Deployment Dump

Cloudera Manager is fairly opinionated. In its defence, it pretty much needs to be given that it needs to wrangle multiple underlying Open Source projects. Each of these, in turn, have their own quirks and opinions.

The following is a description of how to recover a Cloudera Manager cluster post disaster, assuming that you have a copy of the deployment. I will say that this is something of a hack; it treats the cluster a bit too much like a pet, but you could make a case that the Cloudera Manager’s deployment dump behaves similarly to infrastructure as code.

One of the tricks I’d previously discovered which can help is that the /var/lib/cloudera-scm-agent/uuid file can either be generated by the agent or you can choose your own. It is used by the Cloudera Manager as a primary key for hosts on which the agent lives. In the case of a disaster or server crash, if replacement hostnames and IP addresses remain the same (since they are set once within Cloudera Manager and cannot change) then the hosts can be dropped back into the cluster without creating multiple records of the same hostname. A means of doing so would be something like:

Note the call to tr above. The uuid file is used explicitly, so if there are any linefeeds in the file, the linefeed becomes part of the UUID inside Cloudera Manager. While it is legal, this can make some API calls awkward.

However, if you aren’t able to reuse the same UUID(s), or if the UUID is overwritten, say by Puppet, after the cluster is created, all is not lost. You likely will have some cleanup to perform, but it’s not
insurmountable.

The Cloudera Manager API is very powerful and opens up many possibilities — not the least of which is opening a door into the mind of the Manager and how it thinks. One of the calls, /cm/deployment, I had figured would work for backing up and restoring the state of a cluster. I’d tested it previously in a single node cluster, so I knew it could work — at least in the small!

I had an opportunity to test my theory in a larger cluster this evening.

The basic symptoms were a dashboard full of red and no messages in the logs when you attempted to restart a server — you could start the Agent, but it wouldn’t do much good — it wasn’t speaking to the Manager.

I determined (after some experimentation) that the reason why they weren’t speaking very well was that the UUID was being overwritten by Puppet. I started giggling. In a warped, BOFH sense, it is actually pretty funny what was causing the cluster to misbehave. All I could think was that the cluster was most definitely borked by a master!

bork

The Cloudera Manager instance was still available so I had a starting point from which to work — although if I’d had a backup of the deployment configuration it would still have worked if the cluster was totally dead.

I set out to replace the hostId entries and figured that was all I would need to do. Turns out there was a bit more than just that.

Here are a basic set of steps in order to recover:

Note: Replace MANAGER with the name of the host on which the Cloudera Manager is located. Also, replace the authentication user/pass as needed — it’s unlikely (I hope!) that you’re still using admin/admin for user/pass.

  1. If the cluster is still up, then dump the hosts. The information you need is in the deployment, but it’s convenient to pull it from the hosts: curl -u 'admin:admin' http://MANAGER:7180/api/v11/hosts > hosts.json

  2. If you don’t have a dump of the deployment, you can get one via:
    curl -u 'admin:admin' http://MANAGER:7180/api/v11/cm/deployment > deployment.json

  3. In this case, I needed all of the new uuid’s. You may be able to skip ahead to step 7 if the uuid’s haven’t changed. There may be modifications needed for the IP/Hosts if your replacement cluster is on a different network. Or you could use this to create a template and reproduce it in different networks.

  1. We’re going to use sed to replace all of the UUID entries. The following ruby code will generate our sed script for us:

  1. Paranoia Check time. Look at the output sedder.sed script. If the UUID’s are generated, then there is a good chance that they will have a “\n” in them. Consequently you may need to edit the sed script..

  2. Run the script which was just generated: bash -x sedder.sed

  3. At this point, there is likely some more fixing needed. The following is needed because:

    When specifying roles to be created, the names provided for each role must not conflict with the names that CM auto-generates for roles. Specifically, names of the form “<service name>-<role type>-<arbitrary value>” cannot be used unless the <arbitrary value> is the same one CM would use. If CM detects such a conflict, the error message will indicate what is safe to use. Alternately, a differently formatted name should be used. — Cloudera Manager API

    Cloudera Manager played fickle and didn’t think that the “arbitrary” values it thought safe previously were still any good. I searched a bit, but could not find anything to tell how it calculates those arbitrary safe values :-/. They are a large hexadecimal number; I was able to identify that part, which led (after much experimentation) to the following fix.

    Save the following to fixer.awk:

    Execute it via: awk -f fixer.awk deployment.json >
    fixed-deployment.json

  4. More Paranoia. Look at the output JSON text — if it looks borked, then don’t deploy it!

  5. Install the fixed deployment

At this point, if all has gone well, it accepts the deployment. You’ll need to restart the cluster(s). I strongly suggest starting the Management Cluster first.

After the dust settles, you’ll have a repaired and/or newly deployed cluster.

In my case, I was mostly there when I found that puppet was also overwriting the Manager server entry in
/etc/cloudera-scm-agent/config.ini with localhost

doh

Good luck and let me know what you think.

Leave a Reply

%d bloggers like this: