Monitoring your Puppet nodes using PuppetDB

When you run Puppet, it is very important to monitor whether all nodes have an uptodate catalog and did not miss the last year of changes because of a typo in a manifest or a broken cron-script. The most common solution to this is a script that checks /var/lib/puppet/state/last_run_summary.yaml on each node. While this is nice and easy in a small setup, it can get a bit messy in a bigger environment as you have to do an NRPE call for every node (or integrate the check as a local check into check_mk).

Given a slightly bigger Puppet environment, I guess you already have PuppetDB running. Bonuspoints if you already let it save the reports of the nodes via reports = store,puppetdb. Given a central knowledgebase about your Puppet environment one could ask PuppetDB about the last node runs, right? I did not find any such script on the web, so I wrote my own: check_puppetdb_nodes.

The script requires a "recent" (1.5) PuppetDB and a couple of Perl modules (JSON, LWP, Date::Parse, Nagios::Plugin) installed. When run, the script will contact the PuppetDB via HTTP on localhost:8080 (obviously configurable via -H and -p, HTTPS is available via -s) and ask for a list of nodes from the /nodes endpoint of the API. PuppetDB will answer with a list of all nodes, their catalog timestamps and whether the node is deactivated. Based on this result, check_puppetdb_nodes will check the last catalog run of all not deactivated nodes and issue a WARNING notification if there was none in the last 2 hours (-w) or a CRITICAL notification if there was none for 24 hours (-c).

As a fresh catalog does not mean that the node was able to apply it, check_puppetdb_nodes will also query the /event-counts endpoint for each node and verify that the node did not report any failures in the last run (for this feature to work, you need reports stored in PuppetDB). You can modify the thresholds for the number of failures that trigger a WARNING/CRITICAL with -W and -C, but I think 1 is quite a reasonable default for a CRITICAL in this case.

Using check_puppetdb_nodes you can monitor the health of ALL your Puppet nodes with a singe NRPE call. Or even with zero, if your monitoring host can access PuppetDB directly.

Comments