Automating Orchestration Tasks With Sensu

And a one, and a two, and a three

GoDaddy web hosting is a very complex system of systems. We use several tools to help manage and monitor thousands of servers serving up millions of web sites every day. As a 24/7/365 business, our customers expect us to maintain a very high level of service reliability.

Sensu is one of the tools we use that provides a scalable framework ideal for application monitoring and server health validation at scale. Historically, monitoring systems are often based on concepts like pulling in metrics into round-robin time-series table, or by polling TCP/UDP services to verify if those services are responding, and sometimes by applications pushing a periodic heartbeat out to a watchdog service. For years, we’ve used open source, homebrewed, and commercial solutions to add observability to our environments.

In using these types of monitoring systems, we’ve often pushed the responsibility away from product developers and engineers to a centralized group to manage all aspects monitoring. This works, but can create application knowledge gaps or organizational complexities that slow down the velocity of service delivery. Recognizing bottlenecks and gaps imposed by this structure, we’ve evolved many of our product teams into tribes of cross-functional skillsets (development, engineering, operational skills) and enhanced our tools to align with more small team agile processes. This is where combining the power of product-oriented team expertise with the Sensu software framework as services has empowered our product groups to quickly deliver robust monitoring of services and significantly increase automated responses to service interruption detection.

Sensu provides for basic push/pull metric monitoring scenarios, but that’s not the limit of the Sensu magic. Sensu allows you to execute custom verification commands and then, based on the result of those commands, can chain multiple complex orchestration actions to those commands. What does that mean? It means we can build automated responses to our alarms.  For instance, if a web server stops running, we can fire off a job using Sensu to try to automatically restart the web service. Concurrently, that job can capture some basic troubleshooting data (e.g. process listing, web server configuration, etc) for further analysis. If we can’t restore service, we can further automate the notification out to an on-call human to investigate.

Sensu on demand

Using the Sensu framework, we can utilize the REST API to execute orchestration tasks on demand.

Wait, what, really?

Oh, yes!

Within Sensu, each client has n identification tags associated with the host.  These are called subscribers. We’ll often tag each host with a set of subscribers that includes the short hostname of the asset (computer0001), the major services or business purpose of the asset (httpd_service, provisioning_service, monitoring_service), and a generic operating system tag for the host (linux_host).

Here’s an example client configuration.

/etc/sensu/conf.d/client.json:
{
  "client": {
    "name": "coolhost001.example.com",
    "address": "10.0.0.1",
    "subscriptions": [
      "linux_host",
      "coolhost001",
      "httpd_service",
      "puppet_group001"
    ],
  }
}

Let’s create a check to restart the httpd service (/etc/sensu/plugins/actions/restart-service.rb already exists on the clients).

/etc/sensu/conf.d/checks/restart_httpd.json:
{
    "checks": {
        "restart_httpd": {
          "handlers": ["notifyops" ],
          "command": "/etc/sensu/plugins/actions/restart-service.rb httpd",
          "publish": false,
          “interval”: 9999
        }
      }
}

With this non-published check in place, it won’t automatically fire off at a periodic interval. We can then chain this action to httpd service checks using a straight forward API call within a handler.

curl -XPOST http://sensuserver.example.com:4567/check/request -d '{"subscribers": ["coolhost001"], "check":"restart_httpd"}'

We can even tie this into our Jenkins build system, or even call this action manually. The possibilities here are numerous. We can adjust the target of such calls by making use of the subscriber. Some of the interesting cases we’ve explored in this space are:

  • Automating Puppet/Chef remediation – or calling these on demand, as a step within a CI/CD deployment.  This allows for A-side / B-side deployments based on the subscriber group
  • Automating content syncs on demand – allowing us to use Sensu as framework with which to deploy content updates from git, trigged by a git commit hook
  • Automating non-periodic validation/audit scripts to be executed on demand, and being able to gather the pass/fail results of those checks directly from the Sensu API

Now with all this said, Sensu is not intended to be a replacement for traditional orchestration tools like Mcollective or RunDeck. (Just because you can do something, doesn’t always mean you should.) But in some cases, it can make sense to re-use the same automated commands run by periodic checks as on-demand commands within the same system.

What are some of the monitoring tools or services that you use?

Image by: chrisbb@prodigy.net via Compfight cc

Mike McLane
Mike grew up in the Southwest USA with systems technology: embracing dial-up modems, IBM PC XTs, and programming in BASIC all by the age of 8. From CompUSA to IBM and from GlobalCrossing to GoDaddy: Mike has made his rounds over the last 2 decades in the tech industry. He helps answer questions on #httpd and #centos on FreeNode and is a member of the IEEE. His latest project is Webrockit.