#BeachOps. Build for lazy!

What if.

What if I asked you to change your way of thinking. From building to do to building for what you’re not going to do?

What if, instead, I asked you to build for lazy?

What if you decided you want to spend every moment at the beach? Not working? How would you build and run a production service? How would you build to be lazy? To build for the beach?

This is the core of #BeachOps.

I’d rather be at the beach than work. And to do that,

  • I have to build systems and tools to do my job.
  • I have to empower others to do my job.
  • I need computers to reason for me.
  • I want to build so I can be lazy.

Because, remember, we’d rather be at the beach. Not working. Eating tacos.

Story time.

But I didn’t used to think this way so I want to share the true story about the moment that changed my thinking from simply building to building for lazy.

Years ago I worked with this CEO and after a few years, she stepped out of her CEO role to just focus on her role as chairperson. This felt very abrupt. And it felt like she had quit one role to take on a smaller role, a role that felt like it had less impact than her CEO role.

A couple years after that, at a company all-hands, she reflected on that and I’ve never forgotten the essence of what she said.

You see, back in 2008, she started to ask herself:

  • What am I doing that someone else could do at least as well as me?
  • Are there things not being done that only I could do?

I learned two things from her:

  1. There is a point in time where you realize that others can do what you can do, or can do most of what you can do (and you can coach/mentor the rest).
  2. There is always a giant pile of work that only you can work on.

By quitting as “CEO”, she freed her time to focus on what only she could work on.

The Essence

#BeachOps is a way of thinking about everything we do and asking –

  • Is there a way for me to empower someone else to do this?
  • How can a computer reason for me?
  • How can I be lazy?

And being lazy is hard work!

Instead of building to do, we build for what we’re not going to do.

Focus on the important. Focus on work that only we can do.

Recognize that there is always work for which only I can work on. Find ways to focus on that.

What should I stop doing? What can only I work on?

Thoughts from an Operations Wrangler: how we use alerts to monitor Wavefront

Thoughts from an Operations Wrangler; I lead the production engineering team, running one of the largest SaaS observability platforms on the planet. Wavefront started in 2013 (I joined in 2016) and was acquired by VMware in 2017. These thoughts are all mine.


I find myself more and more talking with Wavefront users – both internally and externally – on How Wavefront uses Wavefront to Wavefront.

Or, how Wavefront’s Production Engineers use Wavefront to run Wavefront.

And, in what I hope is a series of posts, I hope to go a bit deeper into how we do what we do.

At our core, as production engineers, Reliability is our product. Alerts are fundamental to that and the starting point.

why alerts?#BeachOps because sometimes it's better at the beach

We don’t look at dashboards until an alert tells us to. We don’t look at charts or create ad-hoc queries until an alert tells us to.

We generally aspire to sit beachside until an alert says otherwise.

Within Wavefront Operations we have a few truisms. Alerts are:

  • always evolving
  • actionable or informative (more on this later)
  • any alert that pages is an alert that keeps us from the beach (and tacos)

An alert is the system telling us to go look at a thing. It’s the system telling us something important is outside of some definition of normal.

The majority of our alerts measure the rate of change of a metric or a change in the slope of a line or some other complicated math & science.

Data Ingester SQS Message Processing Variance Detected

variance(rate(ts(dataingester.sqs.processed, context=* and (tag="*-primary" or tag="*-secondary"))), hosttags, context) > 10

the evolution of an alert

Alerts start as

  • a chart, exploring the data or patterns
  • an alert where we test our hypothesis using Wavefront’s back testing
  • an experimental alert, tagged with an alert tag path “experimental”, with an alert destination to anywhere but PagerDuty. It’s here where we refine the alert – it should not contribute to alert fatigue.

Eventually, we have a “production push” and the alert is live.

But we aren’t done until we have the alert automagically fixed through an integration with something like Jenkins or Stackstorm.

why get up when Stackstorm can handle this?

when an alert triggers there must be an action

We try to frame things as the “2am problem” – what am I willing to wake up for at 2am? There are many things for which I will and many more that I won’t.

When an alert fires it must have some action. And generally, we obsess about refining alerts such that when it fires, there is little to no debugging. Because in Wavefront, alerts & queries can be so precise, we evolve the alert such that it represents a singular action.

Singular actions → computer code → an alert that triggers a webhook → leave me along, I’m in bed sleeping.

alerts are actionable… sometimes

Not all alerts page out. When we do get a page we want to assimilate as much information as possible. As quickly as possible.

What else is going on in the system?

contextual alertsWe use alerts to do that too. Alerts should also be informative. We label them as INFO or SMOKE but they help bring context. And since alerts are overlayed in Wavefront charts, we get even richer context.

#BeachOps

Ultimately we want to be at the beach. And an alert that fires is an alert that keeps us away from the beach.

Everyone talks about “single pane of glass” but we use Wavefront as our First Pane of Glass, consolidating disparate metrics sources into single charts and into single alerts. You might call this full stack alerting — we call it #BeachOps.

We leverage Wavefront’s analytics engine and query language to build alerts that are actionable by a computer. Or by a human where we use Wavefront to provide as much context as possible. We also constantly evolve alerts. Taken together, these have helped prevent alert fatigue and keep the team size small while the infrastructure has grown by 400%.

And since Wavefront alerts can trigger actions in automation tooling like Jenkins or Stackstorm, we can spend our time at the beach. With tacos.

day337v2-relaxing-on-beach