Zimbra & Mozilla email, 4 months later

Four months ago was a very tough time in Operations. We suffered a catastrophic disk array failure on Mozilla’s mail server (I blogged about it too). A series of mistakes kept email offline for two days. This was the worst I’ve ever felt, both professionally and personally.

Fast forward to today. So. Much. Better.

We learned. We researched. We re-organized ourselves. Much like The Six Million Dollar Man, we rebuilt it better.

justdave posted his account, “Re-imagining Zimbra email at Mozilla” but I wanted to add my own color.

Background

During my interview at Mozilla in 2006, I was asked a bunch of questions about Zimbra. First I had heard of it. By the time I started I had learned quite a bit about Zimbra. Back in 2006, mozilla.com email was hosted externally and we began the process of moving email back in house. The company hosting email couldn’t provide SSL and wasn’t doing all the groupware things we needed.

Post Zimbra-gate (December)

I was mentally done with email. I looked at simply outsourcing. I looked at hosted Zimbra, hosted Exchange, hosted whatever.

We Mozillians, we’re a unique group.

  • We want to use the IMAP client of our choice. Some of us just want to use the web interface. Others prefer Microsoft Outlook. Or Thunderbird, or Mail.app or Postbox or mutt or pine or Sparrow or …
  • Calendaring is just as complex.
  • We need to support a wide number of mobile devices – iOS, Android, Blackberry, devices that support Microsoft’s ActiveSync – with both email and calendaring.
  • Some use Zimbra’s document sharing/storage
  • We need something that supports IMAP, ActiveSync, CalDAV, CardDAV.

We looked at what others at our scale and beyond our scale use for email. Oracle uses Zimbra. Comcast uses Zimbra. At. Scale.

We talked to others hosting their corporate email with Google Apps (and their 15-person staff managing their Google Apps mail!). We learned that deploying Exchange requires a move from OpenLDAP to Active Directory and a particular skill set that we don’t have in house.

Moving Forward

This incident highlighted the need to have a team focused on infrastructure. Our primary focus (and priorities) always tend to lean towards various Mozilla web properties or developer services.

So we did two things –

  1. Broke up a fairly flat Operations group and created an Infrastructure Operations team (and a couple others) to focus on services like email & LDAP, to name a few.
  2. Built a new environment for services that, when break, cause work stoppage, cause a line to form behind my desk. This Hyper Critical Infrastructure, or HCI, is isolated from the rest of the production environment, has different change control processes and is meant to hit as many “9s” as we can hit. It’s a very different way of planning than we had done in the past.This technology stack uses more corporate/enterprise technology than we’re used to using at Mozilla.

HCI Today

HCI straddles two high density, (~15kW) racks. It’s only relation to the rest of Mozilla production network is two 10GbE fiber drops from the network core.

HCI has it’s own Juniper SRX 1440 firewalls. Its own Juniper EX4500 switching. Its own NetApp FAS3270. Its own 5 node VMware ESX cluster, each machine having 2x 6-core Xeons & 192GB RAM.

In a couple months, services here will be replicated to SCL3 using various NetApp & VMware technologies.

We had planned to have HCI in production by the end of February but no one wanted to rush this (plus someone decided to have a baby).

Instead we slipped that to the last week of March and I’m glad we did. We consulted with Zimbra and others. We sent Desktop & InfraOps to training. We tuned and fine tuned.

Zimbra Today

We have mailboxes spread across seven mailbox servers and understand the metrics we’ll use to determine when to add more mailbox servers.

We migrated 1002 mailboxes from San Jose to Phoenix without anyone noticing, without any user impact, in just a couple days. In fact, we didn’t mention it until we were done.

We have instrumentation and trending and alerting on everything we could think of.

What’s next?

All is for naught without learning. We learned a lot and we’ve changed how we operate as a team.

Once bitten, twice shy.

recovering from an email outage

If I could do this week over I would.  Too bad I can’t.

Email today is vital.  Not having it makes your heart palpitate. 

Monday morning, during a swap of a failed hard drive (something we’ve done countless times) the storage array we use for email went offline.  The whole thing.  And for various reasons, the last known good backup was from awhile ago. 

I painfully remember thinking “oh shit” when I realized what this meant.

[This isn’t a post about all the things I should have done to make sure I was never in this spot.  Everything’s obvious now.]

I learned a couple things this week:

  1. Hire the absolute best people (and geezus, hire people smarter than you!). You never know when you’ll need them.  You never know who will have the answer to the problem.  Hire people who care about each other.  You never know when you need them to look out for the one guy who, in 73 hours, forgot to sleep.  The same one guy who has to run point on The Next Big Step in 7 hours.
  2. Work somewhere where everyone realizes we’re all fighting the same fight. I’m surrounded by coders and when we needed coding, 1492 python coders lined up to help.  Not a single one of them reports to me.
  3. Get upset, yell, demand results.  But realize when it’s the right time to yell and when it’s not.  During a firefight, I need you to be on the best fucking game of your entire life.  It is not the time to be berating you.  It’s the time to treat you like a hero, a magician.  It’s when I do what you tell me to do for you.
  4. Communicate the heck out of everything.  Throughout this outage we found other tools to use to let users know what was going on and what to expect.  I’d post updates even when the information I had was incomplete.  I’d say so.  I hated having folks in the dark.  
  5. Expect criticism.  Some of it will be searing.
  6. Realize that the people working under me on this are collectively smarter than I am.  Offer help whenever but let them work.  Take point at handling communication.  Make sure #5 doesn’t get to them. Remind yourself of #3.

It took nearly two days to get things back to an okay state, a state where we had new emails.  Still recovering data from backups and reconstructing state from a now corrupt MySQL database.  

I’ll probably never be able to express my gratitude to the team I manage for their efforts this week.  Sucks we got here but without thinking, I’d go to battle with this team again.

We made mistakes that got us here but we can talk about that later and make sure it doesn’t happen again.