Four months ago was a very tough time in Operations. We suffered a catastrophic disk array failure on Mozilla’s mail server (I blogged about it too). A series of mistakes kept email offline for two days. This was the worst I’ve ever felt, both professionally and personally.
Fast forward to today. So. Much. Better.
We learned. We researched. We re-organized ourselves. Much like The Six Million Dollar Man, we rebuilt it better.
justdave posted his account, “Re-imagining Zimbra email at Mozilla” but I wanted to add my own color.
Background
During my interview at Mozilla in 2006, I was asked a bunch of questions about Zimbra. First I had heard of it. By the time I started I had learned quite a bit about Zimbra. Back in 2006, mozilla.com email was hosted externally and we began the process of moving email back in house. The company hosting email couldn’t provide SSL and wasn’t doing all the groupware things we needed.
Post Zimbra-gate (December)
I was mentally done with email. I looked at simply outsourcing. I looked at hosted Zimbra, hosted Exchange, hosted whatever.
We Mozillians, we’re a unique group.
- We want to use the IMAP client of our choice. Some of us just want to use the web interface. Others prefer Microsoft Outlook. Or Thunderbird, or Mail.app or Postbox or mutt or pine or Sparrow or …
- Calendaring is just as complex.
- We need to support a wide number of mobile devices – iOS, Android, Blackberry, devices that support Microsoft’s ActiveSync – with both email and calendaring.
- Some use Zimbra’s document sharing/storage
- We need something that supports IMAP, ActiveSync, CalDAV, CardDAV.
We looked at what others at our scale and beyond our scale use for email. Oracle uses Zimbra. Comcast uses Zimbra. At. Scale.
We talked to others hosting their corporate email with Google Apps (and their 15-person staff managing their Google Apps mail!). We learned that deploying Exchange requires a move from OpenLDAP to Active Directory and a particular skill set that we don’t have in house.
Moving Forward
This incident highlighted the need to have a team focused on infrastructure. Our primary focus (and priorities) always tend to lean towards various Mozilla web properties or developer services.
So we did two things –
- Broke up a fairly flat Operations group and created an Infrastructure Operations team (and a couple others) to focus on services like email & LDAP, to name a few.
- Built a new environment for services that, when break, cause work stoppage, cause a line to form behind my desk. This Hyper Critical Infrastructure, or HCI, is isolated from the rest of the production environment, has different change control processes and is meant to hit as many “9s” as we can hit. It’s a very different way of planning than we had done in the past.This technology stack uses more corporate/enterprise technology than we’re used to using at Mozilla.
HCI Today
HCI straddles two high density, (~15kW) racks. It’s only relation to the rest of Mozilla production network is two 10GbE fiber drops from the network core.
HCI has it’s own Juniper SRX 1440 firewalls. Its own Juniper EX4500 switching. Its own NetApp FAS3270. Its own 5 node VMware ESX cluster, each machine having 2x 6-core Xeons & 192GB RAM.
In a couple months, services here will be replicated to SCL3 using various NetApp & VMware technologies.
We had planned to have HCI in production by the end of February but no one wanted to rush this (plus someone decided to have a baby).
Instead we slipped that to the last week of March and I’m glad we did. We consulted with Zimbra and others. We sent Desktop & InfraOps to training. We tuned and fine tuned.
Zimbra Today
We have mailboxes spread across seven mailbox servers and understand the metrics we’ll use to determine when to add more mailbox servers.
We migrated 1002 mailboxes from San Jose to Phoenix without anyone noticing, without any user impact, in just a couple days. In fact, we didn’t mention it until we were done.
We have instrumentation and trending and alerting on everything we could think of.
What’s next?
All is for naught without learning. We learned a lot and we’ve changed how we operate as a team.
Once bitten, twice shy.