Analysis of a server downtime and lessons learnt

Hi Mum, this post isn’t for you — if you’re not interested in the gory details of our near server disaster last night, then please look away now.

If you are, read on!

Background

Last night we managed to break several components on two of our servers — our Database server and our Indexing server. This resulted in the Database server being offline for around 15 minutes, and the Indexing server being offline for several hours.

In addition, because of the Indexing server problems, we had to stop crawling links. So while there wasn’t anything wrong with what we call our pipeline process (the act of connecting to Twitter, extracting a link, crawling it then submitting it to the indexer), we had to stop that as well.

This occured around 7PM Melbourne time. It took over 12 hours until “working” recovery — more on that later. During this time the site remained available (with the exception of the 15 minute window when the DB server was unavailable), but Search, Indexing and Crawling were all offline, so no new links were being collected.

Issue 1 — Weekly Email

The motivation for the changes was a simple encoding problem in our weekly email. It was reported last week that our emails were NOT properly supporting multibyte character languages. For example, Chinese link titles were showing as unencoded strings.

In diagnosing this issue, several things were discovered.

  1. Our test environment (which WAS properly encoding) was running Python 2.6.6
  2. Our staging environment (which was NOT properly encoding) was running Python 2.6.5
  3. Turns out this is a python bug, resolved between 2.6.5 and 2.6.6 as issue #1368247

We pride ourselves on our unicode support and decided to resolve this before sending the next weekly email by upgrading the Python environment. At this point we made several mistakes:

  1. We did not assess how long the upgrade would likely take.
  2. We did not research what was involved in the upgrade (as we later found, there was LOTS of doco saying not to do it the way we did).
  3. Most critically, we didn’t prioritise this. While fixing the bug is important, it had been there for at least four weeks. Did one more matter?

Lessons: While we may still of had the problems, we would of been better doing either holding the email or sending “as is” until we could do more research.

Issue 2 — Upgrade One — Database / Staging Server

We then commenced upgrading our staging environment from Python 2.6.5. This is where it all started to go wrong. Staging is also the database server. We use it in this way as a “final check” of things that need a live database before being pushed to production. In upgrading Python, we made several mistakes.

  1. Committed a fatal error in that we upgraded the Python distro that comes with Ubuntu 10.04. This broke many, many things, but was not immediately obvious. Why? Well Ubuntu uses Python as an internal scripting language and a lot of the libraries have version dependencies. We didn’t realise the seriousness of this at the time.
  2. We tested the email with the new Python, worked. Database went down, we tricked the path — got it back up again. Did our happy dance, convinced issue was solved.

At this point there were fundamental, terminal flaws in this server caused by the upgrade of the core Python libraries which we overlooked.

Lessons: We were in too much of a hurry, we should of stopped and more thoroughly tested this stage. Several very basic things would of highlighted issues before we went further (for example, the site was running in memory, but as soon as we restarted Apache later on in the night, it fell over).

For the future: With the benefit of hindsight — we should configure our environment so that we have a SEPARATE installation of python from the core server version. This would of let us independently upgrade without trashing the server core version.

Issue 3 — Upgrade Two — Indexing Server

Suspecting things weren’t quite perfect, we did table the idea of stopping at this point (one near disaster, but server back up), maybe sleep on it, come back to it tomorrow in the afternoon when things were quieter on the server. Decided to push on — first upgrade was “fine” (wrong), let’s do the second.

We shut down the search and processing pipeline so that we could update the indexer (nothing wrong with this, in fact it was a good decision. It’s not uncommon for us to do this for 10 minutes at a time to make some changes).

  1. Followed same upgrade path as with previous server. This time we couldn’t restart the indexer.
  2. Started researching and NOW realised what we’d done (twice) *sigh*.
  3. Began to try and recover. We realised one correct way would of been to upgrade the Ubuntu distro to 10.10 which includes Python 2.6.6 in the core. As noted above, this is probably not the best way — we should have the site and core versions independent from each other, ALTHOUGH this would of worked. We successfully upgraded two other servers using this approach and verified everything was fine.

We never recovered the Indexing server.

On the positive side — we had a recovery path identified very early on (copy the files to an upgraded server and bring that up as the Indexer).

Lessons: Bugs / problems are HARD. By definition you don’t know the solution. Our biggest problem at this stage was not properly time boxing the effort. We should of said “we’ll try for an hour, then move to plan B”, instead we tried for almost 5.5 hours before falling over to plan B. From the time we finally called it (in the chat log) to when we actually STOPPED was still another 30 minutes. Starting plan B sooner would of shortened our recovery time by some 5–6 hours.

For the future: Time box, if it’s not working, move on to plan B.

Issue 4 — Recovery

This wasn’t bad, just time consuming. It went reasonably smoothly.

For the future: We could of saved some time by better documenting core components and what they required. It was a bit of trial and error to bring a couple of things up (for example, there were firewall rules for the Mail queue which needed access back to the SQL server.

Issue 5 — Current State

The end result of all this is one server totally out of commission, and one partially out of commission. While the site is all back up and running, we still have to rebuild the kaput server from scratch, swing the database on to that, then rebuild the partially working server.

Final observations

While there were lots of issues of our own making, it should be noted that a lot of things went right that are not easily tested unless under this kind of stress. Databases were robust and survived (the loss of the slave hasn’t caused any issues), the architecture “coped” with the downtime. Our splitting of components and configurations have been successful (with simple config we could move components around without any real problems once firewall rules etc. were sorted out).

While there were technical issues and mistakes made, I think the real root cause was tiredness and haste. We had several clear “backout / abort” points but we glossed over errors to try “get it done”.

Final Lessons

It’s not immediately obvious here, but one thing we are going to do is have a code freeze. Every time we deploy minor upgrades (and the email issue was minor, even if the upgrade wasn’t so much), there seems to be a follow on issue or two. What’s happening is that our major release of the new search engine keeps getting pushed back (and now pushed back even further) while we resolve small, irritating problems.

We might still have stuffed up, but if we had this next release out of the way, we’d of been in less of a hurry to deploy and push things through. It’s about focus on prioritisation. We should of weighed up the cost of no unicode (annoying) with the risk of destroying our server (significant impact even if we may of assessed it as low risk).