ROSL

Tough decisions in real time: ROSL83

Tim Bull

12 Jun 2011 • 4 min read

ROSL83 — Reflections on startup life, week 83

I don’t think I ever expected that among the many skills I would acquire, that being a tight rope artist would also be one. Turns out that startups are the worlds greatest balancing act.

We are constantly bombarded with decisions and often the answers are in conflict. If we do this, then that happens, if we do that then this won’t. Typically this is really a question of resource allocation (we can do a or b, if we do a we can’t do b yet), which when you get down to it, is really about speed of execution. It’s not great, but as long as you can keep things in balance and move both a and b over time (or my personal favourite complete a, then worry about b), you’re heading in the right direction.

Every now and then however, there are situations that send you off balance and spiraling out of control. To resolve them you’re faced with tough choices, like the one last week that caused us to reenter closed beta.

The basic problem was performance and stability of Trunk.ly. Over the last few weeks in particular, we’ve been fighting more and more issues with the processing pipeline — larger and larger volumes of inbound links which meant that the site was slowing down, it was becoming unstable and weren’t able to keep links up to date as often as we (and users) would like. The biggest problem is that it was continually sucking our time from things that would improve the site, to spending lots of time on maintenance of the site. As an example of this, a “critical” task we’d decided to address two weeks ago was still on the todo list because of these real-time stability priorities.

Eventually we decided to temporarily close off new signups for three main reasons:

We weren’t handling the link volume we had anyway, taking in more links seemed silly.
There was a suspicion that the onboarding of links from new users was causing the issue, or at the least making it worse.
The experience for new users is one area that’s definitely an issue in Trunk.ly anyway, but with these processing problems, it was now way below optimal. We’d be better off collecting email addresses, resolving a few issues and then inviting people.

None of this is unique. The reality for all startups I suspect is that you hit this point where your own success begins to “eat you alive”; you’re now spending so much time on maintenance and performance that your not driving the product forward. To break out of this you have to change modes. Here’s four things we had to do to get through this.

Stop!

This was the toughest — we failed to recognise there was a real issue for a long time. So stop ignoring the signals, stop pushing it off.

Think!

Apply some critical analysis to the issue. Why is it happening, what might be the likely cause, what probably ISN’T the cause. You may find at this stage you’re already quickly honing in on the likely issue. The key realisation for us was that there really was something critically wrong — given the number of users and some rough calculations there was just no way we should be collecting the number of links the pipeline was seeing. This meant that the answer wasn’t going to be “throw more hardware” but rather that there was something quite fundamental in the code not working properly.

Clear the decks

This will challenge the more corporate of you, but for startups, sometimes the quickest way to resolving a problem is to take some hard actions and clear the decks. We did this with shutting down the new users. It was a tough decision, but what it immediately showed us was that new users weren’t the issue after all which let us move on to work out where we really needed to measure. You won’t always be able to do this, but by removing some pieces we could work out where we needed to focus much more quickly.

Gather

The reality is that as a startup you don’t have endless metrics at your fingertips, however you need to be prepared to step up at this stage and put something in place. We now knew the problem was deep in the processing pipeline, but had no visibility on that process. We didn’t know which users were storing the most links, we didn’t know where most of the links were coming from. It took us a day or so to get this in place, but now we have a “good enough” pipeline monitoring solution which lets us see which links are being processed and so forth. This showed us a couple of key things which lead us to further investigate our connectors (which extract links).

Act

Try something, do something — once you’ve got some reasonable information, you’re almost certainly at a point where doing something is better than doing nothing. Having identified a couple of likely places, we dived in to the code and found that our RSS connector was at fault.

So that’s it — last week was involved in performance issues of various kinds, but by actually stopping to recognise that there was a real issue (beyond just “we’re growing”) we’ve finally got stuck in and resolved a real root cause problem, consequently freeing up time to reinvest in the site.

I’ll leave you with this picture which was taken by my Mum in rural India recently when she accompanied Dad on a trip there for work. Dad works in agriculture and was out there visiting some banana plantations (among many other places). These ladies carry the bananas from the plantation back to the truck along paths through the fields; each bunch is between 25kg — 35kg, so two bunches means they are carrying between 50kg — 70kg (110–154lbs). Now that’s a balancing act!