Wednesday, November 30, 2005

Amusing story from the back office...

I haven't had a chance to describe what it is I do for Amazon, and this story needs some background, so here goes.


I'm a software development engineer (read: "Code Monkey") in a group which variously goes by the names Customer Master Systems/Service and Identity (you might say we have a bit of an Identity crisis, but you wouldn't because we'd have to hurt you for such a pun). We're basically responsible for all of the core customer information (names, addresses, etc., though not credit cards or bank accounts) -- storing it reliably, making it available to the various applications inside the company that need it, and enforcing things like privacy policies.

The last bit is surprisingly non-trivial -- keep in mind that Amazon runs a number of third-party sites like Target, The Bombay Company, and the NBA store, whose privacy policies may state that they won't share personally identifiable information with third-parties. It's up to us to make sure the bits associated with, say, your Target account information never mix with the programs running for Bombay or even Amazon itself (!). At the same time, we're constantly monitoring for suspicious activity -- we're often able to shut down hacked accounts before the real owner even knows they were hacked. Multiply this by a few million users on the site at any given moment and you have a lot of bits that need to go from point A to point B without crossing points C, Q, or Z.

So many bits, in fact, that, even though we have some of the fastest network equipment in the world, we're always looking for ways to reduce the load. One of our other engineers implemented a way to offload a significant chunk of this traffic while maintaining all the other features and constraints. We tested it, beat on it, stomped all the bugs we could find out, and declared, "This is good. Let's deploy it."

So we pushed the new software live. We watched the traffic on it grow, and it seemed stable. Things went this way for 12 hours or so, and nobody noticed a thing.

Then a bug was filed. It said, "Hey, 1-click isn't working!" Worse, the bug was from Jeff Bezos.

For those who don't know, Jeff is the founder and CEO of Amazon. He's one of the inventors of 1-click (where you just click on a single button next to each item you want and they magically appear on your doorstep -- you don't need to explicitly check out, fill out payment info, etc., if you already have an account). Whether 1-click is a significant contributor to our bottom line is irrelevant. The rule is: Thou shalt not break 1-click. The corollary is: If you do break 1-click, you scramble to fix it.

You definitely do not leave 1-click broken for 12+ hours and let Jeff (no, he doesn't need a last name inside the company) find the bug for you.

(We did actually test this, incidentally; it's one of those cases where it was only intermittently broken and happened to work in the test case. These are usually the worst kinds of bugs.)

Anyway, I was one of the three engineers on the call fixing it. I was finding reproducible cases where it was broken and transcribing the phone call. Another engineer was actually doing the fix. The third was verifying the fix -- that is, he was going around to all of the broken cases I found and making sure 1-click did work.

Today, he came into the office and said: "Hey, Dave, remember the 1-click bug from last week?"
Me: "Yes..."
Him: "You know how I was testing and making sure it worked?"
Me: "Yes..."
Him: "Well, last night I had about 40 items show up on my doorstep."

The office pauses, then bursts out laughing.

No comments: