Tuesday, November 27, 2007

They say that multi-drive failures are infrequent enough that you generally don't need to worry about them. RAID-5 is good enough.

Well, buy me a lottery ticket. I had to deal with a multi-drive failure on one of our login databases on Sunday night -- first drive failed at 9:30 pm, then the second at 10:08 pm. It took us about six hours to recover the database. Fortunately, our applications are resilient to a single-database failure, so there wasn't any impact -- but those were a tense six hours nonetheless.

Friday was another interesting day. A few web devs thought it would be neat to implement an AJAX script which updated a progress bar showing the progress of a sale (what percentage of items have been sold). Alas, they didn't think through the impact their little script would have as a few million users hit the site with a refresh request every half second. The script was badly written enough that it grabbed data from multiple services and then ignored 90% of the information retrieved. The net result? A few services and networking devices melted down. And guess who got to help clean up the mess?

No comments: