Yesterday one of the sites hosted on my server hit the digg front page. It wasn’t one of my posts, but one from a good friend (nice one Jonny!). Anyway, the massive surge in traffic caused some issues…
Firstly, let me say that I was completely unprepared for this. I hadn’t done much performance tuning, because honestly, I’d never seen the point since the traffic was so low. Hitting the digg front page has shown me the error of my ways…
Roughly, this is what happened once the article hit the front page:
I think it was around 9:00pm (UK time) when it started. The site hosted uses Wordpress, and was running version 2.0 at the time. As the traffic ramped up it (predictably) began to crawl. After a while (at around 9:15), MySQL decided to die. The site now (slowly) displayed the ‘Wordpress database error’ page (which I’m sure is familiar to digg users).
I attempted to SSH into the server to try and get things going again, but I was greeted only by ‘connection refused’. This is when things became awkward. With no shell access to the box I had no way of seeing what was going on. Apache was still serving pages, albeit pretty slowly, but the database was a no show. The page was still receiving loads of hits, all to the unfriendly error page. Not good…
I tried the admin console for my VM (the server is a UML instance hosted at bytemark), but I was unable to get a login via any of the ttys or serial consoles. I’m not sure if this was actually caused by the ‘diggpocalypse’ , since I’ve never tried this way of logging in before it may never have worked. With nothing else to try (what else can you do once you’re unable to login?), I attempted a reboot.
Ctrl+Alt+Del did nothing - of course on the VM this is simply issuing the ‘cad’ command from the admin console. I waited around 10 mins, to allow any dead processes to timeout in the reboot cycle, but still nothing. The machine was still up, and Apache was still serving pages, hits were still flooding in, but still no SSH.
Time for ‘emergency measures’. With bytemark’s admin system I performed a ‘rescuehalt’. This forcibly terminates the kernel, but it allows the discs etc to be cleanly stopped. It went down fine, which made me happy (dreading loss of data…). Unfortunately, it didn’t come back up…
Starting the VM in rescue mode provided me with this error when mounting the rootfs:
mount: wrong fs type, bad option, bad superblock on /dev/loop0,
Panic stations! I tried a couple of other things, but each time I started the VM, I got the same error. By this time is was getting pretty late, and I’d run out of things to try. I fired off an email to bytemark support, and went to bed…
I was greeting in the morning by two emails from bytemark - they’d got it sorted for me, by 11am. Apparently, their scripts were having issues mounting my rootfs (which is loopback file / ext3) - this was fixed by manually adding a journal entry. They didn’t go into details of exactly why this happened, but it was apparently a bug in their scripts (now a fixed bug, of course).
Secondary to the mount issues, there were problems with the kernel version I was using. I was still on the 2.4 kernel, which was the default when I first got the machine. I’d never seen the need to upgrade, and I didn’t have the time to deal with it, so I just left it alone. The bug in the script was partly due to the old kernel (or more precisely udev misbehaving), so to fix it the bytemark guys upgraded me to 2.6.20 (with bytemark UML patches). The upgrade was trouble free, as far as I can tell.
Much kudos to bytemark - their level of service continues to impress me.
So, by lunchtime today I had the server back up and running
I’ve spent a bit of time tuning the server to allow it to better cope with this kind of load in the future, but I’ll save that for another post…
{ 1 } Comments
I love reading about tuning small sites to meet big traffic demands. I guess because it’s about getting the most out of what you’ve got.
{ 1 } Trackback
[...] up continuously. Most of the reboots have been for upgrades (or a hosting move) but there have been some problems along the [...]
Post a Comment