Server Trouble

Early in the morning on Sunday 26th May 2013 my server, which runs almost every site I've ever built and every associated email account, suffered a hard drive failure. What follows is a frank assessment of what went wrong, why it went wrong, what I did to rectify the situation and what I'm doing to prevent something like this happening again.

What happened

I was woken (thankfully) by an eagle-eyed client who couldn't access his website. Within a few seconds I'd worked out it was more than a caching issue or domain name renewal problem. The firm I lease the server from, Hetzner, got me access to the server and it was clear that a hard drive had failed.

I requested that they replace the broke drive.

What Should've Happened...

The server runs two hard drives in a RAID - they're a copy of each other, and whilst a RAID is not replacement for a backup, it's the first line of defence against hardware failure and data loss. In this instance, a new drive should have been put in, the server come back online and the second drive keep everything running as it was before, given that it is an exacty copy of the drive which failed.

What Actually Happened...

On May 20th 2012 the RAID array ceased to function as it should. I was unaware of this, and so the server came back online, but when it did the last year's content, updates, emails, new sites etc. were all missing.

I turned to my backup solution, which had a 'full' server backup from a few weeks previous. Actually, the software which ran that backup had failed to do its job properly and the backup files were both partially corrupt and incomplete.

All this had happened by about 8.30am on Sunday morning.

Why did this happen?

Basically, the simple answer is, I didn't do enough to keep track of the health of the server, it's components and it's software, for which I can only apologise to those adversely affected by this problem.

In a bit more detail, hardware failure is going to get the better of everyone who relies on computers at some point or another. The hard drives in a server are really no different to those in a standard computer - and they fail often enough. And server drives are running, reading and writing 24 hours a day, 7 days a week, 365 days a year.

At the time of the failure, I was running Plesk control panel on the server, and using its built in health monitoring software. Despite having disk monitoring, it failed to alert me to anything suspicious in the days leading up to the failure.

Also, Plesk's backup software failed to work effectively. It appeared to have backed up the entire server a few weeks ago (I admit, I should have been running it more frequently, and am now running nightly backups). However, upon downloading the backup files it was soon apparent they they were incomplete.

To recap, by mid-morning on Sunday, the entire server was set back to 20th May 2012 - a whole year's work plus emails, including any new sites, was gone.

What I did about it

On Sunday morning I began rebuilding and restoring all websites and accounts from available backup files (some local, some from the server) and original design files as well as Google's cache of the sites.

I requested the delivery of the faulty hard drive to attempt data recovery (from Germany).

On Monday and Tuesday the rebuilding continued. On Tuesday afternoon it became apparent that the second drive was also on the brink of failure. At this point, I contacted my brother, a talented programmer and server administrator, who nursed the striken server through Tuesday night and Wednesday.

On Wednesday evening the server was taken offline at around 22:30 whilst the data was copied from the existing original hard drive to the new drive. This process failed at around 2am and it was started again. At 7am the hard drives were reconfigured and the server was rebooted at 7:35am on Thursday morning.

By Friday many of the smaller sites had been restored and the faulty hard drive was delivered (via TNT). I began data recovery processes on the disk and was able to copy all the database files and website files from the disk over Friday/Saturday and continued to rebuild more websites.

By Monday this week most sites were back to normal.

What next?

So, it's now March 2015 - I never actually finished this post, but thought I'd publish it anyway. Oh, and my hosting's now with Bytemark in York - they're great, and I've now (almost 2 years later) recovered from the trauma of this event :).

06 June 2013