harddrive_server_200x200As many of our customers know today on Box20, we had an eventful day that happens about once every blue moon — an event that isn’t foreseeable and an event that if not handled correctly, can devastate our company and our customers.

A little after 7am EDT (GMT -5), upon entering the office we noticed an anomaly in Box20‘s logfile a very inconspicuous message from SMARTD (a Linux process to monitor the smart application on the hard drive itself to detect failure).  Our policy is to treat our customer and their data as our own: if we had company data on this server (which we did not) we would want it investigated, so we did just that.

Directly from the log:
Mar 18 08:53:55 box20 smartd[3962]: Device: /dev/sda, 4 Offline uncorrectable sectors

After running a SMART scan, we noticed the errors were progressively getting worse which indicates the hard drive’s lifespan was limited to weeks, if not minutes.   With this knowledge, most providers would have rolled the dice and gambled with their customer’s data to ensure they stay above the 99.9% guarantee they offer, only to cite their terms and conditions later on that they are responsible for data-loss or backups.  Even though our policy states the same thing, we believe we still have both a morale and ethical obligation to our clients which we upheld and honored.

Simply put, we do not gamble with our client’s data.  We never have, we never will.  Within seconds after we had confirmed that SMART indicated failure, we shut the machine down and placed it on the bench to start cloning a new drive.  As evidence to our handling the situation this morning: as soon as our fears were confirmed, we took the server down immediately, allowing as little room as possible to chance and luck.

We successfully cloned every bit (as in 1/8th of a byte) of data and were able to bring all of our client’s sites back online with 100% data integrity.  Though a 1 TB hard drive takes a while to clone, we hope our clients are happy and appreciate our very quick call to action in getting this problem pinned down before it could become a problem.   We were able to bring the server online just a little less than four hours later at 2:24PM EDT (GMT -5).  For us, these are hours that preciously preserved eight times what it could have been waiting for a new server install and data upload from backups.

See Event Logs

Want to see how this event transpired with precise times? Take a look at the event log!