18.4.13

“Disk Failure”, watch your S.M.A.R.T!


Yes, for those in any sort of "ops" or Sysadmin position, the very phrase makes your blood run cold causing cranial dermis spasms, usually around the eyes, along with that sinking feeling that makes you feel like you are exiting this reality at warp speed.

Well, its that taboo topic that needs talking about.  But you may say/think that “I have backups, I am fine” or “I use Various RAID levels with redundancy, I’ll be fine.”  but are you really?  If this is you, READ ON!  :)

Recently I had a NAS Failure.  Well, not so much as a complete failure, but rather, a NAS that was seemingly not happy.  Causing significant service disruptions and delays.  Big deal right?  Deal with it and move on.  Well recently, after fully integrating a storage area network in my workplace, I was thinking this is great.  I can spin up a VM, move it from node to node, add disks when I need more space, and all seemed hunky dory.  

Many months into its service, services seemed to sporadically crash, or have significant delays.   As I was diagnosing issues, I was thinking there was an issue with my compute node setup in my ESXi cloud.  My virtual machines would simply freeze.  They would show many disk errors.  I thought there was an interruption in the transport layer (or on the wire if you will) because once I rebooted the compute node all machines seemed to work again.  Slowly these issues began to bleed through to other servers connected to the SAN.  

I thought, and I admit perhaps naively, that well, the particular NAS (or Network Attached Storage) unit that was serving these servers had RAID5.  If there was a disk issue, it would fail the disk to the hot spare and given the redundant nature of RAID5, it would continue to operate and alert me of the failed disk.  WRONG!   
For those who are not familiar with RAID5, in a nut shell RAID is the way you configure multiple hard disks to operate together.  RAID-5 for example, what I was using, is the practice of using 3 or more disks with fault tolerance such that if one hard disk failed the other 2 would function and there would be no data loss.  
So, in my case, I was using 7 2TB disks, effectively having 6 2TB drives or  12TB of storage  (less formatting) available.  RAID5 complexity allows any one disk to fail and there would be no service impacting issues.  Especially with the “hot spare” waiting.  (A hot spare is an installed disk on standby in case another fails.)

Now back to the point, I began to suspect that a particular NAS server was showing signs of complete failure.  After making sure all my backups were up to date, I began to take a look at the system.  When I looked at the console, It was violently spewing out disk error and retry messages.  Every time there was a service stall, the console would begin to projectile puke these massages that make you freeze in fear.  Could my NAS be completely failing???  Could RAID5 have deceived me?

With poise and professionalism, I got my composure and exited my server room to face an office at 9am slowly growing full of people asking questions like; “why is FogBugz not working?” and “why cant I access the shared files on the server?” … and on and on.  I was looking at a systems failure of about 50% of all the services I operate.

On looking at the web client of the NAS, I began to check the health of all the disks.  What was actually happening was a SMART drive abnormality.  (SMART is the function that is built into all modern hard drives that can detect pending failure).  A disk in the RAID5 array had not completely failed, rather, it was about to.  It was not quite timing out to a failure, but as the drive was franticly trying to reallocate bad sectors, it stalled out the whole disk array.  But it would restore operations before it times out completely to failure but this was service impacting.  

So what I am trying to say here is that you cannot rely on RAID levels as a redundant measure.  You MUST also monitor the SMART status of all the attached drives.  A drive with an emanate failure can take out an entire array and play havoc with your services.  All I needed to do was remove the bad disk and the system failed over to the hot spare.  After the RAID rebuild completed, I added a new hot spare and all was good in the land of NAS once again.

Moral of my story here is that make sure all your monitoring systems are in check.  Setup regular SMART tests on all production systems.  In the case of this drive the SMART system was telling me that there was an abnormally high rate of bad sector reallocation.  It was a bad disk.  

As a side note here, we have the incorrect thought that modern hard drives are more stable than older hard drives.  But in actuality, the have MUCH more bad sector counts( more sectors, more failures).  They just reallocate these bad sectors and mark them as bad more frequently.  We just never know about it now unless you run disk utilities to view this internal data.

Also, if you reply on a NAS unit for as many services as I do here, I recommend that you have a redundant NAS unit that is mirroring so that you can fail over to in  cases like this.  Fix the issues and keep services running.  Sadly, this was not in my budget this year.

Until Next time,
//Ian\\

PS: If you have any topic you are interested in, please let me know, and feel free to leave comments and start discussion.  It would be nice to see an actual comment rather than spam :)  

No comments: