Drive Failures

Written by Michael Cole - July 3rd 2015

So in preparing for tomorrow's party, I happen to be doing up some pulled pork, so I have plenty of time (around 14 hours) to monitor temperature, wood, etc. That gave me some time to work on some computer stuff in between. So all of my parts arrived and I have things at least partially setup the way I want. New UPS, new switch, and some other things. I noticed that my current hardware RAID solution on my Linux server was being a bit flaky (yes that's a technical term). This prompted me to try to move some important content to my new RAIDZ2 setup. So I copied several files over. The next day, my hardware RAID went down, and the device disappeared from Linux all together. I guess it's lucky I copied that data when I did.

Except… my new RAIDZ2 was DEGRADED and re-silvering. Really?! What could have happened? Then it went to too many errors and stopped. Well luckily I had named the devices based on their physical location in my machine. So I knew right away it was drive 5. So I checked all the cabling, etc it seemed fine I tried clearing the errors and trying again, same thing… Now I tried a totally new cable, different connections on the motherboard (Both the LSI and the board itself). No matter what I did, it had the same issue. It actually got worse, now it couldn't probe the drive at all. I pulled it out, and tried it in an external enclosure I tried it on a few computers and I got bad health.

So now I am RMA'ing the drive to Western Digital. It doesn't make me happy that a brand new drive would be bad this quickly, but I know it can happen. That's why I'm glad I have RAIDZ2. So now I'm operating in a DEGRADED state, but I still have all the data.

Also I was luck enough to have a simple issue on the hardware RAID Linux setup. Mainly some of the airflow. I cleaned up each drive the backplane and all fans and it seems to be working great.

Now it's just a matter of waiting.