Two weeks ago I started receiving automated messages from one of our 3ware 9650SE raid cards concerning an increase in the number of SMART errors on one of the 2TB hard drives attached to the card. Within a few days of the raid card starting to generate these messages, ZFS was nice enough to take the drive in question out of service, and replace it with one of the drives we had set aside as an ‘online spare’ for that specific pool.
So far so good.
Two terabytes of data is a decent amount, so I assumed that the resilvering might take some time, and i was able to confirm that after logging in and looking at the output from the ‘zpool status’ command. The output indicated that it was going to take several more hours before the resilvering process would be totally complete.
So far so good.
The next day I logged into to server to check on the progress, not only did I find that the job had not yet been completed, but I also discovered that now the ‘zpool status’ command had almost doubled the amount of time that it estimated would be required to fully resilver the drive.
It was at this point that I started to suspect that maybe our automated snapshotting policy (which runs hourly, daily, weekly and monthly via cron) may be hampering the resilvering progress. A quick google search indicated that at some point in the past, bug number ‘6343667‘ had in fact been associated with degraded scrub and resilvering performance during periods in which snapshots were being taken. It appears that some older versions of ZFS used to require a restart of the entire resilvering process after a snapshot was initiated.
According to bug number ‘6343667‘, this issue was resolved with the release of ZFS pool version 11. I double checked the version we are running on the server in question and discovered that we were running version 13.
At this point I am unsure if the problem that I experienced had anything to do with that specific bug number, however what I do know is that after commenting out the automated snapshot entries from the crontab on that server, the drive resilvering finished quickly and without error, and I have not had any problems since.
Just remember to re-enable to snapshots after the resilver and you should be all set.