ZFS crash during high I/O

After successfully completing a ‘zfs replace’ I was not so pleased to get the following error message back from ‘zfs detach’:

cannot detach c5t17d0: no valid replicas

I decided that I would upgrade this OpenSolaris 2008.11 instance to OpenSolaris 2009.06 in order to see if the obvious bug that I was encountering was resolved in the newest version. Since upgrading in OpenSolaris supports automatic boot environment creation, there really is not much danger at all in updating because you can always boot back into the other environment at any time.

The upgrade was a success, and after I booted into 2009.06 I was able to simply detach the failed drive from the pool and thus remove it from the system.

I recompiled gluster and I ran 2009.06 for a couple of days, until I started noticing that the server was rebooting during times of high I/O. A peek inside ‘/var/adm/messages’ revealed the following errors:

Aug 15 22:33:04 cybertron unix: [ID 836849 kern.notice]
Aug 15 22:33:04 cybertron ^Mpanic[cpu0]/thread=ffffff091060c900:
Aug 15 22:33:04 cybertron genunix: [ID 783603 kern.notice] Deadlock: cycle in blocking chain
Aug 15 22:33:04 cybertron unix: [ID 100000 kern.notice]
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9651f0 genunix:turnstile_block+795 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965250 unix:mutex_vector_enter+261 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9652f0 zfs:zfs_zget+be ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965380 zfs:zfs_zaccess+7c ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965400 zfs:zfs_lookup+333 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9654a0 genunix:fop_lookup+ed ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965550 genunix:xattr_dir_realdir+8b ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9655a0 genunix:xattr_dir_realvp+5e ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9655f0 genunix:fop_realvp+32 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965640 genunix:vn_compare+31 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965860 genunix:lookuppnvp+94c ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965900 genunix:lookuppnatcred+11b ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965990 genunix:lookuppnat+69 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965b30 genunix:vn_createat+13a ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965cf0 genunix:vn_openat+1fb ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965e50 genunix:copen+435 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965e80 genunix:openat64+25 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965ec0 genunix:fsat32+f5 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965f10 unix:brand_sys_sysenter+1e0 ()
Aug 15 22:33:04 cybertron unix: [ID 100000 kern.notice]
Aug 15 22:33:04 cybertron genunix: [ID 672855 kern.notice] syncing file systems…
Aug 15 22:33:04 cybertron genunix: [ID 904073 kern.notice] done

My efforts to find any further detailes about this bug are ongoing, so at this point I have booted back into 2008.11 and I will be running that until a fix or a workaround is found.

4 thoughts on “ZFS crash during high I/O

  1. shainmiley Post author

    Yes…we are still using this solution in production. Currently I am starting to think about the next-gen setup though. Mainly because now that RedHat bought Gluster, our current setup which was loosely supported, will definitely not be supported. The support is not such a huge deal, however I would like to take advantage of all the bugfixes/ new features that are available since 3.0.5.

    Right now I am leaning towards going with linux-raid (software), with xfs and maybe lvm (although from what I am hearing I should trying to avoid using both software raid and LVM). I may use the 3ware hardware RAID on the 9650’s…but I am not sure I want to do that either.

    Luckily for us we are only currently using 22 out of the 55 (useable) TB’s…so we have a little bit of time to decide what we want to do. The best case as far as I am concerned would be a stable zfs on linux implantation + Gluster…or maybe a production ready BTRFS + Gluster setup….that way we have the online scrubbing and snapshots (a feature which Gluster plans to release in Q3).

    So overall yes…as long as I have stayed on 2008.11 opensolaris…we have been up and running great for about a year.

  2. Csepulvedab

    Thks for de answer!

    do you use ssd disk for l2arc?
    do you know if in linux with zfs i can use ssd disk for l2arc?

    Thanks!

  3. shainmiley Post author

    We are not currently using any ssd drives for any of our storage, I think we will be exploring those options going forward. I am not sure about about the options concerning ssd/l2arc for either zfsonlinux or zfs-fuse. If you find out the answer, please feel free to share it with us there.

    Thanks,

    Shain

Leave a Reply

Your email address will not be published. Required fields are marked *