Category Archives: Zfs

zfs on linux links part II

Proxmox 3.4 was recently released with additional integrated support for ZFS.  More details are provided here in the Proxmox ZFS wiki section. I also decided to start gathering up another more current round of links related to performance, best practices, benchmarking, etc.

If you are looking for up to date links to help you understand some of the more advanced aspects and use cases surrounding zfs, or if you are just getting started and are looking for some relevant reading material on the subject, you should find these links extremely useful:

1.The state of ZFS on linux
2.Arch linux ZFS wiki page
3.Gentoo linux ZFS wiki page
4.ZFS Raidz Performance, Capacity and Integrity
5.ZFS administration
6.KVM benchmarking using various filesystems
7.How to improve ZFS performance

ZFS Day videos

The folks over at ZFS Day 2012 have posted a nice and wide ranging series of videos from last years event.

Taken from the description of the event:

‘The first ever ZFS conference covered all aspects of using ZFS in production. You can find ZFS in the most demanding environments, from video servers to cloud platforms to databases to NFS servers to HPC. Come learn about what makes ZFS a great storage system for these and other applications.’

Of particular interest to me (because of our desire to run zfs on Linux in production environments (zfsonlinux.org)) are the following two videos:

ZFS for Linux Implementation – Brian Behlendorf:

Panel: The State of ZFS on various os versions:

You can find a complete list of videos, talks and slides here.

zfsonlinux and gluster so far….

Recently I started to revisit the idea of using zfs and linux (zfsonlinux) as the basis for a server that will eventually be the foundation of our gluster storage infrastructure.  At this point we are using the Opensolaris version of zfs and an older (but stable) version of gluster (3.0.5).

The problem with staying with Opensolaris (besides the fact that it is no longer being actively supported itself),  is that we would be unable to upgrade gluster….and thus we would be unable to take advantage of some of the new and upcoming features that exist in the later versions (such as geo-replication, snapshots, active-active geo-replication and various other bugfixes, performance enhancements, etc).

Hardware:

Here are the specs for the current hardware I am using to test:

  • 2 x Intel Xeon E5410 @ 2.33GHz:CPU
  • 32 GB DDR2 DIMMS:RAM
  • 48 X 2TB Western Digital SATA II:HARD DRIVES
  • 2 x 3WARE 9650SE-24M8 PCIE:RAID CONTROLLER
  • Ubuntu 11.10
  • Glusterfs version 3.2.5
  • 1 Gbps interconnects (LAN)

ZFS installation:

I decided to use Ubuntu 11.10 for this round of testing, currently the daliy ppa has a lot of bugfixes and performance improvements that do not exist in the latest stable release ( 0.6.0-rc6) so the daily ppa is the version that should be used until either v0.6.0-rc7 or v0.6.0 final are released.

Here is what you will need to get zfs installed and running:

# apt-add-repository ppa:zfs-native/daily
# apt-get update
# apt-get install debootstrap ubuntu-zfs

At this point we can create our first zpool. Here is the syntax used to create a 6 disk raidz2 vdev:

# zpool create -f tank raidz2 sdc sdd sde sdf sdg sdh

Now let’s check the status of the zpool:

# zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0errors: No known data errors

ZFS Benchmarks:

I ran a few tests to see what kind of performance I could expect out of zfs first, before I added gluster on top, that way I would have better idea about where the bottleneck (if any) existed.

linux 3.3-rc5 kernel untar:

single ext4 disk: 3.277s
zfs 2 disk mirror: 19.338s
zfs 6 disk raidz2: 8.256s

dd using block size of 4096:

single ext4 disk: 204 MB/s
zfs 2 disk mirror: 7.5 MB/s
zfs 6 disk raidz2: 174 MB/s

dd using block size of 1M:

single ext4 disk: 153.0 MB/s
zfs 2 disk mirror: 99.7 MB/s
zfs 6 disk raidz2: 381.2 MB/s

Gluster + ZFS Benchmarks

Next I added gluster (version 3.2.5) to the mix to see how they performed together:

linux 3.3-rc5 kernel untar:

zfs 6 disk raidz2 + gluster (replication): 4m10.093s
zfs 6 disk raidz2 + gluster (geo replication): 1m12.054s

dd using block size of 4096:

zfs 6 disk raidz2 + gluster (replication): 53.6 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 53.7 MB/s

dd using block size of 1M:

zfs 6 disk raidz2 + gluster (replication): 45.7 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 155 MB/s

Conclusion

Well so far so good, I have been running the zfsonlinux port for two weeks now without any real issues. From what I understand there is still a decent amount of work left to do around dedup and compression (neither of which I necessarily require for this particular setup).

The good news is that the zfsonlinux developers have not even really started looking into improving performance at this point, since their main focus thus far has been overall stability.

A good deal of development is also taking place in order to allow linux to boot using a zfs ‘/boot’ partition.  This is currently an option on several disto’s including Ubuntu and Gentoo, however the setup requires a fair amount of effort to get going, so it will be nice when this style setup is supported out of the box.

In terms of Gluster specifically, it performs quite well using geo-replication with larger file sizes. I am really looking forward to the active-active geo-replication feature currently planned for v3.4 to become fully implemented and available. Our current production setup (currently using two node replication) has a T3 (WAN) interconnect, so having the option to use geo-replication in the future should really speed up our write throughput, which is currently hampered by the throughput of the T3 itself.

ZFS crash during high I/O

After successfully completing a ‘zfs replace’ I was not so pleased to get the following error message back from ‘zfs detach’:

cannot detach c5t17d0: no valid replicas

I decided that I would upgrade this OpenSolaris 2008.11 instance to OpenSolaris 2009.06 in order to see if the obvious bug that I was encountering was resolved in the newest version. Since upgrading in OpenSolaris supports automatic boot environment creation, there really is not much danger at all in updating because you can always boot back into the other environment at any time.

The upgrade was a success, and after I booted into 2009.06 I was able to simply detach the failed drive from the pool and thus remove it from the system.

I recompiled gluster and I ran 2009.06 for a couple of days, until I started noticing that the server was rebooting during times of high I/O. A peek inside ‘/var/adm/messages’ revealed the following errors:

Aug 15 22:33:04 cybertron unix: [ID 836849 kern.notice]
Aug 15 22:33:04 cybertron ^Mpanic[cpu0]/thread=ffffff091060c900:
Aug 15 22:33:04 cybertron genunix: [ID 783603 kern.notice] Deadlock: cycle in blocking chain
Aug 15 22:33:04 cybertron unix: [ID 100000 kern.notice]
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9651f0 genunix:turnstile_block+795 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965250 unix:mutex_vector_enter+261 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9652f0 zfs:zfs_zget+be ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965380 zfs:zfs_zaccess+7c ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965400 zfs:zfs_lookup+333 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9654a0 genunix:fop_lookup+ed ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965550 genunix:xattr_dir_realdir+8b ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9655a0 genunix:xattr_dir_realvp+5e ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d9655f0 genunix:fop_realvp+32 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965640 genunix:vn_compare+31 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965860 genunix:lookuppnvp+94c ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965900 genunix:lookuppnatcred+11b ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965990 genunix:lookuppnat+69 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965b30 genunix:vn_createat+13a ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965cf0 genunix:vn_openat+1fb ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965e50 genunix:copen+435 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965e80 genunix:openat64+25 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965ec0 genunix:fsat32+f5 ()
Aug 15 22:33:04 cybertron genunix: [ID 655072 kern.notice] ffffff003d965f10 unix:brand_sys_sysenter+1e0 ()
Aug 15 22:33:04 cybertron unix: [ID 100000 kern.notice]
Aug 15 22:33:04 cybertron genunix: [ID 672855 kern.notice] syncing file systems…
Aug 15 22:33:04 cybertron genunix: [ID 904073 kern.notice] done

My efforts to find any further detailes about this bug are ongoing, so at this point I have booted back into 2008.11 and I will be running that until a fix or a workaround is found.

SUNWattr_ro error:Permission denied on OpenSolaris using Gluster 3.0.5–PartII

Recently one of our 3ware 9650SE raid cards started spitting out errors indicating that the unit was repeatedly issuing a bunch of soft resets. The lines in the log look similar to this:

WARNING: tw1: tw_aen_task AEN 0x0039 Buffer ECC error corrected address=0xDF420
WARNING: tw1: tw_aen_task AEN 0x005f Cache synchronization failed; some data lost unit=22
WARNING: tw1: tw_aen_task AEN 0x0001 Controller reset occurred resets=13

I downloaded and installed the latest firmware for the card (version 4.10.00.021), which the release notes claimed had several fixes for cards experiencing soft resets.  Much to my disappointment the resets continued to occur despite the new revised firmware.

The card was under warranty, so I contacted 3ware support and had a new one sent overnight.  The new card seemed to resolve the issues associated with random soft resets, however the resets and the downtime had left this node little out of sync with the other Gluster server.

After doing a ‘zfs replace’ on two bad disks (at this point I am still unsure whether the bad drives where a symptom or the cause of the issues with the raid card, however what I do know is that the Cavier Geen Western Digital drives that are populating this card have a very high error rate, and we are currently in the process of replacing all 24 drives with hitachi ones), I set about trying to initiate a ‘self-heal’ on the known up to date node using the following command:

server2:/zpool/glusterfs# ls -laR *

After some time I decided to tail the log file to see if there were any errors that might indicate a problem with the self heal. Once again the Gluster error log begun to fill up with errors associated with setting extended attributes on SUNWattr_ro.

At that point I began to worry whether or not the AFR (Automatic File Replication) portion of the Replicate/AFR translator was actually working correctly or not.  I started running some tests to determine what exactly was going on.  I began by copying over a few files to test replication.  All the files showed up on both nodes, so far so good.

Next it was time to test AFR so I began deleting a few files off one node and then attempting to self heal those same deleted files.  After a couple of minutes, I re-listed the files and the deleted files had in fact been restored. Despite the successful copy, the errors continued to show up every single time the file/directory was accessed (via stat).  It seemed that even though AFR was able to copy all the files to the new node correctly, gluster for some reason continued to want to self heal the files over and over again.

After finding the function that sets the extended attributes on Solaris, the following patch was created:

— compat.c Tue Aug 23 13:24:33 2011
+++ compat_new.c Tue Aug 23 13:24:49 2011
@@ -193,7 +193,7 @@
{
int attrfd = -1;
int ret = 0;

+
attrfd = attropen (path, key, flags|O_CREAT|O_WRONLY, 0777);
if (attrfd >= 0) {
ftruncate (attrfd, 0);
@@ -200,13 +200,16 @@
ret = write (attrfd, value, size);
close (attrfd);
} else {
– if (errno != ENOENT)
– gf_log (“libglusterfs”, GF_LOG_ERROR,
+ if(!strcmp(key,”SUNWattr_ro”)&&!strcmp(key,”SUNWattr_rw”)) {
+
+ if (errno != ENOENT)
+ gf_log (“libglusterfs”, GF_LOG_ERROR,
“Couldn’t set extended attribute for %s (%d)”,
path, errno);
– return -1;
+ return -1;
+ }
+ return 0;
}

return 0;
}

The patch simply ignores the two Solaris specific extended attributes (SUNWattr_ro and SUNWattr_rw), and returns a ‘0’ to the posix layer instead of a ‘-1’ if either of these is encountered.

We’ve been running this code change on both Solaris nodes for several days and so far so good, the errors are gone and replicate and AFR both seem to be working very well.

Slow ZFS resilvering on OpenSolaris

Two weeks ago I started receiving automated messages from one of our 3ware 9650SE raid cards concerning an increase in the number of SMART errors on one of the 2TB hard drives attached to the card. Within a few days of the raid card starting to generate these messages, ZFS was nice enough to take the drive in question out of service, and replace it with one of the drives we had set aside as an ‘online spare’ for that specific pool.

So far so good.

Two terabytes of data is a decent amount, so I assumed that the resilvering might take some time, and i was able to confirm that after logging in and looking at the output from the ‘zpool status’ command. The output indicated that it was going to take several more hours before the resilvering process would be totally complete.

So far so good.

The next day I logged into to server to check on the progress, not only did I find that the job had not yet been completed, but I also discovered that now the ‘zpool status’ command had almost doubled the amount of time that it estimated would be required to fully resilver the drive.

It was at this point that I started to suspect that maybe our automated snapshotting policy (which runs hourly, daily, weekly and monthly via cron) may be hampering the resilvering progress. A quick google search indicated that at some point in the past, bug number ‘6343667‘ had in fact been associated with degraded scrub and resilvering performance during periods in which snapshots were being taken. It appears that some older versions of ZFS used to require a restart of the entire resilvering process after a snapshot was initiated.

According to bug number ‘6343667‘, this issue was resolved with the release of ZFS pool version 11. I double checked the version we are running on the server in question and discovered that we were running version 13.

At this point I am unsure if the problem that I experienced had anything to do with that specific bug number, however what I do know is that after commenting out the automated snapshot entries from the crontab on that server, the drive resilvering finished quickly and without error, and I have not had any problems since.

Just remember to re-enable to snapshots after the resilver and you should be all set.

DYI NAS: Part2

I was able to spend a little more time gathering some numbers around performance and power usage over the last couple of days so I figured I would share a little bit about what I was able to learn from the build I described in part1.

Power usage:

After purchasing a ‘Kill A Watt’ from Home Depot, I set out to try and determine how many watts were being drawn by my mini-itx NAS.

Luckily the ‘Kill A Watt’ is a device that is very easy to setup and use, so it did not take me long to discover the following: the server consumes around 50 watts during start-up. After the boot-up process has finished, the power consumption levels off to right around 34.5 watts.  According to the Kill A Watt, that amounts to less then $3.00 per month…not bad as far as I am concerned.

Various attempts to lower that usage even further resulted in very little drop in that number, I believe that I could have knocked off a couple of watts had I removed one of the DIMM’s and ran the server with 1 DIMM and 2GB of RAM instead of 2 DIMM’s and 4GB of ram. I decided that having the additional RAM was worth the few extra watts that could be saved.

Network performance:

Using a Netgear Gigabit switch and a couple of cat6 cables I was able to achieve transfer speeds of 29 MB/sec over samba between FreeNAS and my Mac laptop. Throughput numbers over my Linksys 10/100 WRT54G wired network ports were less stellar, I saw throughput of around 7.1 MB/sec.  The wireless numbers were not as good as I had hoped, however they are probably what I should have expected for a shared 802.11g link, between 1 and 2 MB/sec.

Being that this is an ‘archive’ of sorts I can certainly live with those numbers for now, my main focus on this particular project was stability, reliability and redundancy.

DYI NAS: Part1

A few weeks I decided to do some research into what it would take to build a NAS unit that would act as a storage server for all of my digital assets (audio, video, images, etc). I had purchased a 500GB external HD from Bestbuy about a year ago, however recently I was having problem reading the drive from my Linux laptop (since then I have not had any problems reading that same drive from my new MacBook Pro).

That was enough to scare me into investing a little bit more time and money into something that provided a higher level of  fault tolerance,  and might also lead to some more restful nights as well.

HARDWARE:

My initial requirements were not super hefty, I knew I wanted the following:

1)relatively small form factor case
2)a unit that consumed relativity little power
3)would allow scaling up to at least 4 drives
4)a 64 bit CPU and a motherboard that would handle at least 4GB of RAM
5)a setup that would allow me to use ZFS as the backend filesystem

After doing some initial research, I came across the case that I thought would be perfect for this build, the Chenbro ES34169. This case fit several of the requirements, it was small, I could use a mini-itx motherboard and it provided a backplane for at least four hot swappable 3.5 inch hard drives.

After finding a case that I was fairly sure I was going to use, I set out to find a motherboard that would allow for at least 4 SATA devices and work well with with OpenSolaris, EON, Nexenta or FreeNAS.

Initially I really liked the GA-D525TUD from GIGABYTE.  It  has 4 SATA ports, can handle up to 4 GB of RAM, has a built in Intel Atom D525, and was priced very reasonably priced at right about $100.  The one huge downside of this motherboard was the onboard Realtek NIC.  I came across several posts (here and here for example) that indicated that reliability issues existed with these types of NIC’s, and that I was better of using an Intel chipset instead.  Since this server was mainly going to be used as a file server, network performance and  reliability were imperative, so I was going to try and avoid a motherboard with a Realtek NIC if possible.

I also found the JNC98-525E-LF from JetWay, this motherboard had a lot of the same appeal of the GIGABYTE motherboard, however it also had an HDMI port, a DVI port and analog audio outputs as well.  I think this motherboard would be a good pick if I was building a media server instead of strictly a storage server . This motherboard also used the Realtek chipset for networking as well, so I decided to continue my search.

When everything was all said and done, I went with the MBD-X7SPA-H-O from Supermicro.  This board has 6 SATA ports, allowed 4GB of RAM, came with a 64-bit processor, an on board USB port (which would allow me to hide my boot device inside the case itself) and most importantly had two Intel based network cards.  This motherboard was a bit pricey at right around $200, however I guess you are paying extra for the extra SATA ports and the extra NIC.

I found 4 GB of cheap Kingston RAM, so the only thing left hardware wise was to decide what type of hard drives I would purchase and how many.  I decided that I would use two WD20EADS 2TB hard drives from Western Digital.

The only real issue that I ran into with this setup was the fact that there are very few mini-itx motherboards that support ECC RAM, which is a must have for enterprise level storage setups, however I was not willing to spend the extra money for an enterprise level mini-itx board which sells for about $1000.  My other option was to scrap my plans for a really small form factor mini-itx based setup and go with something bigger like an micro-atx or regular atx motherboard, where I am sure I would have an easier time finding ECC support.

I decided that I would give up the ECC RAM option in order to gain the benefits that come with a much smaller machine.

SOFTWARE:

I went back and fourth between using EON, OpenSolaris and FreeNAS.  I liked EON because it has a very small footprint, it could be installed on a USB flash drive, it was based on OpenSolaris and had a stable ZFS implementation.  The downside of using EON for this project is that it would require a bit more expertise to configure and administer.

I have a good amount of OpenSolaris experience, and they have obviously the most stable ZFS implementation that exists, but it think that it is overkill for this machine, and I am also not a big fan of what Oracle is doing right now in terms of the open source community, so I decided to look a little closer at what FreeNAS had to offer.

FreeNAS is a FreeBSD based NAS distro, that has a very nice web interface that allows you to configure almost all aspects of the server from any web browser.  Research indicated that they had a stable and reliable ZFS port in place, I would be able to install and boot the OS on my usb flash drive, and if I got hit by a bus, one of my family members would have a better chance of being able to figure out how to retrieve the data…so FreeNAS it was.

In Part 2 of my post I plan to provide some more details about overall power use and network performance.

Updated Native Linux ZFS benchmarks

Phornix.com just released some updated numbers from benchmarks they took using the recently released GA version of the native ZFS kernel module for Linux. They conducted a total of 10 tests using the ZFS kernel module, Ext4, Btrfs and XFS.

The tests were performed using Ubuntu 10.10 and kernel version 2.6.35 for the ZFS tests,  kernel version 2.6.37 was used when testing the other three filesystems.

It appears that these tests were all run using single disk setups, I think it would be really great if Phornix would also look into providing benchmarks on multi-disk setups such as ZFS mirrored disks vs hardware or software RAID1 on Linux. I would also like to see benchmarks comparing RAID5 on Linux vs RAIDZ on ZFS.  I think these kinds of tests might provide a more realistic comparison of real world enterprise level storage configurations.

Native Linux ZFS kernel module and stability.

UPDATE: If you are interested in ZFS on linux you have two options at this point:

I have been actively following the  zfsonlinux project because once stable and ready it should offer surperior performance due to the extra overhead that would be incurred by using fuse with the zfs-fuse project.

You can see another one of my posts concerning zfsonlinux here.

————————————————————————————————————————————————————-

There was a question posted in response to my previous blog post found here, about the stability of the native Linux ZFS kernel module release. I thought I would just make a post out of my response:

So far I have been able to perform some limited testing (given that the GA code was just released earlier this week), some time ago I had been given access to the beta builds,  so I had done some initial testing using those, I configured two mirrored vdevs consisting of two drives each. It seemed relatively stable as far as I was concerned, as I stated in my previous post…there is a known issue with the ‘zfs rollback’ command…which I tested using the GA release,  and I did in fact have problems with.

The work around at this point seems to be to perform a reboot after the rollback and then a ‘zfs scrub’ on the pool after the reboot. Personally I am hoping this gets fixed soon, because not everyone has the same level of flexibility, when it comes to rebooting their servers and storage nodes.

As far as I understand it, this module really consists of three pieces:

1)SPL -  a Linux kernel module which provides many of the Solaris kernel APIs. This layer makes it possible to run Solaris kernel code in the Linux kernel with relatively minimal modification.
2)ZFS – a Linux kernel module which provides a fully functional and stable SPA, DMU, and ZVOL layer.
3)LZFS – a Linux kernel module which provides the necessary POSIX layer.

Pieces #1 and #2 have been available for a while and are derived from code taken from the ZFS on Linux project found here. The folks at KQ Infotech are really building on that and providing piece #3, the missing POSIX layer.

Only time will tell how stable the code really is, my opinion at this point is that most software projects have some number of known bugs that exist (and even more have some unknown number of bugs as well), I know I am going to continue to test in a non production environment for the next few months.  At this point I have not experienced any instability (other then what was discussed above) or crashing, all the commands seem to work as advertised, there are a lot of features I have not been able to test yet, such as dedup, compression, etc, so there is lots more to look at in the upcoming weeks.

KQStor’s business model seems to be one where the source code is provided and support is charged for.  So far I have been able to have an open and productive dialog with their developers, and they have been very responsive to my inquiries, however it does not appear that they are going to be setting public tools such as mailing lists or forums, due to their current business model.  I am hoping that this will change in the near future, as I truly believe that everyone will be able to benefit from those kinds of public repositories, and there is no doubt in my mind that such tools will only lead to a more stable product in the long run.