Category Archives: Gluster

Ceph overview with videos

I recently started looking into Ceph as a possible replacement for our 2 node Glustr cluster. After reading numerous blog posts and several videos on the topic, I believe that the following three videos provide the best overview and necessary insight into how Ceph works, what it offers, how to mange a cluster, and how it differs from Gluster.

This video was given at the 2013 linux.conf.au conference. The first one is a Ceph only talk given by Florian Haas and Tim Serong.  It provides a complete overview of the Ceph software, it’s feature set and goes into detail about the current status of each of Ceph’s components (block storage, object storage, filesystem storage, etc).

The second video is a bit more lighthearted, it is a talk which involves some back and forth between Ceph’s Saige Weil and Gluster’s John Mark Walker, the talk is moderated by Florian Haas.  It covers some of the basic use cases for each file systems, offers some technical insight into the overall design, some of the ways in which the filesystems are similar, some items that are on each of their roadmaps moving forward,  as well as some of the ways each of  the projects differ from one another.

The final video also takes place at the 2013 linux.conf.au conference.  It covers some of the same topics that were discussed in the prior two videos, however this one is also geared toward the operations aspect of managing a Ceph cluster.  This talk is given by Saige Weil as well.

GlusterFS sys admin overview

Here is another interesting video from Linuxconf Japan 2012, below Dustin Black provides an overview of Gluster from a system administrator’s point of view.

The topics covered in this 1 hour presentation include:

  • An overview of GlusterFS and Red Hat Storage
  • A look at the logic behind the software
  • How to scale using GlusterFS
  • Redundancy and Fault Tolerance
  • Accessing your data
  • General system administration tasks
  • Real-world use cases for Gluster
  • Q and A

What’s new in GlusterFS 3.3?

Here is a link to a talk given by John Mark Walker at this year’s LinuxCon Japan, in which he discusses some of the internal details of the Gluster 3.3 release.

A few of the new features discussed during the presentation are:

  • UFO (universal file and object storage)
  • HDFS compatibility
  • Proactive self-heal
  • Granular locking
  • Quorum enforcement (for resolving split-brain scenarios)

zfsonlinux and gluster so far….

Recently I started to revisit the idea of using zfs and linux (zfsonlinux) as the basis for a server that will eventually be the foundation of our gluster storage infrastructure.  At this point we are using the Opensolaris version of zfs and an older (but stable) version of gluster (3.0.5).

The problem with staying with Opensolaris (besides the fact that it is no longer being actively supported itself),  is that we would be unable to upgrade gluster….and thus we would be unable to take advantage of some of the new and upcoming features that exist in the later versions (such as geo-replication, snapshots, active-active geo-replication and various other bugfixes, performance enhancements, etc).

Hardware:

Here are the specs for the current hardware I am using to test:

  • 2 x Intel Xeon E5410 @ 2.33GHz:CPU
  • 32 GB DDR2 DIMMS:RAM
  • 48 X 2TB Western Digital SATA II:HARD DRIVES
  • 2 x 3WARE 9650SE-24M8 PCIE:RAID CONTROLLER
  • Ubuntu 11.10
  • Glusterfs version 3.2.5
  • 1 Gbps interconnects (LAN)

ZFS installation:

I decided to use Ubuntu 11.10 for this round of testing, currently the daliy ppa has a lot of bugfixes and performance improvements that do not exist in the latest stable release ( 0.6.0-rc6) so the daily ppa is the version that should be used until either v0.6.0-rc7 or v0.6.0 final are released.

Here is what you will need to get zfs installed and running:

# apt-add-repository ppa:zfs-native/daily
# apt-get update
# apt-get install debootstrap ubuntu-zfs

At this point we can create our first zpool. Here is the syntax used to create a 6 disk raidz2 vdev:

# zpool create -f tank raidz2 sdc sdd sde sdf sdg sdh

Now let’s check the status of the zpool:

# zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0errors: No known data errors

ZFS Benchmarks:

I ran a few tests to see what kind of performance I could expect out of zfs first, before I added gluster on top, that way I would have better idea about where the bottleneck (if any) existed.

linux 3.3-rc5 kernel untar:

single ext4 disk: 3.277s
zfs 2 disk mirror: 19.338s
zfs 6 disk raidz2: 8.256s

dd using block size of 4096:

single ext4 disk: 204 MB/s
zfs 2 disk mirror: 7.5 MB/s
zfs 6 disk raidz2: 174 MB/s

dd using block size of 1M:

single ext4 disk: 153.0 MB/s
zfs 2 disk mirror: 99.7 MB/s
zfs 6 disk raidz2: 381.2 MB/s

Gluster + ZFS Benchmarks

Next I added gluster (version 3.2.5) to the mix to see how they performed together:

linux 3.3-rc5 kernel untar:

zfs 6 disk raidz2 + gluster (replication): 4m10.093s
zfs 6 disk raidz2 + gluster (geo replication): 1m12.054s

dd using block size of 4096:

zfs 6 disk raidz2 + gluster (replication): 53.6 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 53.7 MB/s

dd using block size of 1M:

zfs 6 disk raidz2 + gluster (replication): 45.7 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 155 MB/s

Conclusion

Well so far so good, I have been running the zfsonlinux port for two weeks now without any real issues. From what I understand there is still a decent amount of work left to do around dedup and compression (neither of which I necessarily require for this particular setup).

The good news is that the zfsonlinux developers have not even really started looking into improving performance at this point, since their main focus thus far has been overall stability.

A good deal of development is also taking place in order to allow linux to boot using a zfs ‘/boot’ partition.  This is currently an option on several disto’s including Ubuntu and Gentoo, however the setup requires a fair amount of effort to get going, so it will be nice when this style setup is supported out of the box.

In terms of Gluster specifically, it performs quite well using geo-replication with larger file sizes. I am really looking forward to the active-active geo-replication feature currently planned for v3.4 to become fully implemented and available. Our current production setup (currently using two node replication) has a T3 (WAN) interconnect, so having the option to use geo-replication in the future should really speed up our write throughput, which is currently hampered by the throughput of the T3 itself.

The future of gluster.org

Now that RedHat has purchased Gluster, and they are in the process of releasing their storage software appliance, many people are wondering what all this means for the GlusterFS project and gluster.org as a whole.

John Mark Walker conducted a webinar last week entitled ‘The Future of GlusterFS and Gluser.org’. In the beginning of this presentation John talks about the history behind, and origins of the Gluster project, he then goes into a basic overview of the features provided by GlusterFS, and finally he talks about what to expect from version 3.3 of GlusterFS and the GlusterFS open source community going forward.

Here are some of the talking points that were discussed during the webinar:

  • Unstructured data is expected to grow 44X by 2020
  • Scale out storage will hold 63,000 PB by 2015
  • RedHat is aggressively hiring developers with file system knowledge
  • Moving back to an open-source model from and open-core model
  • Open source version will be testing ground for new features
  • RHSSA will be more hardened and thoroughly tested
  • Beta 3 for 3.3 due in Feb/Mar 2012
  • GlusterFS 3.3 expected in Q2/Q3 of 2012

Here is the link to the entire presentation in a downloadable .mp4 format.

Here is a link to all the slides that were presented during the talk.

Redhat to purchase Gluster

Redhat released a statement today in which they announced their plans to acquire Gluster, the company behind the open source scalable filesystem GlusterFS.

Only time will tell exactly what this means for the project, community, etc, but based on the fact that Redhat has a fairly good track record with the open source community, and given the statements they made in their FAQ, I can only assume that we will continue to see GlusterFS grow and mature into a tool that extends reliably into the enterprise environment.

Gluster also provided several statements via their website today as well, you can read a statement from the founders here, as well as an additional Gluster press release here.

SUNWattr_ro error:Permission denied on OpenSolaris using Gluster 3.0.5–PartII

Recently one of our 3ware 9650SE raid cards started spitting out errors indicating that the unit was repeatedly issuing a bunch of soft resets. The lines in the log look similar to this:

WARNING: tw1: tw_aen_task AEN 0x0039 Buffer ECC error corrected address=0xDF420
WARNING: tw1: tw_aen_task AEN 0x005f Cache synchronization failed; some data lost unit=22
WARNING: tw1: tw_aen_task AEN 0x0001 Controller reset occurred resets=13

I downloaded and installed the latest firmware for the card (version 4.10.00.021), which the release notes claimed had several fixes for cards experiencing soft resets.  Much to my disappointment the resets continued to occur despite the new revised firmware.

The card was under warranty, so I contacted 3ware support and had a new one sent overnight.  The new card seemed to resolve the issues associated with random soft resets, however the resets and the downtime had left this node little out of sync with the other Gluster server.

After doing a ‘zfs replace’ on two bad disks (at this point I am still unsure whether the bad drives where a symptom or the cause of the issues with the raid card, however what I do know is that the Cavier Geen Western Digital drives that are populating this card have a very high error rate, and we are currently in the process of replacing all 24 drives with hitachi ones), I set about trying to initiate a ‘self-heal’ on the known up to date node using the following command:

server2:/zpool/glusterfs# ls -laR *

After some time I decided to tail the log file to see if there were any errors that might indicate a problem with the self heal. Once again the Gluster error log begun to fill up with errors associated with setting extended attributes on SUNWattr_ro.

At that point I began to worry whether or not the AFR (Automatic File Replication) portion of the Replicate/AFR translator was actually working correctly or not.  I started running some tests to determine what exactly was going on.  I began by copying over a few files to test replication.  All the files showed up on both nodes, so far so good.

Next it was time to test AFR so I began deleting a few files off one node and then attempting to self heal those same deleted files.  After a couple of minutes, I re-listed the files and the deleted files had in fact been restored. Despite the successful copy, the errors continued to show up every single time the file/directory was accessed (via stat).  It seemed that even though AFR was able to copy all the files to the new node correctly, gluster for some reason continued to want to self heal the files over and over again.

After finding the function that sets the extended attributes on Solaris, the following patch was created:

— compat.c Tue Aug 23 13:24:33 2011
+++ compat_new.c Tue Aug 23 13:24:49 2011
@@ -193,7 +193,7 @@
{
int attrfd = -1;
int ret = 0;
--
+
attrfd = attropen (path, key, flags|O_CREAT|O_WRONLY, 0777);
if (attrfd >= 0) {
ftruncate (attrfd, 0);
@@ -200,13 +200,16 @@
ret = write (attrfd, value, size);
close (attrfd);
} else {
-- if (errno != ENOENT)
-- gf_log (“libglusterfs”, GF_LOG_ERROR,
+ if(!strcmp(key,”SUNWattr_ro”)&&!strcmp(key,”SUNWattr_rw”)) {
+
+ if (errno != ENOENT)
+ gf_log (“libglusterfs”, GF_LOG_ERROR,
“Couldn’t set extended attribute for %s (%d)”,
path, errno);
-- return -1;
+ return -1;
+ }
+ return 0;
}
--
return 0;
}

The patch simply ignores the two Solaris specific extended attributes (SUNWattr_ro and SUNWattr_rw), and returns a ‘0’ to the posix layer instead of a ‘-1’ if either of these is encountered.

We’ve been running this code change on both Solaris nodes for several days and so far so good, the errors are gone and replicate and AFR both seem to be working very well.

GlusterFS and Extended Attributes

Jeff Darcy over at CloudFS.org has posted a very informative and in depth writeup detailing GlusterFS and it’s use of Extended Filesystem Attributes.  One of the areas in which I believe that many Gluster community members would like to see some improvement in, is the area of documentation.

You do not need to spend too much time on the Gluster-users mailing list in order to find frequent calls for more in depth and up to date Gluster documentation.

There is no doubt that the situation has improved greatly over the last 6 to 12 months, however room for improvement still exists, and this is an example of someone outside the inner circle at Gluster providing such a document.

For more information on CloudFS and how it relates to GlusterFS, you can check out these two postings: Why? and What?

SUNWattr_ro error:Permission denied on OpenSolaris using Gluster 3.0.5

Last week I noticed an apparently obscure error message in my glusterfsd logfile. I was getting errors similar to this:

[2011-01-15 18:59:45] E [compat.c:206:solaris_setxattr] libglusterfs: Couldn’t set extended attribute for /datapool/glusterfs/other_files (13)
[2011-01-15 18:59:45] E [posix.c:3056:handle_pair] posix1: /datapool/glusterfs/other_files: key:SUNWattr_ro error:Permission denied

on several directories as well as on the files that resided underneath those directories. These errors only occurred when an attempt was made by Gluster to stat the file or directory (ls -l vs ls) in question.

After reviewing the entire logfile, I was unable to see any real pattern to the error messages, the errors were not very widespread given that I was only seeing these one maybe 75 or so files out of our total 3TB of data.

A google search yielded very few results on the topic, with or without Gluster as a search term. What I was able to find out was this:

SUNWattr_ro and SUNWattr_rw are Solaris ‘system extended attributes’, these attributes cannot be removed from a file or directory, you can however prevent users from being able to set them at all, by setting xattr=off, either during the creation of the zpool or changing the parameter after the fact.

This was not a viable solution for me due to the fact that several of Gluster’s translators require extended attributes be enabled on the underling filesystem.

I was able to list the extended attributes using the following command:

user@solaris1# touch test.file
user@solaris1# runat test.file ls -l
total 2
-r--r--r-- 1 root root 84 Jan 15 11:58 SUNWattr_ro
-rw-r--r-- 1 root root 408 Jan 15 11:58 SUNWattr_rw

I also learned that some people were having problems with these attributes on Solaris 10 systems, this is due to the fact that the kernels that are used by those versions of Solaris do not include, nor do they understand how to translate these ‘system extended attributes’, that were introduced in new versions of Solaris . This has caused a headache for some people who have been trying to share files between Solaris 10 and Solaris 11 based servers.

In the end, the solution was not overly complex, I had to recursively copy the directories to a temporary location, delete the original folder and rename the new one:

(cp -r folder folder.new;rm -rf folder;mv folder.new folder)

These commands must be done from a Gluster client mount point, so that Gluster can set or reset the necessary extended attributes.

Gluster on OpenSolaris so far…part 1.

We have been running Gluster in our production environment for about 1 month now, so I figured I would post some details about our setup and our experiences with Gluster and OpenSolaris so far.

Overview:

Currently we have a 2 node Gluster cluster, we are using the replicate translator in order to provide Raid-1 type mirroring of the filesystem.  The initial requirements involved providing  a solution that would house our digital media archive (audio, video, etc), would scale up to around 150TB, support exports such as CIFS and NFS, and be extremely stable.

It was decided that we would use ZFS as our underlying filesystem, due to it’s data integerity features as well as it’s support for taking filesystem snapshots, both considered very high on the requirement list for this project as well.

Although FreeBSD has had ZFS support for quite some time, there were some known issues (with 32 vs 64 bit inode numbers) at the time of my research that prevented us from going that route.

Just this week  KQstor released their native ZFS kernel module for Linux, which as of this latest release is supposed to fully support extended filesystem attributes, these are requirement in order for Gluster to function properly.  This software was Beta at the time,  and did not support extended attributes, so we were unable to consider and/or test this configuration either.

The choice was then made to go with ZFS on OpenSolaris (2008.11 specifically due to the 3ware drivers available at the time).  Currently there is no FUSE support under Solaris, so although you can use it without a problem on the server side,  if you choose to use a Solaris variant for your storage nodes,  you will be required to use a head node with an OS that does support it on the client side.

The latest version of Gluster to be fully supported on the Solaris platform is version 3.0.5. 3.1.x introduced some nice new features, however we will have to either port our storage nodes to Linux, or wait until the folks at Gluster decide to release 3.1.x for Solaris (which I am not sure will happen anytime soon).

Here is the current hardware/software configuration:

  • 2 x Intel Xeon E5410 @ 2.33GHz:CPU
  • 32 GB DDR2 DIMMS:RAM
  • 48 X 2TB Western Digital SATA II:HARD DRIVES
  • 2 x 3WARE 9650SE-24M8 PCIE:RAID CONTROLLER
  • Opensolaris version 2008.11
  • Glusterfs version 3.0.5
  • Samba version 3.2.5 (Gluster1)

ZFS Setup:

Setup for the two OS drives was pretty straight forward, we created a two disk mirrored rpool.  This will allow us to have a disk failure in the root pool and still be able to boot the system.

Since we have 48 disks to work with for our data pool, we created a total of 6 Raid-z2 vdevs, each consisting of 7 physical disks.  This setup gives up 75TB of space (53TB usable) per node, while leaving 6 disks available to use as spares.

user@computer:$
user@server1:/# zpool list
NAME       SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
rpool     1.81T  19.6G  1.79T     1%  ONLINE  -
datapool  75.8T  9.01T  66.7T    11%  ONLINE  -

Gluster setup:

Creating the Gluster .vol configuration files is easily done via the glusterfs-volgen command:

user@computer:$ user1@host1:/#glusterfs-volgen --name cluster01 --raid 1 server1.hostname.com:/data/path server2.hostname.com:/data/path

That command will produce 2 volume files, one is called ‘glusterfsd.vol’ used on the server side and one called ‘glusterfs.vol’ used on the client.

Starting glusterd on the serverside is straightforward:

user@computer:$ user1@host1:/# /usr/glusterfs/sbin/glusterfsd

Starting gluster on the client side is straightforward as well:

user@computer:$ user1@host2:/#/usr/glusterfs/sbin/glusterfs --volfile=/usr/glusterfs/etc/glusterfs/glusterfs.vol /mnt/glusterfs/

In a later blog post I plan to talk more about issues that we have encountered running this specific setup in a production environment.