GlusterFS sys admin overview

Here is another interesting video from Linuxconf Japan 2012, below Dustin Black provides an overview of Gluster from a system administrator’s point of view.

The topics covered in this 1 hour presentation include:

  • An overview of GlusterFS and Red Hat Storage
  • A look at the logic behind the software
  • How to scale using GlusterFS
  • Redundancy and Fault Tolerance
  • Accessing your data
  • General system administration tasks
  • Real-world use cases for Gluster
  • Q and A

Using strace to debug issues with apache

Today I had to track down the cause of an issue we were having with a server where shortly after restarting the server, requests would start to hang, and the number of Apache processes seemed to be growing rather large, rather quickly.

I started out using Apache’s mod_status to get some details about the state of each process.

I noticed that many of the processes ended up  in a ‘”W”  or “Sending Reply” state.  I choose a random Apache process and fired up ‘strace’ to try to get some more information:

server7:/root# strace -p 11574
Process 11574 attached -- interrupt to quit
flock(26, LOCK_EX <unfinished …>

This process was stuck waiting for an exclusive lock on some file.  I used ‘readlink’ to find out the name of the file in question:

server7:/root# readlink /proc/11574/fd/26
/mnt/Pages/xml/0/1/list1055.xml

Once I had the name of the file I used ‘lsof’ to see if there were any other processes trying to access that file as well:

server7:/root#lsof |grep list1055.xml
httpd 11574 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)
httpd 11579 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)
httpd 11629 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)

Here we have several other process waiting for an exclusive lock on the file as well.

At this point it appears as though a recent code change maybe the cause of this issue…however a closer look at the recent source code commits will be required to know for sure.

What’s new in GlusterFS 3.3?

Here is a link to a talk given by John Mark Walker at this year’s LinuxCon Japan, in which he discusses some of the internal details of the Gluster 3.3 release.

A few of the new features discussed during the presentation are:

  • UFO (universal file and object storage)
  • HDFS compatibility
  • Proactive self-heal
  • Granular locking
  • Quorum enforcement (for resolving split-brain scenarios)

zfsonlinux and gluster so far….

Recently I started to revisit the idea of using zfs and linux (zfsonlinux) as the basis for a server that will eventually be the foundation of our gluster storage infrastructure.  At this point we are using the Opensolaris version of zfs and an older (but stable) version of gluster (3.0.5).

The problem with staying with Opensolaris (besides the fact that it is no longer being actively supported itself),  is that we would be unable to upgrade gluster….and thus we would be unable to take advantage of some of the new and upcoming features that exist in the later versions (such as geo-replication, snapshots, active-active geo-replication and various other bugfixes, performance enhancements, etc).

Hardware:

Here are the specs for the current hardware I am using to test:

  • 2 x Intel Xeon E5410 @ 2.33GHz:CPU
  • 32 GB DDR2 DIMMS:RAM
  • 48 X 2TB Western Digital SATA II:HARD DRIVES
  • 2 x 3WARE 9650SE-24M8 PCIE:RAID CONTROLLER
  • Ubuntu 11.10
  • Glusterfs version 3.2.5
  • 1 Gbps interconnects (LAN)

ZFS installation:

I decided to use Ubuntu 11.10 for this round of testing, currently the daliy ppa has a lot of bugfixes and performance improvements that do not exist in the latest stable release ( 0.6.0-rc6) so the daily ppa is the version that should be used until either v0.6.0-rc7 or v0.6.0 final are released.

Here is what you will need to get zfs installed and running:

# apt-add-repository ppa:zfs-native/daily
# apt-get update
# apt-get install debootstrap ubuntu-zfs

At this point we can create our first zpool. Here is the syntax used to create a 6 disk raidz2 vdev:

# zpool create -f tank raidz2 sdc sdd sde sdf sdg sdh

Now let’s check the status of the zpool:

# zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0errors: No known data errors

ZFS Benchmarks:

I ran a few tests to see what kind of performance I could expect out of zfs first, before I added gluster on top, that way I would have better idea about where the bottleneck (if any) existed.

linux 3.3-rc5 kernel untar:

single ext4 disk: 3.277s
zfs 2 disk mirror: 19.338s
zfs 6 disk raidz2: 8.256s

dd using block size of 4096:

single ext4 disk: 204 MB/s
zfs 2 disk mirror: 7.5 MB/s
zfs 6 disk raidz2: 174 MB/s

dd using block size of 1M:

single ext4 disk: 153.0 MB/s
zfs 2 disk mirror: 99.7 MB/s
zfs 6 disk raidz2: 381.2 MB/s

Gluster + ZFS Benchmarks

Next I added gluster (version 3.2.5) to the mix to see how they performed together:

linux 3.3-rc5 kernel untar:

zfs 6 disk raidz2 + gluster (replication): 4m10.093s
zfs 6 disk raidz2 + gluster (geo replication): 1m12.054s

dd using block size of 4096:

zfs 6 disk raidz2 + gluster (replication): 53.6 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 53.7 MB/s

dd using block size of 1M:

zfs 6 disk raidz2 + gluster (replication): 45.7 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 155 MB/s

Conclusion

Well so far so good, I have been running the zfsonlinux port for two weeks now without any real issues. From what I understand there is still a decent amount of work left to do around dedup and compression (neither of which I necessarily require for this particular setup).

The good news is that the zfsonlinux developers have not even really started looking into improving performance at this point, since their main focus thus far has been overall stability.

A good deal of development is also taking place in order to allow linux to boot using a zfs ‘/boot’ partition.  This is currently an option on several disto’s including Ubuntu and Gentoo, however the setup requires a fair amount of effort to get going, so it will be nice when this style setup is supported out of the box.

In terms of Gluster specifically, it performs quite well using geo-replication with larger file sizes. I am really looking forward to the active-active geo-replication feature currently planned for v3.4 to become fully implemented and available. Our current production setup (currently using two node replication) has a T3 (WAN) interconnect, so having the option to use geo-replication in the future should really speed up our write throughput, which is currently hampered by the throughput of the T3 itself.

Introduction to Btrfs

I have been waiting for the video presentation of a talk given by Chris Mason at this year’s Scale 10x to finally be posted online. The original Scale 10x talks were streamed live, and the website claims that the videos will be posted online soon, however at this point no date has been provided.

In the meantime however, I found a link to another talk given by Chirs, this time hostsed at linuxfoundation.org. In order to view the full video you do need to provide your name and email address, but the process is painless and well worth the 30 seconds it takes to fill in the form.

It appears as though this was put together in December 2011, so it is relatively new and up to date, provides a nice introduction to btrfs, a look at the upcoming feature set, and a list of work that still needs to be done in order to make btrfs production ready.

Here is a link to the first few minutes of the talk:

XFS: Adventures in Filesystem Scalability

There was another file system talk to come out of the recent Linux.conf.au conference, this one was given by Dave Chinner and was entitled ‘XFS: Recent and Future Adventures in Filesystem Scalability’.

Here Dave discusses some of the historical roadblocks which prevented XFS from scaling as well as it could have, provides some in depth details about how these issue were eventually overcome, shows off some benchmarks comparing throughput and overall scaling using XFS, EXT4 and BTRFS.

Dave finishes up the talk with some discussion about what you can expect next from XFS and then takes some questions from the audience.

The future of gluster.org

Now that RedHat has purchased Gluster, and they are in the process of releasing their storage software appliance, many people are wondering what all this means for the GlusterFS project and gluster.org as a whole.

John Mark Walker conducted a webinar last week entitled ‘The Future of GlusterFS and Gluser.org’. In the beginning of this presentation John talks about the history behind, and origins of the Gluster project, he then goes into a basic overview of the features provided by GlusterFS, and finally he talks about what to expect from version 3.3 of GlusterFS and the GlusterFS open source community going forward.

Here are some of the talking points that were discussed during the webinar:

  • Unstructured data is expected to grow 44X by 2020
  • Scale out storage will hold 63,000 PB by 2015
  • RedHat is aggressively hiring developers with file system knowledge
  • Moving back to an open-source model from and open-core model
  • Open source version will be testing ground for new features
  • RHSSA will be more hardened and thoroughly tested
  • Beta 3 for 3.3 due in Feb/Mar 2012
  • GlusterFS 3.3 expected in Q2/Q3 of 2012

Here is the link to the entire presentation in a downloadable .mp4 format.

Here is a link to all the slides that were presented during the talk.

A tour of btrfs by Avi Miller

Here is a Youtube video of a presentation from this years Linux.conf.au conference given by Avi Miller.  The video talks about the current state of btrfs, some of the upcoming features, and Avi also provides a demonstration of one of the filesystem recovery tools in action.

Here are a a few of the highlights:

  • Lots of performance and stability fixes
  • Lots of code cleanup
  • New compression options (LZO and snappy)
  • Auto file defrag
  • Kernel 3.3 will allow larger block sizes (4k,8k,16k) for better meta-data throughput
  • A ZFS like send/receive is in the works
  • New filesystem checker (btrfsck) should be released by Feb 14th
  • Raid 5/6 code (from Intel) will go into mainline kernel after the release of btrfsck
  • Options exist/will exist to do mixed raid modes for data and meta-data
  • Btrfs will be production filesystem in next version of Oracle Unbreakable Linux

No doubt about it, if you are interested in the current state of btrfs you should check out this talk.

Mdadm cheat sheet

I have spent some time over the last few weeks getting familiar with mdadm and software RAID on Linux, so I thought I would write down some of the commands and example syntax that I have used while getting started.

1)If we would like to create a new RAID array from scratch we can use the following example commands:

RAID1-with 2 Drives:

# mdadm --create --verbose /dev/md0 --level=1 /dev/sda1 /dev/sdb1

RAID5-with 5 Drives:

# mdadm --create --verbose /dev/md0 --level=5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

RAID6-with 4 Drives with 1 spare:

# mdadm --create --verbose /dev/md0 --level=6 --raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

2)If we would like to add a disk to an existing array:

# mdadm --add /dev/md0 /dev/sdf1 (only added as a spare)
# mdadm --grow /dev/md0 -n [new number of active disks -- spares] (grow the size of the array)

3)If we would like to remove a disk from an existing array:

First we need to ‘fail’ the drive:

# mdadm --fail /dev/md0 /dev/sdc1

Next it can be safely removed from the array:

# mdadm --remove /dev/md0 /dev/sdc1

4)In order to make the array survive a reboot, you need to add the details to ‘/etc/mdadm/mdadm.conf’

# mdadm --detail --scan >> /etc/mdadm/mdadm.conf (Debian)
# mdadm --detail --scan >> /etc/mdadm.conf (Everyone else)

5)In order to delete and remove the entire array:

First we need to ‘stop’ the array:

# mdadm --stop /dev/md0

Next it can be removed:

# mdadm --remove /dev/md0

6)Examining the status of your RAID array:

There are two options here:

# cat /proc/mdstat
or
# mdadm --detail /dev/md0

Access files underneath an already mounted partition in Linux

Here is a quick tip for anyone who needs to access files that exists underneath an already mounted filesystem mount point.  For example suppose that you have some files located in a directory called ‘/tmp/docs’.

At some point someone might decide to accidentally take that same directory, and create an NFS or CIFS mount, if you need to access the original files that existed before the new mount point was put into place, you have two options.

  1. Unmount the NFS or CIFS filesystem and access your files and then remount.
  2. However, you may find yourself in a situation (such as I did), where it is extremely inconvenient or impossible for you have the downtime associated with the umount/remount process. In that case you have another option…you can use a ‘bind’ mount.

All you need to do is something like the following:

mount --bind /tmp /tmp/new_location

Now you should be able to access the original files here:

‘/tmp/new_location/docs’