GlusterFS sys admin overview

Leave a reply

Here is another interesting video from Linuxconf Japan 2012, below Dustin Black provides an overview of Gluster from a system administrator’s point of view.

The topics covered in this 1 hour presentation include:

An overview of GlusterFS and Red Hat Storage
A look at the logic behind the software
How to scale using GlusterFS
Redundancy and Fault Tolerance
Accessing your data
General system administration tasks
Real-world use cases for Gluster
Q and A

Using strace to debug issues with apache

Leave a reply

Today I had to track down the cause of an issue we were having with a server where shortly after restarting the server, requests would start to hang, and the number of Apache processes seemed to be growing rather large, rather quickly.

I started out using Apache’s mod_status to get some details about the state of each process.

I noticed that many of the processes ended up Â in a ‘”W” Â or “Sending Reply” state. Â I choose a random Apache process and fired up ‘strace’ to try to get some more information:

server7:/root# strace -p 11574
Process 11574 attached – interrupt to quit
flock(26, LOCK_EX <unfinished …>

This process was stuck waiting for anÂ exclusive lock on some file. Â I used ‘readlink’ to find out the name of the file in question:

server7:/root# readlink /proc/11574/fd/26
/mnt/Pages/xml/0/1/list1055.xml

Once I had the name of the file I used ‘lsof’ to see if there were any other processes trying to access that file as well:

server7:/root#lsof |grepÂ list1055.xml
httpd 11574 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)
httpd 11579 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)
httpd 11629 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)

Here we have several other process waiting for anÂ exclusive lock on the file as well.

At this point it appears as though a recent code change maybe the cause of this issue…however a closer look at the recent source code commits will be required to know for sure.

What’s new in GlusterFS 3.3?

Leave a reply

Here is a link to a talk given byÂ John Mark Walker at this year’s LinuxCon Japan, in which he discusses some of the internal details of the Gluster 3.3 release.

A few of the new features discussed during the presentation are:

UFO (universal file and object storage)
HDFS compatibility
Proactive self-heal
Granular locking
Quorum enforcement (for resolving split-brain scenarios)

zfsonlinux and gluster so far….

14 Replies

Recently I started to revisit the idea of using zfs and linux (zfsonlinux) as the basis for a server that will eventually be the foundation of our gluster storage infrastructure. Â At this point we are using the Opensolaris version of zfs and an older (but stable) version of gluster (3.0.5).

The problem with staying with Opensolaris (besides the fact that it is no longer being actively supported itself),Â Â is that we would be unable to upgrade gluster….and thus we would be unable to take advantage of some of the new and upcoming features that exist in the later versions (such asÂ geo-replication, snapshots, active-active geo-replication and various other bugfixes, performanceÂ enhancements, etc).

Hardware:

Here are the specs for the current hardware I am using to test:

2 x Intel Xeon E5410 @ 2.33GHz:CPU
32 GB DDR2 DIMMS:RAM
48 X 2TB Western Digital SATA II:HARD DRIVES
2 x 3WARE 9650SE-24M8 PCIE:RAID CONTROLLER
Ubuntu 11.10
Glusterfs version 3.2.5
1 Gbps interconnects (LAN)

ZFS installation:

I decided to use Ubuntu 11.10 for this round of testing, currently the daliy ppa has a lot of bugfixes and performance improvements that do not exist in the latest stable release ( 0.6.0-rc6) so the daily ppa is the version that should be used until either v0.6.0-rc7 or v0.6.0 final are released.

Here is what you will need to get zfs installed and running:

# apt-add-repository ppa:zfs-native/daily
# apt-get update
# apt-get install debootstrap ubuntu-zfs

At this point we can create our first zpool. Here is the syntax used to create a 6 disk raidz2 vdev:

# zpool create -f tank raidz2 sdc sdd sde sdf sdg sdh

Now let’s check the status of the zpool:

# zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0errors: No known data errors

ZFS Benchmarks:

I ran a few tests to see what kind of performance I could expect out of zfs first, before I added gluster on top, that way I would have better idea about where the bottleneck (if any) existed.

linux 3.3-rc5 kernel untar:

single ext4 disk: 3.277s
zfs 2 disk mirror: 19.338s
zfs 6 disk raidz2: 8.256s

dd using block size of 4096:

single ext4 disk: 204 MB/s
zfs 2 disk mirror: 7.5 MB/s
zfs 6 disk raidz2: 174 MB/s

dd using block size of 1M:

single ext4 disk: 153.0 MB/s
zfs 2 disk mirror: 99.7 MB/s
zfs 6 disk raidz2: 381.2 MB/s

Gluster + ZFS Benchmarks

Next I added gluster (version 3.2.5) to the mix to see how they performed together:

linux 3.3-rc5 kernel untar:

zfs 6 disk raidz2 + gluster (replication): 4m10.093s
zfs 6 disk raidz2 + gluster (geo replication): 1m12.054s

dd using block size of 4096:

zfs 6 disk raidz2 + gluster (replication): 53.6 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 53.7 MB/s

dd using block size of 1M:

zfs 6 disk raidz2 + gluster (replication): 45.7 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 155 MB/s

Conclusion

Well so far so good, I have been running the zfsonlinux port for two weeks now without any real issues. From what I understand there is still a decent amount of work left to do around dedup and compression (neither of which IÂ necessarily require for this particular setup).

The good news is that the zfsonlinux developers have not even really started looking into improving performance at this point, since their main focus thus far has been overall stability.

A good deal of development is also taking place in order to allow linux to boot using a zfs ‘/boot’ partition. Â This is currently an option on several disto’s including Ubuntu and Gentoo, however the setup requires a fair amount of effort to get going, so it will be nice when this style setup is supported out of the box.

In terms of Gluster specifically, it performs quite well using geo-replication with larger file sizes. I am really looking forward to theÂ active-active geo-replication feature currently planned for v3.4 to become fully implemented and available. Our current production setup (currently using two node replication) has a T3 (WAN) interconnect, so having the option to use geo-replication in the future should really speed up our write throughput, which is currentlyÂ hampered by the throughput of the T3 itself.

Introduction to Btrfs

Leave a reply

I have been waiting for the video presentation of a talk given by Chris Mason at this year’s Scale 10x to finally be posted online. The original Scale 10x talks were streamed live, and the website claims that the videos will be posted online soon, however at this point no date has been provided.

In the meantime however, I found a link to another talk given by Chirs, this time hostsed at linuxfoundation.org. In order to view the full video you do need to provide your name and email address, but the process is painless and well worth the 30 seconds it takes to fill in the form.

It appears as though this was put together in December 2011, so it is relatively new and up to date, provides a nice introduction to btrfs, a look at the upcoming feature set, and a list of work that still needs to be done in order to make btrfs production ready.

Here is a link to the first few minutes of the talk:

[youtube]http://www.youtube.com/watch?v=ZW2E4WgPlzc[/youtube]

XFS: Adventures in Filesystem Scalability

Leave a reply

There was another file system talk to come out of the recent Linux.conf.au conference, this one was given by Dave Chinner and was entitled ‘XFS: Recent and Future Adventures in Filesystem Scalability’.

Here Dave discusses some of the historical roadblocks which prevented XFS from scaling as well as it could have, provides some in depth details about how these issue were eventuallyÂ overcome, shows off some benchmarks comparing throughput and overall scaling using XFS, EXT4 and BTRFS.

Dave finishes up the talk with some discussion about what you can expect next from XFS and then takes some questions from the audience.

[youtube]http://www.youtube.com/watch?v=FegjLbCnoBw[/youtube]

The future of gluster.org

Leave a reply

Now that RedHat has purchased Gluster, and they are in the process of releasing their storage software appliance, many people are wondering what all this means for the GlusterFS project and gluster.org as a whole.

John Mark WalkerÂ conducted a webinar last week entitled ‘The Future of GlusterFS and Gluser.org’. In the beginning of this presentation John talks about the history behind, and origins of the Gluster project, he then goes into a basic overview of the features provided by GlusterFS, and finally he talks about what to expect from version 3.3 of GlusterFS and the GlusterFS open source community going forward.

Here are some of the talking points that were discussed during the webinar:

Unstructured data is expected to grow 44X by 2020
Scale out storage will hold 63,000 PB by 2015
RedHat is aggressively hiring developers with file system knowledge
Moving back to an open-source model from and open-core model
Open source version will be testing ground for new features
RHSSA will be more hardened and thoroughlyÂ tested
Beta 3 for 3.3 due in Feb/Mar 2012
GlusterFS 3.3 expected in Q2/Q3 of 2012

Here is the link to the entire presentation in a downloadable .mp4 format.

Here is a link to all the slides that were presented during the talk.

A tour of btrfs by Avi Miller

Leave a reply

Here is a Youtube video of a presentation from this years Linux.conf.au conference given by Avi Miller. Â The video talks about the current state of btrfs, some of the upcoming features, and Avi also provides a demonstration of one of the filesystem recovery tools in action.

Here are a a few of the highlights:

Lots of performance and stability fixes
Lots of code cleanup
New compression options (LZO and snappy)
Auto file defrag
Kernel 3.3 will allow larger block sizes (4k,8k,16k) for better meta-data throughput
A ZFS like send/receive is in the works
New filesystem checker (btrfsck) should be released by Feb 14th
Raid 5/6 code (from Intel) will go into mainline kernel after the release of btrfsck
Options exist/will exist to do mixed raid modes for data and meta-data
Btrfs will be production filesystem in next version of Oracle Unbreakable Linux

No doubt about it, if you are interested in the current state of btrfs you should check out this talk.

[youtube]http://www.youtube.com/watch?v=hxWuaozpe2I[/youtube]

Mdadm cheat sheet

Leave a reply

I have spent some time over the last few weeks getting familiar with mdadm and software RAID on Linux, so I thought I would write down some of the commands and example syntax that I have used while getting started.

1)If we would like to create a new RAID array from scratch we can use the following example commands:

RAID1-with 2 Drives:

# mdadm –create –verbose /dev/md0 –level=1 /dev/sda1 /dev/sdb1

RAID5-with 5 Drives:

# mdadm –create –verbose /dev/md0 –level=5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

RAID6-with 4 Drives with 1 spare:

# mdadm –create –verbose /dev/md0 –level=6 –raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

2)If we would like to add a disk to an existing array:

# mdadm –add /dev/md0 /dev/sdf1 (only added as a spare)
# mdadm –grow /dev/md0 -n [new number of active disks – spares] (grow the size of the array)

3)If we would like to remove a disk from an existing array:

First we need to ‘fail’ the drive:

# mdadm –fail /dev/md0 /dev/sdc1

Next it can be safely removed from the array:

# mdadm –remove /dev/md0 /dev/sdc1

4)In order to make the array survive a reboot, you need to add the details to ‘/etc/mdadm/mdadm.conf’

# mdadm –detail –scan >> /etc/mdadm/mdadm.conf (Debian)
# mdadm –detail –scan >> /etc/mdadm.conf (Everyone else)

5)In order to delete and remove the entire array:

First we need to ‘stop’ the array:

# mdadm –stop /dev/md0

Next it can be removed:

# mdadm –remove /dev/md0

6)Examining the status of your RAID array:

There are two options here:

# cat /proc/mdstat
or
# mdadm –detail /dev/md0

Access files underneath an already mounted partition in Linux

1 Reply

Here is a quick tip for anyone who needs to access files that exists underneath an already mounted filesystem mount point. Â For example suppose that you have some files located in a directory called ‘/tmp/docs’.

At some point someone might decide to accidentally take that same directory, and create an NFS or CIFS mount, if you need to access the original files that existed before the new mount point was put into place, you have two options.

Unmount the NFS or CIFS filesystem and access your files and then remount.
However, you may find yourself in a situation (such as I did), where it is extremely inconvenient or impossible for you have the downtime associated with the umount/remount process. In that case you have another option…you can use a ‘bind’ mount.

All you need to do is something like the following:

mount --bind /tmp /tmp/new_location

Now you should be able to access the original files here:

‘/tmp/new_location/docs’

ShainMiley.com

Techmology…what is it allabout?

GlusterFS sys admin overview

Using strace to debug issues with apache

What’s new in GlusterFS 3.3?

zfsonlinux and gluster so far….

Introduction to Btrfs

XFS: Adventures in Filesystem Scalability

The future of gluster.org

A tour of btrfs by Avi Miller

Mdadm cheat sheet

Access files underneath an already mounted partition in Linux