Category Archives: Linux

All things Linux.

ZFS Day videos

The folks over at ZFS Day 2012 have posted a nice and wide ranging series of videos from last years event.

Taken from the description of the event:

‘The first ever ZFS conference covered all aspects of using ZFS in production. You can find ZFS in the most demanding environments, from video servers to cloud platforms to databases to NFS servers to HPC. Come learn about what makes ZFS a great storage system for these and other applications.’

Of particular interest to me (because of our desire to run zfs on Linux in production environments (zfsonlinux.org)) are the following two videos:

ZFS for Linux Implementation – Brian Behlendorf:

Panel: The State of ZFS on various os versions:

You can find a complete list of videos, talks and slides here.

InnoDB Quick Reference Guide

We are currently in the process of upgrading our mysql 5.1 instances to mysql 5.6 , because of this, we are once again very focused on overall mysql performance, and more specifically on Innodb performance going forward.

Matt Reid recently published a book entitled ‘InnoDB Quick Reference Guide.’

I believe that this book will come in very handy to us over the next few weeks and months, as we once again look to delve into mysql behavior and Innodb internals.

I have downloaded a copy of this e-book and I will be providing a more in-depth review shortly.

UPDATE:

Chapter 1: Getting Started with Innodb
This chapter talks about the features of the Innodb storage engine, as well as it’s requiremnets, suported platforms, etc. The chapter does a good job of providing a clear overview of Innodb and it’s overall features and use cases.

Chapter 2: Basic Configuration Parameters
This chapter talks about various configuration varables and how they realate to and effect Innodb. The chapter helps provide you with a better understanding of some of the more basic Innodb configuration options and how the effect Innodb.

Chapter 3: Advanced Configuration Parameters
This chapter provides a much more in-dpeth look at some of the more advanced configuraion variables used to control the behavior of Innodb and how they relate it’s overall performance. The chapter does a great job of covering all the necessary Innodb related parameters that really effect how Innodb performs under real world workloads.

Chapter 4: Load Testing InnoDB for Performance
This chapter focuses on the numerous open-source tools that can be used to test the performance of both the application (Mysql) and the OS (filesystem, etc). All the major tools are covered here and the chapter does a good job of covering each of the tools and their use cases.

Chapter 5: Maintenance and Monitoring
This section discuss some typical maintence tasks that are associated with running Innodb. Other information includes some common methods for finding and pulling runtime information and performance information from the storage engine.

Chapter 6: Troubleshooting InnoDB
This chapter provides some good insite into several of the more common issues that you could face if you have an Innodb deployment, from crash recovery to issues regarding backup and recovery.

Chapter 7: References and links
A small section that you can use to find further detail about Innodb and Mysql.

Overall:
This book does a good job of covering the main features, paramaters, and use cases for the Mysql InnodDB storage engine.

GlusterFS sys admin overview

Here is another interesting video from Linuxconf Japan 2012, below Dustin Black provides an overview of Gluster from a system administrator’s point of view.

The topics covered in this 1 hour presentation include:

  • An overview of GlusterFS and Red Hat Storage
  • A look at the logic behind the software
  • How to scale using GlusterFS
  • Redundancy and Fault Tolerance
  • Accessing your data
  • General system administration tasks
  • Real-world use cases for Gluster
  • Q and A

Using strace to debug issues with apache

Today I had to track down the cause of an issue we were having with a server where shortly after restarting the server, requests would start to hang, and the number of Apache processes seemed to be growing rather large, rather quickly.

I started out using Apache’s mod_status to get some details about the state of each process.

I noticed that many of the processes ended up  in a ‘”W”  or “Sending Reply” state.  I choose a random Apache process and fired up ‘strace’ to try to get some more information:

server7:/root# strace -p 11574
Process 11574 attached – interrupt to quit
flock(26, LOCK_EX <unfinished …>

This process was stuck waiting for an exclusive lock on some file.  I used ‘readlink’ to find out the name of the file in question:

server7:/root# readlink /proc/11574/fd/26
/mnt/Pages/xml/0/1/list1055.xml

Once I had the name of the file I used ‘lsof’ to see if there were any other processes trying to access that file as well:

server7:/root#lsof |grep list1055.xml
httpd 11574 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)
httpd 11579 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)
httpd 11629 nobody 26w REG 0,31 4232 925874559 /mnt/Pages/xml/0/1/list1055.xml (storage1.npr.org:/files/data)

Here we have several other process waiting for an exclusive lock on the file as well.

At this point it appears as though a recent code change maybe the cause of this issue…however a closer look at the recent source code commits will be required to know for sure.

What’s new in GlusterFS 3.3?

Here is a link to a talk given by John Mark Walker at this year’s LinuxCon Japan, in which he discusses some of the internal details of the Gluster 3.3 release.

A few of the new features discussed during the presentation are:

  • UFO (universal file and object storage)
  • HDFS compatibility
  • Proactive self-heal
  • Granular locking
  • Quorum enforcement (for resolving split-brain scenarios)

zfsonlinux and gluster so far….

Recently I started to revisit the idea of using zfs and linux (zfsonlinux) as the basis for a server that will eventually be the foundation of our gluster storage infrastructure.  At this point we are using the Opensolaris version of zfs and an older (but stable) version of gluster (3.0.5).

The problem with staying with Opensolaris (besides the fact that it is no longer being actively supported itself),  is that we would be unable to upgrade gluster….and thus we would be unable to take advantage of some of the new and upcoming features that exist in the later versions (such as geo-replication, snapshots, active-active geo-replication and various other bugfixes, performance enhancements, etc).

Hardware:

Here are the specs for the current hardware I am using to test:

  • 2 x Intel Xeon E5410 @ 2.33GHz:CPU
  • 32 GB DDR2 DIMMS:RAM
  • 48 X 2TB Western Digital SATA II:HARD DRIVES
  • 2 x 3WARE 9650SE-24M8 PCIE:RAID CONTROLLER
  • Ubuntu 11.10
  • Glusterfs version 3.2.5
  • 1 Gbps interconnects (LAN)

ZFS installation:

I decided to use Ubuntu 11.10 for this round of testing, currently the daliy ppa has a lot of bugfixes and performance improvements that do not exist in the latest stable release ( 0.6.0-rc6) so the daily ppa is the version that should be used until either v0.6.0-rc7 or v0.6.0 final are released.

Here is what you will need to get zfs installed and running:

# apt-add-repository ppa:zfs-native/daily
# apt-get update
# apt-get install debootstrap ubuntu-zfs

At this point we can create our first zpool. Here is the syntax used to create a 6 disk raidz2 vdev:

# zpool create -f tank raidz2 sdc sdd sde sdf sdg sdh

Now let’s check the status of the zpool:

# zpool status tank
pool: tank
state: ONLINE
scan: none requested
config:NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0errors: No known data errors

ZFS Benchmarks:

I ran a few tests to see what kind of performance I could expect out of zfs first, before I added gluster on top, that way I would have better idea about where the bottleneck (if any) existed.

linux 3.3-rc5 kernel untar:

single ext4 disk: 3.277s
zfs 2 disk mirror: 19.338s
zfs 6 disk raidz2: 8.256s

dd using block size of 4096:

single ext4 disk: 204 MB/s
zfs 2 disk mirror: 7.5 MB/s
zfs 6 disk raidz2: 174 MB/s

dd using block size of 1M:

single ext4 disk: 153.0 MB/s
zfs 2 disk mirror: 99.7 MB/s
zfs 6 disk raidz2: 381.2 MB/s

Gluster + ZFS Benchmarks

Next I added gluster (version 3.2.5) to the mix to see how they performed together:

linux 3.3-rc5 kernel untar:

zfs 6 disk raidz2 + gluster (replication): 4m10.093s
zfs 6 disk raidz2 + gluster (geo replication): 1m12.054s

dd using block size of 4096:

zfs 6 disk raidz2 + gluster (replication): 53.6 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 53.7 MB/s

dd using block size of 1M:

zfs 6 disk raidz2 + gluster (replication): 45.7 MB/s
zfs 6 disk raidz2 + gluster (geo replication): 155 MB/s

Conclusion

Well so far so good, I have been running the zfsonlinux port for two weeks now without any real issues. From what I understand there is still a decent amount of work left to do around dedup and compression (neither of which I necessarily require for this particular setup).

The good news is that the zfsonlinux developers have not even really started looking into improving performance at this point, since their main focus thus far has been overall stability.

A good deal of development is also taking place in order to allow linux to boot using a zfs ‘/boot’ partition.  This is currently an option on several disto’s including Ubuntu and Gentoo, however the setup requires a fair amount of effort to get going, so it will be nice when this style setup is supported out of the box.

In terms of Gluster specifically, it performs quite well using geo-replication with larger file sizes. I am really looking forward to the active-active geo-replication feature currently planned for v3.4 to become fully implemented and available. Our current production setup (currently using two node replication) has a T3 (WAN) interconnect, so having the option to use geo-replication in the future should really speed up our write throughput, which is currently hampered by the throughput of the T3 itself.

Introduction to Btrfs

I have been waiting for the video presentation of a talk given by Chris Mason at this year’s Scale 10x to finally be posted online. The original Scale 10x talks were streamed live, and the website claims that the videos will be posted online soon, however at this point no date has been provided.

In the meantime however, I found a link to another talk given by Chirs, this time hostsed at linuxfoundation.org. In order to view the full video you do need to provide your name and email address, but the process is painless and well worth the 30 seconds it takes to fill in the form.

It appears as though this was put together in December 2011, so it is relatively new and up to date, provides a nice introduction to btrfs, a look at the upcoming feature set, and a list of work that still needs to be done in order to make btrfs production ready.

Here is a link to the first few minutes of the talk:

[youtube]http://www.youtube.com/watch?v=ZW2E4WgPlzc[/youtube]

XFS: Adventures in Filesystem Scalability

There was another file system talk to come out of the recent Linux.conf.au conference, this one was given by Dave Chinner and was entitled ‘XFS: Recent and Future Adventures in Filesystem Scalability’.

Here Dave discusses some of the historical roadblocks which prevented XFS from scaling as well as it could have, provides some in depth details about how these issue were eventually overcome, shows off some benchmarks comparing throughput and overall scaling using XFS, EXT4 and BTRFS.

Dave finishes up the talk with some discussion about what you can expect next from XFS and then takes some questions from the audience.

[youtube]http://www.youtube.com/watch?v=FegjLbCnoBw[/youtube]

A tour of btrfs by Avi Miller

Here is a Youtube video of a presentation from this years Linux.conf.au conference given by Avi Miller.  The video talks about the current state of btrfs, some of the upcoming features, and Avi also provides a demonstration of one of the filesystem recovery tools in action.

Here are a a few of the highlights:

  • Lots of performance and stability fixes
  • Lots of code cleanup
  • New compression options (LZO and snappy)
  • Auto file defrag
  • Kernel 3.3 will allow larger block sizes (4k,8k,16k) for better meta-data throughput
  • A ZFS like send/receive is in the works
  • New filesystem checker (btrfsck) should be released by Feb 14th
  • Raid 5/6 code (from Intel) will go into mainline kernel after the release of btrfsck
  • Options exist/will exist to do mixed raid modes for data and meta-data
  • Btrfs will be production filesystem in next version of Oracle Unbreakable Linux

No doubt about it, if you are interested in the current state of btrfs you should check out this talk.

[youtube]http://www.youtube.com/watch?v=hxWuaozpe2I[/youtube]

Mdadm cheat sheet

I have spent some time over the last few weeks getting familiar with mdadm and software RAID on Linux, so I thought I would write down some of the commands and example syntax that I have used while getting started.

1)If we would like to create a new RAID array from scratch we can use the following example commands:

RAID1-with 2 Drives:

# mdadm –create –verbose /dev/md0 –level=1 /dev/sda1 /dev/sdb1

RAID5-with 5 Drives:

# mdadm –create –verbose /dev/md0 –level=5 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

RAID6-with 4 Drives with 1 spare:

# mdadm –create –verbose /dev/md0 –level=6 –raid-devices=4 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

2)If we would like to add a disk to an existing array:

# mdadm –add /dev/md0 /dev/sdf1 (only added as a spare)
# mdadm –grow /dev/md0 -n [new number of active disks – spares] (grow the size of the array)

3)If we would like to remove a disk from an existing array:

First we need to ‘fail’ the drive:

# mdadm –fail /dev/md0 /dev/sdc1

Next it can be safely removed from the array:

# mdadm –remove /dev/md0 /dev/sdc1

4)In order to make the array survive a reboot, you need to add the details to ‘/etc/mdadm/mdadm.conf’

# mdadm –detail –scan >> /etc/mdadm/mdadm.conf (Debian)
# mdadm –detail –scan >> /etc/mdadm.conf (Everyone else)

5)In order to delete and remove the entire array:

First we need to ‘stop’ the array:

# mdadm –stop /dev/md0

Next it can be removed:

# mdadm –remove /dev/md0

6)Examining the status of your RAID array:

There are two options here:

# cat /proc/mdstat
or
# mdadm –detail /dev/md0