Benchmarking Ceph

This is a post that I have had in draft mode for quite some time. At this point some of this information is out of date, so I am planning on writing a ‘part II’ post shortly, which will include some updated information.

Benchmarking Ceph:

Ever since we got our ceph cluster up and running, I’ve been running various benchmarking applications against different cluster configurations. Just to review, the cluster that we recently built has the following specs:

Cluster specs:

  • 3 x Dell R-420;32 GB of RAM; for MON/RADOSGW/MDS nodes
  • 6 x Dell R-720xd;64 GB of RAM; for OSD nodes
  • 72 x 4TB SAS drives as OSD’s
  • 2 x Force10 S4810 switches
  • 4 x 10 GigE LCAP bonded Intel cards
  • Ubuntu 12.04 (AMD64)
  • Ceph 0.72.1 (emperor)
  • 2400 placement groups
  • 261TB of usable space

The main role for this cluster will be one primarily tied to archiving audio and video assets. This being the case, we decided to try and maximize total cluster capacity (4TB drives, no ssd’s, etc), while at the same time being able to achieve and maintain reasonable cluster throughput (10 GigE, 12 drives per osd nodes, etc).

Most of my benchmarking focused on rbd and radosgw, because either of these is most likely to be what we introduce into production when we are ready.  We are very much awaiting a stable and supported cephfs release (which will hopefully be available sometime in mid-late 2014), which will allow us to switch out our rbd + samba setup, for on based on cephfs.

Rados Benchmarks: 

I setup a pool called ‘test’ with 1600 pg’s in order to run some benchmarks using the ‘rados bench’ tool that came with Ceph.  I started with a replication level of ‘1’ and worked my way up to a replication level of ‘3’.

root@hqceph1:/# rados -p test bench 20 write (rep size=1)
Total time run: 20.241290
Total writes made: 5646
Write size: 4194304
Bandwidth (MB/sec): 1115.739
Stddev Bandwidth: 246.027
Max bandwidth (MB/sec): 1136
Min bandwidth (MB/sec): 0
Average Latency: 0.0571572
Stddev Latency: 0.0262513
Max latency: 0.336378
Min latency: 0.02248
root@hqceph1:/# rados -p test bench 20 write (rep size=2)
Total time run: 20.547026
Total writes made: 2910
Write size: 4194304
Bandwidth (MB/sec): 566.505
Stddev Bandwidth: 154.643
Max bandwidth (MB/sec): 764
Min bandwidth (MB/sec): 0
Average Latency: 0.112384
Stddev Latency: 0.198579
Max latency: 2.5105
Min latency: 0.025391
root@hqceph1:/# rados -p test bench 20 write (rep size=3)
Total time run: 20.755272
Total writes made: 2481
Write size: 4194304
Bandwidth (MB/sec): 478.144
Stddev Bandwidth: 147.064
Max bandwidth (MB/sec): 728
Min bandwidth (MB/sec): 0
Average Latency: 0.133827
Stddev Latency: 0.229962
Max latency: 3.32957
Min latency: 0.029481

RBD Benchmarks:

Next I setup a 10GB block device using rbd:

root@ceph1:/blockdev# dd bs=1M count=256 if=/dev/zero of=test1 conv=fdatasync (rep size=1)
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 0.440333 s, 610 MB/s
root@ceph1:/blockdev# dd bs=4M count=256 if=/dev/zero of=test1 conv=fdatasync (rep size=1)
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 1.07413 s, 1000 MB/s
root@ceph1:/mnt/blockdev# hdparm -Tt /dev/rbd1 (rep size=1)
Timing cached reads: 16296 MB in 2.00 seconds = 8155.69 MB/sec
Timing buffered disk reads: 246 MB in 3.10 seconds = 79.48 MB/sec
root@ceph1:/mnt/blockdev# dd bs=1M count=256 if=/dev/zero of=test conv=fdatasync (rep size=2)
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 1.29985 s, 207 MB/s
root@ceph1:/mnt/blockdev# dd bs=4M count=256 if=/dev/zero of=test2 conv=fdatasync(rep size=2)
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 4.02375 s, 267 MB/s
root@cephmount1:/mnt/ceph-block-device/test# hdparm -Tt /dev/rbd1 (rep size=2)
Timing cached reads: 16434 MB in 2.00 seconds = 8225.55 MB/sec
Timing buffered disk reads: 152 MB in 3.01 seconds = 50.55 MB/sec

Radosgw Benchmarks:

Using s3cmd (s3tools) I was able to achieve about 70MB/s when pushing files to ceph via the s3 restful API.


Ceph braindump part1

After spending about 4 months testing, benchmarking, setting up and breaking down various Ceph clusters, I though I would spend time documenting some of the things I have learned while setting up cephfs, rbd and radosgw along the way.

First let me talk a little bit about the details of the cluster that we will be putting into production over the next several weeks.

Cluster specs:

  • 6 x Dell R-720xd;64 GB of RAM; for OSD nodes
  • 72 x 4TB SAS drives as OSD’s
  • 3 x Dell R-420;32 GB of RAM; for MON/RADOSGW/MDS nodes
  • 2 x Force10 S4810 switches
  • 4 x 10 GigE LCAP bonded Intel cards
  • Ubuntu 12.04 (AMD64)
  • Ceph 0.72.1 (emperor)
  • 2400 placement groups
  • 261TB of usable space

The process I used to set- up and tear down our cluster during testing was quite simple, after installing ‘ceph-deploy’ on the admin node:

  1. ceph-deploy new mon1 mon2 mon3
  2. ceph-deploy install  mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6
  3. ceph-deploy mon create mon1 mon2 mon3
  4. ceph-deploy gatherkeys mon1
  5. ceph-deploy osd create osd1:sdb
  6. ceph-deploy osd create osd1:sdc

The uninstall process went something like this:

  1. ceph-deploy disk zap osd1:sdb
  2. ceph-deploy purge mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6
  3. ceph-deploy purgedata mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6

Additions to ceph.conf:

Since we wanted to configure an appropriate journal size for our 10GigE network, mount xfs with appropriate options and configure radosgw, we added the following to our ceph.conf (after ‘ceph-deploy new but before ‘ceph-deploy install’:

osd_journal_size = 10240
osd_mount_options_xfs = “rw,noatime,nodiratime,logbsize=256k,logbufs=8,inode64”
osd_mkfs_options_xfs = “-f -i size=2048”

host = mon1
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/ceph/radosgw.log
admin_socket = /var/run/ceph/radosgw.asok
rgw_dns_name =
debug rgw = 20
rgw print continue = true
rgw should log = true
rgw enable usage log = true


I used the following commands to benchmark rados, rbd, cephfs, etc

  1. rados -p rbd  bench 20 write –no-cleanup
  2. rados -p rbd  bench 20 seq
  3. dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
  4. dd bs=4M count=512 if=/dev/zero of=test conv=fdatasync

 Ceph blogs worth reading:

megacli cheat sheet

The information below is based heavily off of a post that can be found here:

I am providing the information on my blog in the event that the original blog post becomes unavailable at some point in the future, as we use this information quite regularly.

1-Gather Info: 

Controller information

megacli -AdpAllInfo -aALL
megacli -CfgDsply -aALL
megacli -adpeventlog -getevents -f controller-events.log -a0 -nolog

Enclosure information

megacli -EncInfo -aALL

Virtual drive information

megacli -LDInfo -Lall -aALL

Physical drive information

megacli -PDList -aALL
megacli -PDInfo -PhysDrv [E:S] -aALL

Battery backup information

megacli -AdpBbuCmd -aALL

Check Battery backup warning on boot

megacli -AdpGetProp BatWarnDsbl -a0

Controller management:

Silence active alarm

megacli -AdpSetProp AlarmSilence -aALL

Disable alarm

megacli -AdpSetProp AlarmDsbl -aALL

Enable alarm

megacli -AdpSetProp AlarmEnbl -aALL

Disable battery backup warning on system boot

megacli -AdpSetProp BatWarnDsbl -a0

Change the adapter rebuild rate to 60%:

megacli -AdpSetProp {RebuildRate -60} -aALL

2-Virtual drive management:

Create RAID 0, 1, 5 drive

megacli -CfgLdAdd -r(0|1|5) [E:S, E:S, …] -aN

Create RAID 10 drive

megacli -CfgSpanAdd -r10 -Array0[E:S,E:S] -Array1[E:S,E:S] -aN

Remove drive

megacli -CfgLdDel -Lx -aN

Physical drive management

Set state to offline

megacli -PDOffline -PhysDrv [E:S] -aN

Set state to online

megacli -PDOnline -PhysDrv [E:S] -aN

Mark as missing

megacli -PDMarkMissing -PhysDrv [E:S] -aN

Prepare for removal

megacli -PdPrpRmv -PhysDrv [E:S] -aN

Replace missing drive

megacli -PdReplaceMissing -PhysDrv [E:S] -ArrayN -rowN -aN

The number N of the array parameter is the Span Reference you get using megacli -CfgDsply -aALL and the number N of the row parameter is the Physical Disk in that span or array starting with zero (it’s not the physical disk’s slot!).

Rebuild drive – Drive status should be “Firmware state: Rebuild”

megacli -PDRbld -Start -PhysDrv [E:S] -aN
megacli -PDRbld -Stop -PhysDrv [E:S] -aN
megacli -PDRbld -ShowProg -PhysDrv [E:S] -aN
megacli -PDRbld -ProgDsply -physdrv [E:S] -aN

Clear drive

megacli -PDClear -Start -PhysDrv [E:S] -aN
megacli -PDClear -Stop -PhysDrv [E:S] -aN
megacli -PDClear -ShowProg -PhysDrv [E:S] -aN

Bad to good

megacli -PDMakeGood -PhysDrv[E:S] -aN

Changes drive in state Unconfigured-Bad to Unconfigured-Good.

Hot spare management

Set global hot spare

megacli -PDHSP -Set -PhysDrv [E:S] -aN

Remove hot spare

megacli -PDHSP -Rmv -PhysDrv [E:S] -aN

Set dedicated hot spare

megacli -PDHSP -Set -Dedicated -ArrayN,M,… -PhysDrv [E:S] -aN

Walkthrough: Rebuild a Drive that is marked ‘Foreign’ when Inserted:

Bad to good

megacli -PDMakeGood -PhysDrv [E:S] -aALL

Clear the foreign setting

megacli -CfgForeign -Clear -aALL

Set global hot spare

megacli -PDHSP -Set -PhysDrv [E:S] -aN

Walkthrough: Change/replace a drive

a. Set the drive offline, if it is not already offline due to an error

megacli -PDOffline -PhysDrv [E:S] -aN

b. Mark the drive as missing

megacli -PDMarkMissing -PhysDrv [E:S] -aN

c. Prepare drive for removal

megacli -PDPrpRmv -PhysDrv [E:S] -aN

d. Change/replace the drive

e. If you’re using hot spares then the replaced drive should become your new hot spare drive

megacli -PDHSP -Set -PhysDrv [E:S] -aN

f. In case you’re not working with hot spares, you must re-add the new drive to your RAID virtual drive and start the rebuilding

megacli -PdReplaceMissing -PhysDrv [E:S] -ArrayN -rowN -aN
megacli -PDRbld -Start -PhysDrv [E:S] -aN

3-Gathering Standard logs

#rm –f MegaSAS.log
#megacli -adpallinfo -a0
#megacli -encinfo -a0
#megacli -ldinfo -lall -a0
#megacli -pdlist -a0
#megacli -adpeventlog -getevents -f controller-events.log -a0 -nolog
#megacli -fwtermlog -dsply -a0 -nolog > controller-fwterm.log

Openmanage 7.3 on Proxmox 3.0 (Debian Wheezy)

I ran in to a few issues while trying to install Dell Openmanage on the latest version of Proxmox (3.0).

In order to get things working correctly on Proxmox 3.x, here are the steps that are required:

#echo “deb wheezy openmanage” > /etc/apt/sources.list.d/
#gpg –keyserver –recv-key 1285491434D8786F
#gpg -a –export 1285491434D8786F | sudo apt-key add –
#sudo apt-get update
#apt-get install libcurl3
#sudo apt-get install srvadmin-all
#sudo service dataeng start
#sudo service dsm_om_connsvc start

Once you get everything installed correctly you will be able to log in to the Openmanage web interface here:

https://<hostname or ip address>:1311

The first time you log in you should use the ‘root’ username and associated password.

Ceph overview with videos

I recently started looking into Ceph as a possible replacement for our 2 node Glustr cluster. After reading numerous blog posts and several videos on the topic, I believe that the following three videos provide the best overview and necessary insight into how Ceph works, what it offers, how to mange a cluster, and how it differs from Gluster.

This video was given at the 2013 conference. The first one is a Ceph only talk given by Florian Haas and Tim Serong.  It provides a complete overview of the Ceph software, it’s feature set and goes into detail about the current status of each of Ceph’s components (block storage, object storage, filesystem storage, etc).

The second video is a bit more lighthearted, it is a talk which involves some back and forth between Ceph’s Saige Weil and Gluster’s John Mark Walker, the talk is moderated by Florian Haas.  It covers some of the basic use cases for each file systems, offers some technical insight into the overall design, some of the ways in which the filesystems are similar, some items that are on each of their roadmaps moving forward,  as well as some of the ways each of  the projects differ from one another.

The final video also takes place at the 2013 conference.  It covers some of the same topics that were discussed in the prior two videos, however this one is also geared toward the operations aspect of managing a Ceph cluster.  This talk is given by Saige Weil as well.

Ubuntu 13.04 + Gnome 3.8 + ATI HD5450

I recently switched from using OpenSuse 12, to using Ubuntu 13.04 on my desktop machine at work. In order to get everything working correctly I had to use the latest ATI beta drivers (13.3 Beta).

After you unzip the downloaded file, simply run:

# ./

Initially the install script errored out, after it failed to locate ‘version.h’, so before I could successfully complete the driver install, I had to run:

# ln -s /lib/modules/3.8.0-19-generic/build/include/generated/uapi/linux/version.h /lib/modules/3.8.0-19-generic/build/include/linux/version.h

Since these are beta drivers, they have a watermark in the lower right hand corner of each monitor.  The solution to removing that watermark can be found here.

If you have dual monitors as I do, once you boot into ubuntu you will want to configure them with the ATI Control Center:

# amdcccle

In order to get the latest version of Gnome you can add the Gnome3 team ppa’s:

# sudo add-apt-repository ppa:gnome3-team/gnome3
# sudo apt-get update
# sudo apt-get upgrade

If  you want access to the latest bleeding edge applications and utilities , you can add the Ricotz testing ppa’s as well:

# sudo add-apt-repository ppa:ricotz/testing
# sudo apt-get update
# sudo apt-get upgrade

The Ricotz staging ppa’s can be used however, there is a chance that if you upgrade with these, you system may end up becoming unstable:

# sudo add-apt-repository ppa:ricotz/staging
# sudo apt-get update
# sudo apt-get upgrade

Finally, I ran into a bug in Gnome where my entire desktop background was gray/white, in inorder to get that issue fixed I followed the advice given in the comments section of this Youtube video.

Open Stack Ops Guide

Ted Neykov, currently a hacker at Rackspace, pointed me in the direction of the new Open Stack Operations Guide. I have only had a chance to browse the .pdf at this point, however I believe this will end up being a very informative and useful book for me going forward.

Taken from the guide’s summery:

‘This book offers hard-earned experience from OpenStack operators who have run OpenStack in production for six months or longer.  They’ve gathered their notes, shared their stories, and learned from each other in the room. We invite you to join in the quest for best practices in OpenStack cloud operations’

Here is a quick video that was released along with the guide, that briefly describes the process they used during it’s creation:

ZFS Day videos

The folks over at ZFS Day 2012 have posted a nice and wide ranging series of videos from last years event.

Taken from the description of the event:

‘The first ever ZFS conference covered all aspects of using ZFS in production. You can find ZFS in the most demanding environments, from video servers to cloud platforms to databases to NFS servers to HPC. Come learn about what makes ZFS a great storage system for these and other applications.’

Of particular interest to me (because of our desire to run zfs on Linux in production environments ( are the following two videos:

ZFS for Linux Implementation – Brian Behlendorf:

Panel: The State of ZFS on various os versions:

You can find a complete list of videos, talks and slides here.

Maatkit is now Percona Toolkit for MySQL

I have written about Maatkit in the past, and more specifically how to use ‘mk-query-digest’.   Development on Maatkit has stopped at this point and you should look to use the Percona Toolkit for Mysql going forward.

We use ‘pt-query-digest’ on a regular basis on our servers in order to profile running Mysql instances during periods of high load and at times when general query profiling is required.  It appears as though there have been some changes to way in which the script works during the transition from ‘mk-query-digest’ to ‘pt-query-profiler’

In order to use it at this point you should use the following example syntax:

# pt-query-digest –user user_name –password pass_word –processlist localhost –interval 0.01 –run-time 10m

You will also notice that there is a new command line paramater ‘–run-time’ that is used to determine how long you would like the profiler to run before producing a report, in this case I would like to run the profiler for 10 minutes.

The output is also slightly different in that the summery report that was normally printed out at the end of the report has been moved toward the beginning of the report as well.

InnoDB Quick Reference Guide

We are currently in the process of upgrading our mysql 5.1 instances to mysql 5.6 , because of this, we are once again very focused on overall mysql performance, and more specifically on Innodb performance going forward.

Matt Reid recently published a book entitled ‘InnoDB Quick Reference Guide.’

I believe that this book will come in very handy to us over the next few weeks and months, as we once again look to delve into mysql behavior and Innodb internals.

I have downloaded a copy of this e-book and I will be providing a more in-depth review shortly.


Chapter 1: Getting Started with Innodb
This chapter talks about the features of the Innodb storage engine, as well as it’s requiremnets, suported platforms, etc. The chapter does a good job of providing a clear overview of Innodb and it’s overall features and use cases.

Chapter 2: Basic Configuration Parameters
This chapter talks about various configuration varables and how they realate to and effect Innodb. The chapter helps provide you with a better understanding of some of the more basic Innodb configuration options and how the effect Innodb.

Chapter 3: Advanced Configuration Parameters
This chapter provides a much more in-dpeth look at some of the more advanced configuraion variables used to control the behavior of Innodb and how they relate it’s overall performance. The chapter does a great job of covering all the necessary Innodb related parameters that really effect how Innodb performs under real world workloads.

Chapter 4: Load Testing InnoDB for Performance
This chapter focuses on the numerous open-source tools that can be used to test the performance of both the application (Mysql) and the OS (filesystem, etc). All the major tools are covered here and the chapter does a good job of covering each of the tools and their use cases.

Chapter 5: Maintenance and Monitoring
This section discuss some typical maintence tasks that are associated with running Innodb. Other information includes some common methods for finding and pulling runtime information and performance information from the storage engine.

Chapter 6: Troubleshooting InnoDB
This chapter provides some good insite into several of the more common issues that you could face if you have an Innodb deployment, from crash recovery to issues regarding backup and recovery.

Chapter 7: References and links
A small section that you can use to find further detail about Innodb and Mysql.

This book does a good job of covering the main features, paramaters, and use cases for the Mysql InnodDB storage engine.