Category Archives: Ceph

Ceph upgrade to Luminous + Ubuntu 16.04

Our Ubuntu 14.04 cluster has been running Ceph Jewel without issue for about the last 9 months.  During that time period the Ceph team released the latest LTS version called Luminous.  Given that Jewel is slated for EOL at some point int he next 2 or 3 months, I thought it was a good time to once again upgrade the cluster.

Before I started working on upgrading the version of Ceph on the cluster I decided to go ahead and upgrade the version of Ubuntu from 14.04 to 16.04.  The process for me was a relatively simple one, I used the following single set of commands on each node:

apt-get update; apt-get upgrade; apt-get dist-upgrade ; apt-get install update-manager-core; do-release-upgrade

After the upgrade the process for restarting Ceph daemons will have changed slightly due to the fact that Ubuntu 16.04 uses systemd instead of upstart.  For example you will have to use the following commands to restart each of the services going forward:

systemctl restart ceph-mon.target
systemctl restart ceph-osd.target
systemctl restart ceph-mgr.target

After the Ubuntu upgrade I decided to start upgrading the version of Ceph on the cluster.
Using ‘ceph-deploy’ I issued the following commands on each node:

Monitor nodes first:
ceph-deploy install --release luminous hqceph1 hqceph2 hqceph3 (from the admin node)
systemctl restart ceph-mon.target (locally on each server)

OSD nodes second (for example):

Set ‘noout’ so your data does not try to rebalance during the OSD restarts:
ceph osd set noout

ceph-deploy install --release luminous hqosd1 hqosd2 hqosd3  (from admin node)
systemctl restart ceph-osd.target (locally on each server)

Next you should do the same for any RGW nodes that you might have.
Finally repeat the process for any other client nodes (librbd, KRBD, etc).

Don’t forget to unset noout before you finish:
ceph osd unset noout (from the admin node)

This release also introduced a new daemon called the manager daemon or ‘mgr’.  You can learn more about this new process here:  http://docs.ceph.com/docs/mimic/mgr/

I was able to install the mgr daemon on two of my existing nodes (one as a active and the second as the standyby) using the following commands:

ceph-deploy mgr create hqceph2 (from the admin node)
ceph-deploy mgr create hqceph3 (from the admin node)

Upgrading Ceph from Hammer to Jewel

We recently upgraded our Ceph cluster from the latest version of Hammer to 10.2.7 (Jewel). Here are the steps that we used in order to complete the upgrade. Due to a change in Ceph daemon permissions, this specific upgrade required an additional step of using chmod to change file permissions for each daemon directory.

Set the cluster to the ‘noout’ state so that we can perform the upgrade without any data movement:
ceph osd set noout

From the Ceph-deploy control node upgrade monitor nodes first:
ceph-deploy install --release jewel ceph-mon1 ceph-mon2 ceph-mon3

On each monitor node:
stop ceph-mon-all
cd /var/lib/ceph
chown -R ceph:ceph /var/lib/ceph/
start ceph-mon-all

Next move on to the OSD nodes:
ceph-deploy install --release jewel ceph-osd1 ceph-osd2 ceph-osd3 ceph-osd4

Add the following line to /etc/ceph/ceph.conf on each OSD (this will allow the ceph daemons to startup using the old permission scheme):

setuser match path = /var/lib/ceph/$type/$cluster-$id

Stop OSD’s and restart them on each node:
stop ceph-osd-all
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Once the cluster is all healthy again and you have some time make the necessary permission changes for the OSD daemons you can do the following:

Set noout:
ceph osd set noout

Log onto to each OSD node 1 at a time and run the following commands:
find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R root:root

stop ceph-osd-all

find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R ceph:ceph

chown -R ceph:ceph /var/lib/ceph/

Comment out the setuser line in ceph.conf and restart OSD’s:
#setuser match path = /var/lib/ceph/$type/$cluster-$id
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Replace failed Ceph disk on Dell hardware

We are using Dell 720 and 730xd servers for our Ceph OSD servers. Here is the process that we use in order to replace a disk and/or remove the faulty OSD from service.

In this example we will attempt to replace OSD #45 (slot #9 of this particular server):

Stop the OSD and unmount the directory:
stop ceph-osd id=45
umount /var/lib/ceph/osd/ceph-45
ceph osd crush reweight osd.num 0.0 (wait for the cluster to rebalance):
ceph osd out osd.num
service ceph stop osd.num
ceph osd crush remove osd.num
ceph auth del osd.num
ceph osd rm osd.num

megacli -PDList -a0

If not already offline…offline the drive:
megacli -pdoffline -physdrv[32:9] -a0
Mark disk as missing:
megacli -pdmarkmissing -physdrv[32:9] -a0
Permanently remove drive from array:
megacli -pdprprmv -physdrv[32:9] -a0

NOW PHYSICALLY REPLACE THE BAD THE DRIVE WITH A NEW ONE.

Set drive state to online if not already:
megacli -PDOnline -PhysDrv [32:9] -a0
Create Raid-0 array on new drive:
megacli -CfgLdAdd -r0[32:9] -a0

You may need to discard the cache before doing the last step:
First get cache lsit:
megacli -GetPreservedCacheList -a0
Clear whichover one you need to:
megacli -DiscardPreservedCache -L2 -a0

Recreate OSD using Bluestore as the new default
ceph-deploy disk zap hqosdNUM /dev/sdx
ceph-deploy osd create --data /dev/sdm hqosdNUM

Upgrading Ceph from Firefly to Hammer

Last week we decided to upgrade our 13 node Ceph cluster from version 0.80.11 (firefly) to version 0.94.6 (hammer).

Although we have not been having any known issues with the cluster running firefly, the official support for firefly ended in January 2016, and the jewel release will be out soon and it will be easier to upgrade to jewel from either hammer or infernalis.

The overall upgrade process was relatively painless. I used the ceph-deploy script to create the cluster initially, and I choose to use it again to upgrade the cluster to hammer.

1) First I pull in the current config file and keys:
root@admin:/ceph-deploy# ceph-deploy config pull mon1
root@admin:/ceph-deploy# ceph-deploy gatherkeys mon1

2) Next we upgrade each of the mon daemons:
root@admin:/ceph-deploy# ceph-deploy install --release hammer mon1 mon2 mon3

3) Now we can restart the daemons on each mon server
root@mon1:~# stop ceph-mon-all
root@mon1:~# start ceph-mon-all

4) Next it’s time to upgrade the osd server daemons:
root@admin:/ceph-deploy# ceph-deploy install --release hammer osd1 osd2 osd3 osd4

5) Now we can restart the daemons on each of the osd servers:
root@osd4:~# stop ceph-osd-all
root@osd4:~# start ceph-osd-all

6) Finally you can upgrade any client server daemons that you have:
root@admin:/ceph-deploy# ceph-deploy install --release hammer clientserver1

Remove objects from Ceph pool without deleting pool

I recently wanted to cleanup a few of the pools I have been using for rados benchmarking. I did not want to do delete the pools, just the objects inside the pool.

If you are trying to clear up ‘rados bench’ objects you can use something like this:

‘rados -p temp_pool cleanup –prefix benchmark’

If you are trying to remove all the objects from a pool that does not prefix the object names with:

****WARNING THIS WILL ERASE ALL OBJECTS IN YOUR POOL*****
‘for i in `rados -p temp_pool ls`; do echo $i; rados -p temp_pool rm $i; done’

Ceph cheatsheet

Here is a list of Ceph commands that we tend to use on a regular basis:

a)Display cluster status:
‘ceph -s’

b)Display running cluster status:
‘ceph -w’

c)Display pool usage stats:
‘ceph df’

d)List pools:
‘ceph osd lspools’

e)Display per pool placement group and replication levels:
‘ceph osd dump | grep ‘replicated size’

f)Set pool placement group sizes:
‘ceph osd pool set pool_name pg_num 512’
‘ceph osd pool set pool_name pgp_num 512’

g)Display rbd images in a pool:
‘rbd -p pool_name list’

h)Create rbd snapshot:
‘rbd –pool rbd snap create pool_name/image_name@snap_name’

i)Display rbd snapshots:
‘rbd snap ls pool_name/image_name’

j)Display which images are mapped via kernel:
‘rbd showmapped’

k)Get rados statistics:
‘rados df’

l)List pieces of pool using rados:
‘rados -p pool_name  ls’

Benchmarking Ceph

This is a post that I have had in draft mode for quite some time. At this point some of this information is out of date, so I am planning on writing a ‘part II’ post shortly, which will include some updated information.

Benchmarking Ceph:

Ever since we got our ceph cluster up and running, I’ve been running various benchmarking applications against different cluster configurations. Just to review, the cluster that we recently built has the following specs:

Cluster specs:

  • 3 x Dell R-420;32 GB of RAM; for MON/RADOSGW/MDS nodes
  • 6 x Dell R-720xd;64 GB of RAM; for OSD nodes
  • 72 x 4TB SAS drives as OSD’s
  • 2 x Force10 S4810 switches
  • 4 x 10 GigE LCAP bonded Intel cards
  • Ubuntu 12.04 (AMD64)
  • Ceph 0.72.1 (emperor)
  • 2400 placement groups
  • 261TB of usable space

The main role for this cluster will be one primarily tied to archiving audio and video assets. This being the case, we decided to try and maximize total cluster capacity (4TB drives, no ssd’s, etc), while at the same time being able to achieve and maintain reasonable cluster throughput (10 GigE, 12 drives per osd nodes, etc).

Most of my benchmarking focused on rbd and radosgw, because either of these is most likely to be what we introduce into production when we are ready.  We are very much awaiting a stable and supported cephfs release (which will hopefully be available sometime in mid-late 2014), which will allow us to switch out our rbd + samba setup, for on based on cephfs.

Rados Benchmarks: 

I setup a pool called ‘test’ with 1600 pg’s in order to run some benchmarks using the ‘rados bench’ tool that came with Ceph.  I started with a replication level of ‘1’ and worked my way up to a replication level of ‘3’.

root@hqceph1:/# rados -p test bench 20 write (rep size=1)
Total time run: 20.241290
Total writes made: 5646
Write size: 4194304
Bandwidth (MB/sec): 1115.739
Stddev Bandwidth: 246.027
Max bandwidth (MB/sec): 1136
Min bandwidth (MB/sec): 0
Average Latency: 0.0571572
Stddev Latency: 0.0262513
Max latency: 0.336378
Min latency: 0.02248
root@hqceph1:/# rados -p test bench 20 write (rep size=2)
Total time run: 20.547026
Total writes made: 2910
Write size: 4194304
Bandwidth (MB/sec): 566.505
Stddev Bandwidth: 154.643
Max bandwidth (MB/sec): 764
Min bandwidth (MB/sec): 0
Average Latency: 0.112384
Stddev Latency: 0.198579
Max latency: 2.5105
Min latency: 0.025391
root@hqceph1:/# rados -p test bench 20 write (rep size=3)
Total time run: 20.755272
Total writes made: 2481
Write size: 4194304
Bandwidth (MB/sec): 478.144
Stddev Bandwidth: 147.064
Max bandwidth (MB/sec): 728
Min bandwidth (MB/sec): 0
Average Latency: 0.133827
Stddev Latency: 0.229962
Max latency: 3.32957
Min latency: 0.029481

RBD Benchmarks:

Next I setup a 10GB block device using rbd:

root@ceph1:/blockdev# dd bs=1M count=256 if=/dev/zero of=test1 conv=fdatasync (rep size=1)
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 0.440333 s, 610 MB/s
root@ceph1:/blockdev# dd bs=4M count=256 if=/dev/zero of=test1 conv=fdatasync (rep size=1)
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 1.07413 s, 1000 MB/s
root@ceph1:/mnt/blockdev# hdparm -Tt /dev/rbd1 (rep size=1)
/dev/rbd1:
Timing cached reads: 16296 MB in 2.00 seconds = 8155.69 MB/sec
Timing buffered disk reads: 246 MB in 3.10 seconds = 79.48 MB/sec
root@ceph1:/mnt/blockdev# dd bs=1M count=256 if=/dev/zero of=test conv=fdatasync (rep size=2)
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 1.29985 s, 207 MB/s
root@ceph1:/mnt/blockdev# dd bs=4M count=256 if=/dev/zero of=test2 conv=fdatasync(rep size=2)
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 4.02375 s, 267 MB/s
root@cephmount1:/mnt/ceph-block-device/test# hdparm -Tt /dev/rbd1 (rep size=2)
/dev/rbd1:
Timing cached reads: 16434 MB in 2.00 seconds = 8225.55 MB/sec
Timing buffered disk reads: 152 MB in 3.01 seconds = 50.55 MB/sec

Radosgw Benchmarks:

Using s3cmd (s3tools) I was able to achieve about 70MB/s when pushing files to ceph via the s3 restful API.

 

Ceph braindump part1

After spending about 4 months testing, benchmarking, setting up and breaking down various Ceph clusters, I though I would spend time documenting some of the things I have learned while setting up cephfs, rbd and radosgw along the way.

First let me talk a little bit about the details of the cluster that we will be putting into production over the next several weeks.

Cluster specs:

  • 6 x Dell R-720xd;64 GB of RAM; for OSD nodes
  • 72 x 4TB SAS drives as OSD’s
  • 3 x Dell R-420;32 GB of RAM; for MON/RADOSGW/MDS nodes
  • 2 x Force10 S4810 switches
  • 4 x 10 GigE LCAP bonded Intel cards
  • Ubuntu 12.04 (AMD64)
  • Ceph 0.72.1 (emperor)
  • 2400 placement groups
  • 261TB of usable space

The process I used to set- up and tear down our cluster during testing was quite simple, after installing ‘ceph-deploy’ on the admin node:

  1. ceph-deploy new mon1 mon2 mon3
  2. ceph-deploy install  mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6
  3. ceph-deploy mon create mon1 mon2 mon3
  4. ceph-deploy gatherkeys mon1
  5. ceph-deploy osd create osd1:sdb
  6. ceph-deploy osd create osd1:sdc
    ……….

The uninstall process went something like this:

  1. ceph-deploy disk zap osd1:sdb
    ……….
  2. ceph-deploy purge mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6
  3. ceph-deploy purgedata mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6

Additions to ceph.conf:

Since we wanted to configure an appropriate journal size for our 10GigE network, mount xfs with appropriate options and configure radosgw, we added the following to our ceph.conf (after ‘ceph-deploy new but before ‘ceph-deploy install’:

[global]
osd_journal_size = 10240
osd_mount_options_xfs = “rw,noatime,nodiratime,logbsize=256k,logbufs=8,inode64”
osd_mkfs_options_xfs = “-f -i size=2048”

[client.radosgw.gateway]
host = mon1
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/ceph/radosgw.log
admin_socket = /var/run/ceph/radosgw.asok
rgw_dns_name = yourdomain.com
debug rgw = 20
rgw print continue = true
rgw should log = true
rgw enable usage log = true

Benchmarking:

I used the following commands to benchmark rados, rbd, cephfs, etc

  1. rados -p rbd  bench 20 write –no-cleanup
  2. rados -p rbd  bench 20 seq
  3. dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
  4. dd bs=4M count=512 if=/dev/zero of=test conv=fdatasync

 Ceph blogs worth reading:

http://ceph.com/community/blog/
http://www.sebastien-han.fr/blog/
http://dachary.org/

Ceph overview with videos

I recently started looking into Ceph as a possible replacement for our 2 node Glustr cluster. After reading numerous blog posts and several videos on the topic, I believe that the following three videos provide the best overview and necessary insight into how Ceph works, what it offers, how to mange a cluster, and how it differs from Gluster.

This video was given at the 2013 linux.conf.au conference. The first one is a Ceph only talk given by Florian Haas and Tim Serong.  It provides a complete overview of the Ceph software, it’s feature set and goes into detail about the current status of each of Ceph’s components (block storage, object storage, filesystem storage, etc).

The second video is a bit more lighthearted, it is a talk which involves some back and forth between Ceph’s Saige Weil and Gluster’s John Mark Walker, the talk is moderated by Florian Haas.  It covers some of the basic use cases for each file systems, offers some technical insight into the overall design, some of the ways in which the filesystems are similar, some items that are on each of their roadmaps moving forward,  as well as some of the ways each of  the projects differ from one another.

The final video also takes place at the 2013 linux.conf.au conference.  It covers some of the same topics that were discussed in the prior two videos, however this one is also geared toward the operations aspect of managing a Ceph cluster.  This talk is given by Saige Weil as well.