Monthly Archives: July 2017

Upgrading Ceph from Hammer to Jewel

We recently upgraded our Ceph cluster from the latest version of Hammer to 10.2.7 (Jewel). Here are the steps that we used in order to complete the upgrade. Due to a change in Ceph daemon permissions, this specific upgrade required an additional step of using chmod to change file permissions for each daemon directory.

Set the cluster to the ‘noout’ state so that we can perform the upgrade without any data movement:
ceph osd set noout

From the Ceph-deploy control node upgrade monitor nodes first:
ceph-deploy install --release jewel ceph-mon1 ceph-mon2 ceph-mon3

On each monitor node:
stop ceph-mon-all
cd /var/lib/ceph
chown -R ceph:ceph /var/lib/ceph/
start ceph-mon-all

Next move on to the OSD nodes:
ceph-deploy install --release jewel ceph-osd1 ceph-osd2 ceph-osd3 ceph-osd4

Add the following line to /etc/ceph/ceph.conf on each OSD (this will allow the ceph daemons to startup using the old permission scheme):

setuser match path = /var/lib/ceph/$type/$cluster-$id

Stop OSD’s and restart them on each node:
stop ceph-osd-all
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Once the cluster is all healthy again and you have some time make the necessary permission changes for the OSD daemons you can do the following:

Set noout:
ceph osd set noout

Log onto to each OSD node 1 at a time and run the following commands:
find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R root:root

stop ceph-osd-all

find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R ceph:ceph

chown -R ceph:ceph /var/lib/ceph/

Comment out the setuser line in ceph.conf and restart OSD’s:
#setuser match path = /var/lib/ceph/$type/$cluster-$id
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Replace failed Ceph disk on Dell hardware

We are using Dell 720 and 730xd servers for our Ceph OSD servers. Here is the process that we use in order to replace a disk and/or remove the faulty OSD from service.

In this example we will attempt to replace OSD #45 (slot #9 of this particular server):

Stop the OSD and unmount the directory:
stop ceph-osd id=45
umount /var/lib/ceph/osd/ceph-45
ceph osd crush reweight osd.num 0.0 (wait for the cluster to rebalance):
ceph osd out osd.num
service ceph stop osd.num
ceph osd crush remove osd.num
ceph auth del osd.num
ceph osd rm osd.num

megacli -PDList -a0

If not already offline…offline the drive:
megacli -pdoffline -physdrv[32:9] -a0
Mark disk as missing:
megacli -pdmarkmissing -physdrv[32:9] -a0
Permanently remove drive from array:
megacli -pdprprmv -physdrv[32:9] -a0

NOW PHYSICALLY REPLACE THE BAD THE DRIVE WITH A NEW ONE.

Set drive state to online if not already:
megacli -PDOnline -PhysDrv [32:9] -a0
Create Raid-0 array on new drive:
megacli -CfgLdAdd -r0[32:9] -a0

You may need to discard the cache before doing the last step:
First get cache lsit:
megacli -GetPreservedCacheList -a0
Clear whichover one you need to:
megacli -DiscardPreservedCache -L2 -a0

Recreate OSD using Bluestore as the new default
ceph-deploy disk zap hqosdNUM /dev/sdx
ceph-deploy osd create --data /dev/sdm hqosdNUM