Ceph upgrade to Luminous + Ubuntu 16.04

Our Ubuntu 14.04 cluster has been running Ceph Jewel without issue for about the last 9 months.  During that time period the Ceph team released the latest LTS version called Luminous.  Given that Jewel is slated for EOL at some point int he next 2 or 3 months, I thought it was a good time to once again upgrade the cluster.

Before I started working on upgrading the version of Ceph on the cluster I decided to go ahead and upgrade the version of Ubuntu from 14.04 to 16.04.  The process for me was a relatively simple one, I used the following single set of commands on each node:

apt-get update; apt-get upgrade; apt-get dist-upgrade ; apt-get install update-manager-core; do-release-upgrade

After the upgrade the process for restarting Ceph daemons will have changed slightly due to the fact that Ubuntu 16.04 uses systemd instead of upstart.  For example you will have to use the following commands to restart each of the services going forward:

systemctl restart ceph-mon.target
systemctl restart ceph-osd.target
systemctl restart ceph-mgr.target

After the Ubuntu upgrade I decided to start upgrading the version of Ceph on the cluster.
Using ‘ceph-deploy’ I issued the following commands on each node:

Monitor nodes first:
ceph-deploy install --release luminous hqceph1 hqceph2 hqceph3 (from the admin node)
systemctl restart ceph-mon.target (locally on each server)

OSD nodes second (for example):

Set ‘noout’ so your data does not try to rebalance during the OSD restarts:
ceph osd set noout

ceph-deploy install --release luminous hqosd1 hqosd2 hqosd3  (from admin node)
systemctl restart ceph-osd.target (locally on each server)

Next you should do the same for any RGW nodes that you might have.
Finally repeat the process for any other client nodes (librbd, KRBD, etc).

Don’t forget to unset noout before you finish:
ceph osd unset noout (from the admin node)

This release also introduced a new daemon called the manager daemon or ‘mgr’.  You can learn more about this new process here:  http://docs.ceph.com/docs/mimic/mgr/

I was able to install the mgr daemon on two of my existing nodes (one as a active and the second as the standyby) using the following commands:

ceph-deploy mgr create hqceph2 (from the admin node)
ceph-deploy mgr create hqceph3 (from the admin node)

Remove unused servers from PMM

We recently decommissioned some unused Mysql servers. Part of this process involved removing the nodes from Percona Monitoring and Management (PMM). I found that although this process is very simple overall, it is not every intuitive. The process is as follows:

1) Find node name:
curl -s 'http://pmm-hostname.domain.com/v1/internal/ui/nodes?dc=dc1' | python -mjson.tool

2) Remove node from consul:
curl -s -X PUT -d '{"Datacenter":"dc1","Node":"node_name"}' 'http://pmm-hostname.domain.com/v1/catalog/deregister?dc=dc1'

3) Remove node from prometheus:
curl -X DELETE 'http://pmm-hostname.domain.com/prometheus/api/v1/series?match\[\]=\{instance="node_name"\}'

Upgrading Ceph from Hammer to Jewel

We recently upgraded our Ceph cluster from the latest version of Hammer to 10.2.7 (Jewel). Here are the steps that we used in order to complete the upgrade. Due to a change in Ceph daemon permissions, this specific upgrade required an additional step of using chmod to change file permissions for each daemon directory.

Set the cluster to the ‘noout’ state so that we can perform the upgrade without any data movement:
ceph osd set noout

From the Ceph-deploy control node upgrade monitor nodes first:
ceph-deploy install --release jewel ceph-mon1 ceph-mon2 ceph-mon3

On each monitor node:
stop ceph-mon-all
cd /var/lib/ceph
chown -R ceph:ceph /var/lib/ceph/
start ceph-mon-all

Next move on to the OSD nodes:
ceph-deploy install --release jewel ceph-osd1 ceph-osd2 ceph-osd3 ceph-osd4

Add the following line to /etc/ceph/ceph.conf on each OSD (this will allow the ceph daemons to startup using the old permission scheme):

setuser match path = /var/lib/ceph/$type/$cluster-$id

Stop OSD’s and restart them on each node:
stop ceph-osd-all
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Once the cluster is all healthy again and you have some time make the necessary permission changes for the OSD daemons you can do the following:

Set noout:
ceph osd set noout

Log onto to each OSD node 1 at a time and run the following commands:
find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R root:root

stop ceph-osd-all

find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R ceph:ceph

chown -R ceph:ceph /var/lib/ceph/

Comment out the setuser line in ceph.conf and restart OSD’s:
#setuser match path = /var/lib/ceph/$type/$cluster-$id
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Replace failed Ceph disk on Dell hardware

We are using Dell 720 and 730xd servers for our Ceph OSD servers. Here is the process that we use in order to replace a disk and/or remove the faulty OSD from service.

In this example we will attempt to replace OSD #45 (slot #9 of this particular server):

Stop the OSD and unmount the directory:
stop ceph-osd id=45
umount /var/lib/ceph/osd/ceph-45
ceph osd crush reweight osd.num 0.0 (wait for the cluster to rebalance):
ceph osd out osd.num
service ceph stop osd.num
ceph osd crush remove osd.num
ceph auth del osd.num
ceph osd rm osd.num

megacli -PDList -a0

If not already offline…offline the drive:
megacli -pdoffline -physdrv[32:9] -a0
Mark disk as missing:
megacli -pdmarkmissing -physdrv[32:9] -a0
Permanently remove drive from array:
megacli -pdprprmv -physdrv[32:9] -a0

NOW PHYSICALLY REPLACE THE BAD THE DRIVE WITH A NEW ONE.

Set drive state to online if not already:
megacli -PDOnline -PhysDrv [32:9] -a0
Create Raid-0 array on new drive:
megacli -CfgLdAdd -r0[32:9] -a0

You may need to discard the cache before doing the last step:
First get cache lsit:
megacli -GetPreservedCacheList -a0
Clear whichover one you need to:
megacli -DiscardPreservedCache -L2 -a0

Recreate OSD using Bluestore as the new default
ceph-deploy disk zap hqosdNUM /dev/sdx
ceph-deploy osd create --data /dev/sdm hqosdNUM

Upgrading Ceph from Firefly to Hammer

Last week we decided to upgrade our 13 node Ceph cluster from version 0.80.11 (firefly) to version 0.94.6 (hammer).

Although we have not been having any known issues with the cluster running firefly, the official support for firefly ended in January 2016, and the jewel release will be out soon and it will be easier to upgrade to jewel from either hammer or infernalis.

The overall upgrade process was relatively painless. I used the ceph-deploy script to create the cluster initially, and I choose to use it again to upgrade the cluster to hammer.

1) First I pull in the current config file and keys:
root@admin:/ceph-deploy# ceph-deploy config pull mon1
root@admin:/ceph-deploy# ceph-deploy gatherkeys mon1

2) Next we upgrade each of the mon daemons:
root@admin:/ceph-deploy# ceph-deploy install --release hammer mon1 mon2 mon3

3) Now we can restart the daemons on each mon server
root@mon1:~# stop ceph-mon-all
root@mon1:~# start ceph-mon-all

4) Next it’s time to upgrade the osd server daemons:
root@admin:/ceph-deploy# ceph-deploy install --release hammer osd1 osd2 osd3 osd4

5) Now we can restart the daemons on each of the osd servers:
root@osd4:~# stop ceph-osd-all
root@osd4:~# start ceph-osd-all

6) Finally you can upgrade any client server daemons that you have:
root@admin:/ceph-deploy# ceph-deploy install --release hammer clientserver1

zfs on linux links part II

Proxmox 3.4 was recently released with additional integrated support for ZFS.  More details are provided here in the Proxmox ZFS wiki section. I also decided to start gathering up another more current round of links related to performance, best practices, benchmarking, etc.

If you are looking for up to date links to help you understand some of the more advanced aspects and use cases surrounding zfs, or if you are just getting started and are looking for some relevant reading material on the subject, you should find these links extremely useful:

1.The state of ZFS on linux
2.Arch linux ZFS wiki page
3.Gentoo linux ZFS wiki page
4.ZFS Raidz Performance, Capacity and Integrity
5.ZFS administration
6.KVM benchmarking using various filesystems
7.How to improve ZFS performance

Syslog broken after upgrade to Debian Wheezy

We have run into an issue on several of of our Debian servers after upgradeing to Wheezy. The issue is one that tends to go unnoticed for a while, until you are looking through the files in ‘/var/log’ and notice that none of the files have updated entries since the date of the last upgrade.

The solution to the issue is to install the following package:

‘apt-get install inetutils-syslogd’

After installing this missing package, syslogd should once again be running, and you should start to see new entries show up in your messages, syslog, etc files.

Force10 troubleshooting commands

Here are a few of the more helpful commands that I have been using recently to troubleshoot some performance issues we have been having with our our stacked S4820’s.

1)Show interfaces statistics:
‘show interface tengigabitethernet 0/40’

2)Monitor interface statistics in real-time:
‘monitor interface tengigabitethernet 0/40’

3)Show port channel statistics:
‘show interface port-channel7’

4)Show overview of port channel groupings:
‘show interface port-channel brief’

5)Monitor port channel statistics in real-time:
‘monitor interface port-channel 12’

6)Display vlan interfaces statistics:
‘show interface vlan 20’

7)Gather data for tech support:
‘show tech-support’

8)Display serial numbers, service tags, etc:
‘show inventory’

Remove objects from Ceph pool without deleting pool

I recently wanted to cleanup a few of the pools I have been using for rados benchmarking. I did not want to do delete the pools, just the objects inside the pool.

If you are trying to clear up ‘rados bench’ objects you can use something like this:

‘rados -p temp_pool cleanup –prefix benchmark’

If you are trying to remove all the objects from a pool that does not prefix the object names with:

****WARNING THIS WILL ERASE ALL OBJECTS IN YOUR POOL*****
‘for i in `rados -p temp_pool ls`; do echo $i; rados -p temp_pool rm $i; done’

Ceph cheatsheet

Here is a list of Ceph commands that we tend to use on a regular basis:

a)Display cluster status:
‘ceph -s’

b)Display running cluster status:
‘ceph -w’

c)Display pool usage stats:
‘ceph df’

d)List pools:
‘ceph osd lspools’

e)Display per pool placement group and replication levels:
‘ceph osd dump | grep ‘replicated size’

f)Set pool placement group sizes:
‘ceph osd pool set pool_name pg_num 512’
‘ceph osd pool set pool_name pgp_num 512’

g)Display rbd images in a pool:
‘rbd -p pool_name list’

h)Create rbd snapshot:
‘rbd –pool rbd snap create pool_name/image_name@snap_name’

i)Display rbd snapshots:
‘rbd snap ls pool_name/image_name’

j)Display which images are mapped via kernel:
‘rbd showmapped’

k)Get rados statistics:
‘rados df’

l)List pieces of pool using rados:
‘rados -p pool_name  ls’