Category Archives: Linux

All things Linux.

Ceph upgrade to Luminous + Ubuntu 16.04

Our Ubuntu 14.04 cluster has been running Ceph Jewel without issue for about the last 9 months.  During that time period the Ceph team released the latest LTS version called Luminous.  Given that Jewel is slated for EOL at some point int he next 2 or 3 months, I thought it was a good time to once again upgrade the cluster.

Before I started working on upgrading the version of Ceph on the cluster I decided to go ahead and upgrade the version of Ubuntu from 14.04 to 16.04.  The process for me was a relatively simple one, I used the following single set of commands on each node:

apt-get update; apt-get upgrade; apt-get dist-upgrade ; apt-get install update-manager-core; do-release-upgrade

After the upgrade the process for restarting Ceph daemons will have changed slightly due to the fact that Ubuntu 16.04 uses systemd instead of upstart.  For example you will have to use the following commands to restart each of the services going forward:

systemctl restart ceph-mon.target
systemctl restart ceph-osd.target
systemctl restart ceph-mgr.target

After the Ubuntu upgrade I decided to start upgrading the version of Ceph on the cluster.
Using ‘ceph-deploy’ I issued the following commands on each node:

Monitor nodes first:
ceph-deploy install --release luminous hqceph1 hqceph2 hqceph3 (from the admin node)
systemctl restart ceph-mon.target (locally on each server)

OSD nodes second (for example):

Set ‘noout’ so your data does not try to rebalance during the OSD restarts:
ceph osd set noout

ceph-deploy install --release luminous hqosd1 hqosd2 hqosd3  (from admin node)
systemctl restart ceph-osd.target (locally on each server)

Next you should do the same for any RGW nodes that you might have.
Finally repeat the process for any other client nodes (librbd, KRBD, etc).

Don’t forget to unset noout before you finish:
ceph osd unset noout (from the admin node)

This release also introduced a new daemon called the manager daemon or ‘mgr’.  You can learn more about this new process here:  http://docs.ceph.com/docs/mimic/mgr/

I was able to install the mgr daemon on two of my existing nodes (one as a active and the second as the standyby) using the following commands:

ceph-deploy mgr create hqceph2 (from the admin node)
ceph-deploy mgr create hqceph3 (from the admin node)

Upgrading Ceph from Hammer to Jewel

We recently upgraded our Ceph cluster from the latest version of Hammer to 10.2.7 (Jewel). Here are the steps that we used in order to complete the upgrade. Due to a change in Ceph daemon permissions, this specific upgrade required an additional step of using chmod to change file permissions for each daemon directory.

Set the cluster to the ‘noout’ state so that we can perform the upgrade without any data movement:
ceph osd set noout

From the Ceph-deploy control node upgrade monitor nodes first:
ceph-deploy install --release jewel ceph-mon1 ceph-mon2 ceph-mon3

On each monitor node:
stop ceph-mon-all
cd /var/lib/ceph
chown -R ceph:ceph /var/lib/ceph/
start ceph-mon-all

Next move on to the OSD nodes:
ceph-deploy install --release jewel ceph-osd1 ceph-osd2 ceph-osd3 ceph-osd4

Add the following line to /etc/ceph/ceph.conf on each OSD (this will allow the ceph daemons to startup using the old permission scheme):

setuser match path = /var/lib/ceph/$type/$cluster-$id

Stop OSD’s and restart them on each node:
stop ceph-osd-all
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Once the cluster is all healthy again and you have some time make the necessary permission changes for the OSD daemons you can do the following:

Set noout:
ceph osd set noout

Log onto to each OSD node 1 at a time and run the following commands:
find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R root:root

stop ceph-osd-all

find /var/lib/ceph/osd -maxdepth 1 -mindepth 1 -print | xargs -P12 -n1 chown -R ceph:ceph

chown -R ceph:ceph /var/lib/ceph/

Comment out the setuser line in ceph.conf and restart OSD’s:
#setuser match path = /var/lib/ceph/$type/$cluster-$id
start ceph-osd-all

Don’t forget to unset noout from the admin node:
ceph osd unset noout

Replace failed Ceph disk on Dell hardware

We are using Dell 720 and 730xd servers for our Ceph OSD servers. Here is the process that we use in order to replace a disk and/or remove the faulty OSD from service.

In this example we will attempt to replace OSD #45 (slot #9 of this particular server):

Stop the OSD and unmount the directory:
stop ceph-osd id=45
umount /var/lib/ceph/osd/ceph-45
ceph osd crush reweight osd.num 0.0 (wait for the cluster to rebalance):
ceph osd out osd.num
service ceph stop osd.num
ceph osd crush remove osd.num
ceph auth del osd.num
ceph osd rm osd.num

megacli -PDList -a0

If not already offline…offline the drive:
megacli -pdoffline -physdrv[32:9] -a0
Mark disk as missing:
megacli -pdmarkmissing -physdrv[32:9] -a0
Permanently remove drive from array:
megacli -pdprprmv -physdrv[32:9] -a0

NOW PHYSICALLY REPLACE THE BAD THE DRIVE WITH A NEW ONE.

Set drive state to online if not already:
megacli -PDOnline -PhysDrv [32:9] -a0
Create Raid-0 array on new drive:
megacli -CfgLdAdd -r0[32:9] -a0

You may need to discard the cache before doing the last step:
First get cache lsit:
megacli -GetPreservedCacheList -a0
Clear whichover one you need to:
megacli -DiscardPreservedCache -L2 -a0

Recreate OSD using Bluestore as the new default
ceph-deploy disk zap hqosdNUM /dev/sdx
ceph-deploy osd create --data /dev/sdm hqosdNUM

zfs on linux links part II

Proxmox 3.4 was recently released with additional integrated support for ZFS.  More details are provided here in the Proxmox ZFS wiki section. I also decided to start gathering up another more current round of links related to performance, best practices, benchmarking, etc.

If you are looking for up to date links to help you understand some of the more advanced aspects and use cases surrounding zfs, or if you are just getting started and are looking for some relevant reading material on the subject, you should find these links extremely useful:

1.The state of ZFS on linux
2.Arch linux ZFS wiki page
3.Gentoo linux ZFS wiki page
4.ZFS Raidz Performance, Capacity and Integrity
5.ZFS administration
6.KVM benchmarking using various filesystems
7.How to improve ZFS performance

Benchmarking Ceph

This is a post that I have had in draft mode for quite some time. At this point some of this information is out of date, so I am planning on writing a ‘part II’ post shortly, which will include some updated information.

Benchmarking Ceph:

Ever since we got our ceph cluster up and running, I’ve been running various benchmarking applications against different cluster configurations. Just to review, the cluster that we recently built has the following specs:

Cluster specs:

  • 3 x Dell R-420;32 GB of RAM; for MON/RADOSGW/MDS nodes
  • 6 x Dell R-720xd;64 GB of RAM; for OSD nodes
  • 72 x 4TB SAS drives as OSD’s
  • 2 x Force10 S4810 switches
  • 4 x 10 GigE LCAP bonded Intel cards
  • Ubuntu 12.04 (AMD64)
  • Ceph 0.72.1 (emperor)
  • 2400 placement groups
  • 261TB of usable space

The main role for this cluster will be one primarily tied to archiving audio and video assets. This being the case, we decided to try and maximize total cluster capacity (4TB drives, no ssd’s, etc), while at the same time being able to achieve and maintain reasonable cluster throughput (10 GigE, 12 drives per osd nodes, etc).

Most of my benchmarking focused on rbd and radosgw, because either of these is most likely to be what we introduce into production when we are ready.  We are very much awaiting a stable and supported cephfs release (which will hopefully be available sometime in mid-late 2014), which will allow us to switch out our rbd + samba setup, for on based on cephfs.

Rados Benchmarks: 

I setup a pool called ‘test’ with 1600 pg’s in order to run some benchmarks using the ‘rados bench’ tool that came with Ceph.  I started with a replication level of ‘1’ and worked my way up to a replication level of ‘3’.

root@hqceph1:/# rados -p test bench 20 write (rep size=1)
Total time run: 20.241290
Total writes made: 5646
Write size: 4194304
Bandwidth (MB/sec): 1115.739
Stddev Bandwidth: 246.027
Max bandwidth (MB/sec): 1136
Min bandwidth (MB/sec): 0
Average Latency: 0.0571572
Stddev Latency: 0.0262513
Max latency: 0.336378
Min latency: 0.02248
root@hqceph1:/# rados -p test bench 20 write (rep size=2)
Total time run: 20.547026
Total writes made: 2910
Write size: 4194304
Bandwidth (MB/sec): 566.505
Stddev Bandwidth: 154.643
Max bandwidth (MB/sec): 764
Min bandwidth (MB/sec): 0
Average Latency: 0.112384
Stddev Latency: 0.198579
Max latency: 2.5105
Min latency: 0.025391
root@hqceph1:/# rados -p test bench 20 write (rep size=3)
Total time run: 20.755272
Total writes made: 2481
Write size: 4194304
Bandwidth (MB/sec): 478.144
Stddev Bandwidth: 147.064
Max bandwidth (MB/sec): 728
Min bandwidth (MB/sec): 0
Average Latency: 0.133827
Stddev Latency: 0.229962
Max latency: 3.32957
Min latency: 0.029481

RBD Benchmarks:

Next I setup a 10GB block device using rbd:

root@ceph1:/blockdev# dd bs=1M count=256 if=/dev/zero of=test1 conv=fdatasync (rep size=1)
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 0.440333 s, 610 MB/s
root@ceph1:/blockdev# dd bs=4M count=256 if=/dev/zero of=test1 conv=fdatasync (rep size=1)
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 1.07413 s, 1000 MB/s
root@ceph1:/mnt/blockdev# hdparm -Tt /dev/rbd1 (rep size=1)
/dev/rbd1:
Timing cached reads: 16296 MB in 2.00 seconds = 8155.69 MB/sec
Timing buffered disk reads: 246 MB in 3.10 seconds = 79.48 MB/sec
root@ceph1:/mnt/blockdev# dd bs=1M count=256 if=/dev/zero of=test conv=fdatasync (rep size=2)
256+0 records in
256+0 records out
268435456 bytes (268 MB) copied, 1.29985 s, 207 MB/s
root@ceph1:/mnt/blockdev# dd bs=4M count=256 if=/dev/zero of=test2 conv=fdatasync(rep size=2)
256+0 records in
256+0 records out
1073741824 bytes (1.1 GB) copied, 4.02375 s, 267 MB/s
root@cephmount1:/mnt/ceph-block-device/test# hdparm -Tt /dev/rbd1 (rep size=2)
/dev/rbd1:
Timing cached reads: 16434 MB in 2.00 seconds = 8225.55 MB/sec
Timing buffered disk reads: 152 MB in 3.01 seconds = 50.55 MB/sec

Radosgw Benchmarks:

Using s3cmd (s3tools) I was able to achieve about 70MB/s when pushing files to ceph via the s3 restful API.

 

Ceph braindump part1

After spending about 4 months testing, benchmarking, setting up and breaking down various Ceph clusters, I though I would spend time documenting some of the things I have learned while setting up cephfs, rbd and radosgw along the way.

First let me talk a little bit about the details of the cluster that we will be putting into production over the next several weeks.

Cluster specs:

  • 6 x Dell R-720xd;64 GB of RAM; for OSD nodes
  • 72 x 4TB SAS drives as OSD’s
  • 3 x Dell R-420;32 GB of RAM; for MON/RADOSGW/MDS nodes
  • 2 x Force10 S4810 switches
  • 4 x 10 GigE LCAP bonded Intel cards
  • Ubuntu 12.04 (AMD64)
  • Ceph 0.72.1 (emperor)
  • 2400 placement groups
  • 261TB of usable space

The process I used to set- up and tear down our cluster during testing was quite simple, after installing ‘ceph-deploy’ on the admin node:

  1. ceph-deploy new mon1 mon2 mon3
  2. ceph-deploy install  mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6
  3. ceph-deploy mon create mon1 mon2 mon3
  4. ceph-deploy gatherkeys mon1
  5. ceph-deploy osd create osd1:sdb
  6. ceph-deploy osd create osd1:sdc
    ……….

The uninstall process went something like this:

  1. ceph-deploy disk zap osd1:sdb
    ……….
  2. ceph-deploy purge mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6
  3. ceph-deploy purgedata mon1 mon2 mon3 osd1 osd2 osd3 osd4 osd5 osd6

Additions to ceph.conf:

Since we wanted to configure an appropriate journal size for our 10GigE network, mount xfs with appropriate options and configure radosgw, we added the following to our ceph.conf (after ‘ceph-deploy new but before ‘ceph-deploy install’:

[global]
osd_journal_size = 10240
osd_mount_options_xfs = “rw,noatime,nodiratime,logbsize=256k,logbufs=8,inode64”
osd_mkfs_options_xfs = “-f -i size=2048”

[client.radosgw.gateway]
host = mon1
keyring = /etc/ceph/keyring.radosgw.gateway
rgw_socket_path = /tmp/radosgw.sock
log_file = /var/log/ceph/radosgw.log
admin_socket = /var/run/ceph/radosgw.asok
rgw_dns_name = yourdomain.com
debug rgw = 20
rgw print continue = true
rgw should log = true
rgw enable usage log = true

Benchmarking:

I used the following commands to benchmark rados, rbd, cephfs, etc

  1. rados -p rbd  bench 20 write –no-cleanup
  2. rados -p rbd  bench 20 seq
  3. dd bs=1M count=512 if=/dev/zero of=test conv=fdatasync
  4. dd bs=4M count=512 if=/dev/zero of=test conv=fdatasync

 Ceph blogs worth reading:

http://ceph.com/community/blog/
http://www.sebastien-han.fr/blog/
http://dachary.org/

megacli cheat sheet

The information below is based heavily off of a post that can be found here:

http://erikimh.com/megacli-cheatsheet/.

I am providing the information on my blog in the event that the original blog post becomes unavailable at some point in the future, as we use this information quite regularly.

1-Gather Info: 

Controller information

megacli -AdpAllInfo -aALL
megacli -CfgDsply -aALL
megacli -adpeventlog -getevents -f controller-events.log -a0 -nolog

Enclosure information

megacli -EncInfo -aALL

Virtual drive information

megacli -LDInfo -Lall -aALL

Physical drive information

megacli -PDList -aALL
megacli -PDInfo -PhysDrv [E:S] -aALL

Battery backup information

megacli -AdpBbuCmd -aALL

Check Battery backup warning on boot

megacli -AdpGetProp BatWarnDsbl -a0

Controller management:

Silence active alarm

megacli -AdpSetProp AlarmSilence -aALL

Disable alarm

megacli -AdpSetProp AlarmDsbl -aALL

Enable alarm

megacli -AdpSetProp AlarmEnbl -aALL

Disable battery backup warning on system boot

megacli -AdpSetProp BatWarnDsbl -a0

Change the adapter rebuild rate to 60%:

megacli -AdpSetProp {RebuildRate -60} -aALL

2-Virtual drive management:

Create RAID 0, 1, 5 drive

megacli -CfgLdAdd -r(0|1|5) [E:S, E:S, …] -aN

Create RAID 10 drive

megacli -CfgSpanAdd -r10 -Array0[E:S,E:S] -Array1[E:S,E:S] -aN

Remove drive

megacli -CfgLdDel -Lx -aN

Physical drive management

Set state to offline

megacli -PDOffline -PhysDrv [E:S] -aN

Set state to online

megacli -PDOnline -PhysDrv [E:S] -aN

Mark as missing

megacli -PDMarkMissing -PhysDrv [E:S] -aN

Prepare for removal

megacli -PdPrpRmv -PhysDrv [E:S] -aN

Replace missing drive

megacli -PdReplaceMissing -PhysDrv [E:S] -ArrayN -rowN -aN

The number N of the array parameter is the Span Reference you get using megacli -CfgDsply -aALL and the number N of the row parameter is the Physical Disk in that span or array starting with zero (it’s not the physical disk’s slot!).

Rebuild drive – Drive status should be “Firmware state: Rebuild”

megacli -PDRbld -Start -PhysDrv [E:S] -aN
megacli -PDRbld -Stop -PhysDrv [E:S] -aN
megacli -PDRbld -ShowProg -PhysDrv [E:S] -aN
megacli -PDRbld -ProgDsply -physdrv [E:S] -aN

Clear drive

megacli -PDClear -Start -PhysDrv [E:S] -aN
megacli -PDClear -Stop -PhysDrv [E:S] -aN
megacli -PDClear -ShowProg -PhysDrv [E:S] -aN

Bad to good

megacli -PDMakeGood -PhysDrv[E:S] -aN

Changes drive in state Unconfigured-Bad to Unconfigured-Good.

Hot spare management

Set global hot spare

megacli -PDHSP -Set -PhysDrv [E:S] -aN

Remove hot spare

megacli -PDHSP -Rmv -PhysDrv [E:S] -aN

Set dedicated hot spare

megacli -PDHSP -Set -Dedicated -ArrayN,M,… -PhysDrv [E:S] -aN

Walkthrough: Rebuild a Drive that is marked ‘Foreign’ when Inserted:

Bad to good

megacli -PDMakeGood -PhysDrv [E:S] -aALL

Clear the foreign setting

megacli -CfgForeign -Clear -aALL

Set global hot spare

megacli -PDHSP -Set -PhysDrv [E:S] -aN

Walkthrough: Change/replace a drive

a. Set the drive offline, if it is not already offline due to an error

megacli -PDOffline -PhysDrv [E:S] -aN

b. Mark the drive as missing

megacli -PDMarkMissing -PhysDrv [E:S] -aN

c. Prepare drive for removal

megacli -PDPrpRmv -PhysDrv [E:S] -aN

d. Change/replace the drive

e. If you’re using hot spares then the replaced drive should become your new hot spare drive

megacli -PDHSP -Set -PhysDrv [E:S] -aN

f. In case you’re not working with hot spares, you must re-add the new drive to your RAID virtual drive and start the rebuilding

megacli -PdReplaceMissing -PhysDrv [E:S] -ArrayN -rowN -aN
megacli -PDRbld -Start -PhysDrv [E:S] -aN

3-Gathering Standard logs

#rm –f MegaSAS.log
#megacli -adpallinfo -a0
#megacli -encinfo -a0
#megacli -ldinfo -lall -a0
#megacli -pdlist -a0
#megacli -adpeventlog -getevents -f controller-events.log -a0 -nolog
#megacli -fwtermlog -dsply -a0 -nolog > controller-fwterm.log

Openmanage 7.3 on Proxmox 3.0 (Debian Wheezy)

I ran in to a few issues while trying to install Dell Openmanage on the latest version of Proxmox (3.0).

In order to get things working correctly on Proxmox 3.x, here are the steps that are required:

#echo “deb http://linux.dell.com/repo/community/ubuntu wheezy openmanage” > /etc/apt/sources.list.d/linux.dell.com.sources.list
#gpg –keyserver pool.sks-keyservers.net –recv-key 1285491434D8786F
#gpg -a –export 1285491434D8786F | sudo apt-key add –
#sudo apt-get update
#apt-get install libcurl3
#sudo apt-get install srvadmin-all
#sudo service dataeng start
#sudo service dsm_om_connsvc start

Once you get everything installed correctly you will be able to log in to the Openmanage web interface here:

https://<hostname or ip address>:1311

The first time you log in you should use the ‘root’ username and associated password.

Ceph overview with videos

I recently started looking into Ceph as a possible replacement for our 2 node Glustr cluster. After reading numerous blog posts and several videos on the topic, I believe that the following three videos provide the best overview and necessary insight into how Ceph works, what it offers, how to mange a cluster, and how it differs from Gluster.

This video was given at the 2013 linux.conf.au conference. The first one is a Ceph only talk given by Florian Haas and Tim Serong.  It provides a complete overview of the Ceph software, it’s feature set and goes into detail about the current status of each of Ceph’s components (block storage, object storage, filesystem storage, etc).

The second video is a bit more lighthearted, it is a talk which involves some back and forth between Ceph’s Saige Weil and Gluster’s John Mark Walker, the talk is moderated by Florian Haas.  It covers some of the basic use cases for each file systems, offers some technical insight into the overall design, some of the ways in which the filesystems are similar, some items that are on each of their roadmaps moving forward,  as well as some of the ways each of  the projects differ from one another.

The final video also takes place at the 2013 linux.conf.au conference.  It covers some of the same topics that were discussed in the prior two videos, however this one is also geared toward the operations aspect of managing a Ceph cluster.  This talk is given by Saige Weil as well.

Open Stack Ops Guide

Ted Neykov, currently a hacker at Rackspace, pointed me in the direction of the new Open Stack Operations Guide. I have only had a chance to browse the .pdf at this point, however I believe this will end up being a very informative and useful book for me going forward.

Taken from the guide’s summery:

‘This book offers hard-earned experience from OpenStack operators who have run OpenStack in production for six months or longer.  They’ve gathered their notes, shared their stories, and learned from each other in the room. We invite you to join in the quest for best practices in OpenStack cloud operations’

Here is a quick video that was released along with the guide, that briefly describes the process they used during it’s creation: