Over the last few weeks we have been encountering some strange networking issues at our off site data center. The problem was characterized by slow ping times, packet loss, service timeouts, etc. After checking all of the network devices on the outside facing link, we turned to the internal infrastructure, our new Cisco 3750 series routers.
After reviewing the configuration using the ‘show running-config’ command, we checked each of our bonded networking interfaces with ‘the show interface’ command to see if we could uncover any errors, dropped packets, etc.
Next we decided to check the switches available resources. We used ‘show proc cpu’ in order to check the cpu usage. This is when we saw the following line:
‘CPU utilization for five seconds: 97%/0%; one minute: 97%; five minutes: 98%’
We immediately knew that this is what we were looking for. After contacting Cisco support…we learned that there exists an obscure bug in this version of IOS which causes this kind of behavior. If you are interested in learning more about this specific issue, refer to Cisco bug ID ‘CSCsd95669’.
The only known fixes at this point are upgrade your version of IOS or restart your switch. We have decided to restart the switch and monitor the cpu at regular intervals. If this problem were to appear again, at that point we would obviously choose to upgrade.
Recently I was given the task of putting together a storage solution that would be used to house a large amount of our digital assets. I was also asked to make sure there would be enough space to meet our needs over the next few years. The project called for a solution that could scale up to around 120TB of usable space. Depending on the price, this solution might also be used to store a majority of our digital archive (audio and video).
I will go into the specific hardware and software details of the project in another post, however after about a month of research, we decided to go with a solution that was able to take advantage of the ZFS filesystem.
Here are a few documents that I found invaluable during my setup and overall planning:
ZFS Best Practices Guide
ZFS Configuration Guide
ZFS Troubleshooting Guide
ZFS Troubleshooting and Cheatsheet Guide
These links can be a starting point for anyone who wants to gain a better overall understanding of how to best administer a server running ZFS. The ‘best practices guide’ is also a great resource to consult during the initial project planning stages.
While doing research into poor write performance with Oracle I discovered that the server was using the LSI SAS1068E. We had a RAID1 setup with 300GB 10K RPM SAS drives. Google provided some possible insight into why we the write performance was so bad(1 2). The main problem with this card is that there is no battery backed write cache. This means that the write-cache is disabled by default. I was able to turn on the write cache using the LSI utility.
This change however did not seem to any difference on performance. At this point I came to the conclusion that the card itself is the blame. I believe that this is an inexpensive RAID card that is good for general use of RAID0 and Raid1, however for anything were write throughput is important, it might be better the spring for a something a little bit more expensive.
When it was all said and done we ended up replacing all the these LSI cards with Dell Perc 6i cards. These cards did come battery backed…which allowed us to then enable the write cache, needless to say the performance improved significantly.
We recently deployed an Oracle virtual machine for development and testing purposes. Imports and database migration scripts were taking several hours on existing VM’s, so we hoped this new machine with more RAM (32 GB) and more CPU horsepower (quad core Intel Xeon’s) would allow for those operations to move along much more quickly.
We soon got reports from users that this server was in fact much slower then the existing less powerful Oracle VM’s. After doing some poking around (with vztop) we discovered that there were no issues with cpu or memory resources, however the server was performing terribly when it came to I/O.