SUNWattr_ro error:Permission denied on OpenSolaris using Gluster 3.0.5–PartII

Recently one of our 3ware 9650SE raid cards started spitting out errors indicating that the unit was repeatedly issuing a bunch of soft resets. The lines in the log look similar to this:

WARNING: tw1: tw_aen_task AEN 0x0039 Buffer ECC error corrected address=0xDF420
WARNING: tw1: tw_aen_task AEN 0x005f Cache synchronization failed; some data lost unit=22
WARNING: tw1: tw_aen_task AEN 0x0001 Controller reset occurred resets=13

I downloaded and installed the latest firmware for the card (version 4.10.00.021), which the release notes claimed had several fixes for cards experiencing soft resets.  Much to my disappointment the resets continued to occur despite the new revised firmware.

The card was under warranty, so I contacted 3ware support and had a new one sent overnight.  The new card seemed to resolve the issues associated with random soft resets, however the resets and the downtime had left this node little out of sync with the other Gluster server.

After doing a ‘zfs replace’ on two bad disks (at this point I am still unsure whether the bad drives where a symptom or the cause of the issues with the raid card, however what I do know is that the Cavier Geen Western Digital drives that are populating this card have a very high error rate, and we are currently in the process of replacing all 24 drives with hitachi ones), I set about trying to initiate a ‘self-heal’ on the known up to date node using the following command:

server2:/zpool/glusterfs# ls -laR *

After some time I decided to tail the log file to see if there were any errors that might indicate a problem with the self heal. Once again the Gluster error log begun to fill up with errors associated with setting extended attributes on SUNWattr_ro.

At that point I began to worry whether or not the AFR (Automatic File Replication) portion of the Replicate/AFR translator was actually working correctly or not.  I started running some tests to determine what exactly was going on.  I began by copying over a few files to test replication.  All the files showed up on both nodes, so far so good.

Next it was time to test AFR so I began deleting a few files off one node and then attempting to self heal those same deleted files.  After a couple of minutes, I re-listed the files and the deleted files had in fact been restored. Despite the successful copy, the errors continued to show up every single time the file/directory was accessed (via stat).  It seemed that even though AFR was able to copy all the files to the new node correctly, gluster for some reason continued to want to self heal the files over and over again.

After finding the function that sets the extended attributes on Solaris, the following patch was created:

— compat.c Tue Aug 23 13:24:33 2011
+++ compat_new.c Tue Aug 23 13:24:49 2011
@@ -193,7 +193,7 @@
{
int attrfd = -1;
int ret = 0;

+
attrfd = attropen (path, key, flags|O_CREAT|O_WRONLY, 0777);
if (attrfd >= 0) {
ftruncate (attrfd, 0);
@@ -200,13 +200,16 @@
ret = write (attrfd, value, size);
close (attrfd);
} else {
– if (errno != ENOENT)
– gf_log (“libglusterfs”, GF_LOG_ERROR,
+ if(!strcmp(key,”SUNWattr_ro”)&&!strcmp(key,”SUNWattr_rw”)) {
+
+ if (errno != ENOENT)
+ gf_log (“libglusterfs”, GF_LOG_ERROR,
“Couldn’t set extended attribute for %s (%d)”,
path, errno);
– return -1;
+ return -1;
+ }
+ return 0;
}

return 0;
}

The patch simply ignores the two Solaris specific extended attributes (SUNWattr_ro and SUNWattr_rw), and returns a ‘0’ to the posix layer instead of a ‘-1’ if either of these is encountered.

We’ve been running this code change on both Solaris nodes for several days and so far so good, the errors are gone and replicate and AFR both seem to be working very well.

Leave a Reply

Your email address will not be published. Required fields are marked *