Blog Archives

Hosts Disconnecting from Cluster, Storage Issues?

4/22/2013

Here is an interesting issue I ran into recently-
-Alarms showing loss of path redundancy to storage
-Several hosts disconnect from a cluster
-Cannot access the host via vSphere client or SSH
-One or more datastores shows dead and cannot be accessed
-CPU on several hosts is at or near 100 percent
I saw these issues just after several hosts reported the loss of redundant path to storage alarms.
The storage is managed by a separate team, so I had them check the fabric and storage presented to the cluster, they didn’t see any issues except an alarm around the same time as the first loss of redundant path alarms… So what is the next step, try a rescan of the storage, well I did that and the rescan ran for several minutes and timed out, then that host disconnected from the cluster! Going back to the storage team I had them check the one of the LUN ID of the datastore that showed dead, they said it showed on line and didn’t see a problem. Finally they removed and re-presented the LUN to the clusters hosts.
I tried another rescan and again it took forever and failed. So the next step, reboot a hosts? I had one that only had one VM on it, rebooted it and the previous dead datastore was back. A few minutes later the hosts that were previously disconnected from the cluster reconnected and appeared fine..??
I remember back on a 4.0 environment when someone powered off an iSCSI array the hosts disconnected from the cluster, so I assumed that having the storage pulled out from under the hosts is still an issue in vSphere 5.0.
After doing some research and opening a case with VMware, this still can be an issue.
The link below is to a KB that explains a Permanent Device Loss and All Paths Down error. One note on the KB is-
“As the ESXi host is not able to determine if the device loss is permanent (PDL) or transient (APD), it indefinitely retries SCSI I/O, including:

Userworld I/O (hostd management agent)
Virtual machine guest I/O”

That explains why the hosts disconnected and why the CPU on some showed 100 percent. The hostd process just peaks trying to retry I/O, that slows the management agents so you can’t connect directly, and of course running a rescan of the storage just compounds the problem.
Click here for a link to the KB article.
The KB also notes that the only way to recover is to resolve the storage access issue and reboot the hosts. Nice…
It turns out there are some settings that can be added to alleviate this issue from happening in 5.1 and in 5.0
Update 2.
For more details see Cormac Hogans great info on the storage features in 5.1 starting here-
(Hope he doesn't mind me sharing this link)
Another KB states that if Storage I/0 Control is enabled, a host cannot remount the datastore.
In my case SIOC was enabled on all of the datastores.
The KB details steps to stop the SIOC service on a host to allow the removal of the datastore.
Access this KB here-
In my case I think rebooting the hosts was the only option to clear the I/0 to the lost datastore. Of course what caused the issue on the storage side is still a mystery.
I have since added the settings to each of the hosts and to the cluster, if there is another issue like this one I am hoping it makes a difference.
If you have experienced this or a similar issue please share your experiences.....

7 Comments

PEX 2013

4/15/2013

3 Comments

Well I have recovered from attending my first VMware Partner Exchange! I thought it was great and the breakouts were full of valuable technical information. I also attended a boot camp, which meant being in class from Saturday till Monday, not the funnest way to spend a weekend in Las Vegas, but definitely worthwhile.
I attended several break outs that focused on virtualizing business critical applications, such as Microsoft SQL Server. One demo showed the use of a second or standby VM for patches and upgrades. The demo can be seen here-
All of the presenters in the breakouts made plenty of time to answer questions during and after their presentations. It was great to ask questions from one of the actual developers of an area or product.
The hands on labs were another great area to see and learn new technologies!
For anyone else who was able to attend PEX this year, let me know your thoughts.

3 Comments