Monthly Archives: May 2014

‘Not free; Lock’ error messages and high CPU on ESXi host causes VMs to momentarily freeze

Background:

We’re running vCloud Director 1.5.2  and vsphere 5.0 update 1 in a 11 host cluster/PVDC, customers began complaining that virtual machines we’re locking up and losing pings for between 5 and 30 seconds, we were seeing Not free;Lock errors as below in /var/log/vmkernel.log.

2013-04-24T08:33:41.546Z cpu28:6869)DLX: 3901: vol DATALUN’: [Req mode: 1] Not free; Lock [type 10c00001 offset 207360000 v 123467, hb offset 3854336

gen 69, mode 1, owner 5176c50b-7452087c-b21a-mtime 133636 nHld 0 nOvf$

2013-04-24T08:33:48.541Z cpu21:6869)DLX: 3394: vol : [Req mode 1] Checking liveness of [type 10c00001 offset 207360000 v 123467, hb offset 3854336

gen 69, mode 1, owner 5176c50b-7452087c-b21a- mtime 133636 nHld 0$

2013-04-24T08:33:52.552Z cpu21:6869)DLX: 3901: vol : [Req mode: 1] Not free; Lock [type 10c00001 offset 207360000 v 123467, hb offset 3854336

gen 69, mode 1, owner 5176c50b-7452087c-b21a-mtime 133636 nHld 0 nOvf$

This was a real head scratcher, we spent the best part of a week troubleshooting with VMware and EMC, we were seeing the error messages in /var/log/vmkernel.log on random esxi hosts in the same cluster.  When the messages appeared virtual machines running on the datastore and esxi host would momentarily lock up, you can work out which ESXi hosts is causing the lock as the mac address is visible in the error message, we also noticed that the CPU would shoot up to 100% on the host holding the lock.  We went through the storage configuration on the hosts and at the backend, we found some performance issues which we adressed but it did not fix the problem.  At first we thought it might be a LUN zoning problem but this all checked out and everything appeared to be in order.  So we went back to VMware and after a week or so of extensive troubleshooting they confirmed we were hitting the bug described below. The fix is to upgrade to ESXi 5.0 Update 2.

https://www.vmware.com/support/vsphere5/doc/vsp_esxi50_u2_rel_notes.html

“ESXi hostd agent might consume very high CPU resulting in performance degradation”
“When vCloud Director fetches the screen shot of virtual machine desktop from the ESXi host, hostd agent might enter into an infinite loop resulting in 100% CPU usage and the CPU usage might not reduce until you restart hostd.

Hopefully this post will save you some time and hair!!