Skip to main content

HOW TO Recover from failed Storage VMotion

 HOW-TO: Recover from failed Storage VMotion


A while ago I received a request from the storage department to move a whole ESX cluster to another storage I/O-Group. This would be a disruptive action.
I was wondering if storage VMotion would help me out here if they assigned me some new storage on the other I/O-Group instead of moving the current LUNs.

If you are going to use sVMotion I would strongly suggest using the sVMotion plug-in from lostcreations.

Well sVMotion made it possible to move the cluster to the other I/O-Group online. This would save me a lot of time explaining to the customer why the entire cluster had to go offline. We are talking off 175+ vms here.

So that was the path I chose, and it worked out fine, but I got so excited on sVMotion that I didn’t paid attention to the available storage space left on my new LUNs. So after a while the LUN filled up and the sVMotion process failed.

Whenever a sVMotion fails you probably end up in a situation where the config files are moved to the new location and the .vmdk files and their accompanying snapshot files (sVMotion creates a snapshot in order to copy the .vmdk files) are still in the old location.

Issuing another sVMotion will generate this error: “ERROR: A specified parameter was not correct. spec” and if you turn off the vm you get an extra option “Complete Migration”. This option actually makes a copy of the .vmdk files to the same LUN, and hence it requires twice the space of the .vmdks on your LUN and most importantly it requires downtime.


Here’s what I did to resolve this split:


  • Create a snapshot of the vm. Since it’s not available via vCenter GUI in this state, you have to do this in the COS or connect your VIC directly to the ESX host.


    • Through SSH console session:

      • find the config_file_path of the VM
        Code:
        vmware-cmd -l
      • Create a snapshot of the vm
        Code:
        vmware-cmd <config_file_path> createsnapshot snapshot_name snapshot_description 1 1

    • Through VIC:
      • Use GUI as normal.


  • Remove (Commit) the snapshots:
    Code:
    vmware-cmd <config_file_path> removesnapshots
    This will remove the newly created snapshot AND the snapshot created by sVMotion.

  • vCenter still thinks the vm is in dmotion state so you can’t edit settings, perform VMotion or anything else via vCenter. To fix this we need to clear the DMotionParent parameters in the .vmx file with the following commands from the COS:
    Code:
    vmware-cmd <config_file_path> setconfig scsi0:0.DMotionParent ""
    vmware-cmd <config_file_path> setconfig scsi0:1.DMotionParent ""
    Do this for every DMotionParent entry in the .vmx file, so be sure to check your .vmx file to get the right SCSI IDs. Note that editing the .vmx file directly will not trigger a reload of the .vmx config file!

  • Now Perform a new storage migration to move back the .vmx configuration file to its original location.

  • Clean up destination LUN and remove any files/folders created by the failed sVMotion. We’re done and back in business again without downtime!
    We can retry the sVMotion now.


Offcourse I didn’t found out all this by myself. All credits go to Argyle from theVMTN Forum. You can read his original thread here.

Someone would probably say “Don’t try this at home”, but if you’re curious and do want to try this at home, use the following procedure to reproduce this split situation:

  • perform a sVMotion of a TEST vm
  • on the COS of the ESX host issue”:
    service mgmt-vmware restart
  • Have fun!!

Comments

Popular posts from this blog

Integration with vCloud Director failing after NSXT upgrade to 4.1.2.0 certificate expired

  Issue Clarification: after upgrade from 3.1.3 to 4.1.2.0 observed certificate to be expired related to various internal services.   Issue Verification: after Upgrade from 3.1.3 to 4.1.2.0 observed certificate to be expired related to various internal services.   Root Cause Identification: >>we confirmed the issue to be related to the below KB NSX alarms indicating certificates have expired or are expiring (94898)   Root Cause Justification:   There are two main factors that can contribute to this behaviour: NSX Managers have many certificates for internal services. In version NSX 3.2.1, Cluster Boot Manager (CBM) service certificates were incorrectly given a validity period of 825 days instead of 100 years. This was corrected to 100 years in NSX 3.2.3. However any environment originally installed on NSX 3.2.1 will have the internal CBM Corfu certs expire after 825 regardless of upgrade to the fixed version or not. On NSX-T 3.2.x interna...

Calculate how much data can be transferred in 24 hours based on link speed in data center

  In case you are planning for migration via DIA or IPVPN link and as example you have 200Mb stable speed so you could calculate using the below formula. (( 200Mb /8)x60x60x24) /1024/1024 = 2TB /per day In case you have different speed you could replace the 200Mb by any rate to calculate as example below. (( 5 00Mb /8)x60x60x24) /1024/1024 =  5.15TB  /per day So approximate each 100Mb would allow around 1TB per day.

Device expanded/shrank messages are reported in the VMkernel log for VMFS-5

    Symptoms A VMFS-5 datastore is no longer visible in vSphere 5 datastores view. A VMFS-5 datastore is no longer mounted in the vSphere 5 datastores view. In the  /var/log/vmkernel.log  file, you see an entry similar to: .. cpu1:44722)WARNING: LVM: 2884: [naa.6006048c7bc7febbf4db26ae0c3263cb:1] Device shrank (actual size 18424453 blocks, stored size 18424507 blocks) A VMFS-5 datastore is mounted in the vSphere 5 datastores view, but in the  /var/log/vmkernel.log  file you see an entry similar to: .. cpu0:44828)LVM: 2891: [naa.6006048c7bc7febbf4db26ae0c3263cb:1] Device expanded (actual size 18424506 blocks, stored size 18422953 blocks)   Purpose This article provides steps to correct the VMFS-5 partition table entry using  partedUtil . For more information see  Using the partedUtil command line utility on ESX and ESXi (1036609) .   Cause The device size discrepancy is caused by an incorrect ending sector for the VMFS-5 partition on the ...