vSAN Cluster Shutdown

A few weeks ago I had to shutdown a vSAN Cluster temporarily for a planned site-wide 24 hour power outage that was blacking out a datacentre. With the amount of warning and a multi-datacentre design this wasn’t an issue, but I made use of vSphere tags and some Powershell/PowerCLI to help with the evacuation and repopulation of the affected cluster. Hopefully some of this may be useful to others.

The infrastructure has two vSAN Clusters – Cluster-Alpha and Cluster-Beta. Cluster-Beta was the one being affected by the power outage, and there was sufficient space on Cluster-Alpha to absorb migrated workloads. Whilst they exist in different datacentres both clusters are on the same LAN and under the same vCenter.

I divided the VMs on Cluster-Beta into three categories:

  1. Powered-Off VMs and Templates. These were to stay in place, they would be inaccessible for the outage but I determined this wouldn’t present any issues.
  2. VMs which needed to migrate and stay on. These were tagged with the vSphere tag “July2019Migrate”
  3. VMs which needed to be powered off but not migrated. For example test/dev boxes which were not required for the duration. These were tagged with “July2019NOMigrate”

The tagging was important, not only to make sure I knew what was migrating and what was staying, but also what we needed to move back or power on once the electrical work had completed. PowerCLI was used to check that all powered-on VMs in Cluster-Beta were tagged one way or another.

Get the VMs in CLuster-Beta where the tag “July2019Migrate” is not assigned and the tag “July 2019NOMigrate” is not assigned and the VM is Powered On.

1Get-Cluster -Name "Cluster-Beta" |Get-VM | where {
2(Get-TagAssignment -Entity $_).Tag.Name –notcontains "July2019Migrate" –and
3(Get-TagAssignment -Entity $_).Tag.Name –notcontains "July2019NOMigrate" –and
4$_.PowerState –eq “PoweredOn”}

In the week approaching the shutdown the migration was kicked off:

1#Create a List of the VMs in the Source Cluster which are tagged to migrate
2$MyTag= Get-Tag -Name "July2019Migrate"
3$MyVMs=Get-Cluster "Cluster-Beta" | Get-VM | Where-Object {(Get-TagAssignment -Entity $_).Tags -contains $MyTag }
4#Do the Migration
5$TargetCluster= "Cluster-Alpha" #Target Cluster
6$TargetDatastore= "vSANDatastore-Alpha" #Target Datastore on Target Cluster
7$MyVMs | Move-VM -Destination (Get-Cluster -Name $TargetCluster) -Datastore (Get-Datastore -Name $TargetDatastore) -DiskStorageFormat Thin -VMotionPriority High

At shutdown time, a quick final check of the remaining powered on VMs was done and then all remaining VMs in Cluster-Beta were shut down. Once there were no running workloads on Beta it was time to shut down the vSAN cluster. This part I didn’t automate as I’m not planning on doing it a lot, and there’s comprehensive documentation in the VMware Docs site. The process is basically one of putting all the hosts into maintenance mode and then once the whole cluster is done, powering them off.

You are in a dark, quiet datacentre. There are many servers, all alike. There may be Grues here.

When power was restored, the process was largely reversed. I powered on the switches providing the network interconnect between the nodes, and then powered on those vSAN hosts and waited for them to come up. Once all the hosts were visible to vCenter, it was just a case of selecting them all and choosing “Exit Maintenance Mode”

There was a momentary flash of alerts as nodes come up and wonder where their friends are, but in under a minute the cluster was passing the vSAN Health Check

At this point it was all ready to power on the VMs that had been shutdown and left on the cluster, and vMotion the migrated virtual machines back across. Again, PowerCLI simplified this process:

 1#Create a List of the VMs in the Source Cluster which are tagged to stay but need powering on.
 2$MyTag= Get-Tag -Name "July2019NOMigrate"
 3$MyVMs=Get-Cluster “Cluster-Alpha” | Get-VM | Where-Object {(Get-TagAssignment -Entity $_).Tags -contains $MyTag }
 4#Power on those VMs
 5$MyVMs | Start-VM
 6
 7#Create a List of the VMs in the Source Cluster which are tagged to migrate (back)
 8$MyTag= Get-Tag -Name "July2019Migrate"
 9$MyVMs=Get-Cluster “Cluster-Alpha” | Get-VM | Where-Object {(Get-TagAssignment -Entity $_).Tags -contains $MyTag }
10#Do the Migration
11$TargetCluster= "Cluster-Beta" #New Target Cluster
12$TargetDatastore= "vSANDatastore-Beta" #Target Datastore on Target Cluster
13$MyVMs | Move-VM -Destination (Get-Cluster -Name $TargetCluster) -Datastore (Get-Datastore -Name $TargetDatastore) -DiskStorageFormat Thin -VMotionPriority High

Then it was just a case of waiting for the data to flow across the network and finally check that everything had migrated successfully and normality had been restored.

we have normality, I repeat we have normality…Anything you still can’t cope with is therefore your own problem. Please relax.

Trillian, via the keyboard of Douglas Adams. The Hitchhiker’s Guide to the Galaxy