A field report with some tips and tricks
Please keep in mind: This is not a step by step guide, however, I try to capture the process and mention some tips and tricks which helped us to successfully implement the update.
As always please check if the update-path you are planning to do is supported and if you need to make additional steps with it. Furthermore you shouldn’t perform any domain altering operations such as expanding / expanding / shrinking / creating workload domains as this can have severe consequences.
Please keep in mind that many Updates can have a operational impact and try to perform them whenever your systems are not as frequently used as in the main business hours.
In our case we are updating a full VCF on VxRail Environment with 1 Management Domains and 2 Workload Domains totaling in about 60 hosts. The starting Version was 18.104.22.168 and the target version is 4.5. We are updating everything from the central SDDC Manager which also includes downloading the Bundles etc.
As usual in this environment we start with updating the Management Domain as this is mandatory withing the updating process. First we need to download the Bundles and as I mentioned, we will do this from the SDDC Manager via Inventory > Workload Domains > Updates/Patches. Here we will be presented with all the available updates for the specific workload domain. Alternatively you can also download the bundles via Lifecycle Management > Bundle Management. The order in which the updates are presented worked for us pretty well, just to mention it here the update order should be:
In case you don’t see anything here you can check if you are connected with you My VMware/ Dell EMC credentials in the SDDC Manager via Administration > Repository Settings ( In later versions you can find it under Online Depot). If you are not having any internet access from your SDDC Manager directly you can always use the process to add a proxy for it, however, this is not part of our field report here and can be looked up in the VMware process.
Be sure you have a valid backup before starting here, in addition you can also take a snapshot but be sure to delete it after successful updating the appliance.
Via Inventory > Workload Domains we can now start the precheck for the desired update. Make sure you really check the addressed issues, even warnings as these may not fail the update but can have weird unexpected behaviors. If you are unsure if the issue need further attendance you can contact the VMware support and have a pretty fast answer. For further analysis the tool “VCF_verify” by Dell can come in handy as this will deliver more detailed information about what fails and you can more easily find the cause of this. I would recommend this tool in addition to the default GUI PreChecks because this gives you the extra portion of security everything will work.
If we are green at this point we can start scheduling the update or update immediately via Inventory > Workload Domains > Updates/Patches.
If the update fails for any reason you will see the next step update in the GUI and no retry option but do not click it yet. We experienced this couple of times and you just need to wait some seconds/minutes until the retry button appears.
You can always view the status of the update with the View Status button.
If everything went fine we can continue with the VCF Configuration Drift Bundel. Same procedure as with all the steps PreCheck > Schedule Update > View Status > Finish
Same procedure for the next components vRealize (if applicable)+ NSX-T Data Center (you have quite a lot of individual settings here like upgrade only specific Edge clusters/host clusters etc., we went with standard // Federated Environment differs from this, be careful // use the NSX-T Upgrade Evaluation Tool before Upgrading for additional security).
If you are doing a jump in Versions I strongly recommend using the NSX-T Upgrade Evaluation Tool (integrated in GUI Prechecks in newer VCF-Versions).
Be sure all the errors and problems in the NSX Manager are solved prior to start updating, as the update will fail then. E.g. we had one Edge in failure state in a update scenario and NSX failed because of it. It was just a non started service which was fixed by restarting the Edge, however, not everything is clear -> update will fail.
We experienced some weird issue following the update in one scenario:
One NSX-T Manager somehow was out of sync in the cluster and throws errors like “unable to fetch cluster information”. This wont go away automatically in our case and we had to restart the node. This can also happen if the node is the elected VIP so your VIP address has this failure -> don’t freak out, just restart the broken node and it will work fine.
For the next step of the vCenter we experienced some weird issues which happen because of proxy settings of the vCenter. We deactivated it to make sure everything will be fine. Furthermore we had issues with a service/database which is dedicated to statistics called vtsdb. The symptom was a never starting update in the vCenter. However, this issues we were not able to solve alone and contacted Dell/VMware support, who helped us rebuilding the Database.
Last and most time-consuming part was the updated of VxRail Managers and nodes which happened to be quite clean for us without many issues. Here you can split the update if you have multiple clusters within 1 workload domain e.g. if you have a WLD with Database and VDI Cluster you can start the update for both in parallel. If something goes wrong in the update process you can also retry separately here. You can upgrade up to 5 clusters in parallel.
The steps of the Workload Domains doesn’t differ from the Management Domain that much. Just keep in mind that many managing components only reside in the Management Domain and you won’t have them within the Workload Domain. Usually the Workload Domains will only have the Update Process for NSX-T, vCenter and VxRail Manager/Nodes.
In this section I would like to give you an idea in how long everything takes without troubleshooting or anything like this.
The scenario which we will take as reference here is a pretty low load management cluster with 4 + 4 hosts stretched over 2 locations in a stretched cluster. vCenter and NSX-T are, as recommended, in a 3 Node HA-Cluster and we have 1 Edge Cluster with 2 Edge Appliances present. We were coming from Version 22.214.171.124 and went to 4.5:
So as you can see the whole process clocked in at close to 11 hours. This may take longer depending on the load you have in the cluster, so make sure you planned accordingly. You can also have a break between each step but keep in mind you can’t stop any time.
All in all I can say that the process provided to update a VCF environment is quite convenient and works pretty good. Just keep attention to the state of all appliances and do your PreChecks. If you have any concern regarding your update path or procedure contact Dell support and let them double check because 4 eyes will always see more than 2.
Always check that you have a valid file based backup of your SDDC Manager, NSX-T Manager and vCenter. In addition, as mentioned before, always feel free to make a snapshot of the appliance you are updating next, so in a failure scenario with some damage you can fallback quickly.
Here are some useful links that might help: