Flapping Alerts in vRealize Operations 8.x

Flapping vRealize Operations Alerts

I have just discovered this bug care of the tens of thousands of flapping alerts I’ve received in the last month.

Checking my federated vROps cluster to compile a report on the number of alerts generated over December I was greeted with a significantly higher number than I was expecting, especially considering the Christmas Change Freeze which would stop any non urgent tasks. Further investigation showed that it was due to a few dozen alert definition appearing thousands of times each (>6k alerts for one alert type on one cluster for example)

This appears to affect any alert based on receiving a fault symptom, such as all the default vSAN Management Pack Alerts for example.

This manifests itself as an alert going active, and then soon after cancelling, and then reactivating aka flapping. See below for an example for one cluster where the HCL DB wasn’t up to date.

And the cause of this bug is seen in the symptoms view on the object where it creates a new symptom every time instead of updating the existing fault symptom.

If you look at the “cancelled on” value, they were all showing active at the same time, and cancelled when the vSAN HCL DB was updated around 3:30pm on the 23th December. The 50 minute regularity seems to tie in with the vSAN Health Check interval on the vCenter.

I am running vROps 8.1.1 (16522874), but not sure whether this impacts all versions of vROps 8.x but if you see this on any other versions, let me know.

Luckily there is a fix, HF4 which will take you to vROps version 8.1.1 (17258327)

As this pak file is 2.2GB in size, I am unable to host it on my blog for easy download, so I suggest you speak to your VMware TAM, Account Manager, or open a case with Global Support Services and reference this hotfix.

If all else fails I might be able to share it with you using onedrive, however I cannot promise a quick turnaround for that.

UPDATE: I have had it confirmed that this bug affects 8.0 and 8.2 as well, and there are hotfixes for those versions too. The next full release will have the fix built in.

If you are currently on 8.0.x or 8.1.x I would suggest either applying the HF and then upgrading straight to 8.3 when it is released or upgrading to 8.2 first and then applying the HF.

Troubleshooting vRealize Operations Networking

vRealize Operations

One of the first steps when troubleshooting vROps is to ensure that the correct ports are open.

This is best done via SSH, so first of all, enable that via the admin screen and log in as root (you did set a root password didn’t you? If not go do that now via the vSphere Console)

Port Checking

echo -e "\e[4;1;37mNode Connectivity Check..\e[0m"; for port in {80,123,443,6061} {10000..10010} {20000..20010}; do (echo >/dev/tcp/OTHERNODE_IPADDRESS/$port) > /dev/null 2>&1 && echo -e "\e[1;32mport $port connectivity test successful\e[0m" || echo -e "\e[1;31mport $port connectivity test failure\e[0m";done

copy and paste the above, changing the endpoint IP Address, to get a nice simple output for the usual ports required between the nodes.

Full details of the ports and directions below:


If you want to test a single port you can use curl

curl -v telnet://OTHERNODE_IPADDRESS:443

Latency Checking

grep clusterMembership /storage/db/casa/webapp/hsqldb/casa.db.script | sed -n 1'p' | tr ',' '\n' | grep ip_address | cut -d ':' -f 2 | sed 's/\"//g' | while read nodeip; do echo -n "$nodeip avg latency: " && ping -c 10 -i 0.2 $nodeip | grep rtt | cut -d '/' -f 5; done

This command will collect the names of all the nodes in the cluster and ping them, outputting the latency to each node

vCenter Connectivity Checking

echo -e "\e[1;31mvCENTER CONNECTIVITY:\e[0m" >> $HOSTNAME-status.txt;M0RE="y";while [ "$M0RE" == "y" ];do echo $MORE;while read -p 'Enter vCenter F_Q_D_N or I_P: ' F_Q_D_N && [[ -z "$F_Q_D_N" ]] ; do echo 'F_Q_D_N or I_P cannot be blank';done;curl -k https://$F_Q_D_N >> /dev/null;if [ "$?" == "0" ]; then echo $F_Q_D_N 'Connectivity Test Successful' >> $HOSTNAME-status.txt;else echo $F_Q_D_N 'Connectivity Test Failed' >> $HOSTNAME-status.txt;fi; nslookup $F_Q_D_N >> $HOSTNAME-status.txt; echo -n "Check M0RE y or n: " && read M0RE;done;

This command will ask for an input and then perform a connectivity test to the supplied vCenter


If you need to quickly check if the adapters have been distributed to all the nodes, run the following command to check the plugins folder size


du -h --max-depth=1

Estimate the equivalent number of VMs able to be reclaimed by rightsizing using vRealize Operations Supermetrics

When planning performing rightsizing events on a customer’s estate I am usually requested to estimate the number of new VMs which could be placed into estate on the resources freed up by rightsizing.

This can be calculated relatively easily by hand, but who wants to do that when you can have something else do it for you, and even utilise it on a dashboard as a KPI

My customer in this example have a guideline they use for an average machine on their estate which is 4 vCPU and 32GB RAM.

So in the first example I will show the code with a fixed VM size.

This calculation uses the floor function to take the lowest of an array of numbers. More details here:

Estimate remaining VM Overhead using vROps – Advanced Super Metrics

The calculations it’s using here are the number of excess vCPU metric, divided by 4 vCPU for our guideline VM, and the amount of excess memory metric, convert from KB to GB and divided by 32GB RAM

Remember the depth setting allowing this supermetric to run at a higher grouping level such as vCenter or Custom Group

floor(min([((sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|vcpus, depth=5}))/4),(((sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|memory, depth=5}))/1048576)/32)]))

Now this can be further expanded by instead of using a fixed VM size, we could take the average VM size of the grouping we are running this supermetric against.

To do this we would replace the “4” and “32” with a calculation for average size

For vCPU this would be

avg(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=config|hardware|num_Cpu, depth=5}) 

for RAM this would be

avg((${adaptertype=VMWARE, objecttype=VirtualMachine, metric=config|hardware|memoryKB, depth=5})/1048576)

so our full calculation for estimating how many of the average VM size could be reclaimed by rightsizing would be:

floor(min([((sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|vcpus, depth=5}))/avg(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=config|hardware|num_Cpu, depth=5})),(((sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|memory, depth=5}))/1048576)/avg((${adaptertype=VMWARE, objecttype=VirtualMachine, metric=config|hardware|memoryKB, depth=5})/1048576)

Sizing your migration using vRealize Operations and Supermetrics

Today I’m going to talk about using vRealize Operations and Supermetrics to size your requirements for migrating from one estate to another.

I have a customer with a large sprawling legacy vSphere estate and they are planning their migration to a new VCF deployment using HCX.

They could simply keep everything the same size and purchase the appropriate number of nodes, however in this case that could become very expensive very quickly.

Luckily we have been monitoring the legacy estate with vROps 7.0 and 8.1 for the last year.

With this in mind I created a supermetric which would calculated the total number of hosts required if all the VMs were conservatively rightsized, which would reduce their resource allocation by up to 50%, based on the vROps analystics calculations for recommended size along with removing any idle VMs which are no longer required.

This supermetric works to a depth of 5 deep, which means that we can get a required number of hosts for a cluster level as well as a whole vCenter or even a custom group of multiple vCenters.

In my example my new hosts have 40 cores which we are allowing to over-allocate by up to 4:1 giving a maximum of 160 vCPU per host, along with 1.5TB of RAM which is not going to be over allocated.

Step One – Memory

(ceil(((sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource, metric=mem|memory_allocated_on_all_vms, depth=5}))-sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource, metric=reclaimable|idle_vms|mem, depth=5})-sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|memory, depth=5}))/1574400000)+1)

This first calculation takes the total memory allocated on a cluster, removes the memory reclaimable from deleting idle VMs, and removes the total of memory able to be reclaimed by rightsizing the VMs.

This number is then divided by the amount of memory available in each host in kB

This number is then rounded up by using the CEIL function. More details on that here:

Estimate remaining VM Overhead using vROps – Advanced Super Metrics

Finally an additional host is added to this number to allow for N+1 High Availability. This can be set to your requirements.

Step Two – CPU

(ceil(((sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource, metric=cpu|vcpus_allocated_on_all_vms, depth=5}))-sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource,  metric=reclaimable|idle_vms|cpu, depth=5})-sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|vcpus, depth=5}))/(4*(40)))+1)

Similar to the memory calculation above, this takes the total number of vCPUs allocated on a cluster, removes the vCPUs able to be reclaimed from deleting idle VMs, and removes the total number of vCPUs able to be reclaimed by rightsizing the VMs.

This number is then divided by the number of cores available in each host multiplied by our maximum over-allocation of 4:1

Again this is rounded up using a CEIL function and then an additional host added for HA.

Step Three – Wrapping it up with a MAX function

This is the final super metric formula, which take the two calculations above and puts them into an array with the max function used to take the highest value to ensure we get the correct number of hosts.

This function has the following format:

max( [ calc1 , calc2 , … calcN ] )

You may spot that I have added a “3” as the third number, this is to ensure that the super metric never recommends a cluster size of less than three hosts.

max([(ceil(((sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource, metric=mem|memory_allocated_on_all_vms, depth=5}))-sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource, metric=reclaimable|idle_vms|mem, depth=5})-sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|memory, depth=5}))/1574400000)+1),(ceil(((sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource, metric=cpu|vcpus_allocated_on_all_vms, depth=5}))-sum(${adaptertype=VMWARE, objecttype=ClusterComputeResource,  metric=reclaimable|idle_vms|cpu, depth=5})-sum(${adaptertype=VMWARE, objecttype=VirtualMachine, metric=summary|oversized|vcpus, depth=5}))/(4*(40)))+1),3])

IF Function in vROps Super Metrics aka Ternary Expressions

vRealize Operations. Using vROps Super Metric Ternary Expressions IF Function

Have you ever just wanted an IF Function when creating Super Metrics? Good news, there is one!

Leading on from the last post I did on determining the number of VMs which will fit into cluster, I have decided to further expand it with an IF function to take the Host Admission Policy failure to tolerate level into account as well.

Previously we used a flat 20% overhead as that was the company policy, however that reserved way too many resources on larger clusters, and setting it to a flat two host failures

We wanted to set any Cluster Compute Resource with less than 10 hosts, to only allow for a single host failure, but clusters of 10 and above should allow for two host failures.

In vROps terms this requires a Ternary Expression, or as most people know them, an IF Function.

You can use the ternary operator in an expression to run conditional expressions in the same way you would an IF Function.

This is done in the format:

expression_condition ? expression_if_true : expression_if_false.

So for our example we want to take the metric summary|total_number_hosts and check if the number of hosts is less than 10.

This means our expression condition is:

${this, metric=summary|total_number_hosts}<10

as we want to return a “1” for one host failure if this is true, and “2” for two host failures if it’s 10 or more our full expression is:

(${this, metric=summary|total_number_hosts}<10?1:2)

This means our full code is:

floor(min([(((((${this, metric=cpu|corecount_provisioned})-(((${this, metric=cpu|corecount_provisioned})/${this, metric=summary|total_number_hosts}))*(${this, metric=summary|total_number_hosts}<10?1:2))*4)-(${this, metric=cpu|vcpus_allocated_on_all_vms}))/8),(((((${this, metric=mem|host_provisioned})*((${this, metric=mem|host_provisioned}/${this, metric=summary|total_number_hosts})*(${this, metric=summary|total_number_hosts}<10?1:2)))-(${this, metric=mem|memory_allocated_on_all_vms, depth=1}))/1048576)/32),((((${this, metric=diskspace|total_capacity})*0.7-(${this, metric=diskspace|total_provisioned, depth=1}))/1.33)/(500+32))]))

vROps Summary Tab Fault for Certain Objects


I recently came across a client using vROps 7.5 with a fault with the vROps Summary tab for individual objects. It was working fine for some objects but not others.

The fault they were suffering with resulted in the Summary tab not working for certain object types. It would either show a blank grey screen or it would automatically forwarded to the “Manage Dashboard” screen. If you added “/alerts” to the end of the URL you can get to the alerts tab and then click and access all the other tabs.

Although if you then click on the vROps Summary tab, it just shows a blank screen or forwards to Manage Dashboards again.

At first I thought it had to be some Licensing “feature” to annoy people who were breaking their allowed number of Licensed Objects, so applied a temporary 10k OSI Enterprise license and STILL had the issue.

Even taking the cluster offline and rebooting, and reinstalling Management Packs didn’t fix the issue.

I was scratching my head for two days trying to figure out why it was only affecting some object types, but thanks to a nudge from a colleague we discovered the problem.

Good news everyone, we fixed the vROps Summary tab fault!

The Summary Dashboards Summary Detail tabs were blank for these object types but set correctly for others.

The vROps Summary Tab Fix!

This annoying fault can be resolved using these steps: 

  1. Navigating to Dashboards
  2. Manage Dashboards
  3. Click the Cog Icon
  4. Manage Summary Dashboards
  5. Select adapter type associated with your Object Types ( vCenter Adapter in my case)
  6. Click on each of the items with blank ‘Detail Page’ entries
  7. Click the ‘Use Default’ button in the top left hand corner to re-add them to summary detail
  8. Save

Now go run and find a Virtual Machine and revel in the glow of a working Summary tab in details view.

I’ve not found this discussed anywhere else, so hopefully this will be useful for anyone else who has this issue.

Registration failed: Log Insight Adaptor Object Missing

I recently came across a problem at a client’s with integrating Log Insight (vRLI) with vROps. The connection tests successfully and alert integration works, however launch in context returns the error “Registration failed: Log Insight Adapter Object Missing”

After a discussion with GSS it was discovered this is actually a known issue due to the vROps cluster being behind a load balancer and the following errors are shown in the Log Insight log /storage/var/loginsight/vcenter_operations.log

[2018-05-15 09:51:02.621+0000] ["https-jsse-nio-443-exec-3"/ INFO] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [Open connection to URL https://vrops.domain.com/suite-api/api/versions/current]
[2018-05-15 09:51:02.621+0000] ["https-jsse-nio-443-exec-3"/ INFO] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [http connection, setting request method 'GET' and content type 'application/json; charset=utf-8']
[2018-05-15 09:51:02.621+0000] ["https-jsse-nio-443-exec-3"/ INFO] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [reading server response]
[2018-05-15 09:51:02.626+0000] ["https-jsse-nio-443-exec-3"/ ERROR] [com.vmware.loginsight.vcopssuite.VcopsSuiteApiRequest] [failed to post resource to vRealize Operations Manager]
javax.net.ssl.SSLProtocolException: handshake alert:  unrecognized_name

This is caused by some security updates to the Apache Struts, JRE, kernel-default, and other libraries from vRealize Log Insight 4.5.1. These updated libraries affect the SSL Handshake that takes place when testing the vRealize Operations Manager integration.

To resolve this issue we needed to add the FQDN of the vROps load balancer as an alias to the apache2 config. This can be done by following these steps.

  1. ​Log into the vRealize Operations Manager Master node as root via SSH or Console.
  2. Open /usr/lib/vmware-vcopssuite/utilities/conf/vcops-apache.conf in a text editor.
  3. Find the ServerName ${VCOPS_APACHE_SERVER_NAME} line and insert a new line after it.
  4. On the new line enter the following:
ServerAlias vrops.domain.com

Note: Replace vrops.domain.com with the FQDN of vRealize Operations Manager’s load balancer.

5. Save and close the file.

6. Restart the apache2 service:

service apache2 restart

7. Repeat steps 1-6 on all nodes in the vRealize Operations Manager cluster.

Removing a Management Pack from vRealize Operations Manager (vROps)

I was recently asked by a colleague new to vROps, on how to remove a management pack in their client’s environment and realised it’s not really well documented and used to be a GSS only process.

Unfortunately removing a management pack from vROps is a CLI operation.

1. Log in to the vRealize Operations Manager Master node as root through SSH or Console.

2. Run this command to determine the existing management pack .pak files and make note of the name of the solution you want to remove:

$VMWARE_PYTHON_BIN $ALIVE_BASE/../vmware-vcopssuite/utilities/pakManager/bin/vcopsPakManager.py --action query_pak_files

3. Run this command to determine the management pack’s internal adapter name listed in the name section:

cat /storage/db/pakRepoLocal/<Adapter_Folder>/manifest.txt

4. Change to the /usr/lib/vmware-vcops/tools/opscli/ directory.

5. Run the ops-cli.sh script with the uninstall option for the management pack name

./ops-cli.sh solution uninstall "<adapter_name>"

6. Run the cleanup script:

$VMWARE_PYTHON_BIN $ALIVE_BASE/../vmware-vcopssuite/utilities/pakManager/bin/vcopsPakManager.py --action cleanup --remove_pak --pak "<adapter_name>"

7. Remove the management pack’s .pak file from the $STORAGE/db/casa/pak/dist_pak_files/VA_LINUX/ directory.

8. Open the vcopsPakManagerCommonHistory.json file using a text editor.

vi /storage/db/pakRepoLocal/vcopsPakManagerCommonHistory.json 

9. Delete entries related to the deleted management pack from { to }

10. Save and close the file.


How to add Historic User Session Latency to vROps for Horizon.

This post was written by Cameron Fore and is being reproduced here just for posterity purposes for a future implementation. Original link HERE

vROps for Horizon provides end-to-end visibility into key User session statistics that make it easy for Horizon admins to visualise and alert on performance problems impacting the user’s of their environment. One of the key metrics used in determining how well user’s are connected to their virtual app or desktop session is Session Latency (ms), as it most visually impacts the user’s perspective of their session performance.  The lower the session latency, the quicker video, keyboard, and mouse inputs are redirected to and from a user’s endpoint client, giving the user a more native-like PC experience.

As the latency trends higher (>180ms), the experience begins to degrade, and the user can begin to notice “sluggishness“ – slow keyboard, mouse, and video responsiveness.

vROps for Horizon gives us direct visibility into when these issues are occurring across all of the Active User Sessions of the Horizon View environment.  However, once the session becomes inactive, it will go into a stale object state and be removed from vROps during a clean-up window.

To be able to view this information historically on Pools and User objects, you can create Super Metrics that simply maps the session latency to the objects you want to report on.

Creating the Super Metric

To create the Super Metric, Navigate to Administration -> Configuration -> Super Metrics.  Click the green + sign to create a new Super Metric.

Provide the Super Metric a unique name, in this case we are using “Avg App Session Latency”.  Search for the  “Application Session” Object Type, and click “Round Trip Latency (ms)” to add it to the Super Metric.  Since, we are looking for the average latency, select “avg” from the available functions list, making sure that the average function applies to the metric by encapsulating it parenthesis as demonstrated in the image below.  Click Save to finish the Super Metric.

Next, you will need to add the Super Metric to the “User” object type.  Click the green + sign under the “Object Types” section.  Search and select the “User” object type.

Before the Super Metric will begin collecting data, you will need to navigate to Administration-> Policies, and edit the active monitoring policy to enable the metric for collection.

Once the metric has started to collect data, you can view the data on a individual “User” object by selecting “All Metrics” -> Super Metric -> select metric.

You can also create custom Views that display the historical latency for all users of the environment, as well as perform simple roll-up statistics.

Identifying VMs with RDMs for Categorisation in vRealize Operations

I had a customer with a very large legacy estate of very large VMs with RDMs attached, both physical and virtual. We were implementing vRealize Operations (vROps) and the customer wished for a way to automatically categorise and discover all VMs which had an RDM attached to them in the vROps dashboards and reports.

There are many ways to attempt to do this but it was decided that the simplest was to create a PowerCLI script to add a vCenter Custom Attribute to all VMs with an RDM attached. This Custom Attribute will automatically show as a property in vROps against the VM object, allowing for a new Custom Group to be created for VMs with and without RDMs. As vROps custom groups can be set for dynamic membership, the groups can be kept up to date without further configuration within vROps.

This script is designed to be run on a regular basis in order to account for new machines being added.

#load VMware PowerCLI module
if ((Get-PSSnapin | where {$_.Name -ilike "VMware.VimAutomation.Core"}).Name -ine "VMware.VimAutomation.Core"){
	Write-Host "Loading VMware PowerCLI"
	Add-PSSnapin VMware.VimAutomation.Core -ErrorAction SilentlyContinue

#Disconnect from any active vCenter sessions
If ($global:DefaultVIServers) {
    Disconnect-VIServer -Server $global:DefaultVIServers -Confirm:$false -Force:$true

#define variables

#Retrieve Local Hostname

#Connect to servers

Write-Host "Connecting to Local vCenter"
Connect-VIServer $LocalvCenter 

try {
	#Check if CustomAttributes Exist, if not create them
	if ((Get-CustomAttribute -Name 'RDMAttached' -TargetType VirtualMachine -ErrorAction:SilentlyContinue) -eq $null){
			Write-Host "Creating Custom Attribute RDMAttached"
			New-CustomAttribute -Name "RDMAttached" -TargetType VirtualMachine
	#Write Annotations for VMs with RDMs Attached
	Write-Host "Writing Annotations" -NoNewline
	$VMwithRDM = Get-VM | Where-Object {$_ | Get-HardDisk -DiskType "RawPhysical","RawVirtual"}

	foreach($vm in $VMwithRDM){
		#Write annotations
		$vm|Set-Annotation -CustomAttribute "RDMAttached" -Value "True"
		Write-Host "." -NoNewline
	}#end Write Annotation
	Write-Host "`n"	

} #end of try

Catch {
	Write-Host $_.Exception.Message -ForegroundColor Red

	Write-Host "Disconnecting from vCenter"
	Disconnect-VIServer -Confirm:$false -Force:$true