vSphere Replication – Create an alarm for RPO violations

In our article series about vSphere Replication, we configured new replications and checked the possibilities for monitoring these replications.

These monitoring possibilities are interesting when you create a new replication, or when you’re in a troubleshooting session. But on the long run, this is not an efficient way to monitor ongoing replications. In this article we will create an alarm to send us an email when the RPO of a replication is violated.

Easy? We’ll see!

Alarm configuration

We’ll start by creating a new alarm in our vSphere client. From the Hosts and clusters view, select the vCenter, go to Manage, Alarm definitions and click the green “+” sign to add an alarm.01.vSphereReplAlarm-ENIn the  wizard, name you alarm and choose to monitor virtual machines, specific event.02.vSphereReplAlarm-ENThe event that we want to monitor is called RPO violated. Let’s select it in the drop-down list.03.vSphereReplAlarm-ENThen we add a second event (RPO restored) to clear the alarm when everything is fine again. Don’t forget to change the status to Normal!04.vSphereReplAlarm-ENNext, the alarm actions. Let’s send us an email when the alarm is triggered.05.vSphereReplAlarm-ENDone! You can test the alarm by reducing the RPO of a replication to 15 minutes and copying enough new files to the VM to make the replication fail. You should immediately receive the email.

This sounds nice, but…

… it’s not really finished. With your alarm in production, you will notice that you get way too many emails because the RPO was exceeded by a minute. This is caused by how the replication works: it waits as long as it can to start the synchronization (in order to get the latest changes), and often, the RPO gets exceeded by a few minutes.

How to mitigate this problem? We should be able to say something like “trigger the alarm only if the RPO is exceeded by more than 10 minutes”, but it’s not possible in the web client. The good old C# client can do slightly better, but it is limited too as we will see. Let’s try!

Once connected, go to hosts and clusters, select the vCenter on the left, Alarms and Definition. Double-click the RPO alarm and go to the Triggers tab. Our alarm is the event com.vmware.vcHms.rpoViolatedEvent. Click on Advanced in the Conditions column to define our custom condition.06.vSphereReplAlarm-ENType currentRpoViolation, not equal to, and 1.07.vSphereReplAlarm-ENThe currentRpoViolation counter contains the number of minutes since the beginning of the RPO violation. By saying “not equal to 1”, we will avoid to trigger the alarm immediately. And don’t look for a “greater than” operator, there is no such thing :(.

Theoretically, our customizing would trigger the alarm as soon as we reach the second minute of the RPO violation, but in my tests, the alarm was only triggered after 10 to 16 minutes, which could be satisfying. I guess that the currentRpoViolation counter is not checked every minute!

If you need to block the alarm further, the best approach is probably to abandon the integrated alarm system and go for a powershell script that would check the status of the currentRpoViolation counter, and trigger an email as soon as your custom trigger is reached. If you’re interested in such a script, you can find interesting elements here (I found the trick to delay the alarm there) or here.