An Implementation of GoldenGate Monitoring configured in a Real Application Cluster Environment
Essa postagem não é minha, todos os créditos de desse post são para : http://maazanjum.com/2014/06/19/an-implementation-of-goldengate-monitoring-configured-in-a-real-application-cluster-environment/
Nearly a year ago I wrote about GoldenGate monitoring using a Metric Extension, and then later on expanded upon the creation of a Metric Extension, which I’ve installed and configured at several customer sites where GoldenGate was running on a stand alone server.
Over the last few months, several people had approached me on the process, as well as improved on it. One such improvement is accredited to Bobby Curtis (@dbasolved) who taught me how to buffer in Perl. Bobby also has a neat collection of monitoring scripts for GoldenGate, you can find them here.
The latest implementation, is a collaboration led by my tenacious and talented colleague Tucker Thompson (LinkedIn). He is responsible to maintain and manage the Enterprise Manager environment from an operational perspective for (among many other) a rather large retail corporation – let’s call them Furry Feet (FF). Their environment contains multiple Exadata machines including several dozen non-Exadata environments.
A few weeks ago he approached me with a question on GoldenGate monitoring with Enterprise Manager that does not involve the GoldenGate Plugin. In his own words, Tucker described the problem below:
“The client was previously using [custom built] crontab scripts to monitor multiple items (including GoldenGate) in their large Exadata environment, despite having an OEM 12c implementation. Our desire was to move all of their crontab elements into a centralized strategy utilizing OEM 12c.
The current GoldenGate plugin for OEM 12c was tested, but seemed very buggy and the client was not ready to use it yet.
The client has AGCTL configured to assist in running multiple highly available GoldenGate instances in the same Exadata DBM.”
As per Oracle’s documentation; Agent Control (AGCTL) is the agent command line utility to manage bundled agents (XAG) for application High Availability (HA) using Oracle Grid Infrastructure.
Tucker explains why this solution wasn’t always reliable by stating:
“The crontab scripts operated per compute node to check the logs for errors, send a lag status to the elements, and check the AGCTL status to determine where the instance was running. However, we found that any alert in the alert.log triggered a critical alert through their ticketing system, as it gripped for any ORA-XXXXX error.
For instance, if there were any long running queries, we would get a ticket created. Another major issue was that we would encounter multiple issues where the status of GoldenGate in AGCTL did not accurately reflect is actual status. For example, an instance could be showing as down through AGCTL, but through GGSCI, the status was RUNNING.”
We seem to find a pattern with issues in monitoring of GoldenGate, don’t we? This doesn’t necessarily mean that the tool itself is at fault but rather the available options. Since I had already come up with an adequate way to monitor GoldenGate using Metric Extensions in EM12c, but it was designed to run against a host target specifically where GoldenGate runs in a stand alone mode. In FF’s environment, there were several GoldenGate instances running across the various Exadata compute nodes that were configured to fail-over and restart GoldenGate seamlessly. This made for an interesting problem to resolve because my initial script assumes a static GoldenGate home.
As an example, three Nodes in a cluster each with a different GoldenGate instance that is managed by XAG.
Tucker’s innovative solution was to retrieve the information from clusterware via AGCTL to run the GoldenGate check against the nodes where the instance is currently running.
“What this script does is execute against the Exadata Database Machine as a target. This means that it will first find a compute node available, run AGCTL to determine the names of the GoldenGate instances, and respective nodes they are configured to run on. This information is always available from any node, and the script does not take into account where AGCTL thinks GoldenGate is actually currently running.
Next, with that information registered, the script runs olsnodes to grab the host names of all compute nodes registered in the DBM. It then uses information pulled from the AGCTL configuration per GoldenGate instance to ssh to each compute node and grep for the manager process for that specific GG instance. With the manager found running on a certain node, the script then runs ggsci from that node against that GG instance, and parses the results to tell us if the different components are running, stopped, or abended. It will also take the lag into consideration and set warning thresholds, rather than a critical alert for any amount of lag. The script even goes as far to add the agctl status as an informational column, so it can be seen if agctl is showing as down, but GGSCI shows all processes running fine.
If the manager is not found anywhere in the Exadata environment, it extracts the first node that the instance is configured to run on from the AGCTL configuration, and runs ggsci from that node. This allows the script to still show all of the components as stopped or abended, and their respective lags.”
Another thing to note is that if a GoldenGate instance relocates to a different node for some reason, instead of just getting an alert that the instance went down we would get that alert, followed by a clear alert once the GoldenGate objects (manager, extract, replicat, pump) are back up and running on a different node.
Once tested via the Metric Extension setup screens, the output looks like:
This strategy allows for one script to monitor multiple diagnostics for across GoldenGate instances configured to run in a large, highly available environment.”
It outputs the following information when run from a prompt.
Once the Metric Extension is deployed, its information is accessible in the “All Metrics” section for the Exadata Database Machine.
The point of this exercise was to solve a particular use-case where GoldenGate instances are configured as clusterware resources which can be restarted on different nodes each time. What I would like to see is, an implementation of this GoldenGate monitoring script in an implementation that doesn’t necessarily use XAG in a clustered environment.
Thanks again to Tucker for coming up with the idea to retrieve the information and it was fun to incorporate my original script into his version.