In many system management environments heartbeat solutions are utilised to determine the availability of distributed servers, applications or application instances. Such solutions rely on the distributed resources sending regular heartbeat pulses to the central systems management infrastructure with alerts being generated for missed heartbeats, i.e. where the heartbeat pulse is not received within the expected time frame.
Such solutions are prevalent in TEC Server infrastructures and based on EIF event heartbeat pulses. This article considers the migration of such solutions to Netcool/OMNIbus, prompted by the fact IBM have announced that Netcool/OMNIbus is the defined upgrade path for TEC Server solutions, and provides a solution based on the discussed concepts.
The basic concept of the OMNIbus Heartbeat solution is identical to the solution implemented with TEC Server, however, with OMNIbus out-of-the-box features can be used to build a more scalable and resilient solution.
OMNIbus can receive non-TME EIF events using the Tivoli EIF Probe. Hence heartbeat pulses from distributed resources sent as an EIF event can be forwarded directly to the OMNIbus infrastructure. Existing resources that generate EIF heartbeat pulses may require minor updates, for example adapting existing scripts to call postzmsg in place of the wpostzmsg binary, or TEC Adapter configuration changes to generate non-TME events.
The Tivoli EIF Probe is configured, using a rules file, to process the heartbeat pulses, and insert the data into a custom table within the ObjectServer database. A custom table is used, as opposed to the existing alerts table, for performance reasons.
ObjectServer triggers can then be used to generate alerts based on the information within the custom table, identifying resources that have missed a heartbeat, or where a heartbeat has been restored after a failure period. The solution offered below expands the logic to identify relationships between sympton and causal events, for example missed heartbeats from an application and and the node the application is hosted on, and will suppress events for the former.
Finally, when a system is decommissioned, the custom table can be updated using WebGUI tools, enabling operators to directly action such processes.
The key advantages of using an OMNIbus heartbeat solution, in place of TEC Server, are:
- Performance: the ObjectServer event throughput is significantly higher than TEC Server. This significantly reduces the impact of the heartbeat solution on the processing of standard alerts.
- Simplified Implementation: The solution is resilient to server restarts without the need to develop custom code (as was required with the TEC Server global variables). The is due to the fact that the heartbeat data is maintained in a standard ObjectServer database table. All ObjectServer databases are written to disk at regular intervals, along with replay logs between those check-points.
- Resilience: The solution can integrate with the standard OMNIbus ObjectServer fail-over solution.
- Scalability: The OMNIbus heartbeat solution can easily be designed in accordance with a multiple level hierarchy, individual collection level ObjectServers maintaining local hearbeat data and only escalating missed or clearing heartbeat events, thus reducing the workload ont he aggregation ObjectServer.
Orb Data Solution
The Orb Data solution includes available from here includes EIF Probe rules, ObjectServer triggers and procedures plus a WebGUI tool. Various configuration options are available to customise the behaviour of the solution. The full details are documented in the solution readme with the binaries available here.