Monday, January 27, 2014

OIM monitoring check-list

Systematic monitoring of OIM deployments helps to reduce risk of both technical and security related issues. It also can help to avoid performance degradation that can happen because of data growth over time. This post presents a set of topics about OIM and WebLogic monitoring, and it presents tools that can be used for both monitoring and diagnostic.This list is not intended to replace any official product documentation, instead, it should be used in conjunction with it.

This is another post in the OIM academy series. You can check the complete series here.

OIM Features

  • OIM scheduler: scheduled tasks are an essential feature of OIM. Administrators should check for things like failed tasks, long running tasks, unnecessary tasks that can be disabled, and others.
  • Open provisioning tasks: when provisioning tasks fail, they are assigned to the system administrator group (unless configured differently) and will show up in the open tasks list. The open tasks list should be checked frequently to make sure that tasks are not accumulating. Growing number of open tasks might be a symptom of an environmental problem. OIM is also capable of sending notifications out when a task fails, but the task needs to be configured for that.
  • Pending approval tasks: approval tasks that are pending for longer than expected might be a symptom of a problem. For example: notifications are not going out of OIM/SOA and approvers are not aware of the pending tasks. It also could be a symptom of communication problems between SOA and OIM.
  • Non-processed reconciliation events: accumulation of events in ‘Data Received’ or ‘Event Received’ status might be a symptom of a problem. Check here a complete list of event status. Administrators should periodically check the reconciliation events to make sure they are being correctly processed.
  • Pending audit events: when the audit events creation rate is higher than the audit events processing rate, events will start accumulate in the AUD_JMS table. The ‘Issue Audit Message’ scheduled task must be properly configured to handle the load. Accumulations of events in the AUD_JMS table can also be a symptom of event processing failure. Administrators should monitor the table growth and space consumption on the database side.
  • Data growth: OIM transaction data will grow over time if proper archival and purge processes are not in place. Make sure that the processes are in place, that their frequency is according to the expectation around data growth. The archival and purge processes take care of four different types of data and it is documented here:
    • Orchestration: data related to the transactions that happens over users, roles, organizations and provisioning.
    • Request: data related to requests raised in OIM.
    • Reconciliation: data generated by the connectors and the reconciliation engine.
    • Provisioning: data related to the connectors provisioning tasks.

Tools

  • DMS Metrics: OIM uses Oracle Dynamic Monitoring Service feature to report internal metrics. A lot of metrics are available through DMS, including average, maximum and minimum execution time for provisioning adapters, client API login, event handlers and scheduled tasks. DMS metrics are accessible through http://admin_server:port/dms, WebLogic domain administrator credentials can be used to access it. DMS metrics can, among other things, be used to find bottlenecks in OIM operations. More information about DMS is found here.
  • Diagnostic dashboard: the dashboard is a tool that provides diagnostic of an OIM deployment. It runs as a separate Web application deployed to OIM server/cluster. It does not bring any considerable performance impact to OIM. Instructions on how to deploy Diagnostic Dashboard are found here.

Infrastructure

  • WebLogic resources: it is also important to monitor WebLogic resources
    • Data-sources: are the data sources well sized? WebLogic console offers a page that contains a set of data source usage numbers like peak number of connections in use, average number of connections in use, number of leaked connections, and so on
    • JMS queues: make sure that the number of pending messages is not growing over time
    • Cluster: WebLogic console offers live cluster information like frequency of servers dropping from the cluster and others.
    • Stuck threads: WebLogic is capable of notifying administrators of threads that have been running for longer than a specific threshold. Although WebLogic considers such threads as stuck, it does nothing to address possible issues. Long running threads might be an indication of problems
  • JVM and OS: as any other Java based application, it is important to monitor the operating system and the JVM resources to make sure that the CPU, memory, IO and other factors are not imposing performance penalties. There are plenty of tools that can be used for that, but this is a subject to another post.
This post was created with the help of my colleagues Rob Otto and Pulkit Sharma. Thanks to them for sharing their ideas.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.