How to configure OverOps to minimize alert spamming and help prioritize your issues by severity

Follow

Summary

After you install OverOps, all the exceptions and errors coming from your application will show up in your OverOps dashboard .The amount of information you receive can be overwhelming and for this reason you might not find the events you need to focus on. The following Best Practice describes how in 4 steps you can reduce the noise,  prioritize issues, and notify the right teams about critical/important events.

 

Step 1: Model your code using code filters (if needed)

OverOps allows you to either blacklist or whitelist packages and classes to help monitor and  track the variable states and call stack of the events you are interested in. By properly black- and whitelisting 3rd party code benign noise will be reduced and important events will be captured.

The call stack tracks the methods and variables state for the event all the way back to it’s entry point in the code where the error occurred, and the parameters were passed into. The first method in line is the last method on a the code within your application.Typically, you would only be interested in YOUR code and therefore we blacklisted classes and packages by default which we consider standard 3rd party code (https://doc.overops.com/docs/3rd-party-packages). These classes are by default not monitored and therefore would not show in your stack trace either.

 

Therefore, it is important that you either whitelist the packages and classes (from our default provided 3rd party code blacklist) you need / require to monitor or help identify as entry points to a particular section of code or blacklist the packages or classes which are causing noise or prevent you to see the code where the real issues are occurring.

Keep in mind that OverOps uses a sophisticated algorithm based on unique fingerprints of the bytecode to deduplicate the events reported by the JVM. It provides us details like

  • Where in the application code the events is coming from
  • How often it’s happening and how many calls into that specific code
  • Which transaction in the code are affected and at what rate
  • When it was first seen and last seen
  • On which JVMs and machines
  • In which deployment it was introduced

By blacklisting certain parts of your code you could further improve the deduplication function by focusing on the aera(s) which are relevant and important.

 

In the end, this will allow you to structure your views not only by the teams responsible for it but allows you also to structure them by the severity of the issue. The routing,  prioritization and communication of issues based on these views is then easy.

 

As example:  

During check out of an order in your shopping cart app an error is thrown in your SSO Package.

The SSO Package is a very common class and is called from various areas in your application incl. the payment package, shipping package, order entry package etc. - in our case the SSO package is not the cause of the problem. The issue occurred further up in the stack but is hidden by the SSO package. As the error is thrown in the SSO class it also does not allow you to route the problem to the team which is really responsible. In our case, the team responsible for the payment processes. To show where the issue really happens and route the issue accordingly to the team responsible for it you will need to blacklist the SSO package. The next time the issue occurs the classes / package which  comes next in the stack (the payment processes) will be shown as the class which has thrown the error. You can build now a view which shows the issues originating in the payment classes and route it to the correct team. As these issues prevent the customers from completing their orders you would also classify this issue as a much higher severity / priority impact issues compared to other issues occurring anywhere else in your application.

 

Another scenario is where errors are thrown in a class but you do not know what the actual entrypoint of the issue is as the entry point is in a third party package we have by default blacklisted. As in the earlier example this makes it difficult to route the issue to the correct team. To show the package as your entry point you would have to whitelist that package which would going forward expose the entry point of your error and allow you to route the issue to the correct team.

You might have to turn on “Show 3rd party/utility methods” to see the full stack to determine which 3rd party class you might need to whitelist so it is monitored.

OverOps_-_PortUnreachableException.png

 

Step 2: Setup views for the relevant classifications

Now that you modeled your code by using these code filters you can setup views accordingly.

In my example the teams are organized by code so payment package issues would go to that respective development team.

Example:  

Views for

  • Payment package issues Priority 1
    Issued which occurred in the payment package and have the highest customer / business impact.
  • Payment package issues Priority 2
  • Payment package issues Priority 3
  • etc.
  • Shipping package issues Priority 1
  • etc.
  • Authentication issues Priority 1
  • etc.

 

The important part is to structure these views based on

  • Customer/Business impact
  • Team/person responsible
  • Internal engineering/business processes.

Setting up views in this manner allows you and your engineering teams quickly to see what issues occurred and how to prioritize them for resolution. What defines a priority 1 or a priority 2 you define based on where this issue occurs in your code or based on fixed or relative thresholds you set for your code in step number 3 below.

 

How do I setup views:

In your dashboard filter the events by Entry Point, Event Location, Event and save them into a View. Details can be found here https://doc.overops.com/docs/the-event-list#section-sorting-and-filtering-the-grid

All_Exceptions___OverOps.png

If some columns are not shown make sure you click on the + sign in the header of the grid view on the right of the screen and add / remove columns which are relevant to you.

 

Step 3: Create alerting threshold (1%, 5%, new error,..)

Now that you have your views setup the next step is to setup the alerting thresholds. Essentially identifying priorities and severities of the issues by view. Make sure your views represent the priority or severity of the issues collected in them. Learn more about setup passive and dynamic thresholds in this article.

 

Step 4: Setup routing rules - email, slack, etc..

Last step is to setup your routing rules to notify your respective teams via the selected communication channels like email, Slack, Jira, or others. Details can be found in our documentation.

 

We strongly recommend routing the events first to a less structured application like Slack/Email and fine tune the filters and thresholds to improve the Signal/Noise ratio. When you feel comfortable that you have achieved a “clean” alert channel, we recommend routing the events to a more structured system like PagerDuty or Jira, and improve your team detection/identification metrics.

Finally, we highly recommend sharing the information to your NOC / DEVOps teams. You can do that by creating views within OverOps or you can publish the metrics to dashboards via Grafana, Splunk or AppDynamics. To learn more about publishing metrics please see here.



 

1 out of 1 found this helpful

Comments

0 comments

Article is closed for comments.