One of the primary objectives of observability is to help you identify errors and exceptions in your application. A great observability solution provides you with a way to quickly troubleshoot these errors and keep the high developer and business velocity that you hope for.
In previous blog posts, we saw how to set up service maps and monitor different environments with Epsagon. This blog will focus on quickly identifying and troubleshooting errors in your application using distributed tracing with Epsagon.
Identifying and Visualizing Errors in Your Application
Visualizing errors in your application is a key to understanding which errors are important so that you can prioritize troubleshooting them. With Epsagon, you have multiple ways to visualize the errors, such as dashboards, service maps, issues manager, trace search, and our functions screen.
Epsagon’s out-of-the-box dashboards help identify applications that have errors. For example, the Application Overview dashboard gives a summary of the “Top 5 Application Errors”. Also, the Main Dashboard gives the count of errors. You can also create custom dashboards for specific errors of interest. We will see how to create these custom dashboards in the ‘Trace Search’ section.
Another way to visualize your errors is through service maps. The red arrows give a visual representation of errors associated with a service. For example, in the figure below, DynamoDB has a red arrow representing a problematic call with a more 0.5% error rate. To better understand, you can click on the relevant node to see more information and jump to traces to explore the exact issue.
The Issues Manager screen provides an aggregated view of all the issues in one place. You can also set and assign alerts using this screen.
The Trace Search screen provides a customizable way to search for specific errors you are interested in. With the trace search, you can set “error = true” for the particular filter (such as application, AWS account id, Kubernetes pods, or even custom filters), timeframe, and the interval you want.
To visualize and track key errors over time, you can click on visualize and add the graph to your custom dashboard. This allows easy access to important error metrics for you and your team.
If you want to learn more about this feature, check out this recently published blog post on how we improved Epsagon’s Trace Search capabilities using our own platform.
Specifically, for AWS Lambda, you can use the Function screen to see all the events of interest. You can see the Errors, Timeouts, and OOM events associated with your functions.
Troubleshooting Application Errors Using Correlation
Now that you can detect and visualize errors in your application let’s see how you can troubleshoot these errors. Epsagon lets you correlate events, traces, metrics, and logs. Using the correlation, you can troubleshoot errors quickly. In the following example, you find out about an error using the service maps screen.
Let’s focus on the Request Processor Lambda which has errors. Hovering on that microservice enables you to get all the relevant metrics associated with the Lambda.
Here you can see the success-error count over time. Clicking on Traces lets you see all the traces associated with the Lambda function. In the trace search view, you can see all the successful and errored traces associated with it.
Clicking on the relevant errored trace lets you focus on what exactly happened during a particular transaction.
Here, you can see the logs associated with the trace, the metadata for the trace, labels, and the payload information. Getting the payload data associated with the trace while troubleshooting is unique to Epsagon and greatly helps in troubleshooting. You can also create alerts based on payload data.
Using correlated logs, you can quickly identify why the trace was unsuccessful. Within a few clicks, you know the root cause of the issue.
Similarly, in Kubernetes, you can easily correlate MELT (Metrics, Events, Logs, and Traces) with an example of correlating pod metrics with traces shown below.
With Epsagon, you can also use dashboard visualizations to track errors and exception trends over time and then jump directly to the filtered trace-search view for a deep dive into what exactly happened.
You can also create custom errors. For more information, refer to Epsagon docs.
Thus, Epsagon’s unique ability to correlate metrics, logs, and traces helps you with troubleshooting these application errors quickly.
Alerting Based on Errors
Creating alerts based on errors helps you focus on the important errors. Using the Alerts screen, you can see all the alerts in one place, including who created the alert, what types of errors are included in the alert, and which channels have been utilized for alerting. Epsagon integrates with the most popular alerting applications such as Slack, PagerDuty, Teams, etc.
You can create alerts by clicking on the “Create New Alert” on the top right. Alerts can be created based on different environments – serverless/Kubernetes, entities – application, Lambda function, AWS tag, and your choice channel – Slack, Teams, etc.
Conclusion and Next Steps
Epsagon is a complete observability solution and specializes in troubleshooting application errors and exceptions in your microservices-based architecture. Combining the tools provided by Epsagon to correlate events, traces, metrics, and logs, you can have an integral error detection, analysis, and resolution process.
Companies, such as Vonage, have used Epsagon to reduce their mean time to detection and resolution (MTTD/R) by at least 25%. Epsagon uses distributed tracing in a unique way to help you understand and instantly resolve any errors.
To take the next step, start your Epsagon 14-day free trial.