About the author: Amit Lichtenberg is a Software Engineer at Epsagon (LinkedIn).

Dogfooding, aka “eating your own dog food”, is a common practice of organizations using their own products for real-world scenarios. Organizations that do it right enjoy the advantages of early-stage use-case validation, bug hunting, and feedback. Developers who actively use their own product gain confidence and a deeper understanding of their clients. 

Dogfooding is something we at the Epsagon team pride ourselves on. Using our observability, troubleshooting, and monitoring tools on our own stack should come naturally. Indeed, we use Epsagon on Epsagon on a daily basis, helping us develop a better, high-quality product, faster. 

In this blog post, we explore real-life high-throughput data ingestion issues, observed while testing recent enhancements to our trace indexing engine. We demonstrate how, using a divide-and-conquer approach backed by Epsagon’s own capabilities, we managed to tackle and resolve these issues, providing a better trace search experience for our customers. 

Epsagon Trace Search Explained

Trace search and visualization are central features in Epsagon. A trace is a single action that is recorded by Epsagon, such as a Lambda function invocation, a database write, or a Kinesis stream put-record operation. Epsagon records these actions seamlessly, by instrumenting your code with our agent, providing out-of-the-box visibility with no code changes required. 

Figure 1: a single invocation of blog-site-app-prod-Request-Processor, its payload data, and the chain of events leading to and from it

To provide rich event search and visualization, Epsagon supports indexing any field within the trace data. Indexed trace labels can be created from the Epsagon UI or using our instrumentation libraries’ APIs. Epsagon additionally comes with a set of built-in labels for frequent use cases. Labels can be used to index trace payload data, such as invocation arguments, to add context, such as custom error reasons, or even for business-oriented metrics, such as the total number of transactions. The combination of custom, instrumentation, and built-in labels enables filtering traces by pretty much anything.

Labels are also a powerful tool for visualizing trends over time as well as alerting based on trace metrics. They provide our customers with the power of identifying and alerting about issues, analyzing them, and all the way to solving them. 

Figure 2: Average duration of blog-site-app-prod-Request-Processor invocations, grouped by user ID custom label

Epsagon’s trace search is implemented using Elasticsearch – a full-text search and analytics engine, typically used for log processing, text indexing, and more. AWS provides a managed Elasticsearch service, which makes it easy to integrate with serverless application stacks. At Epsagon, every trace is processed by our trace-search-ingest Lambda function, which extracts fields that require indexing and creates a minimal trace document. This document is indexed to Elasticsearch along with the trace ID. The full trace content is stored in Amazon RDS, indexed by its trace ID. Trace search filters are converted to Elasticsearch queries, which return IDs of traces that match those filters. To load the full trace data, the trace document is read from RDS. Visualizations and alerts are implemented using Elasticsearch aggregations.

Figure 3: trace-search high-level architecture

Applied Dogfooding and Epsagon-Meta

Dogfooding was part of the Epsagon mindset from early on. Initially designed to monitor AWS-based apps, we built most of our application stack based on AWS-serverless. Some other parts of our stack are running on AWS Fargate, which enables further understanding of container-monitoring use cases (EKS and ECS). Finally, we designed our own Epsagon-over-Epsagon, which we nicknamed ‘Epsagon-Meta’ or just ‘meta’.

Meta is a complete copy of our production stack, deployed on a dedicated AWS account. Our production, staging, and development environments are all monitored by meta, using the same instrumentation libraries as used by our customers. Meta is fully maintained, deployed as frequently as our production environment, monitored closely, and operational at all times. 

Maintaining meta is not easy though. It takes development time, operational costs, monitoring efforts, and mostly willingness to commit. But it is worth it. We use meta all the time for debugging issues, developing and testing our features, identifying performance bottlenecks, monitoring for production issues, and even just providing cool visualizations and dashboards for our system. 

Figure 4: Trace ingestion trends, visualized on meta using Epsagon custom dashboard

Perhaps one of the most important benefits of meta is the constant stream of real-life feature requests initiated by our own team. In turn, these requests yield exciting new features, which are rolled to meta along with production. This positive feedback loop is amplified by our engineers’ passion for improving their own code. By using our own system, we are able to understand our customers better and provide a better product for them. 

Improving Trace Search Using Meta

Trace search is a feature we are constantly improving and extending as it is such a central part of our system. Supporting aliases, indexing lists, and key-value mappings, and indexing new payload types, are just some of the features added to trace search on recent releases.

Every trace search feature is thoroughly tested using unit tests and end-to-end scenarios. We additionally test all of our features with real-life scenarios, which in this case pose a challenge both scalability-wise and variation-wise. The production trace ingestion pipeline handles millions of traces from our various customers hourly. During our real-life scenario tests, we started experiencing some seemingly related issues: a few of the traces were missing some labels, while some other traces were not indexed at all. 

To understand the scope of the issue better, let us check out a simplified version of our trace-search-ingest python Lambda:

def get_minimal_trace(full_trace: Dict) -> Dict:
    """ extract minimal indexed trace content from full trace """
    minimal_trace = {"id": full_trace["trace_id"]}
    for indexed_tag in custom_indexed_tags:
        value = get_tag_value(full_trace, indexed_tag.key)
        if indexed_tag.expected_type:
                value = cast_tag_value(value, indexed_tag.expected_type)
            except TypeCastException as e:
                logging.exception("Failed casting tag type")
                epsagon.label("indexed_tag", indexed_tag.key)
                epsagon.label("indexed_tag_type", indexed_tag.expected_type)
                epsagon.label("full_trace", full_trace)
        if value:
            minimal_trace[indexed_tag.alias] = value
    return minimal_trace
def ingest_traces(full_traces: List[Dict], index_name: str) -> None:
    """ trace ingestion logic: ingest and index minimal traces """
    minimal_traces = [get_minimal_trace(full_trace) for full_trace in full_traces]
        bulk_index(index_name, minimal_traces)
    except BulkIndexError as e:
        logging.exception("Failed bulk indexing minimal traces")
        epsagon.label("error_reason", e.reason)

This code is executed as an AWS Lambda function. Every ingested trace is mapped to its target index based on the customer it came from and its timestamp. Next, get_minimal_trace extracts the minimal traces by fetching the values for custom indexed tags, converting them to the expected value type (if it is known). Finally, ingest_traces bulk indexes those traces to Elasticsearch, reading, and reporting any error in the process.

Going back to our indexing issues, we could start analyzing them by exploring the CloudWatch logs, beginning with the exception tracebacks. But anyone who has ever had the pleasure of working with CloudWatch logs knows what a nightmare it can become, as Lambda functions invocations are scattered through log streams and log groups. Extracting the full context of a single error reported to CloudWatch logs is oftentimes excruciating. Driving insights, such as the number of occurrences of any particular issue, is practically impossible. This all made the process of debugging through the massive influx of ingested traces like looking for a needle in a haystack. We needed a tool to help us collect, classify, and analyze these real-life issues.

Luckily, we have Epsagon-Meta. This function is instrumented using epsagon-python, reporting to meta, just like our entire stack. The code lines beginning with epsagon. indicate uses of the Epsagon API, added and modified while analyzing the issues in question:

  • epsagon.error marks the trace with a custom error message.
  • epsagon.label labels the trace with any custom label. We use it to label the traces with informative contextual data, such as the indexed tag’s key and the full trace content. 

After adding our labels and error marks, we applied a divide-and-conquer approach to our data. We first visualized trace error trends to classify them by categories, measuring the impact of each category.

Figure 5: Ingest errors grouped by indexed tag key (visualized using Epsagon-Meta)


Figure 6: ingest errors grouped by indexed tag type (visualized using Epsagon-Meta)

Next, we picked some categories that had a significant impact on the total error rate, and deep-dived into their full context, using trace search and full trace context. Here are some of the issues we were able to eliminate.

In this first scenario, we identified that http.status_code labels are failing due to type-cast errors. We traced the issue back to some of our older instrumentation libraries, where HTTP status codes were reported as strings (“200 OK”) rather than integers (200). We chose to add a fix to our ingestion pipeline which converts all HTTP status codes to integers. This almost trivial 5-lines code solved potential indexing issues on more than 20K traces per hour. 

Figure 7: http.status_code indexing error trace (visualized using Epsagon-Meta). Note how the full context, traceback, error reason, indexed tag & its type all appear on a single screen, helping us solve this issue without ever leaving Epsagon.

In the next example, we analyzed errors in indexing dictionary (key-value mappings) values. Up to this point, dictionary-value indices were not officially supported by Epsagon. Seeing these issues helped us measure the impact of adding this feature and the use cases for it, and indeed, indexing dictionaries as JSON-serialized strings is now officially supported. 

Figure 8: Details of a trace causing indexing error over the tag http.response.body (visualized using Epsagon-Meta).

The last case is particularly interesting. In some scenarios, we experienced bulk index errors, where complete traces failed to index to Elasticsearch making them unavailable for searching. Using Epsagon labels and visualizations on meta, along with the full trace context and data, we traced these issues to the way we use Elasticsearch.

Elasticsearch is essentially a schemaless database, providing a lot of flexibility in data indexing. To store and search documents efficiently, Elasticsearch uses index-level mappings, which define the data type of every field, whether or not it should be searchable, the analyzers applied to it, and more. Epsagon uses dynamic index mappings, which automatically infer field types based on the first indexed document that contains them. This allows the flexibility of user-defined labels where the value type is not known a priori. But what happens when future documents contain values which conflict with that first type?

This turned out to be a valid use case, where different portions of a monitored application reported semantically different values under the same tag – e.g. error codes were sometimes reported as strings and other times as integers. Elasticsearch mappings are strict by default, so documents containing values conflicting with the existing mappings were dropped with errors. 

Dealing with conflicting mapping types in Elasticsearch is an interesting topic with multiple solution approaches, which is beyond the scope of this blog. We chose to apply a simple solution of using the ignore_malformed mapping parameter. This way we could index the full traces ignoring only these specific conflicting fields. By applying this very minimal fix, we were able to solve 99% of elastic bulk index failures.

Figure 9: Bulk index errors before and after adding the ignore_malformed parameter; the parameter was applied around UTC midnight, which is 22:00-23:00 in this figure (visualized using Epsagon-Meta).

Wrapping Up

Analyzing the root cause of issues in a high-throughput, dynamic environment can be difficult and time-consuming. Applying a divide-and-conquer methodology by labeling, classifying, deep-diving, solving, and re-iterating makes it simpler. Epsagon provides some powerful tools in the process, such as labels, visualization, and full-context trace search. 

This was just one example of how we use Epsagon for our own real-life use cases. Some other examples include Epsagon alerts, which we use to detect production issues before they even happen, and custom dashboards which are used to visualize and understand our own data pipelines better.

Many consider dogfooding as a common best practice. But the truth about dogfooding is – everybody talks about it, only some actually do it, and very few do it well. For an organization to dogfood in real-life scenarios requires a lot of effort, discipline, and complete trust in the architecture. Gaining insights from it requires being all-in and making it part of your team’s day-to-day routine. At Epsagon, we do just that, and the results are evident. We are able to constantly improve both our product and the underlying architecture while learning more about our customers’ wants and needs every day.

Try Epsagon free for 14-days >>

Read More:

How to Troubleshoot API Errors with Epsagon

Epsagon Announces New AWS Lambda Extension

Epsagon Operator Simplifies Kubernetes Monitoring