This post was originally published on Medium by Luther.ai’s Lead Engineer.
At Luther.ai, we use AWS Serverless Stack and Kubernetes for all the core real-time pipelines, and it is a data-driven execution across all AWS Services — ECS, Lambda, SQS, Fargate, etc.
With hundreds of services and thousands of invocations, each day presents significant complexity to configure, monitor, review logs, measure latency, etc. For configuration and CI/CD, we use the Serverless Framework for packaging and deploying AWS Lambda functions.
We tried multiple monitoring solutions, including just leveraging AWS-native options. However, the scale brought various issues including:
- Multiple programming languages being used for AWS Lambda development.
- Containers / Tasks being used in ECS with EC2 and ECS with Fargate.
- External service access ( outbound API calls ) review, including unauthorized calls.
- Persistent storage access review and latency measurements.
- Unified access to execution logs with searchable options across with full text and time-based.
- Contextualization of the service-based view.
- Proactive notifications to places in which we work — Slack, PagerDuty, etc.
- History of events in a searchable format
Along with the specifics above, the development team wanted to focus on the serverless function rather than increasing its monitoring footprint, causing many worries to on-call DevOps Engineers.
After many reviews of various services to help us solve the issues listed above, Epsagon was the solution we decided to implement.
Below is the journey of how we saved hours and hours of our serverless implementation. Let us break it into installation, monitoring, latency measurements, and notifications:
Installation / Onboarding: If you’re using AWS’ serverless development and are not using lambda layers, you’re missing a core feature that will help a lot. Auto-tracing for your AWS region Lambda functions is enabled with a simple workflow leveraging the Lambda layers, thus solving multiple programming languages development dependencies to enable monitoring (including custom logic per serverless function).
Monitoring: Once you have the auto-tracing enabled, proactive monitoring will help you with alerts and notifications. We use native Slack and email integrations to receive notifications. Each of the notifications has the contextual link to the alert within the CloudWatch log and the start time of the Lambda execution along with the service map of all the services used (external API calls, AWS Services inbound and outbound).
If you’re interested in patterns like I am, you can use the historical view (available for the last 7 days) to understand scenarios such as issues with the last deployment, any specific user action and/or scalability issues caused (I have an interesting issue which we uncovered using historical patterns, but for a future blog).
Latency measurement: With hundreds of services and multi-thousands of executions every day, even a couple of milliseconds of execution time added to one service in the real-time pipeline can result in a bad user experience. With Epsagon, it is efficient to isolate latency issues in multiple facets:
- As in the picture above, a unified view across all the functions is available with the Average Execution duration — which is a great start.
- For each Lambda execution, a contextual service map is available with the time duration of the service call(s).
- With the Service Map view, we can review all the Lambda dependent services as a single landscape with the capability to go back in time and understand the subtleties.
Notifications: With all the integrated features, we configured extensible alerts — of which we use the PagerDuty integration for core functions. Also, the native integration with Jira helps to document bugs from the tool itself for each of the issues. The contextualization of information captured in the bug is a key feature — no more worrying about the log capture, issue timing, etc.
Did I say that we self-configured from start to finish in a weekend? Yes, we did both for our dev and prod environments. Here are the resources that have proven handy:
To conclude, serverless deployments and monitoring of workloads with hundreds of services and millions of invocations are no longer a “needle in a haystack” with Epsagon.