About the authors: Amit Lichtenberg – Software Engineer (LinkedIn) and Hen Peretz – Solutions Architecture Lead (LinkedIn) at Epsagon.

On Jan 4th, 2021, Slack experienced one of its most significant outages of all time. Just when its entire US customers-base woke up for the first working day of 2021, Slack was hit with more than 3 hours of downtime, paralyzing its systems and APIs worldwide.

The outage and the retrospect which followed it provide a great story on the challenges of using third-party solutions in large-scale production environments. The story itself is interesting for two reasons. First, with Slack being a third-party vendor for thousands of companies, providing messaging, monitoring, alerting, and more – all of which became unavailable during the incident. And second, Slack demonstrated an inspiring level of accountability for its (in)stability, even when the root cause was great because of a third-party malfunction. 

Using third-party solutions is a core principle in modern-day software architecture. By utilizing services offered by external software vendors, companies are able to move faster while maintaining focus on their own core business offerings. But what happens when a third-party service fails? Can you prepare for it? Detect it? And most importantly – can you maintain your production systems operational and keep your clients happy?

We believe that you can, and Epsagon can help – starting from understanding those dependencies, through monitoring the stability of your third-party links, and down to troubleshooting incidents caused by third-party solutions if and when they happen. 

Slack New Year Outage – What happened? 

With over 12 million daily active users from over 750K organizations, counting on a 99.99% uptime SLA, Slack being down is like cutting the phone lines of a small country. The impact was massive. It’s not just about organizations’ internal communication going down. Even Slack APIs, which are used for monitoring, troubleshooting, alerting, and more, were down. Losing these APIs for 3 hours left many operational arrays blindfolded.

Incident Analysis

On Feb 2nd, 4 weeks after the incident, the Slack team posted their postmortem of the outage. This reading is strongly recommended for anyone interested in the challenges of scalable cloud infrastructure.

Here’s the summary: On Jan 4th, during the Americas’ morning time, Americans just went back from their new year’s holiday season. Workers started Slacking, services started servicing, alerts started alerting. The post-holiday spike led to massive upscaling on all fronts. While most of Slack’s infrastructure is built to handle this spike, ultimately the networking infrastructure used by Slack did not scale well enough. 

At a high-level, Slack’s network is based on multiple AWS Virtual Private Clouds (VPC) connected using AWS Transit Gateway (TGW) as a traffic hub. As the traffic on Jan 4th quickly grew, the AWS TGW failed to scale fast enough, causing connectivity issues all over Slack’s network. This led to network saturation, packet loss, and health-check failures.

To top it all, dashboards and monitoring systems were failing due to the same networking issues. This triggered an unexpected massive upscaling storm, eventually hitting resource limits and overstressing the entire system (and its engineers). In a heroic, all-hands-on-deck operation, the Slack team managed to weather the storm, stabilizing the network and becoming operational again. 

Postmortem

Slack’s postmortem of their outage offers a peek into the challenges of building resilient, cost-effective, large-scale operations. To handle peaks such as the one just after new year’s, Slack’s infrastructure is designed to auto-scale quickly. But ultimately, when a single component failed, everything else quickly started crumbling down. 

What’s even more interesting is that the scaling capabilities of that single component, the AWS TGW, were in fact out of the Slack engineering team’s hands. To them, the AWS TGW is a third-party black box provided by AWS, which, according to its label, should scale well enough. When it doesn’t, the only thing to do is to escalate the situation to the AWS team (which Slack did). 

But is that the only thing you could do when your dependencies fail?

Impact

As a service provider on its own, the Slack outage had a great impact on many other services.

We at Epsagon felt that too. For a short time, our own UI was down due to Slack API’s inaccessibility. One of our UI screens, used to configure alert targets (one of which is Slack), used the Slack API to list our customers’ Slack channels. With Slack API being down, this functionality broke. To make matters worse, we had a hard time communicating the issue within the company, as 99% of our day-to-day communication is over Slack. 

Luckily the issue was small, easy to fix, and did not have any effect on our production data pipeline. But with Slack being such a popular tool for alerting and monitoring, so many other companies were at risk. Losing your alerting & notification capabilities for 3 hours during rush hour can have a huge negative impact. 

Learning from Slack’s Accountability

One thing exceptional about Slack’s postmortem blog post is its 100% accountability. While the root cause of this meltdown was out of their hands, they focus on how their own infrastructure can and should handle the situation better, and what immediate steps they are taking to improve it.

What the Slack engineers so clearly understand is that blaming others for the faults in your product is easy, but ultimately when your product malfunctions, your clients are going to hold you responsible no matter the excuses. Your third-party vendor is not accountable for your ability to deliver. Only you are. 

Troubleshooting Third-Party APIs Using Epsagon

Challenges in Using Third-Party APIs

So should we all ditch all of our third parties and start developing everything in-house? Of course not! 

Many companies rely on third-party solutions as essential parts of their system – alerting and monitoring tools, infrastructure providers, data pipelining and analytics, and more. Using third-party solutions is a core principle in modern-day software architecture, as it allows companies to focus their time and resources on the business’s core value. Plus, there is no guarantee that your engineers will do any better than your third-party vendors. Sure, AWS TGW failed to scale fast enough in this specific Jan 4th peak, but would an in-house alternative do any better? Maybe, maybe not. 

Ultimately, third-party solutions are an essential part of any product stack. As engineers, it’s important to constantly tackle the crucial and non-trivial questions about our system, especially about those parts of the system that were not even written by our own team – such as third-party APIs.

Do you really know which third-party APIs are being used by your system and when? How reliant is your system on them? What’s their impact on your system and your end-users? Who alerts you if they stop working or start sending malformed data that can cause harm? The answer to these problems resides in observability into applications and the communication they hold with third-party solutions. Without a modern monitoring tool in place, it’s impossible to quickly access this information.

Listing Third-Party APIs Using Trace Search

Epsagon was built by engineers for engineers for exactly this purpose. Using Epsagon, you can gain full visibility and monitoring into your third-party API integrations, and also use their data to build your own business-driven dashboards. It is worth mentioning that a few of the techniques shared below could potentially be applied to other monitoring solutions.

To demonstrate how you can use Epsagon to tackle third-party issues, we use a demo application called “blog-side-prod”, which mimics a customer-facing blog site. It consists of a few AWS Lambda functions, communicating via asynchronous resources such as AWS SQS, AWS SNS, and a few common third-party API integrations such as Auth0 & Stripe. This application is part of our demo environment, which you can view freely for a sneak peek of Epsagon.

So first, we need to know which third-party APIs are being used by the application. All of the services in this app eventually call these APIs, either synchronically or asynchronously, using common network communication protocols such as RESTful HTTP. By collecting those requests and looking at them in an aggregated way, we can list all of the external APIs the app is using.

Epsagon’s tracing library captures those HTTP calls, allowing you to filter for them and count the number of requests per HTTP host using the Epsagon dashboard.

Figure 1: Third-party API metrics on Epsagon

This shows that this app has three major outgoing HTTP calls:

  • API Gateway: used as the front of our AWS Lambdas
  • Auth0: used as an identity provider that authenticates and logs in the blog sites users
  • Twilio: used to send notifications to our blog readers

Using Epsagon’s Trace Search we can understand how these APIs operate by observing their request rate, error rate, and latency. 

Visualizing Third-Party Connections with Service Map

Now that we’ve listed our third-party APIs, we can start asking more specific questions such as which services are using those APIs and when? For this purpose, we can use Epsagon’s Service Map – a real-time map of our application and all of its underlying services and used resources, as well as third-party API.

The Service Map shows exactly which services are calling third-party API, the frequency of these calls, as well as the average latency and error rate.

Figure 2: Service Map in Epsagon

So far, we now know which third-party APIs are being used by the application as well as which services are calling them. We got an insight into their average latency, and thus understand the impact they have on our customer-facing services, and we can see how frequently they are being invoked.

Troubleshooting API Issues with Distributed Tracing

What happens when those APIs start to slow down, break, or send malformed data? Answering this takes more than just logs of individual services and the operations they performed. We need to look at a group of services combined as a single flow or end-to-end request. This is the essence of Distributed Tracing, which enables you to see the end-to-end flow of a request in your system.

Let’s follow a simple flow of our blog site – creating a new post. This flow is triggered by a customer attempting to create a new blog post. The request-processor service authenticates the customer and sends a notification to the post-analysis service in order to create the new post under the customer’s account in our database.

So what happens when third-party APIs break? Suppose Auth0 fails to authenticate the user. The unauthorized post request is forwarded to the post-analysis lambda, which in turn fails with exceptions when attempting to save the new blog post. In this case, the customer is not even notified of the failure, resulting in a very bad experience.

Epsagon’s Trace View can help shed light on the operations that took place as part of a single request flow, across multiple services. It shows, in a single pane of glass, both the distributed trace and the data and payloads being sent as part of this flow. Here you can view the failed Auth0 request, the trace which involved it, and the resulting error in the post-analysis service, allowing you to understand and debug this issue, all in a single view.

Figure 3: A Distributed Trace View in Epsagon

Preparing for Third-Patry API Issues

What can we do to proactively prepare for exceptions, latency issues, or malformed data coming from our third-party APIs? 

Epsagon’s trace-based alerts allow us to do just that and get a notification once such thresholds are crossed or simply when one of our services breaks. We can also filter all the failed requests as they were captured and access their distributed trace for further investigation.

In addition to issue resolution, visibility into the payloads of third-party API queries can provide a lot of interesting insights. For example, if you’re using a SaaS payment provider such as Stripe, other than monitoring its behavior and ensuring it’s working properly, you’d probably be interested in monitoring the business-aspects of this API: how many transactions were made today? Who are your top paying customers? How many customers failed to pay? Those questions can be easily answered using Epsagon. The automatic payload tagging feature allows you to choose specific fields in the payload and create alerts and dashboards around them.

Figure 4: Custom Dashboard based on third-party API payload in Epsagon

Takeaways

Like it or not, we all depend on third-party solutions for our own product stack. Sure, you can (and should!) constantly question and test your third-party dependencies. But ultimately, like any software, third-party solutions can break. And when they do, it is up to you to show accountability for your product, and deliver the best customer experience possible. For that, having the ability to capture and query data from third-party API integrations is crucial. 

Leveraging distributed tracing, service maps, metrics, and alerts, Epsagon provides full observability into third-party APIs, allowing for that visibility you need to enhance business and customer experience. To take the next step, start your Epsagon 14-day free trial here.