Infinite Scalability – Analyzing Billions of Events for a Ride Sharing Application

With Ynon Cohen, Development Group Manager at Via

Via is re-engineering public transit, from a regulated system of rigid routes and schedules to a fully dynamic, on-demand network. Via’s mobile app connects multiple passengers who are headed the same way, allowing riders to seamlessly share a premium vehicle. First launched in New York City in 2013, the Via platform operates across the globe in over 50 cities. Via has partnered with public transportation agencies, private transit operators, taxi fleets, private companies, and universities, seamlessly integrating with public transit infrastructure to power cutting-edge on-demand mobility.

Via’s mobile application

Technology Challenges

Via’s users – riders and drivers – are sending billions of events to their backend engine on a monthly basis. Via’s application handles, tracks, and routes cars simultaneously and matches them to riders using sophisticated algorithms, taking many parameters into consideration, such as the traffic in the city, the location and drop-off time, other riders in the city, and more.

Scalability-wise, these events create an extremely high and spikey load on the backend systems. For example, New York City, one of Via’s major operational regions, is particularly busy from 6 AM to 9 AM, and from 4 PM to 7 PM, when people are commuting to/from work. The demand can grow up to 20x in rush hours compared to low-demand periods. This requires rapid scaling in massive magnitudes.

Previous Architecture

As a cloud-first company, Via’s traditional architecture was already built on AWS. It was based on AWS ELB (Elastic Load Balancer) and standard EC2 servers. When scaling up we had to add additional servers to handle the load, growing from tens to hundreds.

Since Via also used microservices which were connected to these ELB and to one other, the entire system had to scale together – which forced the team to scale multiple other ELBs when one has already scaled.

“Infinite Scalability”

Via wanted a way to be “infinitely scalable” and to do it fast – “to go from 1x to 200x in no-time”, according to Ynon.

They also wanted a way to split the traditional server into small pieces, each of them highly scalable without strong dependencies between one another – to remove the coupling between different services.

Moving to a Modern Architecture

Via’s backend fully runs on Kubernetes clusters and we considered adding new services on the Kubernetes framework vs. using Lambda serverless as infrastructure. Eventually, Via decided to go with Lambda for all new services.

“Less work on DevOps – more work on our business,” said Ynon.

Via started developing new services using AWS Lambda. They also re-wrote some of the existing services using Lambda, breaking large code pieces to Lambda functions.

Looking back at the last year, 80% of the Lambda code is new. They used the Serverless Framework to manage and deploy their applications.

In terms of cost, they saw an extensive reduction in their cloud costs. “Lambda is significantly cheaper than anything else out there – our total cost dropped drastically,” said Ynon.

New Architecture Challenges

Going serverless-first wasn’t free of problems. The team encountered different challenges, from cold starts in a VPC, concurrency and parameters limits, to dealing with the concept of stateless code, which required a mindset shift.

Troubleshooting

The most significant challenge that Via’s team encountered with serverless infrastructure was troubleshooting their applications.

The team put effort into unit-testing their applications, but soon realized that many issues arose not because of a single Lambda function, but rather due to the wiring between the Lambdas, such as an SNS topic which triggers another Lambda.

Architecture visibility

One of the main challenges of a serverless microservice architecture is knowing “who is talking to who and why.”

A small part of Via’s architecture

Knowing how their backend architecture looks like would provide several benefits:

  • Onboard new employees.
  • Gain confidence in their application.
  • Discover hidden issues – a wrong connection between services, or to understand “why is this call being made?” to an external service such as Mandrill or Twilio.

A small part of Via’s architecture

Knowing how their backend architecture looks like would provide several benefits:

  • Onboard new employees.
  • Gain confidence in their application.
  • Discover hidden issues – a wrong connection between services, or to understand “why is this call being made?” to an external service such as Mandrill or Twilio.

When it comes to monitoring and troubleshooting modern microservices, nothing beats Epsagon.

Epsagon @ Via

Via started using Epsagon after already running production workloads with billions of events and thousands of Lambda functions involved.

Enabling Epsagon’s instrumentation to Via’s functions immediately discovered their production architecture. It gave them the confidence to monitor the connectivity mesh of their serverless microservices. They use it regularly to find out about common issues such as Lambda functions which are not connected to anything, or a Lambda that calls itself for no apparent reason.

Monitoring is primarily done via Epsagon’s Slack integration. Since every service has its own Slack channel, issues surface clearly and are easier to handle quickly. When getting an alert, going from the alert to the exact transaction, including all the relevant traces and logs, enables them to troubleshoot issues faster than ever.

“When it comes to monitoring and troubleshooting our serverless mesh, we rely on Epsagon,” says Ynon.

Benefits from Using Epsagon

According to Ynon, 50 engineers use Epsagon across different teams. On average, it saves every engineer that uses Epsagon up to half a day every week.

Mostly, Epsagon saves them a lot of frustration, which enables them to keep building and innovating.

Epsagon saves Via’s teams 50% of the troubleshooting time, which has a significant impact on their business and customer experience.

Conclusions

To summarize Via’s serverless journey, Ynon emphasized the following considerations:

  • Know the limitations of AWS Lambda.
  • Serverless requires a mindset change when it comes to the management of massive concurrency.
  • Many things happen in parallel – how will lambdas behave when triggered by an SQS that has 50K message concurrently?
  • It’s difficult to know which events were handled correctly and which were not, and how to fix it. You need to think about it in advance, and a tool like Epsagon greatly helps.
  • “Think nano-services” – it’s easy to take an existing Lambda cluster and just put more functionality on top, but we found that the right way is usually to write a new service. If the word “and” appears in the service’s name, it’s a good sign it needs to be broken down into two separate services.
  • Define a process for building a new service: 1) write the code, 2) test, 3) deploy, 4) monitor. This way, you won’t get lost every time you deploy a new service.
  • “Start with monitoring in mind” – both technical and business monitoring are critical. Think about post-production from day one.

Using Epsagon, Via continues to grow its modern cloud applications to expand their business. The monitoring, troubleshooting and visualization technologies that Epsagon provides are key to the rapid development by their teams.