By Nir Soudry, Software Manager at Prospera Technologies
Agricultural systems produce many different types of data. Such data comes from different sources: sensor data from field devices, qualitative data collected by people, and data from irrigation and ERP systems.
The challenge begins when you want to analyze all this data in order to get high-impact conclusions.
Prospera Technologies is an agriculture technology data company that develops intelligent solutions for farmers to grow crops more efficiently. The company develops both hardware and software solutions that collect and analyze multi-sensor data with state-of-the-art machine learning algorithms.
The agricultural world is mostly traditional, and as such, various companies use their own information platforms. Data is sometimes collected with legacy systems, and in some cases, is only collected on paper or spreadsheets. There is no standard. One of the main challenges that Prospera address is streaming this highly-diverse data to a single location and making sure it’s coherent. Simply put, orchestrating multiple data pipelines.
To overcome this challenge, Prospera designed and developed an in-house custom data pipeline. Prospera is running its modern application on AWS, and when they had to choose the technology stack for this solution, they decided to take a serverless approach. They designed an architecture which relies on SES, SQS, Kinesis and AWS Lambda for getting the data and analyzing it, and managed services such as S3 and RDS for storing and manipulating the data through different stages of the pipeline.
Designing a Serverless Data Pipeline
The diagram below demonstrates Prospera’s data pipeline, including its different data sources. The data is being streamed from different sources: dedicated field agents (written by Prospera), existing APIs, emails, files that are manually pushed to S3, scrapers that get additional data, and more.
Data Pipelines Using Kinesis, S3, SQS, SES, EC2, Lambda, and RDS
Prospera’s data sources are comprised of AWS services such as Lambda, S3, and SQS, and also containerized PC agents.
The data is going through the following stages:
- L0: original source
- L1: store in S3
- L2: insert to an RDS with no additional processing
- L3: cleaning, deduplication, validation, standardization – ready to process
- L4: aggregation and enrichment across multiple data sets – production ready
Eventually, every piece of data makes its way to an S3 bucket. Then, a dedicated AWS Lambda function is triggered and now in charge of handling this data.
A simple example of aggregation: calculating the median temperature and humidity every few minutes.
Data Pipeline Layers
Each step in the process is handled by a different Lambda function. Prospera stores the pipelines configuration in an RDS database, and each function reads its configuration when it loads. This enables a central configuration store for a large number of functions.
Why Prospera Chose Serverless
The main reason for going serverless and relying on AWS Lambda is because it’s fully managed, reliable, and the cloud provider (AWS) is in charge of making sure that the code is running as it should. Another advantage is the scalability – since many files arrive in S3 at the same time, it’s very convenient to use the built-in scale algorithm of AWS Lambda. Lastly, in terms of architecture design, it makes sense to Prospera, since every Lambda is a “barrier” for the data, making sure that it doesn’t go through unless it’s ready to go.
How Prospera Makes Sure that Their Data is Up to Date
Prospera developed a data consistency process that is running once an hour. This process accesses the pipelines table and verifies the required validity of each data item. It verifies the completeness and latency of the data and if either is not meeting the requirements, it alerts the engineering team.
How Prospera Uses Epsagon to Troubleshoot Production Issues
Epsagon provides automated observability for troubleshooting complex production issues in serverless applications. Using distributed tracing and AI technologies, Epsagon maps the entire architecture, including AWS Lambda, AWS services such as S3, API Gateway, and databases, and also 3rd party APIs.
Prospera uses Epsagon to monitor the complex pipeline application and gets alerted on any exception, timeout, memory, or performance issues. When it happens, they go into Epsagon’s distributed transactions view to root-cause complex issues in no-time.
Prospera’s distributed data pipelines in Epsagon
In addition to troubleshooting issues in distributed applications, Prospera also encountered memory challenges when working with AWS Lambda. Since the amount of data analyzed in the Lambda functions can be high, choosing the correct memory limit was a challenge.
Prospera uses Epsagon to identify and investigate these out-of-memory conditions, and they get alerted about them even before they happen.
Memory Usage of AWS Lambda for Data Processing
As Prospera continues to grow and expand its customer base, more and more data analysis challenges arise. Prospera keeps growing its serverless footprint while designing a scalable data pipeline architecture.
To mitigate the potential observability challenges in their application, Prospera’s engineers use Epsagon to make sure that their data pipelines are healthy, stable, and consistent.