By Nir Soudry, Software Manager at Prospera Technologies
Agricultural systems produce many different types of data. Such data comes from different sources: sensor data from field devices, qualitative data collected by people, and data from irrigation and ERP systems.
The challenge begins when you want to analyze all this data in order to get high-impact conclusions.
Prospera Technologies is an agriculture technology data company that develops intelligent solutions for farmers to grow crops more efficiently. The company develops both hardware and software solutions that collect and analyze multi-sensor data with state-of-the-art machine learning algorithms.
The agricultural world is mostly traditional, and as such, various companies use their own information platforms. Data is sometimes collected with legacy systems, and in some cases, is only collected on paper or spreadsheets. There is no standard. One of the main challenges that Prospera address is streaming this highly-diverse data to a single location and making sure it’s coherent. Simply put, orchestrating multiple data pipelines.
To overcome this challenge, Prospera designed and developed an in-house custom data pipeline. Prospera is running its modern application on AWS. They designed an architecture which relies on ECS, SES, SQS, Kinesis, SNS, CloudWatch, Elastic Beanstalk and AWS Lambda for getting the data and analyzing it, and managed services such as S3 and RDS for storing and manipulating the data through different stages of the pipeline.
Designing a Distributed Data Pipeline
The diagram below demonstrates Prospera’s data pipeline, including its different data sources. The data is being streamed from different sources: dedicated field agents (written by Prospera), existing APIs, emails, files that are manually pushed to S3, scrapers that get additional data, and more.
The data is going through the following stages:
- L0: original source
- L1: store in S3
- L2: insert to an RDS with no additional processing
- L3: cleaning, deduplication, validation, standardization – ready to process
- L4: aggregation and enrichment across multiple data sets – production ready
Eventually, every piece of data makes its way to an S3 bucket.
A simple example of aggregation: calculating the median temperature and humidity every few minutes.
Data Pipeline Layers
Prospera stores the pipeline’s configuration in an RDS database. This enables a central configuration store for a large number of events.
Why Prospera Chose Modern Application
The main reason for choosing modern architecture and relying on AWS Lambda and ECS is because it’s fully managed, reliable, and the cloud provider (AWS) is in charge of making sure that the code is running as it should. Another advantage is the scalability – since many files arrive in S3 at the same time, it’s very convenient to use the built-in scale algorithm. Lastly, in terms of architecture design, it makes sense to Prospera, since each component is a “barrier” for the data, making sure that it doesn’t go through unless it’s ready to go.
How Prospera Makes Sure that Their Data is Up to Date
Prospera developed a data consistency process that is running once an hour. This process accesses the pipelines table and verifies the required validity of each data item. It verifies the completeness and latency of the data and if either is not meeting the requirements, it alerts the engineering team.
How Prospera Uses Epsagon to Troubleshoot Production Issues
Epsagon provides automated observability for troubleshooting complex production issues in microservices, serverless and container-based applications. Using distributed tracing and AI technologies, Epsagon maps the entire architecture, including AWS Lambda and ECS, AWS services such as Kinesis, CloudWatch, SNS, and RDS databases, and also 3rd party APIs.
Prospera uses Epsagon to monitor the complex pipeline application and gets alerted on any exception, timeout, memory, or performance issues. When it happens, they go into Epsagon’s distributed transactions view to root-cause complex issues in no-time.
One of Prospera’s data pipelines as seen in Epsagon
Using the Epsagon tracing library, Prospera can track every request their Django API is processing on the cluster, allowing them to get insights and performance issues for their customers. It follows the request once it reaches their backend endpoint, all the way to the RDS calls.
In addition to troubleshooting issues in distributed applications, Prospera also encountered memory challenges when working with AWS Lambda. Since the amount of data analyzed in the Lambda functions can be high, choosing the correct memory limit was a challenge. Prospera uses Epsagon to identify and investigate these out-of-memory conditions, and they get alerted about them even before they happen.
Memory Usage for Data Processing
As Prospera continues to grow and expand its customer base, more and more data analysis challenges arise. Prospera keeps growing while designing a distributed and scalable, data pipeline architecture.
To mitigate the potential observability challenges in their application, Prospera’s engineers use Epsagon to make sure that their data pipelines are healthy, stable, and consistent.