Prospera Technologies is an agriculture technology data company that develops intelligent solutions for farmers to grow crops more efficiently. The company develops both hardware and software solutions that collect and analyze multi-sensor data with state-of-the-art machine learning algorithms.
Analyzing High Volumes of Data
Agricultural systems produce many different types of data. Such data comes from different sources: sensor data from field devices, qualitative data collected by people, and data from irrigation and ERP systems. The challenge begins when you want to analyze all this data in order to get high-impact conclusions.
“The challenge begins when you want to analyze all this agricultural data in order to get high-impact conclusions.”
The agricultural world is mostly traditional, and as such, various companies use their own information platforms. Data is sometimes collected with legacy systems, and in some cases, is only collected on paper or spreadsheets. There is no standard. One of the main challenges that Prospera addresses is streaming this highly-diverse data to a single location and making sure it’s coherent; that is, orchestrating multiple data pipelines, according to Nir Soudry, Software Manager at Prospera Technologies.
To overcome this data challenge, Prospera designed and developed an in-house custom data pipeline. Prospera is running its modern application on AWS. The architecture relies on ECS, SES, SQS, Kinesis, SNS, CloudWatch, Elastic Beanstalk and AWS Lambda for getting the data and analyzing it, as well as managed services such as S3 and RDS for storing and manipulating the data through different stages of the pipeline.
Designing a Distributed Data Pipeline
In the diagram Prospera Data Pipeline Layers, data is being streamed from different sources: dedicated field agents (written by Prospera), existing APIs, emails, files that are manually pushed to S3, scrapers that get additional data, and more.
The data is going through the following stages:
- L0: original source
- L1: store in S3
- L2: insert to an RDS with no additional processing
- L3: cleaning, deduplication, validation, standardization – ready to process
- L4: aggregation and enrichment across multiple data sets – production ready
Eventually, every piece of data makes its way to an S3 bucket.
A simple example of this aggregation is: calculating the median temperature and humidity every few minutes.
Data Pipeline Layers
Prospera stores the pipeline’s configuration in an RDS database, which enables a central configuration store for a large number of events.
Why Prospera Chose Modern Applications
Prospera chose a modern architecture and AWS Lambda and ECS because the architecture cane be fully managed and reliable. The cloud provider (AWS) is in charge of making sure that the code is running as it should. Another advantage is scalability. Since many files arrive in S3 at the same time, it’s very convenient to use the built-in scale algorithm. Lastly, in terms of architecture design, each component is a “barrier” for the data, making sure that it doesn’t go through unless it’s ready to go.
How Prospera Makes Sure that Their Data is Up to Date
Prospera developed a data consistency process that is running once an hour. This process accesses the pipelines table and verifies the required validity of each data item. It verifies the completeness and latency of the data and if either is not meeting the requirements, it alerts the engineering team.
How Prospera Uses Epsagon to Troubleshoot Production Issues
Epsagon provides automated observability for troubleshooting complex production issues in microservices, serverless and container-based applications. Using distributed tracing and AI technologies, Epsagon maps the entire architecture, including AWS Lambda and ECS, AWS services such as Kinesis, CloudWatch, SNS, and RDS databases, and also third-party APIs.
“Using distributed tracing and AI technologies, Epsagon maps Prospera’s entire architecture.”
Prospera uses Epsagon to monitor the complex pipeline application and alert on any exception, timeout, memory, or performance issues. When alerts happen, the developers go into Epsagon’s distributed transactions view to rapidly discover the root cause for complex issues, Nir explained.
Using the Epsagon tracing library, Prospera can track every request their Django API is processing on the cluster, allowing them to get insights and performance issues for their customers. It follows the request once it reaches their backend endpoint, all the way to the RDS calls.
“Using the Epsagon tracing library, Prospera can track every request their Django API is processing on the cluster, allowing them to get insights and performance issues for customers.”
In addition to troubleshooting issues in distributed applications, Prospera also encountered memory challenges when working with AWS Lambda. Since the amount of data analyzed in the Lambda functions can be high, choosing the correct memory limit was a challenge. Prospera uses Epsagon to identify and investigate these out-of-memory conditions and to be alerted about issues even before they happen, Nir said.
Memory Usage for Data Processing
“Prospera uses Epsagon to identify and investigate these out-of-memory conditions and to be alerted about issues even before they happen.”
Growing Prospera’s Business and Data Analysis
As Prospera continues to grow and expand its customer base, more and more data analysis challenges arise. Prospera keeps growing while designing a distributed and scalable, data pipeline architecture.
To mitigate the potential observability challenges in their application, Prospera’s engineers use Epsagon to make sure that their data pipelines are healthy, stable, and consistent.
“Prospera’s engineers use Epsagon to make sure that their data pipelines are healthy, stable, and consistent.”