The Healthcare Improvement Studies (THIS) Institute, based in Cambridge UK, produces the evidence needed to improve healthcare practices and processes.

THIS Institute Supports Multiple Health Studies

One of the core principles at THIS Institute is to involve patients and staff from across the UK’s National Health Service (NHS) in its research. This enables researchers to collect a wide range of data from staff delivering healthcare, as well as from patients and their families, to support evidence-based research into improving healthcare practices and processes. The institute, launched two years ago, now has multiple and varied projects focused on healthcare improvement.

Andy Paterson is the architect and lead developer for the Thiscovery platform that supports the institute’s research and outreach. It was the development team’s job to build the platform so that it’s easy to gather the data for researchers to develop and validate standards, processes, and best practices in healthcare. Some of the studies are wide in scope with multiple sources of input, while others are very specific, requiring input from only a single group.

The development mandate was to create a platform that could handle large amounts of data and was robust, reliable and flexible in supporting multiple studies with different goals and objectives. 

“The development mandate was to create a platform that could handle large amounts of data and was robust, reliable and flexible in supporting multiple studies with different goals and objectives.

The Need: Developer Efficiency and Velocity

Andy and one other developer built the platform backend, while THIS Institute leveraged a web agency for front-end development and a communications team for copywriting. Andy began with “absolutely nothing,” but knew he did not want to manage many systems.  “With a two-person development team, THIS Institute needed to be incredibly efficient.”

“With a two-person development team, THIS Institute needed to be incredibly efficient.”

THIS Institute decided to use Amazon Web Services (AWS) as the cloud host and build the platform with microservices. Before moving to AWS, maintaining and scaling THIS Institute’s legacy infrastructure was slow, reactive, and prone to technical difficulties. THIS Institute’s infrastructure environments often drifted out of sync, making it hard to increase capacity or deploy changes to production without engaging in manual, time-consuming processes. The goal of all this was to minimize the operational load and the required effort to manage and maintain the platform. “I want it to work and I want to be able to forget about it,” Andy explained.

THIS Institute’s environment is today a mix of serverless (AWS Lambdas), Amazon Relational Database Services (RDS), DynamoDB, API Gateways, Secrets Manager, CloudFormation, CodePipeline, and S3. The development team originally used AWS CloudWatch but wanted a unified view of the production environment, logs and traces.

Ensuring Business as Usual During Migration to AWS

During THIS Institute’s migration to AWS, the visibility provided by Epsagon was critical to maintaining platform reliability and ensuring business as usual for THIS Institute’s customers. THIS Institute trusted Epsagon during their cloud migration because it allowed their engineering teams to:

    • Achieve holistic observability into their legacy and new cloud environments during a migration
    • Instantly being notified on conditions that could stir up to issues
    • Rapidly sorting and understanding any issues that do arise, so they can quickly implement a fix.
    • Epsagon’s solution helped THIS Institute to complete its cloud migration without apparent incidents. Additionally, during the cloud migration, comprehending how their application was performing was crucial for the business, and Epsagon’s insights allowed engineering teams to rapidly make adjustments in real-time to ensure that the migration was completed successfully and that it would deliver the desired benefits.

Epsagon: An Advanced Solution for Monitoring Microservices

“I saw Epsagon in 2018 and I have not seen anything yet as advanced in monitoring microservice applications,” Andy explained. “You can just plug Epsagon in and it works. However, to get the best out of it, you need to do a little custom integration, but it’s comparatively painless. It also works with one of the other development tools we use, Stackery.”

“With just two developers, I need to forget about my operational system unless something goes wrong. Epsagon has allowed me to do that.”  The team receives alerts to system issues via Slack, one of the many communications integrations Epsagon offers. “Epsagon is really useful; it can help track down obscure bugs.”

“With just two developers, I need to forget about my operational system unless something goes wrong. Epsagon has allowed me to do that.”

Microservices have this inherent complexity of behavior, especially when using asynchronous communications, Andy observed. “You really need to think about observability upfront, before you even build anything. I tried to build as much observability as I could into the platform.”

The inherent complexity of microservices, Andy said, makes figuring out what went wrong difficult and time consuming. “Microservices can make life more difficult. Having the right tools like Epsagon reduces complexity and your frustration.”

“Microservices can make life more difficult. Having the right tools like Epsagon reduces complexity and your frustration.”

Microservice Production Complexity

Andy explained that “with microservices, the symptom of the problem is often not in the same place as the cause. Epsagon’s ability to trace back from the component showing the issue is important, since the real error may be with another component in the trace.”

“With microservices, the symptom of the problem is often not in the same place as the cause. Epsagon’s ability to trace back from the component showing the issue is important, since the real error may be with another component in the trace.”

With the ability to see everything in production in the architecture view and trace every request in a transaction, as well as seeing correlated traces, logs and payloads in a single interface, Andy is able to explore, troubleshoot, and find the issue fast.

Epsagon: Useful in Every Production Stage

THIS Institute uses Epsagon in every phase — development, automated testing, staging with full integration testing to third-parties, and production.

“I wanted to move toward a devops and continuous delivery model where every change goes into production asap. That is, I wanted to eliminate staging and deploy frequently and easily. “

“I wanted to move toward a devops and continuous delivery model where every change goes into production asap. That is, I wanted to eliminate staging and deploy frequently and easily.“

The key, Andy said, was very comprehensive testing and the ability to rely on the production system and Epsagon monitoring and troubleshooting. 

End-to-end monitoring and  error detection with visualization and alerting

“Observability within production is key and so is Epsagon. If we put something faulty into production, we will know about the error immediately through the visualization of the error and alerting so we can get the fix out fast.”

Continuous deployment and a focus on development

THIS Institute developers can look at Epsagon traces and see the node with the error, then look for the source of the problem within the trace. “Your architecture view with the correlation of traces, logs and payloads made debugging rapid and much easier. That meant I could deploy continuously without bugs and focus on increasing development rapidity.”

“Your architecture view with the correlation of traces, logs and payloads made debugging rapid and much easier. That meant I could deploy continuously without bugs and focus on increasing development rapidity.”

No need for log searches with architecture view and correlation

Epsagon is “far superior to just searching for or looking at logs,” Andy explained. “It’s helpful to see everything in production visualized because, with microservices, there is a distribution of calls, responsibilities, and processing. It’s not clear looking at code and logs where something is going wrong. Why is it calling 3 times? Epsagon provides insight into the behavior of your system with the architecture view and correlation of logs and traces, allowing you to see things that you did not expect.”

“It’s not clear looking at code and logs where something is going wrong. Why is it calling 3 times? Epsagon provides insight into the behavior of your system with the architecture view and correlation of logs and traces, allowing you to see things that you did not expect.”

Production confidence

During production, Andy explained, the team relies on Epsagon to detect “anything that is failing or a problem, such as timeout, memory leaks, invoking too many times. We can see the pattern and take action.”

Zero production bugs and zero impact on users

Epsagon also can help reduce the impact on users when there are errors. What we don’t want is a user encountering a problem and having to call us, Andy noted.  “We want to catch the issue before it impacts the user and reduce the MTTR. So far, by using Epsagon in development, testing and staging, we have had zero bugs in production back-end systems that have impacted users. With comprehensive automated testing plus Epsagon, we aim to catch 100% of the bugs before they get into production.” 

“So far, by using Epsagon in development, testing and staging, we have had zero bugs in production back-end systems that have impacted users. With comprehensive automated testing plus Epsagon, we aim to catch 100% of the bugs before they get into production.” 

Epsagon Histogram Showing an Error-Free Production Environment

Epsagon Histogram Showing an Error-Free Production Environment

“Best practice” solution for high production reliability

Epsagon contributes to high reliability by helping the organization achieve complete, end-to-end error detection and resolution. Right now, getting applications out the door fast is not an issue in this academic environment. But end-to-end error detection and fast problem resolution reduces risk. When THIS Institute makes changes to the platform later in the year, the organization is maximizing the likelihood that everything will be fine during deployment.

“Breaks, or issues, are inevitable. When breaks happen, we know we have Epsagon there monitoring and alerting us to the source of the issue — no matter where it is in the trace.”

“Breaks, or issues, are inevitable. When breaks happen, we know we have Epsagon there monitoring and alerting us to the source of the issue — no matter where it is in the trace.”

Code-ready for third-party integration 

As THIS Institute increases the level of research activity and outreach, the group will become more service-focused than development-focused. How the Thiscovery platform accommodates that growth and new focus will validate the organization’s approach.

“Our backend platform is common to various different research tasks. It also must work with our partners and integrate with them seamlessly. Right now, we are putting more work into the core infrastructure, but later we will balance that with more of an integration focus. Like it does today, Epsagon observability will be instrumental in making sure our code is ready for that integration.”

“Right now, we are putting more work into the core infrastructure, but later we will balance that with more of an integration focus. Like it does today, Epsagon observability will be instrumental in making sure our code is ready for that integration.”