Streamlining AI Inference: Our Journey to Deep Snake Brain

·Engineering & Technology

Introducing our AI Inference Pipeline

At Teton, we are dedicated to create a reliable, real-time solution that enables staff to promptly respond to patient falls or better yet, to intervene and prevent them. We are proud to provide a service that protects residents’ privacy and that runs 24/7 even in areas with poor internet connectivity. We have achieved this by running our AI inference locally on an Nvidia Jetson Orin-NX. Running the inference locally means that image data never leaves the room, both reducing the need for high-speed internet while ensuring privacy and reliability.

In the last months, we have revamped our inference pipeline. In this post, we’ll walk you through the key components of our revamped inference pipeline. These improvements have enabled us to iterate faster, made our pipeline more efficient and reliable, and streamlined debugging and testing processes. But before diving into the changes, let’s first take a look at the status quo just a few months ago.

The Complex Way

Our previous pipeline involved three key components: Capture-and-Detect (CAD), brain, and sleeping-beauty. It was fully written in C++, leveraging the efficiency of a compiled language that could easily be deployed to devices. At the core of this setup was the CAD component. This component handled multiple tasks: grabbing frames, running model inference, maintaining a buffer for our data engine, and storing anonymized fall detection clips. It was implemented with a GStreamer pipeline to get the frames and a TensorRT engine to perform model inference.

The output from CAD module was sent to another C++ component called brain, which would interpret the results and communicate back with CAD to trigger anonymised fall clips when necessary. CAD would then communicate back once the fall clip was uploaded. The brain would furthermore orchestrate two-way communications with our sleeping-beauty component that would run the TensorRT engine of our sleep model. The animation below shows a simplified overview of the pipeline.

Our inference pipeline used to consist of several C++ components that communicated to each other with different internal communication methods. The pipeline’s complicated structure made iterative improvements slow and the internal communication was prone to errors.

This pipeline posed several challenges:

Disjointed repositories: Code was spread across multiple repositories and services, making it cumbersome to manage and run locally, and introduced the risk of devices running services with incompatible versions.
Brittle communication: We relied on shared memory to communicate between services, which was fragile and prone to breaking. Sometimes it would be stuck in weird states if one would restart, while the other kept running.
Complex communication logic: Since fall detection occurs in the brain module, while buffering and clip creation take place in CAD, the brain had to signal to CAD to trigger a fall event. CAD then had to respond once the process was complete. This back-and-forth communication increased system complexity, introduced potential failure points, and made the overall logic more convoluted than necessary.
Mismatch in development environments: The rest of our model development, including testing and training, was done in Python using PyTorch. This led to discrepancies in inference results, as we had to be cautious about implementation details like downsampling and interpolation, which were handled differently in Python and C++.

These challenges made the iteration speed slow and end-to-end evaluation tedious.

Introducing Deep Snake Brain

In order to iterate faster, simplify the code base, and improve the reliability of our system, we decided to rewrite our inference pipeline in Python. To rewrite CAD, we chose to leverage an awesome Nvidia product: DeepStream

DeepStream is a GPU-accelerated SDK for building AI-powered video analytics applications, integrating with GStreamer to enable efficient, real-time processing of multiple video streams.

With DeepStream, we integrated preprocessing and the TensorRT engine into the pipeline, which significantly reduced the boilerplate code required to run our real-time AI inference. This setup enabled scalability, reduced the amount of memory used, and made the pipeline extendable to support multiple cameras running on the same device. DeepStream manages frame batching efficiently, even when frames are received at different times, ensuring robustness.

To further simplify our inference pipeline and get rid of communication between components, we rewrote our C++-based brain to a Python-based snake-brain. We integrated the sleeping-beauty model into the snake-brain component, such that we could remove all internal communication via shared memory or MQTT. Getting rid of this internal communication made the application run more reliably and the testing much simpler as we only had to run one one application rather than three. Below is an animation of our enhanced pipeline.

Our enhanced pipeline, Deep Snake Brain, is written in Python, utilising Nvidia Deep Stream for efficient and reliable inference.

Contrary to our initial concerns, Python’s performance was not a bottleneck. The majority of time in our pipeline is spent on model inference. Since the brain can run in parallel with the next frame being processed, we have ample time for tracking and interpreting the raw model outputs.

Pipeline Enhancements

Ease of local testing: We no longer have to start three applications, but can just start one.

Development speed: The switch to Python significantly accelerated development, allowing us to iterate faster than with C++.

DeepStream Integration: Using DeepStream has proven to be more reliable and robust than our custom GStreamer pipeline. Additionally, we saw a significant reduction in memory usage, as GStreamer is responsible for a larger percentage of the pipeline.

Reduced complexity: By levering DeepStream and reducing the need for interprocess communication, we reduced the number of lines by over 40%

The rewrite reduced the number of lines in the codebase by more than 40%, and decreased the number of repositories from three to one. This is even though we have added more and better automated tests and several internal debugging tools.

Our rewrite has simplified our inference pipeline significantly and made it faster and more reliable. However, the main disadvantage we found with using DeepStream is that we are forced to use a specific python version, which for the Jetpack version running on the fleet is Python 3.8. This limitation restricts our ability to utilize newer Python features and libraries that require more recent Python versions. Despite this, the benefits of DeepStream, such as its optimized handling of video streams and TensorRT integration, outweigh this drawback. We found that the overall improvements in efficiency, reliability, and speed have positioned our solution to better serve our mission of fall prevention and intervention.

Deployment of Python code

One common hesitation in using Python for production applications, especially in embedded systems, is that it’s an interpreted language rather than a compiled one. This can lead to concerns about deployment size, startup time, and dependency management. However, we’ve found an effective solution to these challenges by using PyInstaller to package our application. Pyinstaller allows us to bundle our Python application, interpreter, and dependencies into a single package, creating standalone executables that eliminate the need for a Python installation on the fleet.

Average and minimum inference rates across the fleet

Average and minimum inference rates across our entire fleet, with devices constrained to operate below 10 FPS. All devices maintain consistent performance at the specified speed. The reduction in internal communication has significantly minimized dropped messages.

To optimize our deployment process, we have separated our TensorRT engines, our dependencies and code into distinct packages that can be updated independently. This approach allows us to deploy the dependencies only when they undergo changes. As a result, most deployments become significantly smaller in size. This strategy is particularly beneficial for devices with unstable internet connections or slow download speeds, as it minimizes the amount of data that needs to be transferred during updates.

Automated Testing

A key focus during the rewrite was automated testing. We opted for more functional, data-driven tests rather than more classical unit tests that tests inputs and outputs of each function. We implemented two types of data-driven tests:

Fast Tests: Executed on every commit and the entire test suite take less than 10 minutes to run. These include strict type checking, linters, and running on small human labelled sequences of model detections. These tests would fail if the expected output is not the same and can easily be extended with other edge-cases.
Slow Tests: Run nightly on a local workstation and take over 2 hours to run. These include full pipeline tests with model inference, performance tracking, brain logic checks, and validation using a dedicated fall test set. These take longer to run as they run on a larger dataset and perform model inference.
Integration Tests: Runs on multiple test devices identical to the ones in the fleet, where we validate the connection to everything external such as the camera, MQTT broker, and other local services running on the device.

This automated test framework is managed with github actions and ensures that we ship high-quality code and continually improve.

We are excited about our improved and simplified AI inference pipeline. Deep Snake Stream has been running in production since July 2024. It has abled us to iterate much faster and more comfortably due to reduced complexity and improved internal tooling and testing. We are excited to continue building and improving on Deep Snake Brain, adding support for multiple cameras, allowing for inter-device communication, and automatic 3D calibration to further push the capabilities of our AI care companion to better support case staff and patients world wide.

Work on real-time AI at the edge

We're building the inference stack behind Teton's care platform. See our open engineering roles.

Book a demo