TEO-2: Automatic data extraction - “What If Your Care Staff Never Had to Write Another Note?”

Frederik Warburg

TEO-2: Powering the Foundational Data Layer for the Point of Care

Our healthcare system today relies on manual data collection at the point of care. This is a tedious process, and documentation is both sporadic and can contain human biases and errors.

One of the biggest opportunities in healthcare today is to use AI to transition from this manual data collection process to an automatic one. This will increase the amount and the fidelity of data gathered by many orders of magnitude, resulting in massive savings in manual labor while unlocking much better insights and ultimately better outcomes for patients.

Our latest AI model, TEO-2, is trained to monitor patient health at the point of care. TEO-2 understands the room layout and the semantic value of each point in 3D. It understands what residents are doing, how they are sleeping, their respiration rate, their lying position, and their gait. It can do all these things in real-time and provide continuous metrics on patient health and the care they receive. 

TEO-2 understands the 3D geometric at the point of care. From this metric 3D understanding, TEO-2 can automatically monitor patient health and treatment and send alarms in real-time.

While TEO-2 represents a breakthrough in Spatial Intelligence, its true value lies in the foundational data layer it creates. A continuous stream of rich care data powers a spectrum of applications that are already transforming healthcare delivery. These applications improve patient safety, deliver data-driven care insights, and drive operational excellence. We are proud of the value that TEO-2 is already delivering for patients and care staff:

  1. Proactive Patient Safety: Our reactive alarm system has achieved a 50% average reduction in falls across facilities, while cutting response times by 80%. When incidents do occur, anonymized fall clips provide valuable insights for prevention and damage assessment, all while maintaining patient privacy.
  2. Ward Overview: Providing care staff with a real-time overview of their ward leads to smarter resource allocation and reduces unnecessary room visits by 25%. This efficiency allows staff to deliver better care and spend more quality time with patients. During night rounds, fewer patient disruptions mean better sleep and recovery.
  3. AI Health Agents: Our AI health agent analyzes thousands of health metrics and care interactions to find daily patterns and insights that might otherwise go unnoticed. It helps staff identify and track changes in patient health and pinpoint opportunities for improved care.
  4. AI Operational Agents: Our AI Operational agent identifies workflow inefficiencies and standardizes best practices across care teams. By understanding exactly how care is delivered, patient health facilities can optimize processes, reduce variations in care quality, and improve overall operational efficiency.
  5. Predictive Health: Perhaps most powerfully, our foundational data layer enables early warning systems that forecast patient risks. Subtle changes in gait patterns or lower sleep quality can indicate increased fall risk days before an incident might occur. Changes in sleep patterns, movement frequency, or interaction levels can signal developing health issues, providing valuable information to staff, and enable them to deliver preventive care.

This foundational data layer represents a paradigm shift in healthcare: from reactive to proactive, from periodic checks to continuous monitoring, and from subjective observations to objective insights. By digitizing the point of care, we're not just collecting data; we're creating the foundation for a smarter, safer, and more compassionate healthcare system. TEO-2 represents our most sophisticated AI model, created to power the foundational data layer that's already transforming healthcare.

TEO-2: Advancing Spatial Intelligence in Healthcare

TEO-2 integrates multiple vision tasks into a single unified model. One of its most significant advancements is its ability to operate entirely in 3D metric space, rather than relying on methods that reason in pixel space. This shift is a game-changer for our AI’s ability to understand and document the point of care with unprecedented accuracy.  The video below renders a 3D reconstruction of the point of care generated by TEO-2. The model is able to create an accurate 3D reconstruction from a single ceiling mounted camera 10 times per second, yielding an unprecedented understanding of the care environment.

TEO-2 understands 3D geometry of the scene. From a single sensor input TEO2 can accurately represent the 3D geometry of the scene.

Moving from a 2D representation to a metric 3D representation of the point of care supercharges TEO-2's capabilities:

  1. Seamless Multi-Sensor Integration: A major benefit of moving into 3D is the ability to integrate multiple sensors into a shared spatial representation. With TEO-2, multiple cameras and sensors can be calibrated into a unified 3D space, allowing for seamless coverage of larger areas. This ensures a more complete and continuous representation at the point of care, further enhancing the system’s accuracy and reliability.
  2. Improved Tracking and Occlusion Handling: In a 2D system, tracking individuals can be challenging, particularly when they pass behind objects, move between rooms, or momentarily are out of view. TEO-2, however, understands their 3D location, allowing it to maintain robust tracking even when individuals are partially or temporarily obscured. Its accurate understanding of the room layout also allows us to understand whether patients are inside the room, or just visible through an open door, ensuring only relevant data is considered.
  3. Physical Priors for Better Predictions: By reasoning in a metric 3D space, TEO-2 can incorporate real-world physical constraints into its tracking models. For example, it understands natural movement speeds and physical limitations, preventing unrealistic jumps or false detections. This allows the system to produce more reliable and consistent insights about patient activity and care interactions.
  4. Advanced Gait and Motion Analysis: TEO-2’s ability to estimate accurate 3D positions enables precise gait analysis. By measuring 3D poses for people, we can accurately deduce step lengths, walking speed, and other motion characteristics. These are valuable insights into patient mobility, fall risk, and overall well-being.
  5. Understanding Human Interactions: Using its 3D understanding, TEO-2 can more accurately assess how patients and caregivers interact. By analyzing spatial relationships, it can better estimate how a caregiver assists a patient, determine the nature of their interaction, and even quantify the level of support provided. This opens up new possibilities for measuring and optimizing the quality of care, and will reduce the documentation burden for care staff.

As TEO-2 can natively reason in 3D, we can combine multiple sensors’ input into a single 3D representation that enables us to cover multiple rooms or even larger apartments.

Teton AI Gym: A Synthetic Data Generator

Our AI Gym uses procedural generation to composite an infinite amount indoor environments with people interacting with each other and objects in the scene. These scenes are procedurally generated using Bayesian inference. More specifically, we use an MCMC process, which samples thousands of times from the distribution of all possible scenes using Metropolis-Hastings sampling, until it reaches a steady state, yielding a plausible room layout, including walls, window- and door placements, and furniture.

Teton AI Gym uses procedural generation to create diverse 3D scenes of indoor environments, where people interact with the environment and the objects in the room

The sampling is divided into multiple stages, where first room layout is decided, then large objects like sofas and beds are placed, followed by medium sized objects and finally small objects on top of the medium or large objects. People are placed in the scene within realistic scenarios, interacting with their environment by optimizing their positions and poses relative to surrounding objects.

To make realistic-looking scenes, we use likelihood functions that dictate for example that chairs are quite likely to be placed around a table in a symmetric fashion, rather than haphazardly strewn about the floor, a bed is most likely placed with the headboard up against the wall, etc. The Teton AI Gym is built on top of an open-source research project by Raistick et al., called Infinigen.

Teton AI Gym generates 3D environments that we can view from different view points.

We render images from these procedurally generated scenes using a camera model that closely mimics the sensor we have in production. This drastically reduces the domain gap between our synthetic generated data and the production data. The scenes are rendered from the viewpoint of a number of different ceiling mounted cameras, where we simulate the exact parameters of the sensor we use in production. This means that the distortion and camera intrinsics of our rendered images perfectly match the ones we see in production.

We can render the scene with a fisheye lens that accurately mimics the one we use in production, such that TEO-2 accurately understand the distortion of our lens.

Prior to TEO-2 our models were trained on a combination of humanly annotated data and distillation from open source models [TEO-1]. The Teton AI gym has allowed us to train TEO-2 with perfect 3D labels and on a much wider distribution of data than previously possible. This has significantly accelerated our progress as we are less restricted by human annotations. The Teton AI gym gives us

  1. Faster iteration speed: If we decide to implement a new feature that requires a specific type of annotation, which we haven’t collected previously, we can quickly and easily generate thousands of examples and start training in a matter of hours.
  2. Perfect 3D ground truth labels: Since the data are synthesized, the ground truth is known. This means that we have perfect 3D labels. Furthermore, we do not have to go through the costly and time-consuming process of human annotations, as they can be programmatically generated from the scene.
  3. Larger distribution of data: We can expose our model to a wider variety of scenes. This helps the model generalize to unseen scenarios. We can even seed the training dataset with more instances of scenes where the model has historically struggled to make predictions.
Perfect ground truth geometric labels. Teton AI Gym has perfect ground truth geometric labels, such depth, instance segmentations, and normals.

Crossing the Uncanny Valley

Creating synthetic data that looks and behaves like real-world data is a significant challenge due to the inherent domain gap. We have applied two strategies to bridge this gap: Model distillation and dataset mixing.

In the latter, we carefully mix our synthetic data into our human curated dataset during training. As we do not have the same labels for our synthetic data and our humanly curated data, e.g. we don’t use action labels on our synthetic data and we don’t use 3D poses on our real-data, we made all our losses robust to missing labels, thus gradients from each example will only contribute to a subset of the network’s parameters. We find that mixing the dataset improves the general performance of the model as it sees a wider data distribution. But more fascinating, the model learns to predict accurately on real-images for tasks, where we only have labels on synthetic dataset.

To further improve  the model performance, we fine-tune a large transformer model with a DINOv2 backbone—leveraging its robust feature extraction capabilities, learned from vast amounts of real-world imagery— on our synthetic dataset. By fine-tuning DINOv2 on our synthetic images, adapting it to our specific camera model and depth estimation tasks, we get a model that learns to generalize very well to real images, because the early layers do not change much during the finetuning process. We use this fine-tuned model to generate pseudo ground truth for real images, effectively transferring knowledge from the large model into our multitask model. This process helps distill valuable insights, improving realism and closing the uncanny valley between synthetic and real data.

This approach does not only work for depth estimation methods, but can also be used for estimating human body poses or 3D cuboids. This ability to transfer geometric knowledge from synthetic to real imagery is super powerful and has enabled several applications, such as human gait estimation and estimation of laying positions. These are both important for understanding the patients health, e.g. how has the patient recovered from a fall or be an early indicator for the need of a walking aid.

Gefion: One of Europe’s largest supercomputers

On October 23rd 2024, Gefion was ceremoniously inaugurated by Jensen Huang, the CEO of NVIDIA, and his majesty king Frederik X of Denmark. The supercomputer is a NVIDIA DGX SuperPOD powered by 1,528 NVIDIA H100 GPUs. It is operated by the Danish Center for AI Innovation (DCAI). Teton was chosen as one of six projects — two from industry and four from academia — to serve as a pilot project for Gefion, granting us early access to a staggering amount of computational power. Over the last few months we have been hard at work harnessing all this power, and it has significantly accelerated the iteration time and unlocked a new scale for the next generation of our model, TEO-2. Generating synthetic data with the Teton AI Gym on multiple nodes has significantly accelerated our progress towards releasing this model.

Gefion ranks #21 on the top 500 list of most powerful super computer for AI training.

TEO-2: A Unified 3D Vision Architecture

TEO-2 represents a significant improvement from its predecessor, maintaining operational efficiency while making a fundamental shift to 3D spatial reasoning. The model's skeletal tracking now estimates full 3D poses rather than 2D projections, providing volumetric understanding of patient and caregiver positioning. Similarly, the detection system has evolved from 2D bounding boxes to 3D cuboids, allowing for precise spatial localization of people and objects within the care environment. Perhaps most importantly, TEO-2's monocular depth estimation has been dramatically improved to deliver accurate metric depth measurements, creating a foundation for all other 3D reasoning tasks.

Despite these substantial advancements, TEO-2 maintains the same operational efficiency as its predecessor, achieving 10 frames per second on an NVIDIA Jetson Orin module deployed at the point of care. This edge-based computation continues to ensure complete privacy, as no video footage leaves the patient's room — all processing occurs locally with only anonymized insights being transmitted.

By understanding the point of care within a comprehensive 3D framework, TEO-2 associates semantic meaning with each spatial coordinate. This enables the system to infer advanced biometrics such as gait parameters and lying positions with unprecedented accuracy, while also providing deeper insights into patient-staff interactions. The result is a digital twin of each care environment that empowers healthcare staff to deliver more efficient, personalized, and preventative care.

The spatial intelligence of TEO-2 represents a significant step forward, largely enabled by our procedurally generated synthetic training data. As we continue to enhance the photorealism and diversity of our Teton AI Gym, we anticipate even greater fidelity in our understanding of the point of care, further transforming how healthcare data is captured and utilized.

Related

No items found.