DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction

¹Technical University of Munich ²Carnegie Mellon University

³NVIDIA

Abstract

Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction, while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.

Overview

Our pipeline assumes an input video sequence, (predicted) depth maps, sparse segmentation masks as well as image features as input. Building on top of [1, 2], we combine the powerful 3DGS-based dynamic scene representation and the flexible online tracking paradigm for pose-free videos through simultaneous scene reconstruction and camera localization. The key to DynOMo’s performance lies in three technical adaptions for the online tracking setting, namely, 1) reconstruction signal enhancement with stronger image features and depth supervisions, 2) semantic-based foreground and background separation to enable camera tracking, and 3) motion regularization bootstrapping via a feature-similarity-guided weighting mechanism. The above figure pictures an overview of our online reconstruction pipeline. We optimize for the camera pose C, add a set of new Gaussians based on the densification concept [1], optimize all Gaussians together and forward propagate G and C. Finally, we directly extract 3D point trajectories from single Gaussians G_p and project them to the image plane to obtain 2D trajectories.

2D Tracking Comparison on Davis

We visualize the comparison between 2D tracks from SpaTracker [3] as it is the state-of-the-art method and Omnimotion [4] since it is another optimization-baed approach. While SpaTracker was trained on trajectories and Omnimotion requires heavily preprocessed flow as input and is offline optimized, DynOMo is able to generate emergent trajectories.

Tracking Comparison On IPhone Dataset

We visualize the comparison between 2D and 3D Tracking of Shape of Motion [6] and TAPIR [5] (both visualizations from Shape of Motion) DynOMo. While TAPIR was trained on trajectory data and Shape of Motion requires TAPIR 2D tracks as input and is optimized in an offline manner, DynOMo obtains 2D and 3D tracks in an emergent manner. Due to the nature of the Gaussians, Gaussians that represent non-rigid objects move slightly over time.

@inproceedings{som2024, title = {DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction}, author = {Jenny Seidenschwarz, Qunjie Zhou, Bardienus Duisterhof, Deva Ramanan, Laura Leal{-}Taix{\'{e}}}, journal = {3DV}, year = {2025} }

DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction

Abstract

Overview

Emergent Trajectories and Online Reconstruction on Davis

Input Video

Emergent Trajectories

Online Reconstruction

Emergent Trajectories and Online Reconstruction On Panoptic Sport

Input Video

Emergent Trajectories

Online Reconstruction

Image Features

Emergent Trajectories and Online Reconstruction On IPhone Dataset

Input Video

Emergent Trajectories

Online Reconstructtion

Comparisons on 2D and 3D Tracking

2D Tracking Comparison on Davis

SpaTracker

OmniMotion

DynOMo

Tracking Comparison On IPhone Dataset

TAPIR

Shape of Motion

DynOMo

Failure Cases

Emergent Trajectories

Online Reconstruction

Acknowledgements

BibTeX

DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction

Abstract

Overview

Emergent Trajectories and Online Reconstruction on Davis

Input Video

Emergent Trajectories

Online Reconstruction

Emergent Trajectories and Online Reconstruction On Panoptic Sport

Input Video

Emergent Trajectories

Online Reconstruction

Image Features

Emergent Trajectories and Online Reconstruction On IPhone Dataset

Input Video

Emergent Trajectories

Online Reconstructtion

Comparisons on 2D and 3D Tracking

2D Tracking Comparison on Davis

SpaTracker

OmniMotion

DynOMo

Tracking Comparison On IPhone Dataset

TAPIR

Shape of Motion

DynOMo

Failure Cases

Emergent Trajectories

Online Reconstruction

Related Work

Acknowledgements

BibTeX