DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction

1Technical University of Munich   2Carnegie Mellon University  
3NVIDIA

We present DynOMo for the task of monocular online point tracking from pose-free videos through joint 3D reconstruction and camera localization based on a dynamic 3D Gaussian representation. We visualize DynOMo's emergent trajectories on ground truth RGB videos.


Abstract

Reconstructing scenes and tracking motion are two sides of the same coin. Tracking points allow for geometric reconstruction, while geometric reconstruction of (dynamic) scenes allows for 3D tracking of points over time. The latter was recently also exploited for 2D point tracking to overcome occlusion ambiguities by lifting tracking directly into 3D. However, above approaches either require offline processing or multi-view camera setups both unrealistic for real-world applications like robot navigation or mixed reality. We target the challenge of online 2D and 3D point tracking from unposed monocular camera input introducing Dynamic Online Monocular Reconstruction (DynOMo). We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion. Our approach extends 3D Gaussians to capture new content and object motions while estimating camera movements from a single RGB frame. DynOMo stands out by enabling emergence of point trajectories through robust image feature reconstruction and a novel similarity-enhanced regularization term, without requiring any correspondence-level supervision. It sets the first baseline for online point tracking with monocular unposed cameras, achieving performance on par with existing methods. We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.



Overview



Our pipeline assumes an input video sequence, (predicted) depth maps, sparse segmentation masks as well as image features as input. Building on top of [1, 2], we combine the powerful 3DGS-based dynamic scene representation and the flexible online tracking paradigm for pose-free videos through simultaneous scene reconstruction and camera localization. The key to DynOMo’s performance lies in three technical adaptions for the online tracking setting, namely, 1) reconstruction signal enhancement with stronger image features and depth supervisions, 2) semantic-based foreground and background separation to enable camera tracking, and 3) motion regularization bootstrapping via a feature-similarity-guided weighting mechanism. The above figure pictures an overview of our online reconstruction pipeline. We optimize for the camera pose C, add a set of new Gaussians based on the densification concept [1], optimize all Gaussians together and forward propagate G and C. Finally, we directly extract 3D point trajectories from single Gaussians Gp and project them to the image plane to obtain 2D trajectories.



Emergent Trajectories and Online Reconstruction on Davis

Input Video

Emergent Trajectories

Online Reconstruction




Emergent Trajectories and Online Reconstruction On Panoptic Sport

Input Video

Emergent Trajectories

Online Reconstruction

Image Features




Emergent Trajectories and Online Reconstruction On IPhone Dataset

Input Video

Emergent Trajectories

Online Reconstructtion




Comparisons on 2D and 3D Tracking

2D Tracking Comparison on Davis

We visualize the comparison between 2D tracks from SpaTracker [3] as it is the state-of-the-art method and Omnimotion [4] since it is another optimization-baed approach. While SpaTracker was trained on trajectories and Omnimotion requires heavily preprocessed flow as input and is offline optimized, DynOMo is able to generate emergent trajectories.


SpaTracker

OmniMotion

DynOMo


Tracking Comparison On IPhone Dataset

We visualize the comparison between 2D and 3D Tracking of Shape of Motion [6] and TAPIR [5] (both visualizations from Shape of Motion) DynOMo. While TAPIR was trained on trajectory data and Shape of Motion requires TAPIR 2D tracks as input and is optimized in an offline manner, DynOMo obtains 2D and 3D tracks in an emergent manner. Due to the nature of the Gaussians, Gaussians that represent non-rigid objects move slightly over time.


TAPIR

Shape of Motion

DynOMo



Failure Cases

DynOMo struggles with extreme camera motion and camera rotation as well as extreme camera acceleration. Additionally, extreme occlusions especially close to the camera break the camera pose optimization as well as the reconstruction.


Emergent Trajectories

Online Reconstruction




Acknowledgements

We thank the following works for open sourcing their source codes, data preprocessing, and project pages:
  1. SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
  2. Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis
  3. Shape Of Motion: 4D Reconstruction from a Single Video

BibTeX


@inproceedings{som2024,
  title     = {DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction},
  author    = {Jenny Seidenschwarz, Qunjie Zhou, Bardienus Duisterhof, Deva Ramanan, Laura Leal{-}Taix{\'{e}}},
  journal   = {3DV},
  year      = {2025}
}