Tracking and Understanding Object Transformations

NeurIPS 2025

Cornell University

Track Any State: Given a video and an object mask as prompt (shown in cyan contour) in the first frame, we produce consisten object tracks (top-right), while building a state graph for each detected transformation and its resulting effect.

Abstract

Real-world objects frequently undergo state transformations. From an apple being cut into pieces to a butterfly emerging from its cocoon, tracking through these changes is important for understanding real-world objects and dynamics. However, existing methods often lose track of the target object after transformation, due to significant changes to object appearance.

To address this limitation, we introduce TubeletGraph, a zero-shot system that recovers missing objects after transformation and maps out how the object states are evolving over time. TubeletGraph first identifies potentially overlooked tracks, and determines whether they should be integrated based on semantic and proximity priors. Then, it reasons about the added tracks and generates a state graph describing each observed transformation. TubeletGraph achieves state-of-the-art tracking performance under transformations, while demonstrating deeper understandings of object transformations and promising capabilities in temporal grounding and semantic reasoning for complex object transformations.

 

Methodology

Overview of TubeletGraph: (1) We first partition the input video into a set of tubelets ("Init. & Later Tubelets", Middle). (2) We then use spatial and semantic proximity to the prompt object to decide which tracks to include in the final prediction ("Predicted Object Tracks", Top-right). (3) Finally, for each added track, we prompt a vision-language model to build a state graph for each detected transformation and its resulting effect ("Predicted State Graph", Bottom).

BibTeX

Coming Soon!