Related papers: TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References

TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References

URL: http://arxiv.org/abs/2512.21641v1
Date: Thu, 25 Dec 2025 12:02:56 GMT
Title: TrackTeller: Temporal Multimodal 3D Grounding for Behavior-Dependent Object References
Authors: Jiahong Yu, Ziqi Wang, Hailiang Zhao, Wei Zhai, Xueqiang Yan, Shuiguang Deng,
Abstract summary: We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations.<n>We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture.
Score: 33.32268865872059
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Understanding natural-language references to objects in dynamic 3D driving scenes is essential for interactive autonomous systems. In practice, many referring expressions describe targets through recent motion or short-term interactions, which cannot be resolved from static appearance or geometry alone. We study temporal language-based 3D grounding, where the objective is to identify the referred object in the current frame by leveraging multi-frame observations. We propose TrackTeller, a temporal multimodal grounding framework that integrates LiDAR-image fusion, language-conditioned decoding, and temporal reasoning in a unified architecture. TrackTeller constructs a shared UniScene representation aligned with textual semantics, generates language-aware 3D proposals, and refines grounding decisions using motion history and short-term dynamics. Experiments on the NuPrompt benchmark demonstrate that TrackTeller consistently improves language-grounded tracking performance, outperforming strong baselines with a 70% relative improvement in Average Multi-Object Tracking Accuracy and a 3.15-3.4 times reduction in False Alarm Frequency.

Related papers

Video Spatial Reasoning with Object-Centric 3D Rollout [58.12446467377404]
We propose Object-Centric 3D Rollout (OCR) to enable robust video spatial reasoning.<n>OCR introduces structured perturbations to the 3D geometry of selected objects during training.<n>OCR compels the model to reason holistically across the entire scene.
arXiv Detail & Related papers (2025-11-17T09:53:41Z)
LSVG: Language-Guided Scene Graphs with 2D-Assisted Multi-Modal Encoding for 3D Visual Grounding [15.944945244005952]
3D visual grounding aims to localize the unique target described by natural languages in 3D scenes.<n>We propose a novel 3D visual grounding framework that constructs language-guided scene graphs with referred object discrimination.
arXiv Detail & Related papers (2025-05-07T02:02:15Z)
IRef-VLA: A Benchmark for Interactive Referential Grounding with Imperfect Language in 3D Scenes [10.139461308573336]
IRef-VLA is the largest real-world dataset for the referential grounding task consisting of over 11.5K scanned 3D rooms.<n>We aim to provide a resource for 3D scene understanding that aids the development of robust, interactive navigation systems.
arXiv Detail & Related papers (2025-03-20T16:16:10Z)
Cognitive Disentanglement for Referring Multi-Object Tracking [28.325814292139686]
We propose a Cognitive Disentanglement for Referring Multi-Object Tracking (CDRMT) framework.<n>CDRMT adapts the "what" and "where" pathways from the human visual processing system to RMOT tasks.<n>Experiments on different benchmark datasets demonstrate that CDRMT achieves substantial improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-14T15:21:54Z)
AugRefer: Advancing 3D Visual Grounding via Cross-Modal Augmentation and Spatial Relation-based Referring [49.78120051062641]
3D visual grounding aims to correlate a natural language description with the target object within a 3D scene.<n>Existing approaches commonly encounter a shortage of text3D pairs available for training.<n>We propose AugRefer, a novel approach for advancing 3D visual grounding.
arXiv Detail & Related papers (2025-01-16T09:57:40Z)
Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking [73.05477052645885]
We introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories.<n>We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes.
arXiv Detail & Related papers (2024-10-02T15:48:42Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Tracking Objects and Activities with Attention for Temporal Sentence Grounding [51.416914256782505]
Temporal sentence (TSG) aims to localize the temporal segment which is semantically aligned with a natural language query in an untrimmed segment. We propose a novel Temporal Sentence Tracking Network (TSTNet), which contains (A) a Cross-modal Targets Generator to generate multi-modal and search space, and (B) a Temporal Sentence Tracker to track multi-modal targets' behavior and to predict query-related segment.
arXiv Detail & Related papers (2023-02-21T16:42:52Z)
Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query. Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information. We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z)
TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D Visual Grounding [15.617150859765024]
We exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data. We propose a TransRefer3D network to extract entity-and-relation aware multimodal context. Our proposed model significantly outperforms existing approaches by up to 10.6%.
arXiv Detail & Related papers (2021-08-05T05:47:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.