TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
- URL: http://arxiv.org/abs/2602.19768v2
- Date: Tue, 24 Feb 2026 03:15:34 GMT
- Title: TraceVision: Trajectory-Aware Vision-Language Model for Human-Like Spatial Understanding
- Authors: Fan Yang, Shurong Zheng, Hongyin Zhao, Yufei Zhan, Xin Li, Yousong Zhu, Chaoyang Zhao Ming Tang, Jinqiao Wang,
- Abstract summary: We propose a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework.<n>TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information.<n>We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis.
- Score: 28.808796664403342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent Large Vision-Language Models (LVLMs) demonstrate remarkable capabilities in image understanding and natural language generation. However, current approaches focus predominantly on global image understanding, struggling to simulate human visual attention trajectories and explain associations between descriptions and specific regions. We propose TraceVision, a unified vision-language model integrating trajectory-aware spatial understanding in an end-to-end framework. TraceVision employs a Trajectory-aware Visual Perception (TVP) module for bidirectional fusion of visual features and trajectory information. We design geometric simplification to extract semantic keypoints from raw trajectories and propose a three-stage training pipeline where trajectories guide description generation and region localization. We extend TraceVision to trajectory-guided segmentation and video scene understanding, enabling cross-frame tracking and temporal attention analysis. We construct the Reasoning-based Interactive Localized Narratives (RILN) dataset to enhance logical reasoning and interpretability. Extensive experiments on trajectory-guided captioning, text-guided trajectory prediction, understanding, and segmentation demonstrate that TraceVision achieves state-of-the-art performance, establishing a foundation for intuitive spatial interaction and interpretable visual understanding.
Related papers
- Lifting Vision: Ground to Aerial Localization with Reasoning Guided Planning [0.0]
We introduce Visual Reasoning for localization, or ViReLoc, which performs planning and localization using only visual representations.<n>The proposed framework learns spatial dependencies and geometric relations that text based reasoning often suffer to understand.<n> Experiments across diverse navigation and localization scenarios show consistent improvements in spatial reasoning accuracy and cross view retrieval performance.
arXiv Detail & Related papers (2025-12-30T18:36:39Z) - In-Video Instructions: Visual Signals as Generative Control [79.44662698914401]
We investigate whether capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions.<n>In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories.<n>Experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions.
arXiv Detail & Related papers (2025-11-24T18:38:45Z) - Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs [41.977702477816756]
Existing Vision-and-Language Navigation (VLN) agents encode images as textual scene descriptions.<n>We improve the navigation agent's contextual understanding by incorporating textual descriptions from multiple perspectives.<n>We evaluate our approach on the R2R dataset, where our experiments demonstrate significant improvements in navigation performance.
arXiv Detail & Related papers (2025-09-29T17:51:01Z) - Context-Enhanced Multi-View Trajectory Representation Learning: Bridging the Gap through Self-Supervised Models [27.316692263196277]
MVTraj is a novel multi-view modeling method for trajectory representation learning.
It integrates diverse contextual knowledge, from GPS to road network and points-of-interest to provide a more comprehensive understanding of trajectory data.
Extensive experiments on real-world datasets demonstrate that MVTraj significantly outperforms existing baselines in tasks associated with various spatial views.
arXiv Detail & Related papers (2024-10-17T03:56:12Z) - Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects [11.117055725415446]
Large Vision Language Models (LVLMs) have demonstrated impressive zero-shot capabilities in various vision-language dialogue scenarios.
The absence of fine-grained visual object detection hinders the model from understanding the details of images, leading to irreparable visual hallucinations and factual errors.
We propose Lyrics, a novel multi-modal pre-training and instruction fine-tuning paradigm that bootstraps vision-language alignment from fine-grained cross-modal collaboration.
arXiv Detail & Related papers (2023-12-08T09:02:45Z) - VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search [51.9899504535878]
We propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search.
In VGSG, a vision-guided attention is employed to extract visual-related textual features.
With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features.
arXiv Detail & Related papers (2023-11-13T17:56:54Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot
Attention for Vision-and-Language Navigation [52.65506307440127]
We propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation.
We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations.
arXiv Detail & Related papers (2023-05-26T17:15:22Z) - Cross-modal Map Learning for Vision and Language Navigation [82.04247028482244]
We consider the problem of Vision-and-Language Navigation (VLN)
In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations.
We propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions.
arXiv Detail & Related papers (2022-03-10T03:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.