Related papers: VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

URL: http://arxiv.org/abs/2509.19002v1
Date: Tue, 23 Sep 2025 13:46:31 GMT
Title: VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction
Authors: Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara,
Abstract summary: We present VIR-Bench, a benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task.<n> Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores.<n>We conduct an in-depth case study in which we develop a prototype travel-planning agent.
Score: 14.873988791609127
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs' geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent's markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

Related papers

VLN-MME: Diagnosing MLLMs as Language-guided Visual Navigation agents [12.383467162169703]
We introduce a unified and evaluation framework to probe MLLMs as zero-shot agents.<n>We simplify the evaluation with a highly modular and accessible design.<n>We observe that enhancing our baseline agent with Chain-of-Thought (CoT) reasoning and self-language leads to an unexpected performance decrease.
arXiv Detail & Related papers (2025-12-31T13:21:21Z)
Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph [29.737059125885057]
Video-STR achieves state-the-art results on various benchmarks, outperforming the base model by 13% on ML-Bench.<n>Code, model, and data will be released.
arXiv Detail & Related papers (2025-10-13T03:26:56Z)
A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z)
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding [55.32878803528196]
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos.<n>We propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding.<n>To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning.
arXiv Detail & Related papers (2025-05-27T04:50:07Z)
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning [42.316341452766075]
This paper aims to enhance video perception with Reinforcement Fine-temporalning (RFT)<n>We develop VideoChat-R1, a powerful video MLLM that achieves state-the-art performance on-temporal tasks without sacrificing chat ability.<n>Our findings underscore the potential of RFT for specialized task enhancement of Video MLLMs.
arXiv Detail & Related papers (2025-04-09T15:09:27Z)
ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models [63.12671761097701]
Vision-Language Models (Ms) struggle to analyze elements like traveled distance and speed of moving objects.<n>We construct a benchmark dataset referred to as STKit and ST-Bench.<n>We show that ST-VLM generalizes robustly across diverse domains and tasks.
arXiv Detail & Related papers (2025-03-25T05:08:06Z)
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.<n> tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.<n>Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.<n>We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z)
HiLM-D: Enhancing MLLMs with Multi-Scale High-Resolution Details for Autonomous Driving [44.06475712570428]
HiLM-D is a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP.<n>Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories.<n>Experiments show HiLM-D's significant improvements over current MLLMs, with a 3.7% in BLEU-4 for captioning and 8.7% in mIoU for detection.
arXiv Detail & Related papers (2023-09-11T01:24:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.