Object-centric Video Question Answering with Visual Grounding and Referring
- URL: http://arxiv.org/abs/2507.19599v1
- Date: Fri, 25 Jul 2025 18:11:23 GMT
- Title: Object-centric Video Question Answering with Visual Grounding and Referring
- Authors: Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, Stratis Gavves,
- Abstract summary: We introduce a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks.<n>We also propose STOM, a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video.<n>We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation.
- Score: 43.963739052764595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multiround interactions. In this paper, we make three contributions: (i) we address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts; (ii) we propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video; (iii) we present VideoInfer, a manually curated object-centric video instruction dataset featuring questionanswering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks of 6 tasks show that our proposed model consistently outperforms baselines in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. Project page: https://qirui-chen.github.io/RGA3-release/.
Related papers
- Can Video LLMs Refuse to Answer? Alignment for Answerability in Video Large Language Models [21.966865098520277]
Video Large Language Models (Video-LLMs) are predominantly trained on questions generated directly from video content.<n>In real-world scenarios, users often pose questions that extend beyond the informational scope of the video.<n>We propose alignment for answerability, a framework that equips Video-LLMs with the ability to evaluate the relevance of a question based on the input video.
arXiv Detail & Related papers (2025-07-07T13:19:43Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - Lost in Time: A New Temporal Benchmark for VideoLLMs [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.<n>We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We present ViLLa: Video reasoning segmentation with Large Language Model.<n>Our ViLLa manages to tackle these challenges through multiple core innovations.<n>To enable efficient processing of long videos, ViLLa incorporates (3) a key segment sampler that adaptively partitions long videos into shorter but semantically dense segments for less redundancy.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)<n>We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.<n>We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.