TextVidBench: A Benchmark for Long Video Scene Text Understanding
- URL: http://arxiv.org/abs/2506.04983v1
- Date: Thu, 05 Jun 2025 12:54:56 GMT
- Title: TextVidBench: A Benchmark for Long Video Scene Text Understanding
- Authors: Yangyang Zhong, Ji Qi, Yuan Yao, Pengxin Luo, Yunfeng Yan, Donglian Qi, Zhiyuan Liu, Tat-Seng Chua,
- Abstract summary: We introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes)<n>TextVidBench makes three key contributions: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding.<n>We propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on
- Score: 60.94150574231576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent progress on the short-video Text-Visual Question Answering (ViteVQA) task - largely driven by benchmarks such as M4-ViteVQA - existing datasets still suffer from limited video duration and narrow evaluation scopes, making it difficult to adequately assess the growing capabilities of powerful multimodal large language models (MLLMs). To address these limitations, we introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes). TextVidBench makes three key contributions: 1) Cross-domain long-video coverage: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding. 2) A three-stage evaluation framework: "Text Needle-in-Haystack -> Temporal Grounding -> Text Dynamics Captioning". 3) High-quality fine-grained annotations: Containing over 5,000 question-answer pairs with detailed semantic labeling. Furthermore, we propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on video-text data. Extensive experiments on multiple public datasets as well as TextVidBench demonstrate that our new benchmark presents significant challenges to existing models, while our proposed method offers valuable insights into improving long-video scene text understanding capabilities.
Related papers
- Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z) - VidText: Towards Comprehensive Evaluation for Video Text Understanding [54.15328647518558]
VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
arXiv Detail & Related papers (2025-05-28T19:39:35Z) - SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z) - LoVR: A Benchmark for Long Video Retrieval in Multimodal Contexts [35.49959781944883]
We introduce LoVR, a benchmark specifically designed for long video-text retrieval.<n>LoVR contains 467 long videos and over 40,804 fine-grained clips with high-quality captions.<n>Our benchmark introduces longer videos, more detailed captions, and a larger-scale dataset.
arXiv Detail & Related papers (2025-05-20T04:49:09Z) - Vidi: Large Multimodal Models for Video Understanding and Editing [33.56852569192024]
We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understand editing scenarios.<n>The first release focuses on temporal retrieval, identifying the time ranges within the input videos corresponding to a given text query.<n>We also present the VUE-TR benchmark, which introduces five key advancements.
arXiv Detail & Related papers (2025-04-22T08:04:45Z) - Re-thinking Temporal Search for Long-Form Video Understanding [67.12801626407135]
Current temporal search methods only achieve 2.1% temporal F1 score on the Longvideobench subset.<n>Inspired by visual search in images, we propose a lightweight temporal search framework, T* that reframes costly temporal search as spatial search.<n>Extensive experiments show that integrating T* with existing methods significantly improves SOTA long-form video understanding.
arXiv Detail & Related papers (2025-04-03T04:03:10Z) - MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval [61.414236415351446]
We propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR)<n>MomentSeeker is created based on long and diverse videos, averaging over 1200 seconds in duration.<n>It covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - Temporal Reasoning Transfer from Text to Video [51.68487044397409]
Video Large Language Models (Video LLMs) struggle with tracking temporal changes and reasoning about temporal relationships.
We introduce the Textual Temporal reasoning Transfer (T3) to transfer temporal reasoning abilities from text to video domains.
LongVA-7B model achieves competitive performance on comprehensive video benchmarks.
arXiv Detail & Related papers (2024-10-08T16:10:29Z) - DTVLT: A Multi-modal Diverse Text Benchmark for Visual Language Tracking Based on LLM [23.551036494221222]
We propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks.
We offer four texts in our benchmark, considering the extent and density of semantic information.
We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance.
arXiv Detail & Related papers (2024-10-03T13:57:07Z) - Chrono: A Simple Blueprint for Representing Time in MLLMs [34.036784478999245]
We investigate the challenge of contextual and temporal comprehension in video-language models by exploring the task of temporal localization in videos.<n>We introduce Chrono, a universal sequence blueprint that can be applied to an image-text pretrained MLLM.<n>We achieve a new SOTA in moment retrieval on the most widely used benchmarks Charades-STA, QVHighlights, ActivityNet Captions, and grounded video question answering on NeXT-GQA.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset.
We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z) - Text-Conditioned Resampler For Long Form Video Understanding [94.81955667020867]
We present a text-conditioned video resampler (TCR) module that uses a pre-trained visual encoder and large language model (LLM)
TCR can process more than 100 frames at a time with plain attention and without optimised implementations.
arXiv Detail & Related papers (2023-12-19T06:42:47Z) - StoryBench: A Multifaceted Benchmark for Continuous Story Visualization [42.439670922813434]
We introduce StoryBench: a new, challenging multi-task benchmark to reliably evaluate text-to-video models.
Our benchmark includes three video generation tasks of increasing difficulty: action execution, story continuation, and story generation.
We evaluate small yet strong text-to-video baselines, and show the benefits of training on story-like data algorithmically generated from existing video captions.
arXiv Detail & Related papers (2023-08-22T17:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.