Related papers: VidText: Towards Comprehensive Evaluation for Video Text Understanding

VidText: Towards Comprehensive Evaluation for Video Text Understanding

URL: http://arxiv.org/abs/2505.22810v1
Date: Wed, 28 May 2025 19:39:35 GMT
Title: VidText: Towards Comprehensive Evaluation for Video Text Understanding
Authors: Zhoufaran Yang, Yan Shu, Zhifei Yang, Yan Zhang, Yu Li, Keyang Lu, Gangyan Zeng, Shaohui Liu, Yu Zhou, Nicu Sebe,
Abstract summary: VidText is a benchmark for comprehensive and in-depth evaluation of video text understanding.<n>It covers a wide range of real-world scenarios and supports multilingual content.<n>It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks.
Score: 54.15328647518558
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual texts embedded in videos carry rich semantic information, which is crucial for both holistic video understanding and fine-grained reasoning about local human actions. However, existing video understanding benchmarks largely overlook textual information, while OCR-specific benchmarks are constrained to static images, limiting their ability to capture the interaction between text and dynamic visual contexts. To address this gap, we propose VidText, a new benchmark designed for comprehensive and in-depth evaluation of video text understanding. VidText offers the following key features: 1) It covers a wide range of real-world scenarios and supports multilingual content, encompassing diverse settings where video text naturally appears. 2) It introduces a hierarchical evaluation framework with video-level, clip-level, and instance-level tasks, enabling assessment of both global summarization and local retrieval capabilities. 3) The benchmark also introduces a set of paired perception reasoning tasks, ranging from visual text perception to cross-modal reasoning between textual and visual information. Extensive experiments on 18 state-of-the-art Large Multimodal Models (LMMs) reveal that current models struggle across most tasks, with significant room for improvement. Further analysis highlights the impact of both model-intrinsic factors, such as input resolution and OCR capability, and external factors, including the use of auxiliary information and Chain-of-Thought reasoning strategies. We hope VidText will fill the current gap in video understanding benchmarks and serve as a foundation for future research on multimodal reasoning with video text in dynamic environments.

Related papers

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? [51.02924254085878]
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs.<n>We introduce VISTA-Bench, a benchmark from multimodal perception, reasoning, to unimodal understanding domains.
arXiv Detail & Related papers (2026-02-04T17:48:55Z)
T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models [12.120541052871486]
T2VTextBench is the first human-evaluation benchmark dedicated to evaluating on-screen text fidelity and temporal consistency in text-to-video models.<n>We evaluate ten state-of-the-art systems, ranging from open-source solutions to commercial offerings, and find that most struggle to generate legible, consistent text.
arXiv Detail & Related papers (2025-05-08T04:49:52Z)
Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing.<n>Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z)
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.<n>We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
SHE-Net: Syntax-Hierarchy-Enhanced Text-Video Retrieval [11.548061962976321]
We propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net)<n>First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions.<n>Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation.
arXiv Detail & Related papers (2024-04-22T10:23:59Z)
Understanding Video Scenes through Text: Insights from Text-based Video Question Answering [40.01623654896573]
This paper explores two recently introduced datasets, NewsVideoQA and M4-ViteVQA, which aim to address video question answering based on textual content. We provide an analysis of the formulation of these datasets on various levels, exploring the degree of visual understanding and multi-frame comprehension required for answering the questions.
arXiv Detail & Related papers (2023-09-04T06:11:00Z)
Let's Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought [62.619076257298204]
We motivate framing video reasoning as the sequential understanding of a small number of video reasonings. We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought. We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
arXiv Detail & Related papers (2023-05-23T10:26:42Z)
TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents. Text reading and information extraction can reinforce each other via a well-designed multi-modal context block. The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z)
Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs. We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z)
Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review [1.0520692160489133]
This review categorizes and describes the state-of-the-art techniques for the video-to-text problem. It covers the main video-to-text methods and the ways to evaluate their performance. State-of-the-art techniques are still a long way from achieving human-like performance in generating or retrieving video descriptions.
arXiv Detail & Related papers (2021-03-27T02:12:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.