Related papers: ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models

ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models

URL: http://arxiv.org/abs/2505.19091v1
Date: Sun, 25 May 2025 11:02:01 GMT
Title: ReadBench: Measuring the Dense Text Visual Reading Ability of Vision-Language Models
Authors: Benjamin Clavié, Florian Brand,
Abstract summary: We introduce ReadBench, a benchmark to evaluate reading comprehension capabilities of Large Vision-Language Models (VLMs)<n>ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact.<n>We find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts.
Score: 0.4453962606945739
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent advancements in Large Vision-Language Models (VLMs), have greatly enhanced their capability to jointly process text and images. However, despite extensive benchmarks evaluating visual comprehension (e.g., diagrams, color schemes, OCR tasks...), there is limited assessment of VLMs' ability to read and reason about text-rich images effectively. To fill this gap, we introduce ReadBench, a multimodal benchmark specifically designed to evaluate the reading comprehension capabilities of VLMs. ReadBench transposes contexts from established text-only benchmarks into images of text while keeping textual prompts and questions intact. Evaluating leading VLMs with ReadBench, we find minimal-but-present performance degradation on short, text-image inputs, while performance sharply declines for longer, multi-page contexts. Our experiments further reveal that text resolution has negligible effects on multimodal performance. These findings highlight needed improvements in VLMs, particularly their reasoning over visually presented extensive textual content, a capability critical for practical applications. ReadBench is available at https://github.com/answerdotai/ReadBench .

Related papers

VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? [51.02924254085878]
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs.<n>We introduce VISTA-Bench, a benchmark from multimodal perception, reasoning, to unimodal understanding domains.
arXiv Detail & Related papers (2026-02-04T17:48:55Z)
Visual Text Processing: A Comprehensive Review and Unified Evaluation [99.57846940547171]
We present a comprehensive, multi-perspective analysis of recent advancements in visual text processing.<n>Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing.
arXiv Detail & Related papers (2025-04-30T14:19:29Z)
MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.<n>It aims to localize instances of interest across multiple images based on open-ended text prompts.<n>We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z)
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark [46.46727031818962]
The comprehension of text-rich visual scenes has become a focal point for evaluating Multi-modal Large Language Models (MLLMs) We introduce a Multimodal benchmark towards Text-rich visual scenes, to evaluate the Cognitive capabilities of MLLMs through visual reasoning and content-creation tasks (MCTBench) MCTBench incorporates several perception tasks to ensure a consistent comparison of both the cognitive and perceptual capabilities of MLLMs.
arXiv Detail & Related papers (2024-10-15T12:13:42Z)
From Text to Pixel: Advancing Long-Context Understanding in MLLMs [70.78454154014989]
We introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images. Our experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach.
arXiv Detail & Related papers (2024-05-23T06:17:23Z)
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs. We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z)
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo) DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs) We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions [41.825273034537204]
Vision Language Models (VLMs) cannot accurately interpret images infused with text. The present study introduces BLIVA: an augmented version of InstructBLIP with Visual Assistant. Our model significantly enhances performance in processing text-rich VQA benchmarks and in undertaking general (not particularly text-rich) VQA benchmarks.
arXiv Detail & Related papers (2023-08-19T07:53:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.