MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
- URL: http://arxiv.org/abs/2505.20298v1
- Date: Mon, 26 May 2025 17:59:59 GMT
- Title: MangaVQA and MangaLMM: A Benchmark and Specialized Model for Multimodal Manga Understanding
- Authors: Jeonghun Baek, Kazuki Egashira, Shota Onohara, Atsuyuki Miyai, Yuki Imajuku, Hikaru Ikuta, Kiyoharu Aizawa,
- Abstract summary: We introduce two benchmarks for multimodal manga understanding: MangaOCR and MangaVQA.<n>MangaLMM is a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL.<n>Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
- Score: 24.928256182137428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Manga, or Japanese comics, is a richly multimodal narrative form that blends images and text in complex ways. Teaching large multimodal models (LMMs) to understand such narratives at a human-like level could help manga creators reflect on and refine their stories. To this end, we introduce two benchmarks for multimodal manga understanding: MangaOCR, which targets in-page text recognition, and MangaVQA, a novel benchmark designed to evaluate contextual understanding through visual question answering. MangaVQA consists of 526 high-quality, manually constructed question-answer pairs, enabling reliable evaluation across diverse narrative and visual scenarios. Building on these benchmarks, we develop MangaLMM, a manga-specialized model finetuned from the open-source LMM Qwen2.5-VL to jointly handle both tasks. Through extensive experiments, including comparisons with proprietary models such as GPT-4o and Gemini 2.5, we assess how well LMMs understand manga. Our benchmark and model provide a comprehensive foundation for evaluating and advancing LMMs in the richly narrative domain of manga.
Related papers
- OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning [72.57452266982642]
OCRBench v2 is a large-scale bilingual text-centric benchmark.<n>It covers 31 diverse scenarios, 10,000 human-verified question-answering pairs, and thorough evaluation metrics.<n>We find that most LMMs score below 50 (100 in total) and suffer from five-type limitations.
arXiv Detail & Related papers (2024-12-31T07:32:35Z) - Context-Informed Machine Translation of Manga using Multimodal Large Language Models [4.063595992745368]
We investigate what extent multimodal large language models (LLMs) can provide effective manga translation.<n> Specifically, we propose a methodology that leverages the vision component of multimodal LLMs to improve translation quality.<n>We introduce a new evaluation dataset -- the first parallel Japanese-Polish manga translation dataset.
arXiv Detail & Related papers (2024-11-04T20:29:35Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - MangaUB: A Manga Understanding Benchmark for Large Multimodal Models [25.63892470012361]
Manga is a popular medium that combines stylized drawings and text to convey stories.
Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches.
MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels.
arXiv Detail & Related papers (2024-07-26T18:21:30Z) - Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z) - The Manga Whisperer: Automatically Generating Transcriptions for Comics [55.544015596503726]
We present a unified model, Magi, that is able to detect panels, text boxes and character boxes.
We propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript.
arXiv Detail & Related papers (2024-01-18T18:59:09Z) - M2C: Towards Automatic Multimodal Manga Complement [40.01354682367365]
Multimodal manga analysis focuses on enhancing manga understanding with visual and textual features.
Currently, most comics are hand-drawn and prone to problems such as missing pages, text contamination, and aging.
We first propose the Multimodal Manga Complement task by establishing a new M2C benchmark dataset covering two languages.
arXiv Detail & Related papers (2023-10-26T04:10:16Z) - BAMBOO: A Comprehensive Benchmark for Evaluating Long Text Modeling Capacities of Large Language Models [141.21603469555225]
Large language models (LLMs) have achieved dramatic proficiency over NLP tasks with normal length.
We propose BAMBOO, a multi-task long context benchmark.
It consists of 10 datasets from 5 different long text understanding tasks.
arXiv Detail & Related papers (2023-09-23T11:36:15Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [153.37868034779385]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.<n>Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - Towards Fully Automated Manga Translation [8.45043706496877]
We tackle the problem of machine translation of manga, Japanese comics.
obtaining context from the image is essential for manga translation.
First, we propose multimodal context-aware translation framework.
Second, for training the model, we propose the approach to automatic corpus construction from pairs of original manga.
Third, we created a new benchmark to evaluate manga translation.
arXiv Detail & Related papers (2020-12-28T15:20:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.