MangaUB: A Manga Understanding Benchmark for Large Multimodal Models
- URL: http://arxiv.org/abs/2407.19034v1
- Date: Fri, 26 Jul 2024 18:21:30 GMT
- Title: MangaUB: A Manga Understanding Benchmark for Large Multimodal Models
- Authors: Hikaru Ikuta, Leslie Wöhler, Kiyoharu Aizawa,
- Abstract summary: Manga is a popular medium that combines stylized drawings and text to convey stories.
Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches.
MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels.
- Score: 25.63892470012361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Manga is a popular medium that combines stylized drawings and text to convey stories. As manga panels differ from natural images, computational systems traditionally had to be designed specifically for manga. Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches. To provide an analysis of the current capability of LMMs for manga understanding tasks and identifying areas for their improvement, we design and evaluate MangaUB, a novel manga understanding benchmark for LMMs. MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels, allowing for a fine-grained analysis of a model's various capabilities required for manga understanding. Our results show strong performance on the recognition of image content, while understanding the emotion and information conveyed across multiple panels is still challenging, highlighting future work towards LMMs for manga understanding.
Related papers
- Context-Informed Machine Translation of Manga using Multimodal Large Language Models [4.063595992745368]
We investigate what extent multimodal large language models (LLMs) can provide effective manga translation.
Specifically, we propose a methodology that leverages the vision component of multimodal LLMs to improve translation quality.
We introduce a new evaluation dataset -- the first parallel Japanese-Polish manga translation dataset.
arXiv Detail & Related papers (2024-11-04T20:29:35Z) - MMR: Evaluating Reading Ability of Large Multimodal Models [52.953316772123586]
Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images.
Current benchmarks fail to accurately reflect performance of different models.
We propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding.
arXiv Detail & Related papers (2024-08-26T19:26:50Z) - VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents [50.12414817737912]
Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents.
Existing benchmarks fail to sufficiently challenge or showcase the full potential of LMMs in complex, real-world environments.
VisualAgentBench (VAB) is a pioneering benchmark specifically designed to train and evaluate LMMs as visual foundation agents.
arXiv Detail & Related papers (2024-08-12T17:44:17Z) - Sketch2Manga: Shaded Manga Screening from Sketch with Diffusion Models [26.010509997863196]
We propose a novel sketch-to-manga framework that first generates a color illustration from the sketch and then generates a screentoned manga.
Our method significantly outperforms existing methods in generating high-quality manga with shaded high-frequency screentones.
arXiv Detail & Related papers (2024-03-13T05:33:52Z) - Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
Lumen first promotes fine-grained vision-language concept alignment.
Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z) - The Manga Whisperer: Automatically Generating Transcriptions for Comics [55.544015596503726]
We present a unified model, Magi, that is able to detect panels, text boxes and character boxes.
We propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript.
arXiv Detail & Related papers (2024-01-18T18:59:09Z) - SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for
Multi-modal Large Language Models [86.478087039015]
We present a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.
Based on our proposed joint mixing, we propose an efficient strategy aiming to better capture fine-grained appearances of high-resolution images.
We hope our work may cast a light on the exploration of joint mixing in future MLLM research.
arXiv Detail & Related papers (2023-11-13T18:59:47Z) - inkn'hue: Enhancing Manga Colorization from Multiple Priors with
Alignment Multi-Encoder VAE [0.0]
We propose a specialized framework for manga colorization.
We leverage established models for shading and vibrant coloring using a multi-encoder VAE.
This structured workflow ensures clear and colorful results, with the option to incorporate reference images and manual hints.
arXiv Detail & Related papers (2023-11-03T09:33:32Z) - M2C: Towards Automatic Multimodal Manga Complement [40.01354682367365]
Multimodal manga analysis focuses on enhancing manga understanding with visual and textual features.
Currently, most comics are hand-drawn and prone to problems such as missing pages, text contamination, and aging.
We first propose the Multimodal Manga Complement task by establishing a new M2C benchmark dataset covering two languages.
arXiv Detail & Related papers (2023-10-26T04:10:16Z) - Dense Multitask Learning to Reconfigure Comics [63.367664789203936]
We develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels.
Our method can successfully identify the semantic units as well as the notion of 3D in comic panels.
arXiv Detail & Related papers (2023-07-16T15:10:34Z) - MangaGAN: Unpaired Photo-to-Manga Translation Based on The Methodology
of Manga Drawing [27.99490750445691]
We propose MangaGAN, the first method based on Generative Adversarial Network (GAN) for unpaired photo-to-manga translation.
Inspired by how experienced manga artists draw manga, MangaGAN generates the geometric features of manga face by a designed GAN model.
To produce high-quality manga faces, we propose a structural smoothing loss to smooth stroke-lines and avoid noisy pixels, and a similarity preserving module.
arXiv Detail & Related papers (2020-04-22T15:23:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.