M2C: Towards Automatic Multimodal Manga Complement
- URL: http://arxiv.org/abs/2310.17130v1
- Date: Thu, 26 Oct 2023 04:10:16 GMT
- Title: M2C: Towards Automatic Multimodal Manga Complement
- Authors: Hongcheng Guo, Boyang Wang, Jiaqi Bai, Jiaheng Liu, Jian Yang, Zhoujun
Li
- Abstract summary: Multimodal manga analysis focuses on enhancing manga understanding with visual and textual features.
Currently, most comics are hand-drawn and prone to problems such as missing pages, text contamination, and aging.
We first propose the Multimodal Manga Complement task by establishing a new M2C benchmark dataset covering two languages.
- Score: 40.01354682367365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal manga analysis focuses on enhancing manga understanding with
visual and textual features, which has attracted considerable attention from
both natural language processing and computer vision communities. Currently,
most comics are hand-drawn and prone to problems such as missing pages, text
contamination, and aging, resulting in missing comic text content and seriously
hindering human comprehension. In other words, the Multimodal Manga Complement
(M2C) task has not been investigated, which aims to handle the aforementioned
issues by providing a shared semantic space for vision and language
understanding. To this end, we first propose the Multimodal Manga Complement
task by establishing a new M2C benchmark dataset covering two languages. First,
we design a manga argumentation method called MCoT to mine event knowledge in
comics with large language models. Then, an effective baseline FVP-M$^{2}$
using fine-grained visual prompts is proposed to support manga complement.
Extensive experimental results show the effectiveness of FVP-M$^{2}$ method for
Multimodal Mange Complement.
Related papers
- Context-Informed Machine Translation of Manga using Multimodal Large Language Models [4.063595992745368]
We investigate what extent multimodal large language models (LLMs) can provide effective manga translation.
Specifically, we propose a methodology that leverages the vision component of multimodal LLMs to improve translation quality.
We introduce a new evaluation dataset -- the first parallel Japanese-Polish manga translation dataset.
arXiv Detail & Related papers (2024-11-04T20:29:35Z) - MangaUB: A Manga Understanding Benchmark for Large Multimodal Models [25.63892470012361]
Manga is a popular medium that combines stylized drawings and text to convey stories.
Recently, the adaptive nature of modern large multimodal models (LMMs) shows possibilities for more general approaches.
MangaUB is designed to assess the recognition and understanding of content shown in a single panel as well as conveyed across multiple panels.
arXiv Detail & Related papers (2024-07-26T18:21:30Z) - The Manga Whisperer: Automatically Generating Transcriptions for Comics [55.544015596503726]
We present a unified model, Magi, that is able to detect panels, text boxes and character boxes.
We propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript.
arXiv Detail & Related papers (2024-01-18T18:59:09Z) - MaRU: A Manga Retrieval and Understanding System Connecting Vision and
Language [10.226184504988067]
MaRU (Manga Retrieval and Understanding) is a multi-staged system that connects vision and language to facilitate efficient search of both dialogues and scenes within Manga frames.
The architecture of MaRU integrates an object detection model for identifying text and frame bounding boxes, a text encoder for embedding text, and a vision- encoder that merges textual and visual information into a unified embedding space for scene retrieval.
arXiv Detail & Related papers (2023-10-22T05:51:02Z) - Dense Multitask Learning to Reconfigure Comics [63.367664789203936]
We develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels.
Our method can successfully identify the semantic units as well as the notion of 3D in comic panels.
arXiv Detail & Related papers (2023-07-16T15:10:34Z) - Language Is Not All You Need: Aligning Perception with Language Models [110.51362453720458]
We introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context, and follow instructions.
We train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data.
Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP.
We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language
arXiv Detail & Related papers (2023-02-27T18:55:27Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - Towards Fully Automated Manga Translation [8.45043706496877]
We tackle the problem of machine translation of manga, Japanese comics.
obtaining context from the image is essential for manga translation.
First, we propose multimodal context-aware translation framework.
Second, for training the model, we propose the approach to automatic corpus construction from pairs of original manga.
Third, we created a new benchmark to evaluate manga translation.
arXiv Detail & Related papers (2020-12-28T15:20:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.