Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
- URL: http://arxiv.org/abs/2404.13993v2
- Date: Wed, 24 Apr 2024 06:00:47 GMT
- Title: Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
- Authors: Yingxuan Li, Ryota Hinami, Kiyoharu Aizawa, Yusuke Matsui,
- Abstract summary: We propose a novel zero-shot approach to identify characters and predict speaker names based solely on unannotated comic images.
Our method requires no training data or annotations, it can be used as-is on any comic series.
- Score: 35.25298023240529
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognizing characters and predicting speakers of dialogue are critical for comic processing tasks, such as voice generation or translation. However, because characters vary by comic title, supervised learning approaches like training character classifiers which require specific annotations for each comic title are infeasible. This motivates us to propose a novel zero-shot approach, allowing machines to identify characters and predict speaker names based solely on unannotated comic images. In spite of their importance in real-world applications, these task have largely remained unexplored due to challenges in story comprehension and multimodal integration. Recent large language models (LLMs) have shown great capability for text understanding and reasoning, while their application to multimodal content analysis is still an open problem. To address this problem, we propose an iterative multimodal framework, the first to employ multimodal information for both character identification and speaker prediction tasks. Our experiments demonstrate the effectiveness of the proposed framework, establishing a robust baseline for these tasks. Furthermore, since our method requires no training data or annotations, it can be used as-is on any comic series.
Related papers
- Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models [83.7506131809624]
We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.
We present a novel, large-scale dataset derived from the MediaSum corpus, encompassing transcripts from a wide range of media sources.
We propose novel transformer-based models tailored for SpeakerID, leveraging contextual cues within dialogues to accurately attribute speaker names.
arXiv Detail & Related papers (2024-07-16T18:03:58Z) - CoMix: A Comprehensive Benchmark for Multi-Task Comic Understanding [14.22900011952181]
We introduce a novel benchmark, CoMix, designed to evaluate the multi-task capabilities of models in comic analysis.
Our benchmark comprises three existing datasets with expanded annotations to support multi-task evaluation.
To mitigate the over-representation of manga-style data, we have incorporated a new dataset of carefully selected American comic-style books.
arXiv Detail & Related papers (2024-07-04T00:07:50Z) - Dense Multitask Learning to Reconfigure Comics [63.367664789203936]
We develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels.
Our method can successfully identify the semantic units as well as the notion of 3D in comic panels.
arXiv Detail & Related papers (2023-07-16T15:10:34Z) - Manga109Dialog: A Large-scale Dialogue Dataset for Comics Speaker Detection [37.083051419659135]
Manga109Dialog is the world's largest comics speaker annotation dataset, containing 132,692 speaker-to-text pairs.
Unlike existing methods mainly based on distances, we propose a deep learning-based method using scene graph generation models.
Experimental results demonstrate that our scene-graph-based approach outperforms existing methods, achieving a prediction accuracy of over 75%.
arXiv Detail & Related papers (2023-06-30T08:34:08Z) - A Benchmark for Understanding and Generating Dialogue between Characters
in Stories [75.29466820496913]
We present the first study to explore whether machines can understand and generate dialogue in stories.
We propose two new tasks including Masked Dialogue Generation and Dialogue Speaker Recognition.
We show the difficulty of the proposed tasks by testing existing models with automatic and manual evaluation on DialStory.
arXiv Detail & Related papers (2022-09-18T10:19:04Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.