Comics for Everyone: Generating Accessible Text Descriptions for Comic
Strips
- URL: http://arxiv.org/abs/2310.00698v1
- Date: Sun, 1 Oct 2023 15:13:48 GMT
- Title: Comics for Everyone: Generating Accessible Text Descriptions for Comic
Strips
- Authors: Reshma Ramaprasad
- Abstract summary: We create natural language descriptions of comic strips that are accessible to the visually impaired community.
Our method consists of two steps: first, we use computer vision techniques to extract information about the panels, characters, and text of the comic images.
We test our method on a collection of comics that have been annotated by human experts and measure its performance using both quantitative and qualitative metrics.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Comic strips are a popular and expressive form of visual storytelling that
can convey humor, emotion, and information. However, they are inaccessible to
the BLV (Blind or Low Vision) community, who cannot perceive the images,
layouts, and text of comics. Our goal in this paper is to create natural
language descriptions of comic strips that are accessible to the visually
impaired community. Our method consists of two steps: first, we use computer
vision techniques to extract information about the panels, characters, and text
of the comic images; second, we use this information as additional context to
prompt a multimodal large language model (MLLM) to produce the descriptions. We
test our method on a collection of comics that have been annotated by human
experts and measure its performance using both quantitative and qualitative
metrics. The outcomes of our experiments are encouraging and promising.
Related papers
- One missing piece in Vision and Language: A Survey on Comics Understanding [13.766672321462435]
This survey is the first to propose a task-oriented framework for comics intelligence.
It aims to guide future research by addressing critical gaps in data availability and task definition.
arXiv Detail & Related papers (2024-09-14T18:26:26Z) - Toward accessible comics for blind and low vision readers [0.059584784039407875]
We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content.
We generate comic book script with context-aware panel description including character's appearance, posture, mood, dialogues etc.
arXiv Detail & Related papers (2024-07-11T07:50:25Z) - The Manga Whisperer: Automatically Generating Transcriptions for Comics [55.544015596503726]
We present a unified model, Magi, that is able to detect panels, text boxes and character boxes.
We propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript.
arXiv Detail & Related papers (2024-01-18T18:59:09Z) - Dense Multitask Learning to Reconfigure Comics [63.367664789203936]
We develop a MultiTask Learning (MTL) model to achieve dense predictions for comics panels.
Our method can successfully identify the semantic units as well as the notion of 3D in comic panels.
arXiv Detail & Related papers (2023-07-16T15:10:34Z) - Do Androids Laugh at Electric Sheep? Humor "Understanding" Benchmarks
from The New Yorker Caption Contest [70.40189243067857]
Large neural networks can now generate jokes, but do they really "understand" humor?
We challenge AI models with three tasks derived from the New Yorker Cartoon Caption Contest.
We find that both types of models struggle at all three tasks.
arXiv Detail & Related papers (2022-09-13T20:54:00Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - ComicGAN: Text-to-Comic Generative Adversarial Network [1.4824891788575418]
We implement ComicGAN, a novel text-to-image GAN that synthesizes comics according to text descriptions.
We extensively evaluate the proposed ComicGAN in two scenarios, namely image generation from descriptions, and image generation from dialogue.
arXiv Detail & Related papers (2021-09-19T13:31:32Z) - Automatic Comic Generation with Stylistic Multi-page Layouts and
Emotion-driven Text Balloon Generation [57.10363557465713]
We propose a fully automatic system for generating comic books from videos without any human intervention.
Given an input video along with its subtitles, our approach first extracts informatives by analyzing the subtitles.
Then, we propose a novel automatic multi-page framework layout, which can allocate the images across multiple pages.
arXiv Detail & Related papers (2021-01-26T22:15:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.