Toward accessible comics for blind and low vision readers
- URL: http://arxiv.org/abs/2407.08248v2
- Date: Tue, 10 Sep 2024 07:59:21 GMT
- Title: Toward accessible comics for blind and low vision readers
- Authors: Christophe Rigaud, Jean-Christophe Burie, Samuel Petit,
- Abstract summary: We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content.
We generate comic book script with context-aware panel description including character's appearance, posture, mood, dialogues etc.
- Score: 0.059584784039407875
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work explores how to fine-tune large language models using prompt engineering techniques with contextual information for generating an accurate text description of the full story, ready to be forwarded to off-the-shelve speech synthesis tools. We propose to use existing computer vision and optical character recognition techniques to build a grounded context from the comic strip image content, such as panels, characters, text, reading order and the association of bubbles and characters. Then we infer character identification and generate comic book script with context-aware panel description including character's appearance, posture, mood, dialogues etc. We believe that such enriched content description can be easily used to produce audiobook and eBook with various voices for characters, captions and playing sound effects.
Related papers
- Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - Comics for Everyone: Generating Accessible Text Descriptions for Comic
Strips [0.0]
We create natural language descriptions of comic strips that are accessible to the visually impaired community.
Our method consists of two steps: first, we use computer vision techniques to extract information about the panels, characters, and text of the comic images.
We test our method on a collection of comics that have been annotated by human experts and measure its performance using both quantitative and qualitative metrics.
arXiv Detail & Related papers (2023-10-01T15:13:48Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - PromptTTS: Controllable Text-to-Speech with Text Descriptions [32.647362978555485]
We develop a text-to-speech (TTS) system that takes a prompt with both style and content descriptions as input to synthesize the corresponding speech.
PromptTTS consists of a style encoder and a content encoder to extract the corresponding representations from the prompt.
Experiments show that PromptTTS can generate speech with precise style control and high speech quality.
arXiv Detail & Related papers (2022-11-22T10:58:38Z) - Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention [54.4258176885084]
How to accurately recognize ambiguous sounds is a major challenge for audio captioning.
We propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects.
Our proposed method achieves state-of-the-art results on machine translation metrics.
arXiv Detail & Related papers (2022-10-28T22:45:41Z) - WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z) - Automatic Comic Generation with Stylistic Multi-page Layouts and
Emotion-driven Text Balloon Generation [57.10363557465713]
We propose a fully automatic system for generating comic books from videos without any human intervention.
Given an input video along with its subtitles, our approach first extracts informatives by analyzing the subtitles.
Then, we propose a novel automatic multi-page framework layout, which can allocate the images across multiple pages.
arXiv Detail & Related papers (2021-01-26T22:15:15Z) - Text-Free Image-to-Speech Synthesis Using Learned Segmental Units [24.657722909094662]
We present the first model for directly fluent, natural-sounding spoken audio captions for images.
We connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units.
We conduct experiments on the Flickr8k spoken caption dataset and a novel corpus of spoken audio captions collected for the popular MSCOCO dataset.
arXiv Detail & Related papers (2020-12-31T05:28:38Z) - Transcription-Enriched Joint Embeddings for Spoken Descriptions of
Images and Videos [4.419800664096478]
We propose an effective approach for training unique embedding representations by combining three simultaneous modalities: image and spoken and textual narratives.
Our experiments on the EPIC-Kitchen and Places Audio Caption datasets show that introducing the human-generated textual transcriptions of the spoken narratives helps to the training procedure yielding to get better embedding representations.
arXiv Detail & Related papers (2020-06-01T08:18:15Z) - TextCaps: a Dataset for Image Captioning with Reading Comprehension [56.89608505010651]
Text is omnipresent in human environments and frequently critical to understand our surroundings.
To study how to comprehend text in the context of an image we collect a novel dataset, TextCaps, with 145k captions for 28k images.
Our dataset challenges a model to recognize text, relate it to its visual context, and decide what part of the text to copy or paraphrase.
arXiv Detail & Related papers (2020-03-24T02:38:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.