Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
- URL: http://arxiv.org/abs/2401.12039v1
- Date: Mon, 22 Jan 2024 15:26:01 GMT
- Title: Look, Listen and Recognise: Character-Aware Audio-Visual Subtitling
- Authors: Bruno Korbar, Jaesung Huh, Andrew Zisserman
- Abstract summary: We propose an audio-visual method that generates a full transcript of the dialogue, with precise speech timestamps, and the character speaking identified.
We evaluate the method over a variety of TV sitcoms, including Seinfeld, Fraiser and Scrubs.
We envision this system being useful for the automatic generation of subtitles to improve the accessibility of videos available on modern streaming services.
- Score: 62.25533750469467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this paper is automatic character-aware subtitle generation.
Given a video and a minimal amount of metadata, we propose an audio-visual
method that generates a full transcript of the dialogue, with precise speech
timestamps, and the character speaking identified. The key idea is to first use
audio-visual cues to select a set of high-precision audio exemplars for each
character, and then use these exemplars to classify all speech segments by
speaker identity. Notably, the method does not require face detection or
tracking. We evaluate the method over a variety of TV sitcoms, including
Seinfeld, Fraiser and Scrubs. We envision this system being useful for the
automatic generation of subtitles to improve the accessibility of the vast
amount of videos available on modern streaming services. Project page :
\url{https://www.robots.ox.ac.uk/~vgg/research/look-listen-recognise/}
Related papers
- Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows.
Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues.
We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z) - Zero-shot audio captioning with audio-language model guidance and audio
context keywords [59.58331215337357]
We propose ZerAuCap, a novel framework for summarising general audio signals in a text caption without requiring task-specific training.
Our framework exploits a pre-trained large language model (LLM) for generating the text which is guided by a pre-trained audio-language model to produce captions.
Our proposed framework achieves state-of-the-art results in zero-shot audio captioning on the AudioCaps and Clotho datasets.
arXiv Detail & Related papers (2023-11-14T18:55:48Z) - Towards Generating Diverse Audio Captions via Adversarial Training [33.76154801580643]
We propose a conditional generative adversarial network (C-GAN) to improve diversity of audio captioning systems.
A caption generator and two hybrid discriminators compete and are learned jointly, where the caption generator can be any standard encoder-decoder captioning model used to generate captions.
The results show that our proposed model can generate captions with better diversity as compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-12-05T05:06:19Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z) - Multi-modal Dense Video Captioning [18.592384822257948]
We present a new dense video captioning approach that is able to utilize any number of modalities for event description.
We show how audio and speech modalities may improve a dense video captioning model.
arXiv Detail & Related papers (2020-03-17T15:15:17Z) - Unsupervised Audiovisual Synthesis via Exemplar Autoencoders [59.13989658692953]
We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers.
We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target speech exemplar.
arXiv Detail & Related papers (2020-01-13T18:56:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.