Seeing wake words: Audio-visual Keyword Spotting
- URL: http://arxiv.org/abs/2009.01225v1
- Date: Wed, 2 Sep 2020 17:57:38 GMT
- Title: Seeing wake words: Audio-visual Keyword Spotting
- Authors: Liliane Momeni and Triantafyllos Afouras and Themos Stafylakis and
Samuel Albanie and Andrew Zisserman
- Abstract summary: KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
- Score: 103.12655603634337
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of this work is to automatically determine whether and when a word
of interest is spoken by a talking face, with or without the audio. We propose
a zero-shot method suitable for in the wild videos. Our key contributions are:
(1) a novel convolutional architecture, KWS-Net, that uses a similarity map
intermediate representation to separate the task into (i) sequence matching,
and (ii) pattern detection, to decide whether the word is there and when; (2)
we demonstrate that if audio is available, visual keyword spotting improves the
performance both for a clean and noisy audio signal. Finally, (3) we show that
our method generalises to other languages, specifically French and German, and
achieves a comparable performance to English with less language specific data,
by fine-tuning the network pre-trained on English. The method exceeds the
performance of the previous state-of-the-art visual keyword spotting
architecture when trained and tested on the same benchmark, and also that of a
state-of-the-art lip reading method.
Related papers
- Zero-Shot Audio Captioning via Audibility Guidance [57.70351255180495]
We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii) faithfulness of the generated text to the input audio, and (iii) audibility.
Our method is a zero-shot method, i.e., we do not learn to perform captioning.
We present our results on the AudioCap dataset, demonstrating that audibility guidance significantly enhances performance compared to the baseline.
arXiv Detail & Related papers (2023-09-07T17:45:58Z) - Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform.
We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions.
We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Class-aware Sounding Objects Localization via Audiovisual Correspondence [51.39872698365446]
We propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios.
We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas.
Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones.
arXiv Detail & Related papers (2021-12-22T09:34:33Z) - LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction
and Lip Reading [24.744371143092614]
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos.
We propose LipSound2, which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms.
arXiv Detail & Related papers (2021-12-09T08:11:35Z) - Attention-Based Keyword Localisation in Speech using Visual Grounding [32.170748231414365]
We investigate whether visually grounded speech models can also do keyword localisation.
We show that attention provides a large gain in performance over previous visually grounded models.
As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions.
arXiv Detail & Related papers (2021-06-16T15:29:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.