Visual Keyword Spotting with Attention
- URL: http://arxiv.org/abs/2110.15957v1
- Date: Fri, 29 Oct 2021 17:59:04 GMT
- Title: Visual Keyword Spotting with Attention
- Authors: K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman
- Abstract summary: We investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword.
We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods.
We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
- Score: 82.79015266453533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the task of spotting spoken keywords in silent
video sequences -- also known as visual keyword spotting. To this end, we
investigate Transformer-based models that ingest two streams, a visual encoding
of the video and a phonetic encoding of the keyword, and output the temporal
location of the keyword if present. Our contributions are as follows: (1) We
propose a novel architecture, the Transpotter, that uses full cross-modal
attention between the visual and phonetic streams; (2) We show through
extensive evaluations that our model outperforms the prior state-of-the-art
visual keyword spotting and lip reading methods on the challenging LRW, LRS2,
LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to
spot words under the extreme conditions of isolated mouthings in sign language
videos.
Related papers
- Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Sign Language Production with Latent Motion Transformer [2.184775414778289]
We develop a new method to make high-quality sign videos without using human poses as a middle step.
Our model works in two main parts: first, it learns from a generator and the video's hidden features, and next, it uses another model to understand the order of these hidden features.
Compared with previous state-of-the-art approaches, our model performs consistently better on two word-level sign language datasets.
arXiv Detail & Related papers (2023-12-20T10:53:06Z) - Towards visually prompted keyword localisation for zero-resource spoken
languages [27.696096343873215]
We formalise the task of visually prompted keyword localisation (VPKL)
VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs.
We show that these innovations give improvements in VPKL over an existing speech-vision model.
arXiv Detail & Related papers (2022-10-12T14:17:34Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination.
We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy.
Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.