Visual Keyword Spotting with Attention
- URL: http://arxiv.org/abs/2110.15957v1
- Date: Fri, 29 Oct 2021 17:59:04 GMT
- Title: Visual Keyword Spotting with Attention
- Authors: K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman
- Abstract summary: We investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword.
We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods.
We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
- Score: 82.79015266453533
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the task of spotting spoken keywords in silent
video sequences -- also known as visual keyword spotting. To this end, we
investigate Transformer-based models that ingest two streams, a visual encoding
of the video and a phonetic encoding of the keyword, and output the temporal
location of the keyword if present. Our contributions are as follows: (1) We
propose a novel architecture, the Transpotter, that uses full cross-modal
attention between the visual and phonetic streams; (2) We show through
extensive evaluations that our model outperforms the prior state-of-the-art
visual keyword spotting and lip reading methods on the challenging LRW, LRS2,
LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to
spot words under the extreme conditions of isolated mouthings in sign language
videos.
Related papers
- Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection [14.801564966406486]
The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query.
We present a novel Video Context-aware Keyword Attention module that overcomes this limitation.
We propose a keyword weight detection module with keyword-aware contrastive learning to enhance fine-grained alignment between visual and textual features.
arXiv Detail & Related papers (2025-01-05T11:01:27Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Sign Language Production with Latent Motion Transformer [2.184775414778289]
We develop a new method to make high-quality sign videos without using human poses as a middle step.
Our model works in two main parts: first, it learns from a generator and the video's hidden features, and next, it uses another model to understand the order of these hidden features.
Compared with previous state-of-the-art approaches, our model performs consistently better on two word-level sign language datasets.
arXiv Detail & Related papers (2023-12-20T10:53:06Z) - Towards visually prompted keyword localisation for zero-resource spoken
languages [27.696096343873215]
We formalise the task of visually prompted keyword localisation (VPKL)
VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs.
We show that these innovations give improvements in VPKL over an existing speech-vision model.
arXiv Detail & Related papers (2022-10-12T14:17:34Z) - VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE.
In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence.
In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection.
We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.