Related papers: Visual Keyword Spotting with Attention

Visual Keyword Spotting with Attention

URL: http://arxiv.org/abs/2110.15957v1
Date: Fri, 29 Oct 2021 17:59:04 GMT
Title: Visual Keyword Spotting with Attention
Authors: K R Prajwal, Liliane Momeni, Triantafyllos Afouras, Andrew Zisserman
Abstract summary: We investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword. We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods. We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
Score: 82.79015266453533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

Related papers

The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z)
VALLR: Visual ASR Language Model for Lip Reading [28.561566996686484]
Lip Reading, or Visual Automatic Speech Recognition, is a complex task requiring the interpretation of spoken language exclusively from visual cues. We propose a novel two-stage, phoneme-centric framework for Visual Automatic Speech Recognition (V-ASR) First, our model predicts a compact sequence of phonemes from visual inputs using a Video Transformer with a CTC head. This phoneme output then serves as the input to a fine-tuned Large Language Model (LLM), which reconstructs coherent words and sentences.
arXiv Detail & Related papers (2025-03-27T11:52:08Z)
Watch Video, Catch Keyword: Context-aware Keyword Attention for Moment Retrieval and Highlight Detection [14.801564966406486]
The goal of video moment retrieval and highlight detection is to identify specific segments and highlights based on a given text query. We present a novel Video Context-aware Keyword Attention module that overcomes this limitation. We propose a keyword weight detection module with keyword-aware contrastive learning to enhance fine-grained alignment between visual and textual features.
arXiv Detail & Related papers (2025-01-05T11:01:27Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
Sign Language Production with Latent Motion Transformer [2.184775414778289]
We develop a new method to make high-quality sign videos without using human poses as a middle step. Our model works in two main parts: first, it learns from a generator and the video's hidden features, and next, it uses another model to understand the order of these hidden features. Compared with previous state-of-the-art approaches, our model performs consistently better on two word-level sign language datasets.
arXiv Detail & Related papers (2023-12-20T10:53:06Z)
Towards visually prompted keyword localisation for zero-resource spoken languages [27.696096343873215]
We formalise the task of visually prompted keyword localisation (VPKL) VPKL is given an image of a keyword, detect and predict where in an utterance the keyword occurs. We show that these innovations give improvements in VPKL over an existing speech-vision model.
arXiv Detail & Related papers (2022-10-12T14:17:34Z)
VCSE: Time-Domain Visual-Contextual Speaker Extraction Network [54.67547526785552]
We propose a two-stage time-domain visual-contextual speaker extraction network named VCSE. In the first stage, we pre-extract a target speech with visual cues and estimate the underlying phonetic sequence. In the second stage, we refine the pre-extracted target speech with the self-enrolled contextual cues.
arXiv Detail & Related papers (2022-10-09T12:29:38Z)
Thinking Hallucination for Video Captioning [0.76146285961466]
In video captioning, there are two kinds of hallucination: object and action hallucination. We identify three main factors: (i) inadequate visual features extracted from pre-trained models, (ii) improper influences of source and target contexts during multi-modal fusion, and (iii) exposure bias in the training strategy. Our method achieves state-of-the-art performance on the MSR-Video to Text (MSR-VTT) and the Microsoft Research Video Description Corpus (MSVD) datasets.
arXiv Detail & Related papers (2022-09-28T06:15:42Z)
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations. For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
Seeing wake words: Audio-visual Keyword Spotting [103.12655603634337]
KWS-Net is a novel convolutional architecture that uses a similarity map intermediate representation to separate the task into sequence matching and pattern detection. We show that our method generalises to other languages, specifically French and German, and achieves a comparable performance to English with less language specific data.
arXiv Detail & Related papers (2020-09-02T17:57:38Z)
Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.