Related papers: Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

Learning Visual Voice Activity Detection with an Automatically Annotated Dataset

URL: http://arxiv.org/abs/2009.11204v2
Date: Fri, 16 Oct 2020 15:08:12 GMT
Title: Learning Visual Voice Activity Detection with an Automatically Annotated Dataset
Authors: Sylvain Guy, St\'ephane Lathuili\`ere, Pablo Mesejo and Radu Horaud
Abstract summary: Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD.
Score: 20.725871972294236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.

Related papers

AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM [1.7051307941715268]
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. Existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. This study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model.
arXiv Detail & Related papers (2025-03-06T14:52:34Z)
UAV-DETR: Efficient End-to-End Object Detection for Unmanned Aerial Vehicle Imagery [14.599037804047724]
Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. Most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. This paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery.
arXiv Detail & Related papers (2025-01-03T15:11:14Z)
UEVAVD: A Dataset for Developing UAV's Eye View Active Object Detection [13.208447570946173]
Occlusion is a longstanding difficulty that challenges the UAV-based object detection. Active Object Detection (AOD) offers an effective way to achieve this purpose. We release a UAV's eye view active vision dataset named UEVAVD to facilitate research on the UAV AOD problem.
arXiv Detail & Related papers (2024-11-07T01:10:05Z)
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection [2.110168344647122]
Voice Activity Detection (VAD) is the process of automatically determining whether a person is speaking and identifying the timing of their speech. We introduce a novel approach leveraging Contrastive Language-Image Pretraining (CLIP) models. Our approach outperforms several audio-visual methods despite its simplicity, and without requiring pre-training on extensive audio-visual datasets.
arXiv Detail & Related papers (2024-10-18T14:43:34Z)
VANP: Learning Where to See for Navigation with Self-Supervised Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects. We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z)
VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning. VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence. On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z)
Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs) We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z)
Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z)
DeepVATS: Deep Visual Analytics for Time Series [7.822594828788055]
We present DeepVATS, an open-source tool that brings the field of Deep Visual Analytics into time series data. DeepVATS trains, in a self-supervised way, a masked time series autoencoder that reconstructs patches of a time series. We report on results that validate the utility of DeepVATS, running experiments on both synthetic and real datasets.
arXiv Detail & Related papers (2023-02-08T03:26:50Z)
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale [46.952339726872374]
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute.
arXiv Detail & Related papers (2022-11-14T18:59:52Z)
Towards Scale Consistent Monocular Visual Odometry by Learning from the Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data. We first train a scale-aware disparity network using both monocular real images and stereo virtual data. The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z)
Off-policy Imitation Learning from Visual Inputs [83.22342811160114]
We propose OPIfVI, which is composed of an off-policy learning manner, data augmentation, and encoder techniques. We show that OPIfVI is able to achieve expert-level performance and outperform existing baselines.
arXiv Detail & Related papers (2021-11-08T09:06:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.