Learning Visual Voice Activity Detection with an Automatically Annotated
Dataset
- URL: http://arxiv.org/abs/2009.11204v2
- Date: Fri, 16 Oct 2020 15:08:12 GMT
- Title: Learning Visual Voice Activity Detection with an Automatically Annotated
Dataset
- Authors: Sylvain Guy, St\'ephane Lathuili\`ere, Pablo Mesejo and Radu Horaud
- Abstract summary: Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not.
We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow.
We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD.
- Score: 20.725871972294236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual voice activity detection (V-VAD) uses visual features to predict
whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD)
is inefficient either because the acoustic signal is difficult to analyze or
because it is simply missing. We propose two deep architectures for V-VAD, one
based on facial landmarks and one based on optical flow. Moreover, available
datasets, used for learning and for testing V-VAD, lack content variability. We
introduce a novel methodology to automatically create and annotate very large
datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face
detection and tracking. A thorough empirical evaluation shows the advantage of
training the proposed deep V-VAD models with this dataset.
Related papers
- VANP: Learning Where to See for Navigation with Self-Supervised
Vision-Action Pre-Training [8.479135285935113]
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation.
Most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects.
We propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP)
arXiv Detail & Related papers (2024-03-12T22:33:08Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Unleashing Text-to-Image Diffusion Models for Visual Perception [84.41514649568094]
VPD (Visual Perception with a pre-trained diffusion model) is a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks.
We show that VPD can be faster adapted to downstream visual perception tasks using the proposed VPD.
arXiv Detail & Related papers (2023-03-03T18:59:47Z) - DeepVATS: Deep Visual Analytics for Time Series [7.822594828788055]
We present DeepVATS, an open-source tool that brings the field of Deep Visual Analytics into time series data.
DeepVATS trains, in a self-supervised way, a masked time series autoencoder that reconstructs patches of a time series.
We report on results that validate the utility of DeepVATS, running experiments on both synthetic and real datasets.
arXiv Detail & Related papers (2023-02-08T03:26:50Z) - EVA: Exploring the Limits of Masked Visual Representation Learning at
Scale [46.952339726872374]
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale.
EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches.
We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute.
arXiv Detail & Related papers (2022-11-14T18:59:52Z) - Learning by Hallucinating: Vision-Language Pre-training with Weak
Supervision [6.8582563015193]
Weakly-supervised vision-language pre-training aims at learning cross-modal alignment with little or no paired data.
Recent methods, which pair visual features with object tags, help achieve performances comparable with some models trained with aligned pairs in various V-L downstream tasks.
We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH)
WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities.
arXiv Detail & Related papers (2022-10-24T20:30:55Z) - EMA-VIO: Deep Visual-Inertial Odometry with External Memory Attention [5.144653418944836]
Visual-inertial odometry (VIO) algorithms exploit the information from camera and inertial sensors to estimate position and translation.
Recent deep learning based VIO models attract attentions as they provide pose information in a data-driven way.
We propose a novel learning based VIO framework with external memory attention that effectively and efficiently combines visual and inertial features for states estimation.
arXiv Detail & Related papers (2022-09-18T07:05:36Z) - Towards Scale Consistent Monocular Visual Odometry by Learning from the
Virtual World [83.36195426897768]
We propose VRVO, a novel framework for retrieving the absolute scale from virtual data.
We first train a scale-aware disparity network using both monocular real images and stereo virtual data.
The resulting scale-consistent disparities are then integrated with a direct VO system.
arXiv Detail & Related papers (2022-03-11T01:51:54Z) - Off-policy Imitation Learning from Visual Inputs [83.22342811160114]
We propose OPIfVI, which is composed of an off-policy learning manner, data augmentation, and encoder techniques.
We show that OPIfVI is able to achieve expert-level performance and outperform existing baselines.
arXiv Detail & Related papers (2021-11-08T09:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.