Repetitive Activity Counting by Sight and Sound
- URL: http://arxiv.org/abs/2103.13096v1
- Date: Wed, 24 Mar 2021 11:15:33 GMT
- Title: Repetitive Activity Counting by Sight and Sound
- Authors: Yunhua Zhang, Ling Shao, Cees G.M. Snoek
- Abstract summary: This paper strives for repetitive activity counting in videos.
Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process.
- Score: 110.36526333035907
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper strives for repetitive activity counting in videos. Different from
existing works, which all analyze the visual video content only, we incorporate
for the first time the corresponding sound into the repetition counting
process. This benefits accuracy in challenging vision conditions such as
occlusion, dramatic camera view changes, low resolution, etc. We propose a
model that starts with analyzing the sight and sound streams separately. Then
an audiovisual temporal stride decision module and a reliability estimation
module are introduced to exploit cross-modal temporal interaction. For learning
and evaluation, an existing dataset is repurposed and reorganized to allow for
repetition counting with sight and sound. We also introduce a variant of this
dataset for repetition counting under challenging vision conditions.
Experiments demonstrate the benefit of sound, as well as the other introduced
modules, for repetition counting. Our sight-only model already outperforms the
state-of-the-art by itself, when we add sound, results improve notably,
especially under harsh vision conditions.
Related papers
- AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot
AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation.
We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters.
We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - The Impact of Spatiotemporal Augmentations on Self-Supervised
Audiovisual Representation Learning [2.28438857884398]
We present a contrastive framework to learn audiovisual representations from unlabeled videos.
We find lossy-temporal transformations that do not corrupt the temporal coherency of videos are the most effective.
Compared to self-supervised models pre-trained on only sampling-based temporal augmentation, self-supervised models pre-trained with our temporal augmentations lead to approximately 6.5% gain on linear performance on dataset AVE.
arXiv Detail & Related papers (2021-10-13T23:48:58Z) - LiRA: Learning Visual Speech Representations from Audio through
Self-supervision [53.18768477520411]
We propose Learning visual speech Representations from Audio via self-supervision (LiRA)
Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech.
We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild dataset.
arXiv Detail & Related papers (2021-06-16T23:20:06Z) - Where and When: Space-Time Attention for Audio-Visual Explanations [42.093794819606444]
We propose a novel space-time attention network that uncovers the synergistic dynamics of audio and visual data over both space and time.
Our model is capable of predicting the audio-visual video events, while justifying its decision by localizing where the relevant visual cues appear.
arXiv Detail & Related papers (2021-05-04T14:16:55Z) - Audiovisual Highlight Detection in Videos [78.26206014711552]
We present results from two experiments: efficacy study of single features on the task, and an ablation study where we leave one feature out at a time.
For the video summarization task, our results indicate that the visual features carry most information, and including audiovisual features improves over visual-only information.
Results indicate that we can transfer knowledge from the video summarization task to a model trained specifically for the task of highlight detection.
arXiv Detail & Related papers (2021-02-11T02:24:00Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions [64.43064637421007]
We introduce a novel task of audiovisual crowd counting, in which visual and auditory information are integrated for counting purposes.
We collect a large-scale benchmark, named auDiovISual Crowd cOunting dataset.
We make use of a linear feature-wise fusion module that carries out an affine transformation on visual and auditory features.
arXiv Detail & Related papers (2020-05-14T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.