Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions
- URL: http://arxiv.org/abs/2005.07097v2
- Date: Sat, 16 May 2020 20:56:26 GMT
- Title: Ambient Sound Helps: Audiovisual Crowd Counting in Extreme Conditions
- Authors: Di Hu, Lichao Mou, Qingzhong Wang, Junyu Gao, Yuansheng Hua, Dejing
Dou, Xiao Xiang Zhu
- Abstract summary: We introduce a novel task of audiovisual crowd counting, in which visual and auditory information are integrated for counting purposes.
We collect a large-scale benchmark, named auDiovISual Crowd cOunting dataset.
We make use of a linear feature-wise fusion module that carries out an affine transformation on visual and auditory features.
- Score: 64.43064637421007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual crowd counting has been recently studied as a way to enable people
counting in crowd scenes from images. Albeit successful, vision-based crowd
counting approaches could fail to capture informative features in extreme
conditions, e.g., imaging at night and occlusion. In this work, we introduce a
novel task of audiovisual crowd counting, in which visual and auditory
information are integrated for counting purposes. We collect a large-scale
benchmark, named auDiovISual Crowd cOunting (DISCO) dataset, consisting of
1,935 images and the corresponding audio clips, and 170,270 annotated
instances. In order to fuse the two modalities, we make use of a linear
feature-wise fusion module that carries out an affine transformation on visual
and auditory features. Finally, we conduct extensive experiments using the
proposed dataset and approach. Experimental results show that introducing
auditory information can benefit crowd counting under different illumination,
noise, and occlusion conditions. The dataset and code will be released. Code
and data have been made available
Related papers
- Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup.
We introduce a unified audio-visual few-shot video classification benchmark on three datasets.
We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z) - Egocentric Audio-Visual Noise Suppression [11.113020254726292]
This paper studies audio-visual noise suppression for egocentric videos.
Video camera emulates off-screen speaker's view of the outside world.
We first demonstrate that egocentric visual information is helpful for noise suppression.
arXiv Detail & Related papers (2022-11-07T15:53:12Z) - Estimating Visual Information From Audio Through Manifold Learning [14.113590443352495]
We propose a new framework for extracting visual information about a scene only using audio signals.
Our framework is based on Manifold Learning and consists of two steps.
We show that our method is able to produce meaningful images from audio using a publicly available audio/visual dataset.
arXiv Detail & Related papers (2022-08-03T20:47:11Z) - AV-Gaze: A Study on the Effectiveness of Audio Guided Visual Attention
Estimation for Non-Profilic Faces [28.245662058349854]
In this paper, we explore if audio-guided coarse head-pose can further enhance visual attention estimation performance for non-prolific faces.
We use off-the-shelf state-of-the-art models to facilitate cross-modal weak-supervision.
Our model can utilize any of the available modalities for task-specific inference.
arXiv Detail & Related papers (2022-07-07T02:23:02Z) - Weakly-Supervised Action Detection Guided by Audio Narration [50.4318060593995]
We propose a model to learn from the narration supervision and utilize multimodal features, including RGB, motion flow, and ambient sound.
Our experiments show that noisy audio narration suffices to learn a good action detection model, thus reducing annotation expenses.
arXiv Detail & Related papers (2022-05-12T06:33:24Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z) - Repetitive Activity Counting by Sight and Sound [110.36526333035907]
This paper strives for repetitive activity counting in videos.
Different from existing works, which all analyze the visual video content only, we incorporate for the first time the corresponding sound into the repetition counting process.
arXiv Detail & Related papers (2021-03-24T11:15:33Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.