Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
Videos
- URL: http://arxiv.org/abs/2110.05122v1
- Date: Mon, 11 Oct 2021 09:58:05 GMT
- Title: Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
Videos
- Authors: Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
- Abstract summary: We propose a benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos.
Using 5.4K 360$circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding.
We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.
- Score: 42.32743253830288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 360$^\circ$ videos convey holistic views for the surroundings of a scene. It
provides audio-visual cues beyond pre-determined normal field of views and
displays distinctive spatial relations on a sphere. However, previous benchmark
tasks for panoramic videos are still limited to evaluate the semantic
understanding of audio-visual relationships or spherical spatial property in
surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale
grounded audio-visual question answering dataset on panoramic videos. Using
5.4K 360$^\circ$ video clips harvested online, we collect two types of novel
question-answer pairs with bounding-box grounding: spherical spatial relation
QAs and audio-visual relation QAs. We train several transformer-based models
from Pano-AVQA, where the results suggest that our proposed spherical spatial
embeddings and multimodal training objectives fairly contribute to a better
semantic understanding of the panoramic surroundings on the dataset.
Related papers
- Panoptic Video Scene Graph Generation [110.82362282102288]
We propose and study a new problem called panoptic scene graph generation (PVSG)
PVSG relates to the existing video scene graph generation problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos.
We contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs.
arXiv Detail & Related papers (2023-11-28T18:59:57Z) - Panoramic Video Salient Object Detection with Ambisonic Audio Guidance [24.341735475632884]
We propose a multimodal fusion module equipped with two pseudo-siamese audio-visual context fusion blocks.
The block equipped with spherical positional encoding enables the fusion in the 3D context to capture the spatial correspondence between pixels and sound sources.
Our method achieves state-of-the-art performance on the ASOD60K dataset.
arXiv Detail & Related papers (2022-11-26T00:50:02Z) - From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering [132.95819467484517]
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer.
We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task.
Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2022-06-04T07:03:18Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - Space-time Neural Irradiance Fields for Free-Viewpoint Video [54.436478702701244]
We present a method that learns a neural irradiance field for dynamic scenes from a single video.
Our learned representation enables free-view rendering of the input video.
arXiv Detail & Related papers (2020-11-25T18:59:28Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - A proto-object based audiovisual saliency map [0.0]
We develop a proto-object based audiovisual saliency map (AVSM) for analysis of dynamic natural scenes.
Such environment can be useful in surveillance, robotic navigation, video compression and related applications.
arXiv Detail & Related papers (2020-03-15T08:34:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.