Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
Videos
- URL: http://arxiv.org/abs/2110.05122v1
- Date: Mon, 11 Oct 2021 09:58:05 GMT
- Title: Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
Videos
- Authors: Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
- Abstract summary: We propose a benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos.
Using 5.4K 360$circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding.
We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.
- Score: 42.32743253830288
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 360$^\circ$ videos convey holistic views for the surroundings of a scene. It
provides audio-visual cues beyond pre-determined normal field of views and
displays distinctive spatial relations on a sphere. However, previous benchmark
tasks for panoramic videos are still limited to evaluate the semantic
understanding of audio-visual relationships or spherical spatial property in
surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale
grounded audio-visual question answering dataset on panoramic videos. Using
5.4K 360$^\circ$ video clips harvested online, we collect two types of novel
question-answer pairs with bounding-box grounding: spherical spatial relation
QAs and audio-visual relation QAs. We train several transformer-based models
from Pano-AVQA, where the results suggest that our proposed spherical spatial
embeddings and multimodal training objectives fairly contribute to a better
semantic understanding of the panoramic surroundings on the dataset.
Related papers
- Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos.
We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z) - Panoptic Video Scene Graph Generation [110.82362282102288]
We propose and study a new problem called panoptic scene graph generation (PVSG)
PVSG relates to the existing video scene graph generation problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos.
We contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs.
arXiv Detail & Related papers (2023-11-28T18:59:57Z) - From Pixels to Objects: Cubic Visual Attention for Visual Question
Answering [132.95819467484517]
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer.
We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task.
Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2022-06-04T07:03:18Z) - Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos.
Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types.
Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z) - Space-time Neural Irradiance Fields for Free-Viewpoint Video [54.436478702701244]
We present a method that learns a neural irradiance field for dynamic scenes from a single video.
Our learned representation enables free-view rendering of the input video.
arXiv Detail & Related papers (2020-11-25T18:59:28Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - A proto-object based audiovisual saliency map [0.0]
We develop a proto-object based audiovisual saliency map (AVSM) for analysis of dynamic natural scenes.
Such environment can be useful in surveillance, robotic navigation, video compression and related applications.
arXiv Detail & Related papers (2020-03-15T08:34:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.