Related papers: Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos

URL: http://arxiv.org/abs/2110.05122v1
Date: Mon, 11 Oct 2021 09:58:05 GMT
Title: Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$ Videos
Authors: Heeseung Yun, Youngjae Yu, Wonsuk Yang, Kangil Lee, Gunhee Kim
Abstract summary: We propose a benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360$circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.
Score: 42.32743253830288
License: http://creativecommons.org/licenses/by/4.0/
Abstract: 360$^\circ$ videos convey holistic views for the surroundings of a scene. It provides audio-visual cues beyond pre-determined normal field of views and displays distinctive spatial relations on a sphere. However, previous benchmark tasks for panoramic videos are still limited to evaluate the semantic understanding of audio-visual relationships or spherical spatial property in surroundings. We propose a novel benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos. Using 5.4K 360$^\circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding: spherical spatial relation QAs and audio-visual relation QAs. We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.

Related papers

Beyond the Frame: Generating 360° Panoramic Videos from Perspective Videos [64.10180665546237]
360deg videos offer a more complete perspective of our surroundings. Existing video models excel at producing standard videos, but their ability to generate full panoramic videos remains elusive. We develop a high-quality data filtering pipeline to curate pairwise training data and improve the quality of 360deg video generation. Experimental results demonstrate that our model can generate realistic and coherent 360deg videos from in-the-wild perspective video.
arXiv Detail & Related papers (2025-04-10T17:51:38Z)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues [8.526720031181027]
The Audio Visual Question Answering (AVQA) task aims to answer questions related to various visual objects, sounds, and their interactions in videos. We propose a Temporal-Spatial Perception Model (TSPM), which aims to empower the model to perceive key visual and auditory cues related to the questions.
arXiv Detail & Related papers (2024-07-30T09:41:37Z)
Panoptic Video Scene Graph Generation [110.82362282102288]
We propose and study a new problem called panoptic scene graph generation (PVSG) PVSG relates to the existing video scene graph generation problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos. We contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs.
arXiv Detail & Related papers (2023-11-28T18:59:57Z)
From Pixels to Objects: Cubic Visual Attention for Visual Question Answering [132.95819467484517]
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to target different visual areas that are related to the answer. We propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
arXiv Detail & Related papers (2022-06-04T07:03:18Z)
Learning to Answer Questions in Dynamic Audio-Visual Scenarios [81.19017026999218]
We focus on the Audio-Visual Questioning (AVQA) task, which aims to answer questions regarding different visual objects sounds, and their associations in videos. Our dataset contains more than 45K question-answer pairs spanning over different modalities and question types. Our results demonstrate that AVQA benefits from multisensory perception and our model outperforms recent A-SIC, V-SIC, and AVQA approaches.
arXiv Detail & Related papers (2022-03-26T13:03:42Z)
Space-time Neural Irradiance Fields for Free-Viewpoint Video [54.436478702701244]
We present a method that learns a neural irradiance field for dynamic scenes from a single video. Our learned representation enables free-view rendering of the input video.
arXiv Detail & Related papers (2020-11-25T18:59:28Z)
Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z)
A proto-object based audiovisual saliency map [0.0]
We develop a proto-object based audiovisual saliency map (AVSM) for analysis of dynamic natural scenes. Such environment can be useful in surveillance, robotic navigation, video compression and related applications.
arXiv Detail & Related papers (2020-03-15T08:34:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.