Audio Visual Scene-Aware Dialog Generation with Transformer-based Video
Representations
- URL: http://arxiv.org/abs/2202.09979v1
- Date: Mon, 21 Feb 2022 04:09:32 GMT
- Title: Audio Visual Scene-Aware Dialog Generation with Transformer-based Video
Representations
- Authors: Yoshihiro Yamazaki, Shota Orihashi, Ryo Masumura, Mihiro Uchida,
Akihiko Takashima
- Abstract summary: We apply the Transformer-based video feature that can capture both temporally and spatially global representations more efficiently than the CNN-based feature.
Our model achieves a subjective score close to that of human answers in DSTC10.
- Score: 20.619819743960868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There have been many attempts to build multimodal dialog systems that can
respond to a question about given audio-visual information, and the
representative task for such systems is the Audio Visual Scene-Aware Dialog
(AVSD). Most conventional AVSD models adopt the Convolutional Neural Network
(CNN)-based video feature extractor to understand visual information. While a
CNN tends to obtain both temporally and spatially local information, global
information is also crucial for boosting video understanding because AVSD
requires long-term temporal visual dependency and whole visual information. In
this study, we apply the Transformer-based video feature that can capture both
temporally and spatially global representations more efficiently than the
CNN-based feature. Our AVSD model with its Transformer-based feature attains
higher objective performance scores for answer generation. In addition, our
model achieves a subjective score close to that of human answers in DSTC10. We
observed that the Transformer-based visual feature is beneficial for the AVSD
task because our model tends to correctly answer the questions that need a
temporally and spatially broad range of visual information.
Related papers
- Robust Audiovisual Speech Recognition Models with Mixture-of-Experts [67.75334989582709]
We introduce EVA, leveraging the mixture-of-Experts for audioVisual ASR to perform robust speech recognition for in-the-wild'' videos.
We first encode visual information into visual tokens sequence and map them into speech space by a lightweight projection.
Experiments show our model achieves state-of-the-art results on three benchmarks.
arXiv Detail & Related papers (2024-09-19T00:08:28Z) - Audio-Visual Glance Network for Efficient Video Recognition [17.95844876568496]
We propose Audio-Visual Network (AVGN) to efficiently process the-temporally important parts of a video.
We use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame.
We incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN.
arXiv Detail & Related papers (2023-08-18T05:46:20Z) - AVSegFormer: Audio-Visual Segmentation with Transformer [42.24135756439358]
A new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video.
This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges.
We propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
arXiv Detail & Related papers (2023-07-03T16:37:10Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - A Comprehensive Survey on Video Saliency Detection with Auditory
Information: the Audio-visual Consistency Perceptual is the Key! [25.436683033432086]
Video saliency detection (VSD) aims at fast locating the most attractive objects/things/patterns in a given video clip.
This paper provides extensive review to bridge the gap between audio-visual fusion and saliency detection.
arXiv Detail & Related papers (2022-06-20T07:25:13Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.