Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers
- URL: http://arxiv.org/abs/2007.03848v2
- Date: Tue, 2 Mar 2021 20:04:33 GMT
- Title: Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers
- Authors: Shijie Geng, Peng Gao, Moitreya Chatterjee, Chiori Hori, Jonathan Le
Roux, Yongfeng Zhang, Hongsheng Li, Anoop Cherian
- Abstract summary: We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
- Score: 89.00926092864368
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an input video, its associated audio, and a brief caption, the
audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a
question-answer dialog with a human about the audio-visual content. This task
thus poses a challenging multi-modal representation learning and reasoning
scenario, advancements into which could influence several human-machine
interaction applications. To solve this task, we introduce a
semantics-controlled multi-modal shuffled Transformer reasoning framework,
consisting of a sequence of Transformer modules, each taking a modality as
input and producing representations conditioned on the input question. Our
proposed Transformer variant uses a shuffling scheme on their multi-head
outputs, demonstrating better regularization. To encode fine-grained visual
information, we present a novel dynamic scene graph representation learning
pipeline that consists of an intra-frame reasoning layer producing
spatio-semantic graph representations for every frame, and an inter-frame
aggregation module capturing temporal cues. Our entire pipeline is trained
end-to-end. We present experiments on the benchmark AVSD dataset, both on
answer generation and selection tasks. Our results demonstrate state-of-the-art
performances on all evaluation metrics.
Related papers
- VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing [81.32613443072441]
For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired.
We propose a method called Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP), which uses the cross-modal sequence transcoder to bring text and speech into a joint space.
arXiv Detail & Related papers (2024-08-11T12:24:23Z) - Zorro: the masked multimodal transformer [68.99684436029884]
Zorro is a technique that uses masks to control how inputs from each modality are routed inside Transformers.
We show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks.
arXiv Detail & Related papers (2023-01-23T17:51:39Z) - VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for
Speech Representation Learning [119.49605266839053]
We propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model)
The proposed VATLM employs a unified backbone network to model the modality-independent information.
In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens.
arXiv Detail & Related papers (2022-11-21T09:10:10Z) - End-to-End Multimodal Representation Learning for Video Dialog [5.661732643450332]
This study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder.
The visual encoder is jointly trained end-to-end with other input modalities such as text and audio.
Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks.
arXiv Detail & Related papers (2022-10-26T06:50:07Z) - Multilevel Hierarchical Network with Multiscale Sampling for Video
Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA.
MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR)
With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations.
PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval [36.50847375135979]
Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation.
We present a multi-modal, modality fusion transformer approach that learns to exchange information between multiple modalities, such as video, audio, and text, and integrate them into a joined multi-modal representation.
arXiv Detail & Related papers (2021-12-08T18:14:57Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - Multiresolution and Multimodal Speech Recognition with Transformers [22.995102995029576]
This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture.
We focus on the scene context provided by the visual information, to ground the ASR.
Our results are comparable to state-of-the-art Listen, Attend and Spell-based architectures.
arXiv Detail & Related papers (2020-04-29T09:32:11Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.