Extending Compositional Attention Networks for Social Reasoning in
Videos
- URL: http://arxiv.org/abs/2210.01191v1
- Date: Mon, 3 Oct 2022 19:03:01 GMT
- Title: Extending Compositional Attention Networks for Social Reasoning in
Videos
- Authors: Christina Sartzetaki, Georgios Paraskevopoulos, Alexandros Potamianos
- Abstract summary: We propose a novel deep architecture for the task of reasoning about social interactions in videos.
We leverage the multi-step reasoning capabilities of Compositional Attention Networks (MAC), and propose a multimodal extension (MAC-X)
- Score: 84.12658971655253
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel deep architecture for the task of reasoning about social
interactions in videos. We leverage the multi-step reasoning capabilities of
Compositional Attention Networks (MAC), and propose a multimodal extension
(MAC-X). MAC-X is based on a recurrent cell that performs iterative mid-level
fusion of input modalities (visual, auditory, text) over multiple reasoning
steps, by use of a temporal attention mechanism. We then combine MAC-X with
LSTMs for temporal input processing in an end-to-end architecture. Our ablation
studies show that the proposed MAC-X architecture can effectively leverage
multimodal input cues using mid-level fusion mechanisms. We apply MAC-X to the
task of Social Video Question Answering in the Social IQ dataset and obtain a
2.5% absolute improvement in terms of binary accuracy over the current
state-of-the-art.
Related papers
- MacFormer: Semantic Segmentation with Fine Object Boundaries [38.430631361558426]
We introduce a new semantic segmentation architecture, MacFormer'', which features two key components.
Firstly, using learnable agent tokens, a Mutual Agent Cross-Attention (MACA) mechanism effectively facilitates the bidirectional integration of features across encoder and decoder layers.
Secondly, a Frequency Enhancement Module (FEM) in the decoder leverages high-frequency and low-frequency components to boost features in the frequency domain.
MacFormer is demonstrated to be compatible with various network architectures and outperforms existing methods in both accuracy and efficiency on datasets benchmark ADE20K and Cityscapes.
arXiv Detail & Related papers (2024-08-11T05:36:10Z) - Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation [12.455034591553506]
Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields.
Previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion.
We have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem.
arXiv Detail & Related papers (2024-07-23T02:23:51Z) - MACO: A Modality Adversarial and Contrastive Framework for
Modality-missing Multi-modal Knowledge Graph Completion [18.188971531961663]
We propose a modality adversarial and contrastive framework (MACO) to solve the modality-missing problem in MMKGC.
MACO trains a generator and discriminator adversarially to generate missing modality features that can be incorporated into the MMKGC model.
arXiv Detail & Related papers (2023-08-13T06:29:38Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement
Learning [45.98103968842858]
The StarCraft Multi-Agent Challenge (SMAC) is a popular testbed for centralised training with decentralised execution.
We show that SMAC lacks the partial observability to require complex *closed-loop* policies.
We introduce SMACv2, a new version of the benchmark where scenarios are procedurally generated and require agents to generalise to previously unseen settings.
arXiv Detail & Related papers (2022-12-14T20:15:19Z) - Accelerated Gradient Descent Learning over Multiple Access Fading
Channels [9.840290491547162]
We consider a distributed learning problem in a wireless network, consisting of N distributed edge devices and a parameter server (PS)
We develop a novel Accelerated Gradient-descent Multiple Access (AGMA) algorithm that uses momentum-based gradient signals over noisy fading MAC to improve the convergence rate as compared to existing methods.
arXiv Detail & Related papers (2021-07-26T19:51:40Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - XCM: An Explainable Convolutional Neural Network for Multivariate Time
Series Classification [64.41621835517189]
We present XCM, an eXplainable Convolutional neural network for MTS classification.
XCM is a new compact convolutional neural network which extracts information relative to the observed variables and time directly from the input data.
We first show that XCM outperforms the state-of-the-art MTS classifiers on both the large and small public UEA datasets.
arXiv Detail & Related papers (2020-09-10T11:55:53Z) - Jointly Cross- and Self-Modal Graph Attention Network for Query-Based
Moment Localization [77.21951145754065]
We propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph.
Our CSMGAN is able to effectively capture high-order interactions between two modalities, thus enabling a further precise localization.
arXiv Detail & Related papers (2020-08-04T08:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.