Modular Blended Attention Network for Video Question Answering
- URL: http://arxiv.org/abs/2311.12866v1
- Date: Thu, 2 Nov 2023 14:22:17 GMT
- Title: Modular Blended Attention Network for Video Question Answering
- Authors: Mingjie Zhou
- Abstract summary: We present an approach to facilitate the question with a reusable and composable neural unit.
We have conducted experiments on three commonly used datasets.
- Score: 1.131316248570352
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In multimodal machine learning tasks, it is due to the complexity of the
assignments that the network structure, in most cases, is assembled in a
sophisticated way. The holistic architecture can be separated into several
logical parts according to the respective ends that the modules are devised to
achieve. As the number of modalities of information representation increases,
constructing ad hoc subnetworks for processing the data from divergent
modalities while mediating the fusion of different information types has become
a cumbersome and expensive problem. In this paper, we present an approach to
facilitate the question with a reusable and composable neural unit; by
connecting the units in series or parallel, the arduous network constructing of
multimodal machine learning tasks will be accomplished in a much
straightforward way. Additionally, through parameter sharing (weights
replication) among the units, the space complexity will be significantly
reduced. We have conducted experiments on three commonly used datasets; our
method achieves impressive performance compared to several video QA baselines.
Related papers
- General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation.
Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance.
We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z) - Modular Deep Learning [120.36599591042908]
Transfer learning has recently become the dominant paradigm of machine learning.
It remains unclear how to develop models that specialise towards multiple tasks without incurring negative interference.
Modular deep learning has emerged as a promising solution to these challenges.
arXiv Detail & Related papers (2023-02-22T18:11:25Z) - Complexity of Representations in Deep Learning [2.0219767626075438]
We analyze the effectiveness of the learned representations in separating the classes from a data complexity perspective.
We show how the data complexity evolves through the network, how it changes during training, and how it is impacted by the network design and the availability of training samples.
arXiv Detail & Related papers (2022-09-01T15:20:21Z) - Routing with Self-Attention for Multimodal Capsule Networks [108.85007719132618]
We present a new multimodal capsule network that allows us to leverage the strength of capsules in the context of a multimodal learning framework.
To adapt the capsules to large-scale input data, we propose a novel routing by self-attention mechanism that selects relevant capsules.
This allows not only for robust training with noisy video data, but also to scale up the size of the capsule network compared to traditional routing methods.
arXiv Detail & Related papers (2021-12-01T19:01:26Z) - Neural Function Modules with Sparse Arguments: A Dynamic Approach to
Integrating Information across Layers [84.57980167400513]
Neural Function Modules (NFM) aims to introduce the same structural capability into deep learning.
Most of the work in the context of feed-forward networks combining top-down and bottom-up feedback is limited to classification problems.
The key contribution of our work is to combine attention, sparsity, top-down and bottom-up feedback, in a flexible algorithm.
arXiv Detail & Related papers (2020-10-15T20:43:17Z) - Multi-Task Learning with Deep Neural Networks: A Survey [0.0]
Multi-task learning (MTL) is a subfield of machine learning in which multiple tasks are simultaneously learned by a shared model.
We give an overview of multi-task learning methods for deep neural networks, with the aim of summarizing both the well-established and most recent directions within the field.
arXiv Detail & Related papers (2020-09-10T19:31:04Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z) - Revealing the Invisible with Model and Data Shrinking for
Composite-database Micro-expression Recognition [49.463864096615254]
We analyze the influence of learning complexity, including the input complexity and model complexity.
We propose a recurrent convolutional network (RCN) to explore the shallower-architecture and lower-resolution input data.
We develop three parameter-free modules to integrate with RCN without increasing any learnable parameters.
arXiv Detail & Related papers (2020-06-17T06:19:24Z) - Deep Multimodal Neural Architecture Search [178.35131768344246]
We devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks.
Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone.
On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks.
arXiv Detail & Related papers (2020-04-25T07:00:32Z) - Multiresolution Convolutional Autoencoders [5.0169726108025445]
We propose a multi-resolution convolutional autoencoder architecture that integrates and leverages three successful mathematical architectures.
Basic learning techniques are applied to ensure information learned from previous training steps can be rapidly transferred to the larger network.
The performance gains are illustrated through a sequence of numerical experiments on synthetic examples and real-world spatial data.
arXiv Detail & Related papers (2020-04-10T08:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.