Adaptive Transformers for Learning Multimodal Representations
- URL: http://arxiv.org/abs/2005.07486v3
- Date: Wed, 8 Jul 2020 12:26:12 GMT
- Title: Adaptive Transformers for Learning Multimodal Representations
- Authors: Prajjwal Bhargava
- Abstract summary: We extend adaptive approaches to learn more about model interpretability and computational efficiency.
We study attention spans, sparse, and structured dropout methods to help understand how their attention mechanism extends for vision and language tasks.
- Score: 6.09170287691728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The usage of transformers has grown from learning about language semantics to
forming meaningful visiolinguistic representations. These architectures are
often over-parametrized, requiring large amounts of computation. In this work,
we extend adaptive approaches to learn more about model interpretability and
computational efficiency. Specifically, we study attention spans, sparse, and
structured dropout methods to help understand how their attention mechanism
extends for vision and language tasks. We further show that these approaches
can help us learn more about how the network perceives the complexity of input
sequences, sparsity preferences for different modalities, and other related
phenomena.
Related papers
- Scalable Representation Learning for Multimodal Tabular Transactions [14.18267117657451]
We present an innovative and scalable solution to these challenges.
We propose a parameter efficient decoder that interleaves transaction and text modalities.
We validate the efficacy of our solution on a large-scale dataset of synthetic payments transactions.
arXiv Detail & Related papers (2024-10-10T12:18:42Z) - Explaining Text Similarity in Transformer Models [52.571158418102584]
Recent advances in explainable AI have made it possible to mitigate limitations by leveraging improved explanations for Transformers.
We use BiLRP, an extension developed for computing second-order explanations in bilinear similarity models, to investigate which feature interactions drive similarity in NLP models.
Our findings contribute to a deeper understanding of different semantic similarity tasks and models, highlighting how novel explainable AI methods enable in-depth analyses and corpus-level insights.
arXiv Detail & Related papers (2024-05-10T17:11:31Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers.
The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention.
We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z) - Self-paced Multi-grained Cross-modal Interaction Modeling for Referring
Expression Comprehension [21.000045864213327]
referring expression comprehension (REC) generally requires a large amount of multi-grained information of visual and linguistic modalities to realize accurate reasoning.
How to aggregate multi-grained information from different modalities and extract abundant knowledge from hard examples is crucial in the REC task.
We propose a Self-paced Multi-grained Cross-modal Interaction Modeling framework, which improves the language-to-vision localization ability.
arXiv Detail & Related papers (2022-04-21T08:32:47Z) - Adaptive Discrete Communication Bottlenecks with Dynamic Vector
Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs.
We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z) - Dodrio: Exploring Transformer Models with Interactive Visualization [10.603327364971559]
Dodrio is an open-source interactive visualization tool to help NLP researchers and practitioners analyze attention mechanisms in transformer-based models with linguistic knowledge.
To facilitate the visual comparison of attention weights and linguistic knowledge, Dodrio applies different graph visualization techniques to represent attention weights with longer input text.
arXiv Detail & Related papers (2021-03-26T17:39:37Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Incidental Supervision: Moving beyond Supervised Learning [72.4859717204905]
This paper describes several learning paradigms that are designed to alleviate the supervision bottleneck.
It will illustrate their benefit in the context of multiple problems, all pertaining to inducing various levels of semantic representations from text.
arXiv Detail & Related papers (2020-05-25T18:44:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.