TxT: Crossmodal End-to-End Learning with Transformers
- URL: http://arxiv.org/abs/2109.04422v1
- Date: Thu, 9 Sep 2021 17:12:20 GMT
- Title: TxT: Crossmodal End-to-End Learning with Transformers
- Authors: Jan-Martin O. Steitz, Jonas Pfeiffer, Iryna Gurevych, Stefan Roth
- Abstract summary: Reasoning over multiple modalities requires an alignment of semantic concepts across domains.
TxT is a transformer-based crossmodal pipeline that enables fine-tuning both language and visual components on the downstream task.
Our model achieves considerable gains from end-to-end learning for multimodal question answering.
- Score: 84.55645255507461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reasoning over multiple modalities, e.g. in Visual Question Answering (VQA),
requires an alignment of semantic concepts across domains. Despite the
widespread success of end-to-end learning, today's multimodal pipelines by and
large leverage pre-extracted, fixed features from object detectors, typically
Faster R-CNN, as representations of the visual world. The obvious downside is
that the visual representation is not specifically tuned to the multimodal task
at hand. At the same time, while transformer-based object detectors have gained
popularity, they have not been employed in today's multimodal pipelines. We
address both shortcomings with TxT, a transformer-based crossmodal pipeline
that enables fine-tuning both language and visual components on the downstream
task in a fully end-to-end manner. We overcome existing limitations of
transformer-based detectors for multimodal reasoning regarding the integration
of global context and their scalability. Our transformer-based multimodal model
achieves considerable gains from end-to-end learning for multimodal question
answering.
Related papers
- CT-MVSNet: Efficient Multi-View Stereo with Cross-scale Transformer [8.962657021133925]
Cross-scale transformer (CT) processes feature representations at different stages without additional computation.
We introduce an adaptive matching-aware transformer (AMT) that employs different interactive attention combinations at multiple scales.
We also present a dual-feature guided aggregation (DFGA) that embeds the coarse global semantic information into the finer cost volume construction.
arXiv Detail & Related papers (2023-12-14T01:33:18Z) - Exchanging-based Multimodal Fusion with Transformer [19.398692598523454]
We study the problem of multimodal fusion in this paper.
Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other.
We propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer.
arXiv Detail & Related papers (2023-09-05T12:48:25Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - VL-InterpreT: An Interactive Visualization Tool for Interpreting
Vision-Language Transformers [47.581265194864585]
Internal mechanisms of vision and multimodal transformers remain largely opaque.
With the success of these transformers, it is increasingly critical to understand their inner workings.
We propose VL-InterpreT, which provides novel interactive visualizations for interpreting the attentions and hidden representations in multimodal transformers.
arXiv Detail & Related papers (2022-03-30T05:25:35Z) - StreaMulT: Streaming Multimodal Transformer for Heterogeneous and
Arbitrary Long Sequential Data [0.0]
StreaMulT is a Streaming Multimodal Transformer relying on cross-modal attention and on a memory bank to process arbitrarily long input sequences at training time and run in a streaming way at inference.
StreaMulT improves the state-of-the-art metrics on CMU-MOSEI dataset for Multimodal Sentiment Analysis task, while being able to deal with much longer inputs than other multimodal models.
arXiv Detail & Related papers (2021-10-15T11:32:17Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.