Compound Tokens: Channel Fusion for Vision-Language Representation
Learning
- URL: http://arxiv.org/abs/2212.01447v1
- Date: Fri, 2 Dec 2022 21:09:52 GMT
- Title: Compound Tokens: Channel Fusion for Vision-Language Representation
Learning
- Authors: Maxwell Mbabilla Aladago and AJ Piergiovanni
- Abstract summary: We present an effective method for fusing visual-and-language representations for question answering tasks.
By fusing on the channels, the model is able to more effectively align the tokens compared to standard methods.
We demonstrate the effectiveness of compound tokens using an encoder-decoder vision-language model trained end-to-end in the open-vocabulary setting.
- Score: 36.19486792701684
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an effective method for fusing visual-and-language representations
for several question answering tasks including visual question answering and
visual entailment. In contrast to prior works that concatenate unimodal
representations or use only cross-attention, we compose multimodal
representations via channel fusion. By fusing on the channels, the model is
able to more effectively align the tokens compared to standard methods. These
multimodal representations, which we call compound tokens are generated with
cross-attention transformer layers. First, vision tokens are used as queries to
retrieve compatible text tokens through cross-attention. We then chain the
vision tokens and the queried text tokens along the channel dimension. We call
the resulting representations compound tokens. A second group of compound
tokens are generated using an analogous process where the text tokens serve as
queries to the cross-attention layer. We concatenate all the compound tokens
for further processing with multimodal encoder. We demonstrate the
effectiveness of compound tokens using an encoder-decoder vision-language model
trained end-to-end in the open-vocabulary setting. Compound Tokens achieve
highly competitive performance across a range of question answering tasks
including GQA, VQA2.0, and SNLI-VE.
Related papers
- ElasticTok: Adaptive Tokenization for Image and Video [109.75935878130582]
We introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens.
During inference, ElasticTok can dynamically allocate tokens when needed.
Our evaluations on images and video demonstrate the effectiveness of our approach in efficient token usage.
arXiv Detail & Related papers (2024-10-10T20:54:15Z) - Balancing Performance and Efficiency: A Multimodal Large Language Model Pruning Method based Image Text Interaction [6.467840081978855]
multimodal large language models (MM-LLMs) have achieved great success in many multimodal tasks, but their high computational costs limit their further promotion and application.
We studied the visual tokens of MM-LLMs and designed a dynamic pruning algorithm to address this issue.
Our proposed method can achieve performance that competes with the original performance when using an average of 22% of the original token quantity.
arXiv Detail & Related papers (2024-09-02T10:49:10Z) - Nested Music Transformer: Sequentially Decoding Compound Tokens in Symbolic Music and Audio Generation [2.668651175000492]
Representing symbolic music with compound tokens, where each token consists of several different sub-tokens, offers the advantage of reducing sequence length.
We introduce the Nested Music Transformer (NMT), an architecture tailored for decoding compound tokens autoregressively, similar to processing flattened tokens, but with low memory usage.
Experiment results showed that applying the NMT to compound tokens can enhance the performance in terms of better perplexity in processing various symbolic music datasets and discrete audio tokens from the MAESTRO dataset.
arXiv Detail & Related papers (2024-08-02T11:02:38Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - SEP: Self-Enhanced Prompt Tuning for Visual-Language Model [93.94454894142413]
We introduce a novel approach named Self-Enhanced Prompt Tuning (SEP)
SEP explicitly incorporates discriminative prior knowledge to enhance both textual-level and visual-level embeddings.
Comprehensive evaluations across various benchmarks and tasks confirm SEP's efficacy in prompt tuning.
arXiv Detail & Related papers (2024-05-24T13:35:56Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Improving Visual Quality of Image Synthesis by A Token-based Generator
with Transformers [51.581926074686535]
We present a new perspective of achieving image synthesis by viewing this task as a visual token generation problem.
The proposed TokenGAN has achieved state-of-the-art results on several widely-used image synthesis benchmarks.
arXiv Detail & Related papers (2021-11-05T12:57:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.