Cross-stitched Multi-modal Encoders
- URL: http://arxiv.org/abs/2204.09227v1
- Date: Wed, 20 Apr 2022 05:09:36 GMT
- Title: Cross-stitched Multi-modal Encoders
- Authors: Karan Singla, Daniel Pressel, Ryan Price, Bhargav Srinivas Chinnari,
Yeon-Jun Kim, Srinivas Bangalore
- Abstract summary: We combine pretrained speech and text encoders using multi-headed cross-modal attention.
The resultant architecture can be used for continuous token-level classification or utterance-level prediction.
Our model architecture is compact, resource efficient, and can be trained on a single consumer GPU card.
- Score: 17.387919594858463
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we propose a novel architecture for multi-modal speech and
text input. We combine pretrained speech and text encoders using multi-headed
cross-modal attention and jointly fine-tune on the target problem. The
resultant architecture can be used for continuous token-level classification or
utterance-level prediction acting on simultaneous text and speech. The
resultant encoder efficiently captures both acoustic-prosodic and lexical
information. We compare the benefits of multi-headed attention-based fusion for
multi-modal utterance-level classification against a simple concatenation of
pre-pooled, modality-specific representations. Our model architecture is
compact, resource efficient, and can be trained on a single consumer GPU card.
Related papers
- MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer [90.72238747690972]
We present Manzano, a simple and scalable unified framework for multimodal large language models.<n>A single vision encoder feeds two adapters that produce continuous embeddings for image-to-text understanding and discrete tokens for text-to-image generation.<n>A unified autoregressive LLM predicts high-level semantics in the form of text and image tokens, with an auxiliary diffusion decoder subsequently translating the image tokens into pixels.
arXiv Detail & Related papers (2025-09-19T17:58:00Z) - METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z) - FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z) - TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment [15.899112804399193]
We present textbfTESU-LLM, a novel framework that enables training speech-capable language models using only text data.<n>Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space.<n>Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks.
arXiv Detail & Related papers (2025-06-01T09:27:55Z) - Rethinking Multimodal Sentiment Analysis: A High-Accuracy, Simplified Fusion Architecture [2.3272964989267626]
We propose a lightweight, yet effective fusion-based deep learning model tailored for utterance-level emotion classification.<n>Our approach demonstrates that with careful feature engineering and modular design, simpler fusion strategies can outperform or match more complex models.
arXiv Detail & Related papers (2025-05-05T02:31:11Z) - Hierarchical Banzhaf Interaction for General Video-Language Representation Learning [60.44337740854767]
Multimodal representation learning plays an important role in the artificial intelligence domain.
We introduce a new approach that models video-text as game players using multivariate cooperative game theory.
We extend our original structure into a flexible encoder-decoder framework, enabling the model to adapt to various downstream tasks.
arXiv Detail & Related papers (2024-12-30T14:09:15Z) - Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment [16.733970553781887]
Recent findings suggest high semantic similarity between well-trained unimodal encoders.
We propose a novel framework that aligns vision and language using frozen unimodal encoders.
arXiv Detail & Related papers (2024-09-28T17:57:32Z) - Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities [6.9522425458326635]
We propose a multi-tower decoder architecture that flexibly composes multimodal generative models from independently pre-trained unimodal decoders.
We show the proposed architecture performs very competitively in scenarios with limited aligned text-speech data.
In cross-modal tasks such as text-to-speech generation (TTS) where the output modality is speech, we show that using a pre-trained speech backbone results in superior performance to the baseline.
arXiv Detail & Related papers (2024-05-29T00:23:55Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Parameter Efficient Audio Captioning With Faithful Guidance Using
Audio-text Shared Latent Representation [0.9285295512807729]
We propose a data augmentation technique for generating hallucinated audio captions and show that similarity based on an audio-text shared latent space is suitable for detecting hallucination.
We then propose a parameter efficient inference time faithful decoding algorithm that enables smaller audio captioning models with performance equivalent to larger models trained with more data.
arXiv Detail & Related papers (2023-09-06T19:42:52Z) - A novel multimodal dynamic fusion network for disfluency detection in
spoken utterances [43.79216238760557]
We propose a novel multimodal architecture for disfluency detection from individual utterances.
Our architecture leverages a multimodal dynamic fusion network that adds minimal parameters over an existing text encoder.
We show that our proposed model achieves state-of-the-art results on the widely used English Switchboard for disfluency detection.
arXiv Detail & Related papers (2022-11-27T01:54:22Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding [93.16866430882204]
In prior works, frame-level features from one layer are aggregated to form an utterance-level representation.
Inspired by the Transformer network, our proposed method utilizes the hierarchical architecture of stacked self-attention mechanisms.
With more layers stacked, the neural network can learn more discriminative speaker embeddings.
arXiv Detail & Related papers (2021-07-14T05:38:48Z) - Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for
Improved Cross-Modal Retrieval [80.35589927511667]
Current state-of-the-art approaches to cross-modal retrieval process text and visual input jointly, relying on Transformer-based architectures with cross-attention mechanisms that attend over all words and objects in an image.
We propose a novel fine-tuning framework which turns any pretrained text-image multi-modal model into an efficient retrieval model.
Our experiments on a series of standard cross-modal retrieval benchmarks in monolingual, multilingual, and zero-shot setups, demonstrate improved accuracy and huge efficiency benefits over the state-of-the-art cross-encoders.
arXiv Detail & Related papers (2021-03-22T15:08:06Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.