CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning
- URL: http://arxiv.org/abs/2512.00496v1
- Date: Sat, 29 Nov 2025 14:04:27 GMT
- Title: CACARA: Cross-Modal Alignment Leveraging a Text-Centric Approach for Cost-Effective Multimodal and Multilingual Learning
- Authors: Diego A. B. Moreira, Alef I. Ferreira, Jhessica Silva, Gabriel O. dos Santos, Gustavo Bonil, João Gondim, Marina dos Santos, Helena Maia, Simone Hashiguti, Nádia da Silva, Carolina Scarton, Helio Pedrini, Sandra Avila,
- Abstract summary: We propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning.<n>By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages.<n>Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models.
- Score: 6.162206820356373
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As deep learning models evolve, new applications and challenges are rapidly emerging. Tasks that once relied on a single modality, such as text, images, or audio, are now enriched by seamless interactions between multimodal data. These connections bridge information gaps: an image can visually materialize a text, while audio can add context to an image. Researchers have developed numerous multimodal models, but most rely on resource-intensive training across multiple modalities. Similarly, extending these models to new languages often follows the same resource-heavy training strategy. In this work, we propose a multimodal and multilingual architecture, CACARA, trained through emergent alignment learning, enabling the seamless integration of new modalities into an existing bimodal/multimodal model without requiring full retraining. This work breaks new ground by demonstrating that this emergent alignment paradigm can unlock multilingual capabilities from monolingual training. By fine-tuning the newly incorporated modality only on data aligned with the English language, our model develops support for over 100 languages without explicit multilingual pretraining or tuning of the text encoder. Such emergent multimodal and multilingual properties are gained efficiently, preserving previously learned knowledge at a training cost comparable to that of a monolingual model. Our strategy achieves up to a 14.24 percentage points improvement in R@1 audio-to-text retrieval, outperforming state-of-the-art multimodal models -- all without the heavy computational cost of retraining across every modality and language.
Related papers
- jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images [5.753626355995653]
jina-clip-v2 is a contrastive vision-language model trained on text pairs, triplets and image-text pairs.<n>We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages.<n>We evaluate the model's performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models.
arXiv Detail & Related papers (2024-12-11T22:28:12Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.56746545958522]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.<n>We build a multimodal text-centric dataset for multimodal alignment pre-training.<n>We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild [102.93338424976959]
We introduce TextBind, an almost annotation-free framework for empowering larger language models with the multi-turn interleaved instruction-following capabilities.
Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model.
To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models.
arXiv Detail & Related papers (2023-09-14T15:34:01Z) - Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages [76.35234803589412]
MPM is an effective training paradigm for training large multimodal models in non-English languages.
We build large multimodal models VisCPM in image-to-text and text-to-image generation, which achieve state-of-the-art (open-source) performance in Chinese.
arXiv Detail & Related papers (2023-08-23T09:55:41Z) - ChatBridge: Bridging Modalities with Large Language Model as a Language
Catalyst [24.517389691825667]
ChatBridge is a novel multimodal language model that leverages the expressive capabilities of language to bridge the gap between various modalities.
All codes, data, and models of ChatBridge will be open-sourced.
arXiv Detail & Related papers (2023-05-25T14:34:08Z) - TextMI: Textualize Multimodal Information for Integrating Non-verbal
Cues in Pre-trained Language Models [5.668457303716451]
We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks.
Our approach significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks.
arXiv Detail & Related papers (2023-03-27T17:54:32Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Large-scale Bilingual Language-Image Contrastive Learning [17.19890778916312]
We collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP.
We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation.
Experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages.
arXiv Detail & Related papers (2022-03-28T03:02:03Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z) - M3P: Learning Universal Representations via Multitask Multilingual
Multimodal Pre-training [119.16007395162431]
M3P is a Multilingual Multimodal Pre-trained model that combines multilingual pre-training and multimodal pre-training.
We show that M3P can achieve comparable results for English and new state-of-the-art results for non-English languages.
arXiv Detail & Related papers (2020-06-04T03:54:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.