Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models
- URL: http://arxiv.org/abs/2510.01845v1
- Date: Thu, 02 Oct 2025 09:38:25 GMT
- Title: Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models
- Authors: Ece Takmaz, Lisa Bylinina, Jakub Dotlacil,
- Abstract summary: This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy.<n>We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets.
- Score: 2.3193211674050516
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: State-of-the-art vision-and-language models consist of many parameters and learn from enormous datasets, surpassing the amounts of linguistic data that children are exposed to as they acquire a language. This paper presents our approach to the multimodal track of the BabyLM challenge addressing this discrepancy. We develop language-only and multimodal models in low-resource settings using developmentally plausible datasets, with our multimodal models outperforming previous BabyLM baselines. One finding in the multimodal language model literature is that these models tend to underperform in \textit{language-only} tasks. Therefore, we focus on maintaining language-only abilities in multimodal models. To this end, we experiment with \textit{model merging}, where we fuse the parameters of multimodal models with those of language-only models using weighted linear interpolation. Our results corroborate the findings that multimodal models underperform in language-only benchmarks that focus on grammar, and model merging with text-only models can help alleviate this problem to some extent, while maintaining multimodal performance.
Related papers
- Multilingual Definition Modeling [1.9409995498330783]
We use monolingual dictionary data for four new languages (Spanish, French, Portuguese, and German)<n>We test the performance of pre-trained multilingual language models on definition modeling of monosemic words when finetuned on this data.<n>Results show that multilingual language models can perform on-pair with English but cannot leverage potential cross-lingual synergies.
arXiv Detail & Related papers (2025-06-02T09:48:37Z) - OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging [124.91183814854126]
Model merging seeks to combine multiple expert models into a single model.<n>We introduce a benchmark for model merging research that clearly divides the tasks for MLLM training and evaluation.<n>We find that model merging offers a promising way for building improved MLLMs without requiring training data.
arXiv Detail & Related papers (2025-05-26T12:23:14Z) - xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation [2.9998889086656586]
We propose an adaptation methodology for Large Vision-Language Models trained on English language data to improve their performance.<n>We introduce a benchmark to evaluate the effectiveness of multilingual and multimodal embedding models.
arXiv Detail & Related papers (2025-03-12T12:04:05Z) - InkubaLM: A small language model for low-resource African languages [9.426968756845389]
InkubaLM is a small language model with 0.4 billion parameters.
It achieves performance comparable to models with significantly larger parameter counts.
It demonstrates remarkable consistency across multiple languages.
arXiv Detail & Related papers (2024-08-30T05:42:31Z) - IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities [4.269326314400742]
We introduce the Inner-Adaptor Architecture for multimodal large language models (MLLMs)<n>The architecture incorporates multiple multimodal adaptors at varying depths within the large language model to facilitate direct interaction with the inherently text-oriented transformer layers.<n>Unlike previous approaches of freezing language models that require large-scale aligned data, our proposed architecture is able to achieve superior performance on small-scale datasets.
arXiv Detail & Related papers (2024-08-23T08:10:13Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process.
Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.56746545958522]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.<n>We build a multimodal text-centric dataset for multimodal alignment pre-training.<n>We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - Structure-Level Knowledge Distillation For Multilingual Sequence
Labeling [73.40368222437912]
We propose to reduce the gap between monolingual models and the unified multilingual model by distilling the structural knowledge of several monolingual models to the unified multilingual model (student)
Our experiments on 4 multilingual tasks with 25 datasets show that our approaches outperform several strong baselines and have stronger zero-shot generalizability than both the baseline model and teacher models.
arXiv Detail & Related papers (2020-04-08T07:14:01Z) - InterBERT: Vision-and-Language Interaction for Multi-modal Pretraining [76.32065400614162]
We propose a novel model, namely InterBERT (BERT for Interaction), which is the first model of our series of multimodal pretraining methods M6.
The model owns strong capability of modeling interaction between the information flows of different modalities.
We propose a large-scale dataset for multi-modal pretraining in Chinese, and we develop the Chinese InterBERT which is the first Chinese multi-modal pretrained model.
arXiv Detail & Related papers (2020-03-30T03:13:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.