Related papers: Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

URL: http://arxiv.org/abs/2511.09396v1
Date: Thu, 13 Nov 2025 01:52:13 GMT
Title: Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque
Authors: Lukas Arana, Julen Etxaniz, Ander Salaberria, Gorka Azkune,
Abstract summary: We develop our own training and evaluation image-text datasets for Basque.<n>We show that low ratios of Basque multimodal data are already enough to obtain solid results on Basque benchmarks.<n>Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.
Score: 9.472204782610122
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Current Multimodal Large Language Models exhibit very strong performance for several demanding tasks. While commercial MLLMs deliver acceptable performance in low-resource languages, comparable results remain unattained within the open science community. In this paper, we aim to develop a strong MLLM for a low-resource language, namely Basque. For that purpose, we develop our own training and evaluation image-text datasets. Using two different Large Language Models as backbones, the Llama-3.1-Instruct model and a Basque-adapted variant called Latxa, we explore several data mixtures for training. We show that: i) low ratios of Basque multimodal data (around 20%) are already enough to obtain solid results on Basque benchmarks, and ii) contrary to expected, a Basque instructed backbone LLM is not required to obtain a strong MLLM in Basque. Our results pave the way to develop MLLMs for other low-resource languages by openly releasing our resources.

Related papers

BYOL: Bring Your Own Language Into LLMs [12.151176703151428]
Large Language Models (LLMs) exhibit strong multilingual capabilities, yet remain constrained by the severe imbalance in global language resources.<n>This disparity leads to systematic underperformance, cultural misalignment, and limited accessibility for speakers of low-resource and extreme-low-resource languages.<n>We introduce Bring Your Own Language (BYOL), a unified framework for scalable, language-aware LLM development tailored to each language's digital footprint.
arXiv Detail & Related papers (2026-01-15T19:15:13Z)
Instructing Large Language Models for Low-Resource Languages: A Systematic Study for Basque [28.653468015270732]
Instructing language models with user intent requires large instruction datasets, which are only available for a limited set of languages.<n>We assume a realistic scenario for low-resource languages, where only the following are available: corpora in the target language, existing open-weight multilingual base and instructed backbone LLMs, and synthetically generated instructions sampled from the instructed backbone.
arXiv Detail & Related papers (2025-06-09T09:54:47Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.<n>Currently, instruction-tuned large language models (LLMs) excel at various English tasks.<n>Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z)
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages [0.0]
Large Language Models (LLMs) have shown incredible proficiency at natural language processing tasks. LLMs often struggle to perform well on low-resource languages because there is so little training data available. In this work, we explore training LLaMA-2 to speak Amharic, a language which is spoken by over 50 million people world wide.
arXiv Detail & Related papers (2024-03-11T01:04:36Z)
Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora. Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z)
Introducing Bode: A Fine-Tuned Large Language Model for Portuguese Prompt-Based Task [1.158680734110387]
This work proposes a fine-tuned LLaMA 2-based model for Portuguese prompts named Bode. We evaluate the performance of this model in classification tasks using the zero-shot approach with in-context learning.
arXiv Detail & Related papers (2024-01-05T17:15:01Z)
Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages [51.301942056881146]
We investigate how large language models (LLMs) function as rerankers in cross-lingual information retrieval systems for African languages. Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) We examine cross-lingual reranking with queries in English and passages in the African languages.
arXiv Detail & Related papers (2023-12-26T18:38:54Z)
PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z)
Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars. We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English. Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.