Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models
- URL: http://arxiv.org/abs/2506.01592v1
- Date: Mon, 02 Jun 2025 12:28:03 GMT
- Title: Statement-Tuning Enables Efficient Cross-lingual Generalization in Encoder-only Models
- Authors: Ahmed Elshabrawy, Thanh-Nhi Nguyen, Yeeun Kang, Lihan Feng, Annant Jain, Faadil Abdullah Shaikh, Jonibek Mansurov, Mohamed Fazli Mohamed Imam, Jesus-German Ortiz-Barajas, Rendi Chevi, Alham Fikri Aji,
- Abstract summary: Large Language Models (LLMs) excel in zero-shot and few-shot tasks, but achieving similar performance with encoder-only models has been challenging.<n>Recent work adapts them for zero-shot generalization using Statement Tuning, which reformulates tasks into finite templates.<n>We extend this approach to multilingual NLP, exploring whether encoders can achieve zero-shot cross-lingual generalization.
- Score: 7.467951065154891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) excel in zero-shot and few-shot tasks, but achieving similar performance with encoder-only models like BERT and RoBERTa has been challenging due to their architecture. However, encoders offer advantages such as lower computational and memory costs. Recent work adapts them for zero-shot generalization using Statement Tuning, which reformulates tasks into finite templates. We extend this approach to multilingual NLP, exploring whether encoders can achieve zero-shot cross-lingual generalization and serve as efficient alternatives to memory-intensive LLMs for low-resource languages. Our results show that state-of-the-art encoder models generalize well across languages, rivaling multilingual LLMs while being more efficient. We also analyze multilingual Statement Tuning dataset design, efficiency gains, and language-specific generalization, contributing to more inclusive and resource-efficient NLP models. We release our code and models.
Related papers
- Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages [9.066355705304984]
We propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages.<n>By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder.<n>Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks.
arXiv Detail & Related papers (2025-02-15T16:53:10Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language.<n>We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z) - Deriving Coding-Specific Sub-Models from LLMs using Resource-Efficient Pruning [4.762390044282733]
Large Language Models (LLMs) have demonstrated their exceptional performance in various complex code generation tasks.<n>To mitigate such requirements, model pruning techniques are used to create more compact models with significantly fewer parameters.<n>In this work, we explore the idea of efficiently deriving coding-specific sub-models through unstructured pruning.
arXiv Detail & Related papers (2025-01-09T14:00:01Z) - LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision.<n>LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks.<n>We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z) - Can General-Purpose Large Language Models Generalize to English-Thai Machine Translation ? [2.1969983462375318]
Large language models (LLMs) perform well on common tasks but struggle with generalization in low-resource and low-computation settings.
We examine this limitation by testing various LLMs and specialized translation models on English-Thai machine translation and code-switching datasets.
arXiv Detail & Related papers (2024-10-22T16:26:03Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - The Ups and Downs of Large Language Model Inference with Vocabulary Trimming by Language Heuristics [74.99898531299148]
This research examines vocabulary trimming (VT) inspired by restricting embedding entries to the language of interest to bolster time and memory efficiency.
We apply two languages to trim the full vocabulary - Unicode-based script filtering and corpus-based selection - to different language families and sizes.
It is found that VT reduces the memory usage of small models by nearly 50% and has an upper bound of 25% improvement in generation speed.
arXiv Detail & Related papers (2023-11-16T09:35:50Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.