Related papers: ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

ConLID: Supervised Contrastive Learning for Low-Resource Language Identification

URL: http://arxiv.org/abs/2506.15304v1
Date: Wed, 18 Jun 2025 09:35:33 GMT
Title: ConLID: Supervised Contrastive Learning for Low-Resource Language Identification
Authors: Negar Foroutan, Jakhongir Saydaliev, Ye Eun Kim, Antoine Bosselut,
Abstract summary: We propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages.<n>We show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%.
Score: 14.504528263331075
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language identification (LID) is a critical step in curating multilingual LLM pretraining corpora from web crawls. While many studies on LID model training focus on collecting diverse training data to improve performance, low-resource languages -- often limited to single-domain data, such as the Bible -- continue to perform poorly. To resolve these class imbalance and bias issues, we propose a novel supervised contrastive learning (SCL) approach to learn domain-invariant representations for low-resource languages. Through an extensive analysis, we show that our approach improves LID performance on out-of-domain data for low-resource languages by 3.2%, demonstrating its effectiveness in enhancing LID models.

Related papers

Bridging Language Gaps: Enhancing Few-Shot Language Adaptation [32.157041759856]
The disparity in language resources poses a challenge in multilingual NLP.<n>High-resource languages benefit from extensive data, while low-resource languages lack sufficient data for effective training.<n>Our Contrastive Language Alignment with Prompting (CoLAP) method addresses this gap by integrating contrastive learning with cross-lingual representations.
arXiv Detail & Related papers (2025-08-26T22:49:17Z)
Improving Multilingual Language Models by Aligning Representations through Steering [10.159957091670883]
This paper investigates how Large Language Models (LLMs) represent non-English tokens.<n>We propose a lightweight intervention method using representation steering, where a learned vector is added to the residual stream at a single model layer to enhance multilingual performance.
arXiv Detail & Related papers (2025-05-19T00:14:43Z)
Cross-Lingual Transfer for Low-Resource Natural Language Processing [0.32634122554914]
Cross-lingual transfer learning is a research area aimed at leveraging data and models from high-resource languages to improve NLP performance.<n>This thesis presents a new method to improve data-based transfer with T-Projection, a state-of-the-art annotation projection method.<n>For model-based transfer, we introduce a constrained decoding algorithm that enhances cross-lingual Sequence Labeling in zero-shot settings.<n>Finally, we develop Medical mT5, the first multilingual text-to-text medical model.
arXiv Detail & Related papers (2025-02-04T21:17:46Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition [55.2480439325792]
Speech Meta In-Context LEarning (SMILE) is an innovative framework that combines meta-learning with speech in-context learning (SICL)<n>We show that SMILE consistently outperforms baseline methods in training-free few-shot multilingual ASR tasks.
arXiv Detail & Related papers (2024-09-16T16:04:16Z)
MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z)
Self-training Large Language Models through Knowledge Detection [26.831873737733737]
Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across downstream tasks. This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on unknown data samples. Empirical evaluations demonstrate significant improvements in reducing hallucination in generation across multiple subjects.
arXiv Detail & Related papers (2024-06-17T07:25:09Z)
Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking [1.3716808114696444]
Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations.
arXiv Detail & Related papers (2024-05-07T21:58:45Z)
Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning. We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups. Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z)
Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora. Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z)
From Classification to Generation: Insights into Crosslingual Retrieval Augmented ICL [8.065775937617417]
We introduce a novel approach that leverages cross-lingual retrieval-augmented in-context learning (CREA-ICL) By extracting semantically similar prompts from high-resource languages, we aim to improve the zero-shot performance of multilingual pre-trained language models (MPLMs) Though our approach yields steady improvements in classification tasks, it faces challenges in generation tasks.
arXiv Detail & Related papers (2023-11-11T15:40:21Z)
Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements. We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations. Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.