Related papers: CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark

CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark

URL: http://arxiv.org/abs/2601.08331v1
Date: Tue, 13 Jan 2026 08:42:03 GMT
Title: CLaS-Bench: A Cross-Lingual Alignment and Steering Benchmark
Authors: Daniil Gurgurov, Yusser Al Ghussin, Tanja Baeumel, Cheng-Ting Chou, Patrick Schramowski, Marius Mosbach, Josef van Genabith, Simon Ostermann,
Abstract summary: We introduce CLaS-Bench, a benchmark for evaluating language-forcing behavior in large language models (LLMs) across 32 languages.<n>We find that across languages simple residual-based DiffMean method consistently outperforms all other methods.
Score: 21.574271160875046
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding and controlling the behavior of large language models (LLMs) is an increasingly important topic in multilingual NLP. Beyond prompting or fine-tuning, , i.e.,~manipulating internal representations during inference, has emerged as a more efficient and interpretable technique for adapting models to a target language. Yet, no dedicated benchmarks or evaluation protocols exist to quantify the effectiveness of steering techniques. We introduce CLaS-Bench, a lightweight parallel-question benchmark for evaluating language-forcing behavior in LLMs across 32 languages, enabling systematic evaluation of multilingual steering methods. We evaluate a broad array of steering techniques, including residual-stream DiffMean interventions, probe-derived directions, language-specific neurons, PCA/LDA vectors, Sparse Autoencoders, and prompting baselines. Steering performance is measured along two axes: language control and semantic relevance, combined into a single harmonic-mean steering score. We find that across languages simple residual-based DiffMean method consistently outperforms all other methods. Moreover, a layer-wise analysis reveals that language-specific structure emerges predominantly in later layers and steering directions cluster based on language family. CLaS-Bench is the first standardized benchmark for multilingual steering, enabling both rigorous scientific analysis of language representations and practical evaluation of steering as a low-cost adaptation alternative.

Related papers

What Language is This? Ask Your Tokenizer [32.28976119949841]
Language Identification (LID) is an important component of many multilingual natural language processing pipelines.<n>We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm.<n>Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models.
arXiv Detail & Related papers (2026-02-19T18:58:39Z)
Language Steering for Multilingual In-Context Learning [10.932074928744568]
Large language models' performance on non-English languages remains substantially inferior to English.<n>We propose language vectors -- a training-free language steering approach.<n>We show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested.
arXiv Detail & Related papers (2026-02-02T16:52:09Z)
LangGPS: Language Separability Guided Data Pre-Selection for Joint Multilingual Instruction Tuning [49.22807995935406]
Joint multilingual instruction tuning is a widely adopted approach to improve the multilingual instruction-following ability and downstream performance of large language models (LLMs)<n>Existing selection methods, often based on features like text quality, diversity, or task relevance, typically overlook the intrinsic linguistic structure of multilingual data.<n>We propose LangGPS, a lightweight two-stage pre-selection framework guided by language separability which quantifies how well samples in different languages can be distinguished in the model's representation space.
arXiv Detail & Related papers (2025-11-13T12:02:32Z)
Language steering in latent space to mitigate unintended code-switching [1.1330938617817454]
Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks.<n>We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations.<n>Our approach mitigates code-switching while preserving semantics with negligible computational overhead.
arXiv Detail & Related papers (2025-10-11T19:49:38Z)
Improving Multilingual Language Models by Aligning Representations through Steering [10.159957091670883]
This paper investigates how Large Language Models (LLMs) represent non-English tokens.<n>We propose a lightweight intervention method using representation steering, where a learned vector is added to the residual stream at a single model layer to enhance multilingual performance.
arXiv Detail & Related papers (2025-05-19T00:14:43Z)
Investigating and Scaling up Code-Switching for Multilingual Language Model Pre-Training [58.696660064190475]
We find that the existence of code-switching, alternating between different languages within a context, is key to multilingual capabilities.<n>To better explore the power of code-switching for language alignment during pre-training, we investigate the strategy of synthetic code-switching.
arXiv Detail & Related papers (2025-04-02T15:09:58Z)
Assessing Code Generation with Intermediate Languages [6.999311675957218]
This study explores the utilization of intermediate languages, including various programming languages, natural language solutions, and pseudo-code. Our findings reveal that intermediate languages generally exhibit greater efficacy in larger models that have not yet achieved state-of-the-art performance.
arXiv Detail & Related papers (2024-07-07T15:35:41Z)
Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning. We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups. Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z)
Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally. Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z)
CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)
Multilingual Chart-based Constituency Parse Extraction from Pre-trained Language Models [21.2879567125422]
We propose a novel method for extracting complete (binary) parses from pre-trained language models. By applying our method on multilingual PLMs, it becomes possible to induce non-trivial parses for sentences from nine languages.
arXiv Detail & Related papers (2020-04-08T05:42:26Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.