Related papers: OLA: Output Language Alignment in Code-Switched LLM Interactions

OLA: Output Language Alignment in Code-Switched LLM Interactions

URL: http://arxiv.org/abs/2601.03589v1
Date: Wed, 07 Jan 2026 05:07:22 GMT
Title: OLA: Output Language Alignment in Code-Switched LLM Interactions
Authors: Juhyun Oh, Haneul Yoo, Faiz Ghifari Haznitrama, Alice Oh,
Abstract summary: Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models.<n>We introduce OLA, a benchmark to evaluate LLMs' Output Language Alignment in code-switched interactions.<n>OLA focuses on Korean--English code-switching and spans simple intra-sentential mixing to instruction-content mismatches.
Score: 31.119553472916234
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models (LLMs). When a user code-switches in their prompt to an LLM, they typically do not specify the expected language of the LLM response, and thus LLMs must infer the output language from contextual and pragmatic cues. We find that current LLMs systematically fail to align with this expectation, responding in undesired languages even when cues are clear to humans. We introduce OLA, a benchmark to evaluate LLMs' Output Language Alignment in code-switched interactions. OLA focuses on Korean--English code-switching and spans simple intra-sentential mixing to instruction-content mismatches. Even frontier models frequently misinterpret implicit language expectation, exhibiting a bias toward non-English responses. We further show this bias generalizes beyond Korean to Chinese and Indonesian pairs. Models also show instability through mid-response switching and language intrusions. Chain-of-Thought prompting fails to resolve these errors, indicating weak pragmatic reasoning about output language. However, Code-Switching Aware DPO with minimal data (about 1K examples) substantially reduces misalignment, suggesting these failures stem from insufficient alignment rather than fundamental limitations. Our results highlight the need to align multilingual LLMs with users' implicit expectations in real-world code-switched interactions.

Related papers

Can Large Language Models Understand, Reason About, and Generate Code-Switched Text? [26.210664542372168]
Code-switching is a pervasive phenomenon in multilingual communication, yet the robustness of large language models (LLMs) in mixed-language settings remains insufficiently understood.<n>We introduce CodeMixQA, a novel benchmark with high-quality human annotations, comprising 16 diverse parallel code-switched language-pair variants.<n>We analyze the reasoning behavior of LLMs on code-switched question-answering tasks, shedding light on how models process and reason over mixed-language inputs.
arXiv Detail & Related papers (2026-01-12T02:52:38Z)
Language Confusion Gate: Language-Aware Decoding Through Model Self-Distillation [50.93756215410832]
This paper introduces the Language Confusion Gate (LCG), a lightweight, plug-in solution that filters tokens during decoding.<n>The LCG is trained using norm-adjusted self-distillation to predict appropriate language families and apply masking only when needed.
arXiv Detail & Related papers (2025-10-20T14:02:37Z)
Lost in the Mix: Evaluating LLM Understanding of Code-Switched Text [25.05270733872823]
Code-switching (CSW) is the act of alternating between two or more languages within a single discourse.<n>Large Language Models (LLMs) are now central to content and communication generation.
arXiv Detail & Related papers (2025-06-16T21:19:27Z)
CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models [18.378069426713]
Code-switching (CS) poses a significant challenge for Large Language Models (LLMs)<n>We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization.<n> CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English, Tamil-English, and Malay-English.
arXiv Detail & Related papers (2025-05-19T09:18:14Z)
Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models [49.16690802656554]
We find that Multilingual factual models struggle to provide consistent responses to semantically equivalent prompts in different languages.<n>We propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency.
arXiv Detail & Related papers (2025-04-05T19:43:10Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Let Models Speak Ciphers: Multiagent Debate through Embeddings [84.20336971784495]
We introduce CIPHER (Communicative Inter-Model Protocol Through Embedding Representation) to address this issue. By deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights. This showcases the superiority and robustness of embeddings as an alternative "language" for communication among LLMs.
arXiv Detail & Related papers (2023-10-10T03:06:38Z)
InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning [66.31509106146605]
Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages. However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data. We propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages.
arXiv Detail & Related papers (2023-05-23T02:51:34Z)
Augmented Language Models: a Survey [55.965967655575454]
This survey reviews works in which language models (LMs) are augmented with reasoning skills and the ability to use tools. We refer to them as Augmented Language Models (ALMs) The missing token objective allows ALMs to learn to reason, use tools, and even act, while still performing standard natural language tasks.
arXiv Detail & Related papers (2023-02-15T18:25:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.