Related papers: Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

URL: http://arxiv.org/abs/2512.02841v1
Date: Tue, 02 Dec 2025 14:54:54 GMT
Title: Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages
Authors: Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, David Jurgens,
Abstract summary: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time.<n>This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior.
Score: 61.18573330164572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Related papers

An Empirical Study on the Effects of System Prompts in Instruction-Tuned Models for Code Generation [4.76360912129794]
We systematically evaluate how system prompts affect code assistant.<n>We find that increasing system-prompt constraint specificity does not monotonically improve correctness.<n>For larger code-specialized models, few-shot examples can degrade performance relative to zero-shot generation.
arXiv Detail & Related papers (2026-02-16T22:11:21Z)
PolyPrompt: Automating Knowledge Extraction from Multilingual Language Models with Dynamic Prompt Generation [0.0]
We introduce PolyPrompt, a novel, parameter-efficient framework for enhancing the multilingual capabilities of large language models (LLMs)<n>Our method learns a set of trigger tokens for each language through a gradient-based search, identifying the input query's language and selecting the corresponding trigger tokens which are prepended to the prompt during inference.<n>We perform experiments on two 1 billion parameter models, with evaluations on the global MMLU benchmark across fifteen typologically and resource diverse languages, demonstrating accuracy gains of 3.7%-19.9% compared to naive and translation-pipeline baselines.
arXiv Detail & Related papers (2025-02-27T04:41:22Z)
LUSIFER: Language Universal Space Integration for Enhanced Multilingual Embeddings with Large Language Models [89.13128402847943]
We present LUSIFER, a novel zero-shot approach that adapts LLM-based embedding models for multilingual tasks without requiring multilingual supervision.<n>LUSIFER's architecture combines a multilingual encoder, serving as a language-universal learner, with an LLM-based embedding model optimized for embedding-specific tasks.<n>We introduce a new benchmark encompassing 5 primary embedding tasks, 123 diverse datasets, and coverage across 14 languages.
arXiv Detail & Related papers (2025-01-01T15:43:07Z)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
DIALIGHT: Lightweight Multilingual Development and Evaluation of Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems. It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level. Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z)
Zero-shot Cross-lingual Transfer of Prompt-based Tuning with a Unified Multilingual Prompt [98.26682501616024]
We propose a novel model that uses a unified prompt for all languages, called UniPrompt. The unified prompt is computation by a multilingual PLM to produce language-independent representation. Our proposed methods can significantly outperform the strong baselines across different languages.
arXiv Detail & Related papers (2022-02-23T11:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.