Related papers: Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

URL: http://arxiv.org/abs/2601.05366v1
Date: Thu, 08 Jan 2026 20:44:28 GMT
Title: Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models
Authors: Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, Wanpeng Xu, Hua Wei, Xiyang Hu,
Abstract summary: Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls.<n>We introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo.
Score: 5.6688028729584055
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user's language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.

Related papers

Layer-Targeted Multilingual Knowledge Erasure in Large Language Models [15.409568435026015]
We identify intervention depth as the key factor determining multilingual generalization.<n>We propose MUTE, a framework that uses Centered Kernel Alignment (CKA) and Linguistic Regions Development Score (LRDS) to identify intermediate, language-agnostic layers.
arXiv Detail & Related papers (2026-02-26T03:00:07Z)
SteerEval: Inference-time Interventions Strengthen Multilingual Generalization in Neural Summarization Metrics [33.30877107523988]
A major empirical bottleneck in this area is the shortage of accurate and robust evaluation metrics for many languages.<n>Recent studies suggest that multilingual language models often use English as an internal pivot language.<n>Motivated by the hypothesis that this mismatch could also apply to multilingual neural metrics, we ask whether steering their activations toward an English pivot can improve correlation with human judgments.
arXiv Detail & Related papers (2026-01-22T09:49:29Z)
Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks [22.908904483320953]
Large Language Models (LLMs) in coding tasks are often a reflection of their extensive pre-training corpora.<n>We propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives.<n>We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks.
arXiv Detail & Related papers (2026-01-16T09:06:47Z)
Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation [4.45354703148321]
Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance.<n>We present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation.
arXiv Detail & Related papers (2025-11-28T12:40:30Z)
Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation [11.110312833458421]
We study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones.<n>Our experiments reveal that the drift results from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language.<n>We propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language.
arXiv Detail & Related papers (2025-11-13T05:36:31Z)
Improving Large Language Models Function Calling and Interpretability via Guided-Structured Templates [56.73907811047611]
Large language models (LLMs) have demonstrated strong reasoning and tool-use capabilities.<n>LLMs often fail in real-world tool-interactions due to incorrect parameterization, poor tool selection, or misinterpretation of user intent.<n>We introduce a curriculum-inspired framework that leverages structured reasoning templates to guide LLMs through more deliberate step-by-step instructions for generating function callings.
arXiv Detail & Related papers (2025-09-22T17:55:14Z)
Teaching a Language Model to Speak the Language of Tools [0.0]
This work presents a methodology for adapting existing language models to enable robust tool use in any target language.<n>The research introduces TUCAN, which achieves up to 28.75% improvement in function-calling accuracy over base models.
arXiv Detail & Related papers (2025-06-29T20:47:27Z)
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities [54.152681077418805]
Current detection approaches are fallible, and are particularly susceptible to attacks that exploit mismatched generalizations of model capabilities.<n>We propose OMNIGUARD, an approach for detecting harmful prompts across languages and modalities.<n>Our approach improves harmful prompt classification accuracy by 11.57% over the strongest baseline in a multilingual setting.
arXiv Detail & Related papers (2025-05-29T05:25:27Z)
Paths Not Taken: Understanding and Mending the Multilingual Factual Recall Pipeline [36.2731426595852]
We find that multilingual large language models (LLMs) exhibit significantly better performance in factual recall tasks in English than in other languages.<n>We identify two primary sources of error: insufficient engagement of the reliable English-centric mechanism for factual recall, and incorrect translation from English back into the target language.<n>Our interventions combined increase the recall accuracy by over 35 percent for the lowest-performing language.
arXiv Detail & Related papers (2025-05-26T22:20:45Z)
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [56.61984030508691]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey [52.00674453604779]
This survey organizes the literature on the design and optimization of emerging structures around post-trained LMs.<n>We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools.
arXiv Detail & Related papers (2024-10-21T18:06:25Z)
SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition [55.2480439325792]
Speech Meta In-Context LEarning (SMILE) is an innovative framework that combines meta-learning with speech in-context learning (SICL)<n>We show that SMILE consistently outperforms baseline methods in training-free few-shot multilingual ASR tasks.
arXiv Detail & Related papers (2024-09-16T16:04:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.