Related papers: Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation

Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation

URL: http://arxiv.org/abs/2509.14477v1
Date: Wed, 17 Sep 2025 23:13:47 GMT
Title: Ticket-Bench: A Kickoff for Multilingual and Regionalized Agent Evaluation
Authors: Thales Sales Almeida, João Guilherme Alves Santos, Thiago Laitz, Giovana Kerche Bonás,
Abstract summary: We introduce Ticket-Bench, a benchmark for multilingual agent evaluation in task-oriented scenarios.<n>Ticket-Bench simulates the domain of soccer ticket purchases across six major languages: Portuguese, English, Spanish, German, Italian, and French.<n>We evaluate a wide range of commercial and open-source LLMs, measuring function-calling accuracy and consistency across languages.
Score: 4.563830993050022
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed as task-oriented agents, where success depends on their ability to generate accurate function calls under realistic, multilingual conditions. However, existing agent evaluations largely overlook cultural and linguistic diversity, often relying on monolingual or naively translated benchmarks. We introduce Ticket-Bench, a benchmark for multilingual agent evaluation in task-oriented scenarios. Ticket-Bench simulates the domain of soccer ticket purchases across six major languages: Portuguese, English, Spanish, German, Italian, and French. Using localized teams, cities, and user profiles to provide a higher level of realism. We evaluate a wide range of commercial and open-source LLMs, measuring function-calling accuracy and consistency across languages. Results show that reasoning-oriented models (e.g., GPT-5, Qwen3-235B) dominate performance but still exhibit notable cross-lingual disparities. These findings underscore the need for culturally aware, multilingual benchmarks to guide the development of robust LLM agents.

Related papers

"Be My Cheese?": Assessing Cultural Nuance in Multilingual LLM Translations [0.0]
This pilot study explores the localisation capabilities of state-of-the-art multilingual AI models when translating figurative language.<n>It focuses on cultural appropriateness and overall localisation quality - critical factors for real-world applications like marketing and e-commerce.
arXiv Detail & Related papers (2025-09-25T20:55:36Z)
The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks [0.0]
We introduce the AI Language Monitor, a comprehensive benchmark that assesses large language models (LLMs) performance across up to 200 languages.<n>Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC.<n>We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance.
arXiv Detail & Related papers (2025-07-11T12:38:02Z)
MuBench: Assessment of Multilingual Capabilities of Large Language Models Across 61 Languages [33.450081592217074]
We introduce MuBench, a benchmark covering 61 languages and evaluating a broad range of capabilities.<n>We evaluate several state-of-the-art multilingual LLMs and find notable gaps between claimed and actual language coverage.
arXiv Detail & Related papers (2025-06-24T09:53:00Z)
GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models [24.450836403716547]
GlotEval is a lightweight framework designed for massively multilingual evaluation.<n>It supports seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation) spanning over dozens to hundreds of languages.<n>It enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts.
arXiv Detail & Related papers (2025-04-05T12:30:58Z)
MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [86.7047714187813]
MMLU-ProX is a benchmark covering 29 languages, built on an English benchmark.<n>Each language version consists of 11,829 identical questions, enabling direct cross-linguistic comparisons.<n>To meet efficient evaluation needs, we provide a lite version containing 658 questions per language.
arXiv Detail & Related papers (2025-03-13T15:59:20Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models [62.91524967852552]
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora.<n>But can these models relate corresponding concepts across languages, i.e., be crosslingual?<n>This study evaluates state-of-the-art LLMs on inherently crosslingual tasks.
arXiv Detail & Related papers (2024-06-23T15:15:17Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
Breaking Language Barriers in Multilingual Mathematical Reasoning: Insights and Observations [59.056367787688146]
This paper pioneers exploring and training powerful Multilingual Math Reasoning (xMR) LLMs. We construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages. By utilizing translation, we construct the first multilingual math reasoning instruction dataset, MGSM8KInstruct, encompassing ten distinct languages.
arXiv Detail & Related papers (2023-10-31T08:09:20Z)
Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages. In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
Multilingual Large Language Models Are Not (Yet) Code-Switchers [41.47534626749588]
Large Language Models (LLMs) have recently shown great capabilities in a wide range of tasks. The practice of alternating languages within an utterance remains relatively uncharted. We argue that current "multilingualism" in LLMs does not inherently imply proficiency with code-switching texts.
arXiv Detail & Related papers (2023-05-23T16:50:48Z)
XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization [128.37244072182506]
Cross-lingual TRansfer Evaluation of Multilinguals XTREME is a benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models.
arXiv Detail & Related papers (2020-03-24T19:09:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.