Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
- URL: http://arxiv.org/abs/2506.21445v1
- Date: Thu, 26 Jun 2025 16:31:10 GMT
- Title: Text2Cypher Across Languages: Evaluating Foundational Models Beyond English
- Authors: Makbule Gulcin Ozsoy, William Tai,
- Abstract summary: This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages.<n>We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in large language models have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses solely on English, with limited evaluation in other languages. This paper investigates the performance of foundational LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual test set by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. We evaluate multiple foundational models using standardized prompts and metrics. Our results show a consistent performance pattern: highest on English, then Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic characteristics. Additionally, we explore the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Our findings highlight the need for more inclusive evaluation and development in multilingual query generation. Future work includes schema localization and fine-tuning across diverse languages.
Related papers
- Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate [36.641755706551336]
Large language models (LLMs) provide detailed and impressive responses to queries in English.<n>But are they really consistent at responding to the same query in other languages?<n>We propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy.
arXiv Detail & Related papers (2025-05-28T06:00:21Z) - mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.<n>We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.<n>We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - The Roles of English in Evaluating Multilingual Language Models [6.396057276543912]
We argue that these roles have different goals: task performance versus language understanding.<n>We recommend to move away from this imprecise method and instead focus on furthering language understanding.
arXiv Detail & Related papers (2024-12-11T14:02:55Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic
Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese)
Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages.
We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.