Text2Cypher Across Languages: Evaluating and Finetuning LLMs
- URL: http://arxiv.org/abs/2506.21445v2
- Date: Thu, 04 Sep 2025 18:01:56 GMT
- Title: Text2Cypher Across Languages: Evaluating and Finetuning LLMs
- Authors: Makbule Gulcin Ozsoy, William Tai,
- Abstract summary: This paper investigates the performance of both foundational and finetuned LLMs on the Text2Cypher task across multiple languages.<n>We create and release a multilingual dataset by translating English questions into Spanish and Turkish.<n>Using standardized prompts and metrics, we evaluate several foundational models and observe a consistent performance pattern.
- Score: 0.17188280334580197
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in large language models (LLMs) have enabled natural language interfaces that translate user questions into database queries, such as Text2SQL, Text2SPARQL, and Text2Cypher. While these interfaces enhance database accessibility, most research today focuses on English, with limited evaluation in other languages. This paper investigates the performance of both foundational and finetuned LLMs on the Text2Cypher task across multiple languages. We create and release a multilingual dataset by translating English questions into Spanish and Turkish while preserving the original Cypher queries, enabling fair cross-lingual comparison. Using standardized prompts and metrics, we evaluate several foundational models and observe a consistent performance pattern: highest on English, followed by Spanish, and lowest on Turkish. We attribute this to differences in training data availability and linguistic features. We also examine the impact of translating task prompts into Spanish and Turkish. Results show little to no change in evaluation metrics, suggesting prompt translation has minor impact. Furthermore, we finetune a foundational model on two datasets: one in English only, and one multilingual. Finetuning on English improves overall accuracy but widens the performance gap between languages. In contrast, multilingual finetuning narrows the gap, resulting in more balanced performance. Our findings highlight the importance for multilingual evaluation and training to build more inclusive and robust query generation systems.
Related papers
- Found in Translation: Measuring Multilingual LLM Consistency as Simple as Translate then Evaluate [36.641755706551336]
Large language models (LLMs) provide detailed and impressive responses to queries in English.<n>But are they really consistent at responding to the same query in other languages?<n>We propose a framework to evaluate LLM's cross-lingual consistency based on a simple Translate then Evaluate strategy.
arXiv Detail & Related papers (2025-05-28T06:00:21Z) - MULTITAT: Benchmarking Multilingual Table-and-Text Question Answering [44.89146464166763]
Existing TATQA datasets are limited to English.<n>They overlook the challenges of multilingual TAT-QA.<n>They do not reflect real-world scenarios where tables and texts frequently appear in non-English languages.
arXiv Detail & Related papers (2025-02-24T15:34:09Z) - mFollowIR: a Multilingual Benchmark for Instruction Following in Retrieval [61.17793165194077]
We introduce mFollowIR, a benchmark for measuring instruction-following ability in retrieval models.<n>We present results for both multilingual (XX-XX) and cross-lingual (En-XX) performance.<n>We see strong cross-lingual performance with English-based retrievers that trained using instructions, but find a notable drop in performance in the multilingual setting.
arXiv Detail & Related papers (2025-01-31T16:24:46Z) - The Roles of English in Evaluating Multilingual Language Models [6.396057276543912]
We argue that these roles have different goals: task performance versus language understanding.<n>We recommend to move away from this imprecise method and instead focus on furthering language understanding.
arXiv Detail & Related papers (2024-12-11T14:02:55Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual
Transfer [81.5984433881309]
We introduce BUFFET, which unifies 15 diverse tasks across 54 languages in a sequence-to-sequence format.
BUFFET is designed to establish a rigorous and equitable evaluation framework for few-shot cross-lingual transfer.
Our findings reveal significant room for improvement in few-shot in-context cross-lingual transfer.
arXiv Detail & Related papers (2023-05-24T08:06:33Z) - Efficiently Aligned Cross-Lingual Transfer Learning for Conversational
Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks.
We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset.
To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z) - MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic
Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese)
Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages.
We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z) - XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for
Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks.
This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query.
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z) - GLUECoS : An Evaluation Benchmark for Code-Switched NLP [17.066725832825423]
We present an evaluation benchmark, GLUECoS, for code-switched languages.
We present results on several NLP tasks in English-Hindi and English-Spanish.
We fine-tune multilingual models on artificially generated code-switched data.
arXiv Detail & Related papers (2020-04-26T13:28:34Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.