Related papers: MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation

MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation

URL: http://arxiv.org/abs/2502.11022v1
Date: Sun, 16 Feb 2025 07:12:47 GMT
Title: MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation
Authors: Zhiqian Qin, Yuanfeng Song, Jinwei Lu, Yuanwei Song, Shuaimin Li, Chen Jason Zhang,
Abstract summary: This paper introduces MultiTEND, the first largest multilingual benchmark for natural language to query generation.<n>We analyze challenges in translating natural language to queries across diverse linguistic structures.<n>We introduce MultiLink, a novel framework that bridges the multilingual input to query generation gap through a Parallel Linking Process.
Score: 6.142748564599452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Natural language interfaces for NoSQL databases are increasingly vital in the big data era, enabling users to interact with complex, unstructured data without deep technical expertise. However, most recent advancements focus on English, leaving a gap for multilingual support. This paper introduces MultiTEND, the first and largest multilingual benchmark for natural language to NoSQL query generation, covering six languages: English, German, French, Russian, Japanese and Mandarin Chinese. Using MultiTEND, we analyze challenges in translating natural language to NoSQL queries across diverse linguistic structures, including lexical and syntactic differences. Experiments show that performance accuracy in both English and non-English settings remains relatively low, with a 4%-6% gap across scenarios like fine-tuned SLM, zero-shot LLM, and RAG for LLM. To address the aforementioned challenges, we introduce MultiLink, a novel framework that bridges the multilingual input to NoSQL query generation gap through a Parallel Linking Process. It breaks down the task into multiple steps, integrating parallel multilingual processing, Chain-of-Thought (CoT) reasoning, and Retrieval-Augmented Generation (RAG) to tackle lexical and structural challenges inherent in multilingual NoSQL generation. MultiLink shows enhancements in all metrics for every language against the top baseline, boosting execution accuracy by about 15% for English and averaging a 10% improvement for non-English languages.

Related papers

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following [51.18383180774354]
We introduce Multi-IF, a new benchmark designed to assess Large Language Models' proficiency in following multi-turn and multilingual instructions. Our evaluation of 14 state-of-the-art LLMs on Multi-IF reveals that it presents a significantly more challenging task than existing benchmarks. languages with non-Latin scripts (Hindi, Russian, and Chinese) generally exhibit higher error rates, suggesting potential limitations in the models' multilingual capabilities.
arXiv Detail & Related papers (2024-10-21T00:59:47Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism. LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z)
UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields. In this work, we construct an open-source multilingual supervised fine-tuning dataset. The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z)
Evaluating Inter-Bilingual Semantic Parsing for Indian Languages [9.838755823660147]
We propose an Inter-bilingual Seq2seq Semantic parsing dataset IE-SEMPARSE for 11 distinct Indian languages. We highlight the proposed task's practicality, and evaluate existing multilingual seq2seq models across several train-test strategies.
arXiv Detail & Related papers (2023-04-25T17:24:32Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing [48.216386761482525]
We present MultiSpider, the largest multilingual text-to- schema- dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese) Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. We also propose a simple framework augmentation framework SAVe (Augmentation-with-Verification) which boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.
arXiv Detail & Related papers (2022-12-27T13:58:30Z)
XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing [70.40401197026925]
In-context learning using large language models has recently shown surprising results for semantic parsing tasks. This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query. We also include global translation exemplars for a target language to facilitate the translation process for large language models.
arXiv Detail & Related papers (2022-10-25T01:33:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.