Related papers: MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases

MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases

URL: http://arxiv.org/abs/2410.18406v1
Date: Thu, 24 Oct 2024 03:42:43 GMT
Title: MoMQ: Mixture-of-Experts Enhances Multi-Dialect Query Generation across Relational and Non-Relational Databases
Authors: Zhisheng Lin, Yifu Liu, Zhiling Luo, Jinyang Gao, Yu Li,
Abstract summary: Cloud service providers are looking for a unified database manager service that can support multiple dialects. MoMQ is a novel Mixture-of-Experts-based multi-dialect query generation framework. MoMQ employs a dialect expert group for each dialect and a multi-level routing strategy to handle dialect-specific knowledge.
Score: 15.59894560371822
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The improvement in translating natural language to structured query language (SQL) can be attributed to the advancements in large language models (LLMs). Open-source LLMs, tailored for specific database dialects such as MySQL, have shown great performance. However, cloud service providers are looking for a unified database manager service (e.g., Cosmos DB from Azure, Amazon Aurora from AWS, Lindorm from AlibabaCloud) that can support multiple dialects. This requirement has led to the concept of multi-dialect query generation, which presents challenges to LLMs. These challenges include syntactic differences among dialects and imbalanced data distribution across multiple dialects. To tackle these challenges, we propose MoMQ, a novel Mixture-of-Experts-based multi-dialect query generation framework across both relational and non-relational databases. MoMQ employs a dialect expert group for each dialect and a multi-level routing strategy to handle dialect-specific knowledge, reducing interference during query generation. Additionally, a shared expert group is introduced to address data imbalance, facilitating the transfer of common knowledge from high-resource dialects to low-resource ones. Furthermore, we have developed a high-quality multi-dialect query generation benchmark that covers relational and non-relational databases such as MySQL, PostgreSQL, Cypher for Neo4j, and nGQL for NebulaGraph. Extensive experiments have shown that MoMQ performs effectively and robustly even in resource-imbalanced scenarios.

Related papers

MultiTEND: A Multilingual Benchmark for Natural Language to NoSQL Query Translation [6.142748564599452]
This paper introduces MultiTEND, the first largest multilingual benchmark for natural language to query generation. We analyze challenges in translating natural language to queries across diverse linguistic structures. We introduce MultiLink, a novel framework that bridges the multilingual input to query generation gap through a Parallel Linking Process.
arXiv Detail & Related papers (2025-02-16T07:12:47Z)
Evaluating and Enhancing LLMs for Multi-turn Text-to-SQL with Multiple Question Types [11.391598870596392]
Large language models (LLMs) have significantly advanced text-to-speech systems. LLMs often narrowly focus on SQL generation, neglecting the complexities of real-world conversational queries. We propose MM, a test suite designed to evaluate the question classification and SQL generation capabilities of LLMs.
arXiv Detail & Related papers (2024-12-21T10:13:45Z)
Towards Evaluating Large Language Models for Graph Query Generation [49.49881799107061]
Large Language Models (LLMs) are revolutionizing the landscape of Generative Artificial Intelligence (GenAI) This paper presents a comparative study addressing the challenge of generating queries a powerful language for interacting with graph databases using open-access LLMs. Our empirical analysis of query generation accuracy reveals that Claude Sonnet 3.5 outperforms its counterparts in this specific domain.
arXiv Detail & Related papers (2024-11-13T09:11:56Z)
SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark [4.049028351548513]
Different database models have a big impact on query complexity and performance. We present SM3-Text-to-Query, the first multi-model medical Text-to-Query benchmark.
arXiv Detail & Related papers (2024-11-08T12:27:13Z)
Table Question Answering for Low-resourced Indic Languages [71.57359949962678]
TableQA is the task of answering questions over tables of structured information, returning individual cells or tables as output. We introduce a fully automatic large-scale tableQA data generation process for low-resource languages with limited budget. We incorporate our data generation method on two Indic languages, Bengali and Hindi, which have no tableQA datasets or models.
arXiv Detail & Related papers (2024-10-04T16:26:12Z)
Text2SQL is Not Enough: Unifying AI and Databases with TAG [47.45480855418987]
Table-Augmented Generation (TAG) is a paradigm for answering natural language questions over databases. We develop benchmarks to study the TAG problem and find that standard methods answer no more than 20% of queries correctly.
arXiv Detail & Related papers (2024-08-27T00:50:14Z)
MST5 -- Multilingual Question Answering over Knowledge Graphs [1.6470999044938401]
Knowledge Graph Question Answering (KGQA) simplifies querying vast amounts of knowledge stored in a graph-based model using natural language. Existing multilingual KGQA systems face challenges in achieving performance comparable to English systems. We propose a simplified approach to enhance multilingual KGQA systems by incorporating linguistic context and entity information directly into the processing pipeline of a language model.
arXiv Detail & Related papers (2024-07-08T15:37:51Z)
UQE: A Query Engine for Unstructured Databases [71.49289088592842]
We investigate the potential of Large Language Models to enable unstructured data analytics. We propose a new Universal Query Engine (UQE) that directly interrogates and draws insights from unstructured data collections.
arXiv Detail & Related papers (2024-06-23T06:58:55Z)
The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants. This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z)
Querying Large Language Models with SQL [16.383179496709737]
In many use-cases, information is stored in text but not available in structured data. With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of text documents. We present Galois, a prototype based on a traditional database architecture, but with new physical operators for querying the underlying LLM.
arXiv Detail & Related papers (2023-04-02T06:58:14Z)
Uni-Parser: Unified Semantic Parser for Question Answering on Knowledge Base and Database [86.03294330305097]
We propose a unified semantic element for question answering (QA) on both knowledge bases (KB) and databases (DB) We introduce the primitive (relation and entity in KB, table name, column name and cell value in DB) as an essential element in our framework. We leverage the generator to predict final logical forms by altering and composing topranked primitives with different operations.
arXiv Detail & Related papers (2022-11-09T19:33:27Z)
Translating synthetic natural language to database queries: a polyglot deep learning framework [0.0]
Polyglotter supports the mapping of natural language searches to database queries. It does not require the creation of manually annotated data for training. Our results indicate that our framework performs well on both synthetic and real databases.
arXiv Detail & Related papers (2021-04-14T17:43:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.