Fugu-MT 論文翻訳(概要): IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

論文の概要: IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

arxiv url: http://arxiv.org/abs/2604.13686v1
Date: Wed, 15 Apr 2026 10:07:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.479526
Title: IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Title（参考訳）: IndicDB -- インド言語における多言語テキスト-SQL機能のベンチマーク
Authors: Aviral Dawar, Roshan Karanth, Vikram Goyal, Dhruv Kumar,
Abstract要約: IndicDBは、さまざまなIndic言語間の言語間セマンティック解析を評価するためのベンチマークである。リレーショナルスキーマは、NDAPやIndia Data Portalなど、オープンソースのプラットフォームからソースされている。 IndicDBは237のテーブルに20のデータベースがあり、英語、ヒンディー語、および5つのIndic言語で15,617のタスクを生成する。
参考スコア（独自算出の注目度）: 5.678442494739663
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an "Indic Gap" driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/
Abstract（参考訳）: LLM(Large Language Models)はテキストとSQLのパフォーマンスが大幅に向上しているが、既存のベンチマークは主に西洋のコンテキストと単純化されたスキーマに焦点を当てており、実際の非西洋のアプリケーションではギャップが残っている。 IndicDBは、多言語間セマンティックパーシングを評価するための多言語テキスト-SQLベンチマークである。リレーショナルスキーマは、National Data and Analytics Platform(NDAP)やIndia Data Portal(IDP)など、オープンソースのプラットフォームからソースされている。 IndicDBは237テーブルにわたる20のデータベースで構成されている。非正規化された政府データをリッチなリレーショナル構造に変換するために、構造厳密かつ高いリレーショナル密度(データベースあたり11.85テーブル、深さ6まで結合する)を保証するために、反復的な3つのエージェントフレームワーク(アーキテクト、監査者、精製者)を使用します。私たちのパイプラインは、価値に気付き、困難に気付き、結合強化され、英語、ヒンディー語、および5つのIndic言語で15,617のタスクを生成します。最先端モデルの言語間意味解析性能 (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) を7つの言語変種で評価した。その結果、英語からIndic言語への9.00%のパフォーマンス低下が示され、より厳しいスキーマリンク、構造的曖昧さの増大、限られた外部知識によって駆動される"インデックスギャップ"が明らかになった。 IndicDBは多言語テキストからSQLへの厳格なベンチマークとして機能する。コードとデータ:https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/

論文の概要: IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

関連論文リスト