Fugu-MT 論文翻訳(概要): BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement

論文の概要: BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement

arxiv url: http://arxiv.org/abs/2604.04708v1
Date: Mon, 06 Apr 2026 14:22:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.225624
Title: BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement
Title（参考訳）: BiST:A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement (英語)
Authors: Abdullah Al Shafi, Swapnil Kundu Argha, M. A. Moyeen, Abdul Muntakim, Shoumik Barman Polok,
Abstract要約: BiSTは文レベルの文法分類のための厳格に硬化したバングラ英語コーパスである。コーパスは、オープンライセンスの百科事典ソースと自然に書かれた会話テキストからコンパイルされる。 BiSTは、制御されたテキスト生成、自動フィードバック生成、言語間表現学習を含む文法モデリングタスクをサポートする。
参考スコア（独自算出の注目度）: 0.17398560678845076
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: High-quality bilingual resources remain a critical bottleneck for advancing multilingual NLP in low-resource settings, particularly for Bangla. To mitigate this gap, we introduce BiST, a rigorously curated Bangla-English corpus for sentence-level grammatical classification, annotated across two fundamental dimensions: syntactic structure (Simple, Complex, Compound, Complex-Compound) and tense (Present, Past, Future). The corpus is compiled from open-licensed encyclopedic sources and naturally composed conversational text, followed by systematic preprocessing and automated language identification, resulting in 30,534 sentences, including 17,465 English and 13,069 Bangla instances. Annotation quality is ensured through a multi-stage framework with three independent annotators and dimension-wise Fleiss Kappa ($κ$) agreement, yielding reliable and reproducible labels with $κ$ values of 0.82 and 0.88 for structural and temporal annotation, respectively. Statistical analyses demonstrate realistic structural and temporal distributions, while baseline evaluations show that dual-encoder architectures leveraging complementary language-specific representations consistently outperform strong multilingual encoders. Beyond benchmarking, BiST provides explicit linguistic supervision that supports grammatical modeling tasks, including controlled text generation, automated feedback generation, and cross-lingual representation learning. The corpus establishes a unified resource for bilingual grammatical modeling and facilitates linguistically grounded multilingual research.
Abstract（参考訳）: 高品質なバイリンガルリソースは、低リソース環境で、特にBanglaにとって、マルチリンガルNLPを進める上で重要なボトルネックであり続けている。このギャップを軽減するため,文レベルの文法分類のための厳格にキュレートされたBangla-EnglishコーパスであるBiSTを導入し,構文構造(シンプル,複雑,複合,複合)と時制(現在,過去,未来)の2つの基本次元に注釈を付ける。コーパスは、オープンライセンスの百科事典ソースからコンパイルされ、自然に会話テキストが作成され、続いて体系的な前処理と自動言語識別が行われ、17,465の英語と13,069のバングラのインスタンスを含む30,534の文が生成される。アノテーションの品質は、3つの独立したアノテータと次元ワイドのFleiss Kappa(κ$)合意を持つ多段階フレームワークによって保証され、それぞれ構造的および時間的アノテーションに対して、κ$値が0.82および0.88の信頼性と再現可能なラベルが得られる。統計的解析では、現実的な構造的および時間的分布が示され、ベースライン評価では、相補的な言語固有の表現を活用する二重エンコーダアーキテクチャは、強い多言語エンコーダを一貫して上回っている。ベンチマークの他に、BiSTは、制御されたテキスト生成、自動フィードバック生成、言語間表現学習を含む文法モデリングタスクをサポートする明示的な言語指導を提供する。コーパスはバイリンガル文法モデリングのための統一的な資源を確立し、言語的に基礎付けられた多言語研究を促進する。

論文の概要: BiST: A Gold Standard Bangla-English Bilingual Corpus for Sentence Structure and Tense Classification with Inter-Annotator Agreement

関連論文リスト