Fugu-MT 論文翻訳(概要): Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon

論文の概要: Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon

arxiv url: http://arxiv.org/abs/2604.00023v1
Date: Wed, 11 Mar 2026 05:23:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 02:36:13.201747
Title: Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon
Title（参考訳）: 音韻フォッシル:スラウェシ基本語彙における非主ストリーム語彙の機械学習検出
Authors: Mukhlis Amien, Go Frendi Gunawan,
Abstract要約: 6つのスラウェシ語から1,357の形式を用いて、438の候補基質形式(26.5%)をコグネート減算とプロト・オーストロネシア横断検定によって同定する。 26の音韻学的特徴に基づいて訓練されたXGBoost分類器は、AUC=.763の非主流形式と区別される。クラスタリングはコヒーレントな語族を産み出さず、オーストロネシア語以前の1つの言語層に証拠を与えない。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Basic vocabulary in many Sulawesi Austronesian languages includes forms resisting reconstruction to any proto-form with phonological patterns inconsistent with inherited roots, but whether this non-conforming vocabulary represents pre-Austronesian substrate or independent innovation has not been tested computationally. We combine rule-based cognate subtraction with a machine learning classifier trained on phonological features. Using 1,357 forms from six Sulawesi languages in the Austronesian Basic Vocabulary Database, we identify 438 candidate substrate forms (26.5%) through cognate subtraction and Proto-Austronesian cross-checking. An XGBoost classifier trained on 26 phonological features distinguishes inherited from non-mainstream forms with AUC=0.763, revealing a phonological fingerprint: longer forms, more consonant clusters, higher glottal stop rates, and fewer Austronesian prefixes. Cross-method consensus (Cohen's kappa=0.61) identifies 266 high-confidence non-mainstream candidates. However, clustering yields no coherent word families (silhouette=0.114; cross-linguistic cognate test p=0.569), providing no evidence for a single pre-Austronesian language layer. Application to 16 additional languages confirms geographic patterning: Sulawesi languages show higher predicted non-mainstream rates (mean P_sub=0.606) than Western Indonesian languages (0.393). This study demonstrates that phonological machine learning can complement traditional comparative methods in detecting non-mainstream lexical layers, while cautioning against interpreting phonological non-conformity as evidence for a shared substrate language.
Abstract（参考訳）: 多くのスラウェシ・オーストロネシア語の基本的な語彙には、音韻パターンが継承された任意の原型への再構成に抵抗する形式が含まれているが、この非コンフォーメーションの語彙がオーストロネシア以前の基質を表すのか、それとも独立した革新が計算的にテストされていないのか。規則に基づくコグネートサブトラクションと、音韻的特徴を訓練した機械学習分類器を組み合わせる。 Austronesian Basic Vocabulary Databaseの6つのスラウェシ語の1,357の形式を用いて、438の候補基質(26.5%)をコグネートサブトラクションとプロト・オーストロネシアのクロスチェックによって同定する。 26の音韻的特徴に基づいて訓練されたXGBoost分類器は、AUC=0.763の非主ストリーム形式と区別し、より長い形、より多くの子音クラスタ、高い声門停止率、より少ないオーストロネシアの接頭辞の音韻的指紋を明らかにする。クロスメタルコンセンサス (Cohen's kappa=0.61) は、266の高信頼な非主流候補を識別する。しかし、クラスタリングはコヒーレントな単語群(silhouette=0.114; 言語横断的なコグネートテスト p=0.569)を産出せず、オーストロネシア語以前の1つの言語層を示す証拠は得られない。スラウェシ語は西インドネシア語(0.393)よりも予測される非主流率(平均P_sub=0.606)が高い。本研究は,音韻論的機械学習が,非主ストリーム語彙層の検出における従来の比較手法を補完し,共有基盤言語の証拠として音韻論的非整合性を解釈することに対して注意を払っていることを示す。

論文の概要: Phonological Fossils: Machine Learning Detection of Non-Mainstream Vocabulary in Sulawesi Basic Lexicon

関連論文リスト