Fugu-MT 論文翻訳(概要): How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

論文の概要: How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

arxiv url: http://arxiv.org/abs/2603.07346v1
Date: Sat, 07 Mar 2026 21:15:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-10 15:13:14.312958
Title: How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection
Title（参考訳）: BERTはどの程度のノイズを扱えるか?多言語文難読度検出からの考察
Authors: Nouran Khallaf, Serge Sharoff,
Abstract要約: ノイズの多いトレーニングデータは、言語モデルに基づく分類器の性能を著しく低下させることができる。文レベルの難易度検出のための多種多様な認知戦略を探索する。また、多言語言語モデルを1つの言語で訓練し、別の言語でテストする言語間移動にも対処する。
参考スコア（独自算出の注目度）: 1.9746060146273674
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see https://github.com/Nouran-Khallaf/denoising-difficulty
Abstract（参考訳）: ノイズの多いトレーニングデータは、特に非トピック分類タスクにおいて、言語モデルに基づく分類器の性能を著しく低下させることができる。本研究では,聴覚障害の影響を評価するための方法論的枠組みを考案した。より具体的には、ノイズの多いクラウドソーシングによって得られた文書レベルの難易度アノテーションから得られたトレーニングデータを用いて、文レベルの難易度検出のための様々な難易度判定戦略について検討した。モノリンガル設定以外にも、多言語言語モデルがひとつの言語でトレーニングされ、別の言語でテストされる、言語間転送にも対処しています。本稿では,ガウス混合モデル(GMM),コ・ティーチング,ノイズ遷移行列,ラベル平滑化など,いくつかのノイズ低減手法を評価する。以上の結果から,BERTモデルではノイズに固有のロバスト性を示すが,明示的なノイズ検出を取り入れることで性能が向上することが示唆された。より小さなデータセットの場合,GMMに基づくノイズフィルタリングは,エリアアンダー・ザ・カーブのスコアを0.52から0.92、あるいはデノイズ法を組み合わせれば0.93に引き上げることで,予測品質の向上に特に有効であることを示す。しかし、我々のより大きなデータセットでは、事前訓練された言語モデルの本質的な正規化は強力なベースラインを提供し、デノナイジング法は限界利得(0.92から0.94まで)しか得られず、2つのデノナイジング法の組み合わせは寄与しなかった。それでも、ノイズの多い文(データセットの約20%)を削除することは、よりクリーンなコーパスを作るのに役立ちます。その結果、文の難易度予測のための最大の多言語コーパスをリリースした:https://github.com/Nouran-Khallaf/denoising-difficultyを参照。

論文の概要: How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

関連論文リスト