Fugu-MT 論文翻訳(概要): How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

論文の概要: How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

arxiv url: http://arxiv.org/abs/2511.09748v1
Date: Fri, 14 Nov 2025 01:07:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-14 22:53:22.46074
Title: How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation
Title（参考訳）: どのくらい小さくできるのか? 機械翻訳におけるオンデバイスクリティカルエラー検出のためのコンパクト言語モデル
Authors: Muskaan Chopra, Lorenz Sparrenberg, Sarthak Khanna, Rafet Sifa,
Abstract要約: We benchmark sub-2B model (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, SynCED-EnDe-2025。我々のフレームワークはプロンプトを標準化し、軽量なロジットバイアス校正と多数決を適用し、セマンティック品質(MCC, F1-ERR/F1-NOT)と計算メトリクス(VRAM,レイテンシ,スループット)の両方を報告する。 Gemma-3-1Bは最高の品質と効率のトレードオフを提供します。
参考スコア（独自算出の注目度）: 1.3288901827225499
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) excel at evaluating machine translation (MT), but their scale and cost hinder deployment on edge devices and in privacy-sensitive workflows. We ask: how small can you get while still detecting meaning-altering translation errors? Focusing on English->German Critical Error Detection (CED), we benchmark sub-2B models (LFM2-350M, Qwen-3-0.6B/1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B) across WMT21, WMT22, and SynCED-EnDe-2025. Our framework standardizes prompts, applies lightweight logit-bias calibration and majority voting, and reports both semantic quality (MCC, F1-ERR/F1-NOT) and compute metrics (VRAM, latency, throughput). Results reveal a clear sweet spot around one billion parameters: Gemma-3-1B provides the best quality-efficiency trade-off, reaching MCC=0.77 with F1-ERR=0.98 on SynCED-EnDe-2025 after merged-weights fine-tuning, while maintaining 400 ms single-sample latency on a MacBook Pro M4 Pro (24 GB). At larger scale, Qwen-3-1.7B attains the highest absolute MCC (+0.11 over Gemma) but with higher compute cost. In contrast, ultra-small models (0.6B) remain usable with few-shot calibration yet under-detect entity and number errors. Overall, compact, instruction-tuned LLMs augmented with lightweight calibration and small-sample supervision can deliver trustworthy, on-device CED for MT, enabling private, low-cost error screening in real-world translation pipelines. All datasets, prompts, and scripts are publicly available at our GitHub repository.
Abstract（参考訳）: 大規模言語モデル(LLM)は、機械翻訳(MT)の評価において優れているが、その規模とコストは、エッジデバイスやプライバシに敏感なワークフローへのデプロイを妨げる。意味を変える翻訳エラーを検知しながら、どのくらい小さくできるのか? WMT21、WMT22、SynCED-EnDe-2025のサブ2Bモデル(LFM2-350M、Qwen-3-0.6B/1.7B、Llama-3.2-1B-Instruct、Gemma-3-1B)をベンチマークした。我々のフレームワークはプロンプトを標準化し、軽量なロジットバイアス校正と多数決を適用し、セマンティック品質(MCC, F1-ERR/F1-NOT)と計算メトリクス(VRAM,レイテンシ,スループット)の両方を報告する。 Gemma-3-1Bは最高の品質効率トレードオフを提供し、SynCED-EnDe-2025でF1-ERR=0.98でMCC=0.77に達した。大規模では、Qwen-3-1.7B は最高絶対値 MCC (+0.11 over Gemma) に達するが、計算コストは高い。対照的に、超小型モデル (0.6B) は、数発のキャリブレーションしか使用できないが、検出できないエンティティと数値エラーで使用することができる。全体として、軽量キャリブレーションと小型の監視機能を備えたコンパクトな命令調整型LLMは、MTのための信頼性の高いオンデバイスCEDを提供することができ、現実世界の翻訳パイプラインにおいて、プライベートで低コストなエラースクリーニングを可能にする。すべてのデータセット、プロンプト、スクリプトは、GitHubリポジトリで公開されています。

論文の概要: How Small Can You Go? Compact Language Models for On-Device Critical Error Detection in Machine Translation

関連論文リスト