Fugu-MT 論文翻訳(概要): Automated alignment is harder than you think

論文の概要: Automated alignment is harder than you think

arxiv url: http://arxiv.org/abs/2605.06390v2
Date: Wed, 13 May 2026 14:04:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.775353
Title: Automated alignment is harder than you think
Title（参考訳）: 自動アライメントは想像以上に難しい
Authors: Aleksandr Bowkis, Marie Davidsen Buhl, Jacob Pfau, Geoffrey Irving,
Abstract要約: 人工超知能(ASI)の整列に関する主要な提案は、AIエージェントを使用して、能力の向上に伴い、アライメント研究のごく一部を自動化することである。我々は、研究員がアライメント作業の妨害を計画していないとしても、この計画は説得力はあるが破滅的に誤解を招く安全評価を生み出すかもしれないと論じている。これは、アライメント研究には多くの面倒な作業が伴うためである。
参考スコア（独自算出の注目度）: 41.94180208011558
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A leading proposal for aligning artificial superintelligence (ASI) is to use AI agents to automate an increasing fraction of alignment research as capabilities improve. We argue that, even when research agents are not scheming to deliberately sabotage alignment work, this plan could produce compelling but catastrophically misleading safety assessments resulting in the unintentional deployment of misaligned AI. This could happen because alignment research involves many hard-to-supervise fuzzy tasks (tasks without clear evaluation criteria, for which human judgement is systematically flawed). Consequently, research outputs will contain systematic, undetected errors, and even correct outputs could be incorrectly aggregated into overconfident safety assessments. This problem is likely to be worse for automated alignment research than for human-generated alignment research for several reasons: 1) optimisation pressure means agent-generated mistakes are concentrated among those that human reviewers are least likely to catch; 2) agents are likely to produce errors that do not resemble human mistakes; 3) AI-generated alignment solutions may involve arguments humans cannot evaluate; and 4) shared weights, data and training processes may make AI outputs more correlated than human equivalents. Therefore, agents must be trained to reliably perform hard-to-supervise fuzzy tasks. Generalisation and scalable oversight are the leading candidates for achieving this but both face novel challenges in the context of automated alignment.
Abstract（参考訳）: 人工超知能(ASI)の整列に関する主要な提案は、AIエージェントを使用して、能力の向上に伴い、アライメント研究のごく一部を自動化することである。我々は、研究員が意図的にアライメント作業を妨害しようとしていないとしても、この計画は説得力あるが破滅的に誤解を招く安全評価を生み出す可能性があり、不適切なAIの展開につながると論じている。これは、アライメント研究が多くの難しいファジィタスク(明確な評価基準のないタスク、人間の判断が体系的に欠陥のあるタスク)を伴うためである。その結果、研究出力には系統的、未検出のエラーが含まれ、正確な出力も誤って過信の安全性評価に集約される可能性がある。この問題は、自動アライメント研究では、いくつかの理由で人為的なアライメント研究よりも悪化する可能性が高い。 1) 最適化圧力とは,ヒトレビュアーがキャッチしにくいものには,エージェント生成ミスが集中していることを意味する。 2) エージェントは,人間のミスと似ていない誤りを生じやすい。 3) 人間が評価できない議論を含むAI生成アライメントソリューション 4) 共有重み、データ、およびトレーニングプロセスは、AI出力を人間の同等量よりもより相関させる可能性がある。したがって、エージェントは確実にファジィタスクを実行するように訓練されなければならない。一般化とスケーラブルな監視がこれを達成するための主要な候補だが、どちらも自動アライメントのコンテキストにおいて、新たな課題に直面している。

論文の概要: Automated alignment is harder than you think

関連論文リスト