Fugu-MT 論文翻訳(概要): Neologism Learning for Controllability and Self-Verbalization

論文の概要: Neologism Learning for Controllability and Self-Verbalization

arxiv url: http://arxiv.org/abs/2510.08506v1
Date: Thu, 09 Oct 2025 17:41:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.258873
Title: Neologism Learning for Controllability and Self-Verbalization
Title（参考訳）: 制御性と自己言語化のためのネオロジズム学習
Authors: John Hewitt, Oyvind Tafjord, Robert Geirhos, Been Kim,
Abstract要約: モデルをよりよく理解し、制御するために、新しい単語を導入するというアイデアを探求する。本手法では、新しい単語を埋め込み、その概念を示す例で訓練することにより、新しい単語を導入する。新しい単語を追加することで、フラットリー、誤った回答、テキストの長さ、AxBenchのより複雑な概念などの概念を制御できることを示す。
参考スコア（独自算出の注目度）: 23.932433693726182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers...'' To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.
Abstract（参考訳）: 人間は新しい有用な概念(例えばDoomscrolling)の需要が高まっているときに新しい言葉を発明する。モデルをよりよく理解し制御するための新しい単語を導入し、最近導入されたネオロジズム学習を拡張します。本手法では, モデルパラメータに他の変更を加えることなく, 新たな単語を埋め込み, トレーニングすることで, 新たな単語を導入する。新しい単語を追加することで、フラットリー、誤った回答、テキストの長さ、AxBenchのより複雑な概念などの概念を制御できることを示す。モデルは、それぞれの新しい単語が自然言語で何を意味しているかを記述できる。例えば、不正確な答えの概念を表す単語は、''a lack of complete, coherent, or meaning answer''を意味すると説明すれば、自己言語を検証するために、プラグイン評価を導入する:モデルのコンテキストに動詞を挿入し、それがターゲット概念を制御するかどうかを測定する。いくつかの自己言語化では、機械のみの同義語を見つける:人間とは無関係に見えるが、機械内でも同様な振る舞いを引き起こす言葉。最後に、ネオロジズム学習が複数の単語で複数の概念を共同で学習する方法を示す。

論文の概要: Neologism Learning for Controllability and Self-Verbalization

関連論文リスト