Fugu-MT 論文翻訳(概要): ETCHR: Editing To Clarify and Harness Reasoning

論文の概要: ETCHR: Editing To Clarify and Harness Reasoning

arxiv url: http://arxiv.org/abs/2605.23897v1
Date: Fri, 22 May 2026 17:58:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.462402
Title: ETCHR: Editing To Clarify and Harness Reasoning
Title（参考訳）: ETCHR: 明確化とハーネス推論のための編集
Authors: Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin,
Abstract要約: ETCHR (Editing To Clarify and Harness Reasoning) は質問条件付き推論対応画像エディタである。 2つのギャップをターゲットとした2段階のレシピでトレーニングされている: 教師付き微調整によるイミテーションの推論、VLM由来の報酬による修正精度と下流の推論精度の推論。エディタは分離されているため、ETCHRは異なるオープンソースおよびクローズドソースのMLLMをトレーニングなしでプラグインする。
参考スコア（独自算出の注目度）: 70.02956047187827
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.
Abstract（参考訳）: マルチモーダルな大規模言語モデルには高度な視覚的推論があるが、純粋にテキストによる思考の連鎖は、きめ細かい焦点やビュー変換を必要とする質問のボトルネックである。のパラダイムは、このギャップを狭めるが、既存のアプローチは、固定された事前定義されたツールキットによって制約されるか、統一されたマルチモーダルメソッドからノイズの多い中間画像を生成する。第3の選択肢は、専用の画像編集モデルを使用して、それを理解モデルで分離することである。しかし、既成のイメージエディターは、2つの補完的なギャップを持つ推論アシスタントとして失敗する:言語側ギャップ、受動命令フォロワーとして訓練されたエディターが、抽象的な質問を適切な視覚的変換にマッピングできないこと、および、推論深度が増加するにつれて編集の正確さが低下する世代側ギャップである。そこで本研究では,下流理解モデルから切り離された質問条件付き推論対応画像エディタであるETCHR(Editing To Clarify and Harness Reasoning)を導入し,2つのギャップを目標とした2段階のレシピをトレーニングした。エディタは分離されているため、ETCHRは異なるオープンソースおよびクローズドソースのMLLMをトレーニングフリーでプラグインする。 ETCHRは5つのタスクファミリ(微妙な認識、チャート理解、論理的推論、ジグソーの復元、および3D理解)で、平均パス@1を55.95から60.77 (+4.82)、Qwen3-VL-8B、65.08から70.55 (+5.47)、Gemini-3.1-Flash-Lite、76.55から81.16 (+4.61)、そして1TパラメータのMoEモデルKimi K2.5で上げる。

論文の概要: ETCHR: Editing To Clarify and Harness Reasoning

関連論文リスト