Fugu-MT 論文翻訳(概要): MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

論文の概要: MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

arxiv url: http://arxiv.org/abs/2505.06176v1
Date: Fri, 09 May 2025 16:38:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-12 20:40:10.340427
Title: MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills
Title（参考訳）: MonetGPT: MLLMのイメージリタッチスキルを強化したパズルの解決
Authors: Niladri Shekhar Dutt, Duygu Ceylan, Niloy J. Mitra,
Abstract要約: 本稿では,MLLM(Multimodal large language model)を生画像の批判に応用できることを示す。 MLLMは、その基盤となる画像処理操作を最初に認識できることを実証する。そして、専門家が編集した写真を手続き的に操作することで、推論データセットを合成する。
参考スコア（独自算出の注目度）: 37.48977077142813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be taught to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing operations, by training them to solve specially designed visual puzzles. Subsequently, such an operation-aware MLLM can both plan and propose edit sequences. To facilitate training, given a set of expert-edited photos, we synthesize a reasoning dataset by procedurally manipulating the expert edits and then grounding a pretrained LLM on the visual adjustments, to synthesize reasoning for finetuning. The proposed retouching operations are, by construction, understandable by the users, preserve object details and resolution, and can be optionally overridden. We evaluate our setup on a variety of test examples and show advantages, in terms of explainability and identity preservation, over existing generative and other procedural alternatives. Code, data, models, and supplementary results can be found via our project website at https://monetgpt.github.io.
Abstract（参考訳）: 原写真の操作後において,リタッチは重要な課題である。テキストやストロークでガイドされた生成編集は、ユーザがアクセスできる新しいツールを提供するが、元のオブジェクトのアイデンティティを簡単に変更できる。対照的に、従来の手続き的な編集は、フォト編集ツール(例えば、Gimmp、Lightroom)が一般的にサポートしているように、保守的であるが、プロが好んでいる。残念なことに、プロフェッショナルな品質修正には、多くの初心者のために計画するのが難しい個別の手続き的な編集作業が伴う。本稿では,MLLM(Multimodal large language model)を用いて,原写真を批判し,適切な治療法を提案するとともに,与えられた手続き的画像操作のセットでそれらを実現できるかを問う。 MLLMは、まず、その基盤となる画像処理操作に気付き、特別に設計された視覚パズルの解法を訓練することで実現可能であることを実証する。その後、このような操作対応MLLMは、いずれにせよ、編集シーケンスを計画および提案することができる。専門家が編集した一連の写真をもとに、専門家の編集を手続き的に操作し、事前学習したLLMを視覚的調整に基盤にして推論データセットを合成し、微調整のための推論を合成する。提案したリタッチ操作は、構成上、ユーザが理解でき、オブジェクトの詳細と解像度を保存でき、オプションでオーバーライドできる。提案手法は, 各種試験例で評価し, 既存の生成的および他の手続き的代替品よりも説明可能性, アイデンティティ保存の面で優位性を示す。コード、データ、モデル、補足的な結果は、プロジェクトのWebサイト(https://monetgpt.github.io.)で確認できます。

論文の概要: MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

関連論文リスト