Fugu-MT 論文翻訳(概要): MAOAM: Unified Object and Material Selection with Vision-Language Models

論文の概要: MAOAM: Unified Object and Material Selection with Vision-Language Models

arxiv url: http://arxiv.org/abs/2606.04880v1
Date: Tue, 02 Jun 2026 17:59:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-04 20:44:18.785822
Title: MAOAM: Unified Object and Material Selection with Vision-Language Models
Title（参考訳）: MAOAM:視覚言語モデルを用いた統一物体と材料選択
Authors: Jaden Park, Valentin Deschaintre, Jason Kuen, Kangning Liu, Iliyan Georgiev, Krishna Kumar Singh, Yong Jae Lee, Michael Fischer,
Abstract要約: Mask Any Object And Material (MAOAM) はインタラクティブな画像編集のための統合された選択フレームワークである。テキストベースのインタラクションとクリックベースのインタラクションの両方で、正確なオブジェクトとマテリアルレベルの選択を可能にする。重要な課題は、テキストアノテーションによるマテリアルセレクションデータセットの欠如である。
参考スコア（独自算出の注目度）: 51.308025632008366
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.
Abstract（参考訳）: 選択はインタラクティブな画像編集における中核的な操作である。実用上は、ユーザーはテキストまたはクリックベースのインタラクションを通じて、所望の選択領域を指定・曖昧化できなければならない。素材ベースの選択は、表面の再テクスチャや特定の素材の編集インスタンスといったタスクに有用である。しかしながら、既存の視覚言語モデル (VLM) ベースの選択法はオブジェクト指向であり、通常は単一の相互作用モダリティをサポートし、適用性を制限する。そこで本研究では,テキストベースとクリックベースの両方のインタラクションに対して,正確なオブジェクトとマテリアルレベルの選択を可能にする統一的な選択フレームワークであるMask Any Object And Materials(MAOAM)を提案する。 MAOAMはセグメンテーションヘッドを持つVLMを利用してユーザプロンプトから画素精度のマスクを生成する: VLMはユーザの選択意図(オブジェクトまたは素材レベル)を解釈し、視覚的実体、属性、空間的関係を符号化し、セグメンテーションヘッドは出力トークンをマスクにデコードする。重要な課題は、テキストアノテーションによるマテリアルセレクションデータセットの欠如である。我々は,物質マスクを用いた実画像と合成画像を収集し,VLMを利用してリッチなビジュアル・セマンティックスによる資料記述を生成する,スケーラブルなデータ生成パイプラインを提案する。教材記述から得られた補助的なVQAタスクとともに、クリックやテキストベースの選択よりも多タスクでMAOAMを訓練し、より深い資料理解を容易にする。ユニモーダルなプロンプトでトレーニングされているにもかかわらず、本モデルでは、テキストと推論時のクリックの組み合わせにおいて、選択が即時改善され、フレキシブルな画像編集ワークフローが実現されている。実験では、様々な物体、材料、相互作用シナリオにわたる正確で一貫性のある選択が示され、実際は堅牢性を強調している。

論文の概要: MAOAM: Unified Object and Material Selection with Vision-Language Models

関連論文リスト