Fugu-MT 論文翻訳(概要): PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

論文の概要: PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

arxiv url: http://arxiv.org/abs/2505.20759v1
Date: Tue, 27 May 2025 06:03:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-28 17:05:58.438294
Title: PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding
Title（参考訳）: PartONOMY:パートレベル視覚理解による大規模マルチモーダルモデル
Authors: Ansel Blume, Jeonghwan Kim, Hyeonjeong Ha, Elen Chatikyan, Xiaomeng Jin, Khanh Duy Nguyen, Nanyun Peng, Kai-Wei Chang, Derek Hoiem, Heng Ji,
Abstract要約: 画素レベルの部分接地のために設計された LMM ベンチマークである PartONOMY を紹介する。我々はいくつかの部分中心LMMをトレーニングし、セグメント化トークンの代わりにスパンタグを使用する新しいセグメント化LMMであるPLUMを提案する。我々の研究は、LMMにおけるきめ細かい基礎的な視覚的理解を実現するための新たな道を開く。
参考スコア（独自算出の注目度）: 114.47739645594204
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Real-world objects are composed of distinctive, object-specific parts. Identifying these parts is key to performing fine-grained, compositional reasoning-yet, large multimodal models (LMMs) struggle to perform this seemingly straightforward task. In this work, we introduce PARTONOMY, an LMM benchmark designed for pixel-level part grounding. We construct PARTONOMY from existing part datasets and our own rigorously annotated set of images, encompassing 862 part labels and 534 object labels for evaluation. Unlike existing datasets that simply ask models to identify generic parts, PARTONOMY uses specialized concepts (e.g., agricultural airplane), and challenges models to compare objects' parts, consider part-whole relationships, and justify textual predictions with visual segmentations. Our experiments demonstrate significant limitations in state-of-the-art LMMs (e.g., LISA-13B achieves only 5.9% gIoU), highlighting a critical gap in their part grounding abilities. We note that existing segmentation-enabled LMMs (segmenting LMMs) have two key architectural shortcomings: they use special [SEG] tokens not seen during pretraining which induce distribution shift, and they discard predicted segmentations instead of using past predictions to guide future ones. To address these deficiencies, we train several part-centric LMMs and propose PLUM, a novel segmenting LMM that uses span tagging instead of segmentation tokens and that conditions on prior predictions in a feedback loop. We find that pretrained PLUM outperforms existing segmenting LMMs on reasoning segmentation, VQA, and visual hallucination benchmarks. In addition, PLUM finetuned on our proposed Explanatory Part Segmentation task is competitive with segmenting LMMs trained on significantly more segmentation data. Our work opens up new avenues towards enabling fine-grained, grounded visual understanding in LMMs.
Abstract（参考訳）: 現実世界のオブジェクトは、独特の、オブジェクト固有の部分で構成されています。これらの部分を特定することは、この一見単純なタスクを実行するために、きめ細かな、構成的推論、大規模マルチモーダルモデル(LMM)を実行するための鍵となる。本研究では,画素レベルの部分接地のために設計されたLMMベンチマークであるPartoNOMYを紹介する。既存のパートデータセットからPartialONOMYを構築し、評価のために852個のパートラベルと534個のオブジェクトラベルを含む、厳密な注釈付き画像集合を構築した。モデルにジェネリックな部分を特定することを求める既存のデータセットとは異なり、PartoNOMYは特別な概念(例えば農業用飛行機)を使用し、オブジェクトのパーツを比較し、部分的関係を考慮し、視覚的なセグメンテーションでテキスト予測を正当化するためにモデルに挑戦する。 LISA-13B は5.9% gIoU しか達成できないが, 現状の LMM の限界は大きい。既存のセグメンテーション対応LMM(セグメンテーションLMM)には2つの重要な欠点があることに注意する必要がある。これらの欠陥に対処するために、いくつかの部分中心LMMをトレーニングし、セグメンテーショントークンの代わりにスパンタグを使用する新しいセグメンテーションLMMであるPLUMと、フィードバックループにおける事前予測条件を提案する。プレトレーニングされたPLUMは,既存のセグメンテーションLMMよりも,推論セグメンテーション,VQA,視覚幻覚ベンチマークに優れていた。さらに,提案した説明部分分割タスクを微調整したPLUMは,さらに多くのセグメンテーションデータに基づいて訓練されたLMMのセグメンテーションと競合する。我々の研究は、LMMにおけるきめ細かな基礎的な視覚的理解を実現するための新たな道を開く。

論文の概要: PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

関連論文リスト