Fugu-MT 論文翻訳(概要): A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

論文の概要: A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

arxiv url: http://arxiv.org/abs/2604.19689v1
Date: Tue, 21 Apr 2026 17:11:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.895967
Title: A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
Title（参考訳）: A-MAR:細粒度アートワーク理解のためのエージェントベースマルチモーダルアート検索
Authors: Shuai Wang, Hongyi Zhu, Jia-Hong Huang, Yixian Shen, Chengxi Zeng, Stevan Rudinac, Monika Kackovic, Nachoem Wijnberg, Marcel Worring,
Abstract要約: A-MARはエージェントベースのマルチモーダルアート検索フレームワークで、構造化された推論計画の検索を明示的に条件付けする。 A-MARは、最終的な説明品質において、静的で計画外の検索と強力なMLLMベースラインを一貫して上回る。これらの結果は,知識集約型マルチモーダル理解における推論条件付き検索の重要性を浮き彫りにした。
参考スコア（独自算出の注目度）: 22.108285993445552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding artworks requires multi-step reasoning over visual content and cultural, historical, and stylistic context. While recent multimodal large language models show promise in artwork explanation, they rely on implicit reasoning and internalized knowl- edge, limiting interpretability and explicit evidence grounding. We propose A-MAR, an Agent-based Multimodal Art Retrieval framework that explicitly conditions retrieval on structured reasoning plans. Given an artwork and a user query, A-MAR first decomposes the task into a structured reasoning plan that specifies the goals and evidence requirements for each step. Retrieval is then conditionedon this plan, enabling targeted evidence selection and supporting step-wise, grounded explanations. To evaluate agent-based multi- modal reasoning within the art domain, we introduce ArtCoT-QA. This diagnostic benchmark features multi-step reasoning chains for diverse art-related queries, enabling a granular analysis that extends beyond simple final answer accuracy. Experiments on SemArt and Artpedia show that A-MAR consistently outperforms static, non planned retrieval and strong MLLM baselines in final explanation quality, while evaluations on ArtCoT-QA further demonstrate its advantages in evidence grounding and multi-step reasoning ability. These results highlight the importance of reasoning-conditioned retrieval for knowledge-intensive multimodal understanding and position A-MAR as a step toward interpretable, goal-driven AI systems, with particular relevance to cultural industries. The code and data are available at: https://github.com/ShuaiWang97/A-MAR.
Abstract（参考訳）: アートワークを理解するには、視覚的内容や文化的、歴史的、様式的な文脈に対する多段階の推論が必要である。最近の多モーダルな大言語モデルは、アートワークの説明において有望であることを示しているが、それらは暗黙の推論と内部化されたノウルエッジに依存しており、解釈可能性や明確な証拠を根拠にしている。本稿では,A-MARというエージェントベースのマルチモーダルアート検索フレームワークを提案する。アートワークとユーザクエリが与えられた後、A-MARはまず、各ステップの目標とエビデンス要件を指定する構造化された推論計画にタスクを分解する。 Retrievalはこの計画を条件付きで実施し、目標とする証拠の選択と、段階的に基礎化された説明の支持を可能にする。エージェントベースのマルチモーダル推論をアート領域内で評価するために,ArtCoT-QAを導入する。この診断ベンチマークは、さまざまなアート関連クエリのための多段階推論チェーンを備えており、単純な最終回答精度を超えて詳細な分析を可能にする。 SemArtとArtpediaの実験は、A-MARが静的で非計画的な検索と強力なMLLMベースラインを最終説明品質で一貫して上回ることを示した。これらの結果は、知識集約型マルチモーダル理解のための推論条件付き検索の重要性を強調し、A-MARを文化産業に特に関連性のある、解釈可能な目標駆動型AIシステムへのステップとして位置づけた。コードとデータは、https://github.com/ShuaiWang97/A-MARで入手できる。

論文の概要: A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding

関連論文リスト