Fugu-MT 論文翻訳(概要): Aesthetic Image Captioning with Saliency Enhanced MLLMs

論文の概要: Aesthetic Image Captioning with Saliency Enhanced MLLMs

arxiv url: http://arxiv.org/abs/2509.04378v3
Date: Tue, 09 Sep 2025 08:09:40 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-10 12:33:22.809391
Title: Aesthetic Image Captioning with Saliency Enhanced MLLMs
Title（参考訳）: Saliency Enhanced MLLM を用いた審美的画像キャプション
Authors: Yilin Tao, Jiashui Huang, Huaze Xu, Ling Shao,
Abstract要約: Aesthetic Image Captioning (AIC)は、画像美学のテキスト記述を作成することを目的としている。本稿では,画像から審美性特徴を効果的かつ効果的に抽出するAesthetic Saliency Module (IASM)を紹介する。また,MLLMのイメージエンコーダとしてIAS-ViTを設計した。
参考スコア（独自算出の注目度）: 26.924932114765596
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.
Abstract（参考訳）: Aesthetic Image Captioning (AIC) は、画像美学のテキスト記述を生成することを目的としており、計算美学の分野において重要な研究方向となっている。近年,MLLM (Pretrained Multimodal Large Language Models) が急速に進歩し,視覚とテキストの両モードを統合した画像美学研究が著しく増加している。しかしながら、画像美学に関する既存の研究のほとんどは、主に美的評価の予測に焦点を合わせており、AICにおいて限定的な応用が示されている。既存のAICの作業は、MLLMをターゲットの美的内容に特化させることなく、主に微調整の手法に依存している。この制限に対処するために、美的サリエンシをMLLMに明示的に組み込んだエンドツーエンドフレームワークであるAesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM)を提案する。本フレームワークでは,画像から美容整合性特徴を効率よく,効果的に抽出する画像美容整合モジュール (IASM) を導入する。さらに, MLLMのイメージエンコーダとしてIAS-ViTを設計し, このモジュールは, クロスアテンション機構を用いて, 美的サリエンシ特徴とオリジナル画像特徴とを融合する。我々の知る限り、ASE-MLLMはAICタスク専用のMLLMに画像美的サリエンシを統合する最初のフレームワークである。大規模な実験により,従来の手法や汎用MLLMを従来のAICベンチマークで大幅に上回り,SOTA(State-of-the-art)性能を実現した。

論文の概要: Aesthetic Image Captioning with Saliency Enhanced MLLMs

関連論文リスト