Fugu-MT 論文翻訳(概要): Generating Accurate and Detailed Captions for High-Resolution Images

論文の概要: Generating Accurate and Detailed Captions for High-Resolution Images

arxiv url: http://arxiv.org/abs/2510.27164v1
Date: Fri, 31 Oct 2025 04:22:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-03 17:52:15.976018
Title: Generating Accurate and Detailed Captions for High-Resolution Images
Title（参考訳）: 高分解能画像のための正確なキャプションと詳細キャプションの生成
Authors: Hankyeol Lee, Gawon Seo, Kyounggyu Lee, Dogun Kim, Kyungwoo Song, Jiyoung Jung,
Abstract要約: 本稿では,視覚言語モデル,大規模言語モデル,オブジェクト検出システムを統合した新しいパイプラインを提案する。提案するパイプラインは,新しい多段階プロセスを通じてキャプションを洗練する。高解像度画像のキュレートされたデータセットの実験により、パイプラインはより詳細で信頼性の高い画像キャプションを生成することが示された。
参考スコア（独自算出の注目度）: 13.538521042598502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) often struggle to generate accurate and detailed captions for high-resolution images since they are typically pre-trained on low-resolution inputs (e.g., 224x224 or 336x336 pixels). Downscaling high-resolution images to these dimensions may result in the loss of visual details and the omission of important objects. To address this limitation, we propose a novel pipeline that integrates vision-language models, large language models (LLMs), and object detection systems to enhance caption quality. Our proposed pipeline refines captions through a novel, multi-stage process. Given a high-resolution image, an initial caption is first generated using a VLM, and key objects in the image are then identified by an LLM. The LLM predicts additional objects likely to co-occur with the identified key objects, and these predictions are verified by object detection systems. Newly detected objects not mentioned in the initial caption undergo focused, region-specific captioning to ensure they are incorporated. This process enriches caption detail while reducing hallucinations by removing references to undetected objects. We evaluate the enhanced captions using pairwise comparison and quantitative scoring from large multimodal models, along with a benchmark for hallucination detection. Experiments on a curated dataset of high-resolution images demonstrate that our pipeline produces more detailed and reliable image captions while effectively minimizing hallucinations.
Abstract（参考訳）: 視覚言語モデル(VLM)は、通常、低解像度の入力(例えば、224x224または336x336ピクセル)で事前訓練されているため、高解像度の画像に対して正確で詳細なキャプションを生成するのに苦労することが多い。これらの次元に高解像度画像をダウンスケールすると、視覚的詳細が失われ、重要な物体が欠落する可能性がある。この制限に対処するために,視覚言語モデル,大規模言語モデル(LLM),オブジェクト検出システムを統合し,キャプションの品質を向上させるパイプラインを提案する。提案するパイプラインは,新しい多段階プロセスを通じてキャプションを洗練する。高解像度の画像が与えられた後、最初にVLMを使用して初期キャプションを生成し、その画像内のキーオブジェクトをLLMで識別する。 LLMは、同定された鍵オブジェクトと共起する可能性のある追加のオブジェクトを予測し、これらの予測はオブジェクト検出システムによって検証される。最初のキャプションで言及されていない新しい検出対象は、それらが組み込まれることを保証するために、地域固有のキャプションに焦点を当てている。このプロセスは、検出されていないオブジェクトへの参照を取り除き、幻覚を減らしながら、キャプションの詳細を豊かにする。本研究は,大規模なマルチモーダルモデルと幻覚検出のためのベンチマークを用いて,ペア比較と定量的スコアリングによる拡張キャプションの評価を行った。高解像度画像のキュレートされたデータセットによる実験により、我々のパイプラインは幻覚を効果的に最小化しつつ、より詳細で信頼性の高い画像キャプションを生成することを示した。

論文の概要: Generating Accurate and Detailed Captions for High-Resolution Images

関連論文リスト