Fugu-MT 論文翻訳(概要): Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

論文の概要: Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

arxiv url: http://arxiv.org/abs/2509.08489v1
Date: Wed, 10 Sep 2025 11:00:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-11 15:16:52.399172
Title: Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation
Title（参考訳）: マルチモーダル生成AIを用いたプロンプト駆動画像解析:検出、セグメンテーション、塗装、解釈
Authors: Kaleem Ahmad,
Abstract要約: 本稿では,オープン語彙検出,アクセシブルセグメンテーション,テキストコンディショニング,視覚言語記述を組み合わせた統合パイプラインの実践事例について述べる。我々は、しきい値調整、光形態によるマスク検査、リソース認識のデフォルトなど、脆さを低減する統合選択を強調した。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt-driven image analysis converts a single natural-language instruction into multiple steps: locate, segment, edit, and describe. We present a practical case study of a unified pipeline that combines open-vocabulary detection, promptable segmentation, text-conditioned inpainting, and vision-language description into a single workflow. The system works end to end from a single prompt, retains intermediate artifacts for transparent debugging (such as detections, masks, overlays, edited images, and before and after composites), and provides the same functionality through an interactive UI and a scriptable CLI for consistent, repeatable runs. We highlight integration choices that reduce brittleness, including threshold adjustments, mask inspection with light morphology, and resource-aware defaults. In a small, single-word prompt segment, detection and segmentation produced usable masks in over 90% of cases with an accuracy above 85% based on our criteria. On a high-end GPU, inpainting makes up 60 to 75% of total runtime under typical guidance and sampling settings, which highlights the need for careful tuning. The study offers implementation-guided advice on thresholds, mask tightness, and diffusion parameters, and details version pinning, artifact logging, and seed control to support replay. Our contribution is a transparent, reliable pattern for assembling modern vision and multimodal models behind a single prompt, with clear guardrails and operational practices that improve reliability in object replacement, scene augmentation, and removal.
Abstract（参考訳）: プロンプト駆動の画像解析は、ひとつの自然言語命令を複数のステップ(場所、セグメント、編集、記述)に変換する。本稿では,オープンボキャブラリ検出,アクセラブルセグメンテーション,テキストコンディショニング,視覚言語記述をひとつのワークフローに統合した統合パイプラインの実践事例について述べる。システムは単一のプロンプトから終端まで動作し、透過的なデバッグ(検出、マスク、オーバーレイ、編集済み画像、および前後合成など)のための中間的なアーティファクトを保持し、インタラクティブUIとスクリプト可能なCLIを通じて、一貫性があり、繰り返し実行できる実行のための機能を提供している。我々は、しきい値調整、光形態によるマスク検査、リソース認識のデフォルトなど、脆さを低減する統合選択を強調した。単一単語のプロンプトセグメントでは,検出とセグメンテーションにより90%以上の症例でマスクが生成され,その精度は基準値より85%以上であった。ハイエンドGPUでは、一般的なガイダンスとサンプリング設定の下で、インペインティングはランタイム全体の60から75%を占めており、注意深いチューニングの必要性を強調している。この研究は、しきい値、マスクの締まり、拡散パラメータに関する実装ガイダンスのアドバイスと、リプレイをサポートするためのバージョンピンニング、アーティファクトロギング、シードコントロールの詳細を提供する。当社のコントリビューションは,現代的なビジョンとマルチモーダルモデルをひとつのプロンプトで組み立てるための,透過的で信頼性の高いパターンです。

論文の概要: Prompt-Driven Image Analysis with Multimodal Generative AI: Detection, Segmentation, Inpainting, and Interpretation

関連論文リスト