Fugu-MT 論文翻訳(概要): One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist

論文の概要: One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist

arxiv url: http://arxiv.org/abs/2508.21451v1
Date: Fri, 29 Aug 2025 09:29:27 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-01 19:45:10.990377
Title: One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist
Title（参考訳）: シャープ・アイズ」の1つ:ライトウェイト・キャプションを現実のビジュアル・スペシャリストとして再考
Authors: Junha Song, Yongsik Jo, So Yeon Min, Quanting Xie, Taehwan Kim, Yonatan Bisk, Jaegul Choo,
Abstract要約: 我々はLLaMA-7Bより56倍小さい言語モデルに基づく軽量キャプションモデルを開発した。我々のモデルは、大規模マルチモーダル・ジェネラリストに匹敵する性能を達成することができる。シャープ・イード・リファインメント(Sharp-Eyed Refinement, シャープ・イード・リファインメント, シャープ・イード・リファインメント)を開発した。
参考スコア（独自算出の注目度）: 58.89538703878721
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image captioning is fundamental for applications like video instruction systems and exploration robots, yet deploying such models on local devices is challenging due to the high computational demands of multimodal large language models (MLLMs). To address this, we first explore lightweight captioning by implementing a specialist based on a 125M-parameter language model, 56 times smaller than LLaMA-7B, and evaluating its performance on both single-sentence and detailed captioning tasks. Surprisingly, we find that our model can achieve performance comparable to large multimodal generalists, suggesting its potential to serve as a strong visual specialist for on-device applications. While promising, our model also exhibits a limitation: like other MLLMs, it suffers from visual blindness, occasionally resulting in semantic captioning errors. We carry out toy experiments and investigate the underlying causes, where we observe that the problems arise from ineffective attention mechanisms and limited visual representations. To alleviate them, we develop a novel captioning framework, Sharp-Eyed Refinement, which enhances caption quality through improved visual grounding. At its core, our DeepLens extracts detailed visual representations by concentrating on informative regions identified during the initial glance. Our experiments confirm both the advantages of our specialist over prior small captioning models and large generalists and the effectiveness of our framework.
Abstract（参考訳）: 画像キャプションは、ビデオインストラクションシステムや探索ロボットなどのアプリケーションには基本的だが、マルチモーダル大言語モデル(MLLM)の高い計算要求のため、ローカルデバイスにそのようなモデルをデプロイすることは困難である。そこで我々はまず,LLaMA-7Bの56倍小さい125Mパラメータ言語モデルに基づいて,軽量キャプション機能を実装し,単一文および詳細なキャプションタスクの性能評価を行った。驚いたことに、我々のモデルは大規模マルチモーダル・ジェネラリストに匹敵する性能を達成でき、デバイス上のアプリケーションの強力なビジュアルスペシャリストとして機能する可能性を示唆している。他のMLLMと同様、視覚障害に悩まされ、時に意味的なキャプションエラーが発生する。玩具実験を行い,その根本原因を解明し,非効果的な注意機構と限られた視覚的表現から問題が発生することを観察した。そこで我々は, シャープ・イード・リファインメント(Sharp-Eyed Refinement, シャープ・イード・リファインメント, シャープ・イード・リファインメント, シャープ・イード・リファインメント)を開発した。中心となるDeepLensは、最初の一見で特定された情報領域に集中して、詳細な視覚的表現を抽出します。本実験は, 従来の小キャプションモデルと大規模ジェネラリストに対する専門家の優位性と, フレームワークの有効性を両立させるものである。

論文の概要: One More Glance with Sharp Eyes: Rethinking Lightweight Captioning as a Practical Visual Specialist

関連論文リスト