Fugu-MT 論文翻訳(概要): FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

論文の概要: FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

arxiv url: http://arxiv.org/abs/2603.17326v1
Date: Wed, 18 Mar 2026 03:39:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.50177
Title: FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions
Title（参考訳）: FineViT:Dense Recaption機能付きファイングラインド・パーセプションの段階的アンロック
Authors: Peisen Zhao, Xiaopeng Zhang, Mingxing Xu, Ruoyu Sun, Zewei Du, Dunzheng Wang, Guanghao Zheng, Haohang Xu, Zhibo Zhang, Yuhang Zhang, Yi Ai, Lin Liu, Qi Tian,
Abstract要約: FineViTは、微粒な知覚を解き放つために特別に設計された、新しい視覚エンコーダである。我々はファインビジョン・エンコーダ(ファインビジョン・エンコーダ)を紹介した。
参考スコア（独自算出の注目度）: 52.366937743884314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Multimodal Large Language Models (MLLMs) have experienced rapid advancements, their visual encoders frequently remain a performance bottleneck. Conventional CLIP-based encoders struggle with dense spatial tasks due to the loss of visual details caused by low-resolution pretraining and the reliance on noisy, coarse web-crawled image-text pairs. To overcome these limitations, we introduce FineViT, a novel vision encoder specifically designed to unlock fine-grained perception. By replacing coarse web data with dense recaptions, we systematically mitigate information loss through a progressive training paradigm.: first, the encoder is trained from scratch at a high native resolution on billions of global recaptioned image-text pairs, establishing a robust, detail rich semantic foundation. Subsequently, we further enhance its local perception through LLM alignment, utilizing our curated FineCap-450M dataset that comprises over $450$ million high quality local captions. Extensive experiments validate the effectiveness of the progressive strategy. FineViT achieves state-of-the-art zero-shot recognition and retrieval performance, especially in long-context retrieval, and consistently outperforms multimodal visual encoders such as SigLIP2 and Qwen-ViT when integrated into MLLMs. We hope FineViT could serve as a powerful new baseline for fine-grained visual perception.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は急速に進歩してきたが、視覚エンコーダはしばしばパフォーマンスのボトルネックのままである。従来のCLIPベースのエンコーダは、低分解能事前学習による視覚的詳細の喪失と、粗いウェブクロース画像テキスト対に依存するため、密集した空間的タスクに苦労する。これらの制限を克服するため、我々はファインヴィジョン・エンコーダを導入した。粗いWebデータを高密度な再適応に置き換えることで、情報損失をプログレッシブトレーニングパラダイムを通じて体系的に軽減する。第一に、エンコーダは、数十億のグローバル再カプセル化イメージテキストペアに対して、高いネイティブ解像度でゼロからトレーニングされ、堅牢で詳細なリッチなセマンティック基盤を確立します。その後,LLMアライメントにより,高品質なローカルキャプションを4億5000万ドル以上で提供するFinCap-450Mデータセットを利用して,その局所認識をさらに強化する。広範囲な実験により、進歩戦略の有効性が検証された。 FineViTは、特に長文検索において最先端のゼロショット認識と検索性能を実現し、MLLMに統合された場合、SigLIP2やQwen-ViTのようなマルチモーダル視覚エンコーダよりも一貫して優れる。 FineViTが、きめ細かい視覚知覚のための強力な新しいベースラインになることを期待している。

論文の概要: FineViT: Progressively Unlocking Fine-Grained Perception with Dense Recaptions

関連論文リスト