Fugu-MT 論文翻訳(概要): EffiMiniVLM: A Compact Dual-Encoder Regression Framework

論文の概要: EffiMiniVLM: A Compact Dual-Encoder Regression Framework

arxiv url: http://arxiv.org/abs/2604.03172v1
Date: Fri, 03 Apr 2026 16:48:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-06 17:20:24.537181
Title: EffiMiniVLM: A Compact Dual-Encoder Regression Framework
Title（参考訳）: EffiMiniVLM: コンパクトデュアルエンコーダ回帰フレームワーク
Authors: Yin-Loon Khor, Yi-Jie Wong, Yan Chai Hum,
Abstract要約: EffiMiniVLMは、コンパクトな視覚言語レグレッションフレームワークである。 EfficientNet-B0イメージエンコーダとMiniLMベースのテキストエンコーダを軽量回帰ヘッドに統合する。 Amazon Reviews 2023データセットの20%しか使用していない。
参考スコア（独自算出の注目度）: 2.194788968762689
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Predicting product quality from multimodal item information is critical in cold-start scenarios, where user interaction history is unavailable and predictions must rely on images and textual metadata. However, existing vision-language models typically depend on large architectures and/or extensive external datasets, resulting in high computational cost. To address this, we propose EffiMiniVLM, a compact dual-encoder vision-language regression framework that integrates an EfficientNet-B0 image encoder and a MiniLM-based text encoder with a lightweight regression head. To improve training sample efficiency, we introduce a weighted Huber loss that leverages rating counts to emphasize more reliable samples, yielding consistent performance gains. Trained using only 20% of the Amazon Reviews 2023 dataset, the proposed model contains 27.7M parameters and requires 6.8 GFLOPs, yet achieves a CES score of 0.40 with the lowest resource cost in the benchmark. Despite its small size, it remains competitive with significantly larger models, achieving comparable performance while being approximately 4x to 8x more resource-efficient than other top-5 methods and being the only approach that does not use external datasets. Further analysis shows that scaling the data to 40% alone allows our model to overtake other methods, which use larger models and datasets, highlighting strong scalability despite the model's compact design.
Abstract（参考訳）: マルチモーダルアイテム情報から製品品質を予測することは、ユーザインタラクション履歴が利用できない、イメージやテキストメタデータに依存する、コールドスタートシナリオにおいて重要である。しかし、既存の視覚言語モデルは通常、大きなアーキテクチャや大規模な外部データセットに依存しており、計算コストが高い。そこで我々は,EffiMiniVLMを提案する。EfficientNet-B0イメージエンコーダとMiniLMベースのテキストエンコーダを軽量回帰ヘッドに統合した,コンパクトなデュアルエンコーダビジョン言語回帰フレームワークである。トレーニングサンプル効率を向上させるために,評価数を利用した重み付きハマー損失を導入し,より信頼性の高いサンプルを強調することにより,一貫した性能向上を実現した。 Amazon Reviews 2023データセットの20%のみを使用してトレーニングされたこのモデルは、27.7Mパラメータを含み、6.8 GFLOPを必要とするが、ベンチマークで最低のリソースコストでCESスコア0.40を達成する。そのサイズは小さいが、非常に大きなモデルと競合し続けており、他のトップ5メソッドよりも約4倍から8倍のリソース効率で、外部データセットを使用しない唯一のアプローチでありながら、同等のパフォーマンスを実現している。さらに分析したところ、データを40%までスケールアップすることで、モデルがコンパクトな設計であるにもかかわらず、より大規模なモデルやデータセットを使用する他の手法を克服できることが示されています。

論文の概要: EffiMiniVLM: A Compact Dual-Encoder Regression Framework

関連論文リスト