Fugu-MT 論文翻訳(概要): InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

論文の概要: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

arxiv url: http://arxiv.org/abs/2501.12368v1
Date: Tue, 21 Jan 2025 18:47:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-01-22 19:37:19.762523
Title: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
Title（参考訳）: InternLM-XComposer2.5-Reward: シンプルで効果的なマルチモーダルリワードモデル
Authors: Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, Kai Chen, Dahua Lin, Jiaqi Wang,
Abstract要約: IXC-2.5-Rewardは、大規模視覚言語モデルと人間の好みを一致させる、単純で効果的なマルチモーダル報酬モデルである。 IXC-2.5-Rewardは、最新のマルチモーダル報酬モデルベンチマークにおいて優れた結果を得るとともに、テキストのみの報酬モデルベンチマーク上での競合性能を示す。
参考スコア（独自算出の注目度）: 80.93387166769679
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the promising performance of Large Vision Language Models (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer
Abstract（参考訳）: 視覚的理解においてLVLM(Large Vision Language Models)が有望な性能を示したにもかかわらず、しばしば誤った出力を生成する。強化学習やテストタイムスケーリングを備えた報酬モデル(RM)は、生成品質を向上させる可能性を秘めているが、LVLM向けに公開されているマルチモーダルRMは不足しており、プロプライエタリモデルの実装詳細はよく分かっていない。 InternLM-XComposer2.5-Reward(IXC-2.5-Reward)によりこのギャップを埋める。 IXC-2.5-Rewardの堅牢性と汎用性を確保するため,テキスト,画像,ビデオ入力を多分野にまたがる高品質なマルチモーダル選好コーパスを構築した。 IXC-2.5-Rewardは、最新のマルチモーダル報酬モデルベンチマークにおいて優れた結果を得るとともに、テキストのみの報酬モデルベンチマーク上での競合性能を示す。さらに, IXC-2.5-Rewardの3つの重要な応用について述べる。 IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat which showed consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate response for test-time scaling; and (3) Filtering outlier or noisy sample from existing image and video instruction Training training data。再現性を確保し、さらなる研究を促進するため、私たちはhttps://github.com/InternLM/InternLM-XComposerですべてのモデルウェイトとトレーニングレシピをオープンソース化しました。

論文の概要: InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

関連論文リスト