Fugu-MT 論文翻訳(概要): CPPO: Contrastive Perception for Vision Language Policy Optimization

論文の概要: CPPO: Contrastive Perception for Vision Language Policy Optimization

arxiv url: http://arxiv.org/abs/2601.00501v1
Date: Thu, 01 Jan 2026 22:48:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-05 15:04:33.478474
Title: CPPO: Contrastive Perception for Vision Language Policy Optimization
Title（参考訳）: CPPO:ビジョン言語政策最適化のための対照的な認識
Authors: Ahmad Rezaei, Mohsen Gholami, Saeed Ranjbar Alvar, Kevin Cannons, Mohammad Asiful Hossain, Zhou Weimin, Shunbo Zhou, Yong Zhang, Mohammad Akbari,
Abstract要約: CPPO (Contrastive Perception Policy Optimization) は、視覚言語モデルを微調整する手法である。摂動入力画像の下でモデル出力のエントロピーシフトによって知覚トークンを検出する。次に、情報保存摂動下での一貫性と情報除去時の感度を強制するコントラスト知覚損失(Contrastive Perception Loss, CPL)を用いてRL目的関数を拡張する。
参考スコア（独自算出の注目度）: 15.695586206709566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce CPPO, a Contrastive Perception Policy Optimization method for finetuning vision-language models (VLMs). While reinforcement learning (RL) has advanced reasoning in language models, extending it to multimodal reasoning requires improving both the perception and reasoning aspects. Prior works tackle this challenge mainly with explicit perception rewards, but disentangling perception tokens from reasoning tokens is difficult, requiring extra LLMs, ground-truth data, forced separation of perception from reasoning by policy model, or applying rewards indiscriminately to all output tokens. CPPO addresses this problem by detecting perception tokens via entropy shifts in the model outputs under perturbed input images. CPPO then extends the RL objective function with a Contrastive Perception Loss (CPL) that enforces consistency under information-preserving perturbations and sensitivity under information-removing ones. Experiments show that CPPO surpasses previous perception-rewarding methods, while avoiding extra models, making training more efficient and scalable.
Abstract（参考訳）: 本稿では,視覚言語モデル(VLM)の微調整のためのコントラスト知覚ポリシー最適化手法であるCPPOを紹介する。強化学習(RL)は言語モデルにおいて高度な推論を持つが、マルチモーダル推論に拡張するには知覚と推論の両方の改善が必要である。先行研究は、主に明示的な認識報酬によってこの課題に取り組むが、推論トークンから認識トークンを遠ざけることは困難であり、余分なLCM、地味なデータ、ポリシーモデルによる推論からの認識の分離、あるいは全ての出力トークンに不特定に報酬を適用することが必要であった。 CPPOは、摂動入力画像下でのモデル出力のエントロピーシフトによって知覚トークンを検出することでこの問題に対処する。 CPPOは、情報保存摂動下での一貫性と情報除去下での感度を強制するコントラスト知覚損失(Contrastive Perception Loss, CPL)を用いてRL目的関数を拡張した。実験の結果、CPPOは従来の知覚回帰手法を超越し、余分なモデルを回避し、トレーニングをより効率的かつスケーラブルにすることが示された。

論文の概要: CPPO: Contrastive Perception for Vision Language Policy Optimization

関連論文リスト