Fugu-MT 論文翻訳(概要): Pixelis: Reasoning in Pixels, from Seeing to Acting

論文の概要: Pixelis: Reasoning in Pixels, from Seeing to Acting

arxiv url: http://arxiv.org/abs/2603.25091v1
Date: Thu, 26 Mar 2026 06:57:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 20:52:48.144071
Title: Pixelis: Reasoning in Pixels, from Seeing to Acting
Title（参考訳）: Pixelis: 見るから行動まで、ピクセルで推論する
Authors: Yunpeng Zhou,
Abstract要約: 我々は,画像やビデオを直接操作するピクセルスペースエージェントであるPixelisを,コンパクトな実行可能な操作セットを通じて提示する。 6つの公開イメージとビデオベンチマークで、Pixelisは一貫して改善されている。
参考スコア（独自算出の注目度）: 2.5754366051855837
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most vision-language systems are static observers: they describe pixels, do not act, and cannot safely improve under shift. This passivity limits generalizable, physically grounded visual intelligence. Learning through action, not static description, is essential beyond curated data. We present Pixelis, a pixel-space agent that operates directly on images and videos via a compact set of executable operations (zoom/crop, segment, track, OCR, temporal localization) and learns from its consequences. Pixelis trains in three phases: (1) Supervised Fine-Tuning learns a pixel-tool grammar from Chain-of-Thought-Action traces with a masked imitation loss that upweights operation/argument tokens and auxiliary heads to stabilize pixel-grounded arguments; (2) Curiosity-Coherence Reward Fine-Tuning optimizes a dual-drive objective marrying prediction-error curiosity with adjacent-step coherence and a mild efficiency prior under a KL anchor, yielding short, valid, structured toolchains; (3) Pixel Test-Time RL performs label-free adaptation by retrieving neighbors, voting over complete trajectories rather than answers, and updating toward short, high-fidelity exemplars while constraining drift with a KL-to-EMA safety control. Across six public image and video benchmarks, Pixelis yields consistent improvements: the average relative gain is +4.08% over the same 8B baseline (peaking at +6.03% on VSI-Bench), computed as (ours-baseline)/baseline, while producing shorter, auditable toolchains and maintaining in-corridor KL during test-time learning. Acting within pixels, rather than abstract tokens, grounds multimodal perception in the physical world, linking visual reasoning with actionable outcomes, and enables embodied adaptation without external feedback.
Abstract（参考訳）: ほとんどの視覚言語システムは静的オブザーバであり、ピクセルを記述し、動作せず、シフト中も安全に改善できない。この通過性は、一般化可能で物理的に接地された視覚的知性を制限する。静的な記述ではなく、アクションを通じて学ぶことは、キュレートされたデータ以外にも不可欠である。我々は,画像やビデオを直接操作するピクセル空間エージェントであるPixelisについて,その結果から学習する。教師付ファインタニングは,1つの段階における画素・ツール文法の学習,2つの段階における画素・ツーリングの学習,2つの段階における画素・ツーリングの学習,2つの段階における画素・ツーリングの最適化,2つの段階のコヒーレンスとKLアンカー下での予測・エラー・キュリオシティの相互結合,2つのステップのコヒーレンスと緩やかな効率,3つの段階におけるPixel Test-Time RLは,周辺住民の回答ではなく,完全な軌跡を投票することで,ラベルフリーな適応を行う。 6つの公開画像とビデオのベンチマークで、Pixelisは一貫して改善されている: 平均的な相対的なゲインは、同じ8Bベースライン(VSI-Benchでは+6.03%)で+4.08%、(ours-baseline)/baselineとして計算され、短い監査可能なツールチェーンを生成し、テスト時間学習中にコリドールKLを維持している。抽象トークンではなくピクセル内での動作は、物理的な世界におけるマルチモーダルな知覚を基盤として、視覚的推論と実行可能な結果とをリンクさせ、外部からのフィードバックなしに具体的適応を可能にする。

論文の概要: Pixelis: Reasoning in Pixels, from Seeing to Acting

関連論文リスト