Fugu-MT 論文翻訳(概要): Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

論文の概要: Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

arxiv url: http://arxiv.org/abs/2602.11241v1
Date: Wed, 11 Feb 2026 17:29:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-13 21:07:25.479761
Title: Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration
Title（参考訳）: アクティブゼロ:アクティブ環境探査による自己進化型ビジョンランゲージモデル
Authors: Jinghan He, Junfeng Fang, Feng Xiong, Zijun Yao, Fei Shen, Haiyun Guo, Jinqiao Wang, Tat-Seng Chua,
Abstract要約: 受動的相互作用から視覚環境の能動的探索に移行する枠組みを提案する。 Active-Zeroでは,3つの共進化エージェントが採用されている。モデルの機能フロンティアに基づいて,オープンワールドリポジトリからイメージを取得する検索だ。 12ベンチマークにわたるQwen2.5-VL-7B-インストラクションについて : Active-Zero 53.97 における推論タスクの平均精度(5.7%の改善)と一般理解における 59.77 について(3.9%の改善)
参考スコア（独自算出の注目度）: 72.84714132070404
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Self-play has enabled large language models to autonomously improve through self-generated challenges. However, existing self-play methods for vision-language models rely on passive interaction with static image collections, resulting in strong dependence on initial datasets and inefficient learning. Without the ability to actively seek visual data tailored to their evolving capabilities, agents waste computational effort on samples that are either trivial or beyond their current skill level. To address these limitations, we propose Active-Zero, a framework that shifts from passive interaction to active exploration of visual environments. Active-Zero employs three co-evolving agents: a Searcher that retrieves images from open-world repositories based on the model's capability frontier, a Questioner that synthesizes calibrated reasoning tasks, and a Solver refined through accuracy rewards. This closed loop enables self-scaffolding auto-curricula where the model autonomously constructs its learning trajectory. On Qwen2.5-VL-7B-Instruct across 12 benchmarks, Active-Zero achieves 53.97 average accuracy on reasoning tasks (5.7% improvement) and 59.77 on general understanding (3.9% improvement), consistently outperforming existing self-play baselines. These results highlight active exploration as a key ingredient for scalable and adaptive self-evolving vision-language systems.
Abstract（参考訳）: セルフプレイは、大規模な言語モデルが自己生成的課題を通じて自律的に改善することを可能にする。しかし、視覚言語モデルのための既存のセルフプレイ手法は静的画像収集との受動的相互作用に依存しており、初期データセットへの強い依存と非効率な学習をもたらす。進化する能力に合わせて視覚データを積極的に探す能力がなければ、エージェントは、自明な、あるいは現在のスキルレベルを超えているサンプルに計算作業を無駄にする。これらの制約に対処するために,受動的インタラクションから視覚環境の能動的探索に移行するフレームワークであるActive-Zeroを提案する。 Active-Zeroは、モデルの機能フロンティアに基づいてオープンワールドリポジトリからイメージを検索する検索エージェント、キャリブレーションされた推論タスクを合成する質問エージェント、精度の高い報酬によって洗練されるソルバーの3つの共進化エージェントを採用している。この閉ループは、モデルが学習軌道を自律的に構築する自己スケーリングオートクラキュラを可能にする。 12ベンチマークにわたるQwen2.5-VL-7B-インストラクトでは、Active-Zeroは推論タスクの平均精度53.97(5.7%の改善)と一般理解59.77(3.9%の改善)を達成し、既存のセルフプレイベースラインを一貫して上回っている。これらの結果は、スケーラブルで適応的な自己進化型視覚言語システムにとって重要な要素として、活発な探索を強調している。

論文の概要: Active Zero: Self-Evolving Vision-Language Models through Active Environment Exploration

関連論文リスト