Fugu-MT 論文翻訳(概要): MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

論文の概要: MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

arxiv url: http://arxiv.org/abs/2604.08516v1
Date: Thu, 09 Apr 2026 17:54:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.061066
Title: MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Title（参考訳）: MolmoWeb: Open Visual Web AgentとOpen Data for the Open Web
Authors: Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, Harsh Trivedi, Taylor Blanton, Caleb Ouellette, Winson Han, Ali Farhadi, Ranjay Krishna,
Abstract要約: MolmoWebMixはブラウザのタスクとWeb-GUIの知覚データを組み合わせたものだ。 MolmoWeb-8Bは、完全にオープンなマルチモーダルWebエージェントのファミリーである。我々は、Webエージェントのオープンな研究を可能にするため、モデルチェックポイント、トレーニングデータ、コード、統一された評価ハーネスをリリースする。
参考スコア（独自算出の注目度）: 60.29597961827816
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Web agents--autonomous systems that navigate and execute tasks on the web on behalf of users--have the potential to transform how people interact with the digital world. However, the most capable web agents today rely on proprietary models with undisclosed training data and recipes, limiting scientific understanding, reproducibility, and community-driven progress. We believe agents for the open web should be built in the open. To this end, we introduce (1) MolmoWebMix, a large and diverse mixture of browser task demonstrations and web-GUI perception data and (2) MolmoWeb, a family of fully open multimodal web agents. Specifically, MolmoWebMix combines over 100K synthetic task trajectories from multiple complementary generation pipelines with 30K+ human demonstrations, atomic web-skill trajectories, and GUI perception data, including referring expression grounding and screenshot question answering. MolmoWeb agents operate as instruction-conditioned visual-language action policies: given a task instruction and a webpage screenshot, they predict the next browser action, requiring no access to HTML, accessibility trees, or specialized APIs. Available in 4B and 8B size, on browser-use benchmarks like WebVoyager, Online-Mind2Web, and DeepShop, MolmoWeb agents achieve state-of-the-art results outperforming similar scale open-weight-only models such as Fara-7B, UI-Tars-1.5-7B, and Holo1-7B. MolmoWeb-8B also surpasses set-of-marks (SoM) agents built on much larger closed frontier models like GPT-4o. We further demonstrate consistent gains through test-time scaling via parallel rollouts with best-of-N selection, achieving 94.7% and 60.5% pass@4 (compared to 78.2% and 35.3% pass@1) on WebVoyager and Online-Mind2Web respectively. We will release model checkpoints, training data, code, and a unified evaluation harness to enable reproducibility and accelerate open research on web agents.
Abstract（参考訳）: Webエージェント – ユーザに代わってWeb上でタスクをナビゲートし実行する自律システム – は,人々がデジタル世界と対話する方法を変革する可能性を持っている。しかし、今日の最も有能なWebエージェントは、科学的な理解、再現性、およびコミュニティ主導の進歩を制限する、未公表のトレーニングデータとレシピを持つプロプライエタリなモデルに依存している。オープンなWebのエージェントは、オープンに構築されるべきである、と私たちは信じています。この目的のために,(1)MolmoWebMix,(1)ブラウザタスクとWeb-GUI知覚データの多種多様な混合,(2)MolmoWeb,(2)完全にオープンなマルチモーダルWebエージェントのファミリーを紹介する。具体的には、M MolmoWebMixは、複数の補完的な生成パイプラインから合成された100K以上のタスクトラジェクトリと、30K以上の人間のデモ、アトミックなWebスキルトラジェクトリ、GUI知覚データを組み合わせる。 MolmoWebエージェントは、タスク命令とWebページのスクリーンショットが与えられたら、次のブラウザアクションを予測し、HTML、アクセシビリティツリー、特別なAPIへのアクセスを必要としない。 4Bと8Bサイズで、WebVoyager、Online-Mind2Web、DeepShopなどのブラウザ使用ベンチマークで、MomoWebエージェントは、Fara-7B、UI-Tars-1.5-7B、Holo1-7Bのような、同様のスケールのオープンウェイトオンリーモデルよりもパフォーマンスが優れている。 MolmoWeb-8B は GPT-4o のような大型のクローズドフロンティアモデル上に構築されたset-of-marks (SoM) エージェントを超越している。さらに、WebVoyagerとOnline-Mind2Webで94.7%、60.5%のpass@4(それぞれ78.2%、35.3%のpass@1)を達成した。我々は、再現性を実現し、Webエージェントのオープンな研究を加速するために、モデルチェックポイント、トレーニングデータ、コード、統一された評価ハーネスをリリースする。

論文の概要: MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

関連論文リスト