Fugu-MT 論文翻訳(概要): VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

論文の概要: VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

arxiv url: http://arxiv.org/abs/2606.20728v1
Date: Wed, 17 Jun 2026 04:52:22 GMT
ステータス: 情報取得中
システム内更新日: 2026-06-23 11:21:13.61167
Title: VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers
Title（参考訳）: VTOS: 共同検索ソリューションとオブザーバによるビジョンツールのオーケストレーションを学ぶ
Authors: Jinchao Ge, Lingqiao Liu, Shuwen Zhao, Lei Wang,
Abstract要約: 本稿では,共同ソリューションによる視覚的ツールオーケストレーションのためのフレームワークであるVTOSを紹介する。我々は,LVIS-Count の高密度オブジェクトカウントと PlantSeg-OOD のゼロショットプラントリリースセグメンテーションの2つのケーススタディにより,VTOS の評価を行った。
参考スコア（独自算出の注目度）: 23.939374004639756
License:
Abstract: Vision foundation tools such as open-vocabulary detectors, segmentation models, and post-processing operators are powerful building blocks for computer vision, but their effectiveness depends heavily on how they are orchestrated: which tools are used, in what order, with what parameters, and under what visual conditions. Existing visual-programming agents typically generate a fixed solution pipeline, making them brittle under dense objects, occlusion, small targets, and domain shift. We introduce VTOS (Vision Tools Orchestration Search), a framework for adaptive visual tool orchestration through joint solution--observer search. VTOS co-searches executable solution programs that compose vision tools such as Grounding DINO, SAM, NMS, and slice-and-detect, together with observer programs that diagnose candidate solutions, identify failure modes, and generate actionable feedback. These observations are accumulated in a shared VisionThoughts knowledge base to guide subsequent search. We evaluate VTOS through two case studies: dense object counting on LVIS-Count and zero-shot plant-disease segmentation on PlantSeg-OOD, which stress different orchestration challenges including threshold calibration, NMS, slicing, mask refinement, and domain generalization. Across both tasks, VTOS outperforms static tool pipelines and agentic visual-programming baselines, showing that co-searching solutions and observers is an effective strategy for adapting vision tools to challenging computer vision tasks.
Abstract（参考訳）: オープンボキャブラリ検出器、セグメンテーションモデル、後処理オペレータといったビジョン基盤ツールは、コンピュータビジョンのための強力なビルディングブロックであるが、その有効性は、どのツールが、どの順番で、どのパラメータで、どのビジュアル条件で、どのツールが使用されるかに大きく依存する。既存のビジュアルプログラミングエージェントは、通常、固定されたソリューションパイプラインを生成し、密集したオブジェクト、閉塞、小さなターゲット、ドメインシフトの下で脆くする。 VTOS(Vision Tools Orchestration Search)は,共同ソリューションによる適応型ビジュアルツールオーケストレーションのためのフレームワークである。 VTOSは、Grounding DINO、SAM、NMS、Slice-and-detectなどのビジョンツールを構成する実行可能なソリューションプログラムと、候補ソリューションの診断、障害モードの識別、アクション可能なフィードバックを生成するオブザーバプログラムを共同で研究している。これらの観測はVisionThoughtsの知識ベースに蓄積され、その後の探索をガイドする。我々は, LVIS-Count を用いた高密度オブジェクトカウントと, しきい値校正, NMS, スライシング, マスクリファインメント, ドメイン一般化など, さまざまなオーケストレーション課題に重点を置いているプラントSeg-OOD のゼロショット植物分離セグメンテーションという2つのケーススタディを通じて, VTOS の評価を行った。どちらのタスクでも、VTOSは静的ツールパイプラインやエージェントによるビジュアルプログラミングのベースラインよりも優れており、共同調査ソリューションとオブザーバは、コンピュータビジョンタスクに挑戦するビジョンツールに適応するための効果的な戦略であることを示している。

論文の概要: VTOS: Learning to Orchestrate Vision Tools by Co-Searching Solutions and Observers

関連論文リスト