Fugu-MT 論文翻訳(概要): Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

論文の概要: Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

arxiv url: http://arxiv.org/abs/2605.07141v1
Date: Fri, 08 May 2026 02:20:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.745577
Title: Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Title（参考訳）: Qwen3-VL-Seg:ビジョンランゲージグラウンドディングによるオープンワールド参照セグメンテーションのアンロック
Authors: Yuan Yao, Qiushi Yang, Humen Zhong, Jiangning Wei, Yifang Men, Shuai Bai, Miaomiao Cui, Zhibo Yang,
Abstract要約: Qwen3-VL-SegはMLLM予測ボックスを意味論的基盤構造として扱うパラメータ効率のよいフレームワークである。その中核は、軽量のボックス誘導マスクデコーダで、マルチスケールの空間的特徴注入、空間意味的クエリ構築、ボックス誘導高解像度ピクセル融合を組み合わせている。 Qwen3-VL-Segはクローズドセットとオープンワールド設定で強く機能することを示す。
参考スコア（独自算出の注目度）: 26.30521907946121
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-world referring segmentation requires grounding unconstrained language expressions to precise pixel-level regions. Existing multimodal large language models (MLLMs) exhibit strong open-world visual grounding, but their outputs remain limited to sparse bounding-box coordinates and are insufficient for dense visual prediction. Recent MLLM-based segmentation methods either directly predict sparse contour coordinates, struggling to reconstruct continuous object boundaries, or rely on external segmentation foundation models such as the Segment Anything Model (SAM), introducing substantial architectural and deployment overhead. We present Qwen3-VL-Seg, a parameter-efficient framework that treats the MLLM-predicted box as a semantically grounded structural prior and decodes it into pixel-level referring segmentation. At its core, a lightweight box-guided mask decoder combines multi-scale spatial feature injection, spatial-semantic query construction, box-guided high-resolution pixel fusion, and iterative mask-aware query refinement, introducing only 17M parameters (about 0.4\% of the base model). For scalable open-world training, we construct SA1B-ORS, an SA-1B-derived dataset with two subsets: SA1B-CoRS (category-oriented samples) and SA1B-DeRS (descriptive, instance-specific samples). For evaluation, we curate ORS-Bench, a manually screened benchmark with in-distribution and out-of-distribution subsets covering diverse referring expression types. Extensive experiments on referring expression segmentation, visual grounding, and ORS-Bench show that Qwen3-VL-Seg performs strongly across closed-set and open-world settings, with clear advantages on language-intensive instructions and strong out-of-distribution generalization. Evaluations on general multimodal benchmarks further show that the model broadly preserves general-purpose multimodal competence after segmentation-oriented adaptation.
Abstract（参考訳）: オープンワールド参照セグメンテーションでは、制約のない言語表現を正確なピクセルレベル領域に接地する必要がある。既存のマルチモーダル大言語モデル(MLLM)は、強力なオープンワールドの視覚的接地を示すが、その出力は疎境界ボックス座標に限られており、密集した視覚的予測には不十分である。最近のMLLMベースのセグメンテーション手法では、スパース輪郭座標を直接予測したり、連続的なオブジェクト境界の再構築に苦労したり、セグメンテーション・アシング・モデル(SAM)のような外部セグメンテーション基盤モデルに依存している。 Qwen3-VL-Segは,MLLM予測ボックスを意味的基底構造として扱うパラメータ効率のよいフレームワークであり,画素レベルの参照セグメンテーションにデコードする。その中核は、軽量のボックス誘導マスクデコーダで、マルチスケールの空間的特徴注入、空間意味的クエリ構築、ボックス誘導高解像度画素融合、反復的なマスク認識クエリ改善を組み合わせ、17Mパラメータ(ベースモデルの約0.4\%)しか導入していない。スケーラブルなオープンワールドトレーニングのために、SA1B-ORSという、SA1B-CoRS(カテゴリ指向サンプル)とSA1B-DeRS(記述型、インスタンス固有サンプル)の2つのサブセットを持つSA1B-ORSを構築した。評価のために,多様な参照表現型をカバーする分布内および分布外サブセットを備えた手動スクリーニングベンチマークであるORS-Benchをキュレートする。 Qwen3-VL-Segは、クローズドセットとオープンワールドのセッティングにおいて、言語集約的な命令と強いアウト・オブ・ディストリビューションの一般化に明確なアドバンテージを持ち、表現のセグメンテーション、視覚的グラウンド、ORS-Benchを参照する広範な実験を行った。一般的なマルチモーダル・ベンチマークの評価は、セグメンテーション指向適応後の汎用マルチモーダル・コンピテンスを広く保存することを示している。

論文の概要: Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding

関連論文リスト