Fugu-MT 論文翻訳(概要): UGround: Towards Unified Visual Grounding with Unrolled Transformers

論文の概要: UGround: Towards Unified Visual Grounding with Unrolled Transformers

arxiv url: http://arxiv.org/abs/2510.03853v1
Date: Sat, 04 Oct 2025 15:56:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.29848
Title: UGround: Towards Unified Visual Grounding with Unrolled Transformers
Title（参考訳）: UGround: アンロールされたトランスフォーマーによる統一されたビジュアルグラウンドを目指して
Authors: Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou,
Abstract要約: これは、textbfUnified visual textbfGrounding パラダイムで、textbfUnrolled transformer の中間層をプロンプトとしてマスクとして動的に選択する。 UGroundの中心となるのは、Skip Connection (SSC) と Mask as Prompt (MasP) の2つの重要なコンポーネントからなる、ポリシープロンプト型マスキングである。
参考スコア（独自算出の注目度）: 42.58167803005241
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt''. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{<SEG>} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{<SEG>} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{<SEG>} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.
Abstract（参考訳）: UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that are a \textbf{U}nrolled transformers across \textbf{U}nrolled transformers as ``mask as prompt'', diverging from the prevailing pipeline which which leverages the fixed last hidden layer as ``\texttt{<SEG>} as prompt'。 UGroundは、(1)中間修正なしで層間伝播から生じる累積誤差を逐次増幅する固定最後の隠蔽層への依存、(2)明示的な空間的手がかりのない視覚空間へのテキスト埋め込みを暗黙的に投影するプロンプトとしてのtexttt{<SEG>}の使用、という2つの主要な課題に対処する。 UGroundの中心となるのは、Stochastic Skip Connection (SSC) と Mask as Prompt (MasP) の2つの主要なコンポーネントからなる、ポリシープロンプト・マスキングである。 SSCは、確率的サンプリングを通じて、各 \texttt{<SEG>}トークンを無回転トランスフォーマー層にスライドさせ、スキップ接続方式でビジョンモデル(\eg, SAM)に接続する動的層選択を可能にする強化学習ポリシーである。選択された隠蔽層が与えられた場合、MasP は \texttt{<SEG>} トークンと画像トークンをソフトロジットマスクとして使用し、SAM をマスク生成に誘導し、そのアクティベーション領域を通じて明示的な空間的手がかりを提供する。 UGroundの有効性を検証するため,従来の参照表現セグメンテーションから新たに提案された推論セグメンテーション,単一ターゲット,複数ターゲット,肯定的クエリ,虚偽の前提(空的ターゲット)に至るまで,属性の観点から初めて,単一のフレームワーク内での統一的な視覚的基盤を構築する。すべてのコードとモデルは、 \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround} で公開されている。

論文の概要: UGround: Towards Unified Visual Grounding with Unrolled Transformers

関連論文リスト