Fugu-MT 論文翻訳(概要): Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection

論文の概要: Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection

arxiv url: http://arxiv.org/abs/2509.24192v1
Date: Mon, 29 Sep 2025 02:14:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.689888
Title: Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection
Title（参考訳）: 言語に基づく物体検出のための遠近法と階層的集合表現
Authors: Sojung An, Kwanyong Park, Yong Jae Lee, Donghyun Kim,
Abstract要約: 本稿では,言語に基づく物体検出のための文内階層関係に基づく言語表現の再構成を提案する。重要な洞察は、テキストトークンを中核となる構成要素、属性、関係("talk in pieces")に切り離し、その後階層的に構造化された文レベルの表現に集約する必要性である。 OmniLabelベンチマークによる実験結果は24%のパフォーマンス向上を示し、言語構成の重要性を示している。
参考スコア（独自算出の注目度）: 39.748035737067745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While vision-language models (VLMs) have made significant progress in multimodal perception (e.g., open-vocabulary object detection) with simple language queries, state-of-the-art VLMs still show limited ability to perceive complex queries involving descriptive attributes and relational clauses. Our in-depth analysis shows that these limitations mainly stem from text encoders in VLMs. Such text encoders behave like bags-of-words and fail to separate target objects from their descriptive attributes and relations in complex queries, resulting in frequent false positives. To address this, we propose restructuring linguistic representations according to the hierarchical relations within sentences for language-based object detection. A key insight is the necessity of disentangling textual tokens into core components-objects, attributes, and relations ("talk in pieces")-and subsequently aggregating them into hierarchically structured sentence-level representations ("see in whole"). Building on this principle, we introduce the TaSe framework with three main contributions: (1) a hierarchical synthetic captioning dataset spanning three tiers from category names to descriptive sentences; (2) Talk in Pieces, the three-component disentanglement module guided by a novel disentanglement loss function, transforms text embeddings into subspace compositions; and (3) See in Whole, which learns to aggregate disentangled components into hierarchically structured embeddings with the guide of proposed hierarchical objectives. The proposed TaSe framework strengthens the inductive bias of hierarchical linguistic structures, resulting in fine-grained multimodal representations for language-based object detection. Experimental results under the OmniLabel benchmark show a 24% performance improvement, demonstrating the importance of linguistic compositionality.
Abstract（参考訳）: 視覚言語モデル(VLM)は、単純な言語クエリによるマルチモーダル認識(例えば、オープンボキャブラリオブジェクト検出)において大きな進歩を遂げているが、最先端のVLMでは、記述的属性や関係節を含む複雑なクエリを知覚する能力が制限されている。我々の詳細な分析では、これらの制限は主にVLMのテキストエンコーダに起因している。このようなテキストエンコーダは、単語の袋のように振る舞うが、複雑なクエリにおける記述的属性や関係からターゲットオブジェクトを分離することができず、しばしば偽陽性となる。そこで本稿では,言語に基づくオブジェクト検出のための文内の階層的関係に基づき,言語表現の再構成を提案する。重要な洞察は、テキストトークンを中核となる構成要素、属性、関係("talk in pieces")に切り離し、その後階層的に構造化された文レベルの表現("see in whole")に集約する必要があることである。本稿では,(1)カテゴリー名から記述文までの3階層にまたがる階層的合成字幕化データセット,(2)新規なアンタングル化損失関数によって誘導される3成分のアンタングル化モジュール,(3)非アンタングル化コンポーネントを階層的に階層化された埋め込みに集約することを学ぶWholeについて紹介する。提案したTaSeフレームワークは階層型言語構造の帰納バイアスを強化し,言語に基づくオブジェクト検出のための微細なマルチモーダル表現を実現する。 OmniLabelベンチマークによる実験結果は24%のパフォーマンス向上を示し、言語構成の重要性を示している。

論文の概要: Talk in Pieces, See in Whole: Disentangling and Hierarchical Aggregating Representations for Language-based Object Detection

関連論文リスト