Fugu-MT 論文翻訳(概要): Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

論文の概要: Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

arxiv url: http://arxiv.org/abs/2606.12985v1
Date: Thu, 11 Jun 2026 07:21:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.636482
Title: Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video
Title（参考訳）: 単語前のオブジェクト: 子視点ビデオにおける接地言語のためのオブジェクトファーストインダクティブビゼ
Authors: Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan,
Abstract要約: BabyMindは、スパースでノイズの多い監督下での、子供に対する対照的な学習に対するオブジェクト指向バイアスである。 SAYCam-Sでは、BabyMindはラベル付きS15の強制選択精度をCで+2.6ポイント改善し、語彙外分布ベンチマークで一貫した利得を得る。
参考スコア（独自算出の注目度）: 19.989128399393085
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.
Abstract（参考訳）: 自然体験から接頭辞の意味を学習するには、幼児向け録音における2つの曖昧さを解消する必要がある: 名前のついた参照者が現れるときと、それが散らかったフレームにある場所である。 SAYCamスタイルのデータでは、介護者のスピーチは疎外され、自我中心のビデオと弱い同期を行うため、単一フレームのコントラッシブなペアリングは、対象のオブジェクトが欠落したり、邪魔者が絡まっていたりしたノイズの多いポジティクスを生成する。本研究では,スパースでノイズの多い監督下での児童視点のコントラスト学習のためのオブジェクト指向バイアスであるBabyMindを提案する。 BabyMindは、オフラインマスクベースのリージョンインターフェースを使用して、候補オブジェクトの埋め込みを抽出し、短い発話中心のウィンドウをまたいだ候補をトラッキングを介して軽量オブジェクトファイルにリンクし、プロトタイプスペースのマルチインスタンスコントラスト目的でオブジェクトファイルのバッグに発話をアライメントする。トラックコヒーレンスおよびグローバルオブジェクト合意正則化器は、学習とオブジェクトファイル構造を、評価に使用するグローバルフレーム埋め込みに転送する。 SAYCam-Sでは、BabyMind は Labeled-S 15 の強制選択精度を CVCL で +2.6 で改善し、語彙外分布ベンチマークで一貫した利得を得る。コードはhttps://github.com/sathiiii/BabyMind.comで入手できる。

論文の概要: Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

関連論文リスト