Fugu-MT 論文翻訳(概要): Contrastive Instruction-Trajectory Learning for Vision-Language Navigation

論文の概要: Contrastive Instruction-Trajectory Learning for Vision-Language Navigation

arxiv url: http://arxiv.org/abs/2112.04138v2
Date: Thu, 9 Dec 2021 06:36:57 GMT
ステータス: 翻訳完了
システム内更新日: 2021-12-10 12:55:50.593756
Title: Contrastive Instruction-Trajectory Learning for Vision-Language Navigation
Title（参考訳）: 視覚言語ナビゲーションのためのコントラスト学習
Authors: Xiwen Liang, Fengda Zhu, Yi Zhu, Bingqian Lin, Bing Wang, Xiaodan Liang
Abstract要約: 視覚言語ナビゲーション(VLN)タスクでは、エージェントが自然言語の指示でターゲットに到達する必要がある。先行研究は、命令-軌道対間の類似点と相違点を識別できず、サブ命令の時間的連続性を無視する。本稿では、類似したデータサンプル間の分散と、異なるデータサンプル間の分散を探索し、ロバストなナビゲーションのための独特な表現を学習するContrastive Instruction-Trajectory Learningフレームワークを提案する。
参考スコア（独自算出の注目度）: 66.16980504844233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction. Previous works learn to navigate step-by-step following an instruction. However, these works may fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions. These problems hinder agents from learning distinctive vision-and-language representations, harming the robustness and generalizability of the navigation policy. In this paper, we propose a Contrastive Instruction-Trajectory Learning (CITL) framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation. Specifically, we propose: (1) a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions, respectively; (2) a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions; (3) a pairwise sample-reweighting mechanism for contrastive learning to mine hard samples and hence mitigate the influence of data sampling bias in contrastive learning. Our CITL can be easily integrated with VLN backbones to form a new learning paradigm and achieve better generalizability in unseen environments. Extensive experiments show that the model with CITL surpasses the previous state-of-the-art methods on R2R, R4R, and RxR.
Abstract（参考訳）: 視覚言語ナビゲーション(VLN)タスクでは、エージェントが自然言語命令のガイダンスでターゲットに到達する必要がある。以前の作業は、命令に従ってステップバイステップでナビゲートすることを学ぶ。しかし、これらの著作は命令-主対間の類似性と不一致を区別できず、副指示の時間的連続性を無視できない可能性がある。これらの問題はエージェントが視覚的な視覚と言語表現を学ぶことを妨げ、ナビゲーションポリシーの堅牢性と一般化性を損なう。本稿では、類似データサンプル間の分散と、異なるデータサンプル間の分散を探索し、ロバストナビゲーションのための特徴表現を学習するContrastive Instruction-Trajectory Learning (CITL)フレームワークを提案する。 Specifically, we propose: (1) a coarse-grained contrastive learning objective to enhance vision-and-language representations by contrasting semantics of full trajectory observations and instructions, respectively; (2) a fine-grained contrastive learning objective to perceive instructions by leveraging the temporal information of the sub-instructions; (3) a pairwise sample-reweighting mechanism for contrastive learning to mine hard samples and hence mitigate the influence of data sampling bias in contrastive learning. 我々のCITLは、VLNバックボーンと容易に統合でき、新しい学習パラダイムを形成し、目に見えない環境でより良い一般化を実現することができる。大規模な実験により,CITLを用いたモデルが従来のR2R,R4R,RxRの最先端手法を上回ることがわかった。

関連論文リスト

EvolveNav: Self-Improving Embodied Reasoning for LLM-Based Vision-Language Navigation [111.0993686148283]
本稿では,EvolveNavと呼ばれるビジョンランゲージナビゲーションを向上するための,新たなSElf-imbodied embodied reasoningフレームワークを提案する。 EvolveNav は,(1) 形式化された CoT ラベルを用いたモデルトレーニング,(2) 自己表現的ポストトライニング,(2) モデルが自己強化 CoT ラベルとして独自の推論出力で反復的にトレーニングされ,監督の多様性を高めるための,形式化された CoT ラベルによるモデルトレーニング,の2つの段階で構成されている。
論文参考訳（メタデータ） (2025-06-02T11:28:32Z)
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
視覚言語ナビゲーション(VLN)のためのリライト駆動型AugMentation(RAM)パラダイムを提案する。書き換え機構を応用して, シミュレータフリー, 省力化の両面で新たな観察指導が可能となり, 一般化が促進される。離散環境 (R2R, REVERIE, R4R) と連続環境 (R2R-CE) の両方における実験により, 本手法の優れた性能と優れた一般化能力が示された。
論文参考訳（メタデータ） (2025-03-23T13:18:17Z)
Vision-and-Language Navigation via Causal Learning [13.221880074458227]
クロスモーダル因果変換器(Cross-modal causal transformer, GOAT)は因果推論のパラダイムに根ざした先駆的な解である。 BACLおよびFACLモジュールは、潜在的刺激的相関を包括的に緩和することにより、偏見のない学習を促進する。グローバルな共同創設者の特徴を捉えるために,コントラスト学習によって教師されるクロスモーダル機能プーリングモジュールを提案する。
論文参考訳（メタデータ） (2024-04-16T02:40:35Z)
TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
本稿では,Large Language Models(LLM)に基づく視覚言語ナビゲーション(VLN)エージェントを提案する。環境認識におけるLLMの欠点を補うための思考・相互作用・行動の枠組みを提案する。また,本手法は教師付き学習手法よりも優れ,ゼロショットナビゲーションの有効性を強調した。
論文参考訳（メタデータ） (2024-03-13T05:22:39Z)
Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning [125.61772424068903]
視覚言語ナビゲーション(VLN)は、エージェントに与えられた言語命令に従って実際の3D環境をナビゲートするように要求する。本稿では,既存のVLNエージェントの一般化能力を高めるために,PROPER(Progressive Perturbation-aware Contrastive Learning)と呼ばれるモデルに依存しない学習パラダイムを提案する。
論文参考訳（メタデータ） (2024-03-09T02:34:13Z)
Anticipating the Unseen Discrepancy for Vision and Language Navigation [63.399180481818405]
視覚言語ナビゲーションでは、エージェントは特定のターゲットに到達するために自然言語命令に従う必要がある。目に見える環境と目に見えない環境の間に大きな違いがあるため、エージェントがうまく一般化することは困難である。本研究では,テストタイムの視覚的整合性を促進することによって,未知の環境への一般化を学習する,未知の離散性予測ビジョンと言語ナビゲーション(DAVIS)を提案する。
論文参考訳（メタデータ） (2022-09-10T19:04:40Z)
Counterfactual Cycle-Consistent Learning for Instruction Following and Generation in Vision-Language Navigation [172.15808300686584]
本稿では,2つのタスクを同時に学習し,それぞれのトレーニングを促進するために本質的な相関性を利用するアプローチについて述べる。提案手法は,様々な追従モデルの性能を改善し,正確なナビゲーション命令を生成する。
論文参考訳（メタデータ） (2022-03-30T18:15:26Z)
Adversarial Reinforced Instruction Attacker for Robust Vision-Language Navigation [145.84123197129298]
自然言語に基づくナビゲーションタスクでは,言語指導が重要な役割を担っている。より堅牢なナビゲータを訓練し、長い指導から重要な要素を動的に抽出する。具体的には,航法士が間違った目標に移動することを誤認することを学習する動的強化命令攻撃装置(DR-Attacker)を提案する。
論文参考訳（メタデータ） (2021-07-23T14:11:31Z)
Language-guided Navigation via Cross-Modal Grounding and Alternate Adversarial Learning [66.9937776799536]
新たなビジョン・アンド・ランゲージナビゲーション(VLN)問題は、見えない写真リアリスティック環境において、エージェントがターゲットの場所に向かうことを学習することを目的としている。 VLNの主な課題は、主に2つの側面から生じている: まず、エージェントは動的に変化する視覚環境に対応する言語命令の有意義な段落に出席する必要がある。そこで本稿では,エージェントにテキストと視覚の対応性を追跡する機能を持たせるために,クロスモーダルグラウンドモジュールを提案する。
論文参考訳（メタデータ） (2020-11-22T09:13:46Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。