Accessible Instruction-Following Agent
- URL: http://arxiv.org/abs/2305.06358v1
- Date: Mon, 8 May 2023 23:57:26 GMT
- Title: Accessible Instruction-Following Agent
- Authors: Kairui Zhou
- Abstract summary: We introduce UVLN, a novel machine-translation instructional augmented framework for cross-lingual vision-language navigation.
We extend the standard VLN training objectives to a multilingual setting via a cross-lingual language encoder.
Experiments over Room Across Room dataset prove the effectiveness of our approach.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans can collaborate and complete tasks based on visual signals and
instruction from the environment. Training such a robot is difficult especially
due to the understanding of the instruction and the complicated environment.
Previous instruction-following agents are biased to English-centric corpus,
making it unrealizable to be applied to users that use multiple languages or
even low-resource languages. Nevertheless, the instruction-following agents are
pre-trained in a mode that assumes the user can observe the environment, which
limits its accessibility. In this work, we're trying to generalize the success
of instruction-following agents to non-English languages with little corpus
resources, and improve its intractability and accessibility. We introduce UVLN
(Universal Vision-Language Navigation), a novel machine-translation
instructional augmented framework for cross-lingual vision-language navigation,
with a novel composition of state-of-the-art large language model (GPT3) with
the image caption model (BLIP). We first collect a multilanguage
vision-language navigation dataset via machine translation. Then we extend the
standard VLN training objectives to a multilingual setting via a cross-lingual
language encoder. The alignment between different languages is captured through
a shared vision and action context via a cross-modal transformer, which encodes
the inputs of language instruction, visual observation, and action decision
sequences. To improve the intractability, we connect our agent with the large
language model that informs the situation and current state to the user and
also explains the action decisions. Experiments over Room Across Room Dataset
prove the effectiveness of our approach. And the qualitative results show the
promising intractability and accessibility of our instruction-following agent.
Related papers
- LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - On the cross-lingual transferability of multilingual prototypical models
across NLU tasks [2.44288434255221]
Supervised deep learning-based approaches have been applied to task-oriented dialog and have proven to be effective for limited domain and language applications.
In practice, these approaches suffer from the drawbacks of domain-driven design and under-resourced languages.
This article proposes to investigate the cross-lingual transferability of using synergistically few-shot learning with prototypical neural networks and multilingual Transformers-based models.
arXiv Detail & Related papers (2022-07-19T09:55:04Z) - Learning Flexible Translation between Robot Actions and Language
Descriptions [16.538887534958555]
We propose a paired gated autoencoders (PGAE) for flexible translation between robot actions and language descriptions.
We train our model in an end-to-end fashion by pairing each action with appropriate descriptions that contain a signal informing about the translation direction.
With the option to use a pretrained language model as the language encoder, our model has the potential to recognise unseen natural language input.
arXiv Detail & Related papers (2022-07-15T12:37:05Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.