Target-dependent UNITER: A Transformer-Based Multimodal Language
Comprehension Model for Domestic Service Robots
- URL: http://arxiv.org/abs/2107.00811v1
- Date: Fri, 2 Jul 2021 03:11:02 GMT
- Title: Target-dependent UNITER: A Transformer-Based Multimodal Language
Comprehension Model for Domestic Service Robots
- Authors: Shintaro Ishikawa and Komei Sugiura
- Abstract summary: We propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image.
Our method is an extension of the UNITER-based Transformer that can be pretrained on general-purpose datasets.
Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, domestic service robots have an insufficient ability to interact
naturally through language. This is because understanding human instructions is
complicated by various ambiguities and missing information. In existing
methods, the referring expressions that specify the relationships between
objects are insufficiently modeled. In this paper, we propose Target-dependent
UNITER, which learns the relationship between the target object and other
objects directly by focusing on the relevant regions within an image, rather
than the whole image. Our method is an extension of the UNITER-based
Transformer that can be pretrained on general-purpose datasets. We extend the
UNITER approach by introducing a new architecture for handling the target
candidates. Our model is validated on two standard datasets, and the results
show that Target-dependent UNITER outperforms the baseline method in terms of
classification accuracy.
Related papers
- Learning-To-Rank Approach for Identifying Everyday Objects Using a
Physical-World Search Engine [0.8749675983608172]
We focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting.
We propose MultiRankIt, which is a novel approach for the learning-to-rank physical objects task.
arXiv Detail & Related papers (2023-12-26T01:40:31Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - One-for-All: Towards Universal Domain Translation with a Single StyleGAN [86.33216867136639]
We propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains.
The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations.
UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
arXiv Detail & Related papers (2023-10-22T08:02:55Z) - Switching Head-Tail Funnel UNITER for Dual Referring Expression
Comprehension with Fetch-and-Carry Tasks [3.248019437833647]
This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions.
Most of the existing multimodal language understanding methods are impractical in terms of computational complexity.
We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model.
arXiv Detail & Related papers (2023-07-14T05:27:56Z) - Efficient Spoken Language Recognition via Multilabel Classification [53.662747523872305]
We show that our models obtain competitive results while being orders of magnitude smaller and faster than current state-of-the-art methods.
Our multilabel strategy is more robust to unseen non-target languages compared to multiclass classification.
arXiv Detail & Related papers (2023-06-02T23:04:19Z) - Object-centric Inference for Language Conditioned Placement: A
Foundation Model based Approach [12.016988248578027]
We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial constraints in language instructions.
We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable.
arXiv Detail & Related papers (2023-04-06T06:51:15Z) - Compositional Generalization in Grounded Language Learning via Induced
Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations.
We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal.
Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - Incorporating Linguistic Knowledge for Abstractive Multi-document
Summarization [20.572283625521784]
We develop a neural network based abstractive multi-document summarization (MDS) model.
We process the dependency information into the linguistic-guided attention mechanism.
With the help of linguistic signals, sentence-level relations can be correctly captured.
arXiv Detail & Related papers (2021-09-23T08:13:35Z) - BURT: BERT-inspired Universal Representation from Learning Meaningful
Segment [46.51685959045527]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present a universal representation model, BURT, to encode different levels of linguistic unit into the same vector space.
Specifically, we extract and mask meaningful segments based on point-wise mutual information (PMI) to incorporate different granular objectives into the pre-training stage.
arXiv Detail & Related papers (2020-12-28T16:02:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.