Enabling Robots to Draw and Tell: Towards Visually Grounded Multimodal
Description Generation
- URL: http://arxiv.org/abs/2101.12338v1
- Date: Thu, 14 Jan 2021 23:40:23 GMT
- Title: Enabling Robots to Draw and Tell: Towards Visually Grounded Multimodal
Description Generation
- Authors: Ting Han, Sina Zarrie{\ss}
- Abstract summary: Socially competent robots should be equipped with the ability to perceive the world that surrounds them and communicate about it in a human-like manner.
Representative skills that exhibit such ability include generating image descriptions and visually grounded referring expressions.
We propose to model the task of generating natural language together with free-hand sketches/hand gestures to describe visual scenes and real life objects.
- Score: 1.52292571922932
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Socially competent robots should be equipped with the ability to perceive the
world that surrounds them and communicate about it in a human-like manner.
Representative skills that exhibit such ability include generating image
descriptions and visually grounded referring expressions. In the NLG community,
these generation tasks are largely investigated in non-interactive and
language-only settings. However, in face-to-face interaction, humans often
deploy multiple modalities to communicate, forming seamless integration of
natural language, hand gestures and other modalities like sketches. To enable
robots to describe what they perceive with speech and sketches/gestures, we
propose to model the task of generating natural language together with
free-hand sketches/hand gestures to describe visual scenes and real life
objects, namely, visually-grounded multimodal description generation. In this
paper, we discuss the challenges and evaluation metrics of the task, and how
the task can benefit from progress recently made in the natural language
processing and computer vision realms, where related topics such as visually
grounded NLG, distributional semantics, and photo-based sketch generation have
been extensively studied.
Related papers
- EMOTION: Expressive Motion Sequence Generation for Humanoid Robots with In-Context Learning [10.266351600604612]
This paper introduces a framework, called EMOTION, for generating expressive motion sequences in humanoid robots.
We conduct online user studies comparing the naturalness and understandability of the motions generated by EMOTION and its human-feedback version, EMOTION++.
arXiv Detail & Related papers (2024-10-30T17:22:45Z) - Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - Contextual Emotion Recognition using Large Vision Language Models [0.6749750044497732]
Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision.
In this paper, we examine two major approaches enabled by recent large vision language models.
We demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines.
arXiv Detail & Related papers (2024-05-14T23:24:12Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions.
InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks.
It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - Enhancing Interpretability and Interactivity in Robot Manipulation: A
Neurosymbolic Approach [0.0]
We present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation.
A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA) or a grasp action instruction.
We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes.
arXiv Detail & Related papers (2022-10-03T12:21:45Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.