HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt
interaction tasks
- URL: http://arxiv.org/abs/2308.12537v1
- Date: Thu, 24 Aug 2023 03:47:27 GMT
- Title: HuBo-VLM: Unified Vision-Language Model designed for HUman roBOt
interaction tasks
- Authors: Zichao Dong, Weikun Zhang, Xufeng Huang, Hang Ji, Xin Zhan, Junbo Chen
- Abstract summary: Human robot interaction is an exciting task, which aimed to guide robots following instructions from human.
HuBo-VLM is proposed to tackle perception tasks associated with human robot interaction.
- Score: 5.057755436092344
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human robot interaction is an exciting task, which aimed to guide robots
following instructions from human. Since huge gap lies between human natural
language and machine codes, end to end human robot interaction models is fair
challenging. Further, visual information receiving from sensors of robot is
also a hard language for robot to perceive. In this work, HuBo-VLM is proposed
to tackle perception tasks associated with human robot interaction including
object detection and visual grounding by a unified transformer based vision
language model. Extensive experiments on the Talk2Car benchmark demonstrate the
effectiveness of our approach. Code would be publicly available in
https://github.com/dzcgaara/HuBo-VLM.
Related papers
- LLM Granularity for On-the-Fly Robot Control [3.5015824313818578]
In circumstances where visuals become unreliable or unavailable, can we rely solely on language to control robots?
This work takes the initial steps to answer this question by: 1) evaluating the responses of assistive robots to language prompts of varying granularities; and 2) exploring the necessity and feasibility of controlling the robot on-the-fly.
arXiv Detail & Related papers (2024-06-20T18:17:48Z) - HumanoidBench: Simulated Humanoid Benchmark for Whole-Body Locomotion and Manipulation [50.616995671367704]
We present a high-dimensional, simulated robot learning benchmark, HumanoidBench, featuring a humanoid robot equipped with dexterous hands.
Our findings reveal that state-of-the-art reinforcement learning algorithms struggle with most tasks, whereas a hierarchical learning approach achieves superior performance when supported by robust low-level policies.
arXiv Detail & Related papers (2024-03-15T17:45:44Z) - HOI4ABOT: Human-Object Interaction Anticipation for Human Intention Reading Collaborative roBOTs [9.806227900768926]
We propose a Human-Object Interaction (HOI) anticipation framework for collaborative robots.
We propose an efficient and robust transformer-based model to detect and anticipate HOIs from videos.
Our model outperforms state-of-the-art results in HOI detection and anticipation in VidHOI dataset.
arXiv Detail & Related papers (2023-09-28T15:34:49Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Robots with Different Embodiments Can Express and Influence Carefulness
in Object Manipulation [104.5440430194206]
This work investigates the perception of object manipulations performed with a communicative intent by two robots.
We designed the robots' movements to communicate carefulness or not during the transportation of objects.
arXiv Detail & Related papers (2022-08-03T13:26:52Z) - Body Gesture Recognition to Control a Social Robot [5.557794184787908]
We propose a gesture based language to allow humans to interact with robots using their body in a natural way.
We have created a new gesture detection model using neural networks and a custom dataset of humans performing a set of body gestures to train our network.
arXiv Detail & Related papers (2022-06-15T13:49:22Z) - Joint Mind Modeling for Explanation Generation in Complex Human-Robot
Collaborative Tasks [83.37025218216888]
We propose a novel explainable AI (XAI) framework for achieving human-like communication in human-robot collaborations.
The robot builds a hierarchical mind model of the human user and generates explanations of its own mind as a form of communications.
Results show that the generated explanations of our approach significantly improves the collaboration performance and user perception of the robot.
arXiv Detail & Related papers (2020-07-24T23:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.