Related papers: MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models

URL: http://arxiv.org/abs/2412.00103v1
Date: Wed, 27 Nov 2024 21:59:29 GMT
Title: MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models
Authors: Angus Fung, Aaron Hao Tan, Haitong Wang, Beno Benhabib, Goldie Nejat,
Abstract summary: We present MLLM-Search, a novel zero-shot person search architecture for mobile robots.<n>Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment.<n>Experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to finding a person in a new unseen environment.
Score: 5.28115111932163
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robotic search of people in human-centered environments, including healthcare settings, is challenging as autonomous robots need to locate people without complete or any prior knowledge of their schedules, plans or locations. Furthermore, robots need to be able to adapt to real-time events that can influence a person's plan in an environment. In this paper, we present MLLM-Search, a novel zero-shot person search architecture that leverages multimodal large language models (MLLM) to address the mobile robot problem of searching for a person under event-driven scenarios with varying user schedules. Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment by generating a spatially grounded waypoint map, representing navigable waypoints by a topological graph and regions by semantic labels. This is incorporated into a MLLM with a region planner that selects the next search region based on the semantic relevance to the search scenario, and a waypoint planner which generates a search path by considering the semantically relevant objects and the local spatial context through our unique spatial chain-of-thought prompting approach. Extensive 3D photorealistic experiments were conducted to validate the performance of MLLM-Search in searching for a person with a changing schedule in different environments. An ablation study was also conducted to validate the main design choices of MLLM-Search. Furthermore, a comparison study with state-of-the art search methods demonstrated that MLLM-Search outperforms existing methods with respect to search efficiency. Real-world experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to finding a person in a new unseen environment.

Related papers

Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video [18.14234312389889]
We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
arXiv Detail & Related papers (2024-07-18T18:55:56Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning [73.0990339667978]
Navigation in unfamiliar environments presents a major challenge for robots. We use language models to bias exploration of novel real-world environments. We evaluate LFG in challenging real-world environments and simulated benchmarks.
arXiv Detail & Related papers (2023-10-16T06:21:06Z)
Active Visual Localization for Multi-Agent Collaboration: A Data-Driven Approach [47.373245682678515]
This work investigates how active visual localization can be used to overcome challenges of viewpoint changes. Specifically, we focus on the problem of selecting the optimal viewpoint at a given location. The result demonstrates the superior performance of the data-driven approach when compared to existing methods.
arXiv Detail & Related papers (2023-10-04T08:18:30Z)
AI planning in the imagination: High-level planning on learned abstract search spaces [68.75684174531962]
We propose a new method, called PiZero, that gives an agent the ability to plan in an abstract search space that the agent learns during training. We evaluate our method on multiple domains, including the traveling salesman problem, Sokoban, 2048, the facility location problem, and Pacman.
arXiv Detail & Related papers (2023-08-16T22:47:16Z)
Learning Hierarchical Interactive Multi-Object Search for Mobile Manipulation [10.21450780640562]
We introduce a novel interactive multi-object search task in which a robot has to open doors to navigate rooms and search inside cabinets and drawers to find target objects. These new challenges require combining manipulation and navigation skills in unexplored environments. We present HIMOS, a hierarchical reinforcement learning approach that learns to compose exploration, navigation, and manipulation skills.
arXiv Detail & Related papers (2023-07-12T12:25:33Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models. Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Generalized Object Search [0.9137554315375919]
This thesis develops methods and systems for (multi-)object search in 3D environments under uncertainty. I implement a robot-independent, environment-agnostic system for generalized object search in 3D. I deploy it on the Boston Dynamics Spot robot, the Kinova MOVO robot, and the Universal Robots UR5e robotic arm.
arXiv Detail & Related papers (2023-01-24T16:41:36Z)
Active Visual Search in the Wild [12.354788629408933]
We propose a system where a user can enter target commands using free-form language. We call this system Active Visual Search in the Wild (AVSW) AVSW detects and plans to search for a target object inputted by a user through a semantic grid map represented by static landmarks.
arXiv Detail & Related papers (2022-09-19T07:18:46Z)
Incremental 3D Scene Completion for Safe and Efficient Exploration Mapping and Planning [60.599223456298915]
We propose a novel way to integrate deep learning into exploration by leveraging 3D scene completion for informed, safe, and interpretable mapping and planning. We show that our method can speed up coverage of an environment by 73% compared to the baselines with only minimal reduction in map accuracy. Even if scene completions are not included in the final map, we show that they can be used to guide the robot to choose more informative paths, speeding up the measurement of the scene with the robot's sensors by 35%.
arXiv Detail & Related papers (2022-08-17T14:19:33Z)
Batch Exploration with Examples for Scalable Robotic Reinforcement Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states. BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.