Related papers: ObjectFinder: An Open-Vocabulary Assistive System for Interactive Object Search by Blind People

ObjectFinder: An Open-Vocabulary Assistive System for Interactive Object Search by Blind People

URL: http://arxiv.org/abs/2412.03118v2
Date: Wed, 30 Apr 2025 17:42:40 GMT
Title: ObjectFinder: An Open-Vocabulary Assistive System for Interactive Object Search by Blind People
Authors: Ruiping Liu, Jiaming Zhang, Angela Schön, Karin Müller, Junwei Zheng, Kailun Yang, Anhong Guo, Kathrin Gerling, Rainer Stiefelhagen,
Abstract summary: We present ObjectFinder, an open-vocabulary wearable system for interactive object search by blind people.<n>ObjectFinder allows users to query target objects using flexible wording.<n>It provides egocentric localization information in real-time, including distance and direction.
Score: 42.050924675417654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Searching for objects in unfamiliar scenarios is a challenging task for blind people. It involves specifying the target object, detecting it, and then gathering detailed information according to the user's intent. However, existing description- and detection-based assistive technologies do not sufficiently support the multifaceted nature of interactive object search tasks. We present ObjectFinder, an open-vocabulary wearable assistive system for interactive object search by blind people. ObjectFinder allows users to query target objects using flexible wording. Once the target object is detected, it provides egocentric localization information in real-time, including distance and direction. Users can then initiate different branches to gather detailed information based on their intent towards the target object, such as navigating to it or perceiving its surroundings. ObjectFinder is powered by a seamless combination of open-vocabulary models, namely an open-vocabulary object detector and a multimodal large language model. The ObjectFinder design concept and its development were carried out in collaboration with a blind co-designer. To evaluate ObjectFinder, we conducted an exploratory user study with eight blind participants. We compared ObjectFinder to BeMyAI and Google Lookout, popular description- and detection-based assistive applications. Our findings indicate that most participants felt more independent with ObjectFinder and preferred it for object search, as it enhanced scene context gathering and navigation, and allowed for active target identification. Finally, we discuss the implications for future assistive systems to support interactive object search.

Related papers

Interacted Object Grounding in Spatio-Temporal Human-Object Interactions [70.8859442754261]
We introduce a new open-world benchmark: Grounding Interacted Objects (GIO) An object grounding task is proposed expecting vision systems to discover interacted objects. We propose a 4D question-answering framework (4D-QA) to discover interacted objects from diverse videos.
arXiv Detail & Related papers (2024-12-27T09:08:46Z)
V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results [142.5704093410454]
The V3Det Challenge 2024 aims to push the boundaries of object detection research. The challenge consists of two tracks: Vast Vocabulary Object Detection and Open Vocabulary Object Detection. We aim to inspire future research directions in vast vocabulary and open-vocabulary object detection.
arXiv Detail & Related papers (2024-06-17T16:58:51Z)
Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors [49.99728312519117]
The aim of this work is to establish how accurately a recent semantic-based active perception model is able to complete visual tasks that are regularly performed by humans. This model exploits the ability of current object detectors to localize and classify a large number of object classes and to update a semantic description of a scene across multiple fixations. In the task of scene exploration, the semantic-based method demonstrates superior performance compared to the traditional saliency-based model.
arXiv Detail & Related papers (2024-04-16T18:15:57Z)
CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection [42.2847114428716]
Task driven object detection aims to detect object instances suitable for affording a task in an image. Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection. We propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task.
arXiv Detail & Related papers (2023-09-03T06:18:39Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
DetGPT: Detect What You Need via Reasoning [33.00345609506097]
We introduce a new paradigm for object detection that we call reasoning-based object detection. Unlike conventional object detection methods that rely on specific object names, our approach enables users to interact with the system using natural language instructions. Our proposed method, called DetGPT, leverages state-of-the-art multi-modal models and open-vocabulary object detectors.
arXiv Detail & Related papers (2023-05-23T15:37:28Z)
Discovering a Variety of Objects in Spatio-Temporal Human-Object Interactions [45.92485321148352]
In daily HOIs, humans often interact with a variety of objects, e.g., holding and touching dozens of household items in cleaning. Here, we introduce a new benchmark based on AVA: Discoveringed Objects (DIO) including 51 interactions and 1,000+ objects. An ST-HOI learning task is proposed expecting vision systems to track human actors, detect interactions and simultaneously discover objects.
arXiv Detail & Related papers (2022-11-14T16:33:54Z)
Towards Open-Set Object Detection and Discovery [38.81806249664884]
We present a new task, namely Open-Set Object Detection and Discovery (OSODD) We propose a two-stage method that first uses an open-set object detector to predict both known and unknown objects. Then, we study the representation of predicted objects in an unsupervised manner and discover new categories from the set of unknown objects.
arXiv Detail & Related papers (2022-04-12T08:07:01Z)
One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image. We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images. With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z)
Detecting Human-Object Interaction via Fabricated Compositional Learning [106.37536031160282]
Human-Object Interaction (HOI) detection is a fundamental task for high-level scene understanding. Human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples. We propose Fabricated Compositional Learning (FCL) to address the problem of open long-tailed HOI detection.
arXiv Detail & Related papers (2021-03-15T08:52:56Z)
GO-Finder: A Registration-Free Wearable System for Assisting Users in Finding Lost Objects via Hand-Held Object Discovery [23.33413589457104]
GO-Finder is a registration-free wearable camera based system for assisting people in finding objects. Go-Finder automatically detects and groups hand-held objects to form a visual timeline of the objects.
arXiv Detail & Related papers (2021-01-18T20:04:56Z)
COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos. We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration. Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z)
Semantic Linking Maps for Active Visual Object Search [14.573513188682183]
We exploit background knowledge about common spatial relations between landmark and target objects. We propose an active visual object search strategy method through our introduction of the Semantic Linking Maps (SLiM) model. Based on SLiM, we describe a hybrid search strategy that selects the next best view pose for searching for the target object.
arXiv Detail & Related papers (2020-06-18T18:59:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.