Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization
- URL: http://arxiv.org/abs/2410.22707v1
- Date: Wed, 30 Oct 2024 05:34:52 GMT
- Title: Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization
- Authors: Kento Kawaharazuka, Yoshiki Obinata, Naoaki Kanazawa, Kei Okada, Masayuki Inaba,
- Abstract summary: We propose a robotic state recognition method using a pre-trained vision-language model.
It is possible to recognize the open/closed state of transparent doors, the state of whether water is running or not from a faucet, and even the qualitative state of whether a kitchen is clean or not.
- Score: 17.164384202639496
- License:
- Abstract: State recognition of the environment and objects, such as the open/closed state of doors and the on/off of lights, is indispensable for robots that perform daily life support and security tasks. Until now, state recognition methods have been based on training neural networks from manual annotations, preparing special sensors for the recognition, or manually programming to extract features from point clouds or raw images. In contrast, we propose a robotic state recognition method using a pre-trained vision-language model, which is capable of Image-to-Text Retrieval (ITR) tasks. We prepare several kinds of language prompts in advance, calculate the similarity between these prompts and the current image by ITR, and perform state recognition. By applying the optimal weighting to each prompt using black-box optimization, state recognition can be performed with higher accuracy. Experiments show that this theory enables a variety of state recognitions by simply preparing multiple prompts without retraining neural networks or manual programming. In addition, since only prompts and their weights need to be prepared for each recognizer, there is no need to prepare multiple models, which facilitates resource management. It is possible to recognize the open/closed state of transparent doors, the state of whether water is running or not from a faucet, and even the qualitative state of whether a kitchen is clean or not, which have been challenging so far, through language.
Related papers
- Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization [17.164384202639496]
We perform a unified environmental state recognition for robots through the spoken language.
We show that it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed.
We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch.
arXiv Detail & Related papers (2024-09-26T04:02:20Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Continuous Object State Recognition for Cooking Robots Using Pre-Trained
Vision-Language Models and Black-box Optimization [18.41474014665171]
We propose a method to recognize the continuous state changes of food for cooking robots through the spoken language.
We show that by adjusting the weighting of each text prompt, more accurate and robust continuous state recognition can be achieved.
arXiv Detail & Related papers (2024-03-13T04:45:40Z) - Deep Learning-based Spatio Temporal Facial Feature Visual Speech
Recognition [0.0]
We present an alternate authentication process that makes use of both facial recognition and the individual's distinctive temporal facial feature motions while they speak a password.
The suggested model attained an accuracy of 96.1% when tested on the industry-standard MIRACL-VC1 dataset.
arXiv Detail & Related papers (2023-04-30T18:52:29Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Towards End-to-end Unsupervised Speech Recognition [120.4915001021405]
We introduce wvu which does away with all audio-side pre-processing and improves accuracy through better architecture.
In addition, we introduce an auxiliary self-supervised objective that ties model predictions back to the input.
Experiments show that wvuimproves unsupervised recognition results across different languages while being conceptually simpler.
arXiv Detail & Related papers (2022-04-05T21:22:38Z) - Learning to Prompt for Vision-Language Models [82.25005817904027]
Vision-language pre-training has emerged as a promising alternative for representation learning.
It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders.
Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks.
arXiv Detail & Related papers (2021-09-02T17:57:31Z) - Skeleton Based Sign Language Recognition Using Whole-body Keypoints [71.97020373520922]
Sign language is used by deaf or speech impaired people to communicate.
Skeleton-based recognition is becoming popular that it can be further ensembled with RGB-D based method to achieve state-of-the-art performance.
Inspired by the recent development of whole-body pose estimation citejin 2020whole, we propose recognizing sign language based on the whole-body key points and features.
arXiv Detail & Related papers (2021-03-16T03:38:17Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - Multi-modal embeddings using multi-task learning for emotion recognition [20.973999078271483]
General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks.
We extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks.
arXiv Detail & Related papers (2020-09-10T17:33:16Z) - Online Visual Place Recognition via Saliency Re-identification [26.209412893744094]
Existing methods often formulate visual place recognition as feature matching.
Inspired by the fact that human beings always recognize a place by remembering salient regions or landmarks, we formulate visual place recognition as saliency re-identification.
In the meanwhile, we propose to perform both saliency detection and re-identification in frequency domain, in which all operations become element-wise.
arXiv Detail & Related papers (2020-07-29T01:53:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.