Continuous Object State Recognition for Cooking Robots Using Pre-Trained
Vision-Language Models and Black-box Optimization
- URL: http://arxiv.org/abs/2403.08239v1
- Date: Wed, 13 Mar 2024 04:45:40 GMT
- Title: Continuous Object State Recognition for Cooking Robots Using Pre-Trained
Vision-Language Models and Black-box Optimization
- Authors: Kento Kawaharazuka and Naoaki Kanazawa and Yoshiki Obinata and Kei
Okada and Masayuki Inaba
- Abstract summary: We propose a method to recognize the continuous state changes of food for cooking robots through the spoken language.
We show that by adjusting the weighting of each text prompt, more accurate and robust continuous state recognition can be achieved.
- Score: 18.41474014665171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The state recognition of the environment and objects by robots is generally
based on the judgement of the current state as a classification problem. On the
other hand, state changes of food in cooking happen continuously and need to be
captured not only at a certain time point but also continuously over time. In
addition, the state changes of food are complex and cannot be easily described
by manual programming. Therefore, we propose a method to recognize the
continuous state changes of food for cooking robots through the spoken language
using pre-trained large-scale vision-language models. By using models that can
compute the similarity between images and texts continuously over time, we can
capture the state changes of food while cooking. We also show that by adjusting
the weighting of each text prompt based on fitting the similarity changes to a
sigmoid function and then performing black-box optimization, more accurate and
robust continuous state recognition can be achieved. We demonstrate the
effectiveness and limitations of this method by performing the recognition of
water boiling, butter melting, egg cooking, and onion stir-frying.
Related papers
- Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization [17.164384202639496]
We propose a robotic state recognition method using a pre-trained vision-language model.
It is possible to recognize the open/closed state of transparent doors, the state of whether water is running or not from a faucet, and even the qualitative state of whether a kitchen is clean or not.
arXiv Detail & Related papers (2024-10-30T05:34:52Z) - ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions [66.20773952864802]
We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images.
We propose ActionCOMET, a framework to discern knowledge present in language models specific to the provided visual input.
arXiv Detail & Related papers (2024-10-17T15:22:57Z) - Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization [17.164384202639496]
We perform a unified environmental state recognition for robots through the spoken language.
We show that it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed.
We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch.
arXiv Detail & Related papers (2024-09-26T04:02:20Z) - Adaptive Visual Imitation Learning for Robotic Assisted Feeding Across Varied Bowl Configurations and Food Types [17.835835270751176]
We introduce a novel visual imitation network with a spatial attention module for robotic assisted feeding (RAF)
We propose a framework that integrates visual perception with imitation learning to enable the robot to handle diverse scenarios during scooping.
Our approach, named AVIL (adaptive visual imitation learning), exhibits adaptability and robustness across different bowl configurations.
arXiv Detail & Related papers (2024-03-19T16:40:57Z) - Food Image Classification and Segmentation with Attention-based Multiple
Instance Learning [51.279800092581844]
The paper presents a weakly supervised methodology for training food image classification and semantic segmentation models.
The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism.
We conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach.
arXiv Detail & Related papers (2023-08-22T13:59:47Z) - Transferring Knowledge for Food Image Segmentation using Transformers
and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food.
One challenge is that food items can overlap and mix, making them difficult to distinguish.
Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT)
The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z) - Rethinking Cooking State Recognition with Vision Transformers [0.0]
Self-attention mechanism of Vision Transformer (ViT) architecture is proposed for the Cooking State Recognition task.
The proposed approach encapsulates the globally salient features from images, while also exploiting the weights learned from a larger dataset.
Our framework has an accuracy of 94.3%, which significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2022-12-16T17:06:28Z) - Counterfactual Recipe Generation: Exploring Compositional Generalization
in a Realistic Scenario [60.20197771545983]
We design the counterfactual recipe generation task, which asks models to modify a base recipe according to the change of an ingredient.
We collect a large-scale recipe dataset in Chinese for models to learn culinary knowledge.
Results show that existing models have difficulties in modifying the ingredients while preserving the original text style, and often miss actions that need to be adjusted.
arXiv Detail & Related papers (2022-10-20T17:21:46Z) - A Bayesian Treatment of Real-to-Sim for Deformable Object Manipulation [59.29922697476789]
We propose a novel methodology for extracting state information from image sequences via a technique to represent the state of a deformable object as a distribution embedding.
Our experiments confirm that we can estimate posterior distributions of physical properties, such as elasticity, friction and scale of highly deformable objects, such as cloth and ropes.
arXiv Detail & Related papers (2021-12-09T17:50:54Z) - Classifying States of Cooking Objects Using Convolutional Neural Network [6.127963013089406]
The main aim is to make the cooking process easier, safer, and create human welfare.
It is important for robots to understand the cooking environment and recognize the objects, especially correctly identifying the state of the cooking objects.
In this project, several parts of the experiment were conducted to design a robust deep convolutional neural network for classifying the state of the cooking objects from scratch.
arXiv Detail & Related papers (2021-04-30T22:26:40Z) - HM4: Hidden Markov Model with Memory Management for Visual Place
Recognition [54.051025148533554]
We develop a Hidden Markov Model approach for visual place recognition in autonomous driving.
Our algorithm, dubbed HM$4$, exploits temporal look-ahead to transfer promising candidate images between passive storage and active memory.
We show that this allows constant time and space inference for a fixed coverage area.
arXiv Detail & Related papers (2020-11-01T08:49:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.