Related papers: Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization

Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization

URL: http://arxiv.org/abs/2403.08239v1
Date: Wed, 13 Mar 2024 04:45:40 GMT
Title: Continuous Object State Recognition for Cooking Robots Using Pre-Trained Vision-Language Models and Black-box Optimization
Authors: Kento Kawaharazuka and Naoaki Kanazawa and Yoshiki Obinata and Kei Okada and Masayuki Inaba
Abstract summary: We propose a method to recognize the continuous state changes of food for cooking robots through the spoken language. We show that by adjusting the weighting of each text prompt, more accurate and robust continuous state recognition can be achieved.
Score: 18.41474014665171
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The state recognition of the environment and objects by robots is generally based on the judgement of the current state as a classification problem. On the other hand, state changes of food in cooking happen continuously and need to be captured not only at a certain time point but also continuously over time. In addition, the state changes of food are complex and cannot be easily described by manual programming. Therefore, we propose a method to recognize the continuous state changes of food for cooking robots through the spoken language using pre-trained large-scale vision-language models. By using models that can compute the similarity between images and texts continuously over time, we can capture the state changes of food while cooking. We also show that by adjusting the weighting of each text prompt based on fitting the similarity changes to a sigmoid function and then performing black-box optimization, more accurate and robust continuous state recognition can be achieved. We demonstrate the effectiveness and limitations of this method by performing the recognition of water boiling, butter melting, egg cooking, and onion stir-frying.

Related papers

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion [58.92430755180394]
We present textbfCookingDiffusion, a novel approach to generate photo-realistic images of cooking steps. These prompts encompass text prompts, image prompts, and multi-modal prompts, ensuring the consistent generation of cooking procedural images. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images.
arXiv Detail & Related papers (2025-01-15T06:58:53Z)
Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization [17.164384202639496]
We propose a robotic state recognition method using a pre-trained vision-language model. It is possible to recognize the open/closed state of transparent doors, the state of whether water is running or not from a faucet, and even the qualitative state of whether a kitchen is clean or not.
arXiv Detail & Related papers (2024-10-30T05:34:52Z)
ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions [66.20773952864802]
We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images. We propose ActionCOMET, a framework to discern knowledge present in language models specific to the provided visual input.
arXiv Detail & Related papers (2024-10-17T15:22:57Z)
Robotic Environmental State Recognition with Pre-Trained Vision-Language Models and Black-Box Optimization [17.164384202639496]
We perform a unified environmental state recognition for robots through the spoken language. We show that it is possible to recognize not only whether a room door is open/closed, but also whether a transparent door is open/closed. We experimentally demonstrate the effectiveness of our method and apply it to the recognition behavior on a mobile robot, Fetch.
arXiv Detail & Related papers (2024-09-26T04:02:20Z)
Adaptive Visual Imitation Learning for Robotic Assisted Feeding Across Varied Bowl Configurations and Food Types [17.835835270751176]
We introduce a novel visual imitation network with a spatial attention module for robotic assisted feeding (RAF) We propose a framework that integrates visual perception with imitation learning to enable the robot to handle diverse scenarios during scooping. Our approach, named AVIL (adaptive visual imitation learning), exhibits adaptability and robustness across different bowl configurations.
arXiv Detail & Related papers (2024-03-19T16:40:57Z)
Food Image Classification and Segmentation with Attention-based Multiple Instance Learning [51.279800092581844]
The paper presents a weakly supervised methodology for training food image classification and semantic segmentation models. The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism. We conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach.
arXiv Detail & Related papers (2023-08-22T13:59:47Z)
Transferring Knowledge for Food Image Segmentation using Transformers and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food. One challenge is that food items can overlap and mix, making them difficult to distinguish. Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT) The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z)
Rethinking Cooking State Recognition with Vision Transformers [0.0]
Self-attention mechanism of Vision Transformer (ViT) architecture is proposed for the Cooking State Recognition task. The proposed approach encapsulates the globally salient features from images, while also exploiting the weights learned from a larger dataset. Our framework has an accuracy of 94.3%, which significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2022-12-16T17:06:28Z)
Counterfactual Recipe Generation: Exploring Compositional Generalization in a Realistic Scenario [60.20197771545983]
We design the counterfactual recipe generation task, which asks models to modify a base recipe according to the change of an ingredient. We collect a large-scale recipe dataset in Chinese for models to learn culinary knowledge. Results show that existing models have difficulties in modifying the ingredients while preserving the original text style, and often miss actions that need to be adjusted.
arXiv Detail & Related papers (2022-10-20T17:21:46Z)
A Bayesian Treatment of Real-to-Sim for Deformable Object Manipulation [59.29922697476789]
We propose a novel methodology for extracting state information from image sequences via a technique to represent the state of a deformable object as a distribution embedding. Our experiments confirm that we can estimate posterior distributions of physical properties, such as elasticity, friction and scale of highly deformable objects, such as cloth and ropes.
arXiv Detail & Related papers (2021-12-09T17:50:54Z)
Classifying States of Cooking Objects Using Convolutional Neural Network [6.127963013089406]
The main aim is to make the cooking process easier, safer, and create human welfare. It is important for robots to understand the cooking environment and recognize the objects, especially correctly identifying the state of the cooking objects. In this project, several parts of the experiment were conducted to design a robust deep convolutional neural network for classifying the state of the cooking objects from scratch.
arXiv Detail & Related papers (2021-04-30T22:26:40Z)
HM4: Hidden Markov Model with Memory Management for Visual Place Recognition [54.051025148533554]
We develop a Hidden Markov Model approach for visual place recognition in autonomous driving. Our algorithm, dubbed HM$4$, exploits temporal look-ahead to transfer promising candidate images between passive storage and active memory. We show that this allows constant time and space inference for a fixed coverage area.
arXiv Detail & Related papers (2020-11-01T08:49:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.