OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking
- URL: http://arxiv.org/abs/2503.05962v1
- Date: Fri, 07 Mar 2025 22:03:21 GMT
- Title: OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking
- Authors: Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, Patrick Carrington,
- Abstract summary: Following recipes while cooking is an important but difficult task for visually impaired individuals.<n>We developed OSCAR, a novel approach that provides recipe progress tracking and context-aware feedback.<n>We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos.
- Score: 24.6085205199758
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Following recipes while cooking is an important but difficult task for visually impaired individuals. We developed OSCAR (Object Status Context Awareness for Recipes), a novel approach that provides recipe progress tracking and context-aware feedback on the completion of cooking tasks through tracking object statuses. OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log. We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos to demonstrate OSCAR's capability to track cooking steps and provide contextual guidance. Our results highlight the effectiveness of using object status to improve performance compared to baseline by over 20% across different VLMs, and we present factors that impact prediction performance. Furthermore, we contribute a dataset of real-world non-visual cooking videos with step annotations as an evaluation benchmark.
Related papers
- CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion [58.92430755180394]
We present textbfCookingDiffusion, a novel approach to generate photo-realistic images of cooking steps.<n>These prompts encompass text prompts, image prompts, and multi-modal prompts, ensuring the consistent generation of cooking procedural images.<n>Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images.
arXiv Detail & Related papers (2025-01-15T06:58:53Z) - ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions [66.20773952864802]
We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images.
We propose ActionCOMET, a framework to discern knowledge present in language models specific to the provided visual input.
arXiv Detail & Related papers (2024-10-17T15:22:57Z) - COM Kitchens: An Unedited Overhead-view Video Dataset as a Vision-Language Benchmark [13.623338371949337]
We propose a new dataset, COM Kitchens, which consists of unedited overhead-view videos captured by smartphones.
We propose the novel video-to-text retrieval task Online Recipe Retrieval (OnRR) and new video captioning domain Dense Video Captioning on unedited Overhead-View videos (DVC-OV)
Our experiments verified the capabilities and limitations of current web-video-based SOTA methods in handling these tasks.
arXiv Detail & Related papers (2024-08-05T07:00:10Z) - Adaptive Visual Imitation Learning for Robotic Assisted Feeding Across Varied Bowl Configurations and Food Types [17.835835270751176]
We introduce a novel visual imitation network with a spatial attention module for robotic assisted feeding (RAF)
We propose a framework that integrates visual perception with imitation learning to enable the robot to handle diverse scenarios during scooping.
Our approach, named AVIL (adaptive visual imitation learning), exhibits adaptability and robustness across different bowl configurations.
arXiv Detail & Related papers (2024-03-19T16:40:57Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Learning and Verification of Task Structure in Instructional Videos [85.511888642497]
We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos.
Compared to prior work which learns step representations locally, our approach involves learning them globally.
We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order.
arXiv Detail & Related papers (2023-03-23T17:59:54Z) - Rethinking Cooking State Recognition with Vision Transformers [0.0]
Self-attention mechanism of Vision Transformer (ViT) architecture is proposed for the Cooking State Recognition task.
The proposed approach encapsulates the globally salient features from images, while also exploiting the weights learned from a larger dataset.
Our framework has an accuracy of 94.3%, which significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2022-12-16T17:06:28Z) - Structured Vision-Language Pretraining for Computational Cooking [54.0571416522547]
Vision-Language Pretraining and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks.
We propose to leverage these techniques for structured-text based computational cuisine tasks.
arXiv Detail & Related papers (2022-12-08T13:37:17Z) - Classifying States of Cooking Objects Using Convolutional Neural Network [6.127963013089406]
The main aim is to make the cooking process easier, safer, and create human welfare.
It is important for robots to understand the cooking environment and recognize the objects, especially correctly identifying the state of the cooking objects.
In this project, several parts of the experiment were conducted to design a robust deep convolutional neural network for classifying the state of the cooking objects from scratch.
arXiv Detail & Related papers (2021-04-30T22:26:40Z) - Multi-modal Cooking Workflow Construction for Food Recipes [147.4435186953995]
We build MM-ReS, the first large-scale dataset for cooking workflow construction.
We propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow.
arXiv Detail & Related papers (2020-08-20T18:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.