VisualChef: Generating Visual Aids in Cooking via Mask Inpainting
- URL: http://arxiv.org/abs/2506.18569v1
- Date: Mon, 23 Jun 2025 12:23:21 GMT
- Title: VisualChef: Generating Visual Aids in Cooking via Mask Inpainting
- Authors: Oleh Kuzyk, Zuoyue Li, Marc Pollefeys, Xi Wang,
- Abstract summary: We introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios.<n>Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object.<n>We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.
- Score: 50.84305074983752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cooking requires not only following instructions but also understanding, executing, and monitoring each step - a process that can be challenging without visual guidance. Although recipe images and videos offer helpful cues, they often lack consistency in focus, tools, and setup. To better support the cooking process, we introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios. Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object, while preserving the initial frame's environment. Previous work aims to integrate knowledge extracted from large language models by generating detailed textual descriptions to guide image generation, which requires fine-grained visual-textual alignment and involves additional annotations. In contrast, VisualChef simplifies alignment through mask-based visual grounding. Our key insight is identifying action-relevant objects and classifying them to enable targeted modifications that reflect the intended action and outcome while maintaining a consistent environment. In addition, we propose an automated pipeline to extract high-quality initial, action, and final state frames. We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.
Related papers
- Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance [6.4337734580551365]
We present a cooking process visualization model, called Chain-of-Cooking.<n>To generate correct appearances of ingredients, we retrieve previously generated image patches as references.<n>To enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance.
arXiv Detail & Related papers (2025-07-29T06:34:59Z) - OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking [24.6085205199758]
Following recipes while cooking is an important but difficult task for visually impaired individuals.<n>We developed OSCAR, a novel approach that provides recipe progress tracking and context-aware feedback.<n>We evaluated OSCAR's recipe following functionality using 173 YouTube cooking videos and 12 real-world non-visual cooking videos.
arXiv Detail & Related papers (2025-03-07T22:03:21Z) - CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion [58.92430755180394]
We present textbfCookingDiffusion, a novel approach to generate photo-realistic images of cooking steps.<n>These prompts encompass text prompts, image prompts, and multi-modal prompts, ensuring the consistent generation of cooking procedural images.<n>Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images.
arXiv Detail & Related papers (2025-01-15T06:58:53Z) - ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions [66.20773952864802]
We develop a dataset consisting of 8.5k images and 59.3k inferences about actions grounded in those images.
We propose ActionCOMET, a framework to discern knowledge present in language models specific to the provided visual input.
arXiv Detail & Related papers (2024-10-17T15:22:57Z) - InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions.
InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks.
It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z) - 50 Ways to Bake a Cookie: Mapping the Landscape of Procedural Texts [15.185745028886648]
We propose an unsupervised learning approach for summarizing multiple procedural texts into an intuitive graph representation.
We demonstrate our approach on recipes, a prominent example of procedural texts.
arXiv Detail & Related papers (2022-10-31T11:41:54Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Multi-modal Cooking Workflow Construction for Food Recipes [147.4435186953995]
We build MM-ReS, the first large-scale dataset for cooking workflow construction.
We propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow.
arXiv Detail & Related papers (2020-08-20T18:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.