Related papers: Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance

Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance

URL: http://arxiv.org/abs/2507.21529v1
Date: Tue, 29 Jul 2025 06:34:59 GMT
Title: Chain-of-Cooking:Cooking Process Visualization via Bidirectional Chain-of-Thought Guidance
Authors: Mengling Xu, Ming Tao, Bing-Kun Bao,
Abstract summary: We present a cooking process visualization model, called Chain-of-Cooking.<n>To generate correct appearances of ingredients, we retrieve previously generated image patches as references.<n>To enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance.
Score: 6.4337734580551365
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cooking process visualization is a promising task in the intersection of image generation and food analysis, which aims to generate an image for each cooking step of a recipe. However, most existing works focus on generating images of finished foods based on the given recipes, and face two challenges to visualize the cooking process. First, the appearance of ingredients changes variously across cooking steps, it is difficult to generate the correct appearances of foods that match the textual description, leading to semantic inconsistency. Second, the current step might depend on the operations of previous step, it is crucial to maintain the contextual coherence of images in sequential order. In this work, we present a cooking process visualization model, called Chain-of-Cooking. Specifically, to generate correct appearances of ingredients, we present a Dynamic Patch Selection Module to retrieve previously generated image patches as references, which are most related to current textual contents. Furthermore, to enhance the coherence and keep the rational order of generated images, we propose a Semantic Evolution Module and a Bidirectional Chain-of-Thought (CoT) Guidance. To better utilize the semantics of previous texts, the Semantic Evolution Module establishes the semantical association between latent prompts and current cooking step, and merges it with the latent features. Then the CoT Guidance updates the merged features to guide the current cooking step remain coherent with the previous step. Moreover, we construct a dataset named CookViz, consisting of intermediate image-text pairs for the cooking process. Quantitative and qualitative experiments show that our method outperforms existing methods in generating coherent and semantic consistent cooking process.

Related papers

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation [34.977083209936815]
CookAnything is a framework that generates coherent, semantically distinct image sequences from cooking instructions of arbitrary length.<n>It supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.
arXiv Detail & Related papers (2025-12-03T08:01:48Z)
VisualChef: Generating Visual Aids in Cooking via Mask Inpainting [50.84305074983752]
We introduce VisualChef, a method for generating contextual visual aids tailored to cooking scenarios.<n>Given an initial frame and a specified action, VisualChef generates images depicting both the action's execution and the resulting appearance of the object.<n>We evaluate VisualChef quantitatively and qualitatively on three egocentric video datasets and show its improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-06-23T12:23:21Z)
CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion [58.92430755180394]
We present textbfCookingDiffusion, a novel approach to generate photo-realistic images of cooking steps.<n>These prompts encompass text prompts, image prompts, and multi-modal prompts, ensuring the consistent generation of cooking procedural images.<n>Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images.
arXiv Detail & Related papers (2025-01-15T06:58:53Z)
OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation [43.65207396061584]
OVFoodSeg is a framework that enhances text embeddings with visual context. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. By addressing the deficiencies of previous models, OVFoodSeg demonstrates a significant improvement, achieving a 4.9% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset.
arXiv Detail & Related papers (2024-04-01T18:26:29Z)
PizzaCommonSense: Learning to Model Commonsense Reasoning about Intermediate Steps in Cooking Recipes [7.839338724237275]
A model to effectively reason about cooking recipes must accurately discern and understand the inputs and outputs of intermediate steps within the recipe. We present a new corpus of cooking recipes enriched with descriptions of intermediate steps that describe the input and output for each step.
arXiv Detail & Related papers (2024-01-12T23:33:01Z)
Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [17.42688184238741]
Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives. We propose a simplified end-to-end model based on well established and high performing encoders for text and images. Our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset.
arXiv Detail & Related papers (2021-03-24T10:17:09Z)
CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval [20.292467149387594]
We introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks. Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision.
arXiv Detail & Related papers (2021-02-04T11:24:34Z)
Structure-Aware Generation Network for Recipe Generation from Images [142.047662926209]
We investigate an open research task of generating cooking instructions based on only food images and ingredients. Target recipes are long-length paragraphs and do not have annotations on structure information. We propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task.
arXiv Detail & Related papers (2020-09-02T10:54:25Z)
Multi-modal Cooking Workflow Construction for Food Recipes [147.4435186953995]
We build MM-ReS, the first large-scale dataset for cooking workflow construction. We propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow.
arXiv Detail & Related papers (2020-08-20T18:31:25Z)
Decomposing Generation Networks with Structure Prediction for Recipe Generation [142.047662926209]
We propose a novel framework: Decomposing Generation Networks (DGN) with structure prediction. Specifically, we split each cooking instruction into several phases, and assign different sub-generators to each phase. Our approach includes two novel ideas: (i) learning the recipe structures with the global structure prediction component and (ii) producing recipe phases in the sub-generator output component based on the predicted structure.
arXiv Detail & Related papers (2020-07-27T08:47:50Z)
Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another. We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities. We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.