Related papers: Multi-modal Cooking Workflow Construction for Food Recipes

Multi-modal Cooking Workflow Construction for Food Recipes

URL: http://arxiv.org/abs/2008.09151v1
Date: Thu, 20 Aug 2020 18:31:25 GMT
Title: Multi-modal Cooking Workflow Construction for Food Recipes
Authors: Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yu-Gang Jiang, Tat-Seng Chua
Abstract summary: We build MM-ReS, the first large-scale dataset for cooking workflow construction. We propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow.
Score: 147.4435186953995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe. This is a non-trivial task that involves common-sense reasoning. However, existing efforts rely on hand-crafted features to extract the workflow graph from recipes due to the lack of large-scale labeled datasets. Moreover, they fail to utilize the cooking images, which constitute an important part of food recipes. In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps are multi-modal, featuring both text instructions and cooking images. We then propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow, which achieved over 20% performance gain over existing hand-crafted baselines.

Related papers

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion [58.92430755180394]
We present textbfCookingDiffusion, a novel approach to generate photo-realistic images of cooking steps. These prompts encompass text prompts, image prompts, and multi-modal prompts, ensuring the consistent generation of cooking procedural images. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images.
arXiv Detail & Related papers (2025-01-15T06:58:53Z)
Retrieval Augmented Recipe Generation [96.43285670458803]
We propose a retrieval augmented large multimodal model for recipe generation. It retrieves recipes semantically related to the image from an existing datastore as a supplement. It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation.
arXiv Detail & Related papers (2024-11-13T15:58:50Z)
FIRE: Food Image to REcipe generation [10.45344523054623]
Food computing aims to develop end-to-end intelligent systems capable of autonomously producing recipe information for a food image. This paper proposes FIRE, a novel methodology tailored to recipe generation in the food computing domain. We showcase two practical applications that can benefit from integrating FIRE with large language model prompting.
arXiv Detail & Related papers (2023-08-28T08:14:20Z)
Recipe2Vec: Multi-modal Recipe Representation Learning with Graph Neural Networks [23.378813327724686]
We formalize the problem of multi-modal recipe representation learning to integrate the visual, textual, and relational information into recipe embeddings. We first present Large-RG, a new recipe graph data with over half a million nodes, making it the largest recipe graph to date. We then propose Recipe2Vec, a novel graph neural network based recipe embedding model to capture multi-modal information.
arXiv Detail & Related papers (2022-05-24T23:04:02Z)
Learning Program Representations for Food Images and Cooking Recipes [26.054436410924737]
We propose to represent cooking recipes and food images as cooking programs. A model is trained to learn a joint embedding between recipes and food images via self-supervision. We show that projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results.
arXiv Detail & Related papers (2022-03-30T05:52:41Z)
Learning Structural Representations for Recipe Generation and Food Retrieval [101.97397967958722]
We propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task. Our proposed model can produce high-quality and coherent recipes, and achieve the state-of-the-art performance on the benchmark Recipe1M dataset.
arXiv Detail & Related papers (2021-10-04T06:36:31Z)
Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers and Self-supervised Learning [17.42688184238741]
Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives. We propose a simplified end-to-end model based on well established and high performing encoders for text and images. Our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset.
arXiv Detail & Related papers (2021-03-24T10:17:09Z)
CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval [20.292467149387594]
We introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks. Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision.
arXiv Detail & Related papers (2021-02-04T11:24:34Z)
Structure-Aware Generation Network for Recipe Generation from Images [142.047662926209]
We investigate an open research task of generating cooking instructions based on only food images and ingredients. Target recipes are long-length paragraphs and do not have annotations on structure information. We propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task.
arXiv Detail & Related papers (2020-09-02T10:54:25Z)
Decomposing Generation Networks with Structure Prediction for Recipe Generation [142.047662926209]
We propose a novel framework: Decomposing Generation Networks (DGN) with structure prediction. Specifically, we split each cooking instruction into several phases, and assign different sub-generators to each phase. Our approach includes two novel ideas: (i) learning the recipe structures with the global structure prediction component and (ii) producing recipe phases in the sub-generator output component based on the predicted structure.
arXiv Detail & Related papers (2020-07-27T08:47:50Z)
Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another. We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities. We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.