Multi-modal Cooking Workflow Construction for Food Recipes
- URL: http://arxiv.org/abs/2008.09151v1
- Date: Thu, 20 Aug 2020 18:31:25 GMT
- Title: Multi-modal Cooking Workflow Construction for Food Recipes
- Authors: Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah
Ngo, Min-Yen Kan, Yu-Gang Jiang, Tat-Seng Chua
- Abstract summary: We build MM-ReS, the first large-scale dataset for cooking workflow construction.
We propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow.
- Score: 147.4435186953995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding food recipe requires anticipating the implicit causal effects
of cooking actions, such that the recipe can be converted into a graph
describing the temporal workflow of the recipe. This is a non-trivial task that
involves common-sense reasoning. However, existing efforts rely on hand-crafted
features to extract the workflow graph from recipes due to the lack of
large-scale labeled datasets. Moreover, they fail to utilize the cooking
images, which constitute an important part of food recipes. In this paper, we
build MM-ReS, the first large-scale dataset for cooking workflow construction,
consisting of 9,850 recipes with human-labeled workflow graphs. Cooking steps
are multi-modal, featuring both text instructions and cooking images. We then
propose a neural encoder-decoder model that utilizes both visual and textual
information to construct the cooking workflow, which achieved over 20%
performance gain over existing hand-crafted baselines.
Related papers
- Retrieval Augmented Recipe Generation [96.43285670458803]
We propose a retrieval augmented large multimodal model for recipe generation.
It retrieves recipes semantically related to the image from an existing datastore as a supplement.
It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation.
arXiv Detail & Related papers (2024-11-13T15:58:50Z) - FIRE: Food Image to REcipe generation [10.45344523054623]
Food computing aims to develop end-to-end intelligent systems capable of autonomously producing recipe information for a food image.
This paper proposes FIRE, a novel methodology tailored to recipe generation in the food computing domain.
We showcase two practical applications that can benefit from integrating FIRE with large language model prompting.
arXiv Detail & Related papers (2023-08-28T08:14:20Z) - Recipe2Vec: Multi-modal Recipe Representation Learning with Graph Neural
Networks [23.378813327724686]
We formalize the problem of multi-modal recipe representation learning to integrate the visual, textual, and relational information into recipe embeddings.
We first present Large-RG, a new recipe graph data with over half a million nodes, making it the largest recipe graph to date.
We then propose Recipe2Vec, a novel graph neural network based recipe embedding model to capture multi-modal information.
arXiv Detail & Related papers (2022-05-24T23:04:02Z) - Learning Program Representations for Food Images and Cooking Recipes [26.054436410924737]
We propose to represent cooking recipes and food images as cooking programs.
A model is trained to learn a joint embedding between recipes and food images via self-supervision.
We show that projecting the image-recipe embeddings into programs leads to better cross-modal retrieval results.
arXiv Detail & Related papers (2022-03-30T05:52:41Z) - Learning Structural Representations for Recipe Generation and Food
Retrieval [101.97397967958722]
We propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task.
Our proposed model can produce high-quality and coherent recipes, and achieve the state-of-the-art performance on the benchmark Recipe1M dataset.
arXiv Detail & Related papers (2021-10-04T06:36:31Z) - Revamping Cross-Modal Recipe Retrieval with Hierarchical Transformers
and Self-supervised Learning [17.42688184238741]
Cross-modal recipe retrieval has recently gained substantial attention due to the importance of food in people's lives.
We propose a simplified end-to-end model based on well established and high performing encoders for text and images.
Our proposed method achieves state-of-the-art performance in the cross-modal recipe retrieval task on the Recipe1M dataset.
arXiv Detail & Related papers (2021-03-24T10:17:09Z) - CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval [20.292467149387594]
We introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks.
Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision.
arXiv Detail & Related papers (2021-02-04T11:24:34Z) - Structure-Aware Generation Network for Recipe Generation from Images [142.047662926209]
We investigate an open research task of generating cooking instructions based on only food images and ingredients.
Target recipes are long-length paragraphs and do not have annotations on structure information.
We propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task.
arXiv Detail & Related papers (2020-09-02T10:54:25Z) - Decomposing Generation Networks with Structure Prediction for Recipe
Generation [142.047662926209]
We propose a novel framework: Decomposing Generation Networks (DGN) with structure prediction.
Specifically, we split each cooking instruction into several phases, and assign different sub-generators to each phase.
Our approach includes two novel ideas: (i) learning the recipe structures with the global structure prediction component and (ii) producing recipe phases in the sub-generator output component based on the predicted structure.
arXiv Detail & Related papers (2020-07-27T08:47:50Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.