Related papers: A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

URL: http://arxiv.org/abs/2005.09606v1
Date: Tue, 19 May 2020 17:27:00 GMT
Title: A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks
Authors: Angela S. Lin, Sudha Rao, Asli Celikyilmaz, Elnaz Nouri, Chris Brockett, Debadeepta Dey, Bill Dolan
Abstract summary: In the cooking domain, the web offers many partially-overlapping text and video recipes that describe how to make the same dish. We use an unsupervised alignment algorithm that learns pairwise alignments between instructions of different recipes for the same dish. We then use a graph algorithm to derive a joint alignment between multiple text and multiple video recipes for the same dish.
Score: 48.39191088844315
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many high-level procedural tasks can be decomposed into sequences of instructions that vary in their order and choice of tools. In the cooking domain, the web offers many partially-overlapping text and video recipes (i.e. procedures) that describe how to make the same dish (i.e. high-level task). Aligning instructions for the same dish across different sources can yield descriptive visual explanations that are far richer semantically than conventional textual instructions, providing commonsense insight into how real-world procedures are structured. Learning to align these different instruction sets is challenging because: a) different recipes vary in their order of instructions and use of ingredients; and b) video instructions can be noisy and tend to contain far more information than text instructions. To address these challenges, we first use an unsupervised alignment algorithm that learns pairwise alignments between instructions of different recipes for the same dish. We then use a graph algorithm to derive a joint alignment between multiple text and multiple video recipes for the same dish. We release the Microsoft Research Multimodal Aligned Recipe Corpus containing 150K pairwise alignments between recipes across 4,262 dishes with rich commonsense information.

Related papers

Stitch-a-Recipe: Video Demonstration from Multistep Descriptions [51.314912554605066]
We propose Stitch-a-Recipe, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips that accurately reflect all the step descriptions, while being visually coherent. Stitch-a-Recipe achieves state-of-the-art performance, with quantitative gains up to 24% as well as dramatic wins in a human preference study.
arXiv Detail & Related papers (2025-03-18T01:57:48Z)
CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion [58.92430755180394]
We present textbfCookingDiffusion, a novel approach to generate photo-realistic images of cooking steps. These prompts encompass text prompts, image prompts, and multi-modal prompts, ensuring the consistent generation of cooking procedural images. Our experimental results demonstrate that our model excels at generating high-quality cooking procedural images.
arXiv Detail & Related papers (2025-01-15T06:58:53Z)
PizzaCommonSense: Learning to Model Commonsense Reasoning about Intermediate Steps in Cooking Recipes [7.839338724237275]
A model to effectively reason about cooking recipes must accurately discern and understand the inputs and outputs of intermediate steps within the recipe. We present a new corpus of cooking recipes enriched with descriptions of intermediate steps that describe the input and output for each step.
arXiv Detail & Related papers (2024-01-12T23:33:01Z)
Towards End-User Development for IoT: A Case Study on Semantic Parsing of Cooking Recipes for Programming Kitchen Devices [4.863892359772122]
We provide a unique corpus which aims to support the transformation of cooking recipe instructions to machine-understandable commands for IoT devices in the kitchen. Based on this corpus, we developed machine learning-based sequence labelling methods, namely conditional random fields (CRF) and a neural network model. Our results show that while it is feasible to train semantics based on our annotations, most natural-language instructions are incomplete, and thus transforming them into formal meaning representation is not straightforward.
arXiv Detail & Related papers (2023-09-25T14:21:24Z)
50 Ways to Bake a Cookie: Mapping the Landscape of Procedural Texts [15.185745028886648]
We propose an unsupervised learning approach for summarizing multiple procedural texts into an intuitive graph representation. We demonstrate our approach on recipes, a prominent example of procedural texts.
arXiv Detail & Related papers (2022-10-31T11:41:54Z)
Counterfactual Recipe Generation: Exploring Compositional Generalization in a Realistic Scenario [60.20197771545983]
We design the counterfactual recipe generation task, which asks models to modify a base recipe according to the change of an ingredient. We collect a large-scale recipe dataset in Chinese for models to learn culinary knowledge. Results show that existing models have difficulties in modifying the ingredients while preserving the original text style, and often miss actions that need to be adjusted.
arXiv Detail & Related papers (2022-10-20T17:21:46Z)
CHEF: Cross-modal Hierarchical Embeddings for Food Domain Retrieval [20.292467149387594]
We introduce a novel cross-modal learning framework to jointly model the latent representations of images and text in the food image-recipe association and retrieval tasks. Our experiments show that by making use of efficient tree-structured Long Short-Term Memory as the text encoder in our computational cross-modal retrieval framework, we are able to identify the main ingredients and cooking actions in the recipe descriptions without explicit supervision.
arXiv Detail & Related papers (2021-02-04T11:24:34Z)
Structure-Aware Generation Network for Recipe Generation from Images [142.047662926209]
We investigate an open research task of generating cooking instructions based on only food images and ingredients. Target recipes are long-length paragraphs and do not have annotations on structure information. We propose a novel framework of Structure-aware Generation Network (SGN) to tackle the food recipe generation task.
arXiv Detail & Related papers (2020-09-02T10:54:25Z)
Multi-modal Cooking Workflow Construction for Food Recipes [147.4435186953995]
We build MM-ReS, the first large-scale dataset for cooking workflow construction. We propose a neural encoder-decoder model that utilizes both visual and textual information to construct the cooking workflow.
arXiv Detail & Related papers (2020-08-20T18:31:25Z)
Decomposing Generation Networks with Structure Prediction for Recipe Generation [142.047662926209]
We propose a novel framework: Decomposing Generation Networks (DGN) with structure prediction. Specifically, we split each cooking instruction into several phases, and assign different sub-generators to each phase. Our approach includes two novel ideas: (i) learning the recipe structures with the global structure prediction component and (ii) producing recipe phases in the sub-generator output component based on the predicted structure.
arXiv Detail & Related papers (2020-07-27T08:47:50Z)
A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos [126.66212285239624]
We propose a benchmark of structured procedural knowledge extracted from cooking videos. Our manually annotated open-vocabulary resource includes 356 instructional cooking videos and 15,523 video clip/sentence-level annotations.
arXiv Detail & Related papers (2020-05-02T05:15:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.