Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation
- URL: http://arxiv.org/abs/2602.15862v1
- Date: Mon, 26 Jan 2026 10:06:18 GMT
- Title: Enhancing Action and Ingredient Modeling for Semantically Grounded Recipe Generation
- Authors: Guoshan Liu, Bin Zhu, Yian Li, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang,
- Abstract summary: We propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation.<n> Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.
- Score: 69.90401008542368
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent advances in Multimodal Large Language Models (MLMMs) have enabled recipe generation from food images, yet outputs often contain semantically incorrect actions or ingredients despite high lexical scores (e.g., BLEU, ROUGE). To address this gap, we propose a semantically grounded framework that predicts and validates actions and ingredients as internal context for instruction generation. Our two-stage pipeline combines supervised fine-tuning (SFT) with reinforcement fine-tuning (RFT): SFT builds foundational accuracy using an Action-Reasoning dataset and ingredient corpus, while RFT employs frequency-aware rewards to improve long-tail action prediction and ingredient generalization. A Semantic Confidence Scoring and Rectification (SCSR) module further filters and corrects predictions. Experiments on Recipe1M show state-of-the-art performance and markedly improved semantic fidelity.
Related papers
- Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models [39.290072292743226]
We introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline.<n>Experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2%.
arXiv Detail & Related papers (2026-03-02T07:54:29Z) - Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z) - Toward Faithful and Complete Answer Construction from a Single Document [1.0742675209112622]
We present EVE, a structured framework for document-grounded reasoning.<n>Unlike free-form prompting, EVE constrains generation to a structured, verifiable pipeline that decomposes high-rigor reasoning into extraction, validation, and enumeration.
arXiv Detail & Related papers (2026-02-05T18:22:08Z) - GLOW: Graph-Language Co-Reasoning for Agentic Workflow Performance Prediction [51.83437071408662]
We propose GLOW, a unified framework for AW performance prediction.<n>GLOW combines the graph-structure modeling capabilities of GNNs with the reasoning power of LLMs.<n>Experiments on FLORA-Bench show that GLOW outperforms state-of-the-art baselines in prediction accuracy and ranking utility.
arXiv Detail & Related papers (2025-12-11T13:30:46Z) - SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization [62.958457694151384]
We introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models.<n>In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms.
arXiv Detail & Related papers (2025-10-22T16:11:22Z) - Sculpting Features from Noise: Reward-Guided Hierarchical Diffusion for Task-Optimal Feature Transformation [18.670626228472877]
DIFFT redefines Feature Transformation as a reward-guided generative task.<n>It produces structured, discrete features, preserving intra-feature dependencies while allowing parallel inter-feature generation.<n>It consistently outperforms state-of-the-art baselines in predictive accuracy and robustness, with significantly lower training and inference times.
arXiv Detail & Related papers (2025-05-21T06:18:42Z) - Harnessing Mixed Features for Imbalance Data Oversampling: Application to Bank Customers Scoring [5.091061468748012]
We introduce MGS-GRF, an oversampling strategy designed for mixed features.<n>We show that MGS-GRF exhibits two important properties: (i) the coherence i.e. the ability to only generate combinations of categorical features that are already present in the original dataset and (ii) association, i.e. the ability to preserve the dependence between continuous and categorical features.
arXiv Detail & Related papers (2025-03-26T08:53:40Z) - ROSE: Revolutionizing Open-Set Dense Segmentation with Patch-Wise Perceptual Large Multimodal Model [75.750699619993]
We propose ROSE, a Revolutionary Open-set dense SEgmentation LMM, which enables dense mask prediction and open-category generation.<n>Our method treats each image patch as an independent region of interest candidate, enabling the model to predict both dense and sparse masks simultaneously.
arXiv Detail & Related papers (2024-11-29T07:00:18Z) - Retrieval Augmented Recipe Generation [96.43285670458803]
We propose a retrieval augmented large multimodal model for recipe generation.<n>It retrieves recipes semantically related to the image from an existing datastore as a supplement.<n>It calculates the consistency among generated recipe candidates, which use different retrieval recipes as context for generation.
arXiv Detail & Related papers (2024-11-13T15:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.