Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering
- URL: http://arxiv.org/abs/2511.01213v1
- Date: Mon, 03 Nov 2025 04:13:24 GMT
- Title: Thought-For-Food: Reasoning Chain Induced Food Visual Question Answering
- Authors: Riddhi Jain, Manasi Patwardhan, Parijat Deshpande, Venkataramana Runkana,
- Abstract summary: Food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer.<n>We create reasoning chains upon the QA with minimal human intervention.<n>We observed accuracy improvement of an average 10 percentage points on the baseline.
- Score: 5.290249856411331
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The immense diversity in the culture and culinary of Indian cuisines calls attention to the major shortcoming of the existing Visual Question Answering(VQA) systems which are inclined towards the foods from Western region. Recent attempt towards building a VQA dataset for Indian food is a step towards addressing this challenge. However, their approach towards VQA follows a two-step process in which the answer is generated first, followed by the explanation of the expected answer. In this work, we claim that food VQA requires to follow a multi-step reasoning process to arrive at an accurate answer, especially in the context of India food, which involves understanding complex culinary context and identifying relationships between various food items. With this hypothesis we create reasoning chains upon the QA with minimal human intervention. We fine-tune smaller LLMs and VLMs with auto-validated reasoning chains and further train them using reinforcement learning with larger data. With augmentation of reasoning chains, we observed accuracy improvement of an average 10 percentage points on the baseline. We provide detailed analysis in terms the effect of addition of reasoning chains for the Indian Food VQA task. Index Terms - FoodVQA, Reasoning Chains, Reinforcement Learning, Knowledge Graph.
Related papers
- Inferential Question Answering [67.54465021408724]
We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues.<n>To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages.<n>We show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements.
arXiv Detail & Related papers (2026-02-01T14:02:43Z) - NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning [49.06840168630573]
Diet plays a critical role in human health, yet tailoring dietary reasoning to individual health conditions remains a major challenge.<n>Nutrition Question Answering (QA) has emerged as a popular method for addressing this problem.<n>We introduce the Nutritional Graph Question Answering (NGQA) benchmark, the first graph question answering dataset designed for personalized nutritional health reasoning.
arXiv Detail & Related papers (2024-12-20T04:13:46Z) - FoodieQA: A Multimodal Dataset for Fine-Grained Understanding of Chinese Food Culture [60.51749998013166]
We introduce FoodieQA, a manually curated, fine-grained image-text dataset capturing the intricate features of food cultures across various regions in China.
We evaluate vision-language Models (VLMs) and large language models (LLMs) on newly collected, unseen food images and corresponding questions.
Our findings highlight that understanding food and its cultural implications remains a challenging and under-explored direction.
arXiv Detail & Related papers (2024-06-16T17:59:32Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - Improving Embedded Knowledge Graph Multi-hop Question Answering by
introducing Relational Chain Reasoning [8.05076085499457]
Knowledge Base Question Answer (KBQA) to answer userquestions from a knowledge base (KB) by identifying reasoning between topic entity and answer.
As a complex branchtask of KBQA, multi-hop KGQA requires reasoning over multi-hop relational chains preserved in structured KG.
arXiv Detail & Related papers (2021-10-25T06:53:02Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.