Related papers: Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone

Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone

URL: http://arxiv.org/abs/2511.08215v1
Date: Wed, 12 Nov 2025 01:47:05 GMT
Title: Evaluating Gemini LLM in Food Image-Based Recipe and Nutrition Description with EfficientNet-B4 Visual Backbone
Authors: Rizal Khoirul Anam,
Abstract summary: We evaluate a system integrating a specialized visual backbone with a powerful generative large language model.<n>The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output.<n>We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of digital food applications necessitates robust methods for automated nutritional analysis and culinary guidance. This paper presents a comprehensive comparative evaluation of a decoupled, multimodal pipeline for food recognition. We evaluate a system integrating a specialized visual backbone (EfficientNet-B4) with a powerful generative large language model (Google's Gemini LLM). The core objective is to evaluate the trade-offs between visual classification accuracy, model efficiency, and the quality of generative output (nutritional data and recipes). We benchmark this pipeline against alternative vision backbones (VGG-16, ResNet-50, YOLOv8) and a lightweight LLM (Gemma). We introduce a formalization for "Semantic Error Propagation" (SEP) to analyze how classification inaccuracies from the visual module cascade into the generative output. Our analysis is grounded in a new Custom Chinese Food Dataset (CCFD) developed to address cultural bias in public datasets. Experimental results demonstrate that while EfficientNet-B4 (89.0\% Top-1 Acc.) provides the best balance of accuracy and efficiency, and Gemini (9.2/10 Factual Accuracy) provides superior generative quality, the system's overall utility is fundamentally bottlenecked by the visual front-end's perceptive accuracy. We conduct a detailed per-class analysis, identifying high semantic similarity as the most critical failure mode.

Related papers

FaithLens: Detecting and Explaining Faithfulness Hallucination [63.905100627300925]
We introduce FaithLens, a cost-efficient and effective faithfulness hallucination detection model.<n>We apply a well-defined data filtering strategy to ensure label correctness, explanation quality, and data diversity.<n>FaithLens can produce high-quality explanations, delivering a distinctive balance of trustworthiness, efficiency, and effectiveness.
arXiv Detail & Related papers (2025-12-23T09:20:32Z)
Deep Feature Optimization for Enhanced Fish Freshness Assessment [0.05599792629509228]
Assessing fish freshness is vital for ensuring food safety and minimizing economic losses in the seafood industry.<n>Recent advances in deep learning have automated visual freshness prediction, but challenges related to accuracy and feature transparency persist.<n>This study introduces a unified three-stage framework that refines and leverages deep visual representations for reliable fish freshness assessment.
arXiv Detail & Related papers (2025-10-28T09:02:10Z)
Reliable and Reproducible Demographic Inference for Fairness in Face Analysis [63.46525489354455]
We propose a fully reproducible DAI pipeline that replaces conventional end-to-end training with a modular transfer learning approach.<n>We audit this pipeline across three dimensions: accuracy, fairness, and a newly introduced notion of robustness, defined via intra-identity consistency.<n>Our results show that the proposed method outperforms strong baselines, particularly on ethnicity, which is the more challenging attribute.
arXiv Detail & Related papers (2025-10-23T12:22:02Z)
Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z)
A Fuzzy Logic-Based Framework for Explainable Machine Learning in Big Data Analytics [0.0]
This paper presents a novel framework that combines type-2 fuzzy sets, granular computing, and clustering to boost explainability and fairness in big data environments.<n>When applied to the UCI Air Quality dataset, the framework effectively manages uncertainty in noisy sensor data, produces linguistic rules, and assesses fairness using silhouette scores and entropy.
arXiv Detail & Related papers (2025-09-29T18:02:31Z)
Comprehensive Evaluation of Large Multimodal Models for Nutrition Analysis: A New Benchmark Enriched with Contextual Metadata [16.03960240895014]
Large Multimodal Models (LMMs) are increasingly applied to meal images for nutrition analysis.<n>This work investigates how interpreting contextual metadata can enhance LMM performance in estimating key nutritional values.<n> Empirical results show that integrating metadata intelligently, when applied through straightforward prompting strategies, can significantly reduce the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) in predicted nutritional values.
arXiv Detail & Related papers (2025-07-09T17:10:33Z)
A Plug-and-Play Learning-based IMU Bias Factor for Robust Visual-Inertial Odometry [27.62788405443008]
We propose a novel plug-and-play module featuring the Inertial Prior Network (IPNet)<n>IPNet infers an IMU bias prior by implicitly capturing the motion characteristics of specific platforms.<n>In this work, we first directly infer the biases prior only using the raw IMU data using a sliding window approach.
arXiv Detail & Related papers (2025-03-16T14:45:19Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)
GCAM: Gaussian and causal-attention model of food fine-grained recognition [5.198198193921202]
We propose the adoption of a Gaussian and causal-attention model for fine-grained object recognition. To counteract data drift resulting from uneven data distributions, we employ a counterfactual reasoning approach. We experimentally show that GCAM surpasses state-of-the-art methods on the ETH-FOOD101, UECFOOD256, and Vireo-FOOD172 datasets.
arXiv Detail & Related papers (2024-03-18T03:39:54Z)
Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs) We first build a vision-language feedback dataset utilizing AI annotation. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
Optimistic Agent: Accurate Graph-Based Value Estimation for More Successful Visual Navigation [18.519303422753534]
We argue that this ability is largely due to three main reasons: the incorporation of prior knowledge (or experience), the adaptation of it to the new environment using the observed visual cues and optimistically searching without giving up early. This is currently missing in the state-of-the-art visual navigation methods based on Reinforcement Learning (RL) In this paper, we propose to use externally learned prior knowledge of the relative object locations and integrate it into our model by constructing a neural graph.
arXiv Detail & Related papers (2020-04-07T09:31:07Z)
Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.