Food Classification using Joint Representation of Visual and Textual
Data
- URL: http://arxiv.org/abs/2308.02562v2
- Date: Wed, 30 Aug 2023 11:47:05 GMT
- Title: Food Classification using Joint Representation of Visual and Textual
Data
- Authors: Prateek Mittal, Puneet Goyal, Joohi Chauhan
- Abstract summary: We propose a multimodal classification framework that uses the modified version of EfficientNet with the Mish activation function for image classification.
The proposed network and the other state-of-the-art methods are evaluated on a large open-source dataset, UPMC Food-101.
- Score: 45.94375447042821
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Food classification is an important task in health care. In this work, we
propose a multimodal classification framework that uses the modified version of
EfficientNet with the Mish activation function for image classification, and
the traditional BERT transformer-based network is used for text classification.
The proposed network and the other state-of-the-art methods are evaluated on a
large open-source dataset, UPMC Food-101. The experimental results show that
the proposed network outperforms the other methods, a significant difference of
11.57% and 6.34% in accuracy is observed for image and text classification,
respectively, when compared with the second-best performing method. We also
compared the performance in terms of accuracy, precision, and recall for text
classification using both machine learning and deep learning-based models. The
comparative analysis from the prediction results of both images and text
demonstrated the efficiency and robustness of the proposed approach.
Related papers
- AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis [1.0858565995100635]
We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA)<n>Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information.<n>Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score.
arXiv Detail & Related papers (2025-07-17T00:06:43Z) - Multi-Level Attention and Contrastive Learning for Enhanced Text Classification with an Optimized Transformer [0.0]
This paper studies a text classification algorithm based on an improved Transformer to improve the performance and efficiency of the model in text classification tasks.
The improved Transformer model outperforms the comparative models such as BiLSTM, CNN, standard Transformer, and BERT in terms of classification accuracy, F1 score, and recall rate.
arXiv Detail & Related papers (2025-01-23T08:32:27Z) - GAMED: Knowledge Adaptive Multi-Experts Decoupling for Multimodal Fake News Detection [18.157900272828602]
Multimodal fake news detection often involves modelling heterogeneous data sources, such as vision and language.<n>This paper develops a significantly novel approach, GAMED, for multimodal modelling.<n>It focuses on generating distinctive and discriminative features through modal decoupling to enhance cross-modal synergies.
arXiv Detail & Related papers (2024-12-11T19:12:22Z) - Self-Supervised Learning in Deep Networks: A Pathway to Robust Few-Shot Classification [0.0]
We first pre-train the model with self-supervision to enable it to learn common feature expressions on a large amount of unlabeled data.
Then fine-tune it on the few-shot dataset Mini-ImageNet to improve the model's accuracy and generalization ability under limited data.
arXiv Detail & Related papers (2024-11-19T01:01:56Z) - SceneGraMMi: Scene Graph-boosted Hybrid-fusion for Multi-Modal Misinformation Veracity Prediction [10.909813689420602]
We propose SceneGraMMi, a Scene Graph-boosted Hybrid-fusion approach for Multi-modal Misinformation veracity prediction.
Experimental results across four benchmark datasets show that SceneGraMMi consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-10-20T21:55:13Z) - GCM-Net: Graph-enhanced Cross-Modal Infusion with a Metaheuristic-Driven Network for Video Sentiment and Emotion Analysis [2.012311338995539]
This paper presents a novel framework that leverages the multi-modal contextual information from utterances and applies metaheuristic algorithms to learn for utterance-level sentiment and emotion prediction.
To show the effectiveness of our approach, we have conducted extensive evaluations on three prominent multimodal benchmark datasets.
arXiv Detail & Related papers (2024-10-02T10:07:48Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Density Adaptive Attention is All You Need: Robust Parameter-Efficient Fine-Tuning Across Multiple Modalities [0.9217021281095907]
DAAM integrates learnable mean and variance into its attention mechanism, implemented in a multi-head framework.
DAAM exhibits superior adaptability and efficacy across a diverse range of tasks, including emotion recognition in speech, image classification, and text classification.
We introduce the Importance Factor, a new learning-based metric that enhances the explainability of models trained with DAAM-based methods.
arXiv Detail & Related papers (2024-01-20T06:42:32Z) - From Text to Pixels: A Context-Aware Semantic Synergy Solution for
Infrared and Visible Image Fusion [66.33467192279514]
We introduce a text-guided multi-modality image fusion method that leverages the high-level semantics from textual descriptions to integrate semantics from infrared and visible images.
Our method not only produces visually superior fusion results but also achieves a higher detection mAP over existing methods, achieving state-of-the-art results.
arXiv Detail & Related papers (2023-12-31T08:13:47Z) - Enhancing Instance-Level Image Classification with Set-Level Labels [12.778150812879034]
We present a novel approach to enhance instance-level image classification by leveraging set-level labels.
We conduct experiments on two categories of datasets: natural image datasets and histopathology image datasets.
Our algorithm achieves 13% improvement in classification accuracy compared to the strongest baseline on the histopathology image classification benchmarks.
arXiv Detail & Related papers (2023-11-09T03:17:03Z) - Convolutional autoencoder-based multimodal one-class classification [80.52334952912808]
One-class classification refers to approaches of learning using data from a single class only.
We propose a deep learning one-class classification method suitable for multimodal data.
arXiv Detail & Related papers (2023-09-25T12:31:18Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - A Visual Interpretation-Based Self-Improved Classification System Using
Virtual Adversarial Training [4.722922834127293]
This paper proposes a visual interpretation-based self-improving classification model with a combination of virtual adversarial training (VAT) and BERT models to address the problems.
Specifically, a fine-tuned BERT model is used as a classifier to classify the sentiment of the text.
The predicted sentiment classification labels are used as part of the input of another BERT for spam classification via a semi-supervised training manner.
arXiv Detail & Related papers (2023-09-03T15:07:24Z) - Fine-grained Recognition with Learnable Semantic Data Augmentation [68.48892326854494]
Fine-grained image recognition is a longstanding computer vision challenge.
We propose diversifying the training data at the feature-level to alleviate the discriminative region loss problem.
Our method significantly improves the generalization performance on several popular classification networks.
arXiv Detail & Related papers (2023-09-01T11:15:50Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z) - EAML: Ensemble Self-Attention-based Mutual Learning Network for Document
Image Classification [1.1470070927586016]
We design a self-attention-based fusion module that serves as a block in our ensemble trainable network.
It allows to simultaneously learn the discriminant features of image and text modalities throughout the training stage.
This is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
arXiv Detail & Related papers (2023-05-11T16:05:03Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z) - Exploiting the relationship between visual and textual features in
social networks for image classification with zero-shot deep learning [0.0]
In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture.
Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part.
Considering the associated texts to the images can help to improve the accuracy depending on the goal.
arXiv Detail & Related papers (2021-07-08T10:54:59Z) - Revisiting The Evaluation of Class Activation Mapping for
Explainability: A Novel Metric and Experimental Analysis [54.94682858474711]
Class Activation Mapping (CAM) approaches provide an effective visualization by taking weighted averages of the activation maps.
We propose a novel set of metrics to quantify explanation maps, which show better effectiveness and simplify comparisons between approaches.
arXiv Detail & Related papers (2021-04-20T21:34:24Z) - Region Comparison Network for Interpretable Few-shot Image
Classification [97.97902360117368]
Few-shot image classification has been proposed to effectively use only a limited number of labeled examples to train models for new classes.
We propose a metric learning based method named Region Comparison Network (RCN), which is able to reveal how few-shot learning works.
We also present a new way to generalize the interpretability from the level of tasks to categories.
arXiv Detail & Related papers (2020-09-08T07:29:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.