Rethinking Cooking State Recognition with Vision Transformers
- URL: http://arxiv.org/abs/2212.08586v1
- Date: Fri, 16 Dec 2022 17:06:28 GMT
- Title: Rethinking Cooking State Recognition with Vision Transformers
- Authors: Akib Mohammed Khan, Alif Ashrafee, Reeshoon Sayera, Shahriar Ivan, and
Sabbir Ahmed
- Abstract summary: Self-attention mechanism of Vision Transformer (ViT) architecture is proposed for the Cooking State Recognition task.
The proposed approach encapsulates the globally salient features from images, while also exploiting the weights learned from a larger dataset.
Our framework has an accuracy of 94.3%, which significantly outperforms the state-of-the-art.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To ensure proper knowledge representation of the kitchen environment, it is
vital for kitchen robots to recognize the states of the food items that are
being cooked. Although the domain of object detection and recognition has been
extensively studied, the task of object state classification has remained
relatively unexplored. The high intra-class similarity of ingredients during
different states of cooking makes the task even more challenging. Researchers
have proposed adopting Deep Learning based strategies in recent times, however,
they are yet to achieve high performance. In this study, we utilized the
self-attention mechanism of the Vision Transformer (ViT) architecture for the
Cooking State Recognition task. The proposed approach encapsulates the globally
salient features from images, while also exploiting the weights learned from a
larger dataset. This global attention allows the model to withstand the
similarities between samples of different cooking objects, while the employment
of transfer learning helps to overcome the lack of inductive bias by utilizing
pretrained weights. To improve recognition accuracy, several augmentation
techniques have been employed as well. Evaluation of our proposed framework on
the `Cooking State Recognition Challenge Dataset' has achieved an accuracy of
94.3%, which significantly outperforms the state-of-the-art.
Related papers
- Computer Vision in the Food Industry: Accurate, Real-time, and Automatic Food Recognition with Pretrained MobileNetV2 [1.6590638305972631]
This study employs the pretrained MobileNetV2 model, which is efficient and fast, for food recognition on the public Food11 dataset, comprising 16643 images.
It also utilizes various techniques such as dataset understanding, transfer learning, data augmentation, regularization, dynamic learning rate, hyper parameter tuning, and consideration of images in different sizes to enhance performance and robustness.
Despite employing a light model with a simpler structure and fewer trainable parameters compared to some deep and dense models in the deep learning area, it achieved commendable accuracy in a short time.
arXiv Detail & Related papers (2024-05-19T17:20:20Z) - Adaptive Visual Imitation Learning for Robotic Assisted Feeding Across Varied Bowl Configurations and Food Types [17.835835270751176]
We introduce a novel visual imitation network with a spatial attention module for robotic assisted feeding (RAF)
We propose a framework that integrates visual perception with imitation learning to enable the robot to handle diverse scenarios during scooping.
Our approach, named AVIL (adaptive visual imitation learning), exhibits adaptability and robustness across different bowl configurations.
arXiv Detail & Related papers (2024-03-19T16:40:57Z) - From Canteen Food to Daily Meals: Generalizing Food Recognition to More
Practical Scenarios [92.58097090916166]
We present two new benchmarks, namely DailyFood-172 and DailyFood-16, designed to curate food images from everyday meals.
These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.
arXiv Detail & Related papers (2024-03-12T08:32:23Z) - Food Image Classification and Segmentation with Attention-based Multiple
Instance Learning [51.279800092581844]
The paper presents a weakly supervised methodology for training food image classification and semantic segmentation models.
The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism.
We conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach.
arXiv Detail & Related papers (2023-08-22T13:59:47Z) - Transferring Knowledge for Food Image Segmentation using Transformers
and Convolutions [65.50975507723827]
Food image segmentation is an important task that has ubiquitous applications, such as estimating the nutritional value of a plate of food.
One challenge is that food items can overlap and mix, making them difficult to distinguish.
Two models are trained and compared, one based on convolutional neural networks and the other on Bidirectional representation for Image Transformers (BEiT)
The BEiT model outperforms the previous state-of-the-art model by achieving a mean intersection over union of 49.4 on FoodSeg103.
arXiv Detail & Related papers (2023-06-15T15:38:10Z) - A Mobile Food Recognition System for Dietary Assessment [6.982738885923204]
We focus on developing a mobile friendly, Middle Eastern cuisine focused food recognition application for assisted living purposes.
Using Mobilenet-v2 architecture for this task is beneficial in terms of both accuracy and the memory usage.
The developed mobile application has potential to serve the visually impaired in automatic food recognition via images.
arXiv Detail & Related papers (2022-04-20T12:49:36Z) - Classifying States of Cooking Objects Using Convolutional Neural Network [6.127963013089406]
The main aim is to make the cooking process easier, safer, and create human welfare.
It is important for robots to understand the cooking environment and recognize the objects, especially correctly identifying the state of the cooking objects.
In this project, several parts of the experiment were conducted to design a robust deep convolutional neural network for classifying the state of the cooking objects from scratch.
arXiv Detail & Related papers (2021-04-30T22:26:40Z) - Large Scale Visual Food Recognition [43.43598316339732]
We introduce Food2K, which is the largest food recognition dataset with 2,000 categories and over 1 million images.
Food2K bypasses them in both categories and images by one order of magnitude.
We propose a deep progressive region enhancement network for food recognition.
arXiv Detail & Related papers (2021-03-30T06:41:42Z) - ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked
Global-Local Attention Network [50.7720194859196]
We introduce the dataset ISIA Food- 500 with 500 categories from the list in the Wikipedia and 399,726 images.
This dataset surpasses existing popular benchmark datasets by category coverage and data volume.
We propose a stacked global-local attention network, which consists of two sub-networks for food recognition.
arXiv Detail & Related papers (2020-08-13T02:48:27Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images
and Recipes with Semantic Consistency and Attention Mechanism [70.85894675131624]
We learn an embedding of images and recipes in a common feature space, such that the corresponding image-recipe embeddings lie close to one another.
We propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities through aligning output semantic probabilities.
We show that we can outperform several state-of-the-art cross-modal retrieval strategies for food images and cooking recipes by a significant margin.
arXiv Detail & Related papers (2020-03-09T07:41:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.