Structured Vision-Language Pretraining for Computational Cooking
- URL: http://arxiv.org/abs/2212.04267v1
- Date: Thu, 8 Dec 2022 13:37:17 GMT
- Title: Structured Vision-Language Pretraining for Computational Cooking
- Authors: Mustafa Shukor, Nicolas Thome, Matthieu Cord
- Abstract summary: Vision-Language Pretraining and Foundation models have been the go-to recipe for achieving SoTA performance on general benchmarks.
We propose to leverage these techniques for structured-text based computational cuisine tasks.
- Score: 54.0571416522547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Pretraining (VLP) and Foundation models have been the go-to
recipe for achieving SoTA performance on general benchmarks. However,
leveraging these powerful techniques for more complex vision-language tasks,
such as cooking applications, with more structured input data, is still little
investigated. In this work, we propose to leverage these techniques for
structured-text based computational cuisine tasks. Our strategy, dubbed VLPCook
(Structured Vision-Language Pretraining for Computational Cooking), first
transforms existing image-text pairs to image and structured-text pairs. This
allows to pretrain our VLPCook model using VLP objectives adapted to the
strutured data of the resulting datasets, then finetuning it on downstream
computational cooking tasks. During finetuning, we also enrich the visual
encoder, leveraging pretrained foundation models (e.g. CLIP) to provide local
and global textual context. VLPCook outperforms current SoTA by a significant
margin (+3.3 Recall@1 absolute improvement) on the task of Cross-Modal Food
Retrieval on the large Recipe1M dataset. Finally, we conduct further
experiments on VLP to validate their importance, especially on the Recipe1M+
dataset. The code will be made publicly available.
Related papers
- Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation [105.23631749213729]
We propose a novel method for unsupervised pre-training in low-data regimes.
Inspired by the recently successful prompting technique, we introduce a new method, Unsupervised Pre-training with Language-Vision Prompts.
We show that our method can converge faster and perform better than CNN-based models in low-data regimes.
arXiv Detail & Related papers (2024-05-22T06:48:43Z) - DeepSeek-VL: Towards Real-World Vision-Language Understanding [24.57011093316788]
We present DeepSeek-VL, an open-source Vision-Language (VL) Model for real-world vision and language understanding applications.
Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios.
We create a use case taxonomy from real user scenarios and construct an instruction tuning dataset.
arXiv Detail & Related papers (2024-03-08T18:46:00Z) - ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models [45.040292339670096]
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities.
This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data.
arXiv Detail & Related papers (2024-02-18T19:26:49Z) - Leveraging Vision-Language Foundation Models for Fine-Grained Downstream
Tasks [17.367599062853156]
Vision-language foundation models such as CLIP have shown impressive zero-shot performance on many tasks and datasets.
We propose a multitask fine-tuning strategy based on a positive/negative prompt formulation to further leverage the capacities of the vision-language foundation models.
arXiv Detail & Related papers (2023-07-13T15:05:34Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Weakly Supervised Vision-and-Language Pre-training with Relative
Representations [76.63610760577214]
Weakly supervised vision-and-language pre-training has been shown to effectively reduce the data cost of pre-training.
Current methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training.
arXiv Detail & Related papers (2023-05-24T18:10:24Z) - CLIP4STR: A Simple Baseline for Scene Text Recognition with Pre-trained Vision-Language Model [55.321010757641524]
We introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP.
We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks.
arXiv Detail & Related papers (2023-05-23T12:51:20Z) - Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context.
We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability.
Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z) - Exploiting the Textual Potential from Vision-Language Pre-training for
Text-based Person Search [17.360982091304137]
Text-based Person Search (TPS) is targeted on retrieving pedestrians to match text descriptions instead of query images.
Recent Vision-Language Pre-training models can bring transferable knowledge to downstream TPS tasks, resulting in more efficient performance gains.
However, existing TPS methods only utilize pre-trained visual encoders, neglecting the corresponding textual representation.
arXiv Detail & Related papers (2023-03-08T10:41:22Z) - A Flexible Clustering Pipeline for Mining Text Intentions [6.599344783327053]
We create a flexible and scalable clustering pipeline within the Verint Intent Manager.
It integrates the fine-tuning of language models, a high performing k-NN library and community detection techniques.
As deployed in the VIM application, this clustering pipeline produces high quality results.
arXiv Detail & Related papers (2022-02-01T22:54:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.