Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan
- URL: http://arxiv.org/abs/2602.06654v1
- Date: Fri, 06 Feb 2026 12:29:13 GMT
- Title: Multimodal Generative Retrieval Model with Staged Pretraining for Food Delivery on Meituan
- Authors: Boyu Chen, Tai Guo, Weiyu Cui, Yuqing Li, Xingxing Wang, Chuan Shi, Cheng Yang,
- Abstract summary: Multimodal retrieval models are increasingly important in scenarios such as food delivery.<n>We propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage.<n>To better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks.
- Score: 30.893121144130664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal retrieval models are becoming increasingly important in scenarios such as food delivery, where rich multimodal features can meet diverse user needs and enable precise retrieval. Mainstream approaches typically employ a dual-tower architecture between queries and items, and perform joint optimization of intra-tower and inter-tower tasks. However, we observe that joint optimization often leads to certain modalities dominating the training process, while other modalities are neglected. In addition, inconsistent training speeds across modalities can easily result in the one-epoch problem. To address these challenges, we propose a staged pretraining strategy, which guides the model to focus on specialized tasks at each stage, enabling it to effectively attend to and utilize multimodal features, and allowing flexible control over the training process at each stage to avoid the one-epoch problem. Furthermore, to better utilize the semantic IDs that compress high-dimensional multimodal embeddings, we design both generative and discriminative tasks to help the model understand the associations between SIDs, queries, and item features, thereby improving overall performance. Extensive experiments on large-scale real-world Meituan data demonstrate that our method achieves improvements of 3.80%, 2.64%, and 2.17% on R@5, R@10, and R@20, and 5.10%, 4.22%, and 2.09% on N@5, N@10, and N@20 compared to mainstream baselines. Online A/B testing on the Meituan platform shows that our approach achieves a 1.12% increase in revenue and a 1.02% increase in click-through rate, validating the effectiveness and superiority of our method in practical applications.
Related papers
- TADS: Task-Aware Data Selection for Multi-Task Multimodal Pre-Training [29.962039479618543]
We introduce TADS (Task-Aware Data Selection), a novel framework for multi-task multimodal pre-training.<n> TADS integrates Intrinsic Quality, Task Relevance, and Distributional Diversity into a learnable value function.<n>A feedback-driven meta-learning mechanism adaptively refines the selection strategy based on proxy model performance.
arXiv Detail & Related papers (2026-02-05T03:08:45Z) - Bridging VLMs and Embodied Intelligence with Deliberate Practice Policy Optimization [72.20212909644017]
Deliberate Practice Policy Optimization (DPPO) is a metacognitive Metaloop'' training framework.<n>DPPO alternates between supervised fine-tuning (competence expansion) and reinforcement learning (skill refinement)<n> Empirically, training a vision-language embodied model with DPPO, referred to as Pelican-VL 1.0, yields a 20.3% performance improvement over the base model.<n>We are open-sourcing both the models and code, providing the first systematic framework that alleviates the data and resource bottleneck.
arXiv Detail & Related papers (2025-11-20T17:58:04Z) - SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model [49.65930977591188]
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks.<n>We introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design.<n>Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency.<n>The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings.
arXiv Detail & Related papers (2025-10-14T16:43:22Z) - Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward [54.708851958671794]
We propose a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection.<n>In offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty.<n>During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential.
arXiv Detail & Related papers (2025-09-01T10:04:20Z) - Improving Task Diversity in Label Efficient Supervised Finetuning of LLMs [14.531280062127442]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse domains, but developing high-performing models for specialized applications often requires substantial human annotation.<n>We address the label-efficient learning problem for supervised finetuning (SFT) by leveraging task-diversity as a fundamental principle for effective data selection.<n>Our approach is based on two key observations: 1) task labels for different prompts are often readily available; 2) pre-trained models have significantly varying levels of confidence across tasks.
arXiv Detail & Related papers (2025-07-29T03:51:00Z) - Multimodal-Guided Dynamic Dataset Pruning for Robust and Efficient Data-Centric Learning [49.10890099624699]
We introduce a dynamic dataset pruning framework that adaptively selects training samples based on task-driven difficulty and cross-modality semantic consistency.<n>Our work highlights the potential of integrating cross-modality alignment for robust sample selection, advancing data-centric learning toward more efficient and robust practices across application domains.
arXiv Detail & Related papers (2025-07-17T03:08:26Z) - Is Diversity All You Need for Scalable Robotic Manipulation? [50.747150672933316]
We investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better"<n>We show that task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios.<n>We propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data.
arXiv Detail & Related papers (2025-07-08T17:52:44Z) - Leveraging Foundation Models for Multi-modal Federated Learning with Incomplete Modality [41.79433449873368]
We propose a novel multi-modal federated learning method, Federated Multi-modal contrastiVe training with Pre-trained completion (FedMVP)
FedMVP integrates the large-scale pre-trained models to enhance the federated training.
We demonstrate that the model achieves superior performance over two real-world image-text classification datasets.
arXiv Detail & Related papers (2024-06-16T19:18:06Z) - T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning [31.276142111455847]
Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning.<n>We design a novel framework, mixunderlinetextbfTureunderlinetextbf-of-underlinetextbfRank-onunderlinetextbfE-eunderlinetextbfXper ts (textttT-REX)<n>Rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal
arXiv Detail & Related papers (2024-04-13T12:14:58Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.