Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search
- URL: http://arxiv.org/abs/2602.13704v1
- Date: Sat, 14 Feb 2026 10:13:48 GMT
- Title: Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search
- Authors: Lei Chen, Chen Ju, Xu Chen, Zhicheng Wang, Yuheng Jiao, Hongfeng Zhan, Zhaoyang Li, Shihao Xu, Zhixiang Zhao, Tong Jia, Jinsong Lan, Xiaoyong Zhu, Bo Zheng,
- Abstract summary: Pailitao-VL is a comprehensive multi-modal retrieval system engineered for real-time industrial search.<n>We address three critical challenges in the current SOTA solution: insufficient retrieval granularity, vulnerability to environmental noise, and prohibitive efficiency-performance gap.<n>This work demonstrates a robust and scalable path for deploying advanced MLLM-based retrieval architectures in demanding, large-scale production environments.
- Score: 31.824675235862532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we presented Pailitao-VL, a comprehensive multi-modal retrieval system engineered for high-precision, real-time industrial search. We here address three critical challenges in the current SOTA solution: insufficient retrieval granularity, vulnerability to environmental noise, and prohibitive efficiency-performance gap. Our primary contribution lies in two fundamental paradigm shifts. First, we transitioned the embedding paradigm from traditional contrastive learning to an absolute ID-recognition task. Through anchoring instances to a globally consistent latent space defined by billions of semantic prototypes, we successfully overcome the stochasticity and granularity bottlenecks inherent in existing embedding solutions. Second, we evolved the generative reranker from isolated pointwise evaluation to the compare-and-calibrate listwise policy. By synergizing chunk-based comparative reasoning with calibrated absolute relevance scoring, the system achieves nuanced discriminative resolution while circumventing the prohibitive latency typically associated with conventional reranking methods. Extensive offline benchmarks and online A/B tests on Alibaba e-commerce platform confirm that Pailitao-VL achieves state-of-the-art performance and delivers substantial business impact. This work demonstrates a robust and scalable path for deploying advanced MLLM-based retrieval architectures in demanding, large-scale production environments.
Related papers
- BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? [61.247730037229815]
We introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope.<n>To investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities.<n>This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
arXiv Detail & Related papers (2026-03-03T17:52:01Z) - Mine and Refine: Optimizing Graded Relevance in E-commerce Search Retrieval [3.1241290518951197]
Large scale e-commerce search demands embeddings that generalize to long tail, noisy queries.<n>We propose a two-stage "Mine and Refine" contrastive training framework for semantic text embeddings.
arXiv Detail & Related papers (2026-02-19T18:56:36Z) - SAGE: Scalable AI Governance & Evaluation [10.238041570564395]
textbfSAGE is a framework that operationalizes high-quality human product judgment as a scalable evaluation signal.<n>SAGE was deployed within LinkedIn Search ecosystems and powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics.
arXiv Detail & Related papers (2026-02-08T06:42:50Z) - SSVP: Synergistic Semantic-Visual Prompting for Industrial Zero-Shot Anomaly Detection [55.54007781679915]
We propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception.<n>SSVP achieves state-of-the-art performance with 93.0% Image-AUROC and 92.2% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
arXiv Detail & Related papers (2026-01-14T04:42:19Z) - Contamination Detection for VLMs using Multi-Modal Semantic Perturbation [73.76465227729818]
Open-source Vision-Language Models (VLMs) have achieved state-of-the-art performance on benchmark tasks.<n>Pretraining corpora raise a critical concern for both practitioners and users: inflated performance due to test-set leakage.<n>We show that existing detection approaches either fail outright or exhibit inconsistent behavior.<n>We propose a novel simple yet effective detection method based on multi-modal semantic perturbation.
arXiv Detail & Related papers (2025-11-05T18:59:52Z) - Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model [62.889356203346985]
We propose DUal-STream diffusion (DUST), a world-model augmented VLA framework that handles the modality conflict.<n>DUST achieves up to 6% gains over a standard VLA baseline and implicit world-modeling methods.<n>On real-world tasks with the Franka Research 3, DUST outperforms baselines in success rate by 13%.
arXiv Detail & Related papers (2025-10-31T16:32:12Z) - AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection [40.34270276536052]
Industrial Anomaly Detection (IAD) poses a formidable challenge due to the scarcity of defective samples.<n>Traditional approaches, often constrained by hand-crafted features or domain-specific expert models, struggle to address this limitation.<n>We introduce AnomalyR1, a pioneering framework that leverages VLM-R1, a Multimodal Large Language Model (MLLM) renowned for its exceptional generalization and interpretability.
arXiv Detail & Related papers (2025-04-16T09:48:41Z) - EIAD: Explainable Industrial Anomaly Detection Via Multi-Modal Large Language Models [23.898938659720503]
Industrial Anomaly Detection (IAD) is critical to ensure product quality during manufacturing.<n>We propose a novel approach that introduces a dedicated multi-modal defect localization module to decouple the dialog functionality from the core feature extraction.<n>We also contribute to the first multi-modal industrial anomaly detection training dataset, named Defect Detection Question Answering (DDQA)
arXiv Detail & Related papers (2025-03-18T11:33:29Z) - Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection [4.691083532629246]
We propose Vision-Language Anomaly Detection via Contrastive Cross-Modal Training (CLAD)<n> CLAD aligns visual and textual features into a shared embedding space using contrastive learning.<n>We demonstrate that CLAD outperforms state-of-the-art methods in both image-level anomaly detection and pixel-level anomaly localization.
arXiv Detail & Related papers (2024-12-01T17:00:43Z) - Small Object Detection via Coarse-to-fine Proposal Generation and
Imitation Learning [52.06176253457522]
We propose a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning.
CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A.
arXiv Detail & Related papers (2023-08-18T13:13:09Z) - MMRNet: Improving Reliability for Multimodal Object Detection and
Segmentation for Bin Picking via Multimodal Redundancy [68.7563053122698]
We propose a reliable object detection and segmentation system with MultiModal Redundancy (MMRNet)
This is the first system that introduces the concept of multimodal redundancy to address sensor failure issues during deployment.
We present a new label-free multi-modal consistency (MC) score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty.
arXiv Detail & Related papers (2022-10-19T19:15:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.