Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings
- URL: http://arxiv.org/abs/2601.11565v1
- Date: Thu, 25 Dec 2025 13:41:53 GMT
- Title: Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings
- Authors: Pakorn Ueareeworakul, Shuman Liu, Jinghao Feng, Ling Hu, Zhantang Shi, Chengqi Sun, Liang Yao, Panyi Ouyang, Haibo Zhang, Anxiang Zeng,
- Abstract summary: We present a high-efficiency multilingual embedding framework specifically optimized for Southeast Asian (SEA) e-commerce scenarios.<n> Compass-Embedding v4 addresses three core challenges.<n>We construct a diversified training corpus through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction.
- Score: 12.049937870582113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As global e-commerce rapidly expands into emerging markets, the lack of high-quality semantic representations for low-resource languages has become a decisive bottleneck for retrieval, recommendation, and search systems. In this work, we present Compass-Embedding v4, a high-efficiency multilingual embedding framework specifically optimized for Southeast Asian (SEA) e-commerce scenarios, where data scarcity, noisy supervision, and strict production constraints jointly challenge representation learning. Compass-Embedding v4 addresses three core challenges. First, large-batch contrastive training under mixed task supervision introduces systematic false negatives that degrade semantic alignment. We propose Class-Aware Masking (CAM), a lightweight modification to the InfoNCE objective that suppresses invalid in-batch negatives and improves semantic discrimination without altering training efficiency. Second, low-resource SEA languages suffer from limited and uneven data coverage. We construct a diversified training corpus through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction, enabling robust multilingual and domain-specific learning. Third, production deployment requires high-throughput inference while preserving embedding quality. We combine robustness-driven large-batch training with spherical model merging to mitigate catastrophic forgetting, and optimize inference via vLLM and FP8 quantization. Extensive evaluations across multilingual benchmarks and proprietary e-commerce tasks show that Compass-Embedding v4 achieves state-of-the-art performance on major SEA languages, significantly outperforming general-purpose embedding models in domain-specific retrieval and classification, while maintaining competitive performance on high-resource languages.
Related papers
- Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation [49.82863380286994]
In-context learning may offer novel ways to adapt Large Language Models for low-resource machine translation.<n>In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models.<n>Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window.
arXiv Detail & Related papers (2026-02-04T17:02:22Z) - GigaEmbeddings: Efficient Russian Language Embedding Model [1.3460582882338625]
GigaEmbeddings is a framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning.<n>Our three-stage pipeline addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation.<n>GigaEmbeddings achieves state-of-the-art results (69.1 avg. score) on the ruMTEB benchmark spanning 23 multilingual tasks.
arXiv Detail & Related papers (2025-10-25T17:26:05Z) - A Data-Centric Approach to Multilingual E-Commerce Product Search: Case Study on Query-Category and Query-Item Relevance [4.017203385311908]
multilingual e-commerce search suffers from severe data imbalance across languages.<n>We present a practical, architecture-agnostic, data-centric framework to enhance performance on two core tasks.
arXiv Detail & Related papers (2025-10-24T17:27:35Z) - Aligning LLMs for Multilingual Consistency in Enterprise Applications [39.24743635297897]
Large language models (LLMs) remain unreliable for global enterprise applications.<n>We propose a practical, batch-wise alignment strategy for fine-tuning LLMs.<n>This approach improves non-English accuracy by up to 23.9% without compromising English performance, model reasoning, or retrieval quality.
arXiv Detail & Related papers (2025-09-28T05:51:22Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Mitigating Data Imbalance and Representation Degeneration in
Multilingual Machine Translation [103.90963418039473]
Bi-ACL is a framework that uses only target-side monolingual data and a bilingual dictionary to improve the performance of the MNMT model.
We show that Bi-ACL is more effective both in long-tail languages and in high-resource languages.
arXiv Detail & Related papers (2023-05-22T07:31:08Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z) - Improving Low-resource Reading Comprehension via Cross-lingual
Transposition Rethinking [0.9236074230806579]
Extractive Reading (ERC) has made tremendous advances enabled by the availability of large-scale high-quality ERC training data.
Despite of such rapid progress and widespread application, the datasets in languages other than high-resource languages such as English remain scarce.
We propose a Cross-Lingual Transposition ReThinking (XLTT) model by modelling existing high-quality extractive reading comprehension datasets in a multilingual environment.
arXiv Detail & Related papers (2021-07-11T09:35:16Z) - Reinforced Iterative Knowledge Distillation for Cross-Lingual Named
Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources.
Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages.
We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.