GigaEmbeddings: Efficient Russian Language Embedding Model
- URL: http://arxiv.org/abs/2510.22369v1
- Date: Sat, 25 Oct 2025 17:26:05 GMT
- Title: GigaEmbeddings: Efficient Russian Language Embedding Model
- Authors: Egor Kolodin, Daria Khomich, Nikita Savushkin, Anastasia Ianina, Fyodor Minkin,
- Abstract summary: GigaEmbeddings is a framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning.<n>Our three-stage pipeline addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation.<n>GigaEmbeddings achieves state-of-the-art results (69.1 avg. score) on the ruMTEB benchmark spanning 23 multilingual tasks.
- Score: 1.3460582882338625
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce GigaEmbeddings, a novel framework for training high-performance Russian-focused text embeddings through hierarchical instruction tuning of the decoder-only LLM designed specifically for Russian language (GigaChat-3B). Our three-stage pipeline, comprising large-scale contrastive pre-training in web-scale corpora, fine-tuning with hard negatives, and multitask generalization across retrieval, classification, and clustering tasks, addresses key limitations of existing methods by unifying diverse objectives and leveraging synthetic data generation. Architectural innovations include bidirectional attention for contextual modeling, latent attention pooling for robust sequence aggregation, and strategic pruning of 25% of transformer layers to enhance efficiency without compromising performance. Evaluated on the ruMTEB benchmark spanning 23 multilingual tasks, GigaEmbeddings achieves state-of-the-art results (69.1 avg. score), outperforming strong baselines with a larger number of parameters.
Related papers
- AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis [13.528308058170479]
We present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA)<n>Our methodology combines fine-tuning of language-appropriate backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA.<n>This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness.
arXiv Detail & Related papers (2026-03-05T08:30:59Z) - Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z) - MagicAgent: Towards Generalized Agent Planning [73.21129030631421]
We present textbfMagicAgent, a series of foundation models specifically designed for generalized agent planning.<n>We introduce a lightweight and scalable synthetic data framework that generates high-quality trajectories across diverse planning tasks.<n>We show that MagicAgent-32B and MagicAgent-30B-A3B achieve superior performance across diverse open-source benchmarks.
arXiv Detail & Related papers (2026-02-22T01:39:16Z) - Compass-Embedding v4: Robust Contrastive Learning for Multilingual E-commerce Embeddings [12.049937870582113]
We present a high-efficiency multilingual embedding framework specifically optimized for Southeast Asian (SEA) e-commerce scenarios.<n> Compass-Embedding v4 addresses three core challenges.<n>We construct a diversified training corpus through context-grounded synthetic data generation, cross-lingual translation, and structured e-commerce data construction.
arXiv Detail & Related papers (2025-12-25T13:41:53Z) - CLaC at DISRPT 2025: Hierarchical Adapters for Cross-Framework Multi-lingual Discourse Relation Classification [0.0509780930114934]
Task 3 introduces a unified set of 17 discourse relation labels across 39 corpora in 16 languages and six discourse frameworks.<n>We first benchmark the task by fine-tuning multilingual BERT-based models with two argument-ordering strategies and progressive unfreezing ratios.<n>We then evaluate prompt-based large language models in zero-shot and few-shot settings to understand how LLMs respond to the newly proposed unified labels.
arXiv Detail & Related papers (2025-09-21T03:34:31Z) - Resource-Efficient Adaptation of Large Language Models for Text Embeddings via Prompt Engineering and Contrastive Fine-tuning [3.9914181590063884]
Large Language Models (LLMs) have become a cornerstone in Natural Language Processing (NLP)<n>We explore several adaptation strategies for pre-trained, decoder-only LLMs.
arXiv Detail & Related papers (2025-07-30T14:49:30Z) - NAP-Tuning: Neural Augmented Prompt Tuning for Adversarially Robust Vision-Language Models [72.58372335140241]
Adversarial Prompt Tuning (AdvPT) introduced learnable text prompts to enhance adversarial robustness in Vision-Language Models (VLMs)<n>We present the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning)<n>Our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 33.5% on ViT-B16 and 33.0% on ViT-B32 architectures.
arXiv Detail & Related papers (2025-06-15T03:34:23Z) - Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models [90.54780244175511]
We introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series.<n>The Qwen3 Embedding series offers a spectrum of model sizes for both embedding and reranking tasks.<n>The Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks.
arXiv Detail & Related papers (2025-06-05T15:49:48Z) - Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Exploring the State-of-the-Art Language Modeling Methods and Data
Augmentation Techniques for Multilingual Clause-Level Morphology [3.8498574327875947]
We present our work on all three parts of the shared task: inflection, reinflection, and analysis.
We mainly explore two approaches: Transformer models in combination with data augmentation, and exploiting the state-of-the-art language modeling techniques for morphological analysis.
Our methods achieved first place in each of the three tasks and outperforms mT5-baseline with 89% for inflection, 80% for reinflection and 12% for analysis.
arXiv Detail & Related papers (2022-11-03T11:53:39Z) - Exploring Dimensionality Reduction Techniques in Multilingual
Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers.
It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z) - Improving Context Modeling in Neural Topic Segmentation [18.92944038749279]
We enhance a segmenter based on a hierarchical attention BiLSTM network to better model context.
Our optimized segmenter outperforms SOTA approaches when trained and tested on three datasets.
arXiv Detail & Related papers (2020-10-07T03:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.