Towards Robust Text Retrieval with Progressive Learning
- URL: http://arxiv.org/abs/2311.11691v1
- Date: Mon, 20 Nov 2023 11:44:01 GMT
- Title: Towards Robust Text Retrieval with Progressive Learning
- Authors: Tong Wu, Yulei Qin, Enwei Zhang, Zihan Xu, Yuting Gao, Ke Li, Xing Sun
- Abstract summary: We propose the PEG, a progressively learned embedding for robust text retrieval.
We increase the training in-batch negative samples to 80,000, and for each query, we extracted five hard negatives.
PEG is trained on more than 100 million data, encompassing a wide range of domains.
- Score: 31.81063977662941
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval augmentation has become an effective solution to empower large
language models (LLMs) with external and verified knowledge sources from the
database, which overcomes the limitations and hallucinations of LLMs in
handling up-to-date and domain-specific information. However, existing
embedding models for text retrieval usually have three non-negligible
limitations. First, the number and diversity of samples in a batch are too
restricted to supervise the modeling of textual nuances at scale. Second, the
high proportional noise are detrimental to the semantic correctness and
consistency of embeddings. Third, the equal treatment to easy and difficult
samples would cause sub-optimum convergence of embeddings with poorer
generalization. In this paper, we propose the PEG, a progressively learned
embeddings for robust text retrieval. Specifically, we increase the training
in-batch negative samples to 80,000, and for each query, we extracted five hard
negatives. Concurrently, we incorporated a progressive learning mechanism,
enabling the model to dynamically modulate its attention to the samples
throughout the entire training process. Additionally, PEG is trained on more
than 100 million data, encompassing a wide range of domains (e.g., finance,
medicine, and tourism) and covering various tasks (e.g., question-answering,
machine reading comprehension, and similarity matching). Extensive experiments
conducted on C-MTEB and DuReader demonstrate that PEG surpasses
state-of-the-art embeddings in retrieving true positives, highlighting its
significant potential for applications in LLMs. Our model is publicly available
at https://huggingface.co/TownsWu/PEG.
Related papers
- Observe-R1: Unlocking Reasoning Abilities of MLLMs with Dynamic Progressive Reinforcement Learning [3.364797975300393]
We present Observe-R1, a novel framework aimed at enhancing the reasoning capabilities of multimodal large language models (MLLMs)<n>We construct the NeuraLadder dataset, which is organized and sampled according to the difficulty and complexity of data samples for RL training.<n>Experiments with the Qwen2.5-VL-3B and Qwen2.5-VL-7B models on 20k samples from the NeuraLadder dataset show that Observe-R1 outperforms a series of larger reasoning models on both reasoning and general benchmarks.
arXiv Detail & Related papers (2025-05-18T14:08:03Z) - Multimodal Distillation-Driven Ensemble Learning for Long-Tailed Histopathology Whole Slide Images Analysis [16.01677300903562]
Multiple Instance Learning (MIL) plays a significant role in computational pathology, enabling weakly supervised analysis of Whole Slide Image (WSI) datasets.
We propose an ensemble learning method based on MIL, which employs expert decoders with shared aggregators to learn diverse distributions.
We introduce a multimodal distillation framework that leverages text encoders pre-trained on pathology-text pairs to distill knowledge.
Our method, MDE-MIL, integrates multiple expert branches focusing on specific data distributions to address long-tailed issues.
arXiv Detail & Related papers (2025-03-02T14:31:45Z) - Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning [37.54523122932728]
We propose a pipeline-based data augmentation method via large language models (LLMs)
To tackle the issue of low data diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and quantities.
To address high data noise, the GCSE model uses a Gaussian-decayed function to limit the impact of false hard negative samples.
arXiv Detail & Related papers (2024-09-19T16:29:58Z) - THaMES: An End-to-End Tool for Hallucination Mitigation and Evaluation in Large Language Models [0.0]
Hallucination, the generation of factually incorrect content, is a growing challenge in Large Language Models.
This paper introduces THaMES, an integrated framework and library addressing this gap.
THaMES offers an end-to-end solution for evaluating and mitigating hallucinations in LLMs.
arXiv Detail & Related papers (2024-09-17T16:55:25Z) - A Framework for Fine-Tuning LLMs using Heterogeneous Feedback [69.51729152929413]
We present a framework for fine-tuning large language models (LLMs) using heterogeneous feedback.
First, we combine the heterogeneous feedback data into a single supervision format, compatible with methods like SFT and RLHF.
Next, given this unified feedback dataset, we extract a high-quality and diverse subset to obtain performance increases.
arXiv Detail & Related papers (2024-08-05T23:20:32Z) - MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training [9.023648972811458]
RagVL is a novel framework with knowledge-enhanced reranking and noise-injected training.
We instruction-tune the MLLM with a simple yet effective instruction template to induce its ranking ability.
For generation, we inject visual noise during training at the data and token levels to enhance the generator's robustness.
arXiv Detail & Related papers (2024-07-31T08:43:17Z) - Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models [0.8399688944263842]
Large Language Models (LLMs) have the capability to understand and generate human-like text from input queries.
This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines.
We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding.
arXiv Detail & Related papers (2024-06-17T04:35:17Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Negotiated Representations for Machine Mearning Application [0.0]
Overfitting is a phenomenon that occurs when a machine learning model is trained for too long and focused too much on the exact fitness of the training samples to the provided training labels.
We present an approach that increases the classification accuracy of machine learning models by allowing the model to negotiate output representations of the samples with previously determined class labels.
arXiv Detail & Related papers (2023-11-19T19:53:49Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Few-shot Instruction Prompts for Pretrained Language Models to Detect
Social Biases [55.45617404586874]
We propose a few-shot instruction-based method for prompting pre-trained language models (LMs)
We show that large LMs can detect different types of fine-grained biases with similar and sometimes superior accuracy to fine-tuned models.
arXiv Detail & Related papers (2021-12-15T04:19:52Z) - Bridging the Gap Between Clean Data Training and Real-World Inference
for Spoken Language Understanding [76.89426311082927]
Existing models are trained on clean data, which causes a textitgap between clean data training and real-world inference.
We propose a method from the perspective of domain adaptation, by which both high- and low-quality samples are embedding into similar vector space.
Experiments on the widely-used dataset, Snips, and large scale in-house dataset (10 million training examples) demonstrate that this method not only outperforms the baseline models on real-world (noisy) corpus but also enhances the robustness, that is, it produces high-quality results under a noisy environment.
arXiv Detail & Related papers (2021-04-13T17:54:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.