Related papers: Improving embedding with contrastive fine-tuning on small datasets with expert-augmented scores

Related papers

An Enhanced Model-based Approach for Short Text Clustering [58.60681789677676]
Short text clustering has become increasingly important with the popularity of social media like Twitter, Google+, and Facebook.<n>Existing methods can be broadly categorized into two paradigms: topic model-based approaches and deep representation learning-based approaches.<n>We propose a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model (GSDMM), which effectively handles the sparsity and high dimensionality of short texts.<n>Based on several aspects of GSDMM that warrant further refinement, we propose an improved approach, GSDMM+, designed to further optimize its performance.
arXiv Detail & Related papers (2025-07-18T10:07:42Z)
LGAI-EMBEDDING-Preview Technical Report [41.68404082385825]
This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks.<n>Our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings.<n>Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score.
arXiv Detail & Related papers (2025-06-09T05:30:35Z)
READ: Reinforcement-based Adversarial Learning for Text Classification with Limited Labeled Data [7.152603583363887]
Pre-trained transformer models such as BERT have shown massive gains across many text classification tasks. This paper proposes a method that encapsulates reinforcement learning-based text generation and semi-supervised adversarial learning approaches. Our method READ, Reinforcement-based Adversarial learning, utilizes an unlabeled dataset to generate diverse synthetic text through reinforcement learning.
arXiv Detail & Related papers (2025-01-14T11:39:55Z)
Optimizing Multi-Stage Language Models for Effective Text Retrieval [0.0]
We introduce a novel two-phase text retrieval pipeline optimized for Japanese legal datasets. Our method leverages advanced language models to achieve state-of-the-art performance. To further enhance robustness and adaptability, we incorporate an ensemble model that integrates multiple retrieval strategies.
arXiv Detail & Related papers (2024-12-26T16:05:19Z)
Movie2Story: A framework for understanding videos and telling stories in the form of novel text [0.0]
We propose a novel benchmark to evaluate text generation capabilities in scenarios enriched with auxiliary information. Our work introduces an innovative automatic dataset generation method to ensure the availability of accurate auxiliary information. Our experiments reveal that current Multi-modal Large Language Models (MLLMs) perform suboptimally under the proposed evaluation metrics.
arXiv Detail & Related papers (2024-12-19T15:44:04Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Beyond Coarse-Grained Matching in Video-Text Retrieval [50.799697216533914]
We introduce a new approach for fine-grained evaluation. Our approach can be applied to existing datasets by automatically generating hard negative test captions. Experiments on our fine-grained evaluations demonstrate that this approach enhances a model's ability to understand fine-grained differences.
arXiv Detail & Related papers (2024-10-16T09:42:29Z)
Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling [0.0]
APEX-Embedding-7B is a 7-billion parameter decoder-only text Feature Extraction Model. Our approach employs two training techniques that yield an emergent improvement in factual focus. Based on our evaluations, our model establishes a new state-of-the-art standard in text feature extraction for longer context document retrieval tasks.
arXiv Detail & Related papers (2024-10-08T17:36:48Z)
Exploring Information Retrieval Landscapes: An Investigation of a Novel Evaluation Techniques and Comparative Document Splitting Methods [0.0]
In this study, the structured nature of textbooks, the conciseness of articles, and the narrative complexity of novels are shown to require distinct retrieval strategies. A novel evaluation technique is introduced, utilizing an open-source model to generate a comprehensive dataset of question-and-answer pairs. The evaluation employs weighted scoring metrics, including SequenceMatcher, BLEU, METEOR, and BERT Score, to assess the system's accuracy and relevance.
arXiv Detail & Related papers (2024-09-13T02:08:47Z)
Enhancing Embedding Performance through Large Language Model-based Text Enrichment and Rewriting [0.0]
This paper proposes a novel approach to improve embedding performance by leveraging large language models (LLMs) to enrich and rewrite input text before the embedding process. The effectiveness of this approach is evaluated on three datasets: Banking77Classification, TwitterSemEval 2015, and Amazon Counter-factual Classification.
arXiv Detail & Related papers (2024-04-18T15:58:56Z)
Improving Sampling Methods for Fine-tuning SentenceBERT in Text Streams [49.3179290313959]
This study explores the efficacy of seven text sampling methods designed to selectively fine-tune language models. We precisely assess the impact of these methods on fine-tuning the SBERT model using four different loss functions. Our findings indicate that Softmax loss and Batch All Triplets loss are particularly effective for text stream classification.
arXiv Detail & Related papers (2024-03-18T23:41:52Z)
One-Shot Learning as Instruction Data Prospector for Large Language Models [108.81681547472138]
textscNuggets uses one-shot learning to select high-quality instruction data from extensive datasets. We show that instruction tuning with the top 1% of examples curated by textscNuggets substantially outperforms conventional methods employing the entire dataset.
arXiv Detail & Related papers (2023-12-16T03:33:12Z)
EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities. We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA. Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z)
Assisted Text Annotation Using Active Learning to Achieve High Quality with Little Effort [9.379650501033465]
We propose a tool that enables researchers to create large, high-quality, annotated datasets with only a few manual annotations. We combine an active learning (AL) approach with a pre-trained language model to semi-automatically identify annotation categories. Our preliminary results show that employing AL strongly reduces the number of annotations for correct classification of even complex and subtle frames.
arXiv Detail & Related papers (2021-12-15T13:14:58Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
Automated Concatenation of Embeddings for Structured Prediction [75.44925576268052]
We propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks. We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model.
arXiv Detail & Related papers (2020-10-10T14:03:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.