Related papers: Do We Need Domain-Specific Embedding Models? An Empirical Investigation

Do We Need Domain-Specific Embedding Models? An Empirical Investigation

URL: http://arxiv.org/abs/2409.18511v3
Date: Thu, 3 Oct 2024 01:44:40 GMT
Title: Do We Need Domain-Specific Embedding Models? An Empirical Investigation
Authors: Yixuan Tang, Yi Yang,
Abstract summary: We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to Massive Text Embedding Benchmark (MTEB) We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB. Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns.
Score: 18.990655668481075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advancements in Large Language Models (LLMs) have further enhanced the performance of embedding models, which are trained on massive amounts of text covering almost every domain. These models are often benchmarked on general-purpose datasets like Massive Text Embedding Benchmark (MTEB), where they demonstrate superior performance. However, a critical question arises: Is the development of domain-specific embedding models necessary when general-purpose models are trained on vast corpora that already include specialized domain texts? In this paper, we empirically investigate this question, choosing the finance domain as an example. We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to MTEB that consists of financial domain-specific text datasets. We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB. To account for the possibility that this drop is driven by FinMTEB's higher complexity, we propose four measures to quantify dataset complexity and control for this factor in our analysis. Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns. Moreover, we find that the performance of general-purpose embedding models on MTEB is not correlated with their performance on FinMTEB, indicating the need for domain-specific embedding benchmarks for domain-specific embedding models. This study sheds light on developing domain-specific embedding models in the LLM era. FinMTEB comes with open-source code at https://github.com/yixuantt/FinMTEB

Related papers

Mirror, Mirror on the Wall -- Which is the Best Model of Them All? [2.2022484178680877]
Large Language Models (LLMs) have become one of the most transformative tools across many applications.<n>We argue that there are two main dimensions that need to be taken into consideration when selecting a model for further training.
arXiv Detail & Related papers (2025-11-25T20:52:45Z)
Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z)
LimiX: Unleashing Structured-Data Modeling Capability for Generalist Intelligence [61.46575527504109]
LimiX-16M and LimiX-2M treat structured data as a joint distribution over variables and missingness.<n>We evaluate LimiX models across 11 large structured-data benchmarks with broad regimes of sample size, feature dimensionality, class number, categorical-to-numerical feature ratio, missingness, and sample-to-feature ratios.
arXiv Detail & Related papers (2025-09-03T17:39:08Z)
Dewey Long Context Embedding Model: A Technical Report [0.0]
dewey_en_beta is a novel text embedding model that achieves excellent performance on MTEB (Eng, v2) and LongEmbed benchmark. This report presents the training methodology and evaluation results of the open-source dewey_en_beta embedding model.
arXiv Detail & Related papers (2025-03-26T09:55:00Z)
FinMTEB: Finance Massive Text Embedding Benchmark [18.990655668481075]
We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a specialized counterpart to MTEB designed for the financial domain. FinMTEB comprises 64 financial domain-specific embedding datasets across 7 tasks. We show three key findings: (1) performance on general-purpose benchmarks shows limited correlation with financial domain tasks; (2) domain-adapted models consistently outperform their general-purpose counterparts; and (3) surprisingly, a simple Bag-of-Words approach outperforms sophisticated dense embeddings in financial Semantic Textual Similarity tasks.
arXiv Detail & Related papers (2025-02-16T04:23:52Z)
On Domain-Adaptive Post-Training for Multimodal Large Language Models [72.67107077850939]
This paper systematically investigates domain adaptation of MLLMs via post-training.<n>We focus on data synthesis, training pipeline, and task evaluation.<n>We conduct experiments in high-impact domains such as biomedicine, food, and remote sensing.
arXiv Detail & Related papers (2024-11-29T18:42:28Z)
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers. LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs. We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z)
Learning to Generalize Unseen Domains via Multi-Source Meta Learning for Text Classification [71.08024880298613]
We study the multi-source Domain Generalization of text classification. We propose a framework to use multiple seen domains to train a model that can achieve high accuracy in an unseen domain.
arXiv Detail & Related papers (2024-09-20T07:46:21Z)
Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models [19.015202590038996]
We evaluate the factuality of different models tuned by various preference learning algorithms. We propose textbfAPEFT (textbfAtomic textbfPreference textbfEnhanced textbfFactuality textbfTuning) to enhance model's awareness of factuality.
arXiv Detail & Related papers (2024-06-18T09:07:30Z)
What matters when building vision-language models? [52.8539131958858]
We develop Idefics2, an efficient foundational vision-language model with 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks. We release the model (base, instructed, and chat) along with the datasets created for its training.
arXiv Detail & Related papers (2024-05-03T17:00:00Z)
Tabular Embedding Model (TEM): Finetuning Embedding Models For Tabular RAG Applications [0.0]
Tabular Embedding Model (TEM) is a novel approach to fine-tune embedding models for tabular Retrieval-Augmentation Generation (RAG) applications. TEM not only outperforms current SOTA embedding models in this domain but also does so with a notably smaller and more efficient model structure.
arXiv Detail & Related papers (2024-04-28T14:58:55Z)
Adapting Large Language Models for Content Moderation: Pitfalls in Data Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains. In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z)
Multi-document Summarization: A Comparative Evaluation [0.0]
This paper is aimed at evaluating state-of-the-art models for Multi-document Summarization (MDS) on different types of datasets in various domains. We analyzed the performance of PRIMERA and PEG models on Big-Survey and MS$2$ datasets.
arXiv Detail & Related papers (2023-09-10T07:43:42Z)
Benchmarking Diverse-Modal Entity Linking with Generative Models [78.93737257356784]
We construct a benchmark for diverse-modal EL (DMEL) from existing EL datasets. To approach the DMEL task, we proposed a generative diverse-modal model (GDMM) following a multimodal-encoder-decoder paradigm. GDMM builds a stronger DMEL baseline, outperforming state-of-the-art task-specific EL models by 8.51 F1 score on average.
arXiv Detail & Related papers (2023-05-27T02:38:46Z)
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction [67.54420015049732]
Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments. Existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains. We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings.
arXiv Detail & Related papers (2023-05-23T18:01:49Z)
Attention is Not Always What You Need: Towards Efficient Classification of Domain-Specific Text [1.1508304497344637]
For large-scale IT corpora with hundreds of classes organized in a hierarchy, the task of accurate classification of classes at the higher level in the hierarchies is crucial. In the business world, an efficient and explainable ML model is preferred over an expensive black-box model, especially if the performance increase is marginal. Despite the widespread use of PLMs, there is a lack of a clear and well-justified need to as why these models are being employed for domain-specific text classification.
arXiv Detail & Related papers (2023-03-31T03:17:23Z)
WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain [42.093876880881886]
We propose a novel domain specific Financial LANGuage model (FLANG) It uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective. Our models, code and benchmark data are publicly available on Github and Huggingface.
arXiv Detail & Related papers (2022-10-31T18:35:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.