Related papers: Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping

Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping

URL: http://arxiv.org/abs/2506.07658v1
Date: Mon, 09 Jun 2025 11:30:12 GMT
Title: Beyond Benchmarks: A Novel Framework for Domain-Specific LLM Evaluation and Knowledge Mapping
Authors: Nitin Sharma, Thomas Wolfers, Çağatay Yıldız,
Abstract summary: The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation.<n>We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation.<n>Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs.<n>We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost.
Score: 0.7555681642774916
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The paper addresses two critical challenges in language model (LM) evaluation: creating reliable domain-specific benchmarks and understanding knowledge representation during domain adaptation. We introduce a deterministic pipeline that converts raw domain corpora into completion-type benchmarks without relying on LMs or human curation, eliminating benchmark contamination issues while enabling evaluation on the latest domain data. Our approach generates domain-specific keywords and related word lists using TF and Term TF-IDF methods and constructs prompt-target pairs. We evaluate models by measuring their ability to complete these prompts with the correct domain-specific targets, providing a direct assessment of domain knowledge with low computational cost. Through comprehensive experiments across multiple models (GPT-2 medium/XL, Llama-2/3.1, OLMo-2, Qwen-2, Mistral) and domains, we demonstrate that our benchmark strongly correlates with expert-generated benchmarks while providing a more accurate measure of domain knowledge than traditional perplexity metrics. We reveal that domain adaptation happens rapidly in smaller models (within 500 steps) and illustrate a new approach to domain knowledge evaluation in base models during training for early stopping. By extending mechanistic analysis to domain adaptation, we discover that initial-to-mid layers are primarily responsible for attribute extraction, while later layers focus on next token prediction. Furthermore, we show that during adaptation, forgetting begins in the middle layers, where attribute extraction happens and is amplified in later layers. Our work provides both a practical evaluation methodology for domain-specific LMs and novel insights into knowledge representation during adaptation, with implications for more efficient fine-tuning strategies and targeted approaches to mitigate catastrophic forgetting.

Related papers

A Unified Analysis of Generalization and Sample Complexity for Semi-Supervised Domain Adaptation [1.9567015559455132]
Domain adaptation seeks to leverage the abundant label information in a source domain to improve classification performance in a target domain with limited labels.<n>Most existing theoretical analyses focus on simplified settings where the source and target domains share the same input space.<n>We present a comprehensive theoretical study of domain adaptation algorithms based on domain alignment.
arXiv Detail & Related papers (2025-07-30T12:53:08Z)
Topology-Aware Modeling for Unsupervised Simulation-to-Reality Point Cloud Recognition [63.55828203989405]
We introduce a novel Topology-Aware Modeling (TAM) framework for Sim2Real UDA on object point clouds.<n>Our approach mitigates the domain gap by leveraging global spatial topology, characterized by low-level, high-frequency 3D structures.<n>We propose an advanced self-training strategy that combines cross-domain contrastive learning with self-training.
arXiv Detail & Related papers (2025-06-26T11:53:59Z)
Context-Aware Self-Adaptation for Domain Generalization [32.094290282897894]
Domain generalization aims at developing suitable learning algorithms in source training domains.<n>We present a novel two-stage approach called Context-Aware Self-Adaptation (CASA) for domain generalization.
arXiv Detail & Related papers (2025-04-03T22:33:38Z)
TestAgent: A Framework for Domain-Adaptive Evaluation of LLMs via Dynamic Benchmark Construction and Exploratory Interaction [29.72874725703848]
Large language models (LLMs) are increasingly deployed to various vertical domains.<n>Current evaluation methods rely on static and resource-intensive datasets that are not aligned with real-world requirements.<n>We introduce two key concepts: textbfBenchmark+, which extends the traditional question-answer benchmark into a more flexible strategy-criterion'' format.<n>We propose textbftextscTestAgent, an agent-based evaluation framework that implements these concepts using retrieval-augmented generation and reinforcement learning.
arXiv Detail & Related papers (2024-10-15T11:20:42Z)
DG-PIC: Domain Generalized Point-In-Context Learning for Point Cloud Understanding [41.49771026674969]
We introduce a novel, practical, multi-domain multi-task setting, handling multiple domains and multiple tasks within one unified model for domain generalized point cloud understanding. Our DG-PIC does not require any model updates during the testing and can handle unseen domains and multiple tasks, textiti.e., point cloud reconstruction, denoising, and registration, within one unified model.
arXiv Detail & Related papers (2024-07-11T18:21:40Z)
Understanding the Cross-Domain Capabilities of Video-Based Few-Shot Action Recognition Models [3.072340427031969]
Few-shot action recognition (FSAR) aims to learn a model capable of identifying novel actions in videos using only a few examples. In assuming the base dataset seen during meta-training and novel dataset used for evaluation can come from different domains, cross-domain few-shot learning alleviates data collection and annotation costs. We systematically evaluate existing state-of-the-art single-domain, transfer-based, and cross-domain FSAR methods on new cross-domain tasks.
arXiv Detail & Related papers (2024-06-03T07:48:18Z)
StyDeSty: Min-Max Stylization and Destylization for Single Domain Generalization [85.18995948334592]
Single domain generalization (single DG) aims at learning a robust model generalizable to unseen domains from only one training domain. State-of-the-art approaches have mostly relied on data augmentations, such as adversarial perturbation and style enhancement, to synthesize new data. We propose emphStyDeSty, which explicitly accounts for the alignment of the source and pseudo domains in the process of data augmentation.
arXiv Detail & Related papers (2024-06-01T02:41:34Z)
Towards Efficient Methods in Medical Question Answering using Knowledge Graph Embeddings [3.944219308229571]
In Natural Language Processing (NLP), Machine Reading (MRC) is the task of answering a question based on a given context.<n>To handle questions in the medical domain, modern language models such as BioBERT, SciBERT and even ChatGPT are trained on vast amounts of in-domain medical corpora.<n>We propose a resource-efficient approach for injecting domain knowledge into a model without relying on such domain-specific pre-training.
arXiv Detail & Related papers (2024-01-15T21:43:46Z)
Improving Domain Generalization with Domain Relations [77.63345406973097]
This paper focuses on domain shifts, which occur when the model is applied to new domains that are different from the ones it was trained on. We propose a new approach called D$3$G to learn domain-specific models. Our results show that D$3$G consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-02-06T08:11:16Z)
Inferring Latent Domains for Unsupervised Deep Domain Adaptation [54.963823285456925]
Unsupervised Domain Adaptation (UDA) refers to the problem of learning a model in a target domain where labeled data are not available. This paper introduces a novel deep architecture which addresses the problem of UDA by automatically discovering latent domains in visual datasets. We evaluate our approach on publicly available benchmarks, showing that it outperforms state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2021-03-25T14:33:33Z)
Cluster, Split, Fuse, and Update: Meta-Learning for Open Compound Domain Adaptive Semantic Segmentation [102.42638795864178]
We propose a principled meta-learning based approach to OCDA for semantic segmentation. We cluster target domain into multiple sub-target domains by image styles, extracted in an unsupervised manner. A meta-learner is thereafter deployed to learn to fuse sub-target domain-specific predictions, conditioned upon the style code. We learn to online update the model by model-agnostic meta-learning (MAML) algorithm, thus to further improve generalization.
arXiv Detail & Related papers (2020-12-15T13:21:54Z)
Domain Adaptation for Semantic Parsing [68.81787666086554]
We propose a novel semantic for domain adaptation, where we have much fewer annotated data in the target domain compared to the source domain. Our semantic benefits from a two-stage coarse-to-fine framework, thus can provide different and accurate treatments for the two stages. Experiments on a benchmark dataset show that our method consistently outperforms several popular domain adaptation strategies.
arXiv Detail & Related papers (2020-06-23T14:47:41Z)
Learning Meta Face Recognition in Unseen Domains [74.69681594452125]
We propose a novel face recognition method via meta-learning named Meta Face Recognition (MFR) MFR synthesizes the source/target domain shift with a meta-optimization objective. We propose two benchmarks for generalized face recognition evaluation.
arXiv Detail & Related papers (2020-03-17T14:10:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.