Related papers: OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation

OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation

URL: http://arxiv.org/abs/2602.08896v1
Date: Mon, 09 Feb 2026 16:57:35 GMT
Title: OmniReview: A Large-scale Benchmark and LLM-enhanced Framework for Realistic Reviewer Recommendation
Authors: Yehua Huang, Penglei Sun, Zebin Chen, Zhenheng Tang, Xiaowen Chu,
Abstract summary: Profiling Scholars with Multi-gate Mixture-of-Experts (Pro-MMoE) is a novel framework that synergizes Large Language Models (LLMs) with Multi-task Learning.<n>Pro-MMoE achieves state-of-the-art performance across six of seven metrics, establishing a new benchmark for realistic reviewer recommendation.
Score: 22.223973340236594
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Academic peer review remains the cornerstone of scholarly validation, yet the field faces some challenges in data and methods. From the data perspective, existing research is hindered by the scarcity of large-scale, verified benchmarks and oversimplified evaluation metrics that fail to reflect real-world editorial workflows. To bridge this gap, we present OmniReview, a comprehensive dataset constructed by integrating multi-source academic platforms encompassing comprehensive scholarly profiles through the disambiguation pipeline, yielding 202, 756 verified review records. Based on this data, we introduce a three-tier hierarchical evaluaion framework to assess recommendations from recall to precise expert identification. From the method perspective, existing embedding-based approaches suffer from the information bottleneck of semantic compression and limited interpretability. To resolve these method limitations, we propose Profiling Scholars with Multi-gate Mixture-of-Experts (Pro-MMoE), a novel framework that synergizes Large Language Models (LLMs) with Multi-task Learning. Specifically, it utilizes LLM-generated semantic profiles to preserve fine-grained expertise nuances and interpretability, while employing a Task-Adaptive MMoE architecture to dynamically balance conflicting evaluation goals. Comprehensive experiments demonstrate that Pro-MMoE achieves state-of-the-art performance across six of seven metrics, establishing a new benchmark for realistic reviewer recommendation.

Related papers

MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
AllSummedUp: un framework open-source pour comparer les metriques d'evaluation de resume [2.2153783542347805]
This paper investigates challenges in automatic text summarization evaluation.<n>Based on experiments conducted across six representative metrics, we highlight significant discrepancies between reported performances in the literature and those observed in our experimental setting.<n>We introduce a unified, open-source framework, applied to the SummEval dataset and designed to support fair and transparent comparison of evaluation metrics.
arXiv Detail & Related papers (2025-08-29T08:05:00Z)
Beyond "Not Novel Enough": Enriching Scholarly Critique with LLM-Assisted Feedback [81.0031690510116]
We present a structured approach for automated novelty evaluation that models expert reviewer behavior through three stages.<n>Our method is informed by a large scale analysis of human written novelty reviews.<n> Evaluated on 182 ICLR 2025 submissions, the approach achieves 86.5% alignment with human reasoning and 75.3% agreement on novelty conclusions.
arXiv Detail & Related papers (2025-08-14T16:18:37Z)
OpenUnlearning: Accelerating LLM Unlearning via Unified Benchmarking of Methods and Metrics [82.0813150432867]
We introduce OpenUnlearning, a standardized framework for benchmarking large language models (LLMs) unlearning methods and metrics.<n>OpenUnlearning integrates 13 unlearning algorithms and 16 diverse evaluations across 3 leading benchmarks.<n>We also benchmark diverse unlearning methods and provide a comparative analysis against an extensive evaluation suite.
arXiv Detail & Related papers (2025-06-14T20:16:37Z)
Leveraging LLMs to Evaluate Usefulness of Document [25.976948104719746]
We introduce a new user-centric evaluation framework that integrates users' search context and behavioral data into large language models.<n>Our study demonstrates that when well-guided with context and behavioral information, LLMs can accurately evaluate usefulness.<n>We also apply the labels produced by our method to predict user satisfaction, with real-world experiments indicating that these labels substantially improve the performance of satisfaction prediction models.
arXiv Detail & Related papers (2025-06-10T09:44:03Z)
Benchmarking and Rethinking Knowledge Editing for Large Language Models [34.80161437154527]
Knowledge editing aims to update embedded knowledge within Large Language Models (LLMs)<n>Existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups.<n>This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
arXiv Detail & Related papers (2025-05-24T13:32:03Z)
Model Utility Law: Evaluating LLMs beyond Performance through Mechanism Interpretable Metric [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.<n>One core challenge of evaluation in the large language model (LLM) era is the generalization issue.<n>We propose Model Utilization Index (MUI), a mechanism interpretability enhanced metric that complements traditional performance scores.
arXiv Detail & Related papers (2025-04-10T04:09:47Z)
Training an LLM-as-a-Judge Model: Pipeline, Insights, and Practical Lessons [9.954960702259918]
This paper introduces Themis, a fine-tuned large language model (LLMs) judge that delivers context-aware evaluations.<n>We provide a comprehensive overview of the development pipeline for Themis, highlighting its scenario-dependent evaluation prompts.<n>We introduce two human-labeled benchmarks for meta-evaluation, demonstrating that Themis can achieve high alignment with human preferences in an economical manner.
arXiv Detail & Related papers (2025-02-05T08:35:55Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored. We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z)
IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation [15.895295957106772]
We propose an ID-induced prompt synthesis framework for evaluating Large Language Models (LLMs) Our data synthesis framework prioritizes both breadth and specificity. It can generate prompts that comprehensively evaluate the capabilities of LLMs. We will release a dataset of over 3,000 carefully crafted prompts to facilitate evaluation research of LLMs.
arXiv Detail & Related papers (2024-09-27T16:29:12Z)
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences. We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.