SciRepEval: A Multi-Format Benchmark for Scientific Document
Representations
- URL: http://arxiv.org/abs/2211.13308v4
- Date: Mon, 13 Nov 2023 18:25:27 GMT
- Title: SciRepEval: A Multi-Format Benchmark for Scientific Document
Representations
- Authors: Amanpreet Singh, Mike D'Arcy, Arman Cohan, Doug Downey, Sergey Feldman
- Abstract summary: We introduce SciRepEval, the first comprehensive benchmark for training and evaluating scientific document representations.
We show how state-of-the-art models like SPECTER and SciNCL struggle to generalize across the task formats.
A new approach that learns multiple embeddings per document, each tailored to a different format, can improve performance.
- Score: 52.01865318382197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learned representations of scientific documents can serve as valuable input
features for downstream tasks without further fine-tuning. However, existing
benchmarks for evaluating these representations fail to capture the diversity
of relevant tasks. In response, we introduce SciRepEval, the first
comprehensive benchmark for training and evaluating scientific document
representations. It includes 24 challenging and realistic tasks, 8 of which are
new, across four formats: classification, regression, ranking and search. We
then use this benchmark to study and improve the generalization ability of
scientific document representation models. We show how state-of-the-art models
like SPECTER and SciNCL struggle to generalize across the task formats, and
that simple multi-task training fails to improve them. However, a new approach
that learns multiple embeddings per document, each tailored to a different
format, can improve performance. We experiment with task-format-specific
control codes and adapters and find they outperform the existing
single-embedding state-of-the-art by over 2 points absolute. We release the
resulting family of multi-format models, called SPECTER2, for the community to
use and build on.
Related papers
- DocTTT: Test-Time Training for Handwritten Document Recognition Using Meta-Auxiliary Learning [7.036629164442979]
We introduce the DocTTT framework to address these challenges.
Key innovation of our approach is that it uses test-time training to adapt the model to each specific input during testing.
We propose a novel Meta-Auxiliary learning approach that combines Meta-learning and self-supervised Masked Autoencoder(MAE)
arXiv Detail & Related papers (2025-01-22T14:18:47Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.
We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.
Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models [63.466265039007816]
We present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community.
We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
arXiv Detail & Related papers (2024-06-17T15:13:52Z) - Beyond Document Page Classification: Design, Datasets, and Challenges [32.94494070330065]
This paper highlights the need to bring document classification benchmarking closer to real-world applications.
We identify the lack of public multi-page document classification datasets, formalize different classification tasks arising in application scenarios, and motivate the value of targeting efficient multi-page document representations.
arXiv Detail & Related papers (2023-08-24T16:16:47Z) - MIReAD: Simple Method for Learning High-quality Representations from
Scientific Documents [77.34726150561087]
We propose MIReAD, a simple method that learns high-quality representations of scientific papers.
We train MIReAD on more than 500,000 PubMed and arXiv abstracts across over 2,000 journal classes.
arXiv Detail & Related papers (2023-05-07T03:29:55Z) - Large-scale learning of generalised representations for speaker
recognition [52.978310296712834]
We develop a speaker recognition model to be used in diverse scenarios.
We investigate several new training data configurations combining a few existing datasets.
We find that MFA-Conformer with the least inductive bias generalises the best.
arXiv Detail & Related papers (2022-10-20T03:08:18Z) - Multi-Vector Models with Textual Guidance for Fine-Grained Scientific
Document Similarity [11.157086694203201]
We present a new scientific document similarity model based on matching fine-grained aspects.
Our model is trained using co-citation contexts that describe related paper aspects as a novel form of textual supervision.
arXiv Detail & Related papers (2021-11-16T11:12:30Z) - SPECTER: Document-level Representation Learning using Citation-informed
Transformers [51.048515757909215]
SPECTER generates document-level embedding of scientific documents based on pretraining a Transformer language model.
We introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction to document classification and recommendation.
arXiv Detail & Related papers (2020-04-15T16:05:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.