Multi-document Summarization: A Comparative Evaluation
- URL: http://arxiv.org/abs/2309.04951v2
- Date: Tue, 12 Sep 2023 04:19:49 GMT
- Title: Multi-document Summarization: A Comparative Evaluation
- Authors: Kushan Hewapathirana (1 and 2), Nisansa de Silva (1), C.D. Athuraliya
(2) ((1) Department of Computer Science & Engineering, University of
Moratuwa, Sri Lanka, (2) ConscientAI, Sri Lanka)
- Abstract summary: This paper is aimed at evaluating state-of-the-art models for Multi-document Summarization (MDS) on different types of datasets in various domains.
We analyzed the performance of PRIMERA and PEG models on Big-Survey and MS$2$ datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is aimed at evaluating state-of-the-art models for Multi-document
Summarization (MDS) on different types of datasets in various domains and
investigating the limitations of existing models to determine future research
directions. To address this gap, we conducted an extensive literature review to
identify state-of-the-art models and datasets. We analyzed the performance of
PRIMERA and PEGASUS models on BigSurvey-MDS and MS$^2$ datasets, which posed
unique challenges due to their varied domains. Our findings show that the
General-Purpose Pre-trained Model LED outperforms PRIMERA and PEGASUS on the
MS$^2$ dataset. We used the ROUGE score as a performance metric to evaluate the
identified models on different datasets. Our study provides valuable insights
into the models' strengths and weaknesses, as well as their applicability in
different domains. This work serves as a reference for future MDS research and
contributes to the development of accurate and robust models which can be
utilized on demanding datasets with academically and/or scientifically complex
data as well as generalized, relatively simple datasets.
Related papers
- Exploring the Efficacy of Meta-Learning: Unveiling Superior Data Diversity Utilization of MAML Over Pre-training [1.3980986259786223]
We show that dataset diversity can impact the performance of vision models.
Our study shows positive correlations between test set accuracy and data diversity.
These findings support our hypothesis and demonstrate a promising way for a deeper exploration of how formal data diversity influences model performance.
arXiv Detail & Related papers (2025-01-15T00:56:59Z) - Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark [54.93461228053298]
We introduce our benchmark, textbfScenario-Wise Rec, which comprises 6 public datasets and 12 benchmark models, along with a training and evaluation pipeline.
We aim for this benchmark to offer researchers valuable insights from prior work, enabling the development of novel models.
arXiv Detail & Related papers (2024-12-23T08:15:34Z) - Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines [64.61315565501681]
Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) is a novel task that enables foundation models to process multi-modal web content.
Despite its potential impact, M$2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources.
arXiv Detail & Related papers (2024-11-25T13:20:19Z) - LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content [62.816876067499415]
We propose LiveXiv: a scalable evolving live benchmark based on scientific ArXiv papers.
LiveXiv accesses domain-specific manuscripts at any given timestamp and proposes to automatically generate visual question-answer pairs.
We benchmark multiple open and proprietary Large Multi-modal Models (LMMs) on the first version of our benchmark, showing its challenging nature and exposing the models true abilities.
arXiv Detail & Related papers (2024-10-14T17:51:23Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - State-of-the-art Models for Object Detection in Various Fields of
Application [0.0]
COCO minival, COCO test, Pascal VOC 2007, ADE20K, and ImageNet are reviewed.
The datasets are handpicked after closely comparing them with the rest in terms of diversity, quality of data, minimal bias, labeling quality etc.
It lists the top models and their optimal use cases for each of the respective datasets.
arXiv Detail & Related papers (2022-11-01T20:25:32Z) - Deep Learning Models for Knowledge Tracing: Review and Empirical
Evaluation [2.423547527175807]
We review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets.
The evaluated DLKT models have been reimplemented for assessing and replicability of previously reported results.
arXiv Detail & Related papers (2021-12-30T14:19:27Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information
Retrieval Models [41.45240621979654]
We introduce BEIR, a heterogeneous benchmark for information retrieval.
We study the effectiveness of nine state-of-the-art retrieval models in a zero-shot evaluation setup.
Dense-retrieval models are computationally more efficient but often underperform other approaches.
arXiv Detail & Related papers (2021-04-17T23:29:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.