LFOSum: Summarizing Long-form Opinions with Large Language Models
- URL: http://arxiv.org/abs/2410.13037v1
- Date: Wed, 16 Oct 2024 20:52:39 GMT
- Title: LFOSum: Summarizing Long-form Opinions with Large Language Models
- Authors: Mir Tafseer Nayeem, Davood Rafiei,
- Abstract summary: This paper introduces (1) a new dataset of long-form user reviews, each entity comprising over a thousand reviews, (2) two training-free LLM-based summarization approaches that scale to long inputs, and (3) automatic evaluation metrics.
Our dataset of user reviews is paired with in-depth and unbiased critical summaries by domain experts, serving as a reference for evaluation.
Our evaluation reveals that LLMs still face challenges in balancing sentiment and format adherence in long-form summaries, though open-source models can narrow the gap when relevant information is retrieved in a focused manner.
- Score: 7.839083566878183
- License:
- Abstract: Online reviews play a pivotal role in influencing consumer decisions across various domains, from purchasing products to selecting hotels or restaurants. However, the sheer volume of reviews -- often containing repetitive or irrelevant content -- leads to information overload, making it challenging for users to extract meaningful insights. Traditional opinion summarization models face challenges in handling long inputs and large volumes of reviews, while newer Large Language Model (LLM) approaches often fail to generate accurate and faithful summaries. To address those challenges, this paper introduces (1) a new dataset of long-form user reviews, each entity comprising over a thousand reviews, (2) two training-free LLM-based summarization approaches that scale to long inputs, and (3) automatic evaluation metrics. Our dataset of user reviews is paired with in-depth and unbiased critical summaries by domain experts, serving as a reference for evaluation. Additionally, our novel reference-free evaluation metrics provide a more granular, context-sensitive assessment of summary faithfulness. We benchmark several open-source and closed-source LLMs using our methods. Our evaluation reveals that LLMs still face challenges in balancing sentiment and format adherence in long-form summaries, though open-source models can narrow the gap when relevant information is retrieved in a focused manner.
Related papers
- Real World Conversational Entity Linking Requires More Than Zeroshots [50.5691094768954]
We design targeted evaluation scenarios to measure the efficacy of EL models under resource constraints.
We assess EL models' ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset.
Results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training.
arXiv Detail & Related papers (2024-09-02T10:37:53Z) - UserSumBench: A Benchmark Framework for Evaluating User Summarization Approaches [25.133460380551327]
Large language models (LLMs) have shown remarkable capabilities in generating user summaries from a long list of raw user activity data.
These summaries capture essential user information such as preferences and interests, and are invaluable for personalization applications.
However, the development of new summarization techniques is hindered by the lack of ground-truth labels, the inherent subjectivity of user summaries, and human evaluation.
arXiv Detail & Related papers (2024-08-30T01:56:57Z) - Multi-Aspect Reviewed-Item Retrieval via LLM Query Decomposition and Aspect Fusion [15.630734768499826]
We propose several novel aspect fusion strategies to address natural language product queries.
For imbalanced review corpora, AF can improve over LF by a MAP@10 increase from 0.36 to 0.52, while achieving equivalent performance for balanced review corpora.
arXiv Detail & Related papers (2024-08-01T19:04:10Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - FENICE: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction [85.26780391682894]
We propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE)
FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary.
Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation.
arXiv Detail & Related papers (2024-03-04T17:57:18Z) - RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models [17.782410287625645]
This paper proposes a benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing.
The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation.
arXiv Detail & Related papers (2024-02-21T01:39:56Z) - L-Eval: Instituting Standardized Evaluation for Long Context Language
Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs)
We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs.
Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z) - AaKOS: Aspect-adaptive Knowledge-based Opinion Summarization [5.4138734778206]
The rapid growth of information on the Internet has led to an overwhelming amount of opinions and comments on various activities, products, and services.
This makes it difficult and time-consuming for users to process all the available information when making decisions.
We propose an Aspect-adaptive Knowledge-based Opinion Summarization model for product reviews.
arXiv Detail & Related papers (2023-05-26T03:44:35Z) - Towards Personalized Review Summarization by Modeling Historical Reviews
from Customer and Product Separately [59.61932899841944]
Review summarization is a non-trivial task that aims to summarize the main idea of the product review in the E-commerce website.
We propose the Heterogeneous Historical Review aware Review Summarization Model (HHRRS)
We employ a multi-task framework that conducts the review sentiment classification and summarization jointly.
arXiv Detail & Related papers (2023-01-27T12:32:55Z) - Learning Opinion Summarizers by Selecting Informative Reviews [81.47506952645564]
We collect a large dataset of summaries paired with user reviews for over 31,000 products, enabling supervised training.
The content of many reviews is not reflected in the human-written summaries, and, thus, the summarizer trained on random review subsets hallucinates.
We formulate the task as jointly learning to select informative subsets of reviews and summarizing the opinions expressed in these subsets.
arXiv Detail & Related papers (2021-09-09T15:01:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.