SOUL: Towards Sentiment and Opinion Understanding of Language
- URL: http://arxiv.org/abs/2310.17924v1
- Date: Fri, 27 Oct 2023 06:48:48 GMT
- Title: SOUL: Towards Sentiment and Opinion Understanding of Language
- Authors: Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, Lidong Bing
- Abstract summary: We propose a new task called Sentiment and Opinion Understanding of Language (SOUL)
SOUL aims to evaluate sentiment understanding through two subtasks: Review (RC) and Justification Generation (JG)
- Score: 96.74878032417054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sentiment analysis is a well-established natural language processing task,
with sentiment polarity classification being one of its most popular and
representative tasks. However, despite the success of pre-trained language
models in this area, they often fall short of capturing the broader
complexities of sentiment analysis. To address this issue, we propose a new
task called Sentiment and Opinion Understanding of Language (SOUL). SOUL aims
to evaluate sentiment understanding through two subtasks: Review Comprehension
(RC) and Justification Generation (JG). RC seeks to validate statements that
focus on subjective information based on a review text, while JG requires
models to provide explanations for their sentiment predictions. To enable
comprehensive evaluation, we annotate a new dataset comprising 15,028
statements from 3,638 reviews. Experimental results indicate that SOUL is a
challenging task for both small and large language models, with a performance
gap of up to 27% when compared to human performance. Furthermore, evaluations
conducted with both human experts and GPT-4 highlight the limitations of the
small language model in generating reasoning-based justifications. These
findings underscore the challenging nature of the SOUL task for existing
models, emphasizing the need for further advancements in sentiment analysis to
address its complexities. The new dataset and code are available at
https://github.com/DAMO-NLP-SG/SOUL.
Related papers
- The Empirical Impact of Data Sanitization on Language Models [1.1359551336076306]
This paper empirically analyzes the effects of data sanitization across several benchmark language-modeling tasks.
Our results suggest that for some tasks such as sentiment analysis or entailment, the impact of redaction is quite low, typically around 1-5%.
For tasks such as comprehension Q&A there is a big drop of >25% in performance observed in redacted queries as compared to the original.
arXiv Detail & Related papers (2024-11-08T21:22:37Z) - Single Ground Truth Is Not Enough: Add Linguistic Variability to Aspect-based Sentiment Analysis Evaluation [41.66053021998106]
Aspect-based sentiment analysis (ABSA) is the challenging task of extracting sentiment along with its corresponding aspects and opinions from human language.
Current evaluation methods for this task often restrict answers to a single ground truth, penalizing semantically equivalent predictions that differ in surface form.
We propose a novel, fully automated pipeline that augments existing test sets with alternative valid responses for aspect and opinion terms.
arXiv Detail & Related papers (2024-10-13T11:48:09Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Can Large Language Models Understand Context? [17.196362853457412]
This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models.
Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models.
As LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings.
arXiv Detail & Related papers (2024-02-01T18:55:29Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z) - TAPE: Assessing Few-shot Russian Language Understanding [1.9859374437454114]
TAPE (Text Attack and Perturbation Evaluation) is a novel benchmark that includes six more complex NLU tasks for Russian.
The detailed analysis of testing the autoregressive baselines indicates that simple spelling-based perturbations affect the performance the most.
We publicly release TAPE to foster research on robust LMs that can generalize to new tasks when little to no supervision is available.
arXiv Detail & Related papers (2022-10-23T18:28:25Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of
Language Models [86.02610674750345]
Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks.
We apply 14 adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.
All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy.
arXiv Detail & Related papers (2021-11-04T12:59:55Z) - AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages
with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context.
It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts.
Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.