From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs
- URL: http://arxiv.org/abs/2408.17026v1
- Date: Fri, 30 Aug 2024 05:50:15 GMT
- Title: From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs
- Authors: Minxue Niu, Mimansa Jaiswal, Emily Mower Provost,
- Abstract summary: We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training.
We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4.
- Score: 12.199629860735195
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Training emotion recognition models has relied heavily on human annotated data, which present diversity, quality, and cost challenges. In this paper, we explore the potential of Large Language Models (LLMs), specifically GPT4, in automating or assisting emotion annotation. We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4 and our human evaluation experiment reveals a consistent preference for GPT-4 annotations over humans across multiple datasets and evaluators. Further, we investigate the impact of using GPT-4 as an annotation filtering process to improve model training. Together, our findings highlight the great potential of LLMs in emotion annotation tasks and underscore the need for refined evaluation methodologies.
Related papers
- Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI [0.0]
We use GPT-4 to replicate 27 annotation tasks across 11 password-protected datasets.
For each task, we compare GPT-4 annotations against human-annotated ground-truth labels and against annotations from separate supervised classification models fine-tuned on human-generated labels.
Our findings underscore the importance of a human-centered workflow and careful evaluation standards.
arXiv Detail & Related papers (2024-09-14T15:27:43Z) - Open Source Language Models Can Provide Feedback: Evaluating LLMs' Ability to Help Students Using GPT-4-As-A-Judge [4.981275578987307]
Large language models (LLMs) have shown great potential for the automatic generation of feedback in a wide range of computing contexts.
However, concerns have been voiced around the privacy and ethical implications of sending student work to proprietary models.
This has sparked considerable interest in the use of open source LLMs in education, but the quality of the feedback that such open models can produce remains understudied.
arXiv Detail & Related papers (2024-05-08T17:57:39Z) - Automated Assessment of Encouragement and Warmth in Classrooms Leveraging Multimodal Emotional Features and ChatGPT [7.273857543125784]
Our work explores a multimodal approach to automatically estimating encouragement and warmth in classrooms.
We employed facial and speech emotion recognition with sentiment analysis to extract interpretable features from video, audio, and transcript data.
We demonstrated our approach on the GTI dataset, comprising 367 16-minute video segments from 92 authentic lesson recordings.
arXiv Detail & Related papers (2024-04-01T16:58:09Z) - GPT as Psychologist? Preliminary Evaluations for GPT-4V on Visual Affective Computing [74.68232970965595]
Multimodal large language models (MLLMs) are designed to process and integrate information from multiple sources, such as text, speech, images, and videos.
This paper assesses the application of MLLMs with 5 crucial abilities for affective computing, spanning from visual affective tasks and reasoning tasks.
arXiv Detail & Related papers (2024-03-09T13:56:25Z) - Human vs. LMMs: Exploring the Discrepancy in Emoji Interpretation and Usage in Digital Communication [68.40865217231695]
This study examines the behavior of GPT-4V in replicating human-like use of emojis.
The findings reveal a discernible discrepancy between human and GPT-4V behaviors, likely due to the subjective nature of human interpretation.
arXiv Detail & Related papers (2024-01-16T08:56:52Z) - CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting.
CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z) - Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large
Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation.
We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics.
We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z) - What's Next in Affective Modeling? Large Language Models [3.0902630634005797]
GPT-4 performs well across multiple emotion tasks.
It can distinguish emotion theories and come up with emotional stories.
We suggest that LLMs could play an important role in affective modeling.
arXiv Detail & Related papers (2023-10-03T16:39:20Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.