Related papers: Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies

Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies

URL: http://arxiv.org/abs/2309.00208v1
Date: Fri, 1 Sep 2023 01:51:28 GMT
Title: Large Language Models for Semantic Monitoring of Corporate Disclosures: A Case Study on Korea's Top 50 KOSPI Companies
Authors: Junwon Sung, Woojin Heo, Yunkyung Byun, Youngsam Kim
Abstract summary: State-of-the-art language models such as OpenAI's GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context.
Score: 0.08192907805418582
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In the rapidly advancing domain of artificial intelligence, state-of-the-art language models such as OpenAI's GPT-3.5-turbo and GPT-4 offer unprecedented opportunities for automating complex tasks. This research paper delves into the capabilities of these models for semantically analyzing corporate disclosures in the Korean context, specifically for timely disclosure. The study focuses on the top 50 publicly traded companies listed on the Korean KOSPI, based on market capitalization, and scrutinizes their monthly disclosure summaries over a period of 17 months. Each summary was assigned a sentiment rating on a scale ranging from 1(very negative) to 5(very positive). To gauge the effectiveness of the language models, their sentiment ratings were compared with those generated by human experts. Our findings reveal a notable performance disparity between GPT-3.5-turbo and GPT-4, with the latter demonstrating significant accuracy in human evaluation tests. The Spearman correlation coefficient was registered at 0.61, while the simple concordance rate was recorded at 0.82. This research contributes valuable insights into the evaluative characteristics of GPT models, thereby laying the groundwork for future innovations in the field of automated semantic monitoring.

Related papers

Evaluating Named Entity Recognition Models for Russian Cultural News Texts: From BERT to LLM [0.0]
The study utilizes the unique SPbLitGuide dataset, a collection of event announcements from Saint Petersburg spanning 1999 to 2019.<n>A comparative evaluation of diverse NER models is presented, encompassing established transformer-based architectures.<n>The research contributes to a deeper understanding of current NER model capabilities and limitations when applied to morphologically rich languages like Russian.
arXiv Detail & Related papers (2025-06-03T08:11:16Z)
WorldPM: Scaling Human Preference Modeling [130.23230492612214]
We propose World Preference Modeling$ (WorldPM) to emphasize this scaling potential.<n>We collect preference data from public forums covering diverse user communities.<n>We conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.
arXiv Detail & Related papers (2025-05-15T17:38:37Z)
Unraveling the Capabilities of Language Models in News Summarization [0.0]
This work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. We focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology. We highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities.
arXiv Detail & Related papers (2025-01-30T04:20:16Z)
Optimizing Performance: How Compact Models Match or Exceed GPT's Classification Capabilities through Fine-Tuning [0.0]
Non-generative, small-sized models can outperform GPT-3.5 and GPT-4 models in zero-shot learning settings. Fine-tuned models show comparable results to GPT-3.5 when it is fine-tuned on the task of determining market sentiment.
arXiv Detail & Related papers (2024-08-22T09:10:43Z)
Evaluation Metrics in the Era of GPT-4: Reliably Evaluating Large Language Models on Sequence to Sequence Tasks [9.801767683867125]
We provide a preliminary and hybrid evaluation on three NLP benchmarks using both automatic and human evaluation. We find that ChatGPT consistently outperforms many other popular models according to human reviewers on the majority of metrics. We also find that human reviewers rate the gold reference as much worse than the best models' outputs, indicating the poor quality of many popular benchmarks.
arXiv Detail & Related papers (2023-10-20T20:17:09Z)
Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts [21.150221839202878]
Large Language Models (LLMs) have achieved significant success across various general tasks. In this work, we examine the proficiency of LLMs in generating succinct survey articles specific to the niche field of NLP in computer science. We compare both human and GPT-based evaluation scores and provide in-depth analysis.
arXiv Detail & Related papers (2023-08-21T01:32:45Z)
Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings [63.35165397320137]
This study investigates the consistency of feedback ratings generated by OpenAI's GPT-4. The model rated responses to tasks within the Higher Education subject domain of macroeconomics in terms of their content and style.
arXiv Detail & Related papers (2023-08-03T12:47:17Z)
Cross-lingual Cross-temporal Summarization: Dataset, Models, Evaluation [20.675242617417677]
Cross-lingual cross-temporal summarization (CLCTS) is a largely unexplored area that has the potential to improve cross-cultural accessibility and understanding. This paper comprehensively addresses the CLCTS task, including dataset creation, modeling, and evaluation.
arXiv Detail & Related papers (2023-06-22T14:31:18Z)
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation [176.56131810249602]
evaluating the factuality of long-form text generated by large language models (LMs) is non-trivial. We introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source.
arXiv Detail & Related papers (2023-05-23T17:06:00Z)
Information Redundancy and Biases in Public Document Information Extraction Benchmarks [0.0]
Despite the good performance of KIE models when fine-tuned on public benchmarks, they still struggle to generalize on complex real-life use-cases lacking sufficient document annotations. Our research highlighted that KIE standard benchmarks such as SROIE and FUNSD contain significant similarity between training and testing documents and can be adjusted to better evaluate the generalization of models.
arXiv Detail & Related papers (2023-04-28T15:48:26Z)
How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks. However, their robustness and abilities to handle various complexities of the open world have yet to be explored. We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z)
News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
SummEval: Re-evaluating Summarization Evaluation [169.622515287256]
We re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion. We benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics. We assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset.
arXiv Detail & Related papers (2020-07-24T16:25:19Z)
Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems [64.4896118325552]
We evaluate the current state-of-the-art AES models using a model adversarial evaluation scheme and associated metrics. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models.
arXiv Detail & Related papers (2020-07-14T03:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.