Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models
- URL: http://arxiv.org/abs/2508.08262v1
- Date: Tue, 22 Jul 2025 17:54:45 GMT
- Title: Argument Quality Annotation and Gender Bias Detection in Financial Communication through Large Language Models
- Authors: Alaa Alhamzeh, Mays Al Rebdawi,
- Abstract summary: We evaluate the capabilities of three state-of-the-art LLMs in annotating financial arguments.<n>We introduce an adversarial attack designed to inject gender bias to analyse models.<n>Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Financial arguments play a critical role in shaping investment decisions and public trust in financial institutions. Nevertheless, assessing their quality remains poorly studied in the literature. In this paper, we examine the capabilities of three state-of-the-art LLMs GPT-4o, Llama 3.1, and Gemma 2 in annotating argument quality within financial communications, using the FinArgQuality dataset. Our contributions are twofold. First, we evaluate the consistency of LLM-generated annotations across multiple runs and benchmark them against human annotations. Second, we introduce an adversarial attack designed to inject gender bias to analyse models responds and ensure model's fairness and robustness. Both experiments are conducted across three temperature settings to assess their influence on annotation stability and alignment with human labels. Our findings reveal that LLM-based annotations achieve higher inter-annotator agreement than human counterparts, though the models still exhibit varying degrees of gender bias. We provide a multifaceted analysis of these outcomes and offer practical recommendations to guide future research toward more reliable, cost-effective, and bias-aware annotation methodologies.
Related papers
- Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research [39.146761527401424]
We evaluate large language models (LLMs) on the task of identifying the top three human values expressed in long-form interviews.<n>We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts.
arXiv Detail & Related papers (2026-03-05T07:38:37Z) - Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection [64.75447949495307]
Large language models (LLMs) have been widely applied across various domains of finance.<n> behavioral biases can lead to instability and uncertainty in decision-making.<n>mfmdscen is a benchmark for evaluating behavioral biases in mfmd across diverse economic scenarios.
arXiv Detail & Related papers (2026-01-08T22:00:32Z) - FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering [57.43420753842626]
FinLFQA is a benchmark designed to evaluate the ability of Large Language Models to generate long-form answers to complex financial questions.<n>We provide an automatic evaluation framework covering both answer quality and attribution quality.
arXiv Detail & Related papers (2025-10-07T20:06:15Z) - Your AI, Not Your View: The Bias of LLMs in Investment Analysis [55.328782443604986]
Large Language Models (LLMs) face frequent knowledge conflicts due to discrepancies between pre-trained parametric knowledge and real-time market data.<n>This paper offers the first quantitative analysis of confirmation bias in LLM-based investment analysis.<n>We observe a consistent preference for large-cap stocks and contrarian strategies across most models.
arXiv Detail & Related papers (2025-07-28T16:09:38Z) - Evaluating Large Language Models (LLMs) in Financial NLP: A Comparative Study on Financial Report Analysis [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide variety of Financial Natural Language Processing (FinNLP) tasks.<n>This study conducts a thorough comparative evaluation of five leading LLMs, GPT, Claude, Perplexity, Gemini and DeepSeek, using 10-K filings from the 'Magnificent Seven' technology companies.
arXiv Detail & Related papers (2025-07-24T20:10:27Z) - Advanced Financial Reasoning at Scale: A Comprehensive Evaluation of Large Language Models on CFA Level III [0.0]
This paper presents a benchmark evaluating 23 state-of-the-art Large Language Models (LLMs) on the Chartered Financial Analyst (CFA) Level III exam.<n>We assess both multiple-choice questions (MCQs) and essay-style responses using multiple prompting strategies including Chain-of-Thought and Self-Discover.<n>Our evaluation reveals that leading models demonstrate strong capabilities, with composite scores such as 79.1% (o4-mini) and 77.3% (Gemini 2.5 Flash) on CFA Level III.
arXiv Detail & Related papers (2025-06-29T19:54:57Z) - Assessing Consistency and Reproducibility in the Outputs of Large Language Models: Evidence Across Diverse Finance and Accounting Tasks [0.0]
This study provides the first comprehensive assessment of consistency and accuracy of Large Language Model (LLM) outputs in finance and accounting research.<n>Using three OpenAI models, we generate over 3.4 million outputs from diverse financial source texts and data.<n>LLMs significantly outperform expert human annotators in consistency even where human experts disagree.
arXiv Detail & Related papers (2025-03-21T09:43:37Z) - FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting [58.70072722290475]
Financial time series (FinTS) record the behavior of human-brain-augmented decision-making.<n>FinTSB is a comprehensive and practical benchmark for financial time series forecasting.
arXiv Detail & Related papers (2025-02-26T05:19:16Z) - Evaluating and Advancing Multimodal Large Language Models in Perception Ability Lens [30.083110119139793]
We introduce textbfAbilityLens, a unified benchmark designed to evaluate MLLMs in six key perception abilities.<n>We identify the strengths and weaknesses of current main-stream MLLMs, highlighting stability patterns and revealing a notable performance gap between state-of-the-art open-source and closed-source models.
arXiv Detail & Related papers (2024-11-22T04:41:20Z) - Evaluating Large Language Models on Financial Report Summarization: An Empirical Study [9.28042182186057]
We conduct a comparative study on three state-of-the-art Large Language Models (LLMs)
Our primary motivation is to explore how these models can be harnessed within finance, a field demanding precision, contextual relevance, and robustness against erroneous or misleading information.
We introduce an innovative evaluation framework that integrates both quantitative metrics (e.g., precision, recall) and qualitative analyses (e.g., contextual fit, consistency) to provide a holistic view of each model's output quality.
arXiv Detail & Related papers (2024-11-11T10:36:04Z) - Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective [66.34066553400108]
We conduct a rigorous evaluation of large language models' implicit bias towards certain demographics.<n>Inspired by psychometric principles, we propose three attack approaches, i.e., Disguise, Deception, and Teaching.<n>Our methods can elicit LLMs' inner bias more effectively than competitive baselines.
arXiv Detail & Related papers (2024-06-20T06:42:08Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z) - Investigating Fairness Disparities in Peer Review: A Language Model
Enhanced Approach [77.61131357420201]
We conduct a thorough and rigorous study on fairness disparities in peer review with the help of large language models (LMs)
We collect, assemble, and maintain a comprehensive relational database for the International Conference on Learning Representations (ICLR) conference from 2017 to date.
We postulate and study fairness disparities on multiple protective attributes of interest, including author gender, geography, author, and institutional prestige.
arXiv Detail & Related papers (2022-11-07T16:19:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.