Automated Annotation of Evolving Corpora for Augmenting Longitudinal Network Data: A Framework Integrating Large Language Models and Expert Knowledge
- URL: http://arxiv.org/abs/2503.01672v1
- Date: Mon, 03 Mar 2025 15:46:01 GMT
- Title: Automated Annotation of Evolving Corpora for Augmenting Longitudinal Network Data: A Framework Integrating Large Language Models and Expert Knowledge
- Authors: Xiao Liu, Zirui Wu, Jiayi Li, Zhicheng Shao, Xun Pang, Yansong Feng,
- Abstract summary: This paper presents the Expert-Augmented LLM (EALA) approach, which leverages Large Language Models (LLMs) in combination with historically annotated data and expert-constructed codebooks to extrapolate and extend datasets into future periods.<n>Our findings demonstrate that EALA effectively predicts nuanced interactions between negotiation parties and captures the evolution of topics over time.<n>Given the wide availability of codebooks and annotated datasets, EALA holds substantial promise for advancing research in political science and beyond.
- Score: 27.879485905967577
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Longitudinal network data are essential for analyzing political, economic, and social systems and processes. In political science, these datasets are often generated through human annotation or supervised machine learning applied to evolving corpora. However, as semantic contexts shift over time, inferring dynamic interaction types on emerging issues among a diverse set of entities poses significant challenges, particularly in maintaining timely and consistent annotations. This paper presents the Expert-Augmented LLM Annotation (EALA) approach, which leverages Large Language Models (LLMs) in combination with historically annotated data and expert-constructed codebooks to extrapolate and extend datasets into future periods. We evaluate the performance and reliability of EALA using a dataset of climate negotiations. Our findings demonstrate that EALA effectively predicts nuanced interactions between negotiation parties and captures the evolution of topics over time. At the same time, we identify several limitations inherent to LLM-based annotation, highlighting areas for further improvement. Given the wide availability of codebooks and annotated datasets, EALA holds substantial promise for advancing research in political science and beyond.
Related papers
- A Review on Large Language Models for Visual Analytics [0.2209921757303168]
The paper outlines the theoretical underpinnings of visual analytics and the transformative potential of Large Language Models (LLMs)
The review further investigates how the synergy between LLMs and visual analytics enhances data interpretation, visualization techniques, and interactive exploration capabilities.
The paper discusses their functionalities, strengths, and limitations in supporting data exploration, visualization enhancement, automated reporting, and insight extraction.
arXiv Detail & Related papers (2025-03-19T13:02:01Z) - Empowering Time Series Analysis with Synthetic Data: A Survey and Outlook in the Era of Foundation Models [104.17057231661371]
Time series analysis is crucial for understanding dynamics of complex systems.
Recent advances in foundation models have led to task-agnostic Time Series Foundation Models (TSFMs) and Large Language Model-based Time Series Models (TSLLMs)
Their success depends on large, diverse, and high-quality datasets, which are challenging to build due to regulatory, diversity, quality, and quantity constraints.
This survey provides a comprehensive review of synthetic data for TSFMs and TSLLMs, analyzing data generation strategies, their role in model pretraining, fine-tuning, and evaluation, and identifying future research directions.
arXiv Detail & Related papers (2025-03-14T13:53:46Z) - The dynamics of meaning through time: Assessment of Large Language Models [2.5864824580604515]
This study aims to evaluate the capabilities of various large language models (LLMs) in capturing temporal dynamics of meaning.<n>Our comparative analysis includes prominent models like ChatGPT, GPT-4, Claude, Bard, Gemini, and Llama.<n>Findings reveal marked differences in each model's handling of historical context and semantic shifts, highlighting both strengths and limitations in temporal semantic understanding.
arXiv Detail & Related papers (2025-01-09T19:56:44Z) - Political Events using RAG with LLMs [1.6385815610837167]
Large Language Models (LLMs) driven by Generative Artificial Intelligence (GenAI)
Retrieval-Augmented Generation (RAG) framework.
Political EE system, specifically tailored to extract political event information from news articles.
arXiv Detail & Related papers (2025-01-06T08:16:24Z) - Knowledge Graphs, Large Language Models, and Hallucinations: An NLP Perspective [5.769786334333616]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) based applications including automated text generation, question answering, and others.
They face a significant challenge: hallucinations, where models produce plausible-sounding but factually incorrect responses.
This paper discusses these open challenges covering state-of-the-art datasets and benchmarks as well as methods for knowledge integration and evaluating hallucinations.
arXiv Detail & Related papers (2024-11-21T16:09:05Z) - Formality is Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge [55.65162959527848]
Large language models have shown excellent performance on many knowledge-intensive tasks.
However, pretraining data tends to contain misleading and even conflicting information.
This study systematically analyze LLMs' learning preferences for data with conflicting knowledge.
arXiv Detail & Related papers (2024-10-07T06:49:41Z) - RelationVLM: Making Large Vision-Language Models Understand Visual Relations [66.70252936043688]
We present RelationVLM, a large vision-language model capable of comprehending various levels and types of relations whether across multiple images or within a video.
Specifically, we devise a multi-stage relation-aware training scheme and a series of corresponding data configuration strategies to bestow RelationVLM with the capabilities of understanding semantic relations.
arXiv Detail & Related papers (2024-03-19T15:01:19Z) - Large Language Models for Data Annotation and Synthesis: A Survey [49.8318827245266]
This survey focuses on the utility of Large Language Models for data annotation and synthesis.<n>It includes an in-depth taxonomy of data types that LLMs can annotate, a review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation and synthesis.
arXiv Detail & Related papers (2024-02-21T00:44:04Z) - Data Distribution Bottlenecks in Grounding Language Models to Knowledge
Bases [9.610231090476857]
Language models (LMs) have already demonstrated remarkable abilities in understanding and generating both natural and formal language.
This paper is an experimental investigation aimed at uncovering the challenges that LMs encounter when tasked with knowledge base question answering (KBQA)
Our experiments reveal that even when employed with our proposed data augmentation techniques, advanced small and large language models exhibit poor performance in various dimensions.
arXiv Detail & Related papers (2023-09-15T12:06:45Z) - ConvFinQA: Exploring the Chain of Numerical Reasoning in Conversational
Finance Question Answering [70.6359636116848]
We propose a new large-scale dataset, ConvFinQA, to study the chain of numerical reasoning in conversational question answering.
Our dataset poses great challenge in modeling long-range, complex numerical reasoning paths in real-world conversations.
arXiv Detail & Related papers (2022-10-07T23:48:50Z) - Improving Classifier Training Efficiency for Automatic Cyberbullying
Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods.
We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments.
The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z) - Detecting ESG topics using domain-specific language models and data
augmentation approaches [3.3332986505989446]
Natural language processing tasks in the financial domain remain challenging due to paucity of appropriately labelled data.
Here, we investigate two approaches that may help to mitigate these issues.
Firstly, we experiment with further language model pre-training using large amounts of in-domain data from business and financial news.
We then apply augmentation approaches to increase the size of our dataset for model fine-tuning.
arXiv Detail & Related papers (2020-10-16T11:20:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.