Related papers: Towards Coding Social Science Datasets with Language Models

Towards Coding Social Science Datasets with Language Models

URL: http://arxiv.org/abs/2306.02177v1
Date: Sat, 3 Jun 2023 19:11:34 GMT
Title: Towards Coding Social Science Datasets with Language Models
Authors: Christopher Michael Rytting, Taylor Sorensen, Lisa Argyle, Ethan Busby, Nancy Fulda, Joshua Gubler, David Wingate
Abstract summary: Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. Recent advances in a specific kind of artificial intelligence tool - language models (LMs) - provide a solution. We find that GPT-3 can match the performance of typical human coders and offers benefits over other machine learning methods of coding text.
Score: 4.280286557747323
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Researchers often rely on humans to code (label, annotate, etc.) large sets of texts. This kind of human coding forms an important part of social science research, yet the coding process is both resource intensive and highly variable from application to application. In some cases, efforts to automate this process have achieved human-level accuracies, but to achieve this, these attempts frequently rely on thousands of hand-labeled training examples, which makes them inapplicable to small-scale research studies and costly for large ones. Recent advances in a specific kind of artificial intelligence tool - language models (LMs) - provide a solution to this problem. Work in computer science makes it clear that LMs are able to classify text, without the cost (in financial terms and human effort) of alternative methods. To demonstrate the possibilities of LMs in this area of political science, we use GPT-3, one of the most advanced LMs, as a synthetic coder and compare it to human coders. We find that GPT-3 can match the performance of typical human coders and offers benefits over other machine learning methods of coding text. We find this across a variety of domains using very different coding procedures. This provides exciting evidence that language models can serve as a critical advance in the coding of open-ended texts in a variety of applications.

Related papers

Processes Matter: How ML/GAI Approaches Could Support Open Qualitative Coding of Online Discourse Datasets [39.96179530555875]
We compare open coding results by five recently published ML/GAI approaches and four human coders. Line-by-line AI approaches effectively identify content-based codes, while humans excel in interpreting conversational dynamics. Instead of replacing humans in open coding, researchers should integrate AI with and according to their analytical processes.
arXiv Detail & Related papers (2025-04-02T13:43:54Z)
Synthetic Data Generation Using Large Language Models: Advances in Text and Code [0.0]
Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. We show how these methods enrich low-resource tasks such as classification and question answering. We address challenges like factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification.
arXiv Detail & Related papers (2025-03-18T08:34:03Z)
The why, what, and how of AI-based coding in scientific research [0.0]
Generative AI, particularly large language models (LLMs), has the potential to transform coding into intuitive conversations. We dissect AI-based coding through three key lenses. We address the limitations and future outlook of AI in coding.
arXiv Detail & Related papers (2024-10-03T02:36:30Z)
LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection [87.43727192273772]
It is often hard to tell whether a piece of text was human-written or machine-generated. We present LLM-DetectAIve, designed for fine-grained detection. It supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished.
arXiv Detail & Related papers (2024-08-08T07:43:17Z)
Case2Code: Scalable Synthetic Data for Code Generation [105.89741089673575]
Large Language Models (LLMs) have shown outstanding breakthroughs in code generation. Recent work improves code LLMs by training on synthetic data generated by some powerful LLMs. We propose a textbfCase2Code task by exploiting the expressiveness and correctness of programs.
arXiv Detail & Related papers (2024-07-17T11:35:00Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning Matches Human Performance in Some Hermeneutic Tasks [0.0]
We show that GPT-4 is capable of human-equivalent interpretations, whereas GPT-3.5 is not. Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis.
arXiv Detail & Related papers (2024-01-26T19:25:43Z)
Cheap Learning: Maximising Performance of Language Models for Social Data Science Using Minimal Data [1.8692054990918079]
We review three cheap' techniques that have developed in recent years: weak supervision, transfer learning and prompt engineering. For the latter, we review the particular case of zero-shot prompting of large language models. We show good performance for all techniques, and in particular we demonstrate how prompting of large language models can achieve high accuracy at very low cost.
arXiv Detail & Related papers (2024-01-22T19:00:11Z)
A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages [0.0]
Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training. This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022. The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains.
arXiv Detail & Related papers (2023-08-08T15:02:32Z)
Stealing the Decoding Algorithms of Language Models [56.369946232765656]
A key component of generating text from modern language models (LM) is the selection and tuning of decoding algorithms. In this work, we show, for the first time, that an adversary with typical API access to an LM can steal the type and hyper parameters of its decoding algorithms. Our attack is effective against popular LMs used in text generation APIs, including GPT-2, GPT-3 and GPT-Neo.
arXiv Detail & Related papers (2023-03-08T17:15:58Z)
Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology. The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge. However, studies on LMs' factual representation ability have almost invariably been performed on English. We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.