Towards Coding Social Science Datasets with Language Models
- URL: http://arxiv.org/abs/2306.02177v1
- Date: Sat, 3 Jun 2023 19:11:34 GMT
- Title: Towards Coding Social Science Datasets with Language Models
- Authors: Christopher Michael Rytting, Taylor Sorensen, Lisa Argyle, Ethan
Busby, Nancy Fulda, Joshua Gubler, David Wingate
- Abstract summary: Researchers often rely on humans to code (label, annotate, etc.) large sets of texts.
Recent advances in a specific kind of artificial intelligence tool - language models (LMs) - provide a solution.
We find that GPT-3 can match the performance of typical human coders and offers benefits over other machine learning methods of coding text.
- Score: 4.280286557747323
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Researchers often rely on humans to code (label, annotate, etc.) large sets
of texts. This kind of human coding forms an important part of social science
research, yet the coding process is both resource intensive and highly variable
from application to application. In some cases, efforts to automate this
process have achieved human-level accuracies, but to achieve this, these
attempts frequently rely on thousands of hand-labeled training examples, which
makes them inapplicable to small-scale research studies and costly for large
ones. Recent advances in a specific kind of artificial intelligence tool -
language models (LMs) - provide a solution to this problem. Work in computer
science makes it clear that LMs are able to classify text, without the cost (in
financial terms and human effort) of alternative methods. To demonstrate the
possibilities of LMs in this area of political science, we use GPT-3, one of
the most advanced LMs, as a synthetic coder and compare it to human coders. We
find that GPT-3 can match the performance of typical human coders and offers
benefits over other machine learning methods of coding text. We find this
across a variety of domains using very different coding procedures. This
provides exciting evidence that language models can serve as a critical advance
in the coding of open-ended texts in a variety of applications.
Related papers
- The why, what, and how of AI-based coding in scientific research [0.0]
Generative AI, particularly large language models (LLMs), has the potential to transform coding into intuitive conversations.
We dissect AI-based coding through three key lenses.
We address the limitations and future outlook of AI in coding.
arXiv Detail & Related papers (2024-10-03T02:36:30Z) - LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection [87.43727192273772]
It is often hard to tell whether a piece of text was human-written or machine-generated.
We present LLM-DetectAIve, designed for fine-grained detection.
It supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished.
arXiv Detail & Related papers (2024-08-08T07:43:17Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Scalable Qualitative Coding with LLMs: Chain-of-Thought Reasoning
Matches Human Performance in Some Hermeneutic Tasks [0.0]
We show that GPT-4 is capable of human-equivalent interpretations, whereas GPT-3.5 is not.
Our results indicate that for certain codebooks, state-of-the-art LLMs are already adept at large-scale content analysis.
arXiv Detail & Related papers (2024-01-26T19:25:43Z) - Cheap Learning: Maximising Performance of Language Models for Social
Data Science Using Minimal Data [1.8692054990918079]
We review three cheap' techniques that have developed in recent years: weak supervision, transfer learning and prompt engineering.
For the latter, we review the particular case of zero-shot prompting of large language models.
We show good performance for all techniques, and in particular we demonstrate how prompting of large language models can achieve high accuracy at very low cost.
arXiv Detail & Related papers (2024-01-22T19:00:11Z) - A Comparative Study of Code Generation using ChatGPT 3.5 across 10
Programming Languages [0.0]
Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training.
This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022.
The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains.
arXiv Detail & Related papers (2023-08-08T15:02:32Z) - Stealing the Decoding Algorithms of Language Models [56.369946232765656]
A key component of generating text from modern language models (LM) is the selection and tuning of decoding algorithms.
In this work, we show, for the first time, that an adversary with typical API access to an LM can steal the type and hyper parameters of its decoding algorithms.
Our attack is effective against popular LMs used in text generation APIs, including GPT-2, GPT-3 and GPT-Neo.
arXiv Detail & Related papers (2023-03-08T17:15:58Z) - Chatbots As Fluent Polyglots: Revisiting Breakthrough Code Snippets [0.0]
The research applies AI-driven code assistants to analyze a selection of influential computer code that has shaped modern technology.
The original contribution of this study was to examine half of the most significant code advances in the last 50 years.
arXiv Detail & Related papers (2023-01-05T23:17:17Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained
Language Models [103.75890012041366]
Language models (LMs) have proven surprisingly successful at capturing factual knowledge.
However, studies on LMs' factual representation ability have almost invariably been performed on English.
We create a benchmark of cloze-style probes for 23 typologically diverse languages.
arXiv Detail & Related papers (2020-10-13T05:29:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.