Related papers: Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models

Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models

URL: http://arxiv.org/abs/2204.07483v2
Date: Tue, 19 Apr 2022 18:09:39 GMT
Title: Polling Latent Opinions: A Method for Computational Sociolinguistics Using Transformer Language Models
Authors: Philip Feldman, Aaron Dant, James R. Foulds, Shemei Pan
Abstract summary: We use the capacity for memorization and extrapolation of Transformer Language Models to learn the linguistic behaviors of a subgroup within larger corpora of Yelp reviews. We show that even in cases where a specific keyphrase is limited or not present at all in the training corpora, the GPT is able to accurately generate large volumes of text that have the correct sentiment.
Score: 4.874780144224057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text analysis of social media for sentiment, topic analysis, and other analysis depends initially on the selection of keywords and phrases that will be used to create the research corpora. However, keywords that researchers choose may occur infrequently, leading to errors that arise from using small samples. In this paper, we use the capacity for memorization, interpolation, and extrapolation of Transformer Language Models such as the GPT series to learn the linguistic behaviors of a subgroup within larger corpora of Yelp reviews. We then use prompt-based queries to generate synthetic text that can be analyzed to produce insights into specific opinions held by the populations that the models were trained on. Once learned, more specific sentiment queries can be made of the model with high levels of accuracy when compared to traditional keyword searches. We show that even in cases where a specific keyphrase is limited or not present at all in the training corpora, the GPT is able to accurately generate large volumes of text that have the correct sentiment.

Related papers

Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Towards Human Understanding of Paraphrase Types in ChatGPT [7.662751948664846]
Atomic paraphrase types (APT) decompose paraphrases into different linguistic changes. We introduce APTY (Atomic Paraphrase TYpes), a dataset of 500 sentence-level and word-level annotations by 15 annotators. Our results reveal that ChatGPT can generate simple APTs, but struggle with complex structures.
arXiv Detail & Related papers (2024-07-02T14:35:10Z)
Retrieval is Accurate Generation [99.24267226311157]
We introduce a novel method that selects context-aware phrases from a collection of supporting documents. Our model achieves the best performance and the lowest latency among several retrieval-augmented baselines.
arXiv Detail & Related papers (2024-02-27T14:16:19Z)
Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks. Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena. For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z)
ChatGPT as a Text Simplification Tool to Remove Bias [0.0]
The presence of specific linguistic signals particular to a certain sub-group can be picked up by language models during training. We explore a potential technique for bias mitigation in the form of simplification of text.
arXiv Detail & Related papers (2023-05-09T13:10:23Z)
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings. Our model operates on parallel data in $N$ languages. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z)
How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN [63.79300884115027]
Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? We introduce RAVEN, a suite of analyses for assessing the novelty of generated text.
arXiv Detail & Related papers (2021-11-18T04:07:09Z)
Did the Cat Drink the Coffee? Challenging Transformers with Generalized Event Knowledge [59.22170796793179]
Transformers Language Models (TLMs) were tested on a benchmark for the textitdynamic estimation of thematic fit Our results show that TLMs can reach performances that are comparable to those achieved by SDM. However, additional analysis consistently suggests that TLMs do not capture important aspects of event knowledge.
arXiv Detail & Related papers (2021-07-22T20:52:26Z)
Corpus-Based Paraphrase Detection Experiments and Review [0.0]
Paraphrase detection is important for a number of applications, including plagiarism detection, authorship attribution, question answering, text summarization, etc. In this paper, we give a performance overview of various types of corpus-based models, especially deep learning (DL) models, with the task of paraphrase detection.
arXiv Detail & Related papers (2021-05-31T23:29:24Z)
MASKER: Masked Keyword Regularization for Reliable Text Classification [73.90326322794803]
We propose a fine-tuning method, coined masked keyword regularization (MASKER), that facilitates context-based prediction. MASKER regularizes the model to reconstruct the keywords from the rest of the words and make low-confidence predictions without enough context. We demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy.
arXiv Detail & Related papers (2020-12-17T04:54:16Z)
Text Mining for Processing Interview Data in Computational Social Science [0.6820436130599382]
We use commercially available text analysis technology to process interview text data from a computational social science study. We find that topical clustering and terminological enrichment provide for convenient exploration and quantification of the responses. We encourage studies in social science to use text analysis, especially for exploratory open-ended studies.
arXiv Detail & Related papers (2020-11-28T00:44:35Z)
Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext) Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
An Empirical Investigation of Pre-Trained Transformer Language Models for Open-Domain Dialogue Generation [23.343006562849126]
We present an empirical investigation of pre-trained Transformer-based auto-regressive language models for the task of open-domain dialogue generation. Training paradigm of pre-training and fine-tuning is employed to conduct learning. Experiments are conducted on the typical single-turn and multi-turn dialogue corpora such as Weibo, Douban, Reddit, DailyDialog, and Persona-Chat.
arXiv Detail & Related papers (2020-03-09T15:20:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.